picardythird

picardythird t1_j66kza2 wrote

Whenever I see music generation models, I immediately go to the "classical" examples (or as close to classical as are provided). The reason for this is that while some genres such as techno, drum 'n' bass, 8-bit, and hip hop are "simple" (from a music theory perspective), and other genres such as ambient, relaxing jazz, swing, and dream pop are vague enough that the model can get by just from spitting out the right general timbre, generating classical music requires understanding of structure, style, and form.

Frankly, I'm not particularly impressed. For the piano snippets, it seems to have mixed in sounds from strings, and both the "professional piano player" and "crazy fast piano player" snippets are basically just random notes with no particular structure. Meanwhile, the "opera" snippet uses piano sounds, which are non-idiomatic to opera. The "string quartet" snippets are not idiomatic to the style of a string quartet (in particular, the "camptown races" snippet completely falls apart at the end, and the "fingerstyle guitar" snippet barely even sounds like string instruments).

I'm also not especially convinced by the Painting Caption Conditioning section. I suspect that there is quite a bit of Barnum Effect going on here; the captions are primed to be accepted as corresponding to the "correct" paintings because they are presented that way, but this is just a framing device. As a self-experiment, play a track from any of the paintings, and look at any of the other paintings. Can you really say that the track could not feasibly correspond to the "other" painting? (Also, as someone who has literally written a piece of music inspired by the Caspar David Friedrich painting, I find myself unconvinced by the model's interpretation... but this is a wholly subjective critique).

This is not to say that the model is not impressive in other ways. Its ability to mimic the styles of different genres is quite good (although the "swing" example in the Long Generation section loses focus halfway through), and the style transfer elements are quite interesting as well. However, music generation models have a long way to go when it comes to idiomatic understanding of the structural elements of music.

22

picardythird t1_iuqi550 wrote

This is akin to a full research paper about a bug report. It's utterly unsurprising that a program may harbor some unusual or unexpected behavior in extremely uncommon edge cases. To me, this anomalous behavior simply arises as a consequence of an edge case in how KataGo handles unusual rulesets; this can be easily patched (as is extremely common for game-playing engines in general), or (as other commenters have pointed out) by not artificially restricting KataGo's strength.

0

picardythird t1_iuqhsya wrote

It is absolutely misleading to claim that Tromp-Taylor is "the standard for evaluation" in computer go.

Tromp-Taylor scoring has been used occasionally as a convenient means of simplifying the way that games are scored for the purposes of quantitative evaluation. However, area scoring (such as standard Chinese rules) or territory scoring (such as standard Japanese rules) are overwhelmingly more common, not to mention that these are actual rulesets used by actual go players.

Your claims are inflated and rely on overly-specific problem statements that do not map to normal (or even common) usage.

−7

picardythird t1_iroxxxd wrote

I'll admit that VAEs are one of the (imo very few) methods that are principally motivated. The original VAE paper is actually one of my favorite papers in all of ML; it's well-structured, easy to read, and the steps leading from problem to solution are clear.

This is not at all the case for the vast majority of ML research. How many times have we seen "provably optimal" bounds on something that are then shown to, whoops, not actually be optimal a few years (or months! or weeks!) later? How many times have seen "theoretical proofs" for why something works where it turns out that, whoops, the proofs were wrong (looking at you, Adam)? Perhaps most damningly, how many times have we seen a mathematical model or structure that seems to work well in a lab completely fail when exposed to real world data that does not follow the extraordinarily convenient assumptions and simplifications that serve as the foundational underpinnings? This last point is extremely common in nonstandard fields such as continual learning, adversarial ML, and noisy label learning.

Most certainly, some (perhaps even "much"... I hesitate to say "most" or "all") of the work that comes out of the top top top tier labs is motivated by theory before practice. However, outside of these extraordinary groups it is plainly evident that most methods you see in the wild are empirical, not theoretical.

12

picardythird t1_irnu71n wrote

Let's be honest, 99.99% of the time the math is a post-hoc, hand-wavey description of the architecture that happened to work after dozens or hundreds of iterations of architecture search. End up with an autoencoder? Great, you have a latent space. Did you perform normalizations or add noise? Fantastic, you can make (reductive and unjustified) claims about probabilistic interpretations using simple distributions. Know some linear algebra? Awesome, you can take pretty much any vaguely-accurate observation about the archtecture and use it to support whatever interpretation you want.

There is a vanishingly small (but admittedly nonzero) amount of research being done that starts with a theoretical formulation and designs a hypothesis test about that formulation. Much, much, much more common is that researchers will start with a high-level (but ultimately heuristic) systems approach toward architecture selection, and then twiddle and tweak the hyperparameters until the numbers look good enough to publish. Then, maybe they might look back and conjure up some vague bullshit that involves math notation, and claim that as "theoretical justification" knowing that 99% of the time no one is going to read it, much less check their work. Bonus points if they make the notation intentionally confusing, so as to dissuade people from looking too closely. If you can mix in as many fonts, subscripts, superscripts, and set notation as possible, most people's eyes are going to glaze over and they will assume you're right.

It's sad, but it's the reality of the state of our field.

Edit: As for intractable equations, some people are going to answer with talk of variational inference and theoretical guarantees about lower bounds. This is nice and all, but the real answer is that "neural networks go brrr." Neural networks are used as function approximators, and except for some high level architectural descriptions, no one actually knows how or why neural networks learn the functions that they do (or even if they learn the intended functions at all). Anyone that tells you otherwise is selling you something.

Edit 2: Lol, it's always hilarious getting downvotes from people can't stand being called out for their bad practices.

51