are_slash_wash

are_slash_wash t1_j3pybug wrote

Programmer with a surface level knowledge of machine learning here, so I might not get this right.

On a super basic level, all Machine Learning models are trained to either recognize a thing or reproduce a thing using countless real-world examples of the thing you want it to do. In the case of something like DALL-E, you feed it a billion pictures of peoples' faces and it eventually connects the dots and figures out that "smiling face" = two dots with a triangle in between them, an upward facing curve, a couple of loops, some dark lines above the dots, etc. "Frowning face" is similar except the curve faces downward, etc.

Critically, the model doesn't actually "get" what it's looking at: it doesn't understand the significance of eyes or facial muscles or bone structure. It's simply taking an enormous data set and grouping traits that it sees in photographs with keywords that are associated with the training data. Then it mashes everything together into an image without any kind of greater thoughts about meaning.

So that's all well and good, but what does it have to do with hands? Well, most things in this photo are what I'd call a determinate shape.

In this case, OP might have used a query like "Hipster, man with stubble, dinosaur chicken farmer, David Foster Wallace lookalike, fall attire, Carhartts and flannel" et cetera. You give an AI a billion pictures of a chicken and they eventually all start to look the same, so you get an "average" (but completely unique) chicken picture. Same with a guy in flannel -- give an AI enough Macy's catalogs and it'll get the gist and give you guy with broad shoulders wearing a shirt that has a plaid pattern.

Hands, on the other hand, are indeterminate (sorta). If you look at a million different pictures of people they'll all be doing something different with their hands. Remember that the AI doesn't really know what it's trying to make: it doesn't know what bones or tendons are, and it doesn't understand their biological function or purpose. Nor does it care: it's taking a "monkey see, monkey do" approach to mimic a real photograph.

So when you tell it to draw someone with hands, it's drawing from billions of photographs of people and trying to distill the concepts that it remembers into a new picture. Some of those people will be waving, some will be holding a flowerpot, some will be playing the piano, some will be shaking their fists in the air, some will be miming finger guns or flipping off a former president. Some will even be shaking hands with another person or holding both hands together in prayer or something. But to the AI, it's all just noise without context. When it spits out a new image, it's doing so under the assumption that "picture is of person, person has hands, hands have seemingly random arrangement of long and short interlocking digits" and it gives you what it sees as an average hand. "average" in this case being a palm attached to two stumpy opposable toes.

Anyway that comment went a little long but I hope that it was informative. Hello from Oregon, love your state <3

63