hijacked_mojo t1_jcpaon7 wrote on March 18, 2023 at 2:52 PM

Keys come from weights, and the dot product determines how much attention a particular query vector should get. The weights are then adjusted during backprop to minimize the error, and thereby modify the keys.

I have a video that goes through everything step-by-step:
https://www.youtube.com/watch?v=acxqoltilME

FunQuarter3511 OP t1_jcpmkyt wrote on March 18, 2023 at 4:13 PM

>I have a video that goes through everything

First off, this video is amazing! You definitely have a new subscriber in me and I will be sharing! Hope you keep making content!!

So I was originally thinking about this like a python dictionary/hash table where you have keys and values, and you retrieve values when the "query" = key.

Rather what is happening here, is that the "loudest" (by magnitude) key is expected to get the highest weight. This is okay, because the key/query (and value) weight matrix are learned anyways, so during backprop, the most important key will just learn to be louder (in addition to being able to learn from the value weights matrix as well).

In essence, the python dictionary is just the wrong analogy to be using here. We are not necessarily giving greater weights to key/query pairs that are similar. But rather, we want the most important keys to be large, which it will learn.

Does that sound right?

hijacked_mojo t1_jcpstsu wrote on March 18, 2023 at 4:56 PM

Yes, you have the right idea but also add this to your mental model: the queries and values are influenced by their *own* set of weights. So it's not only the keys getting modified, but also queries and values.

In other words, the queries, keys and values weights all get adjusted via backprop to minimize the error. So it's entirely possible on a backprop that the value weights get modified a lot (for example) while the key weights are changed little.

It's all about giving the network the "freedom" to adjust itself to best minimize the error.

FunQuarter3511 OP t1_jcptitk wrote on March 18, 2023 at 5:00 PM

That makes a ton of sense. Thanks for your help! You are a legend!