Hello Everyone!

I have a question about scoring in attention. In attention, you find the dot product between the query and the keys. From my understanding, the intuition is that we can use the inner product as the a mechanism to understand similarity.

I don't think I fully understand this:

Lets say we have a query q_1 = [1, 1, 1]

And we have two keys k_1= [1, 1, 1] and k_2 = [100, 50, 100]

The dot-product for q_1 @ k_1 = 3 while the dot product for q_1 @ k_2 = 250

So in the soft-max, the value associated with k_2 will be weighted much higher, even though q_1 and k_1 were literally the same vector.

Now this would work if all the vectors were unit length (ie cosine similarity), but most examples I have seen online don't normalize by the magnitude of the vectors, but by sqrt{d} where d seems to be the number of elements?

If someone could explain where I am missing something, I would really appreciate it!

Comments

[deleted] t1_jcob0br wrote on March 18, 2023 at 8:30 AM

[deleted]

hijacked_mojo t1_jcpaon7 wrote on March 18, 2023 at 2:52 PM

Keys come from weights, and the dot product determines how much attention a particular query vector should get. The weights are then adjusted during backprop to minimize the error, and thereby modify the keys.

I have a video that goes through everything step-by-step:
https://www.youtube.com/watch?v=acxqoltilME

FunQuarter3511 OP t1_jcpmkyt wrote on March 18, 2023 at 4:13 PM

>I have a video that goes through everything

First off, this video is amazing! You definitely have a new subscriber in me and I will be sharing! Hope you keep making content!!

So I was originally thinking about this like a python dictionary/hash table where you have keys and values, and you retrieve values when the "query" = key.

Rather what is happening here, is that the "loudest" (by magnitude) key is expected to get the highest weight. This is okay, because the key/query (and value) weight matrix are learned anyways, so during backprop, the most important key will just learn to be louder (in addition to being able to learn from the value weights matrix as well).

In essence, the python dictionary is just the wrong analogy to be using here. We are not necessarily giving greater weights to key/query pairs that are similar. But rather, we want the most important keys to be large, which it will learn.

Does that sound right?

hijacked_mojo t1_jcpstsu wrote on March 18, 2023 at 4:56 PM

Yes, you have the right idea but also add this to your mental model: the queries and values are influenced by their *own* set of weights. So it's not only the keys getting modified, but also queries and values.

In other words, the queries, keys and values weights all get adjusted via backprop to minimize the error. So it's entirely possible on a backprop that the value weights get modified a lot (for example) while the key weights are changed little.

It's all about giving the network the "freedom" to adjust itself to best minimize the error.

FunQuarter3511 OP t1_jcptitk wrote on March 18, 2023 at 5:00 PM

That makes a ton of sense. Thanks for your help! You are a legend!

p0p4ks t1_jcppzf4 wrote on March 18, 2023 at 4:37 PM

I get these confusions all the time. But then I remember we are back propagating the errors. Imagine your case happening and the model output was incorrect, the backprop will take care of fixing the key value being too big or small and fix the output.

FunQuarter3511 OP t1_jcpstej wrote on March 18, 2023 at 4:55 PM

Fully agree!

I think my issue was that because of the terms query, key, value, I was trying to relate them in a database or hashtable context. But in reality, those terms seem to be misnomers, and backprop will set the key/query pair to whatever is needed such that the dot product for important context will be large and be weighted appropriately.

I was over complicating it.