FunQuarter3511

FunQuarter3511 OP t1_jcpstej wrote

Reply to comment by p0p4ks in Question on Attention by FunQuarter3511

Fully agree!

I think my issue was that because of the terms query, key, value, I was trying to relate them in a database or hashtable context. But in reality, those terms seem to be misnomers, and backprop will set the key/query pair to whatever is needed such that the dot product for important context will be large and be weighted appropriately.

I was over complicating it.

1

FunQuarter3511 OP t1_jcpmkyt wrote

>I have a video that goes through everything

First off, this video is amazing! You definitely have a new subscriber in me and I will be sharing! Hope you keep making content!!

So I was originally thinking about this like a python dictionary/hash table where you have keys and values, and you retrieve values when the "query" = key.

Rather what is happening here, is that the "loudest" (by magnitude) key is expected to get the highest weight. This is okay, because the key/query (and value) weight matrix are learned anyways, so during backprop, the most important key will just learn to be louder (in addition to being able to learn from the value weights matrix as well).

In essence, the python dictionary is just the wrong analogy to be using here. We are not necessarily giving greater weights to key/query pairs that are similar. But rather, we want the most important keys to be large, which it will learn.

Does that sound right?

1