I think my issue was that because of the terms query, key, value, I was trying to relate them in a database or hashtable context. But in reality, those terms seem to be misnomers, and backprop will set the key/query pair to whatever is needed such that the dot product for important context will be large and be weighted appropriately.
First off, this video is amazing! You definitely have a new subscriber in me and I will be sharing! Hope you keep making content!!
So I was originally thinking about this like a python dictionary/hash table where you have keys and values, and you retrieve values when the "query" = key.
Rather what is happening here, is that the "loudest" (by magnitude) key is expected to get the highest weight. This is okay, because the key/query (and value) weight matrix are learned anyways, so during backprop, the most important key will just learn to be louder (in addition to being able to learn from the value weights matrix as well).
In essence, the python dictionary is just the wrong analogy to be using here. We are not necessarily giving greater weights to key/query pairs that are similar. But rather, we want the most important keys to be large, which it will learn.
FunQuarter3511 OP t1_jcptitk wrote
Reply to comment by hijacked_mojo in Question on Attention by FunQuarter3511
That makes a ton of sense. Thanks for your help! You are a legend!