Submitted by FunQuarter3511 t3_11ugj0f in deeplearning
Hello Everyone!
I have a question about scoring in attention. In attention, you find the dot product between the query and the keys. From my understanding, the intuition is that we can use the inner product as the a mechanism to understand similarity.
​
I don't think I fully understand this:
Lets say we have a query q_1 = [1, 1, 1]
And we have two keys k_1= [1, 1, 1] and k_2 = [100, 50, 100]
​
The dot-product for q_1 @ k_1 = 3 while the dot product for q_1 @ k_2 = 250
So in the soft-max, the value associated with k_2 will be weighted much higher, even though q_1 and k_1 were literally the same vector.
​
Now this would work if all the vectors were unit length (ie cosine similarity), but most examples I have seen online don't normalize by the magnitude of the vectors, but by sqrt{d} where d seems to be the number of elements?
​
If someone could explain where I am missing something, I would really appreciate it!
[deleted] t1_jcob0br wrote
[deleted]