maizeq t1_is57j81 wrote on October 13, 2022 at 12:01 PM

Transformers aren’t my field of expertise so I don’t know if this has been done before but hah, neat derivation!

Though I would expect their to be no difference in loss in that case. Was the difference positive or negative? And do you think the difference can be chalked up to numerical precision errors that accumulate due to the double vs single matrix multiplication? An easy test of this would be to compare K’ and Wq (XWk)t and see how close they are throughout training for a particular sample.