Submitted by wangyi_fudan t3_y2w87i in MachineLearning
maizeq t1_is57j81 wrote
Transformers aren’t my field of expertise so I don’t know if this has been done before but hah, neat derivation!
Though I would expect their to be no difference in loss in that case. Was the difference positive or negative? And do you think the difference can be chalked up to numerical precision errors that accumulate due to the double vs single matrix multiplication? An easy test of this would be to compare K’ and Wq (XWk)t and see how close they are throughout training for a particular sample.
Viewing a single comment thread. View all comments