Reasonable_Boss2750 t1_is97cn3 wrote on October 14, 2022 at 5:38 AM Reply to [R]Wq can be omited in single head attention by wangyi_fudan Possible reason why the author uses attention with Wq and Wk is to fuse information in both encoder and decoder. In that case the formula is (XenWq)(XdeWk)t Permalink 1
Reasonable_Boss2750 t1_is97cn3 wrote
Reply to [R]Wq can be omited in single head attention by wangyi_fudan
Possible reason why the author uses attention with Wq and Wk is to fuse information in both encoder and decoder. In that case the formula is (XenWq)(XdeWk)t