Reasonable_Boss2750 t1_is97cn3 wrote on October 14, 2022 at 5:38 AM

Reply to [R]Wq can be omited in single head attention by wangyi_fudan

Possible reason why the author uses attention with Wq and Wk is to fuse information in both encoder and decoder. In that case the formula is (XenWq)(XdeWk)t