[R]Wq can be omited in single head attention Submitted by wangyi_fudan t3_y2w87i on October 13, 2022 at 11:27 AM in MachineLearning 7 comments 17
Co0k1eGal3xy t1_is6xpmg wrote on October 13, 2022 at 7:13 PM How does the loss of converged models compare? Removing parameters is similar to decreasing the learning rate as far as I remember, so you can't compare them during early training stages. Permalink 1
Viewing a single comment thread. View all comments