StellaAthena

StellaAthena t1_jdi094w wrote

I just posted in response to each reviewer:

> Thank you for taking the time to review our work. We have carefully considered your comments and have provided a thorough rebuttal addressing your concerns. If you feel that your comments have been adequately addressed, we would greatly appreciate it if you could update your score to reflect that. We are also more than happy to continue this conversation over the next few days until the March 26th deadline.

I submitted several papers, all of which got borderline scores (average between 4.3 and 5.3), though one got 7 / 7 / 2 (yikes!). I had been hopeful that a strong rebuttal could judge one of them over the line, but the longer it goes without any response or updates the more discouraged I get.

4

StellaAthena t1_is7iss2 wrote

The proof is even more simple: (xW_q)(xW_k)^T = x(W_qW_k^T )x^T = xWx

The problem is that W_q and W_k are not square matrices. They are d_model by d_head, and so their product is d_model x d_model. In practice d_model >> d_head (e.g., they’re 4096 and 256 respectively in GPT-J). Doing it your way uses a lot more memory and compute

22