Submitted by bo_peng t3_1135aew in MachineLearning
hfnuser0000 t1_j8qoshn wrote
I am interested in the theoretical aspect of how your model work. Says transformers, you have tokens that attend to other tokens. In the case of RNNs, a piece of information can be preserved for later uses but with a cost of reducing memory capacity for other information and once the information is lost, it's lost forever. So I think the context length of a RNN scale linearly with the memory capacity (and indirectly with the number of parameters), right?
Viewing a single comment thread. View all comments