Submitted by SuchOccasion457 t3_y2i7h1 in MachineLearning
mrpogiface t1_is400t9 wrote
Reply to comment by Historical_Ad2338 in [D] Wide Attention Is The Way Forward For Transformers by SuchOccasion457
Yeah, I don't think the OP paper did any scaling experiments, so I'm a bit sceptical long term, but it would be awesome for efficiency if it worked out.
Also, it turns out that the scaling laws in the paper you linked weren't quite right either (a la chinchilla) so who knows, maybe there is something that was missed when you move out of the infinite data regime
Viewing a single comment thread. View all comments