Submitted by alexnasla t3_yikumt in MachineLearning
Historical_Ad2338 t1_iujw7fv wrote
LSTMs are quite slow in practice (due to not being able to use parallel computation) which is one of the main reasons why Transformers have taken off (besides improved performance). In an NLP setting of sequence lengths of ~1024 and in the 100 million parameter range, a Transformer can go through an epoch 10x faster (though it does need more memory) in my experience. I'd recommend a Transformer, and if recurrence is really important, you can always use SRU++ which gives parallelizeable recurrence.
Viewing a single comment thread. View all comments