entropyvsenergy

entropyvsenergy t1_iyd6mw0 wrote

Transformers do well with lots of data. This is because the transformer is an extremely flexible and generic architecture. Unlike a fully connected neural network where each input is mapped through a weight matrix to the next layer and the weight matrices are fixed with respect to any input, transformers use attention blocks where the actual "effective" weight matrices are computed using the attention operation using query, key, and value vectors and thus depend on the inputs. What this means is that in order to train a transformer model you need a lot of data in order to get better performance than less flexible neural network architectures such as LSTMs or fully connected networks.

1

entropyvsenergy t1_iw58dge wrote

It's all frameworks now, some better than others. I haven't written one outside of demos or interviews in years. With that being said, I've modified neural networks a whole bunch. Usually you can just tweak parameters in a config file but sometimes you want additional outputs or to fundamentally change the model in some way...usually minor tweaks codewise.

16