neuroguy123

neuroguy123 t1_ix5gflf wrote

I agree with /u/erannare that data size is likely the most relevant issue. I have done a lot of training on time-series data with Transformers and they can be quite difficult to train from scratch on medium-sized datasets. This is even outlined in the main paper. Most people simply do not have enough data in new problem spaces to properly take advantage of Transformer models, despite their allure. My suggestions:

  • Use a hybrid-model as in the original paper. Apply some kind of ResNet, RNN, or whatever is appropriate first as a 'header' to the transformer that will generate the tokens for you. This will create a filter bank for you that may reduce the problem space of the Transformer.
  • A learning-rate scheduler is important.
  • Pre-norm probably will help.
  • Positional encoding is essential and has to be done properly. Run unit-test code for this.
  • Maybe find similar data that you can pre-train with. There may be some decent knowledge transfer from adjacent time-series problems. There is A LOT of audio data out there.
  • The original Transformer architecture becomes very difficult to train after about 500 tokens and you may be exceeding that. You will have to either break down your data series into less tokens or use some other architectures that get around that limit. I find in addition to the quadratic issues of memory, you need even more data to train larger Transformers (not surprisingly).
  • As someone else pointed out, double and triple check your masking code and create unit tests for that. It's very easy to get wrong.

All of that being said, test vs just more specialized architectures if you have long time-series data. It will take a lot of data and a lot of compute to fully take advantage of a Transformer on its own as an end-to-end architecture. RNNs and Wavenets are still relevant architectures as well whether your network is autoregressive, a classifier, or both.

4

neuroguy123 t1_ispbbs0 wrote

I have been a long time user of Colab Pro+ and that might be true, but I have noticed a significant difference in the quality of GPU I get now. You can specify a 'premium' GPU now vs standard. I can get an A100-40Gb any time I want right now, whereas before that was rare. For 100 credits you can train for about 6.5 hours on that. Anyway, I think you're right that they just give you more transparency, but you also get more choice in how aggressively you want to spend the credits. I suspect that the A100s are more available now after these changes.

When I want to use Colab, what I do now is develop on my desktop machine until I have everything working properly and then just train as needed on Colab when I need more compute.

6

neuroguy123 t1_irtuo1b wrote

I recommend some of the YOLO versions. I had fun with those and learned a lot about complex loss functions.

I also implemented a bunch of attention-models starting with Graves', through Bahdanau and Luong, and then Transformers. The history of attention in deep learning is very interesting and instructional to implement.

Another one I had fun implementing was Wavenet as it really forces you to deep dive on convolution variations, pixel-cnn, and some gated network structures. Then conditioning it was an extra challenge (similar to the attention networks).

One thing I've been meaning to get into is deepcut and other pose models because I don't know much about linear programming and the other math they use in those.

2