trendymoniker

trendymoniker t1_j0acn6e wrote

👆

1e6:1 is extreme. 1e3:1 is often realistic (think views to shares on social media). 18:1 is a actually a pretty good real world ratio.

If it were me, I’d just change the weights for each class in the loss function to get them more or less equal.

190m examples isn’t that many either — don’t worry about it. Compute is cheap — it’s ok if it takes more than one machine and/or more time.

9

trendymoniker t1_izckgr1 wrote

Don’t forget Latent Dirichlet Allocation by Blei, Ng, and Jordan (2003). Deep learning has far surpassed probabilistic models for its simple scalability, but they were ascendant throughout the 2000s, with LDA being probably the most impressively complex yet practical of the lot.

Plus: 45,000 citations earned mostly in the era before machine learning was everywhere and every thing.

2

trendymoniker t1_ivf84sd wrote

Easy answer is distillations like EfficientNet or DistillBERT. You can also get an intuition for the process by taking a small easy dataset — like MNIST or CIFAR — and running a big hyperparameter search over models. There will be small models which perform close to the best models.

These days nobody uses ResNet or Inception but there was a time they were the bleeding edge. Now it’s all smaller more precise stuff.

There other dimension you can win over big models is hardcoding in your priors.

11