KD_A

KD_A OP t1_jegxas8 wrote

Interesting, and I think I know what you mean. One naive idea is a "top-k tokens" system. This system considers the top k highest probability tokens (conditional on previous ones) for each completion token, and for each completion. And then take the sum of the average likelihoods across all k^n (n = # completion tokens) paths for each completion. That would be one way to address this synonym problem. But ofc it results in way more computation.

Edit: actually, thinking a bit more, I think the synonym problem is more-or-less a non-issue for LMs trained to do next-token prediction.

2

KD_A OP t1_jegsqe6 wrote

That's a good criticism. I'd guess that this issue is quite problem-dependent. And I'd hope that an LM is good enough to discriminate between the correct-but-many-synonyms class and the wrong-but-few-synonyms class. (We're using the word synonym, but we really mean "high probability token path given prompt".) It's hard for me to come up with examples where this problem arises in a real classification task. But they may be out there.

2

KD_A OP t1_jeghvnn wrote

Yeah I was surprised that this wasn't already coded up--it's been 3 years since we've found out that sampling from GPT-3 is a good zero-shot text classifier.

While benchmarking this method on the infamous Winograd Schema Challenge, I ended up finding a 2018 paper^1 w/ pretty much the same idea as CAPPr. The only difference is that CAPPr typically transposes that probability, and it naively incorporates a prior.

  1. Trinh, Trieu H., and Quoc V. Le. “A simple method for commonsense reasoning.” arXiv preprint arXiv:1806.02847 (2018).
3

KD_A OP t1_jegfh7i wrote

Great question! I have no idea lol.

More seriously, it depends on what you mean by "compare". CAPPr w/ powerful GPT-3+ models is likely gonna be more accurate. But you need to pay to hit OpenAI endpoints, so it's not a fair comparison IMO.

If you can't pay to hit OpenAI endpoints, then a fairer comparison would be CAPPr + GPT-2—specifically, the smallest one in HuggingFace, or whatever's closest in inference speed to something like bart-large-mnli. But then another issue which pops up is that GPT-2 was not explicitly trained on the NLI/MNLI task in the same way bart-large-mnli was. So I'd need to finetune GPT-2 (small) on MNLI to make a fairer comparison.

If I had a bunch of compute and time, I'd like to benchmark (or find benchmarks) for the following text classification approaches, varying the amount of training data if feasible, and ideally on tasks which are more realistic than SuperGLUE:

  • similarity embeddings
    • S-BERT
    • GPT-3+ (they claim their ada model is quite good)
  • sampling
  • MNLI-trained models
  • CAPPr
1

KD_A t1_jbf175s wrote

> Do you think data augmentation should also be disabled in that test?

Yes. I've never actually experimented w/ stuff like image augmentation. But in most examples I looked up, augmentation is a training-only computation which may make training loss look higher than it actually is. In general the rule is just this: to unbiasedly estimate training loss, apply the exact same code you're using to estimate validation loss to training data.

2

KD_A t1_jbdi9x4 wrote

Notice that "significantly lower" can't actually be defined. There isn't a useful definition of overfitting which only requires observing a single model's train and test gap.^1 Here's a contrived but illustrative example: say model A has a training error rate of 0.10 and test error rate of 0.30. It's tempting to think "test error is 3x train error, we're overfitting". This may or may not be right; there absolutely could be a (more complex) model B with, e.g., training error rate 0.05, test error rate 0.27. Notice that the train-test gap increased going from A to B. But I don't care. Assuming these estimates are good, and all I care about is minimizing expected error rate, I'd confidently deploy model B over model A.

The useful definition of overfitting is that it refers to a region in function space where test error goes up and training error goes down (as model complexity goes up). Diagram (for underparametrized models). This definition tells us that the only good way to tell whether a model should be made more simple or more complex is to fit multiple models and compare them. This info is expensive to obtain for NNs, and obtaining it makes one look less clever. But it gives a reliable hint as to how a model should be iterated.^2 In the example above, if we really did observe that model B, then perhaps our next one should be even more complex.

If you're asking more specifically about reading NN loss curves, I haven't seen any science which puts claims like #4 here to the test.^3 I'd also like to mention another common issue w/ reading NN loss curves: people usually don't take care in estimating training loss. The standard NN training loop results in overestimates, which will make the gap between training and validation appear bigger than it actually is. I happened to write about this problem today, here in CrossValidated.


Footnotes

  1. 2 exceptions to this: (1) you're already quite familiar w/ the type of task and data, so you can correlate high gaps with overfitting based on previous experience (2) test error is higher than an intercept-only model, and training error is much lower.

  2. Double descent complicates this workflow. For overparametrized models like NNs, one can be deceived into not going far enough when increasing model complexity. Or it's difficult to determine whether a certain intervention is actually increasing or decreasing complexity. This paper characterizes various forms of double descent.

  3. My answer to #4 would be the same as what I wrote when criticizing the caption in my first comment. The provided answer—"reduce model capacity"—is too vague. The answer should be: select the model checkpoint from halfway, simply b/c its test error is the lowest. The graph alone doesn't tell you anything about how the model should be iterated beyond that info. That's b/c each point on the curve is the loss after being trained for n iterations, conditional on all of the other factors which modulate the model's complexity. There could absolutely be a model w/ more depth, more width, etc. which performs better than the simpler model trained halfway.

2

KD_A t1_jbb5kx5 wrote

The section "Check if your model is overfitting" could be improved.

> The model is overfitting (high variance) when it has low error on the training set but high error on the test set.

A big gap between training and validation error does not imply that it is overfitting. In general, an absolute gap between training and validation errors does not tell you how validation error will change if a model is made more complex or more simple. To answer questions about overfitting and underfitting, one needs to train multiple models and compare their training and validation errors.

> Overfitting and underfitting is easy to detect by visualizing loss curves during training.

nit: this caption is phrased too liberally, as the graph only answers this question: given this model architecture, optimizer, and dataset, which model epoch/checkpoint should I select? It does not tell you about any other factors which modulate model complexity.

> This often means that the training set is not representative of the domain it is supposed to run in.

I wouldn't call this a variance issue per se. If it were a variance issue, sampling more data from the training distribution should significantly lower validation error. If the training distribution is biased, sampling more of it will not help a whole lot.

That all being said, I share your passion for greater standardization of ML workflow. And I agree that there needs to be more work on diagnosing problems, and less "throwing stuff at the wall". To add something, I now typically run learning curves. They can cost quite a bit when training big NNs. But even a low-resolution curve can give a short term answer to an important question: how much should I expect this model to improve if I train it on n more observations? And assuming you have a decent sense of your model's capacity, this question is closely related to another common one: should I prioritize collecting more data, or should I make a modeling intervention? Learning curves have motivated big improvements in my experience.

5