ObjectManagerManager

ObjectManagerManager t1_j60y1rn wrote

OpenAI's LLM is special because it's open to the public. That's it. Other tech companies' internal LLMs are likely better. Google has a whole database of billions of websites and indexes directly at their disposal; I'm quite confident that they can outperform ChatGPT with ease. If Google was really afraid of ChatGPT running them out of business, they'd just release a public API for their own, better model. And they have a monopoly over the internet in terms of raw data and R&D; it would be virtually impossible for anyone else to compete.

Besides that, the whole "Google killer" thing is overreactive, IMO. The public api for ChatGPT doesn't retrain or even prompt-condition on new public internet data. So if you ask it about recent news, it'll spit out utter garbage. An internal version reportedly does seek out and retrain on new public internet data. But how does it find that data? With a neat tool that constantly crawls the web and builds large, efficient databases and indexes. Oh yeah---that's called a search engine.

So even if end users start using LLMs as a substitute for search engines (which is generally not happening at the moment, and it seems unlikely to be a concern in the age of GPT-3, despite what many people believe), most LLM queries will likely be forwarded to some search engine or another for prompt conditioning. Search engines will not die---they'll just have to adapt to be useful for LLM prompt conditioning in addition to being useful to end users.

17

ObjectManagerManager t1_j27h8xa wrote

Platt ("temperature") scaling works well, and it's very simple. Yes, you do it post-training, usually on a held-out calibration set. Some people will then retrain on all of the data and reuse the learned temperature, but that doesn't always work out as well as you want it to.

FTR, "multiclass classification" means each instance belongs to exactly one of many classes. When each label can be 0 / 1 irrespective of the other labels, it's referred to as "multilabel classification".

3

ObjectManagerManager t1_iz5xous wrote

(Confession: I haven't read the paper yet). I have a couple of questions:

  1. If each layer has its own objective function, couldn't you train layers back-to-front? e.g., train the first layer to convergence, then train the second layer, and so on. I doubt this would be faster than training it end-to-end, but a) as the early layers adapt, they screw up the representations being fed to the later layers anyways, so it probably wouldn't be too much slower than training it end-to-end, and b) it would use significantly less memory (e.g., if you pre-compute the inputs to a layer just before you begin training it, you could imagine training any arbitrarily deep model with a finite amount of memory).
  2. What's the motivation behind "goodness"? Suppose we're talking about classification. Why doesn't each layer just minimize cross entropy? I guess that'd require each layer to have its own flatten + linear projection layers. But then you wouldn't have to concatenate the label and the input data, and so inference complexity would be (mostly) independent of the number of classes. Thinking of a typical CNN, a layer could be organized:
    1. Batch norm
    2. Activation (e.g., ReLU)
    3. Convolution (the output of which is fed into the next layer)
    4. Pooling
    5. Flatten
    6. Linear projection
    7. Cross entropy loss

Can anyone (who has read the paper) answer these questions?

2

ObjectManagerManager t1_ivwf1f0 wrote

Alternatively, feed the data source as an output. i.e., have your model output two values. For data sourced from dataset A, minimize loss against the first output. For data sourced from dataset B, minimize loss against the second output.

I don't remember who, but someone wrote a thesis on how it often works better in practice to incorporate additional / auxiliary information in the form of outputs rather than inputs. It's also a very clean solution since you can usually just remove the unnecessary output heads after training, which might decrease your model size for inference (albeit a small amount, unless you have a lot of auxiliary information).

1

ObjectManagerManager t1_ivwdd0k wrote

No. Your model can do whatever it wants with input features. It's not going to just "choose" to treat this new column as a loss weight. Loss weighting requires a specific computation.

If you're training a neural network or something similar, you'd normally average the loss across every example in a batch, and then you'd backpropagate that averaged loss. With loss weighting, you compute a weighted average loss across the batch. In this case, you'd assign larger weights to the more "reliable" data points.

Sample weighting is different, and it can be done with virtually any ML model. It involves weighting the likelihood of sampling each data point. For "full-batch models", you can generate bootstrap samples with the weighted sampling. For "batched" models (e.g., neural networks trained via batched gradient descent), you can use weighted sampling for each batch.

Most modern ML packages have built-in interfaces for both of these, so there's no need to reinvent the wheel here.

2

ObjectManagerManager t1_ivhzsot wrote

There are diminishing returns on data. It's difficult to get truly new data when you already have billions of data points, and it's difficult to improve a model when it's already very good.

So, like Moore's law, it'll probably slow down eventually. At that point, most significant developments will be a result of improving model efficiency rather than just making them bigger.

Not to mention, models are made more efficient all the time. Sure, DALL-E-2 is huge. But first off, it's smaller than DALL-E. And second, if you compare a model of a fixed size today to a model of the same size just a couple of years ago, today's model will still win out by a significant margin. Heck, you can definitely train a decent ImageNet1K model on a hobby ML PC (e.g., an RTX graphics card, or even something cheaper if you have enough days to spare on a small learning rate and batch size). And inference takes much less time / memory than training since you can usually fix the batch size to 1 and you don't have to store a computational graph for a backward pass. A decade ago, this would have been much more difficult.

1

ObjectManagerManager t1_ivhwb6m wrote

Given unlimited data, models are at least as good as humans at every task. All you'd need is a dictionary, and you could perfectly recover the target distribution.

Where humans excel is learning with a relatively small amount of data. But presumably that's just because we're able to transfer knowledge from other, related tasks. Some models can do that too, but not nearly as well. Either way, that invalidates the comparison since the data isn't fixed anymore.

4