ObjectManagerManager
ObjectManagerManager t1_j27i9n5 wrote
Reply to comment by arcxtriy in [D] SOTA Multiclass Model Calibration by arcxtriy
Actually, you're completely right. SOTA in open set recognition is still max logit / max softmax, which is to say that the maximum softmax probability is a useful measure of certainty.
ObjectManagerManager t1_j27h8xa wrote
Reply to [D] SOTA Multiclass Model Calibration by arcxtriy
Platt ("temperature") scaling works well, and it's very simple. Yes, you do it post-training, usually on a held-out calibration set. Some people will then retrain on all of the data and reuse the learned temperature, but that doesn't always work out as well as you want it to.
FTR, "multiclass classification" means each instance belongs to exactly one of many classes. When each label can be 0 / 1 irrespective of the other labels, it's referred to as "multilabel classification".
ObjectManagerManager t1_iz5xous wrote
Reply to [R] The Forward-Forward Algorithm: Some Preliminary Investigations [Geoffrey Hinton] by shitboots
(Confession: I haven't read the paper yet). I have a couple of questions:
- If each layer has its own objective function, couldn't you train layers back-to-front? e.g., train the first layer to convergence, then train the second layer, and so on. I doubt this would be faster than training it end-to-end, but a) as the early layers adapt, they screw up the representations being fed to the later layers anyways, so it probably wouldn't be too much slower than training it end-to-end, and b) it would use significantly less memory (e.g., if you pre-compute the inputs to a layer just before you begin training it, you could imagine training any arbitrarily deep model with a finite amount of memory).
- What's the motivation behind "goodness"? Suppose we're talking about classification. Why doesn't each layer just minimize cross entropy? I guess that'd require each layer to have its own flatten + linear projection layers. But then you wouldn't have to concatenate the label and the input data, and so inference complexity would be (mostly) independent of the number of classes. Thinking of a typical CNN, a layer could be organized:
- Batch norm
- Activation (e.g., ReLU)
- Convolution (the output of which is fed into the next layer)
- Pooling
- Flatten
- Linear projection
- Cross entropy loss
Can anyone (who has read the paper) answer these questions?
ObjectManagerManager t1_ivwf1f0 wrote
Reply to comment by LurkAroundLurkAround in [Discussion] Can we train with multiple sources of data, some very reliable, others less so? by DreamyPen
Alternatively, feed the data source as an output. i.e., have your model output two values. For data sourced from dataset A, minimize loss against the first output. For data sourced from dataset B, minimize loss against the second output.
I don't remember who, but someone wrote a thesis on how it often works better in practice to incorporate additional / auxiliary information in the form of outputs rather than inputs. It's also a very clean solution since you can usually just remove the unnecessary output heads after training, which might decrease your model size for inference (albeit a small amount, unless you have a lot of auxiliary information).
ObjectManagerManager t1_ivwdd0k wrote
Reply to comment by DreamyPen in [Discussion] Can we train with multiple sources of data, some very reliable, others less so? by DreamyPen
No. Your model can do whatever it wants with input features. It's not going to just "choose" to treat this new column as a loss weight. Loss weighting requires a specific computation.
If you're training a neural network or something similar, you'd normally average the loss across every example in a batch, and then you'd backpropagate that averaged loss. With loss weighting, you compute a weighted average loss across the batch. In this case, you'd assign larger weights to the more "reliable" data points.
Sample weighting is different, and it can be done with virtually any ML model. It involves weighting the likelihood of sampling each data point. For "full-batch models", you can generate bootstrap samples with the weighted sampling. For "batched" models (e.g., neural networks trained via batched gradient descent), you can use weighted sampling for each batch.
Most modern ML packages have built-in interfaces for both of these, so there's no need to reinvent the wheel here.
ObjectManagerManager t1_ivhzsot wrote
Reply to [D] Do you think there is a competitive future for smaller, locally trained/served models? by naequs
There are diminishing returns on data. It's difficult to get truly new data when you already have billions of data points, and it's difficult to improve a model when it's already very good.
So, like Moore's law, it'll probably slow down eventually. At that point, most significant developments will be a result of improving model efficiency rather than just making them bigger.
Not to mention, models are made more efficient all the time. Sure, DALL-E-2 is huge. But first off, it's smaller than DALL-E. And second, if you compare a model of a fixed size today to a model of the same size just a couple of years ago, today's model will still win out by a significant margin. Heck, you can definitely train a decent ImageNet1K model on a hobby ML PC (e.g., an RTX graphics card, or even something cheaper if you have enough days to spare on a small learning rate and batch size). And inference takes much less time / memory than training since you can usually fix the batch size to 1 and you don't have to store a computational graph for a backward pass. A decade ago, this would have been much more difficult.
ObjectManagerManager t1_ivhwb6m wrote
Reply to [D] At what tasks are models better than humans given the same amount of data? by billjames1685
Given unlimited data, models are at least as good as humans at every task. All you'd need is a dictionary, and you could perfectly recover the target distribution.
Where humans excel is learning with a relatively small amount of data. But presumably that's just because we're able to transfer knowledge from other, related tasks. Some models can do that too, but not nearly as well. Either way, that invalidates the comparison since the data isn't fixed anymore.
ObjectManagerManager t1_j60y1rn wrote
Reply to Few questions about scalability of chatGPT [D] by besabestin
OpenAI's LLM is special because it's open to the public. That's it. Other tech companies' internal LLMs are likely better. Google has a whole database of billions of websites and indexes directly at their disposal; I'm quite confident that they can outperform ChatGPT with ease. If Google was really afraid of ChatGPT running them out of business, they'd just release a public API for their own, better model. And they have a monopoly over the internet in terms of raw data and R&D; it would be virtually impossible for anyone else to compete.
Besides that, the whole "Google killer" thing is overreactive, IMO. The public api for ChatGPT doesn't retrain or even prompt-condition on new public internet data. So if you ask it about recent news, it'll spit out utter garbage. An internal version reportedly does seek out and retrain on new public internet data. But how does it find that data? With a neat tool that constantly crawls the web and builds large, efficient databases and indexes. Oh yeah---that's called a search engine.
So even if end users start using LLMs as a substitute for search engines (which is generally not happening at the moment, and it seems unlikely to be a concern in the age of GPT-3, despite what many people believe), most LLM queries will likely be forwarded to some search engine or another for prompt conditioning. Search engines will not die---they'll just have to adapt to be useful for LLM prompt conditioning in addition to being useful to end users.