gBoostedMachinations

gBoostedMachinations t1_j17a22i wrote

  1. Set aside a validation set
  2. Use the rest of the data to train two models: One using the duplicates and one using the pre-computed means.
  3. Compare performance on the validation set.

Don’t put very much weight at all into what other people’s intuitions are about these kinds of questions. Just test it. Your question is an empirical one so just do the experiment. I can’t tell you how many times I’ve had a colleague say that something I was trying wasn’t going to work only to see that he was dead wrong when I tested it anyway. Oh man do I love it when that happens.

EDIT: it just occurred to me that validation will be somewhat tricky. Does OP allow (non-overlapping) duplicates to remain in the validation set? Or does OP calculate the averages for the targets? He can’t do something different when comparing the models, yet one model will be clearly favored if he only chooses one method.

I think the answer to the question depends on how data about future targets will be collected. Is OP going to perform repeated experiments in the future and take repeated measurements of the outcome? Or is he only going to perform unique sets of experiments? Whatever the answer the important thing is for OP to consider the future use-case and process his validation set in a way that most closely mimics that environment (e.g., repeated measurements vs single measurements).

Sorry if this isn’t very clear I only had a few minutes to type it out.

105

gBoostedMachinations t1_j16pzea wrote

If there’s on thing I’ve learned about Reddit, it’s that you can make the most uncontroversial comment of the year and still get downvoted. I mean, I got banned from r/coronavirus for pointing out that people who recover from covid probably have at least a little tiny bit of immunity to re-infection.

After covid, I’ve learned to completely ignore my comment scores when it comes to feedback on Reddit. The only way to know if one of my comments is valued is to read the replies.

7

gBoostedMachinations t1_j155zas wrote

Training is what takes so much computation in almost all cases. Once the model itself is trained only a tiny fraction of the compute is needed. Most trained ML models that ship today can generate predictions on a raspberry pi or a cell phone. LLMs still require more hardware for inference, but you’d be surprised how little they need compared to what’s needed for training.

8

gBoostedMachinations t1_j102w53 wrote

Sure but I have never had any problem separating the wheat from the chaff. I can read them myself and decide whether the work is done well. Often the authors can be vetted as well.

If a reader of my own paper has a problem with me citing preprints they can read the paper themselves and decide if it’s appropriate. But the fact that it’s a preprint itself doesn’t really matter.

1