jonas__m
jonas__m t1_j45cym1 wrote
Reply to [D] Can someone point to research on determining usefulness of samples/datasets for training ML models? by HFSeven
Data Shapely is one option but can be computationally expensive. If you’re looking for practical code to try running on real data, here are some tutorials to find the least useful data:
https://docs.cleanlab.ai/stable/tutorials/image.html
https://docs.cleanlab.ai/stable/tutorials/outliers.html
as well as the MOST useful data to label next (or collect an extra label for):
jonas__m t1_j2y72oy wrote
Reply to [Discussion] If ML is based on data generated by humans, can it truly outperform humans? by groman434
Yes eg a medical image classifier can outperform the average doctor. This can be verified by having 10 doctors review each image in the test set to establish a true-consensus.
Even when ML is trained with noisy labels, there are techniques to obtain robust models whose accuracy is overall better than the noise-level in each label. One good opensource library for this: https://github.com/cleanlab/cleanlab/
Another way to get ML that outperforms individual data labelers is to have multiple annotators label your data. Crowdlab is an effective method to analyze such data: https://cleanlab.ai/blog/multiannotator/
jonas__m t1_ix5ey4i wrote
Reply to comment by Mozillah0096 in [R] A relabelling of the COCO 2017 dataset by iknowjerome
cleanlab is an open-source python library that checks data and label quality
jonas__m t1_it9lkpf wrote
Blog about research/code/tutorials related to data-centric AI: https://cleanlab.ai/blog/
jonas__m t1_ircbeek wrote
Reply to [D] Types of Machine Learning Papers by Lost-Parfait568
missing from the list: Present 10 sophisticated innovations when only one simple trick suffices, to ensure reviewers find paper "novel"
jonas__m t1_j47r617 wrote
Reply to Why is Super Learning / Stacking used rather rarely in practice? [D] by Worth-Advance-1232
this is one of the many strategies used in autogluon that enables it to outperform other autoML tools on most datasets:
https://arxiv.org/abs/2003.06505
https://arxiv.org/abs/2207.12560
One complaint people raise is regarding latency & complexity of deploying ensemble models, but there are many easy options to deal with this:
https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-indepth.html#accelerating-inference