jonas__m t1_j47r617 wrote on January 13, 2023 at 7:22 PM

Reply to Why is Super Learning / Stacking used rather rarely in practice? [D] by Worth-Advance-1232

this is one of the many strategies used in autogluon that enables it to outperform other autoML tools on most datasets:
https://arxiv.org/abs/2003.06505

https://arxiv.org/abs/2207.12560

One complaint people raise is regarding latency & complexity of deploying ensemble models, but there are many easy options to deal with this:
https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-indepth.html#accelerating-inference

jonas__m t1_j45cym1 wrote on January 13, 2023 at 7:18 AM

Reply to [D] Can someone point to research on determining usefulness of samples/datasets for training ML models? by HFSeven

Data Shapely is one option but can be computationally expensive. If you’re looking for practical code to try running on real data, here are some tutorials to find the least useful data:

https://docs.cleanlab.ai/stable/tutorials/image.html

https://docs.cleanlab.ai/stable/tutorials/outliers.html

as well as the MOST useful data to label next (or collect an extra label for):

https://github.com/cleanlab/examples/blob/master/active_learning_multiannotator/active_learning.ipynb

jonas__m t1_j2y72oy wrote on January 4, 2023 at 7:49 PM

Reply to [Discussion] If ML is based on data generated by humans, can it truly outperform humans? by groman434

Yes eg a medical image classifier can outperform the average doctor. This can be verified by having 10 doctors review each image in the test set to establish a true-consensus.

Even when ML is trained with noisy labels, there are techniques to obtain robust models whose accuracy is overall better than the noise-level in each label. One good opensource library for this: https://github.com/cleanlab/cleanlab/

Another way to get ML that outperforms individual data labelers is to have multiple annotators label your data. Crowdlab is an effective method to analyze such data: https://cleanlab.ai/blog/multiannotator/

jonas__m t1_ix5ey4i wrote on November 20, 2022 at 9:53 PM

Reply to comment by Mozillah0096 in [R] A relabelling of the COCO 2017 dataset by iknowjerome

cleanlab is an open-source python library that checks data and label quality

jonas__m t1_it9lkpf wrote on October 21, 2022 at 11:03 PM

Reply to [D] Accurate blogs on machine learning? by likeamanyfacedgod

Blog about research/code/tutorials related to data-centric AI: https://cleanlab.ai/blog/

jonas__m t1_ircbeek wrote on October 6, 2022 at 11:05 PM

Reply to [D] Types of Machine Learning Papers by Lost-Parfait568

missing from the list: Present 10 sophisticated innovations when only one simple trick suffices, to ensure reviewers find paper "novel"