jonas__m

jonas__m t1_j47r617 wrote

this is one of the many strategies used in autogluon that enables it to outperform other autoML tools on most datasets:
https://arxiv.org/abs/2003.06505

https://arxiv.org/abs/2207.12560

One complaint people raise is regarding latency & complexity of deploying ensemble models, but there are many easy options to deal with this:
https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-indepth.html#accelerating-inference

7

jonas__m t1_j45cym1 wrote

Data Shapely is one option but can be computationally expensive. If you’re looking for practical code to try running on real data, here are some tutorials to find the least useful data:

https://docs.cleanlab.ai/stable/tutorials/image.html

https://docs.cleanlab.ai/stable/tutorials/outliers.html

as well as the MOST useful data to label next (or collect an extra label for):

https://github.com/cleanlab/examples/blob/master/active_learning_multiannotator/active_learning.ipynb

2

jonas__m t1_j2y72oy wrote

Yes eg a medical image classifier can outperform the average doctor. This can be verified by having 10 doctors review each image in the test set to establish a true-consensus.

Even when ML is trained with noisy labels, there are techniques to obtain robust models whose accuracy is overall better than the noise-level in each label. One good opensource library for this: https://github.com/cleanlab/cleanlab/

Another way to get ML that outperforms individual data labelers is to have multiple annotators label your data. Crowdlab is an effective method to analyze such data: https://cleanlab.ai/blog/multiannotator/

1