Submitted by [deleted] t3_11yccp8 in MachineLearning
[deleted]
Submitted by [deleted] t3_11yccp8 in MachineLearning
[deleted]
Combining leakage, well known datasets, small test data sets, wrong metric ie accuracy on very unbalanced data, a and good data science choices otherwise , is not unexpected to see perfect accuracy. It certainly can be that accurate, but who cares given all these othher possible failings in. The analysis.
I don’t typically deal in breast cancer histopathology models but I do work with medical imaging full time as my day job - if I’m reading this correctly they use the Wisconsin Breast Cancer dataset (originally released in 1995!: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic))
First question - have breast cancer histopathology evaluation techniques changed since 1995? Checking out a quick lit review - yes: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8642363/#Sec2
So is this dataset likely to be useful today? Well… we don’t know the demographics of the population, we don’t know the split of severity of tumors in the population (this could be all easy cancers and not very generalizable/ useful to what someone sees on a day to day!), and the preprocessing required would need someone to take the digital image and extract all these features which honestly probably takes the same amount of time as the pathologist looking at the image and evaluating it. Also it sort of looks like they just used the features that came with the dataset…
They report the 100% accuracy on the training set and 99% on the testing set - great, theoretically any model can get to 100% accuracy on the training set so I almost always ignore this completely when papers do this unless there is a substantial drop off between training and testing or vice versa. But next question - are these results in line with similar published results on this particular dataset? Here’s an ARXIV paper from 2019 with similar results: https://arxiv.org/pdf/1902.03825.pdf
So nothing new here… it seems it’s possible and has been previously published to get 99% accuracy on this dataset…
Next question - is procedia a good journal? It publishes conference paper proceedings with an impact factor of 0.8 (kind of low). It’s unlikely this hit a rigorous peer review process, although I don’t like to throw our conference journals just because some of the big cool clinical trial results and huge breakthroughs are dumped in places like there. But in this case it seems like two researchers trying to get a paper out and not necessarily a ground breaking discovery (people have published on this dataset before and gotten 99% with random forest before!).
Final conclusion: meh.
Claims of 100% accuracy always sets off alarm bells.
I do work in the medical field and the problem is that there are lots of physicians who want to make easy money: Start a startup, collect some data (which is easy for them), download some model they have read about but don't really understand and start training.
I work for a medical device manufacturer and sometimes have to evaluate startups. And the errors they make are sometimes so basic that it becomes clear that they don't have the first clue what they are doing.
One of those startups claimed 99% accuracy on ultrasound images. But upon closer inspection their product was worthless. Apparently they know that they needed to split their data into training/validation/test set.
So what did they do? They took the videos and randomly assigned frames to one of these sets. And since two consecutive frames are very similar to each other, of course you are going to get 99% accuracy. It just means absolutely nothing.
So true - also I always think to the skin cancer detection model that turned out to predict anything with an arrow pointing to it to be cancer because all of the cancerous lesions in their training set had arrows. (Paper showing this ended up in JAMA)
:facepalm:
Yeah, that is exactly the level of mistakes I have to deal with.
Another classic that I see repeated over and over again is wildly unbalanced datasets: Some diseases are very rare, so for every sample of the disease you are looking for, there are 10000 or more samples that are normal. And often, they just throw it into a classifier and hope for the best.
And then you can also easily get 99% accuracy, but the only thing the classifier has learned, is to say "normal tissue", regardless of the input.
A gpt might be engineered to rea papers and report findings of common basic errors in analysis design like you found.
Probability calibration could be added later via telemetry revealing its level of accuracy of its own basic error classification s.
could it be there is some tag within the image setted by doctors ?
it's always important to approach any claim of 100% accuracy with a critical eye. Achieving 100% accuracy is nearly impossible in any practical dataset, and it is usually an indication of overfitting or other statistical biases in the model.
It is also essential to examine the data transformation and feature selection process used in the model as these can have a significant impact on model performance and biases. It's important to ensure that these processes are transparent, unbiased, and validated using appropriate statistical methods.
Looked fine at first glance, imo. I know nothing about medicine, but it’s nice to see that each of the models evaluated were upper 90% on accuracy.
I would want to see how the model performs on a much larger data set before trusting the validity.
It sounds too good to be true.
A dumb model gets 99% acc on 1% prevalence of disease, easily and correctly.
Be careful what you ask for, like acc not f1
It's plausible. To see 100% accuracy esp on well studied datasets or small set of new data. On a 20 example test set in the wild i witnessed exactly this. 20: is super small. The rules forbid me from using more .
Don't believe a paper until you have their code and run it.
Either it’s an easy problem where 98% - 100% accuracy on samples this size is just typical and not really worth publishing, or (not exclusive) the study is flawed.
One could get a totally independent data set of FNA images with these features extracted from different patients in different years, etc. and run their random forest on those. If it gets 98% - 100% accuracy then this is not a hard problem (the feature engineering might have been hard—not taking away from that if so!). If it fails miserably or just gets waaaay lower that 100% you know the study was flawed.
There are so many ML neophytes making “rookie mistakes” with this stuff who don’t fully grasp basic concepts that I think you always need a totally new independent test set that the authors didn’t have access to in order to really test it. That’s even a good idea for experts to be honest.
The paper’s conclusion is likely wrong either way; i.e., that Random Forests are “superior” for this application. Did they get an expert in XGBoost, neural networks, etc and put as much time and effort into those techniques using the same training and test sets to see if they also got 99% - 100%? It didn’t appear so from my cursory glance.
Meddhouib10 t1_jd792us wrote
Generally on medecine papers there is some sort of data leakage (like they do data augmentation before splitting to train, val and test)