KellinPelrine t1_iqolwmz wrote
I think it's noteworthy that the month turns out to be the most informative, but it may be more a reflection on the data and its collection process than a strong feature for real-world detection. For example, there have been datasets collected in the past where real and fake examples were collected at different times, which makes month or other date information artificially predictive. See https://arxiv.org/pdf/2104.06952.pdf, especially 3.4.2.
So I'd encourage you to consider why month would be predictive (and the same for any other metadata), in order to make sure it's not an artifact of the dataset.
loosefer2905 OP t1_iqpyo9n wrote
Hello Kellin
I share ur concern. In this case, the date is the date of publishing or the last date modified of the article... it is not related to the date at which data collector found the article.
The reason we didnt dwelve into the "why" in our research paper itself is because it is supposed to be a Machine Learning paper with a scientific community reading it. Our group of writers are of the opinion that the reasoning is highly political in nature.
KellinPelrine t1_iqtjzfs wrote
Just using date of publication or last date of modification does not avoid the issue I described. In my brief reading I couldn't find a link or reference for your data beyond it coming from Kaggle somehow (might have missed more exact reference), but your sample is definitely not random (as you describe it has exactly 2000 real and 2000 fake examples, while a representative random sample would not be balanced). If the 2000 fake ones have 2016 publication dates and the 2000 real ones have 2017 dates, you haven't found a new optimal detection method nor that every article ever published in 2016 was fake, you've found some artifact of the dataset. Still an important finding, especially if other people are using that data and might be drawing wrong conclusions from it, but not a new misinformation detection method.
Of course, it's probably not such an extreme case like that (although something nearly that extreme has occurred in some widely used datasets, as explained in paper I linked). But here's a more subtle thought experiment: suppose fake articles were collected randomly from a fact-checking website (a not uncommon practice). Further, maybe that fact-checking website expanded its staff near the 2016 US election, say in October, because there was a lot of interest in and public need for misinformation detection at that time. More staff -> more articles checked -> more fake news detected -> a random sample of fake news from the website will contain more examples from October (when there was more staff) than September. So in the data then the month is predictive, but that will not generalize to other data.
A machine learning paper, whatever the audience, requires some guarantee of generalization. Since the metadata features used in your paper are known to be problematic in some datasets, and the paper only reports results on one dataset, in my opinion it cannot give confidence in generalization without some explanation "why."
loosefer2905 OP t1_iqw45vd wrote
As per our understanding choosing the right Machine Learning models is one thing, choosing the right attributes is another. The reason of good accuracy in detection for the Bayesian classifier model in this dataset was because of the type of study done - Most past papers have been working on extracting linguistic features from the article, some have been looking at it from social media perspective, aka looking at twitter profiles and on that basis classifying tweets as fake or real. Month is not the ONLY attribute we used, the type of news (Political, World News, US news) was another factor we used.
Choosing the right model and the right attributes and the right methodology is a specific thing. Most linguistic features-extraction based models for example are more complicated in nature but they cannot even discern real news from fake news very well for most of the previous work we saw... the accuracy is in 70s. For us, getting the right performance using the right selection of attributes was critical and we feel we have done a decent job at that.
The why should be left to your interpretation. I have already said what I said. It is political in nature. More than that we cannot say.
Viewing a single comment thread. View all comments