Submitted by loosefer2905 t3_xt1odk in MachineLearning
The US 2016 elections is a common dataset that has attracted many people to research this using machine learning - so I decided to give it a go.
The classifier I used is actually the simplest one - a Naive Bayesian classifier has been used.
Surprisingly we got a higher accuracy than all the past publications on the same dataset could achieve - even though it was a simple classifier - the catch according to me was the selection of the right attributes to make it happen. We paid attention to the metadata of the news publications and in particular, the month of publication was by itself the most informative attribute when it came to classifying the news as fake. I would allow the readers to make their own conclusions on basis of the finding.
The accuracy was 95.38%. I am sure that on further digging up, higher accuracy can be achieved.
The preprint can be found here, it is open-access: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4074884
​
Thanks!
KellinPelrine t1_iqolwmz wrote
I think it's noteworthy that the month turns out to be the most informative, but it may be more a reflection on the data and its collection process than a strong feature for real-world detection. For example, there have been datasets collected in the past where real and fake examples were collected at different times, which makes month or other date information artificially predictive. See https://arxiv.org/pdf/2104.06952.pdf, especially 3.4.2.
So I'd encourage you to consider why month would be predictive (and the same for any other metadata), in order to make sure it's not an artifact of the dataset.