Submitted by regenerated_lawyer t3_zc7oar in MachineLearning

Hi all,

I am working on fine-tuning some language models to do some tasks unique to current news, and I was wondering if there were are good datasets for articles from a variety of top publications like NYT, WSJ, Economist, etc? I haven't been able to find a varied dataset like that so far.

18

Comments

You must log in or register to comment.

beezlebub33 t1_iywqxfg wrote

It will be hard to get a free source of data for publications like that, because they are not free.

However, there is GDELT: https://www.gdeltproject.org/ (wikipedia: https://en.wikipedia.org/wiki/GDELT_Project ) It's a project that collects events and other data from a variety of sources.

3

maxminmax_ t1_iyxu4m7 wrote

I've looked into this and couldn't find any exhaustive news corpora that's freely available. Hence, I am currently working on building a MIT-licensed news archive corpus. DM me if you'd like to collaborate..

3

TheCockatoo t1_iyv5pz0 wrote

Following, very interested in this too.

2

WitnessWWIII t1_iyvdx9b wrote

The current news is shit, shit quality. For old news, I saw data in google’s datasets. For the current ones (which are not completely shit, you will need a subscription - which makes it illegal to use for your purposes in 99 per cent of it). But you can steal data by scrapping your favourite sources, but this will be illegal. In short, I do not think you will find ready datasets of good-quality articles without paying for them. I also think that you will need to contact the source to make it legal.

2

still_hexed t1_iyy876c wrote

You could create your own using RSS feeds by category from these journals?

1

United_Quit_7754 t1_iyyhuk7 wrote

For crypto news there is cryptopanic.com There i guess you could get higher rated news etc as a function

1

fnslyc t1_iyyc9pr wrote

I think the problem with using news sources for models is that it's almost certainly going to be slanted, misleading, deceptive, etc...

Unless you wanted to to create some language models to do tasks that involve, say, lying or expressing one-sided opinions (lol, or accepting money from people with an agenda in exchange for pushing it for them), you're probably better off looking somewhere else. That said, I think it would be a very interesting project - no question about that...

−2