Hi all,

I am working on fine-tuning some language models to do some tasks unique to current news, and I was wondering if there were are good datasets for articles from a variety of top publications like NYT, WSJ, Economist, etc? I haven't been able to find a varied dataset like that so far.

Comments

You must log in or register to comment.

beezlebub33 t1_iywqxfg wrote on December 4, 2022 at 7:37 PM

It will be hard to get a free source of data for publications like that, because they are not free.

However, there is GDELT: https://www.gdeltproject.org/ (wikipedia: https://en.wikipedia.org/wiki/GDELT_Project ) It's a project that collects events and other data from a variety of sources.

regenerated_lawyer OP t1_iywrs6y wrote on December 4, 2022 at 7:43 PM

Have you worked with it before and would want to chat?

maxminmax_ t1_iyxu4m7 wrote on December 5, 2022 at 12:00 AM

I've looked into this and couldn't find any exhaustive news corpora that's freely available. Hence, I am currently working on building a MIT-licensed news archive corpus. DM me if you'd like to collaborate..

TheCockatoo t1_iyv5pz0 wrote on December 4, 2022 at 12:13 PM

Following, very interested in this too.

WitnessWWIII t1_iyvdx9b wrote on December 4, 2022 at 1:45 PM

The current news is shit, shit quality. For old news, I saw data in google’s datasets. For the current ones (which are not completely shit, you will need a subscription - which makes it illegal to use for your purposes in 99 per cent of it). But you can steal data by scrapping your favourite sources, but this will be illegal. In short, I do not think you will find ready datasets of good-quality articles without paying for them. I also think that you will need to contact the source to make it legal.

[deleted] t1_iyw03rj wrote on December 4, 2022 at 4:42 PM

[deleted]

still_hexed t1_iyy876c wrote on December 5, 2022 at 1:46 AM

You could create your own using RSS feeds by category from these journals?

United_Quit_7754 t1_iyyhuk7 wrote on December 5, 2022 at 3:03 AM

For crypto news there is cryptopanic.com There i guess you could get higher rated news etc as a function

United_Quit_7754 t1_iyyhzb1 wrote on December 5, 2022 at 3:04 AM

It is a news aggregator with more

93-summer-days t1_iyzo40k wrote on December 5, 2022 at 11:28 AM

Newscrawl might be useful

https://data.statmt.org/news-crawl/

fnslyc t1_iyyc9pr wrote on December 5, 2022 at 2:18 AM

I think the problem with using news sources for models is that it's almost certainly going to be slanted, misleading, deceptive, etc...

Unless you wanted to to create some language models to do tasks that involve, say, lying or expressing one-sided opinions (lol, or accepting money from people with an agenda in exchange for pushing it for them), you're probably better off looking somewhere else. That said, I think it would be a very interesting project - no question about that...