PartisanPlayground OP t1_j6d6ebz wrote on January 29, 2023 at 2:44 PM

Reply to comment by ianhillmedia in [OC] How news stories evolve in the news cycle by PartisanPlayground

I'm getting the data from the Google News API. I've used RSS feeds in the past with similar results.

And actually I'm using a clustering algorithm to identify the specific stories. I have an automated process that pulls all articles from the past five days, clusters them into stories, then produces a bunch of analysis. This saves me a lot of time and brings some objectivity to the process.

ianhillmedia t1_j6db30j wrote on January 29, 2023 at 3:19 PM

Got it thanks for the reply! I know not everyone supports RSS, and it’s a challenge when folks format RSS in different ways, but as they’re a primary source from the publisher I’d encourage you to use RSS over APIs from Google.

I was curious the signals in your algorithm as well. One of the challenges with automating taxonomies for news stories is the inexactitude of language and differences in style. A story might mention DeSantis and books in the headline and description but might actually be about GOP primaries; a story might emphasize DeSantis in the primaries in the headline and title but it might actually be about book banning.

Or a better example: a story that mentions Tyre Nichols may be about the actual incident, police violence or defunding the police.

Digging in even further, a local news organization might use colloquialisms for place names that can make it difficult for folks from outside that market to categorize those stories.

PartisanPlayground OP t1_j6eo5hz wrote on January 29, 2023 at 8:38 PM

You're hitting on the most subjective part of this whole process. I've run into all of the issues you describe, and the question is ultimately: how do you define a story?

Your GOP primaries example is a good one. Let's say we have articles on Trump's legal issues, other articles on Pence's classified documents, and other articles on DeSantis and books. Now let's say all of these articles describe these things in the context of the 2024 GOP primaries. Is this one story called "GOP primaries"? Or three separate stories? You could make a case either way.

I've tuned the algorithm to split stories in a way that "looks about right" to me. That's subjective, but there's no way around it. This is an issue whether you're using an algorithm or doing this manually.

A related challenge is that story definitions may change over time. The classified documents story is a good example for this. Right now there are articles on Trump, Biden, and Pence all mishandling classified documents. The algorithm is categorizing all of them as the same story (fair enough).

But let's say that next week (just making this up), Trump gets indicted for it. Is that a separate story now? If so, how do you treat that? Do you retroactively split out the "Trump" portion of the "classified documents" story as though they were not the same story before? Do you show the classified documents story splitting into two? Do you just create a new story on the day the indictment happens? Currently, the algorithm is set up to do the first of these, but again, you could make a case for any of them.

All of this is to say that there is subjectivity involved in this process.