Submitted by PartisanPlayground t3_10nfwir in dataisbeautiful
ianhillmedia t1_j6cydkh wrote
Hey there, journalist here with 20+ years experience in the news industry, including 10+ years in digital news. To fairly consider the “prevalence” of news stories and how they progress in the news cycle you’d need a much bigger dataset. Right now, as you’ve described it in other comments, a more accurate title for this chart is: “How the top 10 articles were curated and ranked on homepages of 64 U.S. general interest national(?) news sites when the researcher looked at them.” Correct?
That can still be interesting data, but it doesn’t truly reflect the which stories are most “prevalent” in the news.
Know that news website homepages are one of a few (many?) places people consume news online. Google Analytics, which many news organizations use to track success online, reports acquisition channels as Search, Social, Referral and Direct. The percentage of users that see or are delivered (“prevalence”) and consume news via each channel varies by news organization, but a news org with a legacy brand will get a healthy percentage of traffic from each. People that come to news homepages are a subset of Direct traffic, which also includes users who just type in Mylocalsite.com/weather, for example.
Direct traffic also can include visitors to news organization mobile apps, which can be curated differently from website homepages. Direct traffic does not include people who read push alerts from mobile apps but don’t click through, and what those folks see also should be considered when determining the “prevalence” of news stories online. Referrals, meanwhile, can include visitors who click through from news organization email newsletters, which are often written and curated differently from homepages.
So the stories “prevalent” to U.S. news consumers can be different based on the platform on which they’re delivered news.
That brings us to Social and Search, both of which send healthy traffic to U.S. news sites and play a noteworthy role in determining the “prevalence” of news for Americans. Pew research reported in September indicated that 50% of U.S. adults get news from social media sometimes or often. 82% of American adults use YouTube; 25% of those users say they regularly get news on the site. 70% of American adults use Facebook, and 31% of those users regularly get news on the site. 30% of American adults use TikTok, and 10% of those users regularly get news on the site.
So to really track and report the “prevalence” of news stories, you’d also need to track and report which stories are delivered and consumed on social, and that delivery is determined in large part by social network algorithms powered by user behavior. Which is in part how we get vertical communities on social (“BookTok,” “Black Twitter.”) The prevalence of news stories in those communities can be community-dependent.
For Search, the good news is that data on what news stories people seek out and are delivered is available from Google Trends. That said, I’d suggest reading the Google Trends help docs before digging into and reporting that data. You need to know what relevance means to Google when looking at those numbers.
Those are just the differences in digital formats that need to be considered when researching the “prevalence” of news stories. We haven’t even discussed that to really measure “prevalence” you’d need to consider what’s in print editions and broadcast newscasts, both of which still help determine the news agenda for the country. We haven’t discussed the role that consumers play on setting the agenda - the number of people clicking on a story on other acquisition channels helps determine if that story is ranked on a homepage, and for how long. And we didn’t discuss the fact that some news organizations are testing personalization of homepages powered by machine learning. What you see on a news homepage can be unique to you and based on how cookies tracked your habits across the web. It might be different from what other visitors see.
It’s also worth noting that 64 news sites may not constitute a useful sample. At a minimum, in the top 100 DMAs in the U.S., there are typically at least four broadcast news websites and one newspaper of record website. That’s 500 local sites that determine the prevalence of news in the U.S. just in the top 100 DMAs. There are 210 Nielsen DMAs in the U.S. Many of those DMAs also are home to hyperlocal startups and alts which also should be considered when tracking and reporting the “prevalence” of news stories. What’s prevalent to people in Cleveland will be different from what’s prevalent to people in Memphis, which will be different from what’s prevalent to people in L.A., etc.
And that’s just the U.S.
That’s not to say that there aren’t worthwhile data-based stories to tell about news consumption and delivery in the U.S. It’s always interesting to learn more about how specific stories are presented by different news organizations on a specific market. You also could subscribe to a bunch of different newsletters and report on what they present, given that newsletters are static. Google Trends data from the previous day also is static.
Here’s the source of the Pew data about news consumption on social: https://www.pewresearch.org/journalism/fact-sheet/social-media-and-news-fact-sheet/
Hope that’s helpful!
PartisanPlayground OP t1_j6d2gwe wrote
This is an excellent comment, thank you for this!
I think I need a clearer way of describing "prevalence". This chart is showing the top ten stories by the share of articles written about them, not by the amount that they are consumed. I take articles from 64 sources on every day, cluster them together into "stories", then calculate each story's share based on the number of articles written about it. For example, if there are 1000 articles for a day, and one story has 100 articles written about it, then its share is 10%. Does that make sense?
I've explored measuring consumption of news in the past, and found it to be very difficult! (Facebook's Graph API used to be wide open, so I was able to get likes/engagement on news stories there, but it has since been locked down) Your comment does a great job of explaining the complexity in measuring consumption. You would need to combine:
- GA data from news outlets (which they don't publish)
- Cable news data (sources exist for this, but you would need to make a lot of assumptions to combine this with articles)
- Social media data
And you would need to make a lot of assumptions about what weights to use on each of those. As a result, I'm keeping this simple and focusing on article shares.
I do publish a daily automated Twitter thread on which news outlet gets the most engagement on Twitter. It includes the most liked and ratioed tweets from each "side" of the media. This is limited to Twitter, so does not cover all the channels you described. See an example here: https://twitter.com/PartisanPlayG/status/1619300675094970369
The other thing I've been doing is cutting articles by which "side" of the media they're on using media bias ratings from AllSides. Again, this involves some simplifying assumptions so it's not perfect but gives a good high-level view. You can see examples here: https://partisanplayground.substack.com
Thanks again for your comment. This is exactly the sort of thing I was looking for when I posted.
ianhillmedia t1_j6d3dqb wrote
Happy to help! And I think you’re spot on when you say you need to clarify the definition of prevalence. Just because a news org puts resources into a topic doesn’t mean it’s prevalent to the user. That said, the number of stories a news org efforts on a subject is an interesting data point.
As someone on the other side of this, I hear you on the challenges associated with getting useful data. How are you currently tracking all articles published by those news orgs? And how are you parsing that data to identify specific stories - what search terms are you using to filter the data?
PartisanPlayground OP t1_j6d6ebz wrote
I'm getting the data from the Google News API. I've used RSS feeds in the past with similar results.
And actually I'm using a clustering algorithm to identify the specific stories. I have an automated process that pulls all articles from the past five days, clusters them into stories, then produces a bunch of analysis. This saves me a lot of time and brings some objectivity to the process.
ianhillmedia t1_j6db30j wrote
Got it thanks for the reply! I know not everyone supports RSS, and it’s a challenge when folks format RSS in different ways, but as they’re a primary source from the publisher I’d encourage you to use RSS over APIs from Google.
I was curious the signals in your algorithm as well. One of the challenges with automating taxonomies for news stories is the inexactitude of language and differences in style. A story might mention DeSantis and books in the headline and description but might actually be about GOP primaries; a story might emphasize DeSantis in the primaries in the headline and title but it might actually be about book banning.
Or a better example: a story that mentions Tyre Nichols may be about the actual incident, police violence or defunding the police.
Digging in even further, a local news organization might use colloquialisms for place names that can make it difficult for folks from outside that market to categorize those stories.
PartisanPlayground OP t1_j6eo5hz wrote
You're hitting on the most subjective part of this whole process. I've run into all of the issues you describe, and the question is ultimately: how do you define a story?
Your GOP primaries example is a good one. Let's say we have articles on Trump's legal issues, other articles on Pence's classified documents, and other articles on DeSantis and books. Now let's say all of these articles describe these things in the context of the 2024 GOP primaries. Is this one story called "GOP primaries"? Or three separate stories? You could make a case either way.
I've tuned the algorithm to split stories in a way that "looks about right" to me. That's subjective, but there's no way around it. This is an issue whether you're using an algorithm or doing this manually.
A related challenge is that story definitions may change over time. The classified documents story is a good example for this. Right now there are articles on Trump, Biden, and Pence all mishandling classified documents. The algorithm is categorizing all of them as the same story (fair enough).
But let's say that next week (just making this up), Trump gets indicted for it. Is that a separate story now? If so, how do you treat that? Do you retroactively split out the "Trump" portion of the "classified documents" story as though they were not the same story before? Do you show the classified documents story splitting into two? Do you just create a new story on the day the indictment happens? Currently, the algorithm is set up to do the first of these, but again, you could make a case for any of them.
All of this is to say that there is subjectivity involved in this process.
Viewing a single comment thread. View all comments