Are you getting the whole story? Investigating mainstream news sources

Analyzing news headlines across the globe with Kimono and MonkeyLearn

The news is probably one of the first things people check in the morning, but how much does what you know and understand about the world depend on your news source? Will you view the world differently if you head over to CNN instead of BBC? Tools like MonkeyLearn and Kimono allow us to gather and analyze the text we encounter in our daily lives and determine what that text says behind the words – in this case, it turns out that different news sources report on different news with different weighting.

We analyzed the front page articles of five different major global news outlets – Fox, CNN, China Daily, Al Jazeera (English), and the BBC. We amassed the data and then used MonkeyLearn’s text analysis, a combination of entity extraction – a process that recognizes text corresponding to important people, places, and organizations – news classification and keyword extraction.

In this post, we share a few insights, and explain how to replicate our process so that you can analyze any text on the web. This analysis is by no means conclusive, as it is based on just one day of data but instead demonstrates the types of insights you can get when you use smart tools like Kimono and MonkeyLearn.

Heatmap of top 20 people discussed by source:

Screen Shot 2015-04-16 at 9.37.13 PM

Heatmap of top 20 organizations discussed by source:

Screen Shot 2015-04-16 at 9.38.10 PM

Wordmap of news:


We did this in three steps:

1. Set up Kimono APIs for top news sites
2. Format our kimono API output
3. Upload data to MonkeyLearn and run it through the appropriate classifiers

1. Set up Kimono APIs for top news sites

For each news source, we created one ‘Headlines API’ to scrape the links to the articles on the front page, and fed those links into a secondary ‘Articles API’ to crawl the full article text behind each link. You can read more about how to do this in this blog post, or you can watch this video tutorial. Setting up this two step crawl enables you to quickly scrape a broad set of content.

Here’s a sample of our unprocessed output from our BBC Articles API (this is the content behind one of the BBC front page articles):

"article_text": "Police in Haiti have clashed with anti-government protesters angry about the high cost of fuel."
"article_text": "Several people were injured as police moved in to clear roadblocks set up in the capital, Port-au-Prince."
"article_text": "Protest organisers said that drivers who ventured into the city centre during the two-day protest would be putting their lives at risk."
"article_text": "Haiti has seen months of protests against President Michel Martelly over delayed elections."
"article_text": "\"Because of the price of fuel, the cost of living is going up,\" said Ralph La Croix, a Port-au-Prince resident."

To format this data to be compatible with MonkeyLearn, we need to format these ‘article_text’ properties into a single string.

2. Format our kimono API output

To format the data you can use your language of choice (or you can write a javascript function directly in the kimono web interface).

In this scenario, we downloaded the data, stitched the article_text properties for each API into one string and pushed it to MonkeyLearn. Here is the code repository along with an IPython Notebook – we suggest you clone the repository and follow the example to see how we formatted the data and called the MonkeyLearn classifiers to analyze the text.

3. Upload data to MonkeyLearn and run it through the appropriate classifiers

Here are links to each of the kimono APIs we used, if you would like to get the data manually:
China Daily API JSON Endpoint
Al Jazeera API JSON Endpoint

Here is an excerpt of code from the IPython notebook that shows you how to run the MonkeyLearn classifiers on the formatted data:

import requests

# We will store all of our results here. The first list will contain news
# classification, the second extracted entities, and the third will hold
# any extracted keywords. Positions will correspond to the position of the
# original piece of news in the objects variable.
partials = [[], [], []]
while src_data:
# We will classify data in chunks of BATCH_SIZE.
items = src_data[:BATCH_SIZE]

# Prepare the data and headers for our request
data = {'text_list': items}
headers = {'Authorization': 'Token {0}'.format(API_KEY),
'Content-Type': 'application/json'}

# We will call the classifiers and extractors here, and then add the
# result to the partials list.
response =,
result = json.loads(response.text)['result']
partials[i] += result

# Remove the classified items from the list of items.
src_data = src_data[BATCH_SIZE:]

# Lets zip the original news with the data MonkeyLearn returned. We will
# end up with a list containing tuples that look like:
# (news text, news categories, extracted entities, extracted keywords)
data = zip(objects, *partials)

By running this scraped data through MonkeyLearn’s categorizers and extractors, we are able to analyze and graph this data to generate different insights. See a few samples below, or view more in the Notebook.


Using this process, we garnered some preliminary insights on these news outlets, such as:

  • Only 1 person and 4 organizations were consistently mentioned across all globally outlets – Barack Obama, Congress, the White House, Facebook and the United Nations
  • Al Jazeera and Fox had the highest overlap – 98 terms; while Al Jazeera and the BBC had the lowest overlap 44 terms
  • Mike Pence, the infamous Indiana Governor who passed a controversial ‘religious freedom law’ in Indiana made the news in all outlets except China
  • NCAA mentions only appeared on CNN – with March Madness entering the final week.
  • Boko Haram appears consistently only on Fox and Al Jazeera (we ran this analysis for longer, and this is consistent across months)
  • Bowe Bergdalh, the once-missing American soldier charged with deserting his unit appeared significantly more often (10x) in CNN than Fox, and was not mentioned outside the US
  • The corruption charges against Governor Menendez of New Jersey (D) had a high number of mentions, but appeared exclusively on Fox and the BBC, not CNN.

We’d need to run a longer analysis to draw strong conclusions, but it is interesting to note that in this analysis, like our past news analyses – it seems that negativity wins in reporting — CNN vs Fox spend more time reporting negative content relating to the Democratic political party than they do on positive content on the Republicans .

Open March Madness API

It’s basketball season, and we’re in full March Madness mode here at kimono. As not-so-closeted-nerds, we like to dig into the data behind everything (plus, we’ll do basically anything to win the office pool). But, since there isn’t a fully open and well documented REST API for the March Madness Tournament, we decided to build one.

Our (un)official March Madness API includes:

  • Historical data for each team
  • RESTful JSON endpoints
  • Detailed docs
  • API explorer to easily run queries without writing any code

We can’t wait to see what you will build with the March Madness API! Use the data explorer to see which team has the most wins, which player had the highest field goal percentage, which team scored the most points in a single game, and anything else your heart desires.

Tomato or Tomahto? How Next Caller uses kimono to add pronunciation to caller ID

Next Caller provides Advanced Caller ID™ to businesses. Across industries, inefficient exchanges between sales/support reps and customers cost $14B annually. With the Next Caller API, unknown inbound phone numbers get matched with relevant background data for the caller, giving reps near-instant access to all relevant data on a customer.

Continue reading

Hacking real estate to find the best off-market deals

Winning in real estate is about better information, with data sources like Redfin and MLS as table stakes. The most successful real estate players win by having an information edge. Kimono is a smart web-scraper – letting innovative real estate players like Michael Tomko find leads that others can’t. Here’s how Michael identifies high value properties and unlocks the best deals on the market.

Continue reading

Fox vs. CNN: Who’s got Obama on the mind?

fox_4.5 cnn_1_29_15
Fox News on 1/29/15

CNN on 1/29/15

Turning the mountains of unstructured text scattered across the web into insights can be a daunting prospect. At kimono, we are working to make this much easier. Take news for example – New York Times alone publishes 350+ pieces of content per day; The Huffington Post releases 1,200. With just a few kimono APIs, we can create a structured corpus of text that we can mine to understand trends, biases and patterns across sources. We’ll make our first scratch on the surface here by setting up APIs for CNN and Fox News to ‘read’ every article on each site’s respective front page and compare what meaningful words are being said.

Continue reading