Fox vs. CNN: Who’s got Obama on the mind?

fox_4.5 cnn_1_29_15
Fox News on 1/29/15

CNN on 1/29/15

Turning the mountains of unstructured text scattered across the web into insights can be a daunting prospect. At kimono, we are working to make this much easier. Take news for example – New York Times alone publishes 350+ pieces of content per day; The Huffington Post releases 1,200. With just a few kimono APIs, we can create a structured corpus of text that we can mine to understand trends, biases and patterns across sources. We’ll make our first scratch on the surface here by setting up APIs for CNN and Fox News to ‘read’ every article on each site’s respective front page and compare what meaningful words are being said.

We’ll follow 3 steps to set this up:

  1. Make 2 APIs for each news site: a Detail API and a Source API
  2. Link the Detail API to Source API
  3. Prepare the data for analysis with Modify Results
  1. Make Kimono APIs

To start, we need to make a kimono API to crawl our data. We’re using news websites with multiple headlines on the front page, and we want the full text of each news article. With Kimono, we can do this by using two APIs – a source API and a detail API. The source API will crawl the front page and give us a list of links. The detail API will scrape each article, fetching the content within those links.

class

Let’s start by going to www.cnn.com, launching the kimono chrome extension and selecting all of the headlines from the front page. When we go to inspect the headline elements, we see that all headlines share a common class which is “.cd__headline-text” so we’ll use that as our CSS selector to grab all of the headlines on the page.

paragraphs

Next, we’ll go into one of the links from the CNN source API, and we’ll grab the article text. Make sure to select more than the first paragraph so that kimono grabs the full text of the article. As you can see in the raw data view, kimono will break each paragraph up into a different property, by default. Don’t worry about that for now though, since we’ll be aggregating all of the articles anyway. We’ll name and save our API and then go to the crawl setup.

  1. Link Detail API to Source API

Now we’ll link these two APIs together to get the full article text for all articles linked to on the homepage. Our detail API gives us the full text content of one article, but we need to link it to the source API to tell it where to look to get content for the other articles.

source

In the crawl setup tab for the detail API, we must set our crawl strategy to “URLs from a source API”. Then, we’re given a list of our APIs to choose from. Selecting our CNN headline API as our source API, we see all of the headline URLs appear in the text box. Awesome! Now let’s start our crawl and look at our JSON output!

The output looks good, but we still need to do a bit of formatting. As you can see, each block of text is an object in the collection1 array, and they are not separated into their own objects by article. Let’s write a function to group all the text elements for each article together.

  1. Prepare the data with Modify Results

Now that we’ve gathered our data, let’s operate on it to make it more meaningful. We’ll be using the new Modify Results feature to remove common ‘stop’ words like ‘the’, or ‘be’ from the data set to get to the words that matter. Once we’ve done that, we will feed the data into a word cloud generator to create a simple visual representation of who said what more.

Kimono’s new modify results function lets you take in data from a Kimono API and transform it with a JavaScript function you can write in the browser, that runs on our cloud. You can then retrieve transformed data or raw data from your API.

Let’s outline what our stop word filtering function will do, at a high level:

  1. Take in raw text data from our Kimono API
  2. Iterate through the data to ensure the results are strings (rather than objects comprised of string and hrefs).
  3. Merge all the article data into one long string
  4. Remove all punctuation using a regular expression
  5. Remove all stop words that exist in our stop word corpus
  6. Combine this back into a string and return this object

Here’s the entire function, below. We’ll dive into each part in more depth in a moment.

function transform(data){

// list of common words that we want to remove from the incoming data

var commonWords = ['fox', 'news', 'foxnews', 'advertisement','com','followfox', 'fox411', 'a', 'about', 'above', 'according',] //truncated for clarity.
var output = data.results.collection1

//ensure that all of our results are just strings (no hrefs)
.map(function(item) {
   if(item.complete_text.text) {
      return item.complete_text.text;
   }
   else if(item.complete_text.text === ''){
      return '';
   }
   else {
      return item.complete_text;
   }
})

//combine all articles into one string
.join(' ')

//remove whitespace and punctuation
.split(/W+/)

//filter out words that are in our commonWords array
.filter(function(word) {
   if(commonWords.indexOf(word.toLowerCase()) !== -1) {
      return false;
   }
   else {
      return true;
   }
})
//put it all back together
.join(' ');

//return the string we just processed
return output;
}

Part 1: Take in raw text data from our kimono API

All Modify Results functions start by declaring a function (by default it’s called transform, but you can rename it if you desire). The transform function takes a data object as its only argument. This is your normal Kimono data–what you see when you go to your json endpoint.  In the code snippet below, we declare our function and initialize commonWords, an array of words to filter out.

function transform(data) {

// list of common words that we want to remove from the incoming data

var commonWords = ['fox', 'news', 'foxnews', 'advertisement','com','followfox', 'fox411', 'a', 'about', 'above', 'according',] //truncated for clarity.

Part 2: Iterate through our data to ensure the results are strings

Depending on how you’ve set your API, it’s possible that you may sometimes be getting an href returned in addition to your text content. We want to filter out the href and just get the text. We accomplish this by taking just the collection1 piece of our data object — this is an array with our data, and doing some basic processing. If there is a property called complete_text.text, we return just that — effectively removing the href if it exists. If the complete_text.text property is empty, we return just an empty string, and finally, if there is no .text property, we know that there is no .href object either, and so we return just complete_text property itself, in this case a string.

Note that this section begins a pattern we’ll see for the rest of our program: we chain a series of array operations, and finally return the result in the output variable we initialize here.

var output = data.results.collection1

//ensure that all of our results are just strings (no hrefs)

.map(function(item) {
   if(item.complete_text.text) {
      return item.complete_text.text;
   }
   else if(item.complete_text.text === ''){
      return '';
   }
   else {
      return item.complete_text;
   }
})

Part 3: Merge all the article data into one long string

In part 2, we return an array. Now, we use the .join() method to combine this array into a giant string. This sets us up for the next step, where we will remove white space.

//combine all articles into one string

.join(' ')

Part 4:  Remove all punctuation using a regular expression

In this step, we parse our string back into an array that splits the text up based on white space and punctuation. After this step, we are back to an array.

//remove whitespace and punctuation

.split(/W+/)

Part 5: Remove all stop words that exist in our stop word corpus

Now that we’ve removed whitespace and punctuation, we’re ready to do the main piece of programming: removing common words. In step 1, we initialized the array commonWords with a list of, well, common words. Now, we’ll be using the javascript filter() method to remove words from our data that can be found in the commonWords array. We use a simple if/else here: If the word (we move it to lower case to match the words in commonWords) is not found in the array, we keep it, otherwise, we remove it. The mechanism that we use to accomplish this is a bit tricky. First, we use the .indexOf() method to check if the current word from our data exists in the common word array. indexOf will return -1 if the value is not found, otherwise it will return the index the word exists at. Therefore, if the value is found in the array (index of returns greater than -1), we return false, which will remove it from our output array. Otherwise, if the value is not found in commonWords, we will return true, which means the value stays. When this bit of code is finished executing, we have an array filtered to exclude any of the words in commonWords!

//filter out words that are in our commonWords array

.filter(function(word) {

// if it is found, remove it.
   if(commonWords.indexOf(word.toLowerCase()) !== -1) {
      return false;
   }

//if not found in common words, keep it.
   else {
      return true;
   }
})

Part 6: Combine this back into a string and return this object.

Finally, we combine our output array into a string, and return that string. Note that we do not need to return the data object we took as input – you can return another value or variable. The .join() method will finish our sequence of chained operations.  When this is done, output (the variable we started with in step 2), is equal to our filtered string. We then return output, and close the function. That’s it!

//put it all back together

.join(' ');

//return the string we just processed

return output;
}

To access our  transformed data, you need to call your API at the normal JSON endpoint with the ‘&kimmodify=1’ parameter appended. This will force the function to evaluate on your data. Otherwise, your data will be returned unmodified.

The last step is to do something fun with this newly parsed data! For now, we created 2 word clouds, one for FOX and one for CNN, to see which news source talked about what the most! You can see the results from a scrape done on 1/29/2015 below.

fox_1_29_15

Fox News on 1/29/15

cnn_1_29_15

CNN on 1/29/15

Looking at the results, it looks like Fox had a lot more to say about Obama on 1/29. Funnily enough, Bush appears larger on the word cloud for CNN than Obama. It’s almost as if the more liberal leaning CNN focuses on Bush while the more conservative Fox News keys in on Obama.

5 thoughts on “Fox vs. CNN: Who’s got Obama on the mind?

Comments are closed.