Guest blog: Sentiment analysis on web scraped data with kimono and MonkeyLearn

Contributed by Raúl Garreta, co-founder of MonkeyLearn

New tools have enabled businesses of all sizes to understand how their customers are reacting to them – do customers like the location, hate the menu, would they come back? This increased volume of data is incredibly valuable but larger than any mere mortal can assess, understand and turn into action. Several technologies have emerged to help businesses unlock the meaning behind this data.

This blog looks at how Kimono, which scrapes and structures data at scale, and MonkeyLearn, which provides machine learning capabilities, can be used together to translate data into insight.

Kimono + MonkeyLearn

Kimono is a smart web scraper that makes it easy to get data from the web. Use the chrome extension to launch the kimono toolbar on any website, click on the data you want and kimono does the rest – organizing your data and building an API in seconds.

MonkeyLearn is a platform for getting data from text using machine learning. MonkeyLearn enables developers with any level of experience to easily extract and classify text information for their specific needs quickly, cheaply and easily.

By using Kimono together with MonkeyLearn, users can extract large, changing data sets from the web and apply machine learning models like sentiment analysis, topic, language or keyword detection and entity recognition to enrich that information

To show just how easy this is, we will build a hotel sentiment meter that detects how users feel about a particular hotel using kimono and MonkeyLearn.

How to create a hotel sentiment analysis detector with Kimono and MonkeyLearn

Our objective is to create a tool that measures the sentiment expressed in user hotel reviews.

We will use Kimono to extract hotel reviews from TripAdvisor and use those reviews to train a machine learning model with MonkeyLearn. The model will learn what makes a hotel review positive or negative and will be able to classify the sentiment of unseen hotel reviews.

Create a Kimono API

The first step is to scrape hotel reviews from TripAdvisor with kimono:

Install the Kimono chrome extension from www.kimonolabs.com:
For more information on how to install Kimono extension visit this article.

Use Kimono on a webpage: To use kimono, navigate to the webpage you want to extract data from, and then click on the chrome extension. In this tutorial we will use New York Inn reviews to create our hotel sentiment analysis classifier.

Select the data you want to scrape with Kimono: If you need help with this step, follow this simple tutorial. In our case we will extract the review title, the review content and the star rating.

1

In order to do this, we have to create three properties – “title”, “content”  and “stars”. Click on the first field (title), then click the plus in the grey circle to add a new property for content, click on the content you want to extract. Each time you click on an element, kimono recognizes similar fields and suggests them to you. Click the check mark to accept the suggestions into the selection.

2

After selecting all the properties, we have to identify the pagination link, i.e. the link that kimono’s crawler must follow to reach the next page of reviews. Do this by clicking on the pagination icon, 3 and then clicking on the “>>” link.:

4

Before we’re ready to create our Kimono API, we need to engage Kimono’s advanced mode to get the star ratings, since the rating information is contained in the ‘alt’ attribute in the webpage’s HTML. Enter advanced mode by clicking the Data Model View and then clicking ‘attributes’ to select the ’alt’ property for the star rating field.

5

Then return to the Raw Data View to verify that kimono is retrieving the correct values.

6

That’s it! Now just click the Done button to create the API. Select manual crawl for the API refresh frequency and set the crawl limit to 50 pages max:

7

Get the Data

Now that we have our Kimono API defined, we are ready to start crawling the data. Go the Crawl Setup tab in your API Detail and hit the Start Crawl button:

8

This kicks off a crawl, which will take a few seconds to finish. Once done, go to the Data Preview tab, select CSV as the data format and click the Download link:

9

Prepare the Data

Now that we have the data in our kimonoData.csv file, let’s pre-process the data. We’ll do that with Python and the Pandas library. First import the csv file into a data frame, remove duplicates, drop the reviews that are neutral (3 of 5 stars):

import pandas as pd

# We use the Pandas library to read the contents of the scraped data
# obtained by Kimono, skipping the first row (which is the name of
# the collection).
df = pd.read_csv('kimonoData.csv', encoding='utf-8', skiprows=1)

# Now we remove duplicate rows (reviews)
df.drop_duplicates(inplace=True)

# Drop the reviews with 3 stars, since we're doing Positive/Negative
# sentiment analysis.
df = df[df['stars'] != '3 of 5 stars']

Then we create a new column that merges that concatenates the title and the content:

# We want to use both the title and content of the review to
# classify, so we merge them both into a new column.
df['full_content'] = df['title'] + '. ' + df['content']

Then create a new column for the sentiment that we want to predict: Good or Bad. We will map reviews with more than 3 stars to Good, and reviews with less than 3 stars to Bad:

def get_class(stars):
    score = int(stars[0])
if score > 3:
    return 'Good'
else:
    return 'Bad'

# Transform the number of stars into Good and Bad tags.
df['true_category'] = df['stars'].apply(get_class)

We'll keep only the full_content and true_category columns:

df = df[['full_content', 'true_category']]

To see a quick overview of the data, we have 429 Good reviews and 225 Bad reviews:

# Print a histogram of sentiment values
df['true_category'].value_counts()

Good    429
Bad     225
dtype: int64

Finally, save our dataset in the MonkeyLearn format. To do this, remove the headers and the index column. The first column must be the text content and the second must be the category. We will encode the text in UTF-8:

# Write the data into a CSV file
df.to_csv('kimonoData_MonkeyLearn.csv', header=False, index=False, encoding='utf-8')

Create a MonkeyLearn classifier

Now it’s time to move to MonkeyLearn to create a text classifier that classifies reviews into two possible categories: Good or Bad, depending on the text in the review. This process of extracting mood from unstructured text is called Sentiment Analysis.

First you have to signup for Monkeylearn. After you log in you will get see the main dashboard. MonkeyLearn has pre-created text mining modules, but also allows you to create customized ones. In our case, we will build a custom text classifier, so within the Classification page, click the Create Module button:

10

In the form that pops up, select English as the working language and name the new module “Hotel Sentiment“:

11

Also, we need to set some advanced options. Click the Show advanced options link and:

  • Set N-gram range to 1-3
  • Disable Use stemming
  • Enable Filter stopwords, and use Custom stopwords: “the, and”

MonkeyLearn_advanced_options

After clicking the Create button, you will see the module detail page.

Feed MonkeyLearn with Kimono
Time to feed the monkey. Go to the Actions menu and select Upload tree, then select the CSV file we created with Kimono’s data:

13

After the uploading completes, MonkeyLearn will create the corresponding category tree on the left. You will see three nodes: Root (the starting point) and our two sentiment categories: Good and Bad. Clicking on each of the categories will reveal text samples for each (the reviews we gathered with Kimono) on a list on the bottom right of the screen:

14

Train MonkeyLearn
Now let’s train the machine learning algorithm. Just click the Train button on the top right of the screen. You will see a progress bar while the machine learning algorithms are training the model in MonkeyLearn’s cloud. This process will take a few seconds or a few minutes depending on the complexity and size of your category tree and samples.

After completing the training, the module’s state will change to a green Trained, and you’ll see some statistics that show how well the module is doing in predicting the correct category (in our case the sentiment):

15

The metrics are Accuracy, Precision and Recall. These metrics are commonly used to evaluate the performance of machine learning algorithms.

You can also see a keyword cloud on the right, that shows some of the terms that will be used to characterize the samples and predict the sentiment of the text. As you can see, they are terms that are semantically associated with positive and negative expressions about hotel features. Those terms are automatically determined by algorithms within MonkeyLearn.

If you want to look at a finished classifier we created a public classifier with the hotel sentiment analysis.
Test the sentiment analysis results
And voilá, we have a sentiment analyzer, created with zero lines of code. We can test the model directly from the GUI within MonkeyLearn. To test it, go to the API tab, write or paste in some text, click submit and you will see the prediction. For example:

 16

The results show an example API response when hitting the MonkeyLearn classifier’s API endpoint. The “result” entry shows the predicted label, in this case “Good”, with a corresponding probability: 1, in this case. The label in our case will always be Good or Bad, and the probability is a real number between 0 and 1. 1 means that is 100% sure about the prediction.

The classifier may still have some errors, that is, classify good reviews as bad, and vice versa, but the good thing is that you can keep improving, if you gather more training samples with tools like Kimono, you can upload more samples to the classifier, retrain and improve the results. Also, you can try different configurations on the advanced settings of your classifier, and retrain the algorithm. Usually different settings work for different classification problems (it’s not the same to do topic detection or sentiment analysis).

Integrate the module with MonkeyLearn’s API
You can also do this programmatically, so you can easily integrate any MonkeyLearn module into your projects, in any programming language. For example, if you are working with Python, you can go to the API libraries, select the Python and copy and paste the appropriate code snippet:

17

For example, we can classify a bunch of new reviews (that we don’t have a priori knowledge of their sentiments) with the batch classification endpoint. Let’s say we have the following list of reviews:

unlabeled_reviews = [
"""Super location
We stayed here the night before our cruise. Room was a little small and old,
but it was clean and functional for us. The location couldn't be beat. There
was a CVS right down the street within walking distance to pick up last minute
things. It was nice to be able to walk across the street to the shopping and...
""",
"""Fabulous stay
We stayed here for two days prior to our Getaway cruise. Our room was on the
10th floor. Alexandro checked us in and was very professional. Our room was
very clean. The bed was firm, but very comfortable. The shower pressure was
great. There is a pool but no hot tub.We could look out at the harbor and watch
ships.
""",
"""Terrible treatment
I had booked reservations 6 weeks in advance and upon arrival I was told they
would not honor my reservations and did not have a room for me. I traveled 4
hours and the front desk turned me away. I argued with them for over and hour
with no success. STAY AWAY!
""",
"""What a dump
Worst Holiday Inn ever. I've always thought Holiday Inn was quite consistent
but...I was wrong. The price was very high for a closet of a room. For $232.00
a night I would have expected much more. Was told I would have a view and what
I got was a brick wall. The bathroom door wouldn't even open all the way it was
so small. The whole hotel smelled very musty with lots of chipped paint in the
room and in the public areas.
"""
]

The following code shows how to use the batch classification endpoint to classify the previous reviews:

import requests
import json
# At this point we'll use MonkeyLearn API to classify the new reviews

data = {
'text_list': unlabeled_reviews
}

# you must put your MonkeyLearn token here:
TOKEN = ''

# the headers must contain our authorization token
headers = {
'Authorization': 'Token ' + TOKEN,
'Content-Type': 'application/json'
}

# you must put your MonkeyLearn classifier endpoint here:
CLASSIFIER_ENDPOINT = 'https://api.monkeylearn.com/api/v1/categorizer/cl_rZ2P7hbs/classify_batch_text/'

response = requests.post(
CLASSIFIER_ENDPOINT,
data=json.dumps(data),
headers=headers
)

results = response.json()

The results may look something like this:

{u'consumed_queries': 4,
u'queries_left': 988,
u'result': [[{u'label': u'Good', u'probability': 1.0}],
[{u'label': u'Good', u'probability': 1.0}],
[{u'label': u'Bad', u'probability': 1.0}],
[{u'label': u'Bad', u'probability': 1.0}]],
u'status_code': u'200'}

That is, the first two reviews are classified as Good and the last two as Bad. We can see that the classifier is pretty sure about the prediction as it returns very high probabilities (close to 1.0).

Conclusion

We combined Kimono and MonkeyLearn to create a machine learning model that learns to predict the sentiment of hotel reviews. Kimono helped us easily retrieve the training data from the web and MonkeyLearn helped us to build the sentiment analysis classifier.

But this is just the tip of the iceberg. There’s much more we can do.

If you are a Kimono user, you can use MonkeyLearn’s pre-trained modules to easily enrich your Kimono APIs, add sentiment analysis, topic detection, language detection, keyword extraction, entity recognition (and others) to the information you gather from the web with Kimono. If you have a specific need, you can create a custom module with MonkeyLearn to process the information you extract the way you need, as we did in this post, we created our custom sentiment analysis classifier for hotels.

If you are already a MonkeyLearn user, you can use Kimono to easily extract samples to train your custom modules and create powerful machine learning models in just a few minutes. If you’re not, you can sign up for MonkeyLearn using this promo code – KIMONO+MONKEY – to get US$20 off on any plan (until March 17, 2015).

Have any cool ideas on how to combine Kimono and MonkeyLearn? Share them with us in the comments.

2 thoughts on “Guest blog: Sentiment analysis on web scraped data with kimono and MonkeyLearn

Comments are closed.