What I learned from Text Mining 400 Spam Comments on my Blog using R

Hey everyone! Welcome back to another amazing analytics post this week. If you are a frequent visitor of my blog, and somehow made a genuine comment here, you would have noticed your comment never appears. If you saw the screenshot above, out of a total of 999 comments, I have marked 400 as spam.

I was reading through some really interesting comments on my blog and I was thinking, why not try doing some text analytics to see what are spammers most interested in talking about on my blog.

Some simple explanation, text mining is a common way to do sentiment analysis on long lines of text which many market researchers do not want to look through. By going through specific text found in the whole data, researchers want to find out what the general public is talking about. In this instance, I want to find out what spam comments are generally being posted to my blog.

A bit of Data Cleaning: The very manual and boring part…

I started off by copying 400 comments and saving it inside a txt file. As my professor always said, data analytics is about 80% data cleaning and 20% analysis. I would change the 20% analysis to 19% and add 1% in terms of insights, which is what the business world truly values.

My First Round of Analysis

After a whole massive cleaning exercise here are the first set of results, represented in a word cloud of my top 30 most popular words in the spam.

The most popular keyword is http… Which means people are spamming websites.

The most popular keyword in the list of comments is http, which many websites start with (https was also likely in the list with the s being removed and recoded as http.) The second most popular keyword is urlqhttp which is probably also a website.

In 400 posts, there were close to 8000 times http has appeared.

In 400 posts, there were close to 8000 instances a web address has appeared, which means on average, spammers were posting 20 links to my blog. (They are probably trying to create backlinks to their website to improve their search engine rankings, which also will damage my website search engine ranking if it has too many backlinks out.) Thankfully these comments did not see the light of day.

Site and blog were the next highest which would make sense to come out 1.5 times per comment. Things like: This is an amazing blog/site, before adding in other things.

These links all appeared 582 times, which should be more or less safe to assume they are posted by the same poster.

These websites were also the most frequent in the comments, in the same frequency, it is likely that a bot has been created by a poster to consistently post the same thing over and over again. (Or perhaps he is that free and did it manually.) It was nice to know that spammers on my blog is interested in reviews, trips and books, linking things, and some German place which consists of Freiheit, which means political freedom (Yes, I learned German for 4 years before.). I did not open the links as I was worried of any potential spyware.

Okay that is enough analysis for today. If you are interested, do drop by for round 2! If the viewership is high enough, I’ll likely run another analysis on more comments in future.

If you liked the analysis, you may like this analysis too!

Otherwise you might want to know how to put analytics and management together!