Sunday, April 12, 2020

A few insights into Blog comments spam

People have been spamming blog comments for years.  From manual beginnings, people developed spambots to automate the process, which understandably resulted in a huge increase in spam traffic.  In earlier days, the intention was a mixture of attempts to build traffic to legitimate sites (for both manual traffic and search engine optimisation),  and various scams including pump and dump.

Many blogs responded by turning off comments altogether; in my case I've vetted comments before publishing.

In more recent years, the spam comments have become apparently innocuous, and don't even try to link to other web sites.  This left me wondering what was going on, but I found it difficult to find out.

So with my previous post, I set up a honeypot to gather spam comments over time, with the intention of analysing them to understand better what was going on.  This entry will give some limited insights; maybe I will add to it when I know more.  The following is a first pass report on the results.

I posted the honeypot in August 2019, and spambot comments flowed in for about two and a half months, before abruptly stopping.  There's an implication that they're all from the same source - or using the same mechanism.

I analysed about 100 comments through a Natural Language Processing framework.  This is a form of Machine Learning (which is popularly referred to Artificial Intelligence, although I don't think it's an accurate term).  It wasn't able to tell me that much.  Amongst other things, it returned a high positive sentiment score through sentiment analysis.  This was fairly obvious already.  To get past spam vetting systems, the spambots intentionally fed relatively upbeat phrases.  They were mostly quite general comments, either about liking the blog or asking help with their own blog (again, no  links).  But it was possible to tell in the first instance that it was spam simply because there was no specific reference to the subject matter in the blog post.  To make this clear, in the honeypot I requested comments to include a specified word to flag that the commenter had read the post.  Which of course is beyond the capability of automated commenting tools.

The only thing I've really gotten from the NLP system so far is that very frequently the comments are very close variants on each other - in groups of two, three or more.  It's as if someone put together three sentences, made some variants on a few keywords/phrases, and then got the spambot to switch around the words frequently enough so as to specifically avoid getting caught by automated processes that blocked groups of indentical comments.

So it looks like it's an arms race between sets of automated tools, a battle to infuse comments on the one hand, and to deflect them on the other.  What hasn't been answered yet to my satisfaction is why the spambots are still running but are not delivering weblink payloads.  My only guess is as before: that the spambots are being used to pinpoint blogs/news sites that allow unfiltered comments to get through.  My feeling is that there must be more to it than that, so suggestions are welcome.

PS: Will a new post get those spambots started again on this blog?  Let's see.

No comments: