Toward better spam filters for blog comments
One of the challenges of running a blog is the proliferation of blog comment spam. Huh? Yeah, believe it or not, the folks who brought you email spam are now dumping info about off-shore gambling and off-shore pharmaceuticals into the comments of your favorite blogs.
Right now, responses seem to be limited to banning particular IP addresses and banning specific words. IP blocks are tough - mostly because the bad guys are able to spoof their way through thousands of different IPs (and because sometimes you'll catch innocent folks on legitimate dial-up IPs). Banning words is worse - it's easy for comment spammers to switch from, say, cialis to cia|is - not to mention the collateral damage of banning good words with bad words hiding inside them - like socialism and specialist.
The cialis/socialism problem was first noticed a couple weeks ago by Jeff Jarvis at the BuzzMachine.
I had to put "cialis" in my comment-spam filter to stay ahead of the swine. But, of course, this is stopping people from putting up legitimate words. I should fix that. But I'm kind of enjoying the discovery. First, they couldn't say "socialism" and thought I was trying to turn that into a dirty word. Now it's "specialist." Can we ask the makers of performance-enhancing drugs to please come up with names whose order of letters does not appear elsewhere in the English language?
That led to some thinking on my part about better comment spam filters. Here's my comment over at the BuzzMachine:
What I can't figure out is why the blog software providers haven't figured out some pretty obvious tools for this stuff.We're not limited to simple word filters, registration schemes - or leaving the door wide open.
For example, a 'click to confirm' mechanism could be easily built. When a comment goes up, an email would be sent. When the emailed link is clicked, the comment goes up. To make it less annoying, you might then allow future comments with the same IP/email pair to automatically go up without clicking. The blog owner could also ban particular email addresses. (This wouldn't stop comment spam altogether, but make it harder on automated systems - and more costly, in time, to spam yours. They'd find another victim.)
Another option: Use the very power of blogs - its audience - against the comment spammers. Why not a "report this comment as spam" link? If it gets X clicks from audience members, it would get pulled and put in an approval queue for the moderator. You would, of course, stand the danger of audience-censorship of unpopular non-spam comments - but that's why you set that threshold at an appropriate point. (Different for every blog, depending on audience size.)
Finally, we could set up Bayesian filters - just like the ones people are using for email these days - to screen for spam. The system would easily and automagically distinguish the words that appear in legitimate comments from those that appear in comment-spam. It'd require some training, but my Bayesian email spam filters work fabulous now.
What are the blogging software guys waiting for?
So... my friends at Movable Type, WordPress, HaloScan, and Blogger... what gives? Let's get on it - who wants to be the new blogging industry champ?
Kari Chisholm | August 4, 2005 | Comments (5) |
Your Name: Your Personal Note: | Your Email: Friends' Emails*: |
Comments
Update: Looks like there's a new Bayesian spam filter that's a plug-in to Movable Type.
Haven't tried it yet. Anyone?
Posted by: Kari Chisholm | Aug 4, 2005 12:26:44 PM
Another update: Chris Nicholson just pointed the way to a SpamAssassin plug-in for WordPress.
I'm skeptical, since SpamAssassin is designed for email - and comment spam is a different animal, but it may be useful if it's got a loose screen - and only kicks to comment moderation when you get a positive match.
Posted by: Kari Chisholm | Aug 4, 2005 12:32:00 PM
I think that peer review is an excellent approach. You mentioned a "flag this link as spam" system with configurable limits to
regulate at what point comments are sent to the
moderated pile.
Slashdot also has a system that recruits moderators from its readership. (http://slashdot.org/faq/com-mod.shtml). I'm not an active participant on Slashdot, but its moderation system seems like an interesting premise, but probably overkill unless your blog has a very large following.
Posted by: Matt | Aug 4, 2005 4:46:13 PM
The only problem I have with Peer Review is that what if it's a conservative blog and someone makes a liberal post, or vice versa. Isn't it more than a little possible that people would flag posts they didn't like as spam?
Posted by: Christopher Nicholson | Aug 5, 2005 11:45:16 AM
You're right, Chris, the peer review thing can be a challenge. I think you'd only want to have the "this comment is spam" button on comments that were NOT yet viewed by the blog owner. Or, perhaps, only on new comments (say under 48 hours old).
Then, the key is that there's a threshold - and if it's crossed, it just goes into a moderation queue; not deleted.
Once approved, it would never come back off again.
Posted by: Kari Chisholm | Aug 5, 2005 11:56:38 AM
