My spam situation is getting out of hand, especially at work. I now monitor at least a dozen different email addresses through my Outlook box which has increased my spam intake exponentially (’cause the older the account, the more spam it tends to get).
So I finally decided to give ol’ SpamBayes a try. I’m researching some options for spam control and prefer Bayesian probability methods over black lists and whitelists, primarily because I want this to be as low maintenance as possible. I haven’t been saving my spam as religiously as I should have been, but I still started out with about 180 pieces of it and more than 5000 instances of “good” email.
I installed the Outlook plugin version of SpamBayes, trained it on my good and bad emails, then set it to work. So far, it’s already caught a piece of spam without even bugging me about it. IT didn’t delete it - it just tossed it in the ol’ Junkmail folder - but it was, beyond shadow of a doubt, spam (specifically: porn mail).
So, one hour of use isn’t enough to make a judgment call yet, but it has been a fascinating look at my email habits so far. For instance, SpamBayes has this way nifty feature that allows me to see how it determined the “spamminess” of an email. It shows each token and the score it calculated based on how many pieces of spam or ham (the good mail) contained the same token. It’s sort of surprising and fun to see which words indicate ham for me.
What will be a true challenge is that fact that I work for a marketing firm, which means I frequently both send and receive targeted, opt-in marketing communications (not spam). The unfortunate fact of that is a lot of our email may appear spammy to it, but I’ll have to rescue those form the spam frier if they get tagged. This could be an interesting experiment.
I’m hoping that SpamBayes is user-friendly enough for me to install on all of our clients or, better yet, on the server. The thing about Bayesian filtering is that it tries to improve as the spammers change their tactics. Of course, spammers have been trying all kinds of stuff recently to fool the Bayesian systems, like filling the subject line and body with completely irrelevant or, in some cases, nonsensical words. This, of course, makes them even more identifiable as spam, but only to a system (like a human) that is capable of natural language processing. Assuming we figure that one out, I’d be willing to bet the spammers would then turn to using foreign words, quite possibly a mixture of them from different languages (i.e. “Subject: Voulez pinata reichstag missa sunt arrivaderci”). So then we need to make the natural language processor multilingual, grammatically flexible and gibberish resistant. This is costly and annoying for all involved. But there’s a silver lining.
Already, many spam emails are more gibberish than actual marketing. Here’s an example of a subject to an email I received the other day: “Fwd: V+a+lium - xana+x+ ` v1@grA $ V|cod|:n Som@ % .P.ntermin lnjfscnylwhx”. It resembles the original words just enough that I know it has something to do with Xanax, Valium and Viagra, but the rest of it is almost totally gibberish. How useful is that for a customer? Spammers still exist because people actually buy crap from spammers. But if the spam itself can’t even tell us what its trying to sell, how can a sale be made? So, yeah, this may get annoying for a while, and on the surface they may have cracked the Bayesian code, but anti-spammers have driven the spammers so deep into the forest that their message is getting lost in the noise. Hopefully, as more people adopt anti-spam measures, more spammers will find it to be a waste of their time to send out these mass untargeted and unsolicited emails. And then they’ll probably go back to air-dropping fliers or something.