|
Report From The 2004 MIT Spam Conference This was a very interesting conference. Mark Ramos of Granite Software also attended, and we sat near each other and had lunch together. Other than having one familar face from the Lotus community in the crowd, however, this conference was unlike any other conference I've been to in a long time. First of all, it was an academic conference, which means that presentations are much shorter than what we see in industry conferences. It all took place in one auditorium, with no breakout sessions. There were 19 individual presentations on the agenda. That's an awful lot of information in a very short time. Now, about the venue... MIT is a world class institution... there's no question about that... but it has all the charm of an urban industrial park. I found it odd that, with as many experts in queueing theory as there are at MIT, there was only a single rest room for an auditorium that held somwhere in the neighborhood of 500 people.
The most frequently heard phrases during the conference were probably "arms race", "white list", and of course "Bayesian". Another common term was "innoculation", which refers to automatically passing on spammer information in near-real-time so that other sites can benefit from information about attacks that you have detectd.
A wide variety of viewpoints were given, some of them in direct conflict. One speaker, Terry Sullivan, presented a statistical analysis that strongly challenged the conventional wisdom that spam patterns mutate very quickly, yet many other speakers took the volatility of spam patterns as a given. One of the two speakers who addressed the problem of making statistical filtering fast enough for server implementation, completely puzzled the audience by citing some numbers describing his adaptation of the CRM114 approach that baffled everyone because he seemed to be implying that he was computing probabilities for eight distinct patterns within a two token window. Clearly something was lost in translation on that one -- but I actually have to admit that this talk got me thinking about a few things, and I'm going to start a correspondence with some of the presenters to see whether the ideas I've come up with might be useful to them. There were also two different speakers on the subject of legal responses to spam, one of whom was fairly optimistic, and the other fairly pessimistic, and two different speakers talking about sender-pays economic solutions, one of whom presented a detailed economic model that was certainly of academic interest even if the practical value is questionable.
I think the most intriguing presenation, in terms of new ideas and techniques, was Marty Lamb's talk about TarProxy. The basic idea is that TarProxy recognizes spammers when they connect, and it does everything it can to keep them connected for as long as possible, while not actually delivering their spam. This turns the tables on spammers, consuming their resources so that they can't send as many spam messages. It's like putting a telemarketer on hold. He's designed it to be pluggable, so integrating it with an existing spam solution should be quite easy, and integrating it with a white list to improve reliability would be a cool idea, too. Eric Kidd of Dartmouth Medical School gave an interesting presentation about Bayesian Whitelisting. He spoke about using statistical analysis of headers to quickly determine whether mail could bypass more expensive analysis. What he's really doing is analyzing the implicit social network within email, and looking for messages that fit the existing communication patterns. Cool idea.
The best presentation was actually delivered in absentia by John Graham-Cumming. His video presentation about "How To Beat A Bayesian Filter" took us inside the minds of spammers, He described an attack on Bayesian filtering using -- get this! -- Bayesian analysis on the spammer's side! The spammer sends a massive number of messages, using the "word salad" technique to try to find versions that make it through the filters, and uses statistics on which ones get rejected least often to train his software to use those particular words more often. His conclusion was that all feedback to spammers is harmful. Non-deliveries, challenge/response messages, HTML rendering of images that contain coded acknowledgement data, SMTP rejection messages, etc., all potentially help spammers fine-tune their attacks. He's right, of course, but as is the case with many things that we know are correct in a purely academic sense, the question of what we should do in practical terms remains wide open.
The most controversial topic was one of the last ones:Eric Raymond's presentation on SPF, which is a DNS-based technique for adding authentication into the SMTP protocol. Opinons run very strong about whether SPF does any good at all, whether it does more harm than good, etc. A fellow in the audience, whom I'm 99.9% certain was Barry Shein, founder of The World -- the first public dialup ISP anywhere, which happens to be where I had my very first dialup account -- spoke very critically of SPF. I'm not sure where I stand on this. Mass adoption of SPF will cause problems for anyone sending mail directly from dynamic IP addresses, but the fact that many major ISPs are already rejecting mail from dynamic IPs kind of makes this a moot point IMHO. It also causes problems for automatied forwarding messages -- at least the invisible type of forwarding that users are accustomed to when they move from one address to another. I'm leaning toward the belief that SPF is another weapon, not necessarily a definitive way to track spam, and probably not a way to reject spam all by itself either, but perhaps as a way to determine which inbound messages need to be subjected to the most stringent analysis before delivery.
The very last presentation, by Richard Jowsey of Death2Spam (gotta love that name!!) appealed to the latent math geek in me. I say latent, but my friends all know about it, and I've let it show here from time to time, I guess. Anyhow, he showed lots of pretty bell curves while handling the topic of fine-tuning Bayesian filtering. It turns out that my own experiments with Bayesian filters in LotusScript explored some similar tuning concepts, but I never actually did the math to justify what I was doing.
|