Tavis Ormandy

$Id: a07cf90837a3c4373b82d6724b97593810766af7 $


I wondered why there wasn’t better spam filtering in USENET clients. Sure, very few people still read USENET, but I like to use Gmane, a service that lets you read and post to mailing lists via NNTP.

I really think we never improved on Newsreaders as a way to do that.

My only complaint is that spam messages sometimes make it through, and there’s no good way to filter them. I think the problem is Newsreaders predate modern spam filtering, and all the major clients seem to be stuck using kill lists and score files.

I credit “A Plan for Spam” for making email usable again in the early 2000s. If you had a public email address back then, you’ll know it was rough. When I read that essay, it really felt like the tide might be turning.

Maybe usenet clients never got that memo though?

Can’t we just plug in modern spam filters?

I’m an slrn user, it’s extensible with a powerful scripting language. Surely I can just plug a spam filter into it and get a cleaned up feed for free?

The answer is no. NNTP is not email, and the way filtering usually works is different. In general, clients ask the server for an NOV (News OVerview). This gives you a list of articles available to download, but not the articles themselves. You are really expected to do the filtering at this stage, and while you don’t have to, it makes newsreaders more efficient.

Unfortunately NOV is very limited, things like Subject, From, References, and so on. Will a naive bayesian spam filter still work if I can only give it those headers?

It’s worth a shot, so I tried it out. I wrote a s-lang script for slrn to plug in bogofilter. It took a while to figure out how, but I did get it working.

The answer is yes, after a few days training it, it did start to work! Over time, the confidence has increased and I usually don’t even bother checking anymore. Perhaps I have to correct one or two emails a week.

I’ve been able to read some very low SNR groups, and bogofilter usually correctly identifies the ones I dont want to read. You have to keep correcting it for it to stay accurate, but a single keybinding takes care of that.

Overall, I’m really pleased with it!