\author{Simon Cozens} \title{A Program which Learns how to Block Spam} @* Spamtrap. This program makes use of statistical and computational linguistic methods as well as a knowledge base in order to determine whether a mail message, news article, or any other given piece of information is likely to be interesting to the reader. While initially designed to stop me having to look at unsolicited mail messages, I realised quickly that the algorithm should work just as well for any other medium in which I chose whether or not to read something based on various factors, especially whether or not I thought it would be interesting. The databases so far consist of the following data: Lists of words that are considered interesting, uninteresting or irrelevant, sentence length and average word length data. @ Hints for training. I think it's best to train on two quite large but approximately equal-sized files, one containing only spam, one containing only sensible mail messages. (The best place to obtain the latter, of course, is your own mail directory, since this will be ideally suited to learning what you personally find interesting.) Now, run them through in batch mode at least twice each. This means that if the program made any false judgments the first time round, it could be rectified by the second attempt. After that, try training it interactively on a small mixed file and see how it gets on. If it's still stuck, try running your large test files through twice again. @ Calling conventions The following flags are available: \begin{itemize} \item{-t : Training mode.} \item{-c : Makes a choice for you in training mode - `-c y' marks everything spam, whereas `-c n' marks everything interesting.} \item{-f : File to analyze. Defaults to your mailbox.} \end{itemize} @* Overview of the program. @p @ @ @ @ @ @* Banner. The version number is stored here so that we can easily access it. @= print "This is spamtrap, version 0.3\n"; @* Global Variables. Firstly, we make sure that the software is cleanly shut down by trapping the interrupt (Ctrl-C) signal. Then we display our lovely banner, and ask for the getopt module which will process our options. This {\em should} be |use getopt::std;| but for some reason that didn't seem to want to work on this system. The next cryptic line is there to flush the output buffers automatically, so that our status displays (the line of dots when reading and writing the database) aren't flushed only when a newline command is issued. @= $SIG{INT}=sub { @ }; @ require "getopt.pl"; $|=1; @* Determine mode. We can run in a number of modes, depending on database or what have you. The most important, and the initial choice we have to make is whether we are {\bf training} or {\bf acting}---that is, whether we interactively build up our hypotheses of what is interesting and what is not, or whether we go through and do something about boringness! So, {\tt -t} gives us the training mode. If a {\tt -c} parameter is given, this will be passed to the dialog later on. Otherwise, of course, the program runs interactively. We pick up the file name from the {\tt -f} command (which works both in training and in active mode) if there's one given. Else, we pick up the mailbox. @= Getopt("cf"); if ($opt_t) { print "spamtrap running in training mode.\n"; print "We will train with $file.\n"; $choice=$opt_c if $opt_c; } elsif ($opt_h) { print "\nA program to identify and delete unwanted information\n\n"; print "When called without switches, will prompt for information"; print ".\nTo run in batch mode, run as \n\n spamtrap -t -f -c \n"; exit; } $file=$opt_f; $file="/var/spool/mail/".$ENV{USER} unless $opt_f; @* Sort out files. This loads up the required data files for a given database (at the moment, I'm assuming that we're dealing with the ``mail'' database, since I've not actually written support for switching databases yet\dots) and puts the data into some variables. It doesn't actually matter if the files we've got don't exist just yet, as they will at the end of the program. The structure of the |data| file is something like this: \begin{itemize} \item Banner: Text to display on loading \item Good Word Length: Letters/Words \item Bad Word Length: Letters/Words \item Good Sentence Length: Words/Sentences \item Bad Sentence Length: Words/Sentences \end{itemize} (The idea behind the fields is that the statistics should be comprehensible when viewed outside the program.) The other files are colon separated files with the format |word:occurences| which are loaded into a hash. If a word occurs in the ``boring'' file, it is ignored in the computation of interestingness. @= print "Loading database."; $dbname="mail"; open(INTEREST, $dbname.".int"); $intmess=; while () { chomp; ($key, $value) = split /:/; $interest{$key}=$value; } close INTEREST; print "."; open(BORING, $dbname.".bor"); while () { chomp; ($key, $value) = split /:/; $boring{$key}=$value; } close BORING; print "."; open(UNINTER, $dbname.".uni"); $unimess=; while () { chomp; ($key, $value) = split /:/; $uninterest{$key}=$value; } close UNINTER; print "."; open(DATA, $dbname.".data"); chomp($banner=); $banner="Banner: ".$dbname if (!$banner); $banner=~s/.*:(.*)/\1/; print $banner; $glw=; ($goodlettercount,$goodwordcount) = $glw=~/.*:(.*)\/(.*)/; print ".."; $blw=; ($badlettercount,$badwordcount) = $blw=~/.*:(.*)\/(.*)/; print ".."; $gsw=; ($good2wordcount,$goodsencount) = $gsw=~/.*:(.*)\/(.*)/; print ".."; $bsw=; ($bad2wordcount,$badsencount) = $bsw=~/.*:(.*)\/(.*)/; print ".."; @ print "\n"; close(DATA); @ @ Sanity check: We also check to ensure that the words in |letters/words| is the same as the words in |words/sentences| or otherwise we complain. The only reason this would actually happen is when some idiot had edited the data file themself. @= if ($bad2wordcount != $badwordcount or $good2wordcount!=$goodwordcount) { print "\n! Someone's been messing around with these files behind my back!\nGuessing optimal figures..."; $goodwordcount=$good2wordcount; $badwordcount=$bad2wordcount; } @ Pick Up Messages We read in the messages from the file specified above. New message headers are isolated as starting with |/^From /|, and text following a blank line is definitely body text. All of the lines of the file are passed of to the appropriate parts of the various arrays for further reference. @= open (MESGS, $file); print "\nLoading up"; while () { print "." unless $count%100; if (/^From /) { $count++; $header=1; } if (!/\S/) { $header=0; } if ($header){ @@messhead[$count].=$_; } else { @@messbody[$count].=$_; } } print "\n"; @* Loop through the messages. Basically, this looks each message, gathers the metrics and decides whether or not to call this ``uninteresting'', prints the headers, gets input from the menu and passes off the output to the apropriate part of the program. @= $count=-1; foreach (@@messhead) { $count++; next unless $_; @ @ @ @ } @ Pager Display the head of the message tidily. I don't think I like the use of the temporary file and the UNIX pager to display the top of the message headers, and I will write a pager that fits in with this program a little better, when the next box of round tuits arrive. @= open(TMP,">spam.tmp") or die "Can't write to spam.tmp"; print TMP $_; close TMP; system('more spam.tmp') unless $opt_t; @* Diagnose. This is where the hard work is done: Metrics are gathered, interesting and uninteresting words compared against the database, and so on. First we simplify the message, and then we examine it. I'll deal with that process later. In this section, we build up measures of how far the text is from the ``Platonic good'' and the ``Platonic spam''\footnote{What a horrible thought!} message, in terms of sentence length, word length and the choices of words used. If this doesn't give us a clear verdict, we then choose to only look at the words used. A little score chart is printed out. @= @ @ $sentenceness=0; $wordness=0; $sentenceness=$thiswords/$sentences if $sentences>0; $wordness=$characters/$thiswords if $thiswords>0; $goodness=abs($sentenceness- ($goodwordcount/$goodsencount)) if $goodsencount>0; $goodness+=abs($wordness- ($goodlettercount/$goodwordcount)) if $goodwordcount>0; $badness=abs($sentenceness- ($badwordcount/$badsencount)) if $badsencount>0; $badness+=abs($wordness- ($badlettercount/$badwordcount)) if $badwordcount>0; print "Goodness: $goodness\n"; print "Badness: $badness\n"; print "Word score: $wordscore\n"; $finalscore=$goodness-$badness+3*$wordscore; $finalscore=$wordscore if (abs($finalscore)<5); print "This message's total score: \n"; print int($finalscore*100)/100; print "\n"; @* Prepare Message. We prepare the message for analysis by: \begin{itemize} \item Collating header and body. \item Replacing non-alphabetics with spaces. (Taking care not to erase sentence divisions\footnote{Or full stops, as they're better known.}) \item Taking everything down into lower case. \item Replacing multiple whitespace with single spaces. (For ease of separation) \end{itemize} @= $message=$_.$messbody[$count]; $message =~ s/\./_fullstop_/g; $message =~ s/\W/ /g; $message =~ s/_fullstop_/\./g; $message =~ tr/A-Z/a-z/; $message =~ s/\s+/ /mg; @* Gather Metrics. Here we pull the message apart and see what it's about. We first split off the sentences par full stop and count them. We then get rid of the full stops, as they're not needed any more. We count the number of words, and the number of characters in the message, and finally add up the score. The word score of the message is defined as being $$\sum_{x= \hbox{\rm word}}^{\hbox{\rm message}} {\hbox{\rm occurences of $x$ in good list} \over \hbox{\rm size of good list}} + {\hbox{\rm occurences of $x$ in bad list} \over \hbox{\rm size of bad list}}$$ @= $sentences = scalar split /\. /,$message; $message=~s/\.//g; undef @@words; @@words = split / /,$message; $thiswords=$#words; $characters=0; $wordscore=0; foreach (@@words) {$characters+=length($_); $wordscore+= $interest{$_}/$intmess if $intmess>0; $wordscore-= $uninterest{$_}/$unimess if $unimess>0; } @* Display Menu. Actually, this displays two things. If we're in training, we either ask (in interactive mode) whether the message is junk, good, or whether the operator wants to see the body of the message. If we're in training but not in interactive mode, the choice is made for us. If we're not in training at all, the verdict is given, and something is done to the message. @= if ($opt_t) { if (!$opt_c) { do { print "Is this junk? (Y/N/M/Q) "; $choice=<>; if ($choice=~/m|M/) { print $messbody[$count]; $choice=""; } } until $choice=~/y|n|q/; } else {$choice=$opt_c;} } else { print "\nI think this message is "; if ($finalscore>0){ print "fine.\n";} else { print "spam.\n";} @ print "\n"; } @ Kill Spam Well, doesn't actually do anything yet. Just gives us a chance to see the verdict. It should eventually give the reader a chance to overturn the verdict. @= <>; @* Take appropriate action. This section is mainly used by the training process, and updates the variables that we're comparing things against with the data derived from the current message. @= if ($choice=~/y/) { $badlettercount+=$characters; $badwordcount+=$thiswords; $badsencount+=$sentences; $unimess++; foreach (@@words) { if ($interest{$_} or $boring{$_}) { delete $interest{$_}; delete $uninterest{$_}; $boring{$_}=1 if not $boring{$_}; } else { $uninterest{$_}++; } } } elsif ($choice=~/n/) { $goodlettercount+=$characters; $goodwordcount+=$thiswords; $goodsencount+=$sentences; foreach (@@words) { if ($uninterest{$_} or $boring{$_}) { delete $interest{$_}; delete $uninterest{$_}; $boring{$_}=1 if not $boring{$_}; } else { $interest{$_}++; } } $intmess++; } else { if ($opt_t) { @ } } @* Write Out Files. Finish off by updating the database files, in much the same way as we read stuff in from them. @= print "\nClosing down."; open(INTEREST, ">".$dbname.".int"); print INTEREST $intmess."\n"; while (($key,$value)= each %interest) { print INTEREST "$key:$value\n"; } close INTEREST; open(BORING, ">".$dbname.".bor"); while (($key,$value)= each %boring) { print BORING "$key:$value\n"; } close BORING; print "."; open(UNINTER, ">".$dbname.".uni"); print UNINTER $unimess."\n"; while (($key,$value)= each %uninterest) { print UNINTER "$key:$value\n"; } close UNINTER; print "."; open(DATA, ">".$dbname.".data"); print DATA "Banner: ".$banner."\n"; print DATA "Good Word Length: $goodlettercount/$goodwordcount\n"; print DATA "Bad Word Length: $badlettercount/$badwordcount\n"; print DATA "Good Sentence Length: $goodwordcount/$goodsencount\n"; print DATA "Bad Word Length: $badwordcount/$badsencount\n"; close DATA; exit;