December 2003
Reducing False Positives in Bayesian Filtering
In this paper I discuss the source of false positives from small e-mail samples and how to reduce them to insignificant levels. I will start by defining some terms and the program flow in my own Bayesian spam filter.
Introduction
There are three phases to the filter program, the first uses the spam and legitimate pools of mail to create Bayesian probability ratios for each of the occurring words. The second phase, which reduces false positives, alters those ratios to be statistically safe. The final phase uses the probability ratios to identify new mails as spam.
Safe Probabilities Example
Our example will concern the single word "super" and a few mails I have received
Suppose I have received 100 spam, ten of which have used the word "super" and 10 legitimate mails with no appearance of "super" in any of them. From stats class you know the expected value of each:
Stats class also told you, that you should NOT be confident with the first two conclusions and definitely not the third, because of the small samples.
Finding out how many times "super" really might appear in legitimate mails depends on your tolerance for error. Since we are dealing with statistics, we can never be sure of anything. All we can do is establish a tolerance for error, and use that to improve our conclusion about how well "super" indicates spam accurately.
Dealing with the legitimate e-mail, we could have just been unlucky in the sample we were given. Suppose the real chance of getting a legitimate e-mail with "super" is ( pl =15%). Then we can ask, what was the chance (P) that 0 of 10 legitimate e-mails will show up without "super":
P = (1 - 0.15)10 = (0.85)10 = .1969 = 20%
Therefore with pl =15%, we can expect about 20% of our 10-samples will show up without any spam at all. I guess we could say that we have a "reasonable chance" of getting a 0/10 result with pl being 15% (or lower).
Suppose we want to be safe, and want to find the range of pl that would reasonably achieve a 0/10 result. This means we want to know what is the range of pl where a 0/10 sample might (>10% chance) happen :
T = 10% < (1 - pl )10 => (0.1)1/10 < 1 - pl
=> pl < 1 - (0.1)1/10
=> pl < 0.2057
Any pl < 20.57% has greater than 10% chance of giving our 0/10 sample. We can alternately say that with pl > 20.57% we have a less than 10% chance of getting a 0/10 sample. Here I have assigned a tolerance ( T ) to represent our aversion to error.
Binomial Distribution
In general, we want to determine the range for of probabilities that might give us the sample we have seen.
T <cn pc(1 - p)n-c
where
Our example above had c=0 and n=10.
The binomial distribution has a single peak at p = c/n and falls monotonically to zero in either side. This means that we can always find a range of p's, between pmin and pmax , that can satisfy this equation.
Implementation
The KASP implementation currently uses Newton's method for finding p. This is quite slow, and could be improved with Taylor approximations.