December 2003

Correlated Words

    The Bayesian spam filter assumes that the words picked from a mail act as independent spam indicators. This is a dangerous assumption because most words often show up in patterns; where a given pair of words indicate spam no better than each individually.

    This document uncovers the formula for a correlation coefficient that can be used to improve Bayesian filtering by not including correlated words in the Bayesian analysis.

Review

    If two words (A and B) are found in a mail, and they are independent spam indicators, then the combined probability that mail is spam is given by:

      PAB = PA PB

    Where PA , PB , and PAB are the respective probabilities that a mail is spam if it contains A, B, or AandB. Be sure you have reviewed Bayesian Group Theory to know the meaning of these circled operators.

The Co-relation Coefficient

    Turns out that most words are not independent spam indicators. Instead, after an analysis of many mails, we find that

      PAB = PA (n PB ) for some n

    We call n the "correlation coefficient of B with respect to A". Only in the rare case that n=1 can we say A and B are truly independent spam indicators.

    Lets have some fun and solve for n:

      PAB = PA (n PB )

                        (PB )n
      => PAB PA = ---------------
                    (PB)n + (1-PB)n

      => k(PB)n + k(1-PB)n = (PB )n where k = PAB PA

      => k(1-PB)n = (1-k)(PB)n

          k     (PB )n
      => --- = -------
         1-k   (1-PB )n

    Let k = g/h = PAB PA . We do this this to simplify the RHS of our equation by eliminating common denominators

       k     (g/h)      (g/h)        g
      --- = ------- = --------- = -------
      1-k   (1-g/h)   ((h-g)/h)    (h-g)

    And returning to our main event we have:

                   PAB (1-PB)               (PB)n
      => ------------------------------- = -------
         PAB(1-PB) + (1-PAB)PB - PAB(1-PB)   (1-PB)n

         PAB(1-PB)    (PB)n
      => --------- = -------
         (1-PAB)PB    (1-PB)n

               / PAB (1-PB)\
             ln| --------- |
               \ (1-PAB)PB /
      => n = ---------------
                /  PB  \
              ln| ---- |
                \ 1-PB /

    This n is not only a correlation coefficient, but can be considered how much B is worth, in terms of whole words) in the context of A. In the case where n=0, we see that B is worth nothing, and does not satisfy the Bayesian filter criterion. When n=1, B is truly independent of A and can be used in a Bayesian filter. Below we have some more analysis.

Analysis of the Co-relation Coefficient

    Partially Co-Related Words (| n | <1 )

    For example, if A="table" or B="td" are found in a mail, then that mail has a 90% chance of being spam. If we assume that those two words appear independently of each other then we could conclude that any mail with "table" AND "td" will have a 98.8% ( =.90 .90 ) chance of being spam. In reality this is a false conclusion because those two words are not independent of each other. It turns out that that a mail with both "table" and "td" has only a 95% chance of being a spam. So if we see A="table" and B="td" in a mail, we can calculate how much B is worth in terms of being an independent word.

            / PAB(1-PB )\       / 95% 10% \
          ln| --------- |    ln| -------- |
            \ (1-PAB)PB /       \  5% 90% /   ln(2.11)
      n = --------------- = --------------- = --------- =.340
              /  PB  \           / 90% \        ln(9)
            ln| ---- |         ln| --- |
              \ 1-PB /           \ 10% /

    The two words are partially correlated , and B is worth only .340 of a word should we need to use it in a Bayesian filter.

    Super Non-Related Words (| n | >1 )

    With | n | >1 the second word (B) is worth more than if it was on its own. Furthermore, we say the words are super non-related . Super non-related words are able to indicate a mail as spam much better than either word can on its own.

    For example A="enlargment" and B="doctor" have PA =90% and PB =70% of indicating spam respectively. If they appear independently, we would expect a mail with both words to have 95.5% ( =.90 .70) chance of being spam. Yet it turns out that a mail with both words actually has a 99% chance of being spam.

          / PAB(1-PA) \      / 99% (10%) \
        ln| --------- |   ln| ---------- |
          \ (1-PAB)PA /      \ (1%) 90%  /   ln(11)
    n = --------------- = --------------- = ------- = 2.83
          /   PB   \          /  70%  \     ln(7/3)
        ln| ------ |       ln| ------ |
          \ (1-PB) /          \ 1-70% /

    So if B="doctor" is found in a mail containing A="enlargment", then B is effectively worth 2.83 times as much (using the Bayesian meaning for "times").

    Spam and Legit Words ( n<0)

    When n<0 either the numerator, or denominator for n is less than zero. This can happen if ( PA >PAB and PB > &func12;) or ( PAB > PA and &func12;>PB ).

    In the first case A is a better indicator of spam than AandB and B is a spam indicator. The second case is where AandB indicate spam better than A, but B is a good legit indicator (a poor spam indicator). These are rare

    Example: Consider A="align=center" and B="trademark" with PA =98% and PB =75% . If these words were independent spam indicators, then we would expect PAB to be greater than either of then. Actually Pab=90% .

          / PAB (1-PA) \      / 90% 2%  \
        ln| ---------_ |   ln| -------- |
          \ (1-PAB )PA /      \ 10% 98% /   ln(.183)
    n = --------------- = --------------- = --------- = -1.54
            /  PB   \         / 75% \        ln(3)
         ln| ------ |      ln| ---- |
            \(1-PB) /         \ 25% /

    In other words, the negative n means that B is acting like a legitimate indicator ( PAB < &func12;) dispite being a spam indicator ( PAB > &func12; ).

Notes on (the lack of) Symmetry

    A candidate for a more symmetric definition of the correlation coefficient is

      PAB = (n PA ) (n PB ) for some n

    This relation has a couple of problems that have been explored:

    1. The form does not help indicate what word is a better choice for indicating spam. The PAB = PA (n PB) formula includes the assumption that A is being used to indicate spam, and we are solving the adjustments needed to include B in our Bayesian filter.
    2. The value for n when PAB = PA is complicated, and does not reveal the important conclusion that B is worth relatively nothing.

Implementation Challenges

  • The problem with finding correlation coefficients is the fact that PAB must be found for all A and B. With potentially hundreds of thousands of "words", the matrix needed to count all the combinations can easily go into the billions.

    KASP goes through two counting stages. The first counts the quantity of each word, and chooses only those that have a good chance of indicating spam (>80% chance of indicating spam) or a good chance of indicating legit mail (<5% chance of indicating spam). These thresholds are chosen so that only about 1500 words are are considered. The second counting stage only counts word pairs found in those 1500 words, and only outputs those pairs that are "good indicators".

  • Limiting the number of word pairs tallied has a consequence of leaving a "hole" in the information available to calculate the correlation coefficient.

    The simple solution is to realized that if A and B are spam indicators, and information in AB is missing, we can assume that AB is a lesser indicator of spam than either one (otherwise it would have passed the threshold). Therefore B is correlated to A. In the case where A is a spam indicator, and B is a legit indicator, we would expect AB to be in the information hole, and we can say that B is non-related to A.

  • The value for PAB suffers in confidence because the tally of AandB is often much less than the tally of either A or B. This reduced confidence introduces significant error if the programmer is too conservative; resulting in the conclusion that almost all words are correlated, and using just one word to indicate spam.

    KASP uses the unadjusted Bayesian probabilities for PAB . Even though PAB can take on extreme values (100% and 0%), it is only used to approximate the correlation coefficient and to ultimately decide if a word is correlated or non-related to other words.

Conclusion

    With the old KASP 1.1, I noticed that the hypertext tags in a legitimate mail were combining to mark the mail as spam. This was a direct consequence of assuming that words appeared independently of each other (and they were not).

    KASP 2.0 corrects these errors significantly.