|
antiSPAM
antiSpam
uses a set of Perl
modules that work
to filter e-mail,
based on criteria
that can be
defined for any
portion of the
message. This
means that the
entire message is
checked from
beginning to end
looking for
characters that
are commonly found
in spam. Every
spam-like
character carries
a certain weight
or score. If that
character exists
in an e-mail,
antiSPAM notes
that character,
how much it is
worth and
continues checking
the message. When
the e-mail has
been completely
checked, the
cumulative score
of all the
spam-like checks
is summed. If that
sum total exceeds
the pre-defined
threshold (“required_hits”)
the message is
tagged as spam. If
not, then the
message is
delivered as usual
without the
antiSPAM “mark
up”.
In its most basic
(and default)
state, that is, no
archiving or
deletion rule has
been established,
antiSpam
will still let
messages that are
marked as spam be
delivered to you.
This is done so
that the user
doesn’t lose mail
during the early
set up portion of
antiSpam.
Once the user is
comfortable with
the way that
antiSpam is
“tagging” spam
then they may
enable additional
message handling
rules – i.e.
archiving or
deleting.
The standard
antiSpam rule
set contains
hundreds of rules
for identifying
questionable
messages on the
basis of header
contents, body
contents, message
structure, sender,
and other
heuristics.
Because each rule
is weighted, rules
can be useful even
if they are not
perfect predictors
of spam
individually, or
even if they in
fact match many
legitimate
messages. For
example, one of
the rules matches
messages whose
subject is in ALL
CAPS. On its own,
this rule would be
a poor one -- it
would match too
many legitimate
messages -- but
taken in
conjunction with
the results of
other rules, it
still contributes
effectively to the
spam-identification
process. This is a
big departure from
the requirements
of most previous
spam filters,
where each rule
had to stand
entirely on its
own.
The default
weighting for the
rules is
determined using a
statistical
technique, testing
the rules against
a large corpus of
known spam and
legitimate mail.
This training
process assigns
weights to each
rule, positive or
negative, based on
its predictive
power of
identifying a
message as
"probably spam" or
"probably
legitimate." In
this way, even
rules that often
match both
legitimate mail
and spam, but
suggest one or the
other, can still
be very useful in
making inferences
about the
probability that a
given message is
spam. A lot of
small "this might
be spam" hints can
add up to a high
degree of
confidence. antiSpam also
employs a
statistical
technique, called
auto-whitelisting,
to learn the
characteristics of
the mail you
receive, and uses
that to adjust the
spam score. It
computes the
statistical
distribution of
the spam score of
messages sent by
individual
senders, and uses
this to adjust the
spam score for a
new message sent
by a known sender.
For example, if
you have a friend
who regularly
sends you
(non-spam) e-mail,
but then that
friend forwards
you an
advertisement that
would ordinarily
have a high spam
score,
antiSpam will
use that friend's
history data to
adjust the
message's spam
score downwards.
You can also
supplement the
rule set with
explicit whitelist
and blacklist
entries if you
know that messages
from a particular
sender (or site)
are legitimate or
spam.
In addition to the
built-in rules,
antiSPAM can
access external
databases, such as
commercial
blacklist services
and the Razor and
DCC spam checksum
databases. These
external checks
are treated just
like any other
rule, and users
can adjust the
weights associated
with matching one
of these databases
as they see fit.
Razor is a
database of
checksums of known
spam messages, as
reported by users. If a
message's checksum
appears in Razor,
the appropriate
rule is triggered
and its score is
added to the
message score.
Razor is
surprisingly
effective at
catching many spams, and because
its decisions are
based on human
classification, it
frequently
identifies
messages that are
not caught by the
other message
heuristics. At the
same time,
antiSPAM
assigns a weight
to Razor hits that
is not, by itself,
enough to mark a
message as spam,
so erroneous or
even malicious
submissions to
Razor don't
usually cause much
trouble for
antiSPAM
users. Razor is a
much sharper tool
than traditional
spam blacklists,
which identify
entire domains or
IP address ranges
as spammers, often
branding innocent
senders in the
process. Because
antiSPAM can
learn over time
and access dynamic
databases, even
without installing
the periodic rule
updates, it is
more likely to
remain useful for
a longer time than
strictly
rule-based
systems.
antiSpam
takes a very
different
approach from
previous spam
filters, and
this approach
has proven to be
more flexible
and adaptable.
While it also
uses matching
rules to
identify
possible spam
candidates, it
takes a
probabilistic,
score-based
approach to
classifying
messages instead
of a binary
approach.
Instead of
seeking to
create rules
that identify
messages as
"definitely
spam" or
"definitely not
spam", it uses
rules that use
probability to
make inferences
about the
likelihood that
a given message
is spam. For most users,
antiSPAM can
catch nearly all
of the spam
without
quarantining
legitimate mail,
and offers
virtually infinite
tuning and
customization
options. After using
antiSPAM for
several months,
the majority of
users are happy to
report that in
their experience,
the false positive
and false negative
rates were
extremely low.
After a small bit
of initial tuning
(mostly whitelist
and blacklist
entries), users now
spend no more than
a few minutes a
week scanning the
spam folder for
false positives
-and almost never
find any!
|