← Index
NYTProf Performance Profile   « line view »
For /usr/local/bin/sa-learn
  Run on Sun Nov 5 03:09:29 2017
Reported on Mon Nov 6 13:20:44 2017

Filename/usr/local/bin/sa-learn
StatementsExecuted 4827 statements in 55.7ms
Subroutines
Calls P F Exclusive
Time
Inclusive
Time
Subroutine
2341133.0ms96437smain::::wanted main::wanted
301712925.6ms25.6msUNIVERSAL::::can UNIVERSAL::can (xsub)
11118.4ms34.0msmain::::BEGIN@24 main::BEGIN@24
39362218.3ms18.3msutf8::::is_utf8 utf8::is_utf8 (xsub)
11117.3ms572msmain::::BEGIN@65 main::BEGIN@65
11111.1ms15.8msmain::::BEGIN@66 main::BEGIN@66
1038119.79ms9.79msEncode::XS::::decodeEncode::XS::decode (xsub)
1115.26ms5.30msmain::::BEGIN@20 main::BEGIN@20
1114.86ms94.6msmain::::BEGIN@25 main::BEGIN@25
234114.53ms4.54msmain::::result main::result
1112.96ms8.28msmain::::BEGIN@23 main::BEGIN@23
1112.53ms7.57msmain::::BEGIN@69 main::BEGIN@69
1111.39ms4.10msmain::::BEGIN@68 main::BEGIN@68
1111.29ms1.75msmain::::BEGIN@39 main::BEGIN@39
1111.07ms1.23msmain::::BEGIN@19 main::BEGIN@19
111530µs538µsmain::::BEGIN@21 main::BEGIN@21
14711511µs511µsmro::::method_changed_in mro::method_changed_in (xsub)
14831489µs489µsInternals::::SvREADONLY Internals::SvREADONLY (xsub)
5744286µs286µsUNIVERSAL::::isa UNIVERSAL::isa (xsub)
665148µs148µsUNIVERSAL::::VERSION UNIVERSAL::VERSION (xsub)
242310129µs129µsmain::::CORE:pack main::CORE:pack (opcode)
2633105µs105µsutf8::::encode utf8::encode (xsub)
21163µs63µsmain::::CORE:ftis main::CORE:ftis (opcode)
11149µs174µsmain::::BEGIN@41 main::BEGIN@41
21139µs39µsmain::::target main::target
11132µs204µsmain::::BEGIN@70 main::BEGIN@70
11128µs28µsmain::::BEGIN@67 main::BEGIN@67
42125µs25µsmain::::CORE:match main::CORE:match (opcode)
11121µs574µsmain::::BEGIN@28 main::BEGIN@28
11120µs20µsmain::::CORE:print main::CORE:print (opcode)
11119µs19µsmain::::BEGIN@26 main::BEGIN@26
11111µs11µsmain::::init_results main::init_results
22111µs11µsmain::::CORE:close main::CORE:close (opcode)
1116µs6µsmain::::__ANON__[:94] main::__ANON__[:94]
0000s0smain::::RUNTIME main::RUNTIME
0000s0smain::::__ANON__[:112] main::__ANON__[:112]
0000s0smain::::__ANON__[:130] main::__ANON__[:130]
0000s0smain::::__ANON__[:131] main::__ANON__[:131]
0000s0smain::::__ANON__[:132] main::__ANON__[:132]
0000s0smain::::__ANON__[:133] main::__ANON__[:133]
0000s0smain::::__ANON__[:134] main::__ANON__[:134]
0000s0smain::::__ANON__[:93] main::__ANON__[:93]
0000s0smain::::__ANON__[:96] main::__ANON__[:96]
0000s0smain::::killed main::killed
0000s0smain::::usage main::usage
Call graph for these subroutines as a Graphviz dot language file.
Line State
ments
Time
on line
Calls Time
in subs
Code
0169µsProfile data that couldn't be associated with a specific line:
# spent 69µs making 1 call to Mail::SpamAssassin::Logger::END
1#!/usr/local/bin/perl -T -w
2# <@LICENSE>
3# Licensed to the Apache Software Foundation (ASF) under one or more
4# contributor license agreements. See the NOTICE file distributed with
5# this work for additional information regarding copyright ownership.
6# The ASF licenses this file to you under the Apache License, Version 2.0
7# (the "License"); you may not use this file except in compliance with
8# the License. You may obtain a copy of the License at:
9#
10# http://www.apache.org/licenses/LICENSE-2.0
11#
12# Unless required by applicable law or agreed to in writing, software
13# distributed under the License is distributed on an "AS IS" BASIS,
14# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15# See the License for the specific language governing permissions and
16# limitations under the License.
17# </@LICENSE>
18
192676µs21.24ms
# spent 1.23ms (1.07+163µs) within main::BEGIN@19 which was called: # once (1.07ms+163µs) by main::NULL at line 19
use strict;
# spent 1.23ms making 1 call to main::BEGIN@19 # spent 12µs making 1 call to strict::import
2025.06ms25.34ms
# spent 5.30ms (5.26+44µs) within main::BEGIN@20 which was called: # once (5.26ms+44µs) by main::NULL at line 20
use warnings;
# spent 5.30ms making 1 call to main::BEGIN@20 # spent 33µs making 1 call to warnings::import
212557µs2545µs
# spent 538µs (530+8) within main::BEGIN@21 which was called: # once (530µs+8µs) by main::NULL at line 21
use bytes;
# spent 538µs making 1 call to main::BEGIN@21 # spent 8µs making 1 call to bytes::import
22
232285µs29.10ms
# spent 8.28ms (2.96+5.32) within main::BEGIN@23 which was called: # once (2.96ms+5.32ms) by main::NULL at line 23
use Errno qw(EBADF);
# spent 8.28ms making 1 call to main::BEGIN@23 # spent 826µs making 1 call to Exporter::import
242351µs237.9ms
# spent 34.0ms (18.4+15.6) within main::BEGIN@24 which was called: # once (18.4ms+15.6ms) by main::NULL at line 24
use Getopt::Long;
# spent 34.0ms making 1 call to main::BEGIN@24 # spent 3.90ms making 1 call to Getopt::Long::import
252374µs294.9ms
# spent 94.6ms (4.86+89.8) within main::BEGIN@25 which was called: # once (4.86ms+89.8ms) by main::NULL at line 25
use Pod::Usage;
# spent 94.6ms making 1 call to main::BEGIN@25 # spent 322µs making 1 call to Exporter::import
26285µs119µs
# spent 19µs within main::BEGIN@26 which was called: # once (19µs+0s) by main::NULL at line 26
use File::Spec;
# spent 19µs making 1 call to main::BEGIN@26
27
2812µs
# spent 574µs (21+553) within main::BEGIN@28 which was called: # once (21µs+553µs) by main::NULL at line 33
use vars qw(
29 $spamtest %opt $isspam $forget
30 $messagecount $learnedcount $messagelimit
31 $progress $total_messages $init_results $start_time
32 $synconly $learnprob @targets $bayes_override_path
33194µs21.13ms);
# spent 574µs making 1 call to main::BEGIN@28 # spent 553µs making 1 call to vars::import
34
3516µsmy $PREFIX = '/usr/local'; # substituted at 'make' time
3612µsmy $DEF_RULES_DIR = '/usr/local/share/spamassassin'; # substituted at 'make' time
3712µsmy $LOCAL_RULES_DIR = '/usr/local/etc/mail/spamassassin'; # substituted at 'make' time
38
392618µs22.07ms
# spent 1.75ms (1.29+466µs) within main::BEGIN@39 which was called: # once (1.29ms+466µs) by main::NULL at line 39
use lib '/usr/local/lib/perl5/site_perl'; # substituted at 'make' time
# spent 1.75ms making 1 call to main::BEGIN@39 # spent 321µs making 1 call to lib::import
40
41
# spent 174µs (49+125) within main::BEGIN@41 which was called: # once (49µs+125µs) by main::NULL at line 63
BEGIN { # see comments in "spamassassin.raw" for doco
42117µs162µs my @bin = File::Spec->splitpath($0);
# spent 62µs making 1 call to File::Spec::Unix::splitpath
4313µs my $bin = ($bin[0] ? File::Spec->catpath(@bin[0..1]) : $bin[1])
44 || File::Spec->curdir;
45
46188µs263µs if (-e $bin.'/lib/Mail/SpamAssassin.pm'
# spent 63µs making 2 calls to main::CORE:ftis, avg 31µs/call
47 || !-e '/usr/local/lib/perl5/site_perl/Mail/SpamAssassin.pm' )
48 {
49 my $searchrelative;
50 if ($searchrelative && $bin eq '../' && -e '../blib/lib/Mail/SpamAssassin.pm')
51 {
52 unshift ( @INC, '../blib/lib' );
53 } else {
54 foreach ( qw(lib ../lib/site_perl
55 ../lib/spamassassin ../share/spamassassin/lib))
56 {
57 my $dir = File::Spec->catdir( $bin, split ( '/', $_ ) );
58 if ( -f File::Spec->catfile( $dir, "Mail", "SpamAssassin.pm" ) )
59 { unshift ( @INC, $dir ); last; }
60 }
61 }
62 }
63172µs1174µs}
# spent 174µs making 1 call to main::BEGIN@41
64
652355µs1572ms
# spent 572ms (17.3+555) within main::BEGIN@65 which was called: # once (17.3ms+555ms) by main::NULL at line 65
use Mail::SpamAssassin;
# spent 572ms making 1 call to main::BEGIN@65
662415µs115.8ms
# spent 15.8ms (11.1+4.72) within main::BEGIN@66 which was called: # once (11.1ms+4.72ms) by main::NULL at line 66
use Mail::SpamAssassin::ArchiveIterator;
# spent 15.8ms making 1 call to main::BEGIN@66
67264µs128µs
# spent 28µs within main::BEGIN@67 which was called: # once (28µs+0s) by main::NULL at line 67
use Mail::SpamAssassin::Message;
# spent 28µs making 1 call to main::BEGIN@67
682432µs14.10ms
# spent 4.10ms (1.39+2.71) within main::BEGIN@68 which was called: # once (1.39ms+2.71ms) by main::NULL at line 68
use Mail::SpamAssassin::PerMsgLearner;
# spent 4.10ms making 1 call to main::BEGIN@68
692420µs17.57ms
# spent 7.57ms (2.53+5.04) within main::BEGIN@69 which was called: # once (2.53ms+5.04ms) by main::NULL at line 69
use Mail::SpamAssassin::Util::Progress;
# spent 7.57ms making 1 call to main::BEGIN@69
7029.58ms2376µs
# spent 204µs (32+172) within main::BEGIN@70 which was called: # once (32µs+172µs) by main::NULL at line 70
use Mail::SpamAssassin::Logger;
# spent 204µs making 1 call to main::BEGIN@70 # spent 172µs making 1 call to Exporter::import
71
72###########################################################################
73
74184µs$SIG{PIPE} = 'IGNORE';
75
76# used to be CmdLearn::cmd_run() ...
77
78116µs%opt = (
79 'force-expire' => 0,
80 'use-ignores' => 0,
81 'nosync' => 0,
82 'quiet' => 0,
83 'cf' => []
84);
85
86118µs1268µsGetopt::Long::Configure(
# spent 268µs making 1 call to Getopt::Long::Configure
87 qw(bundling no_getopt_compat
88 permute no_auto_abbrev no_ignore_case)
89);
90
91GetOptions(
92 'forget' => \$forget,
93 'ham|nonspam' => sub { $isspam = 0; },
94110µs
# spent 6µs within main::__ANON__[/usr/local/bin/sa-learn:94] which was called: # once (6µs+0s) by Getopt::Long::GetOptionsFromArray at line 605 of Getopt/Long.pm
'spam' => sub { $isspam = 1; },
95 'sync' => \$synconly,
96 'rebuild' => sub { $synconly = 1; warn "The --rebuild option has been deprecated. Please use --sync instead.\n" },
97
98 'q|quiet' => \$opt{'quiet'},
99 'username|u=s' => \$opt{'username'},
100 'configpath|config-file|config-dir|c|C=s' => \$opt{'configpath'},
101 'prefspath|prefs-file|p=s' => \$opt{'prefspath'},
102 'siteconfigpath=s' => \$opt{'siteconfigpath'},
10315µs 'cf=s' => \@{$opt{'cf'}},
104
105 'folders|f=s' => \$opt{'folders'},
106 'force-expire|expire' => \$opt{'force-expire'},
107 'local|L' => \$opt{'local'},
108 'no-sync|nosync' => \$opt{'nosync'},
109 'showdots' => \$opt{'showdots'},
110 'progress' => \$opt{'progress'},
111 'use-ignores' => \$opt{'use-ignores'},
112 'no-rebuild|norebuild' => sub { $opt{'nosync'} = 1; warn "The --no-rebuild option has been deprecated. Please use --no-sync instead.\n" },
113
114 'learnprob=f' => \$opt{'learnprob'},
115 'randseed=i' => \$opt{'randseed'},
116 'stopafter=i' => \$opt{'stopafter'},
117 'max-size=i' => \$opt{'max-size'},
118
119 'debug|debug-level|D:s' => \$opt{'debug'},
120 'help|h|?' => \$opt{'help'},
121 'version|V' => \$opt{'version'},
122
123 'dump:s' => \$opt{'dump'},
124 'import' => \$opt{'import'},
125
126 'backup' => \$opt{'backup'},
127 'clear' => \$opt{'clear'},
128 'restore=s' => \$opt{'restore'},
129
130 'dir' => sub { $opt{'old_format'} = 'dir'; },
131 'file' => sub { $opt{'old_format'} = 'file'; },
132 'mbox' => sub { $opt{'format'} = 'mbox'; },
133 'mbx' => sub { $opt{'format'} = 'mbx'; },
134 'single' => sub { $opt{'old_format'} = 'single'; },
135
136 'db|dbpath=s' => \$bayes_override_path,
137198µs135µs 're|regexp=s' => \$opt{'regexp'},
# spent 35µs making 1 call to Getopt::Long::GetOptions
138
139 '<>' => \&target,
140 )
141 or usage( 0, "Unknown option!" );
142
14312µsif ( defined $opt{'help'} ) {
144 usage( 0, "For more information read the manual page" );
145}
14612µsif ( defined $opt{'version'} ) {
147 print "SpamAssassin version " . Mail::SpamAssassin::Version() . "\n";
148 exit 0;
149}
150
151# set debug areas, if any specified (only useful for command-line tools)
15212µsif (defined $opt{'debug'}) {
153 $opt{'debug'} ||= 'all';
154}
155
15612µsif ( $opt{'force-expire'} ) {
157 $synconly = 1;
158}
159
16012µsif ($opt{'showdots'} && $opt{'progress'}) {
161 print "--showdots and --progress may not be used together, please select just one\n";
162 exit 0;
163}
164
16512µsif ( !defined $isspam
166 && !defined $synconly
167 && !defined $forget
168 && !defined $opt{'dump'}
169 && !defined $opt{'import'}
170 && !defined $opt{'clear'}
171 && !defined $opt{'backup'}
172 && !defined $opt{'restore'}
173 && !defined $opt{'folders'} )
174{
175 usage( 0,
176"Please select either --spam, --ham, --folders, --forget, --sync, --import,\n--dump, --clear, --backup or --restore"
177 );
178}
179
180# We need to make sure the journal syncs pre-forget...
18112µsif ( defined $forget && $opt{'nosync'} ) {
182 $opt{'nosync'} = 0;
183 warn
184"sa-learn warning: --forget requires read/write access to the database, and is incompatible with --no-sync\n";
185}
186
18712µsif ( defined $opt{'old_format'} ) {
188
189 #Format specified in the 2.5x form of --dir, --file, --mbox, --mbx or --single.
190 #Convert it to the new behavior:
191 if ( $opt{'old_format'} eq 'single' ) {
192 push ( @ARGV, '-' );
193 }
194}
195
19613µsmy $post_config = '';
197
198# kluge to support old check_bayes_db operation
199# bug 3799: init() will go r/o with the configured DB, and then dbpath needs
200# to override. Just access the dbpath version via post_config_text.
20112µsif ( defined $bayes_override_path ) {
202 # Add a default prefix if the path is a directory
203 if ( -d $bayes_override_path ) {
204 $bayes_override_path = File::Spec->catfile( $bayes_override_path, 'bayes' );
205 }
206
207 $post_config .= "bayes_path $bayes_override_path\n";
208}
209
210# These options require bayes_scanner, which requires "use_bayes 1", but
211# that's not necessary for these commands.
21215µsif (defined $opt{'dump'} || defined $opt{'import'} || defined $opt{'clear'} ||
213 defined $opt{'backup'} || defined $opt{'restore'}) {
214 $post_config .= "use_bayes 1\n";
215}
216
217211µs$post_config .= join("\n", @{$opt{'cf'}})."\n";
218
219# create the tester factory
220$spamtest = new Mail::SpamAssassin(
221 {
222 rules_filename => $opt{'configpath'},
223 site_rules_filename => $opt{'siteconfigpath'},
224 userprefs_filename => $opt{'prefspath'},
225 username => $opt{'username'},
226 debug => $opt{'debug'},
227132µs147.0ms local_tests_only => $opt{'local'},
# spent 47.0ms making 1 call to Mail::SpamAssassin::new
228 dont_copy_prefs => 1,
229 PREFIX => $PREFIX,
230 DEF_RULES_DIR => $DEF_RULES_DIR,
231 LOCAL_RULES_DIR => $LOCAL_RULES_DIR,
232 post_config_text => $post_config,
233 }
234);
235
236112µs112.8s$spamtest->init(1);
# spent 12.8s making 1 call to Mail::SpamAssassin::init
23718µs17µsdbg("sa-learn: spamtest initialized");
# spent 7µs making 1 call to Mail::SpamAssassin::Logger::dbg
238
239# Bug 6228 hack: bridge the transition gap of moving Bayes.pm into a plugin;
240# To be resolved more cleanly!!!
24116µsif ($spamtest->{bayes_scanner}) {
242211µs foreach my $plugin ( @{ $spamtest->{plugins}->{plugins} } ) {
24327489µs27119µs if ($plugin->isa('Mail::SpamAssassin::Plugin::Bayes')) {
# spent 119µs making 27 calls to UNIVERSAL::isa, avg 4µs/call
244 # copy plugin's "store" object ref one level up!
24515µs $spamtest->{bayes_scanner}->{store} = $plugin->{store};
246 }
247 }
248}
249
25019µs123µsif (Mail::SpamAssassin::Util::am_running_on_windows()) {
251 binmode(STDIN) or die "cannot set binmode on STDIN: $!"; # bug 4363
252 binmode(STDOUT) or die "cannot set binmode on STDOUT: $!";
253}
254
25514µsif ( defined $opt{'dump'} ) {
256 my ( $magic, $toks );
257
258 if ( $opt{'dump'} eq 'all' || $opt{'dump'} eq '' ) { # show us all tokens!
259 ( $magic, $toks ) = ( 1, 1 );
260 }
261 elsif ( $opt{'dump'} eq 'magic' ) { # show us magic tokens only
262 ( $magic, $toks ) = ( 1, 0 );
263 }
264 elsif ( $opt{'dump'} eq 'data' ) { # show us data tokens only
265 ( $magic, $toks ) = ( 0, 1 );
266 }
267 else { # unknown option
268 warn "Unknown dump option '" . $opt{'dump'} . "'\n";
269 $spamtest->finish_learner();
270 exit 1;
271 }
272
273 if (!$spamtest->dump_bayes_db( $magic, $toks, $opt{'regexp'}) ) {
274 $spamtest->finish_learner();
275 die "ERROR: Bayes dump returned an error, please re-run with -D for more information\n";
276 }
277
278 $spamtest->finish_learner();
279 # make sure we notice any write errors while flushing output buffer
280 close STDOUT or die "error closing STDOUT: $!";
281 close STDIN or die "error closing STDIN: $!";
282 exit 0;
283}
284
28513µsif ( defined $opt{'import'} ) {
286 my $ret = $spamtest->{bayes_scanner}->{store}->perform_upgrade();
287 $spamtest->finish_learner();
288 # make sure we notice any write errors while flushing output buffer
289 close STDOUT or die "error closing STDOUT: $!";
290 close STDIN or die "error closing STDIN: $!";
291 exit( !$ret );
292}
293
29413µsif (defined $opt{'clear'}) {
295 unless ($spamtest->{bayes_scanner}->{store}->clear_database()) {
296 $spamtest->finish_learner();
297 die "ERROR: Bayes clear returned an error, please re-run with -D for more information\n";
298 }
299
300 $spamtest->finish_learner();
301 # make sure we notice any write errors while flushing output buffer
302 close STDOUT or die "error closing STDOUT: $!";
303 close STDIN or die "error closing STDIN: $!";
304 exit 0;
305}
306
30712µsif (defined $opt{'backup'}) {
308 unless ($spamtest->{bayes_scanner}->{store}->backup_database()) {
309 $spamtest->finish_learner();
310 die "ERROR: Bayes backup returned an error, please re-run with -D for more information\n";
311 }
312
313 $spamtest->finish_learner();
314 # make sure we notice any write errors while flushing output buffer
315 close STDOUT or die "error closing STDOUT: $!";
316 close STDIN or die "error closing STDIN: $!";
317 exit 0;
318}
319
32013µsif (defined $opt{'restore'}) {
321
322 my $filename = $opt{'restore'};
323
324 unless ($filename) {
325 $spamtest->finish_learner();
326 die "ERROR: You must specify a filename to restore.\n";
327 }
328
329 unless ($spamtest->{bayes_scanner}->{store}->restore_database($filename, $opt{'showdots'})) {
330 $spamtest->finish_learner();
331 die "ERROR: Bayes restore returned an error, please re-run with -D for more information\n";
332 }
333
334 $spamtest->finish_learner();
335 # make sure we notice any write errors while flushing output buffer
336 close STDOUT or die "error closing STDOUT: $!";
337 close STDIN or die "error closing STDIN: $!";
338 exit 0;
339}
340
34114µsif ( !$spamtest->{conf}->{use_bayes} ) {
342 warn "ERROR: configuration specifies 'use_bayes 0', sa-learn disabled\n";
343 exit 1;
344}
345
346$spamtest->init_learner(
347 {
348 force_expire => $opt{'force-expire'},
349123µs1228µs learn_to_journal => $opt{'nosync'},
# spent 228µs making 1 call to Mail::SpamAssassin::init_learner
350 wait_for_lock => 1,
351 caller_will_untie => 1
352 }
353);
354
35514µs$spamtest->{bayes_scanner}{use_ignores} = $opt{'use-ignores'};
356
35712µsif ($synconly) {
358 $spamtest->rebuild_learner_caches(
359 {
360 verbose => !$opt{'quiet'},
361 showdots => $opt{'showdots'}
362 }
363 );
364 $spamtest->finish_learner();
365 # make sure we notice any write errors while flushing output buffer
366 close STDOUT or die "error closing STDOUT: $!";
367 close STDIN or die "error closing STDIN: $!";
368 exit 0;
369}
370
37114µs$messagelimit = $opt{'stopafter'};
37213µs$learnprob = $opt{'learnprob'};
373
37413µsif ( defined $opt{'randseed'} ) {
375 srand( $opt{'randseed'} );
376}
377
378# sync the journal first if we're going to go r/w so we make sure to
379# learn everything before doing anything else.
380#
38112µsif ( !$opt{nosync} ) {
382 $spamtest->rebuild_learner_caches();
383}
384
385# what is the result of the run? will end up being the exit code.
38612µsmy $exit_status = 0;
387
388# run this lot in an eval block, so we can catch die's and clear
389# up the dbs.
390eval {
391120µs $SIG{HUP} = \&killed;
39217µs $SIG{INT} = \&killed;
39318µs $SIG{TERM} = \&killed;
394
39513µs if ( $opt{folders} ) {
396 open( F, $opt{folders} ) or die "cannot open $opt{folders}: $!";
397 for ($!=0; <F>; $!=0) {
398 chomp;
399 next if /^\s*$/;
400 if (/^(?:ham|spam):\w*:/) {
401 push ( @targets, $_ );
402 }
403 else {
404 target($_);
405 }
406 }
407 defined $_ || $!==0 or
408 $!==EBADF ? dbg("error reading from $opt{folders}: $!")
409 : die "error reading from $opt{folders}: $!";
410 close(F) or die "error closing $opt{folders}: $!";
411 }
412
413 ###########################################################################
414 # Deal with the target listing, and STDIN -> tempfile
415
41612µs my $tempfile; # will be defined if stdin -> tempfile
41713µs push(@targets, @ARGV);
41813µs @targets = ('-') unless @targets || $opt{folders};
419
420120µs for(my $elem = 0; $elem <= $#targets; $elem++) {
421 # ArchiveIterator doesn't really like STDIN, so if "-" is specified
422 # as a target, make it a temp file instead.
423232µs213µs if ( $targets[$elem] =~ /(?:^|:)-$/ ) {
# spent 13µs making 2 calls to main::CORE:match, avg 7µs/call
424 if (defined $tempfile) {
425 # uh-oh, stdin specified multiple times?
426 warn "skipping extra stdin target (".$targets[$elem].")\n";
427 splice @targets, $elem, 1;
428 $elem--; # go back to this element again
429 next;
430 }
431 else {
432 my $handle;
433 ( $tempfile, $handle ) = Mail::SpamAssassin::Util::secure_tmpfile();
434 binmode $handle or die "cannot set binmode on file $tempfile: $!";
435
436 # avoid slurping the whole file into memory, copy chunk by chunk
437 my($inbuf,$nread);
438 while ( $nread=sysread(STDIN,$inbuf,16384) )
439 { print {$handle} $inbuf or die "error writing to $tempfile: $!" }
440 defined $nread or die "error reading from STDIN: $!";
441 close $handle or die "error closing $tempfile: $!";
442
443 # re-aim the targets at the tempfile instead of STDIN
444 $targets[$elem] =~ s/-$/$tempfile/;
445 }
446 }
447
448 # make sure the target list is in the normal AI format
449227µs212µs if ($targets[$elem] !~ /^[^:]*:[a-z]+:/) {
# spent 12µs making 2 calls to main::CORE:match, avg 6µs/call
450 my $item = splice @targets, $elem, 1;
451 target($item); # add back to the list
452 $elem--; # go back to this element again
453 next;
454 }
455 }
456
457 ###########################################################################
458
459 my $iter = new Mail::SpamAssassin::ArchiveIterator(
460 {
461 # skip messages larger than max-size bytes,
462 # 0 for no limit, undef defaults to 256 KB
463 'opt_max_size' => $opt{'max-size'},
464 'opt_want_date' => 0,
465 'opt_from_regex' => $spamtest->{conf}->{mbox_format_from_regex},
466 }
467123µs149µs );
# spent 49µs making 1 call to Mail::SpamAssassin::ArchiveIterator::new
468
469111µs115µs $iter->set_functions(\&wanted, \&result);
47013µs $messagecount = 0;
47113µs $learnedcount = 0;
472
47312µs $init_results = 0;
47414µs $start_time = time;
475
476 # if exit_status isn't already set to non-zero, set it to the reverse of the
477 # run result (0 is bad, 1+ is good -- the opposite of exit status codes)
478319µs196438s my $run_ok = eval { $exit_status ||= ! $iter->run(@targets); 1 };
# spent 96438s making 1 call to Mail::SpamAssassin::ArchiveIterator::run
479
48013µs print STDERR "\n" if ($opt{showdots});
48112µs $progress->final() if ($opt{progress} && $progress);
482
48313µs my $phrase = defined $forget ? "Forgot" : "Learned";
484 print "$phrase tokens from $learnedcount message(s) ($messagecount message(s) examined)\n"
485143µs120µs if !$opt{'quiet'};
# spent 20µs making 1 call to main::CORE:print
486
487 # If we needed to make a tempfile, go delete it.
48812µs if (defined $tempfile) {
489 unlink $tempfile or die "cannot unlink temporary file $tempfile: $!";
490 undef $tempfile;
491 }
492
49312µs if (!$run_ok && $@ !~ /HITLIMIT/) { die $@ }
494130µs 1;
49514µs} or do {
496 my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat;
497 $spamtest->finish_learner();
498 die $eval_stat;
499};
500
501111µs13.09ms$spamtest->finish_learner();
# spent 3.09ms making 1 call to Mail::SpamAssassin::finish_learner
502# make sure we notice any write errors while flushing output buffer
503120µs17µsclose STDOUT or die "error closing STDOUT: $!";
# spent 7µs making 1 call to main::CORE:close
504112µs14µsclose STDIN or die "error closing STDIN: $!";
# spent 4µs making 1 call to main::CORE:close
505172µsexit $exit_status;
506
507###########################################################################
508
509sub killed {
510 $spamtest->finish_learner();
511 die "interrupted";
512}
513
514
# spent 39µs within main::target which was called 2 times, avg 19µs/call: # 2 times (39µs+0s) by Getopt::Long::GetOptionsFromArray at line 737 of Getopt/Long.pm, avg 19µs/call
sub target {
51526µs my ($target) = @_;
516
51724µs my $class = ( $isspam ? "spam" : "ham" );
51826µs my $format = ( defined( $opt{'format'} ) ? $opt{'format'} : "detect" );
519
520227µs push ( @targets, "$class:$format:$target" );
521}
522
523###########################################################################
524
525
# spent 11µs within main::init_results which was called: # once (11µs+0s) by main::result at line 541
sub init_results {
52613µs $init_results = 1;
527
528111µs return unless $opt{'progress'};
529
530 $total_messages = $Mail::SpamAssassin::ArchiveIterator::MESSAGES;
531
532 $progress = Mail::SpamAssassin::Util::Progress->new({total => $total_messages,});
533}
534
535###########################################################################
536
537
# spent 4.54ms (4.53+11µs) within main::result which was called 234 times, avg 19µs/call: # 234 times (4.53ms+11µs) by Mail::SpamAssassin::ArchiveIterator::_run at line 326 of Mail/SpamAssassin/ArchiveIterator.pm, avg 19µs/call
sub result {
5382341.35ms my ($class, $result, $time) = @_;
539
540 # don't open results files until we get here to avoid overwriting files
541234646µs111µs &init_results if !$init_results;
# spent 11µs making 1 call to main::init_results
542
5432342.49ms $progress->update($messagecount) if ($opt{progress} && $progress);
544}
545
546###########################################################################
547
548
# spent 96437s (33.0ms+96437) within main::wanted which was called 234 times, avg 412s/call: # 234 times (33.0ms+96437s) by Mail::SpamAssassin::ArchiveIterator::_run_file at line 414 of Mail/SpamAssassin/ArchiveIterator.pm, avg 412s/call
sub wanted {
5492342.12ms my ( $class, $id, $time, $dataref ) = @_;
550
551234945µs my $spam = $class eq "s" ? 1 : 0;
552
553234580µs if ( defined($learnprob) ) {
554 if ( int( rand( 1 / $learnprob ) ) != 0 ) {
555 print STDERR '_' if ( $opt{showdots} );
556 return 1;
557 }
558 }
559
560234510µs if ( defined($messagelimit) && $learnedcount > $messagelimit ) {
561 $progress->final() if ($opt{progress} && $progress);
562 die 'HITLIMIT';
563 }
564
565234574µs $messagecount++;
5662342.78ms2344.85s my $ma = $spamtest->parse($dataref);
# spent 4.85s making 234 calls to Mail::SpamAssassin::parse, avg 20.7ms/call
567
5682342.25ms23419.0ms if ( $ma->get_header("X-Spam-Checker-Version") ) {
# spent 19.0ms making 234 calls to Mail::SpamAssassin::Message::Node::get_header, avg 81µs/call
569 my $new_ma = $spamtest->parse($spamtest->remove_spamassassin_markup($ma), 1);
570 $ma->finish();
571 $ma = $new_ma;
572 }
573
5742342.83ms23496432s my $status = $spamtest->learn( $ma, undef, $spam, $forget );
# spent 96432s making 234 calls to Mail::SpamAssassin::learn, avg 412s/call
5752342.60ms2342.40ms my $learned = $status->did_learn();
# spent 2.40ms making 234 calls to Mail::SpamAssassin::PerMsgLearner::did_learn, avg 10µs/call
576
5772341.23ms if ( !defined $learned ) { # undef=learning unavailable
578 die "ERROR: the Bayes learn function returned an error, please re-run with -D for more information\n";
579 }
580 elsif ( $learned == 1 ) { # 1=message was learned. 0=message wasn't learned
581234655µs $learnedcount++;
582 }
583
584 # Do cleanup ...
5852342.01ms2343.72ms $status->finish();
# spent 3.72ms making 234 calls to Mail::SpamAssassin::PerMsgLearner::finish, avg 16µs/call
586234970µs undef $status;
587
5882342.26ms234128ms $ma->finish();
# spent 128ms making 234 calls to Mail::SpamAssassin::Message::finish, avg 547µs/call
5892343.68ms791.54ms undef $ma;
# spent 1.54ms making 79 calls to Mail::SpamAssassin::Message::DESTROY, avg 19µs/call
590
5912341.02ms print STDERR '.' if ( $opt{showdots} );
5922343.28ms return 1;
593}
594
595###########################################################################
596
597sub usage {
598 my ( $verbose, $message ) = @_;
599 my $ver = Mail::SpamAssassin::Version();
600 print "SpamAssassin version $ver\n";
601 pod2usage( -verbose => $verbose, -message => $message, -exitval => 64 );
602}
603
604# ---------------------------------------------------------------------------
605
606=head1 NAME
607
608sa-learn - train SpamAssassin's Bayesian classifier
609
610=head1 SYNOPSIS
611
612B<sa-learn> [options] [file]...
613
614B<sa-learn> [options] --dump [ all | data | magic ]
615
616Options:
617
618 --ham Learn messages as ham (non-spam)
619 --spam Learn messages as spam
620 --forget Forget a message
621 --use-ignores Use bayes_ignore_from and bayes_ignore_to
622 --sync Synchronize the database and the journal if needed
623 --force-expire Force a database sync and expiry run
624 --dbpath <path> Allows commandline override (in bayes_path form)
625 for where to read the Bayes DB from
626 --dump [all|data|magic] Display the contents of the Bayes database
627 Takes optional argument for what to display
628 --regexp <re> For dump only, specifies which tokens to
629 dump based on a regular expression.
630 -f file, --folders=file Read list of files/directories from file
631 --dir Ignored; historical compatibility
632 --file Ignored; historical compatibility
633 --mbox Input sources are in mbox format
634 --mbx Input sources are in mbx format
635 --max-size <b> Skip messages larger than b bytes;
636 defaults to 256 KB, 0 implies no limit
637 --showdots Show progress using dots
638 --progress Show progress using progress bar
639 --no-sync Skip synchronizing the database and journal
640 after learning
641 -L, --local Operate locally, no network accesses
642 --import Migrate data from older version/non DB_File
643 based databases
644 --clear Wipe out existing database
645 --backup Backup, to STDOUT, existing database
646 --restore <filename> Restore a database from filename
647 -u username, --username=username
648 Override username taken from the runtime
649 environment, used with SQL
650 -C path, --configpath=path, --config-file=path
651 Path to standard configuration dir
652 -p prefs, --prefspath=file, --prefs-file=file
653 Set user preferences file
654 --siteconfigpath=path Path for site configs
655 (default: /etc/mail/spamassassin)
656 --cf='config line' Additional line of configuration
657 -D, --debug [area=n,...] Print debugging messages
658 -V, --version Print version
659 -h, --help Print usage message
660
661=head1 DESCRIPTION
662
663Given a typical selection of your incoming mail classified as spam or ham
664(non-spam), this tool will feed each mail to SpamAssassin, allowing it
665to 'learn' what signs are likely to mean spam, and which are likely to
666mean ham.
667
668Simply run this command once for each of your mail folders, and it will
669''learn'' from the mail therein.
670
671Note that csh-style I<globbing> in the mail folder names is supported;
672in other words, listing a folder name as C<*> will scan every folder
673that matches. See C<Mail::SpamAssassin::ArchiveIterator> for more details.
674
675SpamAssassin remembers which mail messages it has learnt already, and will not
676re-learn those messages again, unless you use the B<--forget> option. Messages
677learnt as spam will have SpamAssassin markup removed, on the fly.
678
679If you make a mistake and scan a mail as ham when it is spam, or vice
680versa, simply rerun this command with the correct classification, and the
681mistake will be corrected. SpamAssassin will automatically 'forget' the
682previous indications.
683
684Users of C<spamd> who wish to perform training remotely, over a network,
685should investigate the C<spamc -L> switch.
686
687=head1 OPTIONS
688
689=over 4
690
691=item B<--ham>
692
693Learn the input message(s) as ham. If you have previously learnt any of the
694messages as spam, SpamAssassin will forget them first, then re-learn them as
695ham. Alternatively, if you have previously learnt them as ham, it'll skip them
696this time around. If the messages have already been filtered through
697SpamAssassin, the learner will ignore any modifications SpamAssassin may have
698made.
699
700=item B<--spam>
701
702Learn the input message(s) as spam. If you have previously learnt any of the
703messages as ham, SpamAssassin will forget them first, then re-learn them as
704spam. Alternatively, if you have previously learnt them as spam, it'll skip
705them this time around. If the messages have already been filtered through
706SpamAssassin, the learner will ignore any modifications SpamAssassin may have
707made.
708
709=item B<--folders>=I<filename>, B<-f> I<filename>
710
711sa-learn will read in the list of folders from the specified file, one folder
712per line in the file. If the folder is prefixed with C<ham:type:> or C<spam:type:>,
713sa-learn will learn that folder appropriately, otherwise the folders will be
714assumed to be of the type specified by B<--ham> or B<--spam>.
715
716C<type> above is optional, but is the same as the standard for
717ArchiveIterator: mbox, mbx, dir, file, or detect (the default if not
718specified).
719
720=item B<--mbox>
721
722sa-learn will read in the file(s) containing the emails to be learned,
723and will process them in mbox format (one or more emails per file).
724
725=item B<--mbx>
726
727sa-learn will read in the file(s) containing the emails to be learned,
728and will process them in mbx format (one or more emails per file).
729
730=item B<--use-ignores>
731
732Don't learn the message if a from address matches configuration file
733item C<bayes_ignore_from> or a to address matches C<bayes_ignore_to>.
734The option might be used when learning from a large file of messages
735from which the hammy spam messages or spammy ham messages have not
736been removed.
737
738=item B<--sync>
739
740Synchronize the journal and databases. Upon successfully syncing the
741database with the entries in the journal, the journal file is removed.
742
743=item B<--force-expire>
744
745Forces an expiry attempt, regardless of whether it may be necessary
746or not. Note: This doesn't mean any tokens will actually expire.
747Please see the EXPIRATION section below.
748
749Note: C<--force-expire> also causes the journal data to be synchronized
750into the Bayes databases.
751
752=item B<--forget>
753
754Forget a given message previously learnt.
755
756=item B<--dbpath>
757
758Allows a commandline override of the I<bayes_path> configuration option.
759
760=item B<--dump> I<option>
761
762Display the contents of the Bayes database. Without an option or with
763the I<all> option, all magic tokens and data tokens will be displayed.
764I<magic> will only display magic tokens, and I<data> will only display
765the data tokens.
766
767Can also use the B<--regexp> I<RE> option to specify which tokens to
768display based on a regular expression.
769
770=item B<--clear>
771
772Clear an existing Bayes database by removing all traces of the database.
773
774WARNING: This is destructive and should be used with care.
775
776=item B<--backup>
777
778Performs a dump of the Bayes database in machine/human readable format.
779
780The dump will include token and seen data. It is suitable for input back
781into the --restore command.
782
783=item B<--restore>=I<filename>
784
785Performs a restore of the Bayes database defined by I<filename>.
786
787WARNING: This is a destructive operation, previous Bayes data will be wiped out.
788
789=item B<-h>, B<--help>
790
791Print help message and exit.
792
793=item B<-u> I<username>, B<--username>=I<username>
794
795If specified this username will override the username taken from the runtime
796environment. You can use this option to specify users in a virtual user
797configuration when using SQL as the Bayes backend.
798
799NOTE: This option will not change to the given I<username>, it will only attempt
800to act on behalf of that user. Because of this you will need to have proper
801permissions to be able to change files owned by I<username>. In the case of SQL
802this generally is not a problem.
803
804=item B<-C> I<path>, B<--configpath>=I<path>, B<--config-file>=I<path>
805
806Use the specified path for locating the distributed configuration files.
807Ignore the default directories (usually C</usr/share/spamassassin> or similar).
808
809=item B<--siteconfigpath>=I<path>
810
811Use the specified path for locating site-specific configuration files. Ignore
812the default directories (usually C</etc/mail/spamassassin> or similar).
813
814=item B<--cf='config line'>
815
816Add additional lines of configuration directly from the command-line, parsed
817after the configuration files are read. Multiple B<--cf> arguments can be
818used, and each will be considered a separate line of configuration.
819
820=item B<-p> I<prefs>, B<--prefspath>=I<prefs>, B<--prefs-file>=I<prefs>
821
822Read user score preferences from I<prefs> (usually C<$HOME/.spamassassin/user_prefs>).
823
824=item B<--progress>
825
826Prints a progress bar (to STDERR) showing the current progress. In the case
827where no valid terminal is found this option will behave very much like the
828--showdots option.
829
830=item B<-D> [I<area,...>], B<--debug> [I<area,...>]
831
832Produce debugging output. If no areas are listed, all debugging information is
833printed. Diagnostic output can also be enabled for each area individually;
834I<area> is the area of the code to instrument. For example, to produce
835diagnostic output on bayes, learn, and dns, use:
836
837 spamassassin -D bayes,learn,dns
838
839For more information about which areas (also known as channels) are available,
840please see the documentation at:
841
842 C<http://wiki.apache.org/spamassassin/DebugChannels>
843
844Higher priority informational messages that are suitable for logging in normal
845circumstances are available with an area of "info".
846
847=item B<--no-sync>
848
849Skip the slow synchronization step which normally takes place after
850changing database entries. If you plan to learn from many folders in
851a batch, or to learn many individual messages one-by-one, it is faster
852to use this switch and run C<sa-learn --sync> once all the folders have
853been scanned.
854
855Clarification: The state of I<--no-sync> overrides the
856I<bayes_learn_to_journal> configuration option. If not specified,
857sa-learn will learn to the database directly. If specified, sa-learn
858will learn to the journal file.
859
860Note: I<--sync> and I<--no-sync> can be specified on the same commandline,
861which is slightly confusing. In this case, the I<--no-sync> option is
862ignored since there is no learn operation.
863
864=item B<-L>, B<--local>
865
866Do not perform any network accesses while learning details about the mail
867messages. This will speed up the learning process, but may result in a
868slightly lower accuracy.
869
870Note that this is currently ignored, as current versions of SpamAssassin will
871not perform network access while learning; but future versions may.
872
873=item B<--import>
874
875If you previously used SpamAssassin's Bayesian learner without the C<DB_File>
876module installed, it will have created files in other formats, such as
877C<GDBM_File>, C<NDBM_File>, or C<SDBM_File>. This switch allows you to migrate
878that old data into the C<DB_File> format. It will overwrite any data currently
879in the C<DB_File>.
880
881Can also be used with the B<--dbpath> I<path> option to specify the location of
882the Bayes files to use.
883
884=back
885
886=head1 MIGRATION
887
888There are now multiple backend storage modules available for storing
889user's bayesian data. As such you might want to migrate from one
890backend to another. Here is a simple procedure for migrating from one
891backend to another.
892
893Note that if you have individual user databases you will have to
894perform a similar procedure for each one of them.
895
896=over 4
897
898=item sa-learn --sync
899
900This will sync any outstanding journal entries
901
902=item sa-learn --backup > backup.txt
903
904This will save all your Bayes data to a plain text file.
905
906=item sa-learn --clear
907
908This is optional, but good to do to clear out the old database.
909
910=item Repeat!
911
912At this point, if you have multiple databases, you should perform the
913procedure above for each of them. (i.e. each user's database needs to
914be backed up before continuing.)
915
916=item Switch backends
917
918Once you have backed up all databases you can update your
919configuration for the new database backend. This will involve at least
920the bayes_store_module config option and may involve some additional
921config options depending on what is required by the module. (For
922example, you may need to configure an SQL database.)
923
924=item sa-learn --restore backup.txt
925
926Again, you need to do this for every database.
927
928=back
929
930If you are migrating to SQL you can make use of the -u <username>
931option in sa-learn to populate each user's database. Otherwise, you
932must run sa-learn as the user who database you are restoring.
933
934
935=head1 INTRODUCTION TO BAYESIAN FILTERING
936
937(Thanks to Michael Bell for this section!)
938
939For a more lengthy description of how this works, go to
940http://www.paulgraham.com/ and see "A Plan for Spam". It's reasonably
941readable, even if statistics make me break out in hives.
942
943The short semi-inaccurate version: Given training, a spam heuristics engine
944can take the most "spammy" and "hammy" words and apply probabilistic
945analysis. Furthermore, once given a basis for the analysis, the engine can
946continue to learn iteratively by applying both the non-Bayesian and Bayesian
947rulesets together to create evolving "intelligence".
948
949SpamAssassin 2.50 and later supports Bayesian spam analysis, in
950the form of the BAYES rules. This is a new feature, quite powerful,
951and is disabled until enough messages have been learnt.
952
953The pros of Bayesian spam analysis:
954
955=over 4
956
957=item Can greatly reduce false positives and false negatives.
958
959It learns from your mail, so it is tailored to your unique e-mail flow.
960
961=item Once it starts learning, it can continue to learn from SpamAssassin
962and improve over time.
963
964=back
965
966And the cons:
967
968=over 4
969
970=item A decent number of messages are required before results are useful
971for ham/spam determination.
972
973=item It's hard to explain why a message is or isn't marked as spam.
974
975i.e.: a straightforward rule, that matches, say, "VIAGRA" is
976easy to understand. If it generates a false positive or false negative,
977it is fairly easy to understand why.
978
979With Bayesian analysis, it's all probabilities - "because the past says
980it is likely as this falls into a probabilistic distribution common to past
981spam in your systems". Tell that to your users! Tell that to the client
982when he asks "what can I do to change this". (By the way, the answer in
983this case is "use whitelisting".)
984
985=item It will take disk space and memory.
986
987The databases it maintains take quite a lot of resources to store and use.
988
989=back
990
991=head1 GETTING STARTED
992
993Still interested? Ok, here's the guidelines for getting this working.
994
995First a high-level overview:
996
997=over 4
998
999=item Build a significant sample of both ham and spam.
1000
1001I suggest several thousand of each, placed in SPAM and HAM directories or
1002mailboxes. Yes, you MUST hand-sort this - otherwise the results won't be much
1003better than SpamAssassin on its own. Verify the spamminess/haminess of EVERY
1004message. You're urged to avoid using a publicly available corpus (sample) -
1005this must be taken from YOUR mail server, if it is to be statistically useful.
1006Otherwise, the results may be pretty skewed.
1007
1008=item Use this tool to teach SpamAssassin about these samples, like so:
1009
1010 sa-learn --spam /path/to/spam/folder
1011 sa-learn --ham /path/to/ham/folder
1012 ...
1013
1014Let SpamAssassin proceed, learning stuff. When it finds ham and spam
1015it will add the "interesting tokens" to the database.
1016
1017=item If you need SpamAssassin to forget about specific messages, use
1018the B<--forget> option.
1019
1020This can be applied to either ham or spam that has run through the
1021B<sa-learn> processes. It's a bit of a hammer, really, lowering the
1022weighting of the specific tokens in that message (only if that message has
1023been processed before).
1024
1025=item Learning from single messages uses a command like this:
1026
1027 sa-learn --ham --no-sync mailmessage
1028
1029This is handy for binding to a key in your mail user agent. It's very fast, as
1030all the time-consuming stuff is deferred until you run with the C<--sync>
1031option.
1032
1033=item Autolearning is enabled by default
1034
1035If you don't have a corpus of mail saved to learn, you can let
1036SpamAssassin automatically learn the mail that you receive. If you are
1037autolearning from scratch, the amount of mail you receive will determine
1038how long until the BAYES_* rules are activated.
1039
1040=back
1041
1042=head1 EFFECTIVE TRAINING
1043
1044Learning filters require training to be effective. If you don't train
1045them, they won't work. In addition, you need to train them with new
1046messages regularly to keep them up-to-date, or their data will become
1047stale and impact accuracy.
1048
1049You need to train with both spam I<and> ham mails. One type of mail
1050alone will not have any effect.
1051
1052Note that if your mail folders contain things like forwarded spam,
1053discussions of spam-catching rules, etc., this will cause trouble. You
1054should avoid scanning those messages if possible. (An easy way to do this
1055is to move them aside, into a folder which is not scanned.)
1056
1057If the messages you are learning from have already been filtered through
1058SpamAssassin, the learner will compensate for this. In effect, it learns what
1059each message would look like if you had run C<spamassassin -d> over it in
1060advance.
1061
1062Another thing to be aware of, is that typically you should aim to train
1063with at least 1000 messages of spam, and 1000 ham messages, if
1064possible. More is better, but anything over about 5000 messages does not
1065improve accuracy significantly in our tests.
1066
1067Be careful that you train from the same source -- for example, if you train
1068on old spam, but new ham mail, then the classifier will think that
1069a mail with an old date stamp is likely to be spam.
1070
1071It's also worth noting that training with a very small quantity of
1072ham, will produce atrocious results. You should aim to train with at
1073least the same amount (or more if possible!) of ham data than spam.
1074
1075On an on-going basis, it is best to keep training the filter to make
1076sure it has fresh data to work from. There are various ways to do
1077this:
1078
1079=over 4
1080
1081=item 1. Supervised learning
1082
1083This means keeping a copy of all or most of your mail, separated into spam
1084and ham piles, and periodically re-training using those. It produces
1085the best results, but requires more work from you, the user.
1086
1087(An easy way to do this, by the way, is to create a new folder for
1088'deleted' messages, and instead of deleting them from other folders,
1089simply move them in there instead. Then keep all spam in a separate
1090folder and never delete it. As long as you remember to move misclassified
1091mails into the correct folder set, it is easy enough to keep up to date.)
1092
1093=item 2. Unsupervised learning from Bayesian classification
1094
1095Another way to train is to chain the results of the Bayesian classifier
1096back into the training, so it reinforces its own decisions. This is only
1097safe if you then retrain it based on any errors you discover.
1098
1099SpamAssassin does not support this method, due to experimental results
1100which strongly indicate that it does not work well, and since Bayes is
1101only one part of the resulting score presented to the user (while Bayes
1102may have made the wrong decision about a mail, it may have been overridden
1103by another system).
1104
1105=item 3. Unsupervised learning from SpamAssassin rules
1106
1107Also called 'auto-learning' in SpamAssassin. Based on statistical
1108analysis of the SpamAssassin success rates, we can automatically train the
1109Bayesian database with a certain degree of confidence that our training
1110data is accurate.
1111
1112It should be supplemented with some supervised training in addition, if
1113possible.
1114
1115This is the default, but can be turned off by setting the SpamAssassin
1116configuration parameter C<bayes_auto_learn> to 0.
1117
1118=item 4. Mistake-based training
1119
1120This means training on a small number of mails, then only training on
1121messages that SpamAssassin classifies incorrectly. This works, but it
1122takes longer to get it right than a full training session would.
1123
1124=back
1125
1126=head1 FILES
1127
1128B<sa-learn> and the other parts of SpamAssassin's Bayesian learner,
1129use a set of persistent database files to store the learnt tokens, as follows.
1130
1131=over 4
1132
1133=item bayes_toks
1134
1135The database of tokens, containing the tokens learnt, their count of
1136occurrences in ham and spam, and the timestamp when the token was last
1137seen in a message.
1138
1139This database also contains some 'magic' tokens, as follows: the version
1140number of the database, the number of ham and spam messages learnt, the
1141number of tokens in the database, and timestamps of: the last journal
1142sync, the last expiry run, the last expiry token reduction count, the
1143last expiry timestamp delta, the oldest token timestamp in the database,
1144and the newest token timestamp in the database.
1145
1146This is a database file, using C<DB_File>. The database 'version
1147number' is 0 for databases from 2.5x, 1 for databases from certain 2.6x
1148development releases, 2 for 2.6x, and 3 for 3.0 and later releases.
1149
1150=item bayes_seen
1151
1152A map of Message-Id and some data from headers and body to what that
1153message was learnt as. This is used so that SpamAssassin can avoid
1154re-learning a message it has already seen, and so it can reverse the
1155training if you later decide that message was learnt incorrectly.
1156
1157This is a database file, using C<DB_File>.
1158
1159=item bayes_journal
1160
1161While SpamAssassin is scanning mails, it needs to track which tokens
1162it uses in its calculations. To avoid the contention of having each
1163SpamAssassin process attempting to gain write access to the Bayes DB,
1164the token timestamps are written to a 'journal' file which will later
1165(either automatically or via C<sa-learn --sync>) be used to synchronize
1166the Bayes DB.
1167
1168Also, through the use of C<bayes_learn_to_journal>, or when using the
1169C<--no-sync> option with sa-learn, the actual learning data will take
1170be placed into the journal for later synchronization. This is typically
1171useful for high-traffic sites to avoid the same contention as stated
1172above.
1173
1174=back
1175
1176=head1 EXPIRATION
1177
1178Since SpamAssassin can auto-learn messages, the Bayes database files
1179could increase perpetually until they fill your disk. To control this,
1180SpamAssassin performs journal synchronization and bayes expiration
1181periodically when certain criteria (listed below) are met.
1182
1183SpamAssassin can sync the journal and expire the DB tokens either
1184manually or opportunistically. A journal sync is due if I<--sync>
1185is passed to sa-learn (manual), or if the following is true
1186(opportunistic):
1187
1188=over 4
1189
1190=item - bayes_journal_max_size does not equal 0 (means don't sync)
1191
1192=item - the journal file exists
1193
1194=back
1195
1196and either:
1197
1198=over 4
1199
1200=item - the journal file has a size greater than bayes_journal_max_size
1201
1202=back
1203
1204or
1205
1206=over 4
1207
1208=item - a journal sync has previously occurred, and at least 1 day has
1209passed since that sync
1210
1211=back
1212
1213Expiry is due if I<--force-expire> is passed to sa-learn (manual),
1214or if all of the following are true (opportunistic):
1215
1216=over 4
1217
1218=item - the last expire was attempted at least 12hrs ago
1219
1220=item - bayes_auto_expire does not equal 0
1221
1222=item - the number of tokens in the DB is > 100,000
1223
1224=item - the number of tokens in the DB is > bayes_expiry_max_db_size
1225
1226=item - there is at least a 12 hr difference between the oldest and newest token atimes
1227
1228=back
1229
1230=head2 EXPIRE LOGIC
1231
1232If either the manual or opportunistic method causes an expire run
1233to start, here is the logic that is used:
1234
1235=over 4
1236
1237=item - figure out how many tokens to keep. take the larger of
1238either bayes_expiry_max_db_size * 75% or 100,000 tokens. therefore, the goal
1239reduction is number of tokens - number of tokens to keep.
1240
1241=item - if the reduction number is < 1000 tokens, abort (not worth the effort).
1242
1243=item - if an expire has been done before, guesstimate the new
1244atime delta based on the old atime delta. (new_atime_delta =
1245old_atime_delta * old_reduction_count / goal)
1246
1247=item - if no expire has been done before, or the last expire looks
1248"weird", do an estimation pass. The definition of "weird" is:
1249
1250=over 8
1251
1252=item - last expire over 30 days ago
1253
1254=item - last atime delta was < 12 hrs
1255
1256=item - last reduction count was < 1000 tokens
1257
1258=item - estimated new atime delta is < 12 hrs
1259
1260=item - the difference between the last reduction count and the goal reduction count is > 50%
1261
1262=back
1263
1264=back
1265
1266=head2 ESTIMATION PASS LOGIC
1267
1268Go through each of the DB's tokens. Starting at 12hrs, calculate
1269whether or not the token would be expired (based on the difference
1270between the token's atime and the db's newest token atime) and keep
1271the count. Work out from 12hrs exponentially by powers of 2. ie:
127212hrs * 1, 12hrs * 2, 12hrs * 4, 12hrs * 8, and so on, up to 12hrs
1273* 512 (6144hrs, or 256 days).
1274
1275The larger the delta, the smaller the number of tokens that will
1276be expired. Conversely, the number of tokens goes up as the delta
1277gets smaller. So starting at the largest atime delta, figure out
1278which delta will expire the most tokens without going above the
1279goal expiration count. Use this to choose the atime delta to use,
1280unless one of the following occurs:
1281
1282=over 8
1283
1284=item - the largest atime (smallest reduction count) would expire
1285too many tokens. this means the learned tokens are mostly old and
1286there needs to be new tokens learned before an expire can
1287occur.
1288
1289=item - all of the atime choices result in 0 tokens being removed.
1290this means the tokens are all newer than 12 hours and there needs
1291to be new tokens learned before an expire can occur.
1292
1293=item - the number of tokens that would be removed is < 1000. the
1294benefit isn't worth the effort. more tokens need to be learned.
1295
1296=back
1297
1298If the expire run gets past this point, it will continue to the end.
1299A new DB is created since the majority of DB libraries don't shrink the
1300DB file when tokens are removed. So we do the "create new, migrate old
1301to new, remove old, rename new" shuffle.
1302
1303=head2 EXPIRY RELATED CONFIGURATION SETTINGS
1304
1305=over 4
1306
1307=item C<bayes_auto_expire> is used to specify whether or not SpamAssassin
1308ought to opportunistically attempt to expire the Bayes database.
1309The default is 1 (yes).
1310
1311=item C<bayes_expiry_max_db_size> specifies both the auto-expire token
1312count point, as well as the resulting number of tokens after expiry
1313as described above. The default value is 150,000, which is roughly
1314equivalent to a 6Mb database file if you're using DB_File.
1315
1316=item C<bayes_journal_max_size> specifies how large the Bayes
1317journal will grow before it is opportunistically synced. The
1318default value is 102400.
1319
1320=back
1321
1322=head1 INSTALLATION
1323
1324The B<sa-learn> command is part of the B<Mail::SpamAssassin> Perl module.
1325Install this as a normal Perl module, using C<perl -MCPAN -e shell>,
1326or by hand.
1327
1328=head1 SEE ALSO
1329
1330spamassassin(1)
1331spamc(1)
1332Mail::SpamAssassin(3)
1333Mail::SpamAssassin::ArchiveIterator(3)
1334
1335E<lt>http://www.paulgraham.com/E<gt>
1336Paul Graham's "A Plan For Spam" paper
1337
1338E<lt>http://www.linuxjournal.com/article/6467E<gt>
1339Gary Robinson's f(x) and combining algorithms, as used in SpamAssassin
1340
1341E<lt>http://www.bgl.nu/~glouis/bogofilter/E<gt>
1342'Training on error' page. A discussion of various Bayes training regimes,
1343including 'train on error' and unsupervised training.
1344
1345=head1 PREREQUISITES
1346
1347C<Mail::SpamAssassin>
1348
1349=head1 AUTHORS
1350
1351The SpamAssassin(tm) Project E<lt>http://spamassassin.apache.org/E<gt>
1352
1353=cut
1354
 
# spent 9.79ms within Encode::XS::decode which was called 1038 times, avg 9µs/call: # 1038 times (9.79ms+0s) by Net::DNS::Domain::_decode_ascii at line 299 of Net/DNS/Domain.pm, avg 9µs/call
sub Encode::XS::decode; # xsub
# spent 489µs within Internals::SvREADONLY which was called 148 times, avg 3µs/call: # 146 times (483µs+0s) by constant::import at line 164 of constant.pm, avg 3µs/call # once (3µs+0s) by constant::BEGIN@24 at line 33 of constant.pm # once (2µs+0s) by constant::BEGIN@24 at line 34 of constant.pm
sub Internals::SvREADONLY; # xsub
# spent 148µs within UNIVERSAL::VERSION which was called 6 times, avg 25µs/call: # once (42µs+0s) by Pod::Simple::BEGIN@8 at line 8 of Pod/Simple.pm # once (29µs+0s) by NetAddr::IP::BEGIN@8 at line 8 of NetAddr/IP.pm # once (24µs+0s) by Encode::BEGIN@12 at line 12 of Encode.pm # once (19µs+0s) by Mail::SpamAssassin::NetSet::BEGIN@26 at line 26 of Mail/SpamAssassin/NetSet.pm # once (19µs+0s) by Mail::SpamAssassin::Util::BEGIN@76 at line 76 of Mail/SpamAssassin/Util.pm # once (15µs+0s) by NetAddr::IP::BEGIN@9 at line 21 of NetAddr/IP.pm
sub UNIVERSAL::VERSION; # xsub
# spent 25.6ms within UNIVERSAL::can which was called 3017 times, avg 8µs/call: # 1968 times (18.0ms+0s) by Mail::SpamAssassin::DnsResolver::new_dns_packet at line 602 of Mail/SpamAssassin/DnsResolver.pm, avg 9µs/call # 324 times (2.72ms+0s) by Mail::SpamAssassin::PluginHandler::have_callback at line 166 of Mail/SpamAssassin/PluginHandler.pm, avg 8µs/call # 234 times (2.01ms+0s) by Mail::SpamAssassin::Message::Metadata::parse_received_headers at line 272 of Mail/SpamAssassin/Message/Metadata/Received.pm, avg 9µs/call # 234 times (1.06ms+0s) by Mail::SpamAssassin::Message::Metadata::parse_received_headers at line 278 of Mail/SpamAssassin/Message/Metadata/Received.pm, avg 5µs/call # 189 times (1.03ms+0s) by Mail::SpamAssassin::HTML::parse at line 250 of Mail/SpamAssassin/HTML.pm, avg 5µs/call # 55 times (618µs+0s) by Mail::SpamAssassin::Conf::Parser::cond_clause_can_or_has at line 595 of Mail/SpamAssassin/Conf/Parser.pm, avg 11µs/call # 6 times (38µs+0s) by Mail::SpamAssassin::Util::reverse_ip_address at line 906 of Mail/SpamAssassin/Util.pm, avg 6µs/call # 3 times (24µs+0s) by IO::Socket::SSL::BEGIN@389 at line 399 of IO/Socket/SSL.pm, avg 8µs/call # once (9µs+0s) by Mail::SpamAssassin::DnsResolver::configured_nameservers at line 212 of Mail/SpamAssassin/DnsResolver.pm # once (7µs+0s) by Mail::SpamAssassin::DnsResolver::configured_nameservers at line 213 of Mail/SpamAssassin/DnsResolver.pm # once (5µs+0s) by Net::DNS::Domain::BEGIN@54 at line 1 of (eval 27)[Net/DNS/Domain.pm:54] # once (5µs+0s) by Mail::SpamAssassin::AsyncLoop::BEGIN@49 at line 52 of Mail/SpamAssassin/AsyncLoop.pm
sub UNIVERSAL::can; # xsub
# spent 286µs within UNIVERSAL::isa which was called 57 times, avg 5µs/call: # 27 times (142µs+0s) by base::import at line 97 of base.pm, avg 5µs/call # 27 times (119µs+0s) by main::RUNTIME at line 243, avg 4µs/call # 2 times (18µs+0s) by File::Path::mkpath at line 94 of File/Path.pm, avg 9µs/call # once (6µs+0s) by Getopt::Long::GetOptionsFromArray at line 474 of Getopt/Long.pm
sub UNIVERSAL::isa; # xsub
# spent 11µs within main::CORE:close which was called 2 times, avg 5µs/call: # once (7µs+0s) by main::RUNTIME at line 503 # once (4µs+0s) by main::RUNTIME at line 504
sub main::CORE:close; # opcode
# spent 63µs within main::CORE:ftis which was called 2 times, avg 31µs/call: # 2 times (63µs+0s) by main::BEGIN@41 at line 46, avg 31µs/call
sub main::CORE:ftis; # opcode
# spent 25µs within main::CORE:match which was called 4 times, avg 6µs/call: # 2 times (13µs+0s) by main::RUNTIME at line 423, avg 7µs/call # 2 times (12µs+0s) by main::RUNTIME at line 449, avg 6µs/call
sub main::CORE:match; # opcode
# spent 129µs within main::CORE:pack which was called 24 times, avg 5µs/call: # 2 times (10µs+0s) by Net::DNS::Resolver::Base::BEGIN@33 at line 297 of IO/Socket/INET6.pm, avg 5µs/call # once (12µs+0s) by NetAddr::IP::BEGIN@8 at line 201 of NetAddr/IP/Lite.pm # once (9µs+0s) by NetAddr::IP::Lite::BEGIN@18 at line 153 of NetAddr/IP/Util.pm # once (9µs+0s) by Net::DNS::RR::BEGIN@42 at line 50 of Net/DNS/Domain.pm # once (9µs+0s) by NetAddr::IP::Lite::BEGIN@9 at line 244 of NetAddr/IP/InetBase.pm # once (8µs+0s) by Mail::SpamAssassin::PerMsgStatus::BEGIN@35 at line 319 of IO/Socket.pm # once (8µs+0s) by Net::DNS::Resolver::Base::BEGIN@1.1 at line 523 of IO/Socket/IP.pm # once (7µs+0s) by Net::DNS::RR::OPT::CLIENT_SUBNET::BEGIN@240 at line 52 of Net/DNS/RR/A.pm # once (6µs+0s) by Net::DNS::Resolver::Base::BEGIN@57 at line 763 of Net/DNS/Packet.pm # once (5µs+0s) by NetAddr::IP::BEGIN@8 at line 1420 of NetAddr/IP/Lite.pm # once (5µs+0s) by NetAddr::IP::BEGIN@8 at line 416 of NetAddr/IP/Lite.pm # once (4µs+0s) by NetAddr::IP::Lite::BEGIN@18 at line 200 of NetAddr/IP/Util.pm # once (4µs+0s) by NetAddr::IP::BEGIN@8 at line 683 of NetAddr/IP/Lite.pm # once (4µs+0s) by Net::DNS::RR::BEGIN@43 at line 72 of Net/DNS/DomainName.pm # once (4µs+0s) by NetAddr::IP::Lite::BEGIN@9 at line 256 of NetAddr/IP/InetBase.pm # once (4µs+0s) by NetAddr::IP::BEGIN@8 at line 206 of NetAddr/IP/Lite.pm # once (3µs+0s) by NetAddr::IP::BEGIN@8 at line 202 of NetAddr/IP/Lite.pm # once (3µs+0s) by NetAddr::IP::Lite::BEGIN@18 at line 201 of NetAddr/IP/Util.pm # once (3µs+0s) by NetAddr::IP::BEGIN@8 at line 685 of NetAddr/IP/Lite.pm # once (3µs+0s) by NetAddr::IP::BEGIN@8 at line 204 of NetAddr/IP/Lite.pm # once (3µs+0s) by NetAddr::IP::BEGIN@8 at line 684 of NetAddr/IP/Lite.pm # once (3µs+0s) by Net::DNS::RR::BEGIN@43 at line 213 of Net/DNS/DomainName.pm # once (3µs+0s) by NetAddr::IP::Lite::BEGIN@9 at line 245 of NetAddr/IP/InetBase.pm
sub main::CORE:pack; # opcode
# spent 20µs within main::CORE:print which was called: # once (20µs+0s) by main::RUNTIME at line 485
sub main::CORE:print; # opcode
# spent 511µs within mro::method_changed_in which was called 147 times, avg 3µs/call: # 147 times (511µs+0s) by constant::import at line 198 of constant.pm, avg 3µs/call
sub mro::method_changed_in; # xsub
# spent 105µs within utf8::encode which was called 26 times, avg 4µs/call: # 24 times (93µs+0s) by base::__ANON__[/usr/local/lib/perl5/5.24/base.pm:77] at line 75 of base.pm, avg 4µs/call # once (8µs+0s) by Pod::Simple::LinkSection::BEGIN@9 at line 41 of Pod/Simple/BlackBox.pm # once (4µs+0s) by Encode::encode_utf8 at line 231 of Encode.pm
sub utf8::encode; # xsub
# spent 18.3ms within utf8::is_utf8 which was called 3936 times, avg 5µs/call: # 1968 times (9.92ms+0s) by Mail::SpamAssassin::DnsResolver::new_dns_packet at line 549 of Mail/SpamAssassin/DnsResolver.pm, avg 5µs/call # 1968 times (8.43ms+0s) by Mail::SpamAssassin::Util::decode_dns_question_entry at line 940 of Mail/SpamAssassin/Util.pm, avg 4µs/call
sub utf8::is_utf8; # xsub