← Index
NYTProf Performance Profile   « line view »
For /usr/local/bin/sa-learn
  Run on Sun Nov 5 03:09:29 2017
Reported on Mon Nov 6 13:20:48 2017

Filename/usr/local/lib/perl5/site_perl/Mail/SpamAssassin/Plugin/Bayes.pm
StatementsExecuted 2378601 statements in 31.2s
Subroutines
Calls P F Exclusive
Time
Inclusive
Time
Subroutine
128224117.3s24.8sMail::SpamAssassin::Plugin::Bayes::::_tokenize_lineMail::SpamAssassin::Plugin::Bayes::_tokenize_line
234112.94s32.4sMail::SpamAssassin::Plugin::Bayes::::tokenizeMail::SpamAssassin::Plugin::Bayes::tokenize
6454091812.90s2.90sMail::SpamAssassin::Plugin::Bayes::::CORE:matchMail::SpamAssassin::Plugin::Bayes::CORE:match (opcode)
4155173012.89s2.89sMail::SpamAssassin::Plugin::Bayes::::CORE:substMail::SpamAssassin::Plugin::Bayes::CORE:subst (opcode)
234111.27s3.16sMail::SpamAssassin::Plugin::Bayes::::_tokenize_headersMail::SpamAssassin::Plugin::Bayes::_tokenize_headers
229852711.04s1.04sMail::SpamAssassin::Plugin::Bayes::::CORE:substcontMail::SpamAssassin::Plugin::Bayes::CORE:substcont (opcode)
16835331976ms976msMail::SpamAssassin::Plugin::Bayes::::CORE:regcompMail::SpamAssassin::Plugin::Bayes::CORE:regcomp (opcode)
70222135ms371msMail::SpamAssassin::Plugin::Bayes::::get_msgidMail::SpamAssassin::Plugin::Bayes::get_msgid
23411119ms38.1sMail::SpamAssassin::Plugin::Bayes::::_learn_trappedMail::SpamAssassin::Plugin::Bayes::_learn_trapped
11502177.0ms105msMail::SpamAssassin::Plugin::Bayes::::_tokenize_mail_addrsMail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs
4681162.3ms109msMail::SpamAssassin::Plugin::Bayes::::_pre_chew_receivedMail::SpamAssassin::Plugin::Bayes::_pre_chew_received
7581142.1ms214msMail::SpamAssassin::Plugin::Bayes::::_pre_chew_addr_headerMail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header
2341131.3ms45.1sMail::SpamAssassin::Plugin::Bayes::::learn_messageMail::SpamAssassin::Plugin::Bayes::learn_message
2341127.2ms6.42sMail::SpamAssassin::Plugin::Bayes::::get_body_from_msgMail::SpamAssassin::Plugin::Bayes::get_body_from_msg
11121.6ms33.8msMail::SpamAssassin::Plugin::Bayes::::learner_newMail::SpamAssassin::Plugin::Bayes::learner_new
2341116.8ms6.27sMail::SpamAssassin::Plugin::Bayes::::_get_msgdata_from_permsgstatusMail::SpamAssassin::Plugin::Bayes::_get_msgdata_from_permsgstatus
2221114.1ms28.3msMail::SpamAssassin::Plugin::Bayes::::_pre_chew_content_typeMail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type
225118.57ms15.7msMail::SpamAssassin::Plugin::Bayes::::_pre_chew_message_idMail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id
236112.34ms2.34msMail::SpamAssassin::Plugin::Bayes::::read_db_configsMail::SpamAssassin::Plugin::Bayes::read_db_configs
1111.38ms2.01msMail::SpamAssassin::Plugin::Bayes::::BEGIN@63Mail::SpamAssassin::Plugin::Bayes::BEGIN@63
11162µs102µsMail::SpamAssassin::Plugin::Bayes::::newMail::SpamAssassin::Plugin::Bayes::new
11147µs1.15msMail::SpamAssassin::Plugin::Bayes::::learner_closeMail::SpamAssassin::Plugin::Bayes::learner_close
11143µs51µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@46Mail::SpamAssassin::Plugin::Bayes::BEGIN@46
11139µs110µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@1509Mail::SpamAssassin::Plugin::Bayes::BEGIN@1509
11131µs249µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@175Mail::SpamAssassin::Plugin::Bayes::BEGIN@175
11131µs754µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@165Mail::SpamAssassin::Plugin::Bayes::BEGIN@165
11130µs240µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@174Mail::SpamAssassin::Plugin::Bayes::BEGIN@174
11129µs214µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@166Mail::SpamAssassin::Plugin::Bayes::BEGIN@166
11128µs162µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@51Mail::SpamAssassin::Plugin::Bayes::BEGIN@51
11128µs233µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@68Mail::SpamAssassin::Plugin::Bayes::BEGIN@68
11126µs546µsMail::SpamAssassin::Plugin::Bayes::::learner_is_scan_availableMail::SpamAssassin::Plugin::Bayes::learner_is_scan_available
11126µs185µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@156Mail::SpamAssassin::Plugin::Bayes::BEGIN@156
11125µs74µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@49Mail::SpamAssassin::Plugin::Bayes::BEGIN@49
11125µs187µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@173Mail::SpamAssassin::Plugin::Bayes::BEGIN@173
11124µs628µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@172Mail::SpamAssassin::Plugin::Bayes::BEGIN@172
11124µs102µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@60Mail::SpamAssassin::Plugin::Bayes::BEGIN@60
11122µs153µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@223Mail::SpamAssassin::Plugin::Bayes::BEGIN@223
11122µs30µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@48Mail::SpamAssassin::Plugin::Bayes::BEGIN@48
11121µs46µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@47Mail::SpamAssassin::Plugin::Bayes::BEGIN@47
11120µs150µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@168Mail::SpamAssassin::Plugin::Bayes::BEGIN@168
11120µs150µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@59Mail::SpamAssassin::Plugin::Bayes::BEGIN@59
11120µs225µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@157Mail::SpamAssassin::Plugin::Bayes::BEGIN@157
11120µs146µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@169Mail::SpamAssassin::Plugin::Bayes::BEGIN@169
11120µs145µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@159Mail::SpamAssassin::Plugin::Bayes::BEGIN@159
11120µs145µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@179Mail::SpamAssassin::Plugin::Bayes::BEGIN@179
11120µs145µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@167Mail::SpamAssassin::Plugin::Bayes::BEGIN@167
11120µs143µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@178Mail::SpamAssassin::Plugin::Bayes::BEGIN@178
11119µs144µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@163Mail::SpamAssassin::Plugin::Bayes::BEGIN@163
11119µs141µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@219Mail::SpamAssassin::Plugin::Bayes::BEGIN@219
11119µs141µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@227Mail::SpamAssassin::Plugin::Bayes::BEGIN@227
11119µs145µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@215Mail::SpamAssassin::Plugin::Bayes::BEGIN@215
11118µs140µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@164Mail::SpamAssassin::Plugin::Bayes::BEGIN@164
11118µs150µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@158Mail::SpamAssassin::Plugin::Bayes::BEGIN@158
11118µs18µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@58Mail::SpamAssassin::Plugin::Bayes::BEGIN@58
11115µs15µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@56Mail::SpamAssassin::Plugin::Bayes::BEGIN@56
22114µs14µsMail::SpamAssassin::Plugin::Bayes::::CORE:qrMail::SpamAssassin::Plugin::Bayes::CORE:qr (opcode)
11111µs11µsMail::SpamAssassin::Plugin::Bayes::::BEGIN@57Mail::SpamAssassin::Plugin::Bayes::BEGIN@57
0000s0sMail::SpamAssassin::Plugin::Bayes::::__ANON__[:1701]Mail::SpamAssassin::Plugin::Bayes::__ANON__[:1701]
0000s0sMail::SpamAssassin::Plugin::Bayes::::__ANON__[:874]Mail::SpamAssassin::Plugin::Bayes::__ANON__[:874]
0000s0sMail::SpamAssassin::Plugin::Bayes::::__ANON__[:880]Mail::SpamAssassin::Plugin::Bayes::__ANON__[:880]
0000s0sMail::SpamAssassin::Plugin::Bayes::::__ANON__[:898]Mail::SpamAssassin::Plugin::Bayes::__ANON__[:898]
0000s0sMail::SpamAssassin::Plugin::Bayes::::_compute_declassification_distanceMail::SpamAssassin::Plugin::Bayes::_compute_declassification_distance
0000s0sMail::SpamAssassin::Plugin::Bayes::::_compute_prob_for_all_tokensMail::SpamAssassin::Plugin::Bayes::_compute_prob_for_all_tokens
0000s0sMail::SpamAssassin::Plugin::Bayes::::_compute_prob_for_tokenMail::SpamAssassin::Plugin::Bayes::_compute_prob_for_token
0000s0sMail::SpamAssassin::Plugin::Bayes::::_forget_trappedMail::SpamAssassin::Plugin::Bayes::_forget_trapped
0000s0sMail::SpamAssassin::Plugin::Bayes::::_opportunistic_callsMail::SpamAssassin::Plugin::Bayes::_opportunistic_calls
0000s0sMail::SpamAssassin::Plugin::Bayes::::bayes_report_make_listMail::SpamAssassin::Plugin::Bayes::bayes_report_make_list
0000s0sMail::SpamAssassin::Plugin::Bayes::::check_bayesMail::SpamAssassin::Plugin::Bayes::check_bayes
0000s0sMail::SpamAssassin::Plugin::Bayes::::finishMail::SpamAssassin::Plugin::Bayes::finish
0000s0sMail::SpamAssassin::Plugin::Bayes::::forget_messageMail::SpamAssassin::Plugin::Bayes::forget_message
0000s0sMail::SpamAssassin::Plugin::Bayes::::ignore_messageMail::SpamAssassin::Plugin::Bayes::ignore_message
0000s0sMail::SpamAssassin::Plugin::Bayes::::learner_dump_databaseMail::SpamAssassin::Plugin::Bayes::learner_dump_database
0000s0sMail::SpamAssassin::Plugin::Bayes::::learner_expire_old_trainingMail::SpamAssassin::Plugin::Bayes::learner_expire_old_training
0000s0sMail::SpamAssassin::Plugin::Bayes::::learner_get_implementationMail::SpamAssassin::Plugin::Bayes::learner_get_implementation
0000s0sMail::SpamAssassin::Plugin::Bayes::::learner_syncMail::SpamAssassin::Plugin::Bayes::learner_sync
0000s0sMail::SpamAssassin::Plugin::Bayes::::prefork_initMail::SpamAssassin::Plugin::Bayes::prefork_init
0000s0sMail::SpamAssassin::Plugin::Bayes::::scanMail::SpamAssassin::Plugin::Bayes::scan
0000s0sMail::SpamAssassin::Plugin::Bayes::::spamd_child_initMail::SpamAssassin::Plugin::Bayes::spamd_child_init
Call graph for these subroutines as a Graphviz dot language file.
Line State
ments
Time
on line
Calls Time
in subs
Code
1# <@LICENSE>
2# Licensed to the Apache Software Foundation (ASF) under one or more
3# contributor license agreements. See the NOTICE file distributed with
4# this work for additional information regarding copyright ownership.
5# The ASF licenses this file to you under the Apache License, Version 2.0
6# (the "License"); you may not use this file except in compliance with
7# the License. You may obtain a copy of the License at:
8#
9# http://www.apache.org/licenses/LICENSE-2.0
10#
11# Unless required by applicable law or agreed to in writing, software
12# distributed under the License is distributed on an "AS IS" BASIS,
13# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14# See the License for the specific language governing permissions and
15# limitations under the License.
16# </@LICENSE>
17
18=head1 NAME
19
20Mail::SpamAssassin::Plugin::Bayes - determine spammishness using a Bayesian classifier
21
22=head1 DESCRIPTION
23
24This is a Bayesian-style probabilistic classifier, using an algorithm based on
25the one detailed in Paul Graham's I<A Plan For Spam> paper at:
26
27 http://www.paulgraham.com/spam.html
28
29It also incorporates some other aspects taken from Graham Robinson's webpage
30on the subject at:
31
32 http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
33
34And the chi-square probability combiner as described here:
35
36 http://www.linuxjournal.com/print.php?sid=6467
37
38The results are incorporated into SpamAssassin as the BAYES_* rules.
39
40=head1 METHODS
41
42=cut
43
44package Mail::SpamAssassin::Plugin::Bayes;
45
46264µs260µs
# spent 51µs (43+8) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@46 which was called: # once (43µs+8µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 46
use strict;
# spent 51µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@46 # spent 8µs making 1 call to strict::import
47261µs270µs
# spent 46µs (21+25) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@47 which was called: # once (21µs+25µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 47
use warnings;
# spent 46µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@47 # spent 24µs making 1 call to warnings::import
48256µs237µs
# spent 30µs (22+7) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@48 which was called: # once (22µs+7µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 48
use bytes;
# spent 30µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@48 # spent 8µs making 1 call to bytes::import
492140µs2122µs
# spent 74µs (25+48) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@49 which was called: # once (25µs+48µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 49
use re 'taint';
# spent 74µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@49 # spent 48µs making 1 call to re::import
50
51
# spent 162µs (28+133) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@51 which was called: # once (28µs+133µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 54
BEGIN {
52313µs1133µs eval { require Digest::SHA; import Digest::SHA qw(sha1 sha1_hex); 1 }
# spent 133µs making 1 call to Exporter::import
53110µs or do { require Digest::SHA1; import Digest::SHA1 qw(sha1 sha1_hex) }
54144µs1162µs}
# spent 162µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@51
55
56251µs115µs
# spent 15µs within Mail::SpamAssassin::Plugin::Bayes::BEGIN@56 which was called: # once (15µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 56
use Mail::SpamAssassin;
# spent 15µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@56
57249µs111µs
# spent 11µs within Mail::SpamAssassin::Plugin::Bayes::BEGIN@57 which was called: # once (11µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 57
use Mail::SpamAssassin::Plugin;
# spent 11µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@57
58260µs118µs
# spent 18µs within Mail::SpamAssassin::Plugin::Bayes::BEGIN@58 which was called: # once (18µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 58
use Mail::SpamAssassin::PerMsgStatus;
# spent 18µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@58
59256µs2279µs
# spent 150µs (20+130) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@59 which was called: # once (20µs+130µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 59
use Mail::SpamAssassin::Logger;
# spent 150µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@59 # spent 130µs making 1 call to Exporter::import
60267µs2181µs
# spent 102µs (24+79) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@60 which was called: # once (24µs+79µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 60
use Mail::SpamAssassin::Util qw(untaint_var);
# spent 102µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@60 # spent 79µs making 1 call to Exporter::import
61
62# pick ONLY ONE of these combining implementations.
632438µs12.01ms
# spent 2.01ms (1.38+628µs) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@63 which was called: # once (1.38ms+628µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 63
use Mail::SpamAssassin::Bayes::CombineChi;
# spent 2.01ms making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@63
64# use Mail::SpamAssassin::Bayes::CombineNaiveBayes;
65
66125µsour @ISA = qw(Mail::SpamAssassin::Plugin);
67
6813µs
# spent 233µs (28+205) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@68 which was called: # once (28µs+205µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 73
use vars qw{
69 $IGNORED_HDRS
70 $MARK_PRESENCE_ONLY_HDRS
71 %HEADER_NAME_COMPRESSION
72 $OPPORTUNISTIC_LOCK_VALID
7311.34ms2438µs};
# spent 233µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@68 # spent 205µs making 1 call to vars::import
74
75# Which headers should we scan for tokens? Don't use all of them, as it's easy
76# to pick up spurious clues from some. What we now do is use all of them
77# *less* these well-known headers; that way we can pick up spammers' tracking
78# headers (which are obviously not well-known in advance!).
79
80# Received is handled specially
81126µs110µs$IGNORED_HDRS = qr{(?: (?:X-)?Sender # misc noise
# spent 10µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::CORE:qr
82 |Delivered-To |Delivery-Date
83 |(?:X-)?Envelope-To
84 |X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text
85
86 |Subject # not worth a tiny gain vs. to db size increase
87
88 # Date: can provide invalid cues if your spam corpus is
89 # older/newer than ham
90 |Date
91
92 # List headers: ignore. a spamfiltering mailing list will
93 # become a nonspam sign.
94 |X-List|(?:X-)?Mailing-List
95 |(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe
96 |Unsubscribe|Host|Id|Manager|Admin|Comment
97 |Name|Url)
98 |X-Unsub(?:scribe)?
99 |X-Mailman-Version |X-Been[Tt]here |X-Loop
100 |Mail-Followup-To
101 |X-eGroups-(?:Return|From)
102 |X-MDMailing-List
103 |X-XEmacs-List
104
105 # gatewayed through mailing list (thanks to Allen Smith)
106 |(?:X-)?Resent-(?:From|To|Date)
107 |(?:X-)?Original-(?:From|To|Date)
108
109 # Spamfilter/virus-scanner headers: too easy to chain from
110 # these
111 |X-MailScanner(?:-SpamCheck)?
112 |X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))?
113 |X-Antispam |X-RBL-Warning |X-Mailscanner
114 |X-MDaemon-Deliver-To |X-Virus-Scanned
115 |X-Mass-Check-Id
116 |X-Pyzor |X-DCC-\S{2,25}-Metrics
117 |X-Filtered-B[Yy] |X-Scanned-By |X-Scanner
118 |X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status
119 |X-SpamCop-[^:]+
120 |X-SMTPD |(?:X-)?Spam-Apparently-To
121 |SPAM |X-Perlmx-Spam
122 |X-Bogosity
123
124 # some noisy Outlook headers that add no good clues:
125 |Content-Class |Thread-(?:Index|Topic)
126 |X-Original[Aa]rrival[Tt]ime
127
128 # Annotations from IMAP, POP, and MH:
129 |(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded
130 |Lines |Content-Length
131 |X-UIDL? |X-IMAPbase
132
133 # Annotations from Bugzilla
134 |X-Bugzilla-[^:]+
135
136 # Annotations from VM: (thanks to Allen Smith)
137 |X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified
138 |Summary-Format|VHeader|v\d-Data|Message-Order)
139
140 # Annotations from Gnus:
141 | X-Gnus-Mail-Source
142 | Xref
143
144)}x;
145
146# Note only the presence of these headers, in order to reduce the
147# hapaxen they generate.
148112µs14µs$MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face
# spent 4µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::CORE:qr
149 |X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint
150 |D(?:KIM|omainKey)-Signature
151)}ix;
152
153# tweaks tested as of Nov 18 2002 by jm posted to -devel at
154# http://sourceforge.net/p/spamassassin/mailman/message/12977556/
155# for results. The winners are now the default settings.
156267µs2345µs
# spent 185µs (26+160) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@156 which was called: # once (26µs+160µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 156
use constant IGNORE_TITLE_CASE => 1;
# spent 185µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@156 # spent 160µs making 1 call to constant::import
157264µs2430µs
# spent 225µs (20+205) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@157 which was called: # once (20µs+205µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 157
use constant TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES => 0;
# spent 225µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@157 # spent 205µs making 1 call to constant::import
158261µs2283µs
# spent 150µs (18+133) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@158 which was called: # once (18µs+133µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 158
use constant TOKENIZE_LONG_8BIT_SEQS_AS_UTF8_CHARS => 1;
# spent 150µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@158 # spent 133µs making 1 call to constant::import
159260µs2271µs
# spent 145µs (20+126) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@159 which was called: # once (20µs+126µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 159
use constant TOKENIZE_LONG_TOKENS_AS_SKIPS => 1;
# spent 145µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@159 # spent 125µs making 1 call to constant::import
160
161# tweaks by jm on May 12 2003, see -devel email at
162# http://sourceforge.net/p/spamassassin/mailman/message/14844556/
163254µs2269µs
# spent 144µs (19+125) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@163 which was called: # once (19µs+125µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 163
use constant PRE_CHEW_ADDR_HEADERS => 1;
# spent 144µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@163 # spent 125µs making 1 call to constant::import
164267µs2262µs
# spent 140µs (18+122) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@164 which was called: # once (18µs+122µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 164
use constant CHEW_BODY_URIS => 1;
# spent 140µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@164 # spent 122µs making 1 call to constant::import
1652171µs21.48ms
# spent 754µs (31+724) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@165 which was called: # once (31µs+724µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 165
use constant CHEW_BODY_MAILADDRS => 1;
# spent 754µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@165 # spent 724µs making 1 call to constant::import
166266µs2399µs
# spent 214µs (29+185) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@166 which was called: # once (29µs+185µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 166
use constant HDRS_TOKENIZE_LONG_TOKENS_AS_SKIPS => 1;
# spent 214µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@166 # spent 185µs making 1 call to constant::import
167266µs2271µs
# spent 145µs (20+126) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@167 which was called: # once (20µs+126µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 167
use constant BODY_TOKENIZE_LONG_TOKENS_AS_SKIPS => 1;
# spent 145µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@167 # spent 126µs making 1 call to constant::import
168263µs2280µs
# spent 150µs (20+130) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@168 which was called: # once (20µs+130µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 168
use constant URIS_TOKENIZE_LONG_TOKENS_AS_SKIPS => 0;
# spent 150µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@168 # spent 130µs making 1 call to constant::import
169263µs2271µs
# spent 146µs (20+126) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@169 which was called: # once (20µs+126µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 169
use constant IGNORE_MSGID_TOKENS => 0;
# spent 146µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@169 # spent 126µs making 1 call to constant::import
170
171# tweaks of 12 March 2004, see bug 2129.
1722128µs21.23ms
# spent 628µs (24+604) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@172 which was called: # once (24µs+604µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 172
use constant DECOMPOSE_BODY_TOKENS => 1;
# spent 628µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@172 # spent 604µs making 1 call to constant::import
173266µs2350µs
# spent 187µs (25+163) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@173 which was called: # once (25µs+163µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 173
use constant MAP_HEADERS_MID => 1;
# spent 187µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@173 # spent 163µs making 1 call to constant::import
1742188µs2450µs
# spent 240µs (30+210) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@174 which was called: # once (30µs+210µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 174
use constant MAP_HEADERS_FROMTOCC => 1;
# spent 240µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@174 # spent 210µs making 1 call to constant::import
175289µs2468µs
# spent 249µs (31+218) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@175 which was called: # once (31µs+218µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 175
use constant MAP_HEADERS_USERAGENT => 1;
# spent 249µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@175 # spent 218µs making 1 call to constant::import
176
177# tweaks, see http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3173#c26
178267µs2266µs
# spent 143µs (20+123) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@178 which was called: # once (20µs+123µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 178
use constant ADD_INVIZ_TOKENS_I_PREFIX => 1;
# spent 143µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@178 # spent 123µs making 1 call to constant::import
1792222µs2270µs
# spent 145µs (20+125) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@179 which was called: # once (20µs+125µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 179
use constant ADD_INVIZ_TOKENS_NO_PREFIX => 0;
# spent 145µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@179 # spent 125µs making 1 call to constant::import
180
181# We store header-mined tokens in the db with a "HHeaderName:val" format.
182# some headers may contain lots of gibberish tokens, so allow a little basic
183# compression by mapping the header name at least here. these are the headers
184# which appear with the most frequency in my db. note: this doesn't have to
185# be 2-way (ie. LHSes that map to the same RHS are not a problem), but mixing
186# tokens from multiple different headers may impact accuracy, so might as well
187# avoid this if possible. These are the top ones from my corpus, BTW (jm).
188123µs%HEADER_NAME_COMPRESSION = (
189 'Message-Id' => '*m',
190 'Message-ID' => '*M',
191 'Received' => '*r',
192 'User-Agent' => '*u',
193 'References' => '*f',
194 'In-Reply-To' => '*i',
195 'From' => '*F',
196 'Reply-To' => '*R',
197 'Return-Path' => '*p',
198 'Return-path' => '*rp',
199 'X-Mailer' => '*x',
200 'X-Authentication-Warning' => '*a',
201 'Organization' => '*o',
202 'Organisation' => '*o',
203 'Content-Type' => '*c',
204 'x-spam-relays-trusted' => '*RT',
205 'x-spam-relays-untrusted' => '*RU',
206);
207
208# How many seconds should the opportunistic_expire lock be valid?
20912µs$OPPORTUNISTIC_LOCK_VALID = 300;
210
211# Should we use the Robinson f(w) equation from
212# http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html ?
213# It gives better results, in that scores are more likely to distribute
214# into the <0.5 range for nonspam and >0.5 for spam.
215262µs2271µs
# spent 145µs (19+126) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@215 which was called: # once (19µs+126µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 215
use constant USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS => 1;
# spent 145µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@215 # spent 126µs making 1 call to constant::import
216
217# How many of the most significant tokens should we use for the p(w)
218# calculation?
219266µs2264µs
# spent 141µs (19+122) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@219 which was called: # once (19µs+122µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 219
use constant N_SIGNIFICANT_TOKENS => 150;
# spent 141µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@219 # spent 122µs making 1 call to constant::import
220
221# How many significant tokens are required for a classifier score to
222# be considered usable?
223263µs2285µs
# spent 153µs (22+131) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@223 which was called: # once (22µs+131µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 223
use constant REQUIRE_SIGNIFICANT_TOKENS_TO_SCORE => -1;
# spent 153µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@223 # spent 131µs making 1 call to constant::import
224
225# How long a token should we hold onto? (note: German speakers typically
226# will require a longer token than English ones.)
227213.1ms2262µs
# spent 141µs (19+122) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@227 which was called: # once (19µs+122µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 227
use constant MAX_TOKEN_LENGTH => 15;
# spent 141µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@227 # spent 122µs making 1 call to constant::import
228
229###########################################################################
230
231
# spent 102µs (62+41) within Mail::SpamAssassin::Plugin::Bayes::new which was called: # once (62µs+41µs) by Mail::SpamAssassin::PluginHandler::load_plugin at line 1 of (eval 89)[Mail/SpamAssassin/PluginHandler.pm:129]
sub new {
23213µs my $class = shift;
23313µs my ($main) = @_;
234
23512µs $class = ref($class) || $class;
236113µs118µs my $self = $class->SUPER::new($main);
# spent 18µs making 1 call to Mail::SpamAssassin::Plugin::new
23712µs bless ($self, $class);
238
23917µs $self->{main} = $main;
24013µs $self->{conf} = $main->{conf};
24113µs $self->{use_ignores} = 1;
242
243110µs123µs $self->register_eval_rule("check_bayes");
# spent 23µs making 1 call to Mail::SpamAssassin::Plugin::register_eval_rule
244111µs $self;
245}
246
247sub finish {
248 my $self = shift;
249 if ($self->{store}) {
250 $self->{store}->untie_db();
251 }
252 %{$self} = ();
253}
254
255###########################################################################
256
257# Plugin hook.
258# Return this implementation object, for callers that need to know
259# it. TODO: callers shouldn't *need* to know it!
260# used only in test suite to get access to {store}, internal APIs.
261#
262sub learner_get_implementation { return shift; }
263
264###########################################################################
265
266# Plugin hook.
267# Called in the parent process shortly before forking off child processes.
268sub prefork_init {
269 my ($self) = @_;
270
271 if ($self->{store} && $self->{store}->UNIVERSAL::can('prefork_init')) {
272 $self->{store}->prefork_init;
273 }
274}
275
276###########################################################################
277
278# Plugin hook.
279# Called in a child process shortly after being spawned.
280sub spamd_child_init {
281 my ($self) = @_;
282
283 if ($self->{store} && $self->{store}->UNIVERSAL::can('spamd_child_init')) {
284 $self->{store}->spamd_child_init;
285 }
286}
287
288###########################################################################
289
290# Plugin hook.
291sub check_bayes {
292 my ($self, $pms, $fulltext, $min, $max) = @_;
293
294 return 0 if (!$self->{conf}->{use_learner});
295 return 0 if (!$self->{conf}->{use_bayes} || !$self->{conf}->{use_bayes_rules});
296
297 if (!exists ($pms->{bayes_score})) {
298 my $timer = $self->{main}->time_method("check_bayes");
299 $pms->{bayes_score} = $self->scan($pms, $pms->{msg});
300 }
301
302 if (defined $pms->{bayes_score} &&
303 ($min == 0 || $pms->{bayes_score} > $min) &&
304 ($max eq "undef" || $pms->{bayes_score} <= $max))
305 {
306 if ($self->{conf}->{detailed_bayes_score}) {
307 $pms->test_log(sprintf ("score: %3.4f, hits: %s",
308 $pms->{bayes_score},
309 $pms->{bayes_hits}));
310 }
311 else {
312 $pms->test_log(sprintf ("score: %3.4f", $pms->{bayes_score}));
313 }
314 return 1;
315 }
316
317 return 0;
318}
319
320###########################################################################
321
322# Plugin hook.
323
# spent 1.15ms (47µs+1.10) within Mail::SpamAssassin::Plugin::Bayes::learner_close which was called: # once (47µs+1.10ms) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm
sub learner_close {
32412µs my ($self, $params) = @_;
32513µs my $quiet = $params->{quiet};
326
327 # do a sanity check here. Weird things happen if we remain tied
328 # after compiling; for example, spamd will never see that the
329 # number of messages has reached the bayes-scanning threshold.
330124µs112µs if ($self->{store}->db_readable()) {
# spent 12µs making 1 call to Mail::SpamAssassin::BayesStore::DBM::db_readable
33113µs warn "bayes: oops! still tied to bayes DBs, untying\n" unless $quiet;
332111µs11.09ms $self->{store}->untie_db();
# spent 1.09ms making 1 call to Mail::SpamAssassin::BayesStore::DBM::untie_db
333 }
334}
335
336###########################################################################
337
338# read configuration items to control bayes behaviour. Called by
339# BayesStore::read_db_configs().
340
# spent 2.34ms within Mail::SpamAssassin::Plugin::Bayes::read_db_configs which was called 236 times, avg 10µs/call: # 236 times (2.34ms+0s) by Mail::SpamAssassin::BayesStore::read_db_configs at line 117 of Mail/SpamAssassin/BayesStore.pm, avg 10µs/call
sub read_db_configs {
341236558µs my ($self) = @_;
342
343 # use of hapaxes. Set on bayes object, since it controls prob
344 # computation.
3452362.44ms $self->{use_hapaxes} = $self->{conf}->{bayes_use_hapaxes};
346}
347###########################################################################
348
349sub ignore_message {
350 my ($self,$PMS) = @_;
351
352 return 0 unless $self->{use_ignores};
353
354 my $ig_from = $self->{main}->call_plugins ("check_wb_list",
355 { permsgstatus => $PMS, type => 'from', list => 'bayes_ignore_from' });
356 my $ig_to = $self->{main}->call_plugins ("check_wb_list",
357 { permsgstatus => $PMS, type => 'to', list => 'bayes_ignore_to' });
358
359 my $ignore = $ig_from || $ig_to;
360
361 dbg("bayes: not using bayes, bayes_ignore_from or _to rule") if $ignore;
362
363 return $ignore;
364}
365
366###########################################################################
367
368# Plugin hook.
369
# spent 45.1s (31.3ms+45.1) within Mail::SpamAssassin::Plugin::Bayes::learn_message which was called 234 times, avg 193ms/call: # 234 times (31.3ms+45.1s) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm, avg 193ms/call
sub learn_message {
370234517µs my ($self, $params) = @_;
3712341.03ms my $isspam = $params->{isspam};
372234843µs my $msg = $params->{msg};
373234706µs my $id = $params->{id};
374
3752341.05ms if (!$self->{conf}->{use_bayes}) { return; }
376
3772342.90ms2346.42s my $msgdata = $self->get_body_from_msg ($msg);
# spent 6.42s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg, avg 27.4ms/call
378234502µs my $ret;
379
380 eval {
3812341.86ms local $SIG{'__DIE__'}; # do not run user die() traps in here
3822342.13ms2342.26ms my $timer = $self->{main}->time_method("b_learn");
# spent 2.26ms making 234 calls to Mail::SpamAssassin::time_method, avg 10µs/call
383
384234431µs my $ok;
3852341.31ms if ($self->{main}->{learn_to_journal}) {
386 # If we're going to learn to journal, we'll try going r/o first...
387 # If that fails for some reason, let's try going r/w. This happens
388 # if the DB doesn't exist yet.
3892343.50ms235602ms $ok = $self->{store}->tie_db_readonly() || $self->{store}->tie_db_writable();
# spent 598ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::tie_db_readonly, avg 2.55ms/call # spent 4.53ms making 1 call to Mail::SpamAssassin::BayesStore::DBM::tie_db_writable
390 } else {
391 $ok = $self->{store}->tie_db_writable();
392 }
393
394234929µs if ($ok) {
3952342.87ms23438.1s $ret = $self->_learn_trapped ($isspam, $msg, $msgdata, $id);
# spent 38.1s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::_learn_trapped, avg 163ms/call
396
3972341.10ms if (!$self->{main}->{learn_caller_will_untie}) {
398 $self->{store}->untie_db();
399 }
400 }
4012342.82ms 1;
402234973µs } or do { # if we died, untie the dbs.
403 my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat;
404 $self->{store}->untie_db();
405 die "bayes: (in learn) $eval_stat\n";
406 };
407
4082343.59ms return $ret;
409}
410
411# this function is trapped by the wrapper above
412
# spent 38.1s (119ms+37.9) within Mail::SpamAssassin::Plugin::Bayes::_learn_trapped which was called 234 times, avg 163ms/call: # 234 times (119ms+37.9s) by Mail::SpamAssassin::Plugin::Bayes::learn_message at line 395, avg 163ms/call
sub _learn_trapped {
413234721µs my ($self, $isspam, $msg, $msgdata, $msgid) = @_;
414234927µs my @msgid = ( $msgid );
415
4162341.14ms if (!defined $msgid) {
4172342.91ms234135ms @msgid = $self->get_msgid($msg);
# spent 135ms making 234 calls to Mail::SpamAssassin::Plugin::Bayes::get_msgid, avg 578µs/call
418 }
419
4202341.14ms foreach my $msgid_t ( @msgid ) {
4214584.67ms45831.0ms my $seen = $self->{store}->seen_get ($msgid_t);
# spent 31.0ms making 458 calls to Mail::SpamAssassin::BayesStore::DBM::seen_get, avg 68µs/call
422
4234582.90ms if (defined ($seen)) {
424 if (($seen eq 's' && $isspam) || ($seen eq 'h' && !$isspam)) {
425 dbg("bayes: $msgid_t already learnt correctly, not learning twice");
426 return 0;
427 } elsif ($seen !~ /^[hs]$/) {
428 warn("bayes: db_seen corrupt: value='$seen' for $msgid_t, ignored");
429 } else {
430 # bug 3704: If the message was already learned, don't try learning it again.
431 # this prevents, for instance, manually learning as spam, then autolearning
432 # as ham, or visa versa.
433 if ($self->{main}->{learn_no_relearn}) {
434 dbg("bayes: $msgid_t already learnt as opposite, not re-learning");
435 return 0;
436 }
437
438 dbg("bayes: $msgid_t already learnt as opposite, forgetting first");
439
440 # kluge so that forget() won't untie the db on us ...
441 my $orig = $self->{main}->{learn_caller_will_untie};
442 $self->{main}->{learn_caller_will_untie} = 1;
443
444 my $fatal = !defined $self->{main}->{bayes_scanner}->forget ($msg);
445
446 # reset the value post-forget() ...
447 $self->{main}->{learn_caller_will_untie} = $orig;
448
449 # forget() gave us a fatal error, so propagate that up
450 if ($fatal) {
451 dbg("bayes: forget() returned a fatal error, so learn() will too");
452 return;
453 }
454 }
455
456 # we're only going to have seen this once, so stop if it's been
457 # seen already
458 last;
459 }
460 }
461
462 # Now that we're sure we haven't seen this message before ...
463234767µs $msgid = $msgid[0];
464
4652342.64ms2341.48s my $msgatime = $msg->receive_date();
# spent 1.48s making 234 calls to Mail::SpamAssassin::Message::receive_date, avg 6.34ms/call
466
467 # If the message atime comes back as being more than 1 day in the
468 # future, something's messed up and we should revert to current time as
469 # a safety measure.
470 #
4712341.17ms $msgatime = time if ( $msgatime - time > 86400 );
472
4732342.82ms23432.4s my $tokens = $self->tokenize($msg, $msgdata);
# spent 32.4s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::tokenize, avg 138ms/call
474
4754688.58ms2342.94ms { my $timer = $self->{main}->time_method('b_count_change');
# spent 2.94ms making 234 calls to Mail::SpamAssassin::time_method, avg 13µs/call
4762341.01ms if ($isspam) {
4772342.63ms2349.92ms $self->{store}->nspam_nham_change(1, 0);
# spent 9.92ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::nspam_nham_change, avg 42µs/call
4782342.39ms2343.74s $self->{store}->multi_tok_count_change(1, 0, $tokens, $msgatime);
# spent 3.74s making 234 calls to Mail::SpamAssassin::BayesStore::DBM::multi_tok_count_change, avg 16.0ms/call
479 } else {
480 $self->{store}->nspam_nham_change(0, 1);
481 $self->{store}->multi_tok_count_change(0, 1, $tokens, $msgatime);
482 }
483 }
484
4852342.70ms23410.5ms $self->{store}->seen_put ($msgid, ($isspam ? 's' : 'h'));
# spent 10.5ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::seen_put, avg 45µs/call
4862342.32ms234122ms $self->{store}->cleanup();
# spent 122ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::cleanup, avg 523µs/call
487
4882346.11ms2340s $self->{main}->call_plugins("bayes_learn", { toksref => $tokens,
# spent 17.8ms making 234 calls to Mail::SpamAssassin::call_plugins, avg 76µs/call, recursion: max depth 1, sum of overlapping time 17.8ms
489 isspam => $isspam,
490 msgid => $msgid,
491 msgatime => $msgatime,
492 });
493
4942343.11ms2342.79ms dbg("bayes: learned '$msgid', atime: $msgatime");
# spent 2.79ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 12µs/call
495
49623449.3ms 1;
497}
498
499###########################################################################
500
501# Plugin hook.
502sub forget_message {
503 my ($self, $params) = @_;
504 my $msg = $params->{msg};
505 my $id = $params->{id};
506
507 if (!$self->{conf}->{use_bayes}) { return; }
508
509 my $msgdata = $self->get_body_from_msg ($msg);
510 my $ret;
511
512 # we still tie for writing here, since we write to the seen db
513 # synchronously
514 eval {
515 local $SIG{'__DIE__'}; # do not run user die() traps in here
516 my $timer = $self->{main}->time_method("b_learn");
517
518 my $ok;
519 if ($self->{main}->{learn_to_journal}) {
520 # If we're going to learn to journal, we'll try going r/o first...
521 # If that fails for some reason, let's try going r/w. This happens
522 # if the DB doesn't exist yet.
523 $ok = $self->{store}->tie_db_readonly() || $self->{store}->tie_db_writable();
524 } else {
525 $ok = $self->{store}->tie_db_writable();
526 }
527
528 if ($ok) {
529 $ret = $self->_forget_trapped ($msg, $msgdata, $id);
530
531 if (!$self->{main}->{learn_caller_will_untie}) {
532 $self->{store}->untie_db();
533 }
534 }
535 1;
536 } or do { # if we died, untie the dbs.
537 my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat;
538 $self->{store}->untie_db();
539 die "bayes: (in forget) $eval_stat\n";
540 };
541
542 return $ret;
543}
544
545# this function is trapped by the wrapper above
546sub _forget_trapped {
547 my ($self, $msg, $msgdata, $msgid) = @_;
548 my @msgid = ( $msgid );
549 my $isspam;
550
551 if (!defined $msgid) {
552 @msgid = $self->get_msgid($msg);
553 }
554
555 while( $msgid = shift @msgid ) {
556 my $seen = $self->{store}->seen_get ($msgid);
557
558 if (defined ($seen)) {
559 if ($seen eq 's') {
560 $isspam = 1;
561 } elsif ($seen eq 'h') {
562 $isspam = 0;
563 } else {
564 dbg("bayes: forget: msgid $msgid seen entry is neither ham nor spam, ignored");
565 return 0;
566 }
567
568 # messages should only be learned once, so stop if we find a msgid
569 # which was seen before
570 last;
571 }
572 else {
573 dbg("bayes: forget: msgid $msgid not learnt, ignored");
574 }
575 }
576
577 # This message wasn't learnt before, so return
578 if (!defined $isspam) {
579 dbg("bayes: forget: no msgid from this message has been learnt, skipping message");
580 return 0;
581 }
582 elsif ($isspam) {
583 $self->{store}->nspam_nham_change (-1, 0);
584 }
585 else {
586 $self->{store}->nspam_nham_change (0, -1);
587 }
588
589 my $tokens = $self->tokenize($msg, $msgdata);
590
591 if ($isspam) {
592 $self->{store}->multi_tok_count_change (-1, 0, $tokens);
593 } else {
594 $self->{store}->multi_tok_count_change (0, -1, $tokens);
595 }
596
597 $self->{store}->seen_delete ($msgid);
598 $self->{store}->cleanup();
599
600 $self->{main}->call_plugins("bayes_forget", { toksref => $tokens,
601 isspam => $isspam,
602 msgid => $msgid,
603 });
604
605 1;
606}
607
608###########################################################################
609
610# Plugin hook.
611sub learner_sync {
612 my ($self, $params) = @_;
613 if (!$self->{conf}->{use_bayes}) { return 0; }
614 dbg("bayes: bayes journal sync starting");
615 $self->{store}->sync($params);
616 dbg("bayes: bayes journal sync completed");
617}
618
619###########################################################################
620
621# Plugin hook.
622sub learner_expire_old_training {
623 my ($self, $params) = @_;
624 if (!$self->{conf}->{use_bayes}) { return 0; }
625 dbg("bayes: expiry starting");
626 my $timer = $self->{main}->time_method("expire_bayes");
627 $self->{store}->expire_old_tokens($params);
628 dbg("bayes: expiry completed");
629}
630
631###########################################################################
632
633# Plugin hook.
634# Check to make sure we can tie() the DB, and we have enough entries to do a scan
635# if we're told the caller will untie(), go ahead and leave the db tied.
636
# spent 546µs (26+520) within Mail::SpamAssassin::Plugin::Bayes::learner_is_scan_available which was called: # once (26µs+520µs) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm
sub learner_is_scan_available {
63712µs my ($self, $params) = @_;
638
63914µs return 0 unless $self->{conf}->{use_bayes};
640118µs1520µs return 0 unless $self->{store}->tie_db_readonly();
641
642 # We need the DB to stay tied, so if the journal sync occurs, don't untie!
643 my $caller_untie = $self->{main}->{learn_caller_will_untie};
644 $self->{main}->{learn_caller_will_untie} = 1;
645
646 # Do a journal sync if necessary. Do this before the nspam_nham_get()
647 # call since the sync may cause an update in the number of messages
648 # learnt.
649 $self->_opportunistic_calls(1);
650
651 # Reset the variable appropriately
652 $self->{main}->{learn_caller_will_untie} = $caller_untie;
653
654 my ($ns, $nn) = $self->{store}->nspam_nham_get();
655
656 if ($ns < $self->{conf}->{bayes_min_spam_num}) {
657 dbg("bayes: not available for scanning, only $ns spam(s) in bayes DB < ".$self->{conf}->{bayes_min_spam_num});
658 if (!$self->{main}->{learn_caller_will_untie}) {
659 $self->{store}->untie_db();
660 }
661 return 0;
662 }
663 if ($nn < $self->{conf}->{bayes_min_ham_num}) {
664 dbg("bayes: not available for scanning, only $nn ham(s) in bayes DB < ".$self->{conf}->{bayes_min_ham_num});
665 if (!$self->{main}->{learn_caller_will_untie}) {
666 $self->{store}->untie_db();
667 }
668 return 0;
669 }
670
671 return 1;
672}
673
674###########################################################################
675
676sub scan {
677 my ($self, $permsgstatus, $msg) = @_;
678 my $score;
679
680 return unless $self->{conf}->{use_learner};
681
682 # When we're doing a scan, we'll guarantee that we'll do the untie,
683 # so override the global setting until we're done.
684 my $caller_untie = $self->{main}->{learn_caller_will_untie};
685 $self->{main}->{learn_caller_will_untie} = 1;
686
687 goto skip if ($self->{main}->{bayes_scanner}->ignore_message($permsgstatus));
688
689 goto skip unless $self->learner_is_scan_available();
690
691 my ($ns, $nn) = $self->{store}->nspam_nham_get();
692
693 ## if ($self->{log_raw_counts}) { # see _compute_prob_for_token()
694 ## $self->{raw_counts} = " ns=$ns nn=$nn ";
695 ## }
696
697 dbg("bayes: corpus size: nspam = $ns, nham = $nn");
698
699 my $msgtokens;
700 { my $timer = $self->{main}->time_method('b_tokenize');
701 my $msgdata = $self->_get_msgdata_from_permsgstatus ($permsgstatus);
702 $msgtokens = $self->tokenize($msg, $msgdata);
703 }
704
705 my $tokensdata;
706 { my $timer = $self->{main}->time_method('b_tok_get_all');
707 $tokensdata = $self->{store}->tok_get_all(keys %{$msgtokens});
708 }
709
710 my $timer_compute_prob = $self->{main}->time_method('b_comp_prob');
711
712 my $probabilities_ref =
713 $self->_compute_prob_for_all_tokens($tokensdata, $ns, $nn);
714
715 my %pw;
716 foreach my $tokendata (@{$tokensdata}) {
717 my $prob = shift(@$probabilities_ref);
718 next unless defined $prob;
719 my ($token, $tok_spam, $tok_ham, $atime) = @{$tokendata};
720 $pw{$token} = {
721 prob => $prob,
722 spam_count => $tok_spam,
723 ham_count => $tok_ham,
724 atime => $atime
725 };
726 }
727
728 my @pw_keys = keys %pw;
729
730 # If none of the tokens were found in the DB, we're going to skip
731 # this message...
732 if (!@pw_keys) {
733 dbg("bayes: cannot use bayes on this message; none of the tokens were found in the database");
734 goto skip;
735 }
736
737 my $tcount_total = keys %{$msgtokens};
738 my $tcount_learned = scalar @pw_keys;
739
740 # Figure out the message receive time (used as atime below)
741 # If the message atime comes back as being in the future, something's
742 # messed up and we should revert to current time as a safety measure.
743 #
744 my $msgatime = $msg->receive_date();
745 my $now = time;
746 $msgatime = $now if ( $msgatime > $now );
747
748 my @touch_tokens;
749 my $tinfo_spammy = $permsgstatus->{bayes_token_info_spammy} = [];
750 my $tinfo_hammy = $permsgstatus->{bayes_token_info_hammy} = [];
751
752 my %tok_strength = map( ($_, abs($pw{$_}->{prob} - 0.5)), @pw_keys);
753 my $log_each_token = (would_log('dbg', 'bayes') > 1);
754
755 # now take the most significant tokens and calculate probs using
756 # Robinson's formula.
757
758 @pw_keys = sort { $tok_strength{$b} <=> $tok_strength{$a} } @pw_keys;
759
760 if (@pw_keys > N_SIGNIFICANT_TOKENS) { $#pw_keys = N_SIGNIFICANT_TOKENS - 1 }
761
762 my @sorted;
763 foreach my $tok (@pw_keys) {
764 next if $tok_strength{$tok} <
765 $Mail::SpamAssassin::Bayes::Combine::MIN_PROB_STRENGTH;
766
767 my $pw_tok = $pw{$tok};
768 my $pw_prob = $pw_tok->{prob};
769
770 # What's more expensive, scanning headers for HAMMYTOKENS and
771 # SPAMMYTOKENS tags that aren't there or collecting data that
772 # won't be used? Just collecting the data is certainly simpler.
773 #
774 my $raw_token = $msgtokens->{$tok} || "(unknown)";
775 my $s = $pw_tok->{spam_count};
776 my $n = $pw_tok->{ham_count};
777 my $a = $pw_tok->{atime};
778
779 push( @{ $pw_prob < 0.5 ? $tinfo_hammy : $tinfo_spammy },
780 [$raw_token, $pw_prob, $s, $n, $a] );
781
782 push(@sorted, $pw_prob);
783
784 # update the atime on this token, it proved useful
785 push(@touch_tokens, $tok);
786
787 if ($log_each_token) {
788 dbg("bayes: token '$raw_token' => $pw_prob");
789 }
790 }
791
792 if (!@sorted || (REQUIRE_SIGNIFICANT_TOKENS_TO_SCORE > 0 &&
793 $#sorted <= REQUIRE_SIGNIFICANT_TOKENS_TO_SCORE))
794 {
795 dbg("bayes: cannot use bayes on this message; not enough usable tokens found");
796 goto skip;
797 }
798
799 $score = Mail::SpamAssassin::Bayes::Combine::combine($ns, $nn, \@sorted);
800 undef $timer_compute_prob; # end a timing section
801
802 # Couldn't come up with a probability?
803 goto skip unless defined $score;
804
805 dbg("bayes: score = $score");
806
807 # no need to call tok_touch_all unless there were significant
808 # tokens and a score was returned
809 # we don't really care about the return value here
810
811 { my $timer = $self->{main}->time_method('b_tok_touch_all');
812 $self->{store}->tok_touch_all(\@touch_tokens, $msgatime);
813 }
814
815 my $timer_finish = $self->{main}->time_method('b_finish');
816
817 $permsgstatus->{bayes_nspam} = $ns;
818 $permsgstatus->{bayes_nham} = $nn;
819
820 ## if ($self->{log_raw_counts}) { # see _compute_prob_for_token()
821 ## print "#Bayes-Raw-Counts: $self->{raw_counts}\n";
822 ## }
823
824 $self->{main}->call_plugins("bayes_scan", { toksref => $msgtokens,
825 probsref => \%pw,
826 score => $score,
827 msgatime => $msgatime,
828 significant_tokens => \@touch_tokens,
829 });
830
831skip:
832 if (!defined $score) {
833 dbg("bayes: not scoring message, returning undef");
834 }
835
836 undef $timer_compute_prob; # end a timing section if still running
837 if (!defined $timer_finish) {
838 $timer_finish = $self->{main}->time_method('b_finish');
839 }
840
841 # Take any opportunistic actions we can take
842 if ($self->{main}->{opportunistic_expire_check_only}) {
843 # we're supposed to report on expiry only -- so do the
844 # _opportunistic_calls() run for the journal only.
845 $self->_opportunistic_calls(1);
846 $permsgstatus->{bayes_expiry_due} = $self->{store}->expiry_due();
847 }
848 else {
849 $self->_opportunistic_calls();
850 }
851
852 # Do any cleanup we need to do
853 $self->{store}->cleanup();
854
855 # Reset the value accordingly
856 $self->{main}->{learn_caller_will_untie} = $caller_untie;
857
858 # If our caller won't untie the db, we need to do it.
859 if (!$caller_untie) {
860 $self->{store}->untie_db();
861 }
862
863 $permsgstatus->set_tag ('BAYESTCHAMMY',
864 ($tinfo_hammy ? scalar @{$tinfo_hammy} : 0));
865 $permsgstatus->set_tag ('BAYESTCSPAMMY',
866 ($tinfo_spammy ? scalar @{$tinfo_spammy} : 0));
867 $permsgstatus->set_tag ('BAYESTCLEARNED', $tcount_learned);
868 $permsgstatus->set_tag ('BAYESTC', $tcount_total);
869
870 $permsgstatus->set_tag ('HAMMYTOKENS', sub {
871 my $pms = shift;
872 $self->bayes_report_make_list
873 ($pms, $pms->{bayes_token_info_hammy}, shift);
874 });
875
876 $permsgstatus->set_tag ('SPAMMYTOKENS', sub {
877 my $pms = shift;
878 $self->bayes_report_make_list
879 ($pms, $pms->{bayes_token_info_spammy}, shift);
880 });
881
882 $permsgstatus->set_tag ('TOKENSUMMARY', sub {
883 my $pms = shift;
884 if ( defined $pms->{tag_data}{BAYESTC} )
885 {
886 my $tcount_neutral = $pms->{tag_data}{BAYESTCLEARNED}
887 - $pms->{tag_data}{BAYESTCSPAMMY}
888 - $pms->{tag_data}{BAYESTCHAMMY};
889 my $tcount_new = $pms->{tag_data}{BAYESTC}
890 - $pms->{tag_data}{BAYESTCLEARNED};
891 "Tokens: new, $tcount_new; "
892 ."hammy, $pms->{tag_data}{BAYESTCHAMMY}; "
893 ."neutral, $tcount_neutral; "
894 ."spammy, $pms->{tag_data}{BAYESTCSPAMMY}."
895 } else {
896 "Bayes not run.";
897 }
898 });
899
900
901 return $score;
902}
903
904###########################################################################
905
906# Plugin hook.
907sub learner_dump_database {
908 my ($self, $params) = @_;
909 my $magic = $params->{magic};
910 my $toks = $params->{toks};
911 my $regex = $params->{regex};
912
913 # allow dump to occur even if use_bayes disables everything else ...
914 #return 0 unless $self->{conf}->{use_bayes};
915 return 0 unless $self->{store}->tie_db_readonly();
916
917 my @vars = $self->{store}->get_storage_variables();
918
919 my($sb,$ns,$nh,$nt,$le,$oa,$bv,$js,$ad,$er,$na) = @vars;
920
921 my $template = '%3.3f %10u %10u %10u %s'."\n";
922
923 if ( $magic ) {
924 printf($template, 0.0, 0, $bv, 0, 'non-token data: bayes db version')
925 or die "Error writing: $!";
926 printf($template, 0.0, 0, $ns, 0, 'non-token data: nspam')
927 or die "Error writing: $!";
928 printf($template, 0.0, 0, $nh, 0, 'non-token data: nham')
929 or die "Error writing: $!";
930 printf($template, 0.0, 0, $nt, 0, 'non-token data: ntokens')
931 or die "Error writing: $!";
932 printf($template, 0.0, 0, $oa, 0, 'non-token data: oldest atime')
933 or die "Error writing: $!";
934 if ( $bv >= 2 ) {
935 printf($template, 0.0, 0, $na, 0, 'non-token data: newest atime')
936 or die "Error writing: $!";
937 }
938 if ( $bv < 2 ) {
939 printf($template, 0.0, 0, $sb, 0, 'non-token data: current scan-count')
940 or die "Error writing: $!";
941 }
942 if ( $bv >= 2 ) {
943 printf($template, 0.0, 0, $js, 0, 'non-token data: last journal sync atime')
944 or die "Error writing: $!";
945 }
946 printf($template, 0.0, 0, $le, 0, 'non-token data: last expiry atime')
947 or die "Error writing: $!";
948 if ( $bv >= 2 ) {
949 printf($template, 0.0, 0, $ad, 0, 'non-token data: last expire atime delta')
950 or die "Error writing: $!";
951
952 printf($template, 0.0, 0, $er, 0, 'non-token data: last expire reduction count')
953 or die "Error writing: $!";
954 }
955 }
956
957 if ( $toks ) {
958 # let the store sort out the db_toks
959 $self->{store}->dump_db_toks($template, $regex, @vars);
960 }
961
962 if (!$self->{main}->{learn_caller_will_untie}) {
963 $self->{store}->untie_db();
964 }
965 return 1;
966}
967
968###########################################################################
969# TODO: these are NOT public, but the test suite needs to call them.
970
971
# spent 371ms (135+236) within Mail::SpamAssassin::Plugin::Bayes::get_msgid which was called 702 times, avg 528µs/call: # 468 times (88.5ms+147ms) by Mail::SpamAssassin::Plugin::TxRep::check_senders_reputation at line 1251 of Mail/SpamAssassin/Plugin/TxRep.pm, avg 503µs/call # 234 times (46.3ms+89.0ms) by Mail::SpamAssassin::Plugin::Bayes::_learn_trapped at line 417, avg 578µs/call
sub get_msgid {
9727021.77ms my ($self, $msg) = @_;
973
9747021.47ms my @msgid;
975
9767026.99ms70289.0ms my $msgid = $msg->get_header("Message-Id");
# spent 89.0ms making 702 calls to Mail::SpamAssassin::Message::Node::get_header, avg 127µs/call
97770213.2ms6725.11ms if (defined $msgid && $msgid ne '' && $msgid !~ /^\s*<\s*(?:\@sa_generated)?>.*$/) {
# spent 5.11ms making 672 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 8µs/call
978 # remove \r and < and > prefix/suffixes
9796723.16ms chomp $msgid;
980134428.3ms134410.6ms $msgid =~ s/^<//; $msgid =~ s/>.*$//g;
# spent 10.6ms making 1344 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 8µs/call
9816722.50ms push(@msgid, $msgid);
982 }
983
984 # Modified 2012-01-17 per bug 5185 to remove last received from msg_id calculation
985
986 # Use sha1_hex(Date: and top N bytes of body)
987 # where N is MIN(1024 bytes, 1/2 of body length)
988 #
9897026.48ms70274.9ms my $date = $msg->get_header("Date");
# spent 74.9ms making 702 calls to Mail::SpamAssassin::Message::Node::get_header, avg 107µs/call
9907022.27ms $date = "None" if (!defined $date || $date eq ''); # No Date?
991
992 #Removed per bug 5185
993 #my @rcvd = $msg->get_header("Received");
994 #my $rcvd = $rcvd[$#rcvd];
995 #$rcvd = "None" if (!defined $rcvd || $rcvd eq ''); # No Received?
996
997 # Make a copy since pristine_body is a reference ...
99870229.9ms7027.85ms my $body = join('', $msg->get_pristine_body());
# spent 7.85ms making 702 calls to Mail::SpamAssassin::Message::get_pristine_body, avg 11µs/call
999
10007023.72ms if (length($body) > 64) { # Small Body?
10017023.25ms my $keep = ( length $body > 2048 ? 1024 : int(length($body) / 2) );
10027024.13ms substr($body, $keep) = '';
1003 }
1004
1005 #Stripping all CR and LF so that testing midstream from MTA and post delivery don't
1006 #generate different id's simply because of LF<->CR<->CRLF changes.
100770236.8ms70231.3ms $body =~ s/[\r\n]//g;
# spent 31.3ms making 702 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 45µs/call
1008
100970229.5ms70217.1ms unshift(@msgid, sha1_hex($date."\000".$body).'@sa_generated');
# spent 17.1ms making 702 calls to Digest::SHA::sha1_hex, avg 24µs/call
1010
10117028.71ms return wantarray ? @msgid : $msgid[0];
1012}
1013
1014
# spent 6.42s (27.2ms+6.39) within Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg which was called 234 times, avg 27.4ms/call: # 234 times (27.2ms+6.39s) by Mail::SpamAssassin::Plugin::Bayes::learn_message at line 377, avg 27.4ms/call
sub get_body_from_msg {
1015234570µs my ($self, $msg) = @_;
1016
10172341.19ms if (!ref $msg) {
1018 # I have no idea why this seems to happen. TODO
1019 warn "bayes: msg not a ref: '$msg'";
1020 return { };
1021 }
1022
1023 my $permsgstatus =
10242343.77ms23473.4ms Mail::SpamAssassin::PerMsgStatus->new($self->{main}, $msg);
# spent 73.4ms making 234 calls to Mail::SpamAssassin::PerMsgStatus::new, avg 314µs/call
10252342.75ms2342.62ms $msg->extract_message_metadata ($permsgstatus);
# spent 2.62ms making 234 calls to Mail::SpamAssassin::Message::extract_message_metadata, avg 11µs/call
10262342.38ms2346.27s my $msgdata = $self->_get_msgdata_from_permsgstatus ($permsgstatus);
# spent 6.27s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::_get_msgdata_from_permsgstatus, avg 26.8ms/call
10272342.29ms23435.5ms $permsgstatus->finish();
# spent 35.5ms making 234 calls to Mail::SpamAssassin::PerMsgStatus::finish, avg 152µs/call
1028
1029234537µs if (!defined $msgdata) {
1030 # why?!
1031 warn "bayes: failed to get body for ".scalar($self->get_msgid($self->{msg}))."\n";
1032 return { };
1033 }
1034
10352344.29ms2348.58ms return $msgdata;
# spent 8.58ms making 234 calls to Mail::SpamAssassin::PerMsgStatus::DESTROY, avg 37µs/call
1036}
1037
1038
# spent 6.27s (16.8ms+6.26) within Mail::SpamAssassin::Plugin::Bayes::_get_msgdata_from_permsgstatus which was called 234 times, avg 26.8ms/call: # 234 times (16.8ms+6.26s) by Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg at line 1026, avg 26.8ms/call
sub _get_msgdata_from_permsgstatus {
1039234498µs my ($self, $pms) = @_;
1040
10412341.14ms my $t_src = $self->{conf}->{bayes_token_sources};
1042234679µs my $msgdata = { };
1043 $msgdata->{bayes_token_body} =
10442343.48ms234252ms $pms->{msg}->get_visible_rendered_body_text_array() if $t_src->{visible};
# spent 252ms making 234 calls to Mail::SpamAssassin::Message::get_visible_rendered_body_text_array, avg 1.08ms/call
1045 $msgdata->{bayes_token_inviz} =
10462342.69ms234133ms $pms->{msg}->get_invisible_rendered_body_text_array() if $t_src->{invisible};
# spent 133ms making 234 calls to Mail::SpamAssassin::Message::get_invisible_rendered_body_text_array, avg 567µs/call
1047 $msgdata->{bayes_mimepart_digests} =
1048234515µs $pms->{msg}->get_mimepart_digests() if $t_src->{mimepart};
1049234850µs @{$msgdata->{bayes_token_uris}} =
10502344.25ms2345.87s $pms->get_uri_list() if $t_src->{uri};
# spent 5.87s making 234 calls to Mail::SpamAssassin::PerMsgStatus::get_uri_list, avg 25.1ms/call
10512342.19ms return $msgdata;
1052}
1053
1054###########################################################################
1055
1056# The calling functions expect a uniq'ed array of tokens ...
1057
# spent 32.4s (2.94+29.4) within Mail::SpamAssassin::Plugin::Bayes::tokenize which was called 234 times, avg 138ms/call: # 234 times (2.94s+29.4s) by Mail::SpamAssassin::Plugin::Bayes::_learn_trapped at line 473, avg 138ms/call
sub tokenize {
1058234622µs my ($self, $msg, $msgdata) = @_;
1059
10602341.18ms my $t_src = $self->{conf}->{bayes_token_sources};
1061234517µs my @tokens;
1062
1063 # visible tokens from the body
10642342.17ms if ($msgdata->{bayes_token_body}) {
1065 my(@t) = map($self->_tokenize_line ($_, '', 1),
1066468122ms445612.5s @{$msgdata->{bayes_token_body}} );
# spent 12.5s making 4456 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 2.81ms/call
10672342.37ms2342.43ms dbg("bayes: tokenized body: %d tokens", scalar @t);
# spent 2.43ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 10µs/call
106823456.3ms push(@tokens, @t);
1069 }
1070 # the URI list
10712341.55ms if ($msgdata->{bayes_token_uris}) {
1072 my(@t) = map($self->_tokenize_line ($_, '', 2),
107346834.3ms27083.48s @{$msgdata->{bayes_token_uris}} );
# spent 3.48s making 2708 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 1.29ms/call
10742342.00ms2341.67ms dbg("bayes: tokenized uri: %d tokens", scalar @t);
# spent 1.67ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 7µs/call
10752347.35ms push(@tokens, @t);
1076 }
1077 # add invisible tokens
10782341.18ms if ($msgdata->{bayes_token_inviz}) {
1079234457µs my $tokprefix;
10804681.45ms if (ADD_INVIZ_TOKENS_I_PREFIX) { $tokprefix = 'I*:' }
1081 if (ADD_INVIZ_TOKENS_NO_PREFIX) { $tokprefix = '' }
10822341.05ms if (defined $tokprefix) {
1083 my(@t) = map($self->_tokenize_line ($_, $tokprefix, 1),
10844683.62ms53709ms @{$msgdata->{bayes_token_inviz}} );
# spent 709ms making 53 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 13.4ms/call
10852341.61ms2341.43ms dbg("bayes: tokenized invisible: %d tokens", scalar @t);
# spent 1.43ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 6µs/call
10862341.46ms push(@tokens, @t);
1087 }
1088 }
1089
1090 # add digests and Content-Type of all MIME parts
1091234646µs if ($msgdata->{bayes_mimepart_digests}) {
1092 my %shorthand = ( # some frequent MIME part contents for human readability
1093 'da39a3ee5e6b4b0d3255bfef95601890afd80709:text/plain'=> 'Empty-Plaintext',
1094 'da39a3ee5e6b4b0d3255bfef95601890afd80709:text/html' => 'Empty-HTML',
1095 'da39a3ee5e6b4b0d3255bfef95601890afd80709:text/xml' => 'Empty-XML',
1096 'adc83b19e793491b1c6ea0fd8b46cd9f32e592fc:text/plain'=> 'OneNL-Plaintext',
1097 'adc83b19e793491b1c6ea0fd8b46cd9f32e592fc:text/html' => 'OneNL-HTML',
1098 '71853c6197a6a7f222db0f1978c7cb232b87c5ee:text/plain'=> 'TwoNL-Plaintext',
1099 '71853c6197a6a7f222db0f1978c7cb232b87c5ee:text/html' => 'TwoNL-HTML',
1100 );
1101 my(@t) = map('MIME:' . ($shorthand{$_} || $_),
1102 @{ $msgdata->{bayes_mimepart_digests} });
1103 dbg("bayes: tokenized mime parts: %d tokens", scalar @t);
1104 dbg("bayes: mime-part token %s", $_) for @t;
1105 push(@tokens, @t);
1106 }
1107
1108 # Tokenize the headers
11092341.97ms if ($t_src->{header}) {
1110234474µs my(@t);
11112347.50ms2343.16s my %hdrs = $self->_tokenize_headers ($msg);
# spent 3.16s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers, avg 13.5ms/call
111223449.9ms while( my($prefix, $value) = each %hdrs ) {
1113560590.6ms56058.09s push(@t, $self->_tokenize_line ($value, "H$prefix:", 0));
# spent 8.09s making 5605 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 1.44ms/call
1114 }
11152342.07ms2342.13ms dbg("bayes: tokenized header: %d tokens", scalar @t);
# spent 2.13ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 9µs/call
111623440.9ms push(@tokens, @t);
1117 }
1118
1119 # Go ahead and uniq the array, skip null tokens (can happen sometimes)
1120 # generate an SHA1 hash and take the lower 40 bits as our token
1121234514µs my %tokens;
11222341.28ms foreach my $token (@tokens) {
1123 # skip empty tokens
11241598133.83s1557991.44s $tokens{substr(sha1($token), -5)} = $token if $token ne '';
# spent 1.44s making 155799 calls to Digest::SHA::sha1, avg 9µs/call
1125 }
1126
1127 # return the keys == tokens ...
112823452.7ms return \%tokens;
1129}
1130
1131
# spent 24.8s (17.3+7.52) within Mail::SpamAssassin::Plugin::Bayes::_tokenize_line which was called 12822 times, avg 1.94ms/call: # 5605 times (5.51s+2.58s) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1113, avg 1.44ms/call # 4456 times (8.71s+3.83s) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1066, avg 2.81ms/call # 2708 times (2.58s+906ms) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1073, avg 1.29ms/call # 53 times (506ms+203ms) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1084, avg 13.4ms/call
sub _tokenize_line {
11321282223.3ms my $self = $_[0];
11331282224.7ms my $tokprefix = $_[2];
11341282222.0ms my $region = $_[3];
113512822107ms local ($_) = $_[1];
1136
11371282220.6ms my @rettokens;
1138
1139 # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam strings,
1140 # and ISO-8859-15 alphas. Do not split on @'s; better results keeping it.
1141 # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
1142
1143 ### (previous:) tr/-A-Za-z0-9,\@\*\!_'"\$.\241-\377 / /cs;
1144
1145 ### (now): see Bug 7130 for rationale (slower, but makes UTF-8 chars atomic)
1146128222.87s210534894ms s{ ( [A-Za-z0-9,@*!_'"\$. -]+ |
# spent 816ms making 197712 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 4µs/call # spent 78.6ms making 12822 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 6µs/call
1147 { defined $1 ? $1 : ' ' }xsge;
1148 [\xE0-\xEF][\x80-\xBF]{2} |
1149 [\xF0-\xF4][\x80-\xBF]{3} |
1150 [\xA1-\xFF] ) | . }
1151185209833ms
1152 # should we also turn NBSP ( \xC2\xA0 ) into space?
1153
1154 # DO split on "..." or "--" or "---"; common formatting error resulting in
1155 # hapaxes. Keep the separator itself as a token, though, as long ones can
1156 # be good spamsigns.
115712822168ms1290847.4ms s/(\w)(\.{3,6})(\w)/$1 $2 $3/gs;
# spent 46.7ms making 12822 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call # spent 640µs making 86 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 7µs/call
115812822140ms1286231.9ms s/(\w)(\-{2,6})(\w)/$1 $2 $3/gs;
# spent 31.7ms making 12822 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 2µs/call # spent 200µs making 40 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 5µs/call
1159
11601282245.5ms if (IGNORE_TITLE_CASE) {
11611282236.8ms if ($region == 1 || $region == 2) {
1162 # lower-case Title Case at start of a full-stop-delimited line (as would
1163 # be seen in a Western language).
116411448423ms14579230ms s/(?:^|\.\s+)([A-Z])([^A-Z]+)(?:\s|$)/ ' '. (lc $1) . $2 . ' ' /ge;
# spent 142ms making 7217 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 20µs/call # spent 87.7ms making 7362 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 12µs/call
1165 }
1166 }
1167
11681282297.2ms12822136ms my $magic_re = $self->{store}->get_magic_re();
# spent 136ms making 12822 calls to Mail::SpamAssassin::BayesStore::DBM::get_magic_re, avg 11µs/call
1169
1170 # Note that split() in scope of 'use bytes' results in words with utf8 flag
1171 # cleared, even if the source string has perl characters semantics !!!
1172 # Is this really still desirable?
1173
117412822341ms foreach my $token (split) {
11751585602.42s1585601.13s $token =~ s/^[-'"\.,]+//; # trim non-alphanum chars at start or end
# spent 1.13s making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
11761585602.23s158560938ms $token =~ s/[-'"\.,]+$//; # so we don't get loads of '"foo' tokens
# spent 938ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 6µs/call
1177
1178 # Skip false magic tokens
1179 # TVD: we need to do a defined() check since SQL doesn't have magic
1180 # tokens, so the SQL BayesStore returns undef. I really want a way
1181 # of optimizing that out, but I haven't come up with anything yet.
1182 #
11831585603.71s3171201.23s next if ( defined $magic_re && $token =~ /$magic_re/ );
# spent 935ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp, avg 6µs/call # spent 299ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 2µs/call
1184
1185 # *do* keep 3-byte tokens; there's some solid signs in there
1186158560395ms my $len = length($token);
1187
1188 # but extend the stop-list. These are squarely in the gray
1189 # area, and it just slows us down to record them.
1190 # See http://wiki.apache.org/spamassassin/BayesStopList for more info.
1191 #
11921585602.48s1263211.11s next if $len < 3 ||
# spent 1.11s making 126321 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 9µs/call
1193 ($token =~ /^(?:a(?:ble|l(?:ready|l)|n[dy]|re)|b(?:ecause|oth)|c(?:an|ome)|e(?:ach|mail|ven)|f(?:ew|irst|or|rom)|give|h(?:a(?:ve|s)|ttp)|i(?:n(?:formation|to)|t\'s)|just|know|l(?:ike|o(?:ng|ok))|m(?:a(?:de|il(?:(?:ing|to))?|ke|ny)|o(?:re|st)|uch)|n(?:eed|o[tw]|umber)|o(?:ff|n(?:ly|e)|ut|wn)|p(?:eople|lace)|right|s(?:ame|ee|uch)|t(?:h(?:at|is|rough|e)|ime)|using|w(?:eb|h(?:ere|y)|ith(?:out)?|or(?:ld|k))|y(?:ears?|ou(?:(?:\'re|r))?))$/i);
1194
1195 # are we in the body? If so, apply some body-specific breakouts
1196109800290ms if ($region == 1 || $region == 2) {
1197642281.56s128048297ms if (CHEW_BODY_MAILADDRS && $token =~ /\S\@\S/i) {
# spent 297ms making 128048 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 2µs/call
11984084.41ms40832.1ms push (@rettokens, $self->_tokenize_mail_addrs ($token));
# spent 32.1ms making 408 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs, avg 79µs/call
1199 }
1200 elsif (CHEW_BODY_URIS && $token =~ /\S\.[a-z]/i) {
1201524240.5ms push (@rettokens, "UD:".$token); # the full token
120210484117ms524237.8ms my $bit = $token; while ($bit =~ s/^[^\.]+\.(.+)$/$1/gs) {
# spent 37.8ms making 5242 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
12038956172ms895638.1ms push (@rettokens, "UD:".$1); # UD = URL domain
# spent 38.1ms making 8956 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call
1204 }
1205 }
1206 }
1207
1208 # note: do not trim down overlong tokens if they contain '*'. This is
1209 # used as part of split tokens such as "HTo:D*net" indicating that
1210 # the domain ".net" appeared in the To header.
1211 #
1212109800431ms18366111ms if ($len > MAX_TOKEN_LENGTH && $token !~ /\*/) {
# spent 111ms making 18366 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 6µs/call
1213
121417309240ms1730991.7ms if (TOKENIZE_LONG_8BIT_SEQS_AS_UTF8_CHARS && $token =~ /[\x80-\xBF]{2}/) {
# spent 91.7ms making 17309 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 5µs/call
1215 # Bug 7135
1216 # collect 3- and 4-byte UTF-8 sequences, ignore 2-byte sequences
12179347µs9181µs my(@t) = $token =~ /( (?: [\xE0-\xEF] | [\xF0-\xF4][\x80-\xBF] )
# spent 181µs making 9 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 20µs/call
1218 [\x80-\xBF]{2} )/xsg;
1219921µs if (@t) {
12209200µs push (@rettokens, map('u8:'.$_, @t));
1221960µs next;
1222 }
1223 }
1224
1225 if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xa0-\xff]{2}/) {
1226 # Matt sez: "Could be asian? Autrijus suggested doing character ngrams,
1227 # but I'm doing tuples to keep the dbs small(er)." Sounds like a plan
1228 # to me! (jm)
1229 while ($token =~ s/^(..?)//) {
1230 push (@rettokens, "8:$1");
1231 }
1232 next;
1233 }
1234
12351730085.2ms if (($region == 0 && HDRS_TOKENIZE_LONG_TOKENS_AS_SKIPS)
1236 || ($region == 1 && BODY_TOKENIZE_LONG_TOKENS_AS_SKIPS)
1237 || ($region == 2 && URIS_TOKENIZE_LONG_TOKENS_AS_SKIPS))
1238 {
1239 # if (TOKENIZE_LONG_TOKENS_AS_SKIPS)
1240 # Spambayes trick via Matt: Just retain 7 chars. Do not retain the
1241 # length, it does not help; see jm's mail to -devel on Nov 20 2002 at
1242 # http://sourceforge.net/p/spamassassin/mailman/message/12977605/
1243 # "sk:" stands for "skip".
1244 # Bug 7141: retain seven UTF-8 chars (or other bytes),
1245 # if followed by at least two bytes
124611544458ms34632255ms $token =~ s{ ^ ( (?> (?: [\x00-\x7F\xF5-\xFF] |
# spent 130ms making 11544 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 11µs/call # spent 125ms making 23088 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 5µs/call
1247 [\xC0-\xDF][\x80-\xBF] |
1248 [\xE0-\xEF][\x80-\xBF]{2} |
1249 [\xF0-\xF4][\x80-\xBF]{3} | . ){7} ))
1250 .{2,} \z }{sk:$1}xs;
1251 ## (was:) $token = "sk:".substr($token, 0, 7); # seven bytes
1252 }
1253 }
1254
1255 # decompose tokens? do this after shortening long tokens
1256109791285ms if ($region == 1 || $region == 2) {
125764219211ms if (DECOMPOSE_BODY_TOKENS) {
125864219815ms64219243ms if ($token =~ /[^\w:\*]/) {
# spent 243ms making 64219 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
12591541841.6ms my $decompd = $token; # "Foo!"
126015418311ms15418180ms $decompd =~ s/[^\w:\*]//gs;
# spent 180ms making 15418 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 12µs/call
12611541897.2ms push (@rettokens, $tokprefix.$decompd); # "Foo"
1262 }
1263
1264642191.05s64219392ms if ($token =~ /[A-Z]/) {
# spent 392ms making 64219 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 6µs/call
12653432099.2ms my $decompd = $token; $decompd = lc $decompd;
126617160139ms push (@rettokens, $tokprefix.$decompd); # "foo!"
1267
126817160267ms1716080.4ms if ($token =~ /[^\w:\*]/) {
# spent 80.4ms making 17160 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 5µs/call
1269195029.4ms195017.2ms $decompd =~ s/[^\w:\*]//gs;
# spent 17.2ms making 1950 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 9µs/call
1270195013.1ms push (@rettokens, $tokprefix.$decompd); # "foo"
1271 }
1272 }
1273 }
1274 }
1275
12761097911.20s push (@rettokens, $tokprefix.$token);
1277 }
1278
127912822311ms return @rettokens;
1280}
1281
1282
# spent 3.16s (1.27+1.89) within Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers which was called 234 times, avg 13.5ms/call: # 234 times (1.27s+1.89s) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1111, avg 13.5ms/call
sub _tokenize_headers {
1283234573µs my ($self, $msg) = @_;
1284
1285234518µs my %parsed;
1286
1287 my %user_ignore;
1288468212ms $user_ignore{lc $_} = 1 for @{$self->{main}->{conf}->{bayes_ignore_headers}};
1289
1290 # get headers in array context
1291234457µs my @hdrs;
1292 my @rcvdlines;
129323411.6ms2341.13s for ($msg->get_all_headers()) {
# spent 1.13s making 234 calls to Mail::SpamAssassin::Message::Node::get_all_headers, avg 4.84ms/call
1294 # first, keep a copy of Received headers, so we can strip down to last 2
1295741098.2ms741025.1ms if (/^Received:/i) {
# spent 25.1ms making 7410 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 3µs/call
129611316.92ms push(@rcvdlines, $_);
129711312.28ms next;
1298 }
1299 # and now skip lines for headers we don't want (including all Received)
13006279228ms12558101ms next if /^${IGNORED_HDRS}:/i;
# spent 74.1ms making 6279 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 12µs/call # spent 26.9ms making 6279 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp, avg 4µs/call
1301 next if IGNORE_MSGID_TOKENS && /^Message-ID:/i;
1302412438.2ms push(@hdrs, $_);
1303 }
13042343.64ms23426.6ms push(@hdrs, $msg->get_all_metadata());
# spent 26.6ms making 234 calls to Mail::SpamAssassin::Message::get_all_metadata, avg 114µs/call
1305
1306 # and re-add the last 2 received lines: usually a good source of
1307 # spamware tokens and HELO names.
13084682.24ms if ($#rcvdlines >= 0) { push(@hdrs, $rcvdlines[$#rcvdlines]); }
13094681.98ms if ($#rcvdlines >= 1) { push(@hdrs, $rcvdlines[$#rcvdlines-1]); }
1310
13112342.30ms for (@hdrs) {
1312552871.5ms552824.7ms next unless /\S/;
# spent 24.7ms making 5528 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
1313552857.5ms my ($hdr, $val) = split(/:/, $_, 2);
1314
1315 # remove user-specified headers here, after Received, in case they
1316 # want to ignore that too
1317552814.9ms next if exists $user_ignore{lc $hdr};
1318
1319 # Prep the header value
132053749.32ms $val ||= '';
1321537412.6ms chomp($val);
1322
1323 # special tokenization for some headers:
13245374261ms1755194.1ms if ($hdr =~ /^(?:|X-|Resent-)Message-Id$/i) {
# spent 80.3ms making 14037 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 6µs/call # spent 13.9ms making 3514 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp, avg 4µs/call
13252252.22ms22515.7ms $val = $self->_pre_chew_message_id ($val);
# spent 15.7ms making 225 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id, avg 70µs/call
1326 }
1327 elsif (PRE_CHEW_ADDR_HEADERS && $hdr =~ /^(?:|X-|Resent-)
1328 (?:Return-Path|From|To|Cc|Reply-To|Errors-To|Mail-Followup-To|Sender)$/ix)
1329 {
13307586.40ms758214ms $val = $self->_pre_chew_addr_header ($val);
# spent 214ms making 758 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header, avg 282µs/call
1331 }
1332 elsif ($hdr eq 'Received') {
13334684.11ms468109ms $val = $self->_pre_chew_received ($val);
# spent 109ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received, avg 232µs/call
1334 }
1335 elsif ($hdr eq 'Content-Type') {
13362222.06ms22228.3ms $val = $self->_pre_chew_content_type ($val);
# spent 28.3ms making 222 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type, avg 128µs/call
1337 }
1338 elsif ($hdr eq 'MIME-Version') {
13391872.40ms1871.15ms $val =~ s/1\.0//; # totally innocuous
# spent 1.15ms making 187 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 6µs/call
1340 }
1341 elsif ($hdr =~ /^${MARK_PRESENCE_ONLY_HDRS}$/i) {
1342224597µs $val = "1"; # just mark the presence, they create lots of hapaxen
1343 }
1344
1345537417.0ms if (MAP_HEADERS_MID) {
1346537465.8ms537421.2ms if ($hdr =~ /^(?:In-Reply-To|References|Message-ID)$/i) {
# spent 21.2ms making 5374 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
1347237910µs $parsed{"*MI"} = $val;
1348 }
1349 }
1350537424.3ms if (MAP_HEADERS_FROMTOCC) {
1351537472.2ms537423.5ms if ($hdr =~ /^(?:From|To|Cc)$/i) {
# spent 23.5ms making 5374 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
13524351.52ms $parsed{"*Ad"} = $val;
1353 }
1354 }
1355537417.2ms if (MAP_HEADERS_USERAGENT) {
1356537479.5ms537419.7ms if ($hdr =~ /^(?:X-Mailer|User-Agent)$/i) {
# spent 19.7ms making 5374 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
135764264µs $parsed{"*UA"} = $val;
1358 }
1359 }
1360
1361 # replace hdr name with "compressed" version if possible
1362537430.0ms if (defined $HEADER_NAME_COMPRESSION{$hdr}) {
136320098.06ms $hdr = $HEADER_NAME_COMPRESSION{$hdr};
1364 }
1365
1366537426.5ms if (exists $parsed{$hdr}) {
13672882.30ms $parsed{$hdr} .= " ".$val;
1368 } else {
1369508641.5ms $parsed{$hdr} = $val;
1370 }
1371537451.2ms537458.2ms if (would_log('dbg', 'bayes') > 1) {
# spent 58.2ms making 5374 calls to Mail::SpamAssassin::Logger::would_log, avg 11µs/call
1372 dbg("bayes: header tokens for $hdr = \"$parsed{$hdr}\"");
1373 }
1374 }
1375
137623432.1ms return %parsed;
1377}
1378
1379
# spent 28.3ms (14.1+14.2) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type which was called 222 times, avg 128µs/call: # 222 times (14.1ms+14.2ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1336, avg 128µs/call
sub _pre_chew_content_type {
1380222927µs my ($self, $val) = @_;
1381
1382 # hopefully this will retain good bits without too many hapaxen
13832224.58ms2222.51ms if ($val =~ s/boundary=[\"\'](.*?)[\"\']/ /ig) {
# spent 2.51ms making 222 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 11µs/call
1384173646µs my $boundary = $1;
1385173382µs $boundary = '' if !defined $boundary; # avoid a warning
13861737.25ms1735.81ms $boundary =~ s/[a-fA-F0-9]/H/gs;
# spent 5.81ms making 173 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 34µs/call
1387 # break up blocks of separator chars so they become their own tokens
13881739.08ms7874.23ms $boundary =~ s/([-_\.=]+)/ $1 /gs;
# spent 3.28ms making 614 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 5µs/call # spent 949µs making 173 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 5µs/call
1389173680µs $val .= $boundary;
1390 }
1391
1392 # stop-list words for Content-Type header: these wind up totally gray
13932223.18ms2221.70ms $val =~ s/\b(?:text|charset)\b//;
# spent 1.70ms making 222 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 8µs/call
1394
13952221.94ms $val;
1396}
1397
1398
# spent 15.7ms (8.57+7.12) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id which was called 225 times, avg 70µs/call: # 225 times (8.57ms+7.12ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1325, avg 70µs/call
sub _pre_chew_message_id {
1399225896µs my ($self, $val) = @_;
1400 # we can (a) get rid of a lot of hapaxen and (b) increase the token
1401 # specificity by pre-parsing some common formats.
1402
1403 # Outlook Express format:
14042253.25ms2251.65ms $val =~ s/<([0-9a-f]{4})[0-9a-f]{4}[0-9a-f]{4}\$
# spent 1.65ms making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
1405 ([0-9a-f]{4})[0-9a-f]{4}\$
1406 ([0-9a-f]{8})\@(\S+)>/ OEA$1 OEB$2 OEC$3 $4 /gx;
1407
1408 # Exim:
14092252.10ms225676µs $val =~ s/<[A-Za-z0-9]{7}-[A-Za-z0-9]{6}-0[A-Za-z0-9]\@//;
# spent 676µs making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 3µs/call
1410
1411 # Sendmail:
14122252.20ms225812µs $val =~ s/<20\d\d[01]\d[0123]\d[012]\d[012345]\d[012345]\d\.
# spent 812µs making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call
1413 [A-F0-9]{10,12}\@//gx;
1414
1415 # try to split Message-ID segments on probable ID boundaries. Note that
1416 # Outlook message-ids seem to contain a server identifier ID in the last
1417 # 8 bytes before the @. Make sure this becomes its own token, it's a
1418 # great spam-sign for a learning system! Be sure to split on ".".
14192255.65ms2253.98ms $val =~ s/[^_A-Za-z0-9]/ /g;
# spent 3.98ms making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 18µs/call
142022511.3ms $val;
1421}
1422
1423
# spent 109ms (62.3+46.3) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received which was called 468 times, avg 232µs/call: # 468 times (62.3ms+46.3ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1333, avg 232µs/call
sub _pre_chew_received {
14244683.02ms my ($self, $val) = @_;
1425
1426 # Thanks to Dan for these. Trim out "useless" tokens; sendmail-ish IDs
1427 # and valid-format RFC-822/2822 dates
1428
14294686.44ms4683.38ms $val =~ s/\swith\sSMTP\sid\sg[\dA-Z]{10,12}\s/ /gs; # Sendmail
# spent 3.38ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
14304685.93ms4683.04ms $val =~ s/\swith\sESMTP\sid\s[\dA-F]{10,12}\s/ /gs; # Sendmail
# spent 3.04ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 6µs/call
14314686.57ms4683.50ms $val =~ s/\bid\s[a-zA-Z0-9]{7,20}\b/ /gs; # Sendmail
# spent 3.50ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
14324684.65ms4681.91ms $val =~ s/\bid\s[A-Za-z0-9]{7}-[A-Za-z0-9]{6}-0[A-Za-z0-9]/ /gs; # exim
# spent 1.91ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call
1433
143446812.5ms4689.33ms $val =~ s/(?:(?:Mon|Tue|Wed|Thu|Fri|Sat|Sun),\s)?
# spent 9.33ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 20µs/call
1435 [0-3\s]?[0-9]\s
1436 (?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)\s
1437 (?:19|20)?[0-9]{2}\s
1438 [0-2][0-9](?:\:[0-5][0-9]){1,2}\s
1439 (?:\s*\(|\)|\s*(?:[+-][0-9]{4})|\s*(?:UT|[A-Z]{2,3}T))*
1440 //gx;
1441
1442 # IPs: break down to nearest /24, to reduce hapaxes -- EXCEPT for
1443 # IPs in the 10 and 192.168 ranges, they gets lots of significant tokens
1444 # (on both sides)
1445 # also make a dup with the full IP, as fodder for
1446 # bayes_dump_to_trusted_networks: "H*r:ip*aaa.bbb.ccc.ddd"
144746830.7ms141812.6ms $val =~ s{\b(\d{1,3}\.)(\d{1,3}\.)(\d{1,3})(\.\d{1,3})\b}{
# spent 7.22ms making 950 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 8µs/call # spent 5.41ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 12µs/call
14485844.35ms if ($2 eq '10' || ($2 eq '192' && $3 eq '168')) {
1449 $1.$2.$3.$4.
1450 " ip*".$1.$2.$3.$4." ";
1451 } else {
14525847.16ms $1.$2.$3.
1453 " ip*".$1.$2.$3.$4." ";
1454 }
1455 }gex;
1456
1457 # trim these: they turn out as the most common tokens, but with a
1458 # prob of about .5. waste of space!
145946823.8ms46812.5ms $val =~ s/\b(?:with|from|for|SMTP|ESMTP)\b/ /g;
# spent 12.5ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 27µs/call
1460
14614684.12ms $val;
1462}
1463
1464
# spent 214ms (42.1+171) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header which was called 758 times, avg 282µs/call: # 758 times (42.1ms+171ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1330, avg 282µs/call
sub _pre_chew_addr_header {
14657585.58ms my ($self, $val) = @_;
14667581.51ms local ($_);
1467
14687587.87ms75899.0ms my @addrs = $self->{main}->find_all_addrs_in_line ($val);
# spent 99.0ms making 758 calls to Mail::SpamAssassin::find_all_addrs_in_line, avg 131µs/call
14697581.41ms my @toks;
14707583.00ms foreach (@addrs) {
14717429.12ms74272.4ms push (@toks, $self->_tokenize_mail_addrs ($_));
# spent 72.4ms making 742 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs, avg 98µs/call
1472 }
147375812.4ms return join (' ', @toks);
1474}
1475
1476
# spent 105ms (77.0+27.5) within Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs which was called 1150 times, avg 91µs/call: # 742 times (52.2ms+20.2ms) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header at line 1471, avg 98µs/call # 408 times (24.8ms+7.28ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1198, avg 79µs/call
sub _tokenize_mail_addrs {
147711506.96ms my ($self, $addr) = @_;
1478
1479115017.5ms11509.35ms ($addr =~ /(.+)\@(.+)$/) or return ();
# spent 9.35ms making 1150 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 8µs/call
148011502.06ms my @toks;
148111509.72ms push(@toks, "U*".$1, "D*".$2);
1482355553.4ms240518.1ms $_ = $2; while (s/^[^\.]+\.(.+)$/$1/gs) { push(@toks, "D*".$1); }
# spent 18.1ms making 2405 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 8µs/call
1483115016.9ms return @toks;
1484}
1485
1486
1487###########################################################################
1488
1489# compute the probability that a token is spammish for each token
1490sub _compute_prob_for_all_tokens {
1491 my ($self, $tokensdata, $ns, $nn) = @_;
1492 my @probabilities;
1493
1494 return if !$ns || !$nn;
1495
1496 my $threshold = 1; # ignore low-freq tokens below this s+n threshold
1497 if (!USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS) {
1498 $threshold = 10;
1499 }
1500 if (!$self->{use_hapaxes}) {
1501 $threshold = 2;
1502 }
1503
1504 foreach my $tokendata (@{$tokensdata}) {
1505 my $s = $tokendata->[1]; # spam count
1506 my $n = $tokendata->[2]; # ham count
1507 my $prob;
1508
150922.07ms2182µs
# spent 110µs (39+71) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@1509 which was called: # once (39µs+71µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 1509
no warnings 'uninitialized'; # treat undef as zero in addition
# spent 110µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@1509 # spent 71µs making 1 call to warnings::unimport
1510 if ($s + $n >= $threshold) {
1511 # ignoring low-freq tokens, also covers the (!$s && !$n) case
1512
1513 # my $ratios = $s / $ns;
1514 # my $ration = $n / $nn;
1515 # $prob = $ratios / ($ration + $ratios);
1516 #
1517 $prob = ($s * $nn) / ($n * $ns + $s * $nn); # same thing, faster
1518
1519 if (USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS) {
1520 # use Robinson's f(x) equation for low-n tokens, instead of just
1521 # ignoring them
1522 my $robn = $s + $n;
1523 $prob =
1524 ($Mail::SpamAssassin::Bayes::Combine::FW_S_DOT_X + ($robn * $prob))
1525 /
1526 ($Mail::SpamAssassin::Bayes::Combine::FW_S_CONSTANT + $robn);
1527 }
1528 }
1529
1530 # 'log_raw_counts' is used to log the raw data for the Bayes equations
1531 # during a mass-check, allowing the S and X constants to be optimized
1532 # quickly without requiring re-tokenization of the messages for each
1533 # attempt. There's really no need for this code to be uncommented in
1534 # normal use, however. It has never been publicly documented, so
1535 # commenting it out is fine. ;)
1536 #
1537 ## if ($self->{log_raw_counts}) {
1538 ## $self->{raw_counts} .= " s=$s,n=$n ";
1539 ## }
1540
1541 push(@probabilities, $prob);
1542 }
1543 return \@probabilities;
1544}
1545
1546# compute the probability that a token is spammish
1547sub _compute_prob_for_token {
1548 my ($self, $token, $ns, $nn, $s, $n) = @_;
1549
1550 # we allow the caller to give us the token information, just
1551 # to save a potentially expensive lookup
1552 if (!defined($s) || !defined($n)) {
1553 ($s, $n, undef) = $self->{store}->tok_get($token);
1554 }
1555 return if !$s && !$n;
1556
1557 my $probabilities_ref =
1558 $self->_compute_prob_for_all_tokens([ [$token, $s, $n, 0] ], $ns, $nn);
1559
1560 return $probabilities_ref->[0];
1561}
1562
1563###########################################################################
1564# If a token is neither hammy nor spammy, return 0.
1565# For a spammy token, return the minimum number of additional ham messages
1566# it would have had to appear in to no longer be spammy. Hammy tokens
1567# are handled similarly. That's what the function does (at the time
1568# of this writing, 31 July 2003, 16:02:55 CDT). It would be slightly
1569# more useful if it returned the number of /additional/ ham messages
1570# a spammy token would have to appear in to no longer be spammy but I
1571# fear that might require the solution to a cubic equation, and I
1572# just don't have the time for that now.
1573
1574sub _compute_declassification_distance {
1575 my ($self, $Ns, $Nn, $ns, $nn, $prob) = @_;
1576
1577 return 0 if $ns == 0 && $nn == 0;
1578
1579 if (!USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS) {return 0 if ($ns + $nn < 10);}
1580 if (!$self->{use_hapaxes}) {return 0 if ($ns + $nn < 2);}
1581
1582 return 0 if $Ns == 0 || $Nn == 0;
1583 return 0 if abs( $prob - 0.5 ) <
1584 $Mail::SpamAssassin::Bayes::Combine::MIN_PROB_STRENGTH;
1585
1586 my ($Na,$na,$Nb,$nb) = $prob > 0.5 ? ($Nn,$nn,$Ns,$ns) : ($Ns,$ns,$Nn,$nn);
1587 my $p = 0.5 - $Mail::SpamAssassin::Bayes::Combine::MIN_PROB_STRENGTH;
1588
1589 return int( 1.0 - 1e-6 + $nb * $Na * $p / ($Nb * ( 1 - $p )) ) - $na
1590 unless USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS;
1591
1592 my $s = $Mail::SpamAssassin::Bayes::Combine::FW_S_CONSTANT;
1593 my $sx = $Mail::SpamAssassin::Bayes::Combine::FW_S_DOT_X;
1594 my $a = $Nb * ( 1 - $p );
1595 my $b = $Nb * ( $sx + $nb * ( 1 - $p ) - $p * $s ) - $p * $Na * $nb;
1596 my $c = $Na * $nb * ( $sx - $p * ( $s + $nb ) );
1597 my $discrim = $b * $b - 4 * $a * $c;
1598 my $disc_max_0 = $discrim < 0 ? 0 : $discrim;
1599 my $dd_exact = ( 1.0 - 1e-6 + ( -$b + sqrt( $disc_max_0 ) ) / ( 2*$a ) ) - $na;
1600
1601 # This shouldn't be necessary. Should not be < 1
1602 return $dd_exact < 1 ? 1 : int($dd_exact);
1603}
1604
1605###########################################################################
1606
1607sub _opportunistic_calls {
1608 my($self, $journal_only) = @_;
1609
1610 # If we're not already tied, abort.
1611 if (!$self->{store}->db_readable()) {
1612 dbg("bayes: opportunistic call attempt failed, DB not readable");
1613 return;
1614 }
1615
1616 # Is an expire or sync running?
1617 my $running_expire = $self->{store}->get_running_expire_tok();
1618 if ( defined $running_expire && $running_expire+$OPPORTUNISTIC_LOCK_VALID > time() ) {
1619 dbg("bayes: opportunistic call attempt skipped, found fresh running expire magic token");
1620 return;
1621 }
1622
1623 # handle expiry and syncing
1624 if (!$journal_only && $self->{store}->expiry_due()) {
1625 dbg("bayes: opportunistic call found expiry due");
1626
1627 # sync will bring the DB R/W as necessary, and the expire will remove
1628 # the running_expire token, may untie as well.
1629 $self->{main}->{bayes_scanner}->sync(1,1);
1630 }
1631 elsif ( $self->{store}->sync_due() ) {
1632 dbg("bayes: opportunistic call found journal sync due");
1633
1634 # sync will bring the DB R/W as necessary, may untie as well
1635 $self->{main}->{bayes_scanner}->sync(1,0);
1636
1637 # We can only remove the running_expire token if we're doing R/W
1638 if ($self->{store}->db_writable()) {
1639 $self->{store}->remove_running_expire_tok();
1640 }
1641 }
1642
1643 return;
1644}
1645
1646###########################################################################
1647
1648
# spent 33.8ms (21.6+12.2) within Mail::SpamAssassin::Plugin::Bayes::learner_new which was called: # once (21.6ms+12.2ms) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm
sub learner_new {
164912µs my ($self) = @_;
1650
165112µs my $store;
1652115µs162µs my $module = untaint_var($self->{conf}->{bayes_store_module});
# spent 62µs making 1 call to Mail::SpamAssassin::Util::untaint_var
165313µs $module = 'Mail::SpamAssassin::BayesStore::DBM' if !$module;
1654
1655110µs111µs dbg("bayes: learner_new self=%s, bayes_store_module=%s", $self,$module);
# spent 11µs making 1 call to Mail::SpamAssassin::Logger::dbg
165614µs undef $self->{store}; # DESTROYs previous object, if any
1657 eval '
1658 require '.$module.';
1659 $store = '.$module.'->new($self);
1660 1;
16611218µs ' or do {
# spent 402µs executing statements in string eval
1662 my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat;
1663 die "bayes: learner_new $module new() failed: $eval_stat\n";
1664 };
1665
1666110µs111µs dbg("bayes: learner_new: got store=%s", $store);
# spent 11µs making 1 call to Mail::SpamAssassin::Logger::dbg
166714µs $self->{store} = $store;
1668
1669114µs $self;
1670}
1671
1672###########################################################################
1673
1674sub bayes_report_make_list {
1675 my ($self, $pms, $info, $param) = @_;
1676 return "Tokens not available." unless defined $info;
1677
1678 my ($limit,$fmt_arg,$more) = split /,/, ($param || '5');
1679
1680 my %formats = (
1681 short => '$t',
1682 Short => 'Token: \"$t\"',
1683 compact => '$p-$D--$t',
1684 Compact => 'Probability $p -declassification distance $D (\"+\" means > 9) --token: \"$t\"',
1685 medium => '$p-$D-$N--$t',
1686 long => '$p-$d--${h}h-${s}s--${a}d--$t',
1687 Long => 'Probability $p -declassification distance $D --in ${h} ham messages -and ${s} spam messages --${a} days old--token:\"$t\"'
1688 );
1689
1690 my $raw_fmt = (!$fmt_arg ? '$p-$D--$t' : $formats{$fmt_arg});
1691
1692 return "Invalid format, must be one of: ".join(",",keys %formats)
1693 unless defined $raw_fmt;
1694
1695 my $fmt = '"'.$raw_fmt.'"';
1696 my $amt = $limit < @$info ? $limit : @$info;
1697 return "" unless $amt;
1698
1699 my $ns = $pms->{bayes_nspam};
1700 my $nh = $pms->{bayes_nham};
1701 my $digit = sub { $_[0] > 9 ? "+" : $_[0] };
1702 my $now = time;
1703
1704 join ', ', map {
1705 my($t,$prob,$s,$h,$u) = @$_;
1706 my $a = int(($now - $u)/(3600 * 24));
1707 my $d = $self->_compute_declassification_distance($ns,$nh,$s,$h,$prob);
1708 my $p = sprintf "%.3f", $prob;
1709 my $n = $s + $h;
1710 my ($c,$o) = $prob < 0.5 ? ($h,$s) : ($s,$h);
1711 my ($D,$S,$H,$C,$O,$N) = map &$digit($_), ($d,$s,$h,$c,$o,$n);
1712 eval $fmt; ## no critic
1713 } @{$info}[0..$amt-1];
1714}
1715
1716121µs1;
 
# spent 2.90s within Mail::SpamAssassin::Plugin::Bayes::CORE:match which was called 645409 times, avg 4µs/call: # 158560 times (299ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1183, avg 2µs/call # 128048 times (297ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1197, avg 2µs/call # 126321 times (1.11s+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1192, avg 9µs/call # 64219 times (392ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1264, avg 6µs/call # 64219 times (243ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1258, avg 4µs/call # 18366 times (111ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1212, avg 6µs/call # 17309 times (91.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1214, avg 5µs/call # 17160 times (80.4ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1268, avg 5µs/call # 14037 times (80.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1324, avg 6µs/call # 7410 times (25.1ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1295, avg 3µs/call # 6279 times (74.1ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1300, avg 12µs/call # 5528 times (24.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1312, avg 4µs/call # 5374 times (23.5ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1351, avg 4µs/call # 5374 times (21.2ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1346, avg 4µs/call # 5374 times (19.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1356, avg 4µs/call # 1150 times (9.35ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs at line 1479, avg 8µs/call # 672 times (5.11ms+0s) by Mail::SpamAssassin::Plugin::Bayes::get_msgid at line 977, avg 8µs/call # 9 times (181µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1217, avg 20µs/call
sub Mail::SpamAssassin::Plugin::Bayes::CORE:match; # opcode
# spent 14µs within Mail::SpamAssassin::Plugin::Bayes::CORE:qr which was called 2 times, avg 7µs/call: # once (10µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 81 # once (4µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@209 at line 148
sub Mail::SpamAssassin::Plugin::Bayes::CORE:qr; # opcode
# spent 976ms within Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp which was called 168353 times, avg 6µs/call: # 158560 times (935ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1183, avg 6µs/call # 6279 times (26.9ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1300, avg 4µs/call # 3514 times (13.9ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1324, avg 4µs/call
sub Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp; # opcode
# spent 2.89s within Mail::SpamAssassin::Plugin::Bayes::CORE:subst which was called 415517 times, avg 7µs/call: # 158560 times (1.13s+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1175, avg 7µs/call # 158560 times (938ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1176, avg 6µs/call # 15418 times (180ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1260, avg 12µs/call # 12822 times (78.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1146, avg 6µs/call # 12822 times (46.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1157, avg 4µs/call # 12822 times (31.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1158, avg 2µs/call # 11544 times (130ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1246, avg 11µs/call # 8956 times (38.1ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1203, avg 4µs/call # 7217 times (142ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1164, avg 20µs/call # 5242 times (37.8ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1202, avg 7µs/call # 2405 times (18.1ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs at line 1482, avg 8µs/call # 1950 times (17.2ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1269, avg 9µs/call # 1344 times (10.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::get_msgid at line 980, avg 8µs/call # 702 times (31.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::get_msgid at line 1007, avg 45µs/call # 468 times (12.5ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1459, avg 27µs/call # 468 times (9.33ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1434, avg 20µs/call # 468 times (5.41ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1447, avg 12µs/call # 468 times (3.50ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1431, avg 7µs/call # 468 times (3.38ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1429, avg 7µs/call # 468 times (3.04ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1430, avg 6µs/call # 468 times (1.91ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1432, avg 4µs/call # 225 times (3.98ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1419, avg 18µs/call # 225 times (1.65ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1404, avg 7µs/call # 225 times (812µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1412, avg 4µs/call # 225 times (676µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1409, avg 3µs/call # 222 times (2.51ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1383, avg 11µs/call # 222 times (1.70ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1393, avg 8µs/call # 187 times (1.15ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1339, avg 6µs/call # 173 times (5.81ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1386, avg 34µs/call # 173 times (949µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1388, avg 5µs/call
sub Mail::SpamAssassin::Plugin::Bayes::CORE:subst; # opcode
# spent 1.04s within Mail::SpamAssassin::Plugin::Bayes::CORE:substcont which was called 229852 times, avg 5µs/call: # 197712 times (816ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1146, avg 4µs/call # 23088 times (125ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1246, avg 5µs/call # 7362 times (87.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1164, avg 12µs/call # 950 times (7.22ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1447, avg 8µs/call # 614 times (3.28ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1388, avg 5µs/call # 86 times (640µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1157, avg 7µs/call # 40 times (200µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1158, avg 5µs/call
sub Mail::SpamAssassin::Plugin::Bayes::CORE:substcont; # opcode