Profile of Mail/SpamAssassin/Plugin/Bayes.pm

Filename	/usr/local/lib/perl5/site_perl/Mail/SpamAssassin/Plugin/Bayes.pm
Statements	Executed 2376122 statements in 29.8s

Subroutines
Calls	P	F	Exclusive Time	Inclusive Time	Subroutine
12822	4	1	17.3s	23.5s	Mail::SpamAssassin::Plugin::Bayes::_tokenize_line
234	1	1	3.09s	31.0s	Mail::SpamAssassin::Plugin::Bayes::tokenize
645267	18	1	2.36s	2.36s	Mail::SpamAssassin::Plugin::Bayes::CORE:match (opcode)
415086	30	1	2.29s	2.29s	Mail::SpamAssassin::Plugin::Bayes::CORE:subst (opcode)
234	1	1	1.18s	2.99s	Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers
229852	7	1	1.02s	1.02s	Mail::SpamAssassin::Plugin::Bayes::CORE:substcont (opcode)
168353	3	1	812ms	812ms	Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp (opcode)
555	2	2	125ms	306ms	Mail::SpamAssassin::Plugin::Bayes::get_msgid
234	1	1	106ms	36.3s	Mail::SpamAssassin::Plugin::Bayes::_learn_trapped
1150	2	1	67.4ms	91.5ms	Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs
758	1	1	58.8ms	194ms	Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header
468	1	1	53.3ms	98.8ms	Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received
234	1	1	53.1ms	6.25s	Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg
234	1	1	36.8ms	44.1s	Mail::SpamAssassin::Plugin::Bayes::learn_message
1	1	1	19.6ms	29.6ms	Mail::SpamAssassin::Plugin::Bayes::learner_new
234	1	1	16.1ms	6.07s	Mail::SpamAssassin::Plugin::Bayes::_get_msgdata_from_permsgstatus
222	1	1	14.7ms	28.1ms	Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type
225	1	1	9.18ms	16.1ms	Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id
236	1	1	2.35ms	2.35ms	Mail::SpamAssassin::Plugin::Bayes::read_db_configs
1	1	1	1.39ms	2.09ms	Mail::SpamAssassin::Plugin::Bayes::BEGIN@63
1	1	1	84µs	132µs	Mail::SpamAssassin::Plugin::Bayes::new
1	1	1	78µs	176µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@1509
1	1	1	52µs	1.31ms	Mail::SpamAssassin::Plugin::Bayes::learner_close
1	1	1	49µs	68µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@46
1	1	1	41µs	298µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@68
1	1	1	36µs	198µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@167
1	1	1	34µs	180µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@51
1	1	1	32µs	37µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@48
1	1	1	32µs	220µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@219
1	1	1	32µs	234µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@178
1	1	1	32µs	234µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@215
1	1	1	31µs	209µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@174
1	1	1	31µs	228µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@165
1	1	1	30µs	207µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@175
1	1	1	30µs	226µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@169
1	1	1	30µs	58µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@47
1	1	1	30µs	172µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@60
1	1	1	30µs	230µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@173
1	1	1	29µs	204µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@164
1	1	1	29µs	205µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@168
1	1	1	29µs	232µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@227
1	1	1	28µs	236µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@158
1	1	1	28µs	196µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@179
1	1	1	27µs	212µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@172
1	1	1	26µs	506µs	Mail::SpamAssassin::Plugin::Bayes::learner_is_scan_available
1	1	1	25µs	230µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@157
1	1	1	25µs	246µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@163
1	1	1	25µs	97µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@49
1	1	1	25µs	213µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@59
1	1	1	24µs	217µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@156
1	1	1	22µs	184µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@166
1	1	1	21µs	257µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@223
1	1	1	20µs	20µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@58
1	1	1	20µs	196µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@159
2	2	1	18µs	18µs	Mail::SpamAssassin::Plugin::Bayes::CORE:qr (opcode)
1	1	1	15µs	15µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@56
1	1	1	12µs	12µs	Mail::SpamAssassin::Plugin::Bayes::BEGIN@57
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::__ANON__[:1701]
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::__ANON__[:874]
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::__ANON__[:880]
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::__ANON__[:898]
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::_compute_declassification_distance
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::_compute_prob_for_all_tokens
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::_compute_prob_for_token
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::_forget_trapped
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::_opportunistic_calls
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::bayes_report_make_list
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::check_bayes
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::finish
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::forget_message
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::ignore_message
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::learner_dump_database
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::learner_expire_old_training
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::learner_get_implementation
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::learner_sync
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::prefork_init
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::scan
0	0	0	0s	0s	Mail::SpamAssassin::Plugin::Bayes::spamd_child_init

Call graph for these subroutines as a Graphviz dot language file.

Line	State ments	Time on line	Calls	Time in subs	Code
1					# <@LICENSE>
2					# Licensed to the Apache Software Foundation (ASF) under one or more
3					# contributor license agreements. See the NOTICE file distributed with
4					# this work for additional information regarding copyright ownership.
5					# The ASF licenses this file to you under the Apache License, Version 2.0
6					# (the "License"); you may not use this file except in compliance with
7					# the License. You may obtain a copy of the License at:
8					#
9					# http://www.apache.org/licenses/LICENSE-2.0
10					#
11					# Unless required by applicable law or agreed to in writing, software
12					# distributed under the License is distributed on an "AS IS" BASIS,
13					# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14					# See the License for the specific language governing permissions and
15					# limitations under the License.
16					# </@LICENSE>
17
18					=head1 NAME
19
20					Mail::SpamAssassin::Plugin::Bayes - determine spammishness using a Bayesian classifier
21
22					=head1 DESCRIPTION
23
24					This is a Bayesian-style probabilistic classifier, using an algorithm based on
25					the one detailed in Paul Graham's I<A Plan For Spam> paper at:
26
27					http://www.paulgraham.com/spam.html
28
29					It also incorporates some other aspects taken from Graham Robinson's webpage
30					on the subject at:
31
32					http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
33
34					And the chi-square probability combiner as described here:
35
36					http://www.linuxjournal.com/print.php?sid=6467
37
38					The results are incorporated into SpamAssassin as the BAYES_* rules.
39
40					=head1 METHODS
41
42					=cut
43
44					package Mail::SpamAssassin::Plugin::Bayes;
45
46	2	76µs	2	88µs	# spent 68µs (49+19) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@46 which was called: # once (49µs+19µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 46 use strict; # spent 68µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@46 # spent 19µs making 1 call to strict::import
47	2	66µs	2	86µs	# spent 58µs (30+28) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@47 which was called: # once (30µs+28µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 47 use warnings; # spent 58µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@47 # spent 28µs making 1 call to warnings::import
48	2	84µs	2	42µs	# spent 37µs (32+5) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@48 which was called: # once (32µs+5µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 48 use bytes; # spent 37µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@48 # spent 5µs making 1 call to bytes::import
49	2	141µs	2	169µs	# spent 97µs (25+72) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@49 which was called: # once (25µs+72µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 49 use re 'taint'; # spent 97µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@49 # spent 72µs making 1 call to re::import
50
51					# spent 180µs (34+146) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@51 which was called: # once (34µs+146µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 54 BEGIN {
52	3	19µs	1	146µs	eval { require Digest::SHA; import Digest::SHA qw(sha1 sha1_hex); 1 } # spent 146µs making 1 call to Exporter::import
53	1	13µs			or do { require Digest::SHA1; import Digest::SHA1 qw(sha1 sha1_hex) }
54	1	51µs	1	180µs	} # spent 180µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@51
55
56	2	59µs	1	15µs	# spent 15µs within Mail::SpamAssassin::Plugin::Bayes::BEGIN@56 which was called: # once (15µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 56 use Mail::SpamAssassin; # spent 15µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@56
57	2	70µs	1	12µs	# spent 12µs within Mail::SpamAssassin::Plugin::Bayes::BEGIN@57 which was called: # once (12µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 57 use Mail::SpamAssassin::Plugin; # spent 12µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@57
58	2	65µs	1	20µs	# spent 20µs within Mail::SpamAssassin::Plugin::Bayes::BEGIN@58 which was called: # once (20µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 58 use Mail::SpamAssassin::PerMsgStatus; # spent 20µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@58
59	2	68µs	2	401µs	# spent 213µs (25+188) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@59 which was called: # once (25µs+188µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 59 use Mail::SpamAssassin::Logger; # spent 213µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@59 # spent 188µs making 1 call to Exporter::import
60	2	85µs	2	314µs	# spent 172µs (30+142) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@60 which was called: # once (30µs+142µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 60 use Mail::SpamAssassin::Util qw(untaint_var); # spent 172µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@60 # spent 142µs making 1 call to Exporter::import
61
62					# pick ONLY ONE of these combining implementations.
63	2	354µs	1	2.09ms	# spent 2.09ms (1.39+703µs) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@63 which was called: # once (1.39ms+703µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 63 use Mail::SpamAssassin::Bayes::CombineChi; # spent 2.09ms making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@63
64					# use Mail::SpamAssassin::Bayes::CombineNaiveBayes;
65
66	1	26µs			our @ISA = qw(Mail::SpamAssassin::Plugin);
67
68	1	7µs			# spent 298µs (41+257) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@68 which was called: # once (41µs+257µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 73 use vars qw{
69					$IGNORED_HDRS
70					$MARK_PRESENCE_ONLY_HDRS
71					%HEADER_NAME_COMPRESSION
72					$OPPORTUNISTIC_LOCK_VALID
73	1	1.33ms	2	555µs	}; # spent 298µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@68 # spent 257µs making 1 call to vars::import
74
75					# Which headers should we scan for tokens? Don't use all of them, as it's easy
76					# to pick up spurious clues from some. What we now do is use all of them
77					# less these well-known headers; that way we can pick up spammers' tracking
78					# headers (which are obviously not well-known in advance!).
79
80					# Received is handled specially
81	1	36µs	1	14µs	$IGNORED_HDRS = qr{(?: (?:X-)?Sender # misc noise # spent 14µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::CORE:qr
82					\|Delivered-To \|Delivery-Date
83					\|(?:X-)?Envelope-To
84					\|X-MIME-Auto[Cc]onverted \|X-Converted-To-Plain-Text
85
86					\|Subject # not worth a tiny gain vs. to db size increase
87
88					# Date: can provide invalid cues if your spam corpus is
89					# older/newer than ham
90					\|Date
91
92					# List headers: ignore. a spamfiltering mailing list will
93					# become a nonspam sign.
94					\|X-List\|(?:X-)?Mailing-List
95					\|(?:X-)?List-(?:Archive\|Help\|Id\|Owner\|Post\|Subscribe
96					\|Unsubscribe\|Host\|Id\|Manager\|Admin\|Comment
97					\|Name\|Url)
98					\|X-Unsub(?:scribe)?
99					\|X-Mailman-Version \|X-Been[Tt]here \|X-Loop
100					\|Mail-Followup-To
101					\|X-eGroups-(?:Return\|From)
102					\|X-MDMailing-List
103					\|X-XEmacs-List
104
105					# gatewayed through mailing list (thanks to Allen Smith)
106					\|(?:X-)?Resent-(?:From\|To\|Date)
107					\|(?:X-)?Original-(?:From\|To\|Date)
108
109					# Spamfilter/virus-scanner headers: too easy to chain from
110					# these
111					\|X-MailScanner(?:-SpamCheck)?
112					\|X-Spam(?:-(?:Status\|Level\|Flag\|Report\|Hits\|Score\|Checker-Version))?
113					\|X-Antispam \|X-RBL-Warning \|X-Mailscanner
114					\|X-MDaemon-Deliver-To \|X-Virus-Scanned
115					\|X-Mass-Check-Id
116					\|X-Pyzor \|X-DCC-\S{2,25}-Metrics
117					\|X-Filtered-B[Yy] \|X-Scanned-By \|X-Scanner
118					\|X-AP-Spam-(?:Score\|Status) \|X-RIPE-Spam-Status
119					\|X-SpamCop-[^:]+
120					\|X-SMTPD \|(?:X-)?Spam-Apparently-To
121					\|SPAM \|X-Perlmx-Spam
122					\|X-Bogosity
123
124					# some noisy Outlook headers that add no good clues:
125					\|Content-Class \|Thread-(?:Index\|Topic)
126					\|X-Original[Aa]rrival[Tt]ime
127
128					# Annotations from IMAP, POP, and MH:
129					\|(?:X-)?Status \|X-Flags \|X-Keywords \|Replied \|Forwarded
130					\|Lines \|Content-Length
131					\|X-UIDL? \|X-IMAPbase
132
133					# Annotations from Bugzilla
134					\|X-Bugzilla-[^:]+
135
136					# Annotations from VM: (thanks to Allen Smith)
137					\|X-VM-(?:Bookmark\|(?:POP\|IMAP)-Retrieved\|Labels\|Last-Modified
138					\|Summary-Format\|VHeader\|v\d-Data\|Message-Order)
139
140					# Annotations from Gnus:
141					\| X-Gnus-Mail-Source
142					\| Xref
143
144					)}x;
145
146					# Note only the presence of these headers, in order to reduce the
147					# hapaxen they generate.
148	1	12µs	1	4µs	$MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face # spent 4µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::CORE:qr
149					\|X-(?:Gnu-?PG\|PGP\|GPG)(?:-Key)?-Fingerprint
150					\|D(?:KIM\|omainKey)-Signature
151					)}ix;
152
153					# tweaks tested as of Nov 18 2002 by jm posted to -devel at
154					# http://sourceforge.net/p/spamassassin/mailman/message/12977556/
155					# for results. The winners are now the default settings.
156	2	72µs	2	411µs	# spent 217µs (24+194) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@156 which was called: # once (24µs+194µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 156 use constant IGNORE_TITLE_CASE => 1; # spent 217µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@156 # spent 194µs making 1 call to constant::import
157	2	71µs	2	435µs	# spent 230µs (25+205) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@157 which was called: # once (25µs+205µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 157 use constant TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES => 0; # spent 230µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@157 # spent 205µs making 1 call to constant::import
158	2	67µs	2	443µs	# spent 236µs (28+207) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@158 which was called: # once (28µs+207µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 158 use constant TOKENIZE_LONG_8BIT_SEQS_AS_UTF8_CHARS => 1; # spent 236µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@158 # spent 207µs making 1 call to constant::import
159	2	13.4ms	2	373µs	# spent 196µs (20+177) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@159 which was called: # once (20µs+177µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 159 use constant TOKENIZE_LONG_TOKENS_AS_SKIPS => 1; # spent 196µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@159 # spent 177µs making 1 call to constant::import
160
161					# tweaks by jm on May 12 2003, see -devel email at
162					# http://sourceforge.net/p/spamassassin/mailman/message/14844556/
163	2	81µs	2	466µs	# spent 246µs (25+221) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@163 which was called: # once (25µs+221µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 163 use constant PRE_CHEW_ADDR_HEADERS => 1; # spent 246µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@163 # spent 221µs making 1 call to constant::import
164	2	73µs	2	379µs	# spent 204µs (29+175) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@164 which was called: # once (29µs+175µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 164 use constant CHEW_BODY_URIS => 1; # spent 204µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@164 # spent 175µs making 1 call to constant::import
165	2	88µs	2	425µs	# spent 228µs (31+197) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@165 which was called: # once (31µs+197µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 165 use constant CHEW_BODY_MAILADDRS => 1; # spent 228µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@165 # spent 197µs making 1 call to constant::import
166	2	57µs	2	346µs	# spent 184µs (22+162) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@166 which was called: # once (22µs+162µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 166 use constant HDRS_TOKENIZE_LONG_TOKENS_AS_SKIPS => 1; # spent 184µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@166 # spent 162µs making 1 call to constant::import
167	2	69µs	2	360µs	# spent 198µs (36+162) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@167 which was called: # once (36µs+162µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 167 use constant BODY_TOKENIZE_LONG_TOKENS_AS_SKIPS => 1; # spent 198µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@167 # spent 162µs making 1 call to constant::import
168	2	71µs	2	381µs	# spent 205µs (29+176) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@168 which was called: # once (29µs+176µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 168 use constant URIS_TOKENIZE_LONG_TOKENS_AS_SKIPS => 0; # spent 205µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@168 # spent 176µs making 1 call to constant::import
169	2	68µs	2	422µs	# spent 226µs (30+196) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@169 which was called: # once (30µs+196µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 169 use constant IGNORE_MSGID_TOKENS => 0; # spent 226µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@169 # spent 196µs making 1 call to constant::import
170
171					# tweaks of 12 March 2004, see bug 2129.
172	2	77µs	2	397µs	# spent 212µs (27+185) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@172 which was called: # once (27µs+185µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 172 use constant DECOMPOSE_BODY_TOKENS => 1; # spent 212µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@172 # spent 185µs making 1 call to constant::import
173	2	80µs	2	431µs	# spent 230µs (30+201) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@173 which was called: # once (30µs+201µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 173 use constant MAP_HEADERS_MID => 1; # spent 230µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@173 # spent 201µs making 1 call to constant::import
174	2	68µs	2	386µs	# spent 209µs (31+178) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@174 which was called: # once (31µs+178µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 174 use constant MAP_HEADERS_FROMTOCC => 1; # spent 209µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@174 # spent 178µs making 1 call to constant::import
175	2	92µs	2	384µs	# spent 207µs (30+177) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@175 which was called: # once (30µs+177µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 175 use constant MAP_HEADERS_USERAGENT => 1; # spent 207µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@175 # spent 177µs making 1 call to constant::import
176
177					# tweaks, see http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3173#c26
178	2	68µs	2	437µs	# spent 234µs (32+202) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@178 which was called: # once (32µs+202µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 178 use constant ADD_INVIZ_TOKENS_I_PREFIX => 1; # spent 234µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@178 # spent 202µs making 1 call to constant::import
179	2	219µs	2	364µs	# spent 196µs (28+168) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@179 which was called: # once (28µs+168µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 179 use constant ADD_INVIZ_TOKENS_NO_PREFIX => 0; # spent 196µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@179 # spent 168µs making 1 call to constant::import
180
181					# We store header-mined tokens in the db with a "HHeaderName:val" format.
182					# some headers may contain lots of gibberish tokens, so allow a little basic
183					# compression by mapping the header name at least here. these are the headers
184					# which appear with the most frequency in my db. note: this doesn't have to
185					# be 2-way (ie. LHSes that map to the same RHS are not a problem), but mixing
186					# tokens from multiple different headers may impact accuracy, so might as well
187					# avoid this if possible. These are the top ones from my corpus, BTW (jm).
188	1	31µs			%HEADER_NAME_COMPRESSION = (
189					'Message-Id' => '*m',
190					'Message-ID' => '*M',
191					'Received' => '*r',
192					'User-Agent' => '*u',
193					'References' => '*f',
194					'In-Reply-To' => '*i',
195					'From' => '*F',
196					'Reply-To' => '*R',
197					'Return-Path' => '*p',
198					'Return-path' => '*rp',
199					'X-Mailer' => '*x',
200					'X-Authentication-Warning' => '*a',
201					'Organization' => '*o',
202					'Organisation' => '*o',
203					'Content-Type' => '*c',
204					'x-spam-relays-trusted' => '*RT',
205					'x-spam-relays-untrusted' => '*RU',
206					);
207
208					# How many seconds should the opportunistic_expire lock be valid?
209	1	2µs			$OPPORTUNISTIC_LOCK_VALID = 300;
210
211					# Should we use the Robinson f(w) equation from
212					# http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html ?
213					# It gives better results, in that scores are more likely to distribute
214					# into the <0.5 range for nonspam and >0.5 for spam.
215	2	72µs	2	437µs	# spent 234µs (32+203) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@215 which was called: # once (32µs+203µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 215 use constant USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS => 1; # spent 234µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@215 # spent 203µs making 1 call to constant::import
216
217					# How many of the most significant tokens should we use for the p(w)
218					# calculation?
219	2	74µs	2	409µs	# spent 220µs (32+188) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@219 which was called: # once (32µs+188µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 219 use constant N_SIGNIFICANT_TOKENS => 150; # spent 220µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@219 # spent 188µs making 1 call to constant::import
220
221					# How many significant tokens are required for a classifier score to
222					# be considered usable?
223	2	80µs	2	493µs	# spent 257µs (21+236) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@223 which was called: # once (21µs+236µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 223 use constant REQUIRE_SIGNIFICANT_TOKENS_TO_SCORE => -1; # spent 257µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@223 # spent 236µs making 1 call to constant::import
224
225					# How long a token should we hold onto? (note: German speakers typically
226					# will require a longer token than English ones.)
227	2	14.9ms	2	434µs	# spent 232µs (29+203) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@227 which was called: # once (29µs+203µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 227 use constant MAX_TOKEN_LENGTH => 15; # spent 232µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@227 # spent 203µs making 1 call to constant::import
228
229					###########################################################################
230
231					# spent 132µs (84+48) within Mail::SpamAssassin::Plugin::Bayes::new which was called: # once (84µs+48µs) by Mail::SpamAssassin::PluginHandler::load_plugin at line 1 of (eval 89)[Mail/SpamAssassin/PluginHandler.pm:129] sub new {
232	1	3µs			my $class = shift;
233	1	2µs			my ($main) = @_;
234
235	1	3µs			$class = ref($class) \|\| $class;
236	1	13µs	1	17µs	my $self = $class->SUPER::new($main); # spent 17µs making 1 call to Mail::SpamAssassin::Plugin::new
237	1	2µs			bless ($self, $class);
238
239	1	6µs			$self->{main} = $main;
240	1	4µs			$self->{conf} = $main->{conf};
241	1	3µs			$self->{use_ignores} = 1;
242
243	1	10µs	1	31µs	$self->register_eval_rule("check_bayes"); # spent 31µs making 1 call to Mail::SpamAssassin::Plugin::register_eval_rule
244	1	10µs			$self;
245					}
246
247					sub finish {
248					my $self = shift;
249					if ($self->{store}) {
250					$self->{store}->untie_db();
251					}
252					%{$self} = ();
253					}
254
255					###########################################################################
256
257					# Plugin hook.
258					# Return this implementation object, for callers that need to know
259					# it. TODO: callers shouldn't need to know it!
260					# used only in test suite to get access to {store}, internal APIs.
261					#
262					sub learner_get_implementation { return shift; }
263
264					###########################################################################
265
266					# Plugin hook.
267					# Called in the parent process shortly before forking off child processes.
268					sub prefork_init {
269					my ($self) = @_;
270
271					if ($self->{store} && $self->{store}->UNIVERSAL::can('prefork_init')) {
272					$self->{store}->prefork_init;
273					}
274					}
275
276					###########################################################################
277
278					# Plugin hook.
279					# Called in a child process shortly after being spawned.
280					sub spamd_child_init {
281					my ($self) = @_;
282
283					if ($self->{store} && $self->{store}->UNIVERSAL::can('spamd_child_init')) {
284					$self->{store}->spamd_child_init;
285					}
286					}
287
288					###########################################################################
289
290					# Plugin hook.
291					sub check_bayes {
292					my ($self, $pms, $fulltext, $min, $max) = @_;
293
294					return 0 if (!$self->{conf}->{use_learner});
295					return 0 if (!$self->{conf}->{use_bayes} \|\| !$self->{conf}->{use_bayes_rules});
296
297					if (!exists ($pms->{bayes_score})) {
298					my $timer = $self->{main}->time_method("check_bayes");
299					$pms->{bayes_score} = $self->scan($pms, $pms->{msg});
300					}
301
302					if (defined $pms->{bayes_score} &&
303					($min == 0 \|\| $pms->{bayes_score} > $min) &&
304					($max eq "undef" \|\| $pms->{bayes_score} <= $max))
305					{
306					if ($self->{conf}->{detailed_bayes_score}) {
307					$pms->test_log(sprintf ("score: %3.4f, hits: %s",
308					$pms->{bayes_score},
309					$pms->{bayes_hits}));
310					}
311					else {
312					$pms->test_log(sprintf ("score: %3.4f", $pms->{bayes_score}));
313					}
314					return 1;
315					}
316
317					return 0;
318					}
319
320					###########################################################################
321
322					# Plugin hook.
323					# spent 1.31ms (52µs+1.25) within Mail::SpamAssassin::Plugin::Bayes::learner_close which was called: # once (52µs+1.25ms) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm sub learner_close {
324	1	2µs			my ($self, $params) = @_;
325	1	4µs			my $quiet = $params->{quiet};
326
327					# do a sanity check here. Weird things happen if we remain tied
328					# after compiling; for example, spamd will never see that the
329					# number of messages has reached the bayes-scanning threshold.
330	1	26µs	1	13µs	if ($self->{store}->db_readable()) { # spent 13µs making 1 call to Mail::SpamAssassin::BayesStore::DBM::db_readable
331	1	2µs			warn "bayes: oops! still tied to bayes DBs, untying\n" unless $quiet;
332	1	11µs	1	1.24ms	$self->{store}->untie_db(); # spent 1.24ms making 1 call to Mail::SpamAssassin::BayesStore::DBM::untie_db
333					}
334					}
335
336					###########################################################################
337
338					# read configuration items to control bayes behaviour. Called by
339					# BayesStore::read_db_configs().
340					# spent 2.35ms within Mail::SpamAssassin::Plugin::Bayes::read_db_configs which was called 236 times, avg 10µs/call: # 236 times (2.35ms+0s) by Mail::SpamAssassin::BayesStore::read_db_configs at line 117 of Mail/SpamAssassin/BayesStore.pm, avg 10µs/call sub read_db_configs {
341	236	521µs			my ($self) = @_;
342
343					# use of hapaxes. Set on bayes object, since it controls prob
344					# computation.
345	236	2.49ms			$self->{use_hapaxes} = $self->{conf}->{bayes_use_hapaxes};
346					}
347					###########################################################################
348
349					sub ignore_message {
350					my ($self,$PMS) = @_;
351
352					return 0 unless $self->{use_ignores};
353
354					my $ig_from = $self->{main}->call_plugins ("check_wb_list",
355					{ permsgstatus => $PMS, type => 'from', list => 'bayes_ignore_from' });
356					my $ig_to = $self->{main}->call_plugins ("check_wb_list",
357					{ permsgstatus => $PMS, type => 'to', list => 'bayes_ignore_to' });
358
359					my $ignore = $ig_from \|\| $ig_to;
360
361					dbg("bayes: not using bayes, bayes_ignore_from or _to rule") if $ignore;
362
363					return $ignore;
364					}
365
366					###########################################################################
367
368					# Plugin hook.
369					# spent 44.1s (36.8ms+44.1) within Mail::SpamAssassin::Plugin::Bayes::learn_message which was called 234 times, avg 189ms/call: # 234 times (36.8ms+44.1s) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm, avg 189ms/call sub learn_message {
370	234	498µs			my ($self, $params) = @_;
371	234	843µs			my $isspam = $params->{isspam};
372	234	697µs			my $msg = $params->{msg};
373	234	645µs			my $id = $params->{id};
374
375	234	949µs			if (!$self->{conf}->{use_bayes}) { return; }
376
377	234	2.37ms	234	6.25s	my $msgdata = $self->get_body_from_msg ($msg); # spent 6.25s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg, avg 26.7ms/call
378	234	476µs			my $ret;
379
380					eval {
381	234	1.58ms			local $SIG{'__DIE__'}; # do not run user die() traps in here
382	234	3.55ms	234	1.89ms	my $timer = $self->{main}->time_method("b_learn"); # spent 1.89ms making 234 calls to Mail::SpamAssassin::time_method, avg 8µs/call
383
384	234	445µs			my $ok;
385	234	1.25ms			if ($self->{main}->{learn_to_journal}) {
386					# If we're going to learn to journal, we'll try going r/o first...
387					# If that fails for some reason, let's try going r/w. This happens
388					# if the DB doesn't exist yet.
389	234	3.13ms	235	1.59s	$ok = $self->{store}->tie_db_readonly() \|\| $self->{store}->tie_db_writable(); # spent 1.58s making 234 calls to Mail::SpamAssassin::BayesStore::DBM::tie_db_readonly, avg 6.77ms/call # spent 4.45ms making 1 call to Mail::SpamAssassin::BayesStore::DBM::tie_db_writable
390					} else {
391					$ok = $self->{store}->tie_db_writable();
392					}
393
394	234	926µs			if ($ok) {
395	234	2.80ms	234	36.3s	$ret = $self->_learn_trapped ($isspam, $msg, $msgdata, $id); # spent 36.3s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::_learn_trapped, avg 155ms/call
396
397	234	1.06ms			if (!$self->{main}->{learn_caller_will_untie}) {
398					$self->{store}->untie_db();
399					}
400					}
401	234	2.74ms			1;
402	234	1.03ms			} or do { # if we died, untie the dbs.
403					my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat;
404					$self->{store}->untie_db();
405					die "bayes: (in learn) $eval_stat\n";
406					};
407
408	234	3.54ms			return $ret;
409					}
410
411					# this function is trapped by the wrapper above
412					# spent 36.3s (106ms+36.2) within Mail::SpamAssassin::Plugin::Bayes::_learn_trapped which was called 234 times, avg 155ms/call: # 234 times (106ms+36.2s) by Mail::SpamAssassin::Plugin::Bayes::learn_message at line 395, avg 155ms/call sub _learn_trapped {
413	234	689µs			my ($self, $isspam, $msg, $msgdata, $msgid) = @_;
414	234	896µs			my @msgid = ( $msgid );
415
416	234	1.25ms			if (!defined $msgid) {
417	234	2.69ms	234	137ms	@msgid = $self->get_msgid($msg); # spent 137ms making 234 calls to Mail::SpamAssassin::Plugin::Bayes::get_msgid, avg 584µs/call
418					}
419
420	234	1.07ms			foreach my $msgid_t ( @msgid ) {
421	458	4.79ms	458	31.2ms	my $seen = $self->{store}->seen_get ($msgid_t); # spent 31.2ms making 458 calls to Mail::SpamAssassin::BayesStore::DBM::seen_get, avg 68µs/call
422
423	458	3.28ms			if (defined ($seen)) {
424					if (($seen eq 's' && $isspam) \|\| ($seen eq 'h' && !$isspam)) {
425					dbg("bayes: $msgid_t already learnt correctly, not learning twice");
426					return 0;
427					} elsif ($seen !~ /^[hs]$/) {
428					warn("bayes: db_seen corrupt: value='$seen' for $msgid_t, ignored");
429					} else {
430					# bug 3704: If the message was already learned, don't try learning it again.
431					# this prevents, for instance, manually learning as spam, then autolearning
432					# as ham, or visa versa.
433					if ($self->{main}->{learn_no_relearn}) {
434					dbg("bayes: $msgid_t already learnt as opposite, not re-learning");
435					return 0;
436					}
437
438					dbg("bayes: $msgid_t already learnt as opposite, forgetting first");
439
440					# kluge so that forget() won't untie the db on us ...
441					my $orig = $self->{main}->{learn_caller_will_untie};
442					$self->{main}->{learn_caller_will_untie} = 1;
443
444					my $fatal = !defined $self->{main}->{bayes_scanner}->forget ($msg);
445
446					# reset the value post-forget() ...
447					$self->{main}->{learn_caller_will_untie} = $orig;
448
449					# forget() gave us a fatal error, so propagate that up
450					if ($fatal) {
451					dbg("bayes: forget() returned a fatal error, so learn() will too");
452					return;
453					}
454					}
455
456					# we're only going to have seen this once, so stop if it's been
457					# seen already
458					last;
459					}
460					}
461
462					# Now that we're sure we haven't seen this message before ...
463	234	790µs			$msgid = $msgid[0];
464
465	234	2.83ms	234	1.40s	my $msgatime = $msg->receive_date(); # spent 1.40s making 234 calls to Mail::SpamAssassin::Message::receive_date, avg 5.97ms/call
466
467					# If the message atime comes back as being more than 1 day in the
468					# future, something's messed up and we should revert to current time as
469					# a safety measure.
470					#
471	234	1.21ms			$msgatime = time if ( $msgatime - time > 86400 );
472
473	234	2.46ms	234	31.0s	my $tokens = $self->tokenize($msg, $msgdata); # spent 31.0s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::tokenize, avg 132ms/call
474
475	468	8.70ms	234	2.61ms	{ my $timer = $self->{main}->time_method('b_count_change'); # spent 2.61ms making 234 calls to Mail::SpamAssassin::time_method, avg 11µs/call
476	234	1.03ms			if ($isspam) {
477	234	2.49ms	234	9.65ms	$self->{store}->nspam_nham_change(1, 0); # spent 9.65ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::nspam_nham_change, avg 41µs/call
478	234	2.44ms	234	3.49s	$self->{store}->multi_tok_count_change(1, 0, $tokens, $msgatime); # spent 3.49s making 234 calls to Mail::SpamAssassin::BayesStore::DBM::multi_tok_count_change, avg 14.9ms/call
479					} else {
480					$self->{store}->nspam_nham_change(0, 1);
481					$self->{store}->multi_tok_count_change(0, 1, $tokens, $msgatime);
482					}
483					}
484
485	234	3.06ms	234	11.2ms	$self->{store}->seen_put ($msgid, ($isspam ? 's' : 'h')); # spent 11.2ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::seen_put, avg 48µs/call
486	234	2.15ms	234	104ms	$self->{store}->cleanup(); # spent 104ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::cleanup, avg 443µs/call
487
488	234	5.80ms	234	0s	$self->{main}->call_plugins("bayes_learn", { toksref => $tokens, # spent 17.6ms making 234 calls to Mail::SpamAssassin::call_plugins, avg 75µs/call, recursion: max depth 1, sum of overlapping time 17.6ms
489					isspam => $isspam,
490					msgid => $msgid,
491					msgatime => $msgatime,
492					});
493
494	234	3.06ms	234	2.42ms	dbg("bayes: learned '$msgid', atime: $msgatime"); # spent 2.42ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 10µs/call
495
496	234	55.4ms			1;
497					}
498
499					###########################################################################
500
501					# Plugin hook.
502					sub forget_message {
503					my ($self, $params) = @_;
504					my $msg = $params->{msg};
505					my $id = $params->{id};
506
507					if (!$self->{conf}->{use_bayes}) { return; }
508
509					my $msgdata = $self->get_body_from_msg ($msg);
510					my $ret;
511
512					# we still tie for writing here, since we write to the seen db
513					# synchronously
514					eval {
515					local $SIG{'__DIE__'}; # do not run user die() traps in here
516					my $timer = $self->{main}->time_method("b_learn");
517
518					my $ok;
519					if ($self->{main}->{learn_to_journal}) {
520					# If we're going to learn to journal, we'll try going r/o first...
521					# If that fails for some reason, let's try going r/w. This happens
522					# if the DB doesn't exist yet.
523					$ok = $self->{store}->tie_db_readonly() \|\| $self->{store}->tie_db_writable();
524					} else {
525					$ok = $self->{store}->tie_db_writable();
526					}
527
528					if ($ok) {
529					$ret = $self->_forget_trapped ($msg, $msgdata, $id);
530
531					if (!$self->{main}->{learn_caller_will_untie}) {
532					$self->{store}->untie_db();
533					}
534					}
535					1;
536					} or do { # if we died, untie the dbs.
537					my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat;
538					$self->{store}->untie_db();
539					die "bayes: (in forget) $eval_stat\n";
540					};
541
542					return $ret;
543					}
544
545					# this function is trapped by the wrapper above
546					sub _forget_trapped {
547					my ($self, $msg, $msgdata, $msgid) = @_;
548					my @msgid = ( $msgid );
549					my $isspam;
550
551					if (!defined $msgid) {
552					@msgid = $self->get_msgid($msg);
553					}
554
555					while( $msgid = shift @msgid ) {
556					my $seen = $self->{store}->seen_get ($msgid);
557
558					if (defined ($seen)) {
559					if ($seen eq 's') {
560					$isspam = 1;
561					} elsif ($seen eq 'h') {
562					$isspam = 0;
563					} else {
564					dbg("bayes: forget: msgid $msgid seen entry is neither ham nor spam, ignored");
565					return 0;
566					}
567
568					# messages should only be learned once, so stop if we find a msgid
569					# which was seen before
570					last;
571					}
572					else {
573					dbg("bayes: forget: msgid $msgid not learnt, ignored");
574					}
575					}
576
577					# This message wasn't learnt before, so return
578					if (!defined $isspam) {
579					dbg("bayes: forget: no msgid from this message has been learnt, skipping message");
580					return 0;
581					}
582					elsif ($isspam) {
583					$self->{store}->nspam_nham_change (-1, 0);
584					}
585					else {
586					$self->{store}->nspam_nham_change (0, -1);
587					}
588
589					my $tokens = $self->tokenize($msg, $msgdata);
590
591					if ($isspam) {
592					$self->{store}->multi_tok_count_change (-1, 0, $tokens);
593					} else {
594					$self->{store}->multi_tok_count_change (0, -1, $tokens);
595					}
596
597					$self->{store}->seen_delete ($msgid);
598					$self->{store}->cleanup();
599
600					$self->{main}->call_plugins("bayes_forget", { toksref => $tokens,
601					isspam => $isspam,
602					msgid => $msgid,
603					});
604
605					1;
606					}
607
608					###########################################################################
609
610					# Plugin hook.
611					sub learner_sync {
612					my ($self, $params) = @_;
613					if (!$self->{conf}->{use_bayes}) { return 0; }
614					dbg("bayes: bayes journal sync starting");
615					$self->{store}->sync($params);
616					dbg("bayes: bayes journal sync completed");
617					}
618
619					###########################################################################
620
621					# Plugin hook.
622					sub learner_expire_old_training {
623					my ($self, $params) = @_;
624					if (!$self->{conf}->{use_bayes}) { return 0; }
625					dbg("bayes: expiry starting");
626					my $timer = $self->{main}->time_method("expire_bayes");
627					$self->{store}->expire_old_tokens($params);
628					dbg("bayes: expiry completed");
629					}
630
631					###########################################################################
632
633					# Plugin hook.
634					# Check to make sure we can tie() the DB, and we have enough entries to do a scan
635					# if we're told the caller will untie(), go ahead and leave the db tied.
636					# spent 506µs (26+480) within Mail::SpamAssassin::Plugin::Bayes::learner_is_scan_available which was called: # once (26µs+480µs) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm sub learner_is_scan_available {
637	1	2µs			my ($self, $params) = @_;
638
639	1	4µs			return 0 unless $self->{conf}->{use_bayes};
640	1	18µs	1	480µs	return 0 unless $self->{store}->tie_db_readonly(); # spent 480µs making 1 call to Mail::SpamAssassin::BayesStore::DBM::tie_db_readonly
641
642					# We need the DB to stay tied, so if the journal sync occurs, don't untie!
643					my $caller_untie = $self->{main}->{learn_caller_will_untie};
644					$self->{main}->{learn_caller_will_untie} = 1;
645
646					# Do a journal sync if necessary. Do this before the nspam_nham_get()
647					# call since the sync may cause an update in the number of messages
648					# learnt.
649					$self->_opportunistic_calls(1);
650
651					# Reset the variable appropriately
652					$self->{main}->{learn_caller_will_untie} = $caller_untie;
653
654					my ($ns, $nn) = $self->{store}->nspam_nham_get();
655
656					if ($ns < $self->{conf}->{bayes_min_spam_num}) {
657					dbg("bayes: not available for scanning, only $ns spam(s) in bayes DB < ".$self->{conf}->{bayes_min_spam_num});
658					if (!$self->{main}->{learn_caller_will_untie}) {
659					$self->{store}->untie_db();
660					}
661					return 0;
662					}
663					if ($nn < $self->{conf}->{bayes_min_ham_num}) {
664					dbg("bayes: not available for scanning, only $nn ham(s) in bayes DB < ".$self->{conf}->{bayes_min_ham_num});
665					if (!$self->{main}->{learn_caller_will_untie}) {
666					$self->{store}->untie_db();
667					}
668					return 0;
669					}
670
671					return 1;
672					}
673
674					###########################################################################
675
676					sub scan {
677					my ($self, $permsgstatus, $msg) = @_;
678					my $score;
679
680					return unless $self->{conf}->{use_learner};
681
682					# When we're doing a scan, we'll guarantee that we'll do the untie,
683					# so override the global setting until we're done.
684					my $caller_untie = $self->{main}->{learn_caller_will_untie};
685					$self->{main}->{learn_caller_will_untie} = 1;
686
687					goto skip if ($self->{main}->{bayes_scanner}->ignore_message($permsgstatus));
688
689					goto skip unless $self->learner_is_scan_available();
690
691					my ($ns, $nn) = $self->{store}->nspam_nham_get();
692
693					## if ($self->{log_raw_counts}) { # see _compute_prob_for_token()
694					## $self->{raw_counts} = " ns=$ns nn=$nn ";
695					## }
696
697					dbg("bayes: corpus size: nspam = $ns, nham = $nn");
698
699					my $msgtokens;
700					{ my $timer = $self->{main}->time_method('b_tokenize');
701					my $msgdata = $self->_get_msgdata_from_permsgstatus ($permsgstatus);
702					$msgtokens = $self->tokenize($msg, $msgdata);
703					}
704
705					my $tokensdata;
706					{ my $timer = $self->{main}->time_method('b_tok_get_all');
707					$tokensdata = $self->{store}->tok_get_all(keys %{$msgtokens});
708					}
709
710					my $timer_compute_prob = $self->{main}->time_method('b_comp_prob');
711
712					my $probabilities_ref =
713					$self->_compute_prob_for_all_tokens($tokensdata, $ns, $nn);
714
715					my %pw;
716					foreach my $tokendata (@{$tokensdata}) {
717					my $prob = shift(@$probabilities_ref);
718					next unless defined $prob;
719					my ($token, $tok_spam, $tok_ham, $atime) = @{$tokendata};
720					$pw{$token} = {
721					prob => $prob,
722					spam_count => $tok_spam,
723					ham_count => $tok_ham,
724					atime => $atime
725					};
726					}
727
728					my @pw_keys = keys %pw;
729
730					# If none of the tokens were found in the DB, we're going to skip
731					# this message...
732					if (!@pw_keys) {
733					dbg("bayes: cannot use bayes on this message; none of the tokens were found in the database");
734					goto skip;
735					}
736
737					my $tcount_total = keys %{$msgtokens};
738					my $tcount_learned = scalar @pw_keys;
739
740					# Figure out the message receive time (used as atime below)
741					# If the message atime comes back as being in the future, something's
742					# messed up and we should revert to current time as a safety measure.
743					#
744					my $msgatime = $msg->receive_date();
745					my $now = time;
746					$msgatime = $now if ( $msgatime > $now );
747
748					my @touch_tokens;
749					my $tinfo_spammy = $permsgstatus->{bayes_token_info_spammy} = [];
750					my $tinfo_hammy = $permsgstatus->{bayes_token_info_hammy} = [];
751
752					my %tok_strength = map( ($_, abs($pw{$_}->{prob} - 0.5)), @pw_keys);
753					my $log_each_token = (would_log('dbg', 'bayes') > 1);
754
755					# now take the most significant tokens and calculate probs using
756					# Robinson's formula.
757
758					@pw_keys = sort { $tok_strength{$b} <=> $tok_strength{$a} } @pw_keys;
759
760					if (@pw_keys > N_SIGNIFICANT_TOKENS) { $#pw_keys = N_SIGNIFICANT_TOKENS - 1 }
761
762					my @sorted;
763					foreach my $tok (@pw_keys) {
764					next if $tok_strength{$tok} <
765					$Mail::SpamAssassin::Bayes::Combine::MIN_PROB_STRENGTH;
766
767					my $pw_tok = $pw{$tok};
768					my $pw_prob = $pw_tok->{prob};
769
770					# What's more expensive, scanning headers for HAMMYTOKENS and
771					# SPAMMYTOKENS tags that aren't there or collecting data that
772					# won't be used? Just collecting the data is certainly simpler.
773					#
774					my $raw_token = $msgtokens->{$tok} \|\| "(unknown)";
775					my $s = $pw_tok->{spam_count};
776					my $n = $pw_tok->{ham_count};
777					my $a = $pw_tok->{atime};
778
779					push( @{ $pw_prob < 0.5 ? $tinfo_hammy : $tinfo_spammy },
780					[$raw_token, $pw_prob, $s, $n, $a] );
781
782					push(@sorted, $pw_prob);
783
784					# update the atime on this token, it proved useful
785					push(@touch_tokens, $tok);
786
787					if ($log_each_token) {
788					dbg("bayes: token '$raw_token' => $pw_prob");
789					}
790					}
791
792					if (!@sorted \|\| (REQUIRE_SIGNIFICANT_TOKENS_TO_SCORE > 0 &&
793					$#sorted <= REQUIRE_SIGNIFICANT_TOKENS_TO_SCORE))
794					{
795					dbg("bayes: cannot use bayes on this message; not enough usable tokens found");
796					goto skip;
797					}
798
799					$score = Mail::SpamAssassin::Bayes::Combine::combine($ns, $nn, \@sorted);
800					undef $timer_compute_prob; # end a timing section
801
802					# Couldn't come up with a probability?
803					goto skip unless defined $score;
804
805					dbg("bayes: score = $score");
806
807					# no need to call tok_touch_all unless there were significant
808					# tokens and a score was returned
809					# we don't really care about the return value here
810
811					{ my $timer = $self->{main}->time_method('b_tok_touch_all');
812					$self->{store}->tok_touch_all(\@touch_tokens, $msgatime);
813					}
814
815					my $timer_finish = $self->{main}->time_method('b_finish');
816
817					$permsgstatus->{bayes_nspam} = $ns;
818					$permsgstatus->{bayes_nham} = $nn;
819
820					## if ($self->{log_raw_counts}) { # see _compute_prob_for_token()
821					## print "#Bayes-Raw-Counts: $self->{raw_counts}\n";
822					## }
823
824					$self->{main}->call_plugins("bayes_scan", { toksref => $msgtokens,
825					probsref => \%pw,
826					score => $score,
827					msgatime => $msgatime,
828					significant_tokens => \@touch_tokens,
829					});
830
831					skip:
832					if (!defined $score) {
833					dbg("bayes: not scoring message, returning undef");
834					}
835
836					undef $timer_compute_prob; # end a timing section if still running
837					if (!defined $timer_finish) {
838					$timer_finish = $self->{main}->time_method('b_finish');
839					}
840
841					# Take any opportunistic actions we can take
842					if ($self->{main}->{opportunistic_expire_check_only}) {
843					# we're supposed to report on expiry only -- so do the
844					# _opportunistic_calls() run for the journal only.
845					$self->_opportunistic_calls(1);
846					$permsgstatus->{bayes_expiry_due} = $self->{store}->expiry_due();
847					}
848					else {
849					$self->_opportunistic_calls();
850					}
851
852					# Do any cleanup we need to do
853					$self->{store}->cleanup();
854
855					# Reset the value accordingly
856					$self->{main}->{learn_caller_will_untie} = $caller_untie;
857
858					# If our caller won't untie the db, we need to do it.
859					if (!$caller_untie) {
860					$self->{store}->untie_db();
861					}
862
863					$permsgstatus->set_tag ('BAYESTCHAMMY',
864					($tinfo_hammy ? scalar @{$tinfo_hammy} : 0));
865					$permsgstatus->set_tag ('BAYESTCSPAMMY',
866					($tinfo_spammy ? scalar @{$tinfo_spammy} : 0));
867					$permsgstatus->set_tag ('BAYESTCLEARNED', $tcount_learned);
868					$permsgstatus->set_tag ('BAYESTC', $tcount_total);
869
870					$permsgstatus->set_tag ('HAMMYTOKENS', sub {
871					my $pms = shift;
872					$self->bayes_report_make_list
873					($pms, $pms->{bayes_token_info_hammy}, shift);
874					});
875
876					$permsgstatus->set_tag ('SPAMMYTOKENS', sub {
877					my $pms = shift;
878					$self->bayes_report_make_list
879					($pms, $pms->{bayes_token_info_spammy}, shift);
880					});
881
882					$permsgstatus->set_tag ('TOKENSUMMARY', sub {
883					my $pms = shift;
884					if ( defined $pms->{tag_data}{BAYESTC} )
885					{
886					my $tcount_neutral = $pms->{tag_data}{BAYESTCLEARNED}
887					- $pms->{tag_data}{BAYESTCSPAMMY}
888					- $pms->{tag_data}{BAYESTCHAMMY};
889					my $tcount_new = $pms->{tag_data}{BAYESTC}
890					- $pms->{tag_data}{BAYESTCLEARNED};
891					"Tokens: new, $tcount_new; "
892					."hammy, $pms->{tag_data}{BAYESTCHAMMY}; "
893					."neutral, $tcount_neutral; "
894					."spammy, $pms->{tag_data}{BAYESTCSPAMMY}."
895					} else {
896					"Bayes not run.";
897					}
898					});
899
900
901					return $score;
902					}
903
904					###########################################################################
905
906					# Plugin hook.
907					sub learner_dump_database {
908					my ($self, $params) = @_;
909					my $magic = $params->{magic};
910					my $toks = $params->{toks};
911					my $regex = $params->{regex};
912
913					# allow dump to occur even if use_bayes disables everything else ...
914					#return 0 unless $self->{conf}->{use_bayes};
915					return 0 unless $self->{store}->tie_db_readonly();
916
917					my @vars = $self->{store}->get_storage_variables();
918
919					my($sb,$ns,$nh,$nt,$le,$oa,$bv,$js,$ad,$er,$na) = @vars;
920
921					my $template = '%3.3f %10u %10u %10u %s'."\n";
922
923					if ( $magic ) {
924					printf($template, 0.0, 0, $bv, 0, 'non-token data: bayes db version')
925					or die "Error writing: $!";
926					printf($template, 0.0, 0, $ns, 0, 'non-token data: nspam')
927					or die "Error writing: $!";
928					printf($template, 0.0, 0, $nh, 0, 'non-token data: nham')
929					or die "Error writing: $!";
930					printf($template, 0.0, 0, $nt, 0, 'non-token data: ntokens')
931					or die "Error writing: $!";
932					printf($template, 0.0, 0, $oa, 0, 'non-token data: oldest atime')
933					or die "Error writing: $!";
934					if ( $bv >= 2 ) {
935					printf($template, 0.0, 0, $na, 0, 'non-token data: newest atime')
936					or die "Error writing: $!";
937					}
938					if ( $bv < 2 ) {
939					printf($template, 0.0, 0, $sb, 0, 'non-token data: current scan-count')
940					or die "Error writing: $!";
941					}
942					if ( $bv >= 2 ) {
943					printf($template, 0.0, 0, $js, 0, 'non-token data: last journal sync atime')
944					or die "Error writing: $!";
945					}
946					printf($template, 0.0, 0, $le, 0, 'non-token data: last expiry atime')
947					or die "Error writing: $!";
948					if ( $bv >= 2 ) {
949					printf($template, 0.0, 0, $ad, 0, 'non-token data: last expire atime delta')
950					or die "Error writing: $!";
951
952					printf($template, 0.0, 0, $er, 0, 'non-token data: last expire reduction count')
953					or die "Error writing: $!";
954					}
955					}
956
957					if ( $toks ) {
958					# let the store sort out the db_toks
959					$self->{store}->dump_db_toks($template, $regex, @vars);
960					}
961
962					if (!$self->{main}->{learn_caller_will_untie}) {
963					$self->{store}->untie_db();
964					}
965					return 1;
966					}
967
968					###########################################################################
969					# TODO: these are NOT public, but the test suite needs to call them.
970
971					# spent 306ms (125+181) within Mail::SpamAssassin::Plugin::Bayes::get_msgid which was called 555 times, avg 552µs/call: # 321 times (63.9ms+106ms) by Mail::SpamAssassin::Plugin::TxRep::check_senders_reputation at line 1241 of Mail/SpamAssassin/Plugin/TxRep.pm, avg 528µs/call # 234 times (61.6ms+75.0ms) by Mail::SpamAssassin::Plugin::Bayes::_learn_trapped at line 417, avg 584µs/call sub get_msgid {
972	555	1.38ms			my ($self, $msg) = @_;
973
974	555	1.16ms			my @msgid;
975
976	555	5.88ms	555	58.0ms	my $msgid = $msg->get_header("Message-Id"); # spent 58.0ms making 555 calls to Mail::SpamAssassin::Message::Node::get_header, avg 105µs/call
977	555	9.85ms	530	3.90ms	if (defined $msgid && $msgid ne '' && $msgid !~ /^\s<\s(?:\@sa_generated)?>.*$/) { # spent 3.90ms making 530 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 7µs/call
978					# remove \r and < and > prefix/suffixes
979	530	2.38ms			chomp $msgid;
980	1060	28.8ms	1060	8.50ms	$msgid =~ s/^<//; $msgid =~ s/>.*$//g; # spent 8.50ms making 1060 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 8µs/call
981	530	2.02ms			push(@msgid, $msgid);
982					}
983
984					# Modified 2012-01-17 per bug 5185 to remove last received from msg_id calculation
985
986					# Use sha1_hex(Date: and top N bytes of body)
987					# where N is MIN(1024 bytes, 1/2 of body length)
988					#
989	555	5.23ms	555	67.1ms	my $date = $msg->get_header("Date"); # spent 67.1ms making 555 calls to Mail::SpamAssassin::Message::Node::get_header, avg 121µs/call
990	555	1.91ms			$date = "None" if (!defined $date \|\| $date eq ''); # No Date?
991
992					#Removed per bug 5185
993					#my @rcvd = $msg->get_header("Received");
994					#my $rcvd = $rcvd[$#rcvd];
995					#$rcvd = "None" if (!defined $rcvd \|\| $rcvd eq ''); # No Received?
996
997					# Make a copy since pristine_body is a reference ...
998	555	21.5ms	555	6.21ms	my $body = join('', $msg->get_pristine_body()); # spent 6.21ms making 555 calls to Mail::SpamAssassin::Message::get_pristine_body, avg 11µs/call
999
1000	555	2.92ms			if (length($body) > 64) { # Small Body?
1001	555	2.42ms			my $keep = ( length $body > 2048 ? 1024 : int(length($body) / 2) );
1002	555	3.08ms			substr($body, $keep) = '';
1003					}
1004
1005					#Stripping all CR and LF so that testing midstream from MTA and post delivery don't
1006					#generate different id's simply because of LF<->CR<->CRLF changes.
1007	555	51.7ms	555	24.2ms	$body =~ s/[\r\n]//g; # spent 24.2ms making 555 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 44µs/call
1008
1009	555	21.9ms	555	12.7ms	unshift(@msgid, sha1_hex($date."\000".$body).'@sa_generated'); # spent 12.7ms making 555 calls to Digest::SHA::sha1_hex, avg 23µs/call
1010
1011	555	6.86ms			return wantarray ? @msgid : $msgid[0];
1012					}
1013
1014					# spent 6.25s (53.1ms+6.19) within Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg which was called 234 times, avg 26.7ms/call: # 234 times (53.1ms+6.19s) by Mail::SpamAssassin::Plugin::Bayes::learn_message at line 377, avg 26.7ms/call sub get_body_from_msg {
1015	234	520µs			my ($self, $msg) = @_;
1016
1017	234	933µs			if (!ref $msg) {
1018					# I have no idea why this seems to happen. TODO
1019					warn "bayes: msg not a ref: '$msg'";
1020					return { };
1021					}
1022
1023					my $permsgstatus =
1024	234	3.15ms	234	69.0ms	Mail::SpamAssassin::PerMsgStatus->new($self->{main}, $msg); # spent 69.0ms making 234 calls to Mail::SpamAssassin::PerMsgStatus::new, avg 295µs/call
1025	234	2.59ms	234	2.41ms	$msg->extract_message_metadata ($permsgstatus); # spent 2.41ms making 234 calls to Mail::SpamAssassin::Message::extract_message_metadata, avg 10µs/call
1026	234	2.16ms	234	6.07s	my $msgdata = $self->_get_msgdata_from_permsgstatus ($permsgstatus); # spent 6.07s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::_get_msgdata_from_permsgstatus, avg 25.9ms/call
1027	234	2.12ms	234	46.7ms	$permsgstatus->finish(); # spent 46.7ms making 234 calls to Mail::SpamAssassin::PerMsgStatus::finish, avg 200µs/call
1028
1029	234	528µs			if (!defined $msgdata) {
1030					# why?!
1031					warn "bayes: failed to get body for ".scalar($self->get_msgid($self->{msg}))."\n";
1032					return { };
1033					}
1034
1035	234	4.16ms	234	8.52ms	return $msgdata; # spent 8.52ms making 234 calls to Mail::SpamAssassin::PerMsgStatus::DESTROY, avg 36µs/call
1036					}
1037
1038					# spent 6.07s (16.1ms+6.05) within Mail::SpamAssassin::Plugin::Bayes::_get_msgdata_from_permsgstatus which was called 234 times, avg 25.9ms/call: # 234 times (16.1ms+6.05s) by Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg at line 1026, avg 25.9ms/call sub _get_msgdata_from_permsgstatus {
1039	234	475µs			my ($self, $pms) = @_;
1040
1041	234	922µs			my $t_src = $self->{conf}->{bayes_token_sources};
1042	234	644µs			my $msgdata = { };
1043					$msgdata->{bayes_token_body} =
1044	234	3.19ms	234	248ms	$pms->{msg}->get_visible_rendered_body_text_array() if $t_src->{visible}; # spent 248ms making 234 calls to Mail::SpamAssassin::Message::get_visible_rendered_body_text_array, avg 1.06ms/call
1045					$msgdata->{bayes_token_inviz} =
1046	234	2.72ms	234	106ms	$pms->{msg}->get_invisible_rendered_body_text_array() if $t_src->{invisible}; # spent 106ms making 234 calls to Mail::SpamAssassin::Message::get_invisible_rendered_body_text_array, avg 453µs/call
1047					$msgdata->{bayes_mimepart_digests} =
1048	234	489µs			$pms->{msg}->get_mimepart_digests() if $t_src->{mimepart};
1049	234	751µs			@{$msgdata->{bayes_token_uris}} =
1050	234	3.97ms	234	5.70s	$pms->get_uri_list() if $t_src->{uri}; # spent 5.70s making 234 calls to Mail::SpamAssassin::PerMsgStatus::get_uri_list, avg 24.3ms/call
1051	234	2.13ms			return $msgdata;
1052					}
1053
1054					###########################################################################
1055
1056					# The calling functions expect a uniq'ed array of tokens ...
1057					# spent 31.0s (3.09+27.9) within Mail::SpamAssassin::Plugin::Bayes::tokenize which was called 234 times, avg 132ms/call: # 234 times (3.09s+27.9s) by Mail::SpamAssassin::Plugin::Bayes::_learn_trapped at line 473, avg 132ms/call sub tokenize {
1058	234	607µs			my ($self, $msg, $msgdata) = @_;
1059
1060	234	1.10ms			my $t_src = $self->{conf}->{bayes_token_sources};
1061	234	503µs			my @tokens;
1062
1063					# visible tokens from the body
1064	234	2.41ms			if ($msgdata->{bayes_token_body}) {
1065					my(@t) = map($self->_tokenize_line ($_, '', 1),
1066	468	115ms	4456	12.1s	@{$msgdata->{bayes_token_body}} ); # spent 12.1s making 4456 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 2.71ms/call
1067	234	2.29ms	234	2.31ms	dbg("bayes: tokenized body: %d tokens", scalar @t); # spent 2.31ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 10µs/call
1068	234	55.4ms			push(@tokens, @t);
1069					}
1070					# the URI list
1071	234	1.64ms			if ($msgdata->{bayes_token_uris}) {
1072					my(@t) = map($self->_tokenize_line ($_, '', 2),
1073	468	33.7ms	2708	3.29s	@{$msgdata->{bayes_token_uris}} ); # spent 3.29s making 2708 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 1.21ms/call
1074	234	1.78ms	234	1.64ms	dbg("bayes: tokenized uri: %d tokens", scalar @t); # spent 1.64ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 7µs/call
1075	234	7.17ms			push(@tokens, @t);
1076					}
1077					# add invisible tokens
1078	234	1.14ms			if ($msgdata->{bayes_token_inviz}) {
1079	234	455µs			my $tokprefix;
1080	468	1.45ms			if (ADD_INVIZ_TOKENS_I_PREFIX) { $tokprefix = 'I*:' }
1081					if (ADD_INVIZ_TOKENS_NO_PREFIX) { $tokprefix = '' }
1082	234	995µs			if (defined $tokprefix) {
1083					my(@t) = map($self->_tokenize_line ($_, $tokprefix, 1),
1084	468	3.54ms	53	584ms	@{$msgdata->{bayes_token_inviz}} ); # spent 584ms making 53 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 11.0ms/call
1085	234	1.60ms	234	1.41ms	dbg("bayes: tokenized invisible: %d tokens", scalar @t); # spent 1.41ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 6µs/call
1086	234	1.36ms			push(@tokens, @t);
1087					}
1088					}
1089
1090					# add digests and Content-Type of all MIME parts
1091	234	603µs			if ($msgdata->{bayes_mimepart_digests}) {
1092					my %shorthand = ( # some frequent MIME part contents for human readability
1093					'da39a3ee5e6b4b0d3255bfef95601890afd80709:text/plain'=> 'Empty-Plaintext',
1094					'da39a3ee5e6b4b0d3255bfef95601890afd80709:text/html' => 'Empty-HTML',
1095					'da39a3ee5e6b4b0d3255bfef95601890afd80709:text/xml' => 'Empty-XML',
1096					'adc83b19e793491b1c6ea0fd8b46cd9f32e592fc:text/plain'=> 'OneNL-Plaintext',
1097					'adc83b19e793491b1c6ea0fd8b46cd9f32e592fc:text/html' => 'OneNL-HTML',
1098					'71853c6197a6a7f222db0f1978c7cb232b87c5ee:text/plain'=> 'TwoNL-Plaintext',
1099					'71853c6197a6a7f222db0f1978c7cb232b87c5ee:text/html' => 'TwoNL-HTML',
1100					);
1101					my(@t) = map('MIME:' . ($shorthand{$_} \|\| $_),
1102					@{ $msgdata->{bayes_mimepart_digests} });
1103					dbg("bayes: tokenized mime parts: %d tokens", scalar @t);
1104					dbg("bayes: mime-part token %s", $_) for @t;
1105					push(@tokens, @t);
1106					}
1107
1108					# Tokenize the headers
1109	234	2.17ms			if ($t_src->{header}) {
1110	234	480µs			my(@t);
1111	234	7.44ms	234	2.99s	my %hdrs = $self->_tokenize_headers ($msg); # spent 2.99s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers, avg 12.8ms/call
1112	234	57.1ms			while( my($prefix, $value) = each %hdrs ) {
1113	5605	89.2ms	5605	7.60s	push(@t, $self->_tokenize_line ($value, "H$prefix:", 0)); # spent 7.60s making 5605 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 1.36ms/call
1114					}
1115	234	1.97ms	234	2.04ms	dbg("bayes: tokenized header: %d tokens", scalar @t); # spent 2.04ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 9µs/call
1116	234	39.6ms			push(@tokens, @t);
1117					}
1118
1119					# Go ahead and uniq the array, skip null tokens (can happen sometimes)
1120					# generate an SHA1 hash and take the lower 40 bits as our token
1121	234	714µs			my %tokens;
1122	234	1.08ms			foreach my $token (@tokens) {
1123					# skip empty tokens
1124	159813	3.85s	155799	1.32s	$tokens{substr(sha1($token), -5)} = $token if $token ne ''; # spent 1.32s making 155799 calls to Digest::SHA::sha1, avg 8µs/call
1125					}
1126
1127					# return the keys == tokens ...
1128	234	46.9ms			return \%tokens;
1129					}
1130
1131					# spent 23.5s (17.3+6.23) within Mail::SpamAssassin::Plugin::Bayes::_tokenize_line which was called 12822 times, avg 1.84ms/call: # 5605 times (5.50s+2.10s) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1113, avg 1.36ms/call # 4456 times (8.86s+3.21s) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1066, avg 2.71ms/call # 2708 times (2.51s+782ms) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1073, avg 1.21ms/call # 53 times (450ms+134ms) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1084, avg 11.0ms/call sub _tokenize_line {
1132	12822	23.5ms			my $self = $_[0];
1133	12822	25.0ms			my $tokprefix = $_[2];
1134	12822	21.7ms			my $region = $_[3];
1135	12822	97.7ms			local ($_) = $_[1];
1136
1137	12822	20.5ms			my @rettokens;
1138
1139					# include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam strings,
1140					# and ISO-8859-15 alphas. Do not split on @'s; better results keeping it.
1141					# Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
1142
1143					### (previous:) tr/-A-Za-z0-9,\@\*\!_'"\$.\241-\377 / /cs;
1144
1145					### (now): see Bug 7130 for rationale (slower, but makes UTF-8 chars atomic)
1146	12822	2.97s	210534	881ms	s{ ( [A-Za-z0-9,@*!_'"\$. -]+ \| # spent 805ms making 197712 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 4µs/call # spent 75.4ms making 12822 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 6µs/call
1147					{ defined $1 ? $1 : ' ' }xsge;
1148					[\xE0-\xEF][\x80-\xBF]{2} \|
1149					[\xF0-\xF4][\x80-\xBF]{3} \|
1150					[\xA1-\xFF] ) \| . }
1151	185209	746ms
1152					# should we also turn NBSP ( \xC2\xA0 ) into space?
1153
1154					# DO split on "..." or "--" or "---"; common formatting error resulting in
1155					# hapaxes. Keep the separator itself as a token, though, as long ones can
1156					# be good spamsigns.
1157	12822	142ms	12908	43.1ms	s/(\w)(\.{3,6})(\w)/$1 $2 $3/gs; # spent 42.5ms making 12822 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 3µs/call # spent 600µs making 86 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 7µs/call
1158	12822	164ms	12862	30.2ms	s/(\w)(\-{2,6})(\w)/$1 $2 $3/gs; # spent 30.0ms making 12822 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 2µs/call # spent 218µs making 40 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 5µs/call
1159
1160	12822	45.6ms			if (IGNORE_TITLE_CASE) {
1161	12822	36.8ms			if ($region == 1 \|\| $region == 2) {
1162					# lower-case Title Case at start of a full-stop-delimited line (as would
1163					# be seen in a Western language).
1164	11448	424ms	14579	223ms	s/(?:^\|\.\s+)([A-Z])([^A-Z]+)(?:\s\|$)/ ' '. (lc $1) . $2 . ' ' /ge; # spent 137ms making 7217 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 19µs/call # spent 85.3ms making 7362 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 12µs/call
1165					}
1166					}
1167
1168	12822	95.9ms	12822	130ms	my $magic_re = $self->{store}->get_magic_re(); # spent 130ms making 12822 calls to Mail::SpamAssassin::BayesStore::DBM::get_magic_re, avg 10µs/call
1169
1170					# Note that split() in scope of 'use bytes' results in words with utf8 flag
1171					# cleared, even if the source string has perl characters semantics !!!
1172					# Is this really still desirable?
1173
1174	12822	325ms			foreach my $token (split) {
1175	158560	2.16s	158560	757ms	$token =~ s/^[-'"\.,]+//; # trim non-alphanum chars at start or end # spent 757ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 5µs/call
1176	158560	2.04s	158560	755ms	$token =~ s/[-'"\.,]+$//; # so we don't get loads of '"foo' tokens # spent 755ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 5µs/call
1177
1178					# Skip false magic tokens
1179					# TVD: we need to do a defined() check since SQL doesn't have magic
1180					# tokens, so the SQL BayesStore returns undef. I really want a way
1181					# of optimizing that out, but I haven't come up with anything yet.
1182					#
1183	158560	3.62s	317120	1.08s	next if ( defined $magic_re && $token =~ /$magic_re/ ); # spent 771ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp, avg 5µs/call # spent 306ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 2µs/call
1184
1185					# do keep 3-byte tokens; there's some solid signs in there
1186	158560	394ms			my $len = length($token);
1187
1188					# but extend the stop-list. These are squarely in the gray
1189					# area, and it just slows us down to record them.
1190					# See http://wiki.apache.org/spamassassin/BayesStopList for more info.
1191					#
1192	158560	2.20s	126321	874ms	next if $len < 3 \|\| # spent 874ms making 126321 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 7µs/call
1193					($token =~ /^(?:a(?:ble\|l(?:ready\|l)\|n[dy]\|re)\|b(?:ecause\|oth)\|c(?:an\|ome)\|e(?:ach\|mail\|ven)\|f(?:ew\|irst\|or\|rom)\|give\|h(?:a(?:ve\|s)\|ttp)\|i(?:n(?:formation\|to)\|t\'s)\|just\|know\|l(?:ike\|o(?:ng\|ok))\|m(?:a(?:de\|il(?:(?:ing\|to))?\|ke\|ny)\|o(?:re\|st)\|uch)\|n(?:eed\|o[tw]\|umber)\|o(?:ff\|n(?:ly\|e)\|ut\|wn)\|p(?:eople\|lace)\|right\|s(?:ame\|ee\|uch)\|t(?:h(?:at\|is\|rough\|e)\|ime)\|using\|w(?:eb\|h(?:ere\|y)\|ith(?:out)?\|or(?:ld\|k))\|y(?:ears?\|ou(?:(?:\'re\|r))?))$/i);
1194
1195					# are we in the body? If so, apply some body-specific breakouts
1196	109800	292ms			if ($region == 1 \|\| $region == 2) {
1197	64228	1.39s	128048	271ms	if (CHEW_BODY_MAILADDRS && $token =~ /\S\@\S/i) { # spent 271ms making 128048 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 2µs/call
1198	408	4.21ms	408	30.7ms	push (@rettokens, $self->_tokenize_mail_addrs ($token)); # spent 30.7ms making 408 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs, avg 75µs/call
1199					}
1200					elsif (CHEW_BODY_URIS && $token =~ /\S\.[a-z]/i) {
1201	5242	31.5ms			push (@rettokens, "UD:".$token); # the full token
1202	10484	107ms	5242	38.6ms	my $bit = $token; while ($bit =~ s/^[^\.]+\.(.+)$/$1/gs) { # spent 38.6ms making 5242 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
1203	8956	188ms	8956	37.5ms	push (@rettokens, "UD:".$1); # UD = URL domain # spent 37.5ms making 8956 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call
1204					}
1205					}
1206					}
1207
1208					# note: do not trim down overlong tokens if they contain '*'. This is
1209					# used as part of split tokens such as "HTo:D*net" indicating that
1210					# the domain ".net" appeared in the To header.
1211					#
1212	109800	372ms	18366	42.6ms	if ($len > MAX_TOKEN_LENGTH && $token !~ /\*/) { # spent 42.6ms making 18366 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 2µs/call
1213
1214	17309	221ms	17309	75.5ms	if (TOKENIZE_LONG_8BIT_SEQS_AS_UTF8_CHARS && $token =~ /[\x80-\xBF]{2}/) { # spent 75.5ms making 17309 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
1215					# Bug 7135
1216					# collect 3- and 4-byte UTF-8 sequences, ignore 2-byte sequences
1217	9	333µs	9	174µs	my(@t) = $token =~ /( (?: [\xE0-\xEF] \| [\xF0-\xF4][\x80-\xBF] ) # spent 174µs making 9 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 19µs/call
1218					[\x80-\xBF]{2} )/xsg;
1219	9	20µs			if (@t) {
1220	9	197µs			push (@rettokens, map('u8:'.$_, @t));
1221	9	47µs			next;
1222					}
1223					}
1224
1225					if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xa0-\xff]{2}/) {
1226					# Matt sez: "Could be asian? Autrijus suggested doing character ngrams,
1227					# but I'm doing tuples to keep the dbs small(er)." Sounds like a plan
1228					# to me! (jm)
1229					while ($token =~ s/^(..?)//) {
1230					push (@rettokens, "8:$1");
1231					}
1232					next;
1233					}
1234
1235	17300	73.7ms			if (($region == 0 && HDRS_TOKENIZE_LONG_TOKENS_AS_SKIPS)
1236					\|\| ($region == 1 && BODY_TOKENIZE_LONG_TOKENS_AS_SKIPS)
1237					\|\| ($region == 2 && URIS_TOKENIZE_LONG_TOKENS_AS_SKIPS))
1238					{
1239					# if (TOKENIZE_LONG_TOKENS_AS_SKIPS)
1240					# Spambayes trick via Matt: Just retain 7 chars. Do not retain the
1241					# length, it does not help; see jm's mail to -devel on Nov 20 2002 at
1242					# http://sourceforge.net/p/spamassassin/mailman/message/12977605/
1243					# "sk:" stands for "skip".
1244					# Bug 7141: retain seven UTF-8 chars (or other bytes),
1245					# if followed by at least two bytes
1246	11544	558ms	34632	240ms	$token =~ s{ ^ ( (?> (?: [\x00-\x7F\xF5-\xFF] \| # spent 120ms making 23088 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 5µs/call # spent 120ms making 11544 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 10µs/call
1247					[\xC0-\xDF][\x80-\xBF] \|
1248					[\xE0-\xEF][\x80-\xBF]{2} \|
1249					[\xF0-\xF4][\x80-\xBF]{3} \| . ){7} ))
1250					.{2,} \z }{sk:$1}xs;
1251					## (was:) $token = "sk:".substr($token, 0, 7); # seven bytes
1252					}
1253					}
1254
1255					# decompose tokens? do this after shortening long tokens
1256	109791	284ms			if ($region == 1 \|\| $region == 2) {
1257	64219	210ms			if (DECOMPOSE_BODY_TOKENS) {
1258	64219	875ms	64219	187ms	if ($token =~ /[^\w:\*]/) { # spent 187ms making 64219 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 3µs/call
1259	15418	42.2ms			my $decompd = $token; # "Foo!"
1260	15418	318ms	15418	170ms	$decompd =~ s/[^\w:\*]//gs; # spent 170ms making 15418 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 11µs/call
1261	15418	84.5ms			push (@rettokens, $tokprefix.$decompd); # "Foo"
1262					}
1263
1264	64219	857ms	64219	279ms	if ($token =~ /[A-Z]/) { # spent 279ms making 64219 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
1265	34320	109ms			my $decompd = $token; $decompd = lc $decompd;
1266	17160	132ms			push (@rettokens, $tokprefix.$decompd); # "foo!"
1267
1268	17160	222ms	17160	67.3ms	if ($token =~ /[^\w:\*]/) { # spent 67.3ms making 17160 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
1269	1950	30.0ms	1950	18.0ms	$decompd =~ s/[^\w:\*]//gs; # spent 18.0ms making 1950 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 9µs/call
1270	1950	11.1ms			push (@rettokens, $tokprefix.$decompd); # "foo"
1271					}
1272					}
1273					}
1274					}
1275
1276	109791	1.13s			push (@rettokens, $tokprefix.$token);
1277					}
1278
1279	12822	296ms			return @rettokens;
1280					}
1281
1282					# spent 2.99s (1.18+1.81) within Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers which was called 234 times, avg 12.8ms/call: # 234 times (1.18s+1.81s) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1111, avg 12.8ms/call sub _tokenize_headers {
1283	234	568µs			my ($self, $msg) = @_;
1284
1285	234	490µs			my %parsed;
1286
1287					my %user_ignore;
1288	468	204ms			$user_ignore{lc $_} = 1 for @{$self->{main}->{conf}->{bayes_ignore_headers}};
1289
1290					# get headers in array context
1291	234	465µs			my @hdrs;
1292					my @rcvdlines;
1293	234	18.6ms	234	1.10s	for ($msg->get_all_headers()) { # spent 1.10s making 234 calls to Mail::SpamAssassin::Message::Node::get_all_headers, avg 4.69ms/call
1294					# first, keep a copy of Received headers, so we can strip down to last 2
1295	7410	82.5ms	7410	21.3ms	if (/^Received:/i) { # spent 21.3ms making 7410 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 3µs/call
1296	1131	5.69ms			push(@rcvdlines, $_);
1297	1131	2.24ms			next;
1298					}
1299					# and now skip lines for headers we don't want (including all Received)
1300	6279	201ms	12558	98.8ms	next if /^${IGNORED_HDRS}:/i; # spent 72.1ms making 6279 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 11µs/call # spent 26.7ms making 6279 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp, avg 4µs/call
1301					next if IGNORE_MSGID_TOKENS && /^Message-ID:/i;
1302	4124	33.0ms			push(@hdrs, $_);
1303					}
1304	234	3.60ms	234	27.2ms	push(@hdrs, $msg->get_all_metadata()); # spent 27.2ms making 234 calls to Mail::SpamAssassin::Message::get_all_metadata, avg 116µs/call
1305
1306					# and re-add the last 2 received lines: usually a good source of
1307					# spamware tokens and HELO names.
1308	468	2.21ms			if ($#rcvdlines >= 0) { push(@hdrs, $rcvdlines[$#rcvdlines]); }
1309	468	1.97ms			if ($#rcvdlines >= 1) { push(@hdrs, $rcvdlines[$#rcvdlines-1]); }
1310
1311	234	2.27ms			for (@hdrs) {
1312	5528	70.9ms	5528	22.7ms	next unless /\S/; # spent 22.7ms making 5528 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
1313	5528	51.8ms			my ($hdr, $val) = split(/:/, $_, 2);
1314
1315					# remove user-specified headers here, after Received, in case they
1316					# want to ignore that too
1317	5528	16.7ms			next if exists $user_ignore{lc $hdr};
1318
1319					# Prep the header value
1320	5374	9.49ms			$val \|\|= '';
1321	5374	12.7ms			chomp($val);
1322
1323					# special tokenization for some headers:
1324	5374	213ms	17551	86.0ms	if ($hdr =~ /^(?:\|X-\|Resent-)Message-Id$/i) { # spent 71.6ms making 14037 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 5µs/call # spent 14.4ms making 3514 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp, avg 4µs/call
1325	225	2.14ms	225	16.1ms	$val = $self->_pre_chew_message_id ($val); # spent 16.1ms making 225 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id, avg 72µs/call
1326					}
1327					elsif (PRE_CHEW_ADDR_HEADERS && $hdr =~ /^(?:\|X-\|Resent-)
1328					(?:Return-Path\|From\|To\|Cc\|Reply-To\|Errors-To\|Mail-Followup-To\|Sender)$/ix)
1329					{
1330	758	6.14ms	758	194ms	$val = $self->_pre_chew_addr_header ($val); # spent 194ms making 758 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header, avg 257µs/call
1331					}
1332					elsif ($hdr eq 'Received') {
1333	468	4.10ms	468	98.8ms	$val = $self->_pre_chew_received ($val); # spent 98.8ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received, avg 211µs/call
1334					}
1335					elsif ($hdr eq 'Content-Type') {
1336	222	2.05ms	222	28.1ms	$val = $self->_pre_chew_content_type ($val); # spent 28.1ms making 222 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type, avg 127µs/call
1337					}
1338					elsif ($hdr eq 'MIME-Version') {
1339	187	2.33ms	187	1.10ms	$val =~ s/1\.0//; # totally innocuous # spent 1.10ms making 187 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 6µs/call
1340					}
1341					elsif ($hdr =~ /^${MARK_PRESENCE_ONLY_HDRS}$/i) {
1342	224	571µs			$val = "1"; # just mark the presence, they create lots of hapaxen
1343					}
1344
1345	5374	27.7ms			if (MAP_HEADERS_MID) {
1346	5374	91.9ms	5374	20.6ms	if ($hdr =~ /^(?:In-Reply-To\|References\|Message-ID)$/i) { # spent 20.6ms making 5374 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
1347	237	928µs			$parsed{"*MI"} = $val;
1348					}
1349					}
1350	5374	16.8ms			if (MAP_HEADERS_FROMTOCC) {
1351	5374	70.6ms	5374	19.9ms	if ($hdr =~ /^(?:From\|To\|Cc)$/i) { # spent 19.9ms making 5374 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call
1352	435	1.49ms			$parsed{"*Ad"} = $val;
1353					}
1354					}
1355	5374	17.0ms			if (MAP_HEADERS_USERAGENT) {
1356	5374	70.3ms	5374	17.4ms	if ($hdr =~ /^(?:X-Mailer\|User-Agent)$/i) { # spent 17.4ms making 5374 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 3µs/call
1357	64	272µs			$parsed{"*UA"} = $val;
1358					}
1359					}
1360
1361					# replace hdr name with "compressed" version if possible
1362	5374	34.8ms			if (defined $HEADER_NAME_COMPRESSION{$hdr}) {
1363	2009	8.50ms			$hdr = $HEADER_NAME_COMPRESSION{$hdr};
1364					}
1365
1366	5374	24.2ms			if (exists $parsed{$hdr}) {
1367	288	2.46ms			$parsed{$hdr} .= " ".$val;
1368					} else {
1369	5086	38.6ms			$parsed{$hdr} = $val;
1370					}
1371	5374	51.6ms	5374	59.7ms	if (would_log('dbg', 'bayes') > 1) { # spent 59.7ms making 5374 calls to Mail::SpamAssassin::Logger::would_log, avg 11µs/call
1372					dbg("bayes: header tokens for $hdr = \"$parsed{$hdr}\"");
1373					}
1374					}
1375
1376	234	32.8ms			return %parsed;
1377					}
1378
1379					# spent 28.1ms (14.7+13.4) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type which was called 222 times, avg 127µs/call: # 222 times (14.7ms+13.4ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1336, avg 127µs/call sub _pre_chew_content_type {
1380	222	908µs			my ($self, $val) = @_;
1381
1382					# hopefully this will retain good bits without too many hapaxen
1383	222	4.54ms	222	2.45ms	if ($val =~ s/boundary=[\"\'](.*?)[\"\']/ /ig) { # spent 2.45ms making 222 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 11µs/call
1384	173	631µs			my $boundary = $1;
1385	173	407µs			$boundary = '' if !defined $boundary; # avoid a warning
1386	173	7.11ms	173	5.24ms	$boundary =~ s/[a-fA-F0-9]/H/gs; # spent 5.24ms making 173 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 30µs/call
1387					# break up blocks of separator chars so they become their own tokens
1388	173	9.08ms	787	4.00ms	$boundary =~ s/([-_\.=]+)/ $1 /gs; # spent 3.10ms making 614 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 5µs/call # spent 899µs making 173 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 5µs/call
1389	173	729µs			$val .= $boundary;
1390					}
1391
1392					# stop-list words for Content-Type header: these wind up totally gray
1393	222	3.15ms	222	1.67ms	$val =~ s/\b(?:text\|charset)\b//; # spent 1.67ms making 222 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 8µs/call
1394
1395	222	1.92ms			$val;
1396					}
1397
1398					# spent 16.1ms (9.18+6.95) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id which was called 225 times, avg 72µs/call: # 225 times (9.18ms+6.95ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1325, avg 72µs/call sub _pre_chew_message_id {
1399	225	877µs			my ($self, $val) = @_;
1400					# we can (a) get rid of a lot of hapaxen and (b) increase the token
1401					# specificity by pre-parsing some common formats.
1402
1403					# Outlook Express format:
1404	225	3.16ms	225	1.59ms	$val =~ s/<([0-9a-f]{4})[0-9a-f]{4}[0-9a-f]{4}\$ # spent 1.59ms making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
1405					([0-9a-f]{4})[0-9a-f]{4}\$
1406					([0-9a-f]{8})\@(\S+)>/ OEA$1 OEB$2 OEC$3 $4 /gx;
1407
1408					# Exim:
1409	225	2.16ms	225	696µs	$val =~ s/<[A-Za-z0-9]{7}-[A-Za-z0-9]{6}-0[A-Za-z0-9]\@//; # spent 696µs making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 3µs/call
1410
1411					# Sendmail:
1412	225	2.28ms	225	797µs	$val =~ s/<20\d\d[01]\d[0123]\d[012]\d[012345]\d[012345]\d\. # spent 797µs making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call
1413					[A-F0-9]{10,12}\@//gx;
1414
1415					# try to split Message-ID segments on probable ID boundaries. Note that
1416					# Outlook message-ids seem to contain a server identifier ID in the last
1417					# 8 bytes before the @. Make sure this becomes its own token, it's a
1418					# great spam-sign for a learning system! Be sure to split on ".".
1419	225	6.03ms	225	3.86ms	$val =~ s/[^_A-Za-z0-9]/ /g; # spent 3.86ms making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 17µs/call
1420	225	2.05ms			$val;
1421					}
1422
1423					# spent 98.8ms (53.3+45.5) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received which was called 468 times, avg 211µs/call: # 468 times (53.3ms+45.5ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1333, avg 211µs/call sub _pre_chew_received {
1424	468	2.47ms			my ($self, $val) = @_;
1425
1426					# Thanks to Dan for these. Trim out "useless" tokens; sendmail-ish IDs
1427					# and valid-format RFC-822/2822 dates
1428
1429	468	6.47ms	468	3.16ms	$val =~ s/\swith\sSMTP\sid\sg[\dA-Z]{10,12}\s/ /gs; # Sendmail # spent 3.16ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
1430	468	6.05ms	468	3.09ms	$val =~ s/\swith\sESMTP\sid\s[\dA-F]{10,12}\s/ /gs; # Sendmail # spent 3.09ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
1431	468	6.72ms	468	3.43ms	$val =~ s/\bid\s[a-zA-Z0-9]{7,20}\b/ /gs; # Sendmail # spent 3.43ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
1432	468	4.72ms	468	1.87ms	$val =~ s/\bid\s[A-Za-z0-9]{7}-[A-Za-z0-9]{6}-0[A-Za-z0-9]/ /gs; # exim # spent 1.87ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call
1433
1434	468	12.7ms	468	9.41ms	$val =~ s/(?:(?:Mon\|Tue\|Wed\|Thu\|Fri\|Sat\|Sun),\s)? # spent 9.41ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 20µs/call
1435					[0-3\s]?[0-9]\s
1436					(?:Jan\|Feb\|Ma[ry]\|Apr\|Ju[nl]\|Aug\|Sep\|Oct\|Nov\|Dec)\s
1437					(?:19\|20)?[0-9]{2}\s
1438					[0-2][0-9](?:\:[0-5][0-9]){1,2}\s
1439					(?:\s$\|$\|\s(?:[+-][0-9]{4})\|\s(?:UT\|[A-Z]{2,3}T))
1440					//gx;
1441
1442					# IPs: break down to nearest /24, to reduce hapaxes -- EXCEPT for
1443					# IPs in the 10 and 192.168 ranges, they gets lots of significant tokens
1444					# (on both sides)
1445					# also make a dup with the full IP, as fodder for
1446					# bayes_dump_to_trusted_networks: "Hr:ipaaa.bbb.ccc.ddd"
1447	468	30.1ms	1418	12.3ms	$val =~ s{\b(\d{1,3}\.)(\d{1,3}\.)(\d{1,3})(\.\d{1,3})\b}{ # spent 7.12ms making 950 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 7µs/call # spent 5.14ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 11µs/call
1448	584	4.04ms			if ($2 eq '10' \|\| ($2 eq '192' && $3 eq '168')) {
1449					$1.$2.$3.$4.
1450					" ip*".$1.$2.$3.$4." ";
1451					} else {
1452	584	6.55ms			$1.$2.$3.
1453					" ip*".$1.$2.$3.$4." ";
1454					}
1455					}gex;
1456
1457					# trim these: they turn out as the most common tokens, but with a
1458					# prob of about .5. waste of space!
1459	468	15.6ms	468	12.3ms	$val =~ s/\b(?:with\|from\|for\|SMTP\|ESMTP)\b/ /g; # spent 12.3ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 26µs/call
1460
1461	468	4.01ms			$val;
1462					}
1463
1464					# spent 194ms (58.8+136) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header which was called 758 times, avg 257µs/call: # 758 times (58.8ms+136ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1330, avg 257µs/call sub _pre_chew_addr_header {
1465	758	4.42ms			my ($self, $val) = @_;
1466	758	1.44ms			local ($_);
1467
1468	758	8.09ms	758	75.0ms	my @addrs = $self->{main}->find_all_addrs_in_line ($val); # spent 75.0ms making 758 calls to Mail::SpamAssassin::find_all_addrs_in_line, avg 99µs/call
1469	758	1.35ms			my @toks;
1470	758	2.93ms			foreach (@addrs) {
1471	742	8.98ms	742	60.7ms	push (@toks, $self->_tokenize_mail_addrs ($_)); # spent 60.7ms making 742 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs, avg 82µs/call
1472					}
1473	758	11.8ms			return join (' ', @toks);
1474					}
1475
1476					# spent 91.5ms (67.4+24.1) within Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs which was called 1150 times, avg 80µs/call: # 742 times (43.6ms+17.1ms) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header at line 1471, avg 82µs/call # 408 times (23.8ms+6.96ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1198, avg 75µs/call sub _tokenize_mail_addrs {
1477	1150	5.86ms			my ($self, $addr) = @_;
1478
1479	1150	16.4ms	1150	7.84ms	($addr =~ /(.+)\@(.+)$/) or return (); # spent 7.84ms making 1150 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 7µs/call
1480	1150	2.02ms			my @toks;
1481	1150	8.87ms			push(@toks, "U".$1, "D".$2);
1482	3555	44.8ms	2405	16.2ms	$_ = $2; while (s/^[^\.]+\.(.+)$/$1/gs) { push(@toks, "D*".$1); } # spent 16.2ms making 2405 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call
1483	1150	26.4ms			return @toks;
1484					}
1485
1486
1487					###########################################################################
1488
1489					# compute the probability that a token is spammish for each token
1490					sub _compute_prob_for_all_tokens {
1491					my ($self, $tokensdata, $ns, $nn) = @_;
1492					my @probabilities;
1493
1494					return if !$ns \|\| !$nn;
1495
1496					my $threshold = 1; # ignore low-freq tokens below this s+n threshold
1497					if (!USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS) {
1498					$threshold = 10;
1499					}
1500					if (!$self->{use_hapaxes}) {
1501					$threshold = 2;
1502					}
1503
1504					foreach my $tokendata (@{$tokensdata}) {
1505					my $s = $tokendata->[1]; # spam count
1506					my $n = $tokendata->[2]; # ham count
1507					my $prob;
1508
1509	2	2.43ms	2	273µs	# spent 176µs (78+97) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@1509 which was called: # once (78µs+97µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 1509 no warnings 'uninitialized'; # treat undef as zero in addition # spent 176µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@1509 # spent 98µs making 1 call to warnings::unimport
1510					if ($s + $n >= $threshold) {
1511					# ignoring low-freq tokens, also covers the (!$s && !$n) case
1512
1513					# my $ratios = $s / $ns;
1514					# my $ration = $n / $nn;
1515					# $prob = $ratios / ($ration + $ratios);
1516					#
1517					$prob = ($s * $nn) / ($n * $ns + $s * $nn); # same thing, faster
1518
1519					if (USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS) {
1520					# use Robinson's f(x) equation for low-n tokens, instead of just
1521					# ignoring them
1522					my $robn = $s + $n;
1523					$prob =
1524					($Mail::SpamAssassin::Bayes::Combine::FW_S_DOT_X + ($robn * $prob))
1525					/
1526					($Mail::SpamAssassin::Bayes::Combine::FW_S_CONSTANT + $robn);
1527					}
1528					}
1529
1530					# 'log_raw_counts' is used to log the raw data for the Bayes equations
1531					# during a mass-check, allowing the S and X constants to be optimized
1532					# quickly without requiring re-tokenization of the messages for each
1533					# attempt. There's really no need for this code to be uncommented in
1534					# normal use, however. It has never been publicly documented, so
1535					# commenting it out is fine. ;)
1536					#
1537					## if ($self->{log_raw_counts}) {
1538					## $self->{raw_counts} .= " s=$s,n=$n ";
1539					## }
1540
1541					push(@probabilities, $prob);
1542					}
1543					return \@probabilities;
1544					}
1545
1546					# compute the probability that a token is spammish
1547					sub _compute_prob_for_token {
1548					my ($self, $token, $ns, $nn, $s, $n) = @_;
1549
1550					# we allow the caller to give us the token information, just
1551					# to save a potentially expensive lookup
1552					if (!defined($s) \|\| !defined($n)) {
1553					($s, $n, undef) = $self->{store}->tok_get($token);
1554					}
1555					return if !$s && !$n;
1556
1557					my $probabilities_ref =
1558					$self->_compute_prob_for_all_tokens([ [$token, $s, $n, 0] ], $ns, $nn);
1559
1560					return $probabilities_ref->[0];
1561					}
1562
1563					###########################################################################
1564					# If a token is neither hammy nor spammy, return 0.
1565					# For a spammy token, return the minimum number of additional ham messages
1566					# it would have had to appear in to no longer be spammy. Hammy tokens
1567					# are handled similarly. That's what the function does (at the time
1568					# of this writing, 31 July 2003, 16:02:55 CDT). It would be slightly
1569					# more useful if it returned the number of /additional/ ham messages
1570					# a spammy token would have to appear in to no longer be spammy but I
1571					# fear that might require the solution to a cubic equation, and I
1572					# just don't have the time for that now.
1573
1574					sub _compute_declassification_distance {
1575					my ($self, $Ns, $Nn, $ns, $nn, $prob) = @_;
1576
1577					return 0 if $ns == 0 && $nn == 0;
1578
1579					if (!USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS) {return 0 if ($ns + $nn < 10);}
1580					if (!$self->{use_hapaxes}) {return 0 if ($ns + $nn < 2);}
1581
1582					return 0 if $Ns == 0 \|\| $Nn == 0;
1583					return 0 if abs( $prob - 0.5 ) <
1584					$Mail::SpamAssassin::Bayes::Combine::MIN_PROB_STRENGTH;
1585
1586					my ($Na,$na,$Nb,$nb) = $prob > 0.5 ? ($Nn,$nn,$Ns,$ns) : ($Ns,$ns,$Nn,$nn);
1587					my $p = 0.5 - $Mail::SpamAssassin::Bayes::Combine::MIN_PROB_STRENGTH;
1588
1589					return int( 1.0 - 1e-6 + $nb * $Na * $p / ($Nb * ( 1 - $p )) ) - $na
1590					unless USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS;
1591
1592					my $s = $Mail::SpamAssassin::Bayes::Combine::FW_S_CONSTANT;
1593					my $sx = $Mail::SpamAssassin::Bayes::Combine::FW_S_DOT_X;
1594					my $a = $Nb * ( 1 - $p );
1595					my $b = $Nb * ( $sx + $nb * ( 1 - $p ) - $p * $s ) - $p * $Na * $nb;
1596					my $c = $Na * $nb * ( $sx - $p * ( $s + $nb ) );
1597					my $discrim = $b * $b - 4 * $a * $c;
1598					my $disc_max_0 = $discrim < 0 ? 0 : $discrim;
1599					my $dd_exact = ( 1.0 - 1e-6 + ( -$b + sqrt( $disc_max_0 ) ) / ( 2*$a ) ) - $na;
1600
1601					# This shouldn't be necessary. Should not be < 1
1602					return $dd_exact < 1 ? 1 : int($dd_exact);
1603					}
1604
1605					###########################################################################
1606
1607					sub _opportunistic_calls {
1608					my($self, $journal_only) = @_;
1609
1610					# If we're not already tied, abort.
1611					if (!$self->{store}->db_readable()) {
1612					dbg("bayes: opportunistic call attempt failed, DB not readable");
1613					return;
1614					}
1615
1616					# Is an expire or sync running?
1617					my $running_expire = $self->{store}->get_running_expire_tok();
1618					if ( defined $running_expire && $running_expire+$OPPORTUNISTIC_LOCK_VALID > time() ) {
1619					dbg("bayes: opportunistic call attempt skipped, found fresh running expire magic token");
1620					return;
1621					}
1622
1623					# handle expiry and syncing
1624					if (!$journal_only && $self->{store}->expiry_due()) {
1625					dbg("bayes: opportunistic call found expiry due");
1626
1627					# sync will bring the DB R/W as necessary, and the expire will remove
1628					# the running_expire token, may untie as well.
1629					$self->{main}->{bayes_scanner}->sync(1,1);
1630					}
1631					elsif ( $self->{store}->sync_due() ) {
1632					dbg("bayes: opportunistic call found journal sync due");
1633
1634					# sync will bring the DB R/W as necessary, may untie as well
1635					$self->{main}->{bayes_scanner}->sync(1,0);
1636
1637					# We can only remove the running_expire token if we're doing R/W
1638					if ($self->{store}->db_writable()) {
1639					$self->{store}->remove_running_expire_tok();
1640					}
1641					}
1642
1643					return;
1644					}
1645
1646					###########################################################################
1647
1648					# spent 29.6ms (19.6+10.0) within Mail::SpamAssassin::Plugin::Bayes::learner_new which was called: # once (19.6ms+10.0ms) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm sub learner_new {
1649	1	2µs			my ($self) = @_;
1650
1651	1	2µs			my $store;
1652	1	13µs	1	44µs	my $module = untaint_var($self->{conf}->{bayes_store_module}); # spent 44µs making 1 call to Mail::SpamAssassin::Util::untaint_var
1653	1	3µs			$module = 'Mail::SpamAssassin::BayesStore::DBM' if !$module;
1654
1655	1	8µs	1	7µs	dbg("bayes: learner_new self=%s, bayes_store_module=%s", $self,$module); # spent 7µs making 1 call to Mail::SpamAssassin::Logger::dbg
1656	1	4µs			undef $self->{store}; # DESTROYs previous object, if any
1657					eval '
1658					require '.$module.';
1659					$store = '.$module.'->new($self);
1660					1;
1661	1	188µs			' or do { # spent 391µs executing statements in string eval
1662					my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat;
1663					die "bayes: learner_new $module new() failed: $eval_stat\n";
1664					};
1665
1666	1	10µs	1	12µs	dbg("bayes: learner_new: got store=%s", $store); # spent 12µs making 1 call to Mail::SpamAssassin::Logger::dbg
1667	1	4µs			$self->{store} = $store;
1668
1669	1	13µs			$self;
1670					}
1671
1672					###########################################################################
1673
1674					sub bayes_report_make_list {
1675					my ($self, $pms, $info, $param) = @_;
1676					return "Tokens not available." unless defined $info;
1677
1678					my ($limit,$fmt_arg,$more) = split /,/, ($param \|\| '5');
1679
1680					my %formats = (
1681					short => '$t',
1682					Short => 'Token: \"$t\"',
1683					compact => '$p-$D--$t',
1684					Compact => 'Probability $p -declassification distance $D (\"+\" means > 9) --token: \"$t\"',
1685					medium => '$p-$D-$N--$t',
1686					long => '$p-$d--${h}h-${s}s--${a}d--$t',
1687					Long => 'Probability $p -declassification distance $D --in ${h} ham messages -and ${s} spam messages --${a} days old--token:\"$t\"'
1688					);
1689
1690					my $raw_fmt = (!$fmt_arg ? '$p-$D--$t' : $formats{$fmt_arg});
1691
1692					return "Invalid format, must be one of: ".join(",",keys %formats)
1693					unless defined $raw_fmt;
1694
1695					my $fmt = '"'.$raw_fmt.'"';
1696					my $amt = $limit < @$info ? $limit : @$info;
1697					return "" unless $amt;
1698
1699					my $ns = $pms->{bayes_nspam};
1700					my $nh = $pms->{bayes_nham};
1701					my $digit = sub { $_[0] > 9 ? "+" : $_[0] };
1702					my $now = time;
1703
1704					join ', ', map {
1705					my($t,$prob,$s,$h,$u) = @$_;
1706					my $a = int(($now - $u)/(3600 * 24));
1707					my $d = $self->_compute_declassification_distance($ns,$nh,$s,$h,$prob);
1708					my $p = sprintf "%.3f", $prob;
1709					my $n = $s + $h;
1710					my ($c,$o) = $prob < 0.5 ? ($h,$s) : ($s,$h);
1711					my ($D,$S,$H,$C,$O,$N) = map &$digit($_), ($d,$s,$h,$c,$o,$n);
1712					eval $fmt; ## no critic
1713					} @{$info}[0..$amt-1];
1714					}
1715
1716	1	30µs			1;

					# spent 2.36s within Mail::SpamAssassin::Plugin::Bayes::CORE:match which was called 645267 times, avg 4µs/call: # 158560 times (306ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1183, avg 2µs/call # 128048 times (271ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1197, avg 2µs/call # 126321 times (874ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1192, avg 7µs/call # 64219 times (279ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1264, avg 4µs/call # 64219 times (187ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1258, avg 3µs/call # 18366 times (42.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1212, avg 2µs/call # 17309 times (75.5ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1214, avg 4µs/call # 17160 times (67.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1268, avg 4µs/call # 14037 times (71.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1324, avg 5µs/call # 7410 times (21.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1295, avg 3µs/call # 6279 times (72.1ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1300, avg 11µs/call # 5528 times (22.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1312, avg 4µs/call # 5374 times (20.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1346, avg 4µs/call # 5374 times (19.9ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1351, avg 4µs/call # 5374 times (17.4ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1356, avg 3µs/call # 1150 times (7.84ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs at line 1479, avg 7µs/call # 530 times (3.90ms+0s) by Mail::SpamAssassin::Plugin::Bayes::get_msgid at line 977, avg 7µs/call # 9 times (174µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1217, avg 19µs/call sub Mail::SpamAssassin::Plugin::Bayes::CORE:match; # opcode
					# spent 18µs within Mail::SpamAssassin::Plugin::Bayes::CORE:qr which was called 2 times, avg 9µs/call: # once (14µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 81 # once (4µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 148 sub Mail::SpamAssassin::Plugin::Bayes::CORE:qr; # opcode
					# spent 812ms within Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp which was called 168353 times, avg 5µs/call: # 158560 times (771ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1183, avg 5µs/call # 6279 times (26.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1300, avg 4µs/call # 3514 times (14.4ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1324, avg 4µs/call sub Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp; # opcode
					# spent 2.29s within Mail::SpamAssassin::Plugin::Bayes::CORE:subst which was called 415086 times, avg 6µs/call: # 158560 times (757ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1175, avg 5µs/call # 158560 times (755ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1176, avg 5µs/call # 15418 times (170ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1260, avg 11µs/call # 12822 times (75.4ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1146, avg 6µs/call # 12822 times (42.5ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1157, avg 3µs/call # 12822 times (30.0ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1158, avg 2µs/call # 11544 times (120ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1246, avg 10µs/call # 8956 times (37.5ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1203, avg 4µs/call # 7217 times (137ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1164, avg 19µs/call # 5242 times (38.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1202, avg 7µs/call # 2405 times (16.2ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs at line 1482, avg 7µs/call # 1950 times (18.0ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1269, avg 9µs/call # 1060 times (8.50ms+0s) by Mail::SpamAssassin::Plugin::Bayes::get_msgid at line 980, avg 8µs/call # 555 times (24.2ms+0s) by Mail::SpamAssassin::Plugin::Bayes::get_msgid at line 1007, avg 44µs/call # 468 times (12.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1459, avg 26µs/call # 468 times (9.41ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1434, avg 20µs/call # 468 times (5.14ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1447, avg 11µs/call # 468 times (3.43ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1431, avg 7µs/call # 468 times (3.16ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1429, avg 7µs/call # 468 times (3.09ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1430, avg 7µs/call # 468 times (1.87ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1432, avg 4µs/call # 225 times (3.86ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1419, avg 17µs/call # 225 times (1.59ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1404, avg 7µs/call # 225 times (797µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1412, avg 4µs/call # 225 times (696µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1409, avg 3µs/call # 222 times (2.45ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1383, avg 11µs/call # 222 times (1.67ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1393, avg 8µs/call # 187 times (1.10ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1339, avg 6µs/call # 173 times (5.24ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1386, avg 30µs/call # 173 times (899µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1388, avg 5µs/call sub Mail::SpamAssassin::Plugin::Bayes::CORE:subst; # opcode
					# spent 1.02s within Mail::SpamAssassin::Plugin::Bayes::CORE:substcont which was called 229852 times, avg 4µs/call: # 197712 times (805ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1146, avg 4µs/call # 23088 times (120ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1246, avg 5µs/call # 7362 times (85.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1164, avg 12µs/call # 950 times (7.12ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1447, avg 7µs/call # 614 times (3.10ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1388, avg 5µs/call # 86 times (600µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1157, avg 7µs/call # 40 times (218µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1158, avg 5µs/call sub Mail::SpamAssassin::Plugin::Bayes::CORE:substcont; # opcode