Filename | /usr/local/lib/perl5/site_perl/Mail/SpamAssassin/Plugin/Bayes.pm |
Statements | Executed 2376122 statements in 29.8s |
Calls | P | F | Exclusive Time |
Inclusive Time |
Subroutine |
---|---|---|---|---|---|
12822 | 4 | 1 | 17.3s | 23.5s | _tokenize_line | Mail::SpamAssassin::Plugin::Bayes::
234 | 1 | 1 | 3.09s | 31.0s | tokenize | Mail::SpamAssassin::Plugin::Bayes::
645267 | 18 | 1 | 2.36s | 2.36s | CORE:match (opcode) | Mail::SpamAssassin::Plugin::Bayes::
415086 | 30 | 1 | 2.29s | 2.29s | CORE:subst (opcode) | Mail::SpamAssassin::Plugin::Bayes::
234 | 1 | 1 | 1.18s | 2.99s | _tokenize_headers | Mail::SpamAssassin::Plugin::Bayes::
229852 | 7 | 1 | 1.02s | 1.02s | CORE:substcont (opcode) | Mail::SpamAssassin::Plugin::Bayes::
168353 | 3 | 1 | 812ms | 812ms | CORE:regcomp (opcode) | Mail::SpamAssassin::Plugin::Bayes::
555 | 2 | 2 | 125ms | 306ms | get_msgid | Mail::SpamAssassin::Plugin::Bayes::
234 | 1 | 1 | 106ms | 36.3s | _learn_trapped | Mail::SpamAssassin::Plugin::Bayes::
1150 | 2 | 1 | 67.4ms | 91.5ms | _tokenize_mail_addrs | Mail::SpamAssassin::Plugin::Bayes::
758 | 1 | 1 | 58.8ms | 194ms | _pre_chew_addr_header | Mail::SpamAssassin::Plugin::Bayes::
468 | 1 | 1 | 53.3ms | 98.8ms | _pre_chew_received | Mail::SpamAssassin::Plugin::Bayes::
234 | 1 | 1 | 53.1ms | 6.25s | get_body_from_msg | Mail::SpamAssassin::Plugin::Bayes::
234 | 1 | 1 | 36.8ms | 44.1s | learn_message | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 19.6ms | 29.6ms | learner_new | Mail::SpamAssassin::Plugin::Bayes::
234 | 1 | 1 | 16.1ms | 6.07s | _get_msgdata_from_permsgstatus | Mail::SpamAssassin::Plugin::Bayes::
222 | 1 | 1 | 14.7ms | 28.1ms | _pre_chew_content_type | Mail::SpamAssassin::Plugin::Bayes::
225 | 1 | 1 | 9.18ms | 16.1ms | _pre_chew_message_id | Mail::SpamAssassin::Plugin::Bayes::
236 | 1 | 1 | 2.35ms | 2.35ms | read_db_configs | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 1.39ms | 2.09ms | BEGIN@63 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 84µs | 132µs | new | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 78µs | 176µs | BEGIN@1509 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 52µs | 1.31ms | learner_close | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 49µs | 68µs | BEGIN@46 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 41µs | 298µs | BEGIN@68 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 36µs | 198µs | BEGIN@167 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 34µs | 180µs | BEGIN@51 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 32µs | 37µs | BEGIN@48 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 32µs | 220µs | BEGIN@219 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 32µs | 234µs | BEGIN@178 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 32µs | 234µs | BEGIN@215 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 31µs | 209µs | BEGIN@174 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 31µs | 228µs | BEGIN@165 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 30µs | 207µs | BEGIN@175 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 30µs | 226µs | BEGIN@169 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 30µs | 58µs | BEGIN@47 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 30µs | 172µs | BEGIN@60 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 30µs | 230µs | BEGIN@173 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 29µs | 204µs | BEGIN@164 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 29µs | 205µs | BEGIN@168 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 29µs | 232µs | BEGIN@227 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 28µs | 236µs | BEGIN@158 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 28µs | 196µs | BEGIN@179 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 27µs | 212µs | BEGIN@172 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 26µs | 506µs | learner_is_scan_available | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 25µs | 230µs | BEGIN@157 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 25µs | 246µs | BEGIN@163 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 25µs | 97µs | BEGIN@49 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 25µs | 213µs | BEGIN@59 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 24µs | 217µs | BEGIN@156 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 22µs | 184µs | BEGIN@166 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 21µs | 257µs | BEGIN@223 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 20µs | 20µs | BEGIN@58 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 20µs | 196µs | BEGIN@159 | Mail::SpamAssassin::Plugin::Bayes::
2 | 2 | 1 | 18µs | 18µs | CORE:qr (opcode) | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 15µs | 15µs | BEGIN@56 | Mail::SpamAssassin::Plugin::Bayes::
1 | 1 | 1 | 12µs | 12µs | BEGIN@57 | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | __ANON__[:1701] | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | __ANON__[:874] | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | __ANON__[:880] | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | __ANON__[:898] | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | _compute_declassification_distance | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | _compute_prob_for_all_tokens | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | _compute_prob_for_token | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | _forget_trapped | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | _opportunistic_calls | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | bayes_report_make_list | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | check_bayes | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | finish | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | forget_message | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | ignore_message | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | learner_dump_database | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | learner_expire_old_training | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | learner_get_implementation | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | learner_sync | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | prefork_init | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | scan | Mail::SpamAssassin::Plugin::Bayes::
0 | 0 | 0 | 0s | 0s | spamd_child_init | Mail::SpamAssassin::Plugin::Bayes::
Line | State ments |
Time on line |
Calls | Time in subs |
Code |
---|---|---|---|---|---|
1 | # <@LICENSE> | ||||
2 | # Licensed to the Apache Software Foundation (ASF) under one or more | ||||
3 | # contributor license agreements. See the NOTICE file distributed with | ||||
4 | # this work for additional information regarding copyright ownership. | ||||
5 | # The ASF licenses this file to you under the Apache License, Version 2.0 | ||||
6 | # (the "License"); you may not use this file except in compliance with | ||||
7 | # the License. You may obtain a copy of the License at: | ||||
8 | # | ||||
9 | # http://www.apache.org/licenses/LICENSE-2.0 | ||||
10 | # | ||||
11 | # Unless required by applicable law or agreed to in writing, software | ||||
12 | # distributed under the License is distributed on an "AS IS" BASIS, | ||||
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
14 | # See the License for the specific language governing permissions and | ||||
15 | # limitations under the License. | ||||
16 | # </@LICENSE> | ||||
17 | |||||
18 | =head1 NAME | ||||
19 | |||||
20 | Mail::SpamAssassin::Plugin::Bayes - determine spammishness using a Bayesian classifier | ||||
21 | |||||
22 | =head1 DESCRIPTION | ||||
23 | |||||
24 | This is a Bayesian-style probabilistic classifier, using an algorithm based on | ||||
25 | the one detailed in Paul Graham's I<A Plan For Spam> paper at: | ||||
26 | |||||
27 | http://www.paulgraham.com/spam.html | ||||
28 | |||||
29 | It also incorporates some other aspects taken from Graham Robinson's webpage | ||||
30 | on the subject at: | ||||
31 | |||||
32 | http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html | ||||
33 | |||||
34 | And the chi-square probability combiner as described here: | ||||
35 | |||||
36 | http://www.linuxjournal.com/print.php?sid=6467 | ||||
37 | |||||
38 | The results are incorporated into SpamAssassin as the BAYES_* rules. | ||||
39 | |||||
40 | =head1 METHODS | ||||
41 | |||||
42 | =cut | ||||
43 | |||||
44 | package Mail::SpamAssassin::Plugin::Bayes; | ||||
45 | |||||
46 | 2 | 76µs | 2 | 88µs | # spent 68µs (49+19) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@46 which was called:
# once (49µs+19µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 46 # spent 68µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@46
# spent 19µs making 1 call to strict::import |
47 | 2 | 66µs | 2 | 86µs | # spent 58µs (30+28) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@47 which was called:
# once (30µs+28µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 47 # spent 58µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@47
# spent 28µs making 1 call to warnings::import |
48 | 2 | 84µs | 2 | 42µs | # spent 37µs (32+5) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@48 which was called:
# once (32µs+5µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 48 # spent 37µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@48
# spent 5µs making 1 call to bytes::import |
49 | 2 | 141µs | 2 | 169µs | # spent 97µs (25+72) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@49 which was called:
# once (25µs+72µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 49 # spent 97µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@49
# spent 72µs making 1 call to re::import |
50 | |||||
51 | # spent 180µs (34+146) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@51 which was called:
# once (34µs+146µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 54 | ||||
52 | 3 | 19µs | 1 | 146µs | eval { require Digest::SHA; import Digest::SHA qw(sha1 sha1_hex); 1 } # spent 146µs making 1 call to Exporter::import |
53 | 1 | 13µs | or do { require Digest::SHA1; import Digest::SHA1 qw(sha1 sha1_hex) } | ||
54 | 1 | 51µs | 1 | 180µs | } # spent 180µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@51 |
55 | |||||
56 | 2 | 59µs | 1 | 15µs | # spent 15µs within Mail::SpamAssassin::Plugin::Bayes::BEGIN@56 which was called:
# once (15µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 56 # spent 15µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@56 |
57 | 2 | 70µs | 1 | 12µs | # spent 12µs within Mail::SpamAssassin::Plugin::Bayes::BEGIN@57 which was called:
# once (12µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 57 # spent 12µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@57 |
58 | 2 | 65µs | 1 | 20µs | # spent 20µs within Mail::SpamAssassin::Plugin::Bayes::BEGIN@58 which was called:
# once (20µs+0s) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 58 # spent 20µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@58 |
59 | 2 | 68µs | 2 | 401µs | # spent 213µs (25+188) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@59 which was called:
# once (25µs+188µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 59 # spent 213µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@59
# spent 188µs making 1 call to Exporter::import |
60 | 2 | 85µs | 2 | 314µs | # spent 172µs (30+142) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@60 which was called:
# once (30µs+142µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 60 # spent 172µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@60
# spent 142µs making 1 call to Exporter::import |
61 | |||||
62 | # pick ONLY ONE of these combining implementations. | ||||
63 | 2 | 354µs | 1 | 2.09ms | # spent 2.09ms (1.39+703µs) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@63 which was called:
# once (1.39ms+703µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 63 # spent 2.09ms making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@63 |
64 | # use Mail::SpamAssassin::Bayes::CombineNaiveBayes; | ||||
65 | |||||
66 | 1 | 26µs | our @ISA = qw(Mail::SpamAssassin::Plugin); | ||
67 | |||||
68 | 1 | 7µs | # spent 298µs (41+257) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@68 which was called:
# once (41µs+257µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 73 | ||
69 | $IGNORED_HDRS | ||||
70 | $MARK_PRESENCE_ONLY_HDRS | ||||
71 | %HEADER_NAME_COMPRESSION | ||||
72 | $OPPORTUNISTIC_LOCK_VALID | ||||
73 | 1 | 1.33ms | 2 | 555µs | }; # spent 298µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@68
# spent 257µs making 1 call to vars::import |
74 | |||||
75 | # Which headers should we scan for tokens? Don't use all of them, as it's easy | ||||
76 | # to pick up spurious clues from some. What we now do is use all of them | ||||
77 | # *less* these well-known headers; that way we can pick up spammers' tracking | ||||
78 | # headers (which are obviously not well-known in advance!). | ||||
79 | |||||
80 | # Received is handled specially | ||||
81 | 1 | 36µs | 1 | 14µs | $IGNORED_HDRS = qr{(?: (?:X-)?Sender # misc noise # spent 14µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::CORE:qr |
82 | |Delivered-To |Delivery-Date | ||||
83 | |(?:X-)?Envelope-To | ||||
84 | |X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text | ||||
85 | |||||
86 | |Subject # not worth a tiny gain vs. to db size increase | ||||
87 | |||||
88 | # Date: can provide invalid cues if your spam corpus is | ||||
89 | # older/newer than ham | ||||
90 | |Date | ||||
91 | |||||
92 | # List headers: ignore. a spamfiltering mailing list will | ||||
93 | # become a nonspam sign. | ||||
94 | |X-List|(?:X-)?Mailing-List | ||||
95 | |(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe | ||||
96 | |Unsubscribe|Host|Id|Manager|Admin|Comment | ||||
97 | |Name|Url) | ||||
98 | |X-Unsub(?:scribe)? | ||||
99 | |X-Mailman-Version |X-Been[Tt]here |X-Loop | ||||
100 | |Mail-Followup-To | ||||
101 | |X-eGroups-(?:Return|From) | ||||
102 | |X-MDMailing-List | ||||
103 | |X-XEmacs-List | ||||
104 | |||||
105 | # gatewayed through mailing list (thanks to Allen Smith) | ||||
106 | |(?:X-)?Resent-(?:From|To|Date) | ||||
107 | |(?:X-)?Original-(?:From|To|Date) | ||||
108 | |||||
109 | # Spamfilter/virus-scanner headers: too easy to chain from | ||||
110 | # these | ||||
111 | |X-MailScanner(?:-SpamCheck)? | ||||
112 | |X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))? | ||||
113 | |X-Antispam |X-RBL-Warning |X-Mailscanner | ||||
114 | |X-MDaemon-Deliver-To |X-Virus-Scanned | ||||
115 | |X-Mass-Check-Id | ||||
116 | |X-Pyzor |X-DCC-\S{2,25}-Metrics | ||||
117 | |X-Filtered-B[Yy] |X-Scanned-By |X-Scanner | ||||
118 | |X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status | ||||
119 | |X-SpamCop-[^:]+ | ||||
120 | |X-SMTPD |(?:X-)?Spam-Apparently-To | ||||
121 | |SPAM |X-Perlmx-Spam | ||||
122 | |X-Bogosity | ||||
123 | |||||
124 | # some noisy Outlook headers that add no good clues: | ||||
125 | |Content-Class |Thread-(?:Index|Topic) | ||||
126 | |X-Original[Aa]rrival[Tt]ime | ||||
127 | |||||
128 | # Annotations from IMAP, POP, and MH: | ||||
129 | |(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded | ||||
130 | |Lines |Content-Length | ||||
131 | |X-UIDL? |X-IMAPbase | ||||
132 | |||||
133 | # Annotations from Bugzilla | ||||
134 | |X-Bugzilla-[^:]+ | ||||
135 | |||||
136 | # Annotations from VM: (thanks to Allen Smith) | ||||
137 | |X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified | ||||
138 | |Summary-Format|VHeader|v\d-Data|Message-Order) | ||||
139 | |||||
140 | # Annotations from Gnus: | ||||
141 | | X-Gnus-Mail-Source | ||||
142 | | Xref | ||||
143 | |||||
144 | )}x; | ||||
145 | |||||
146 | # Note only the presence of these headers, in order to reduce the | ||||
147 | # hapaxen they generate. | ||||
148 | 1 | 12µs | 1 | 4µs | $MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face # spent 4µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::CORE:qr |
149 | |X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint | ||||
150 | |D(?:KIM|omainKey)-Signature | ||||
151 | )}ix; | ||||
152 | |||||
153 | # tweaks tested as of Nov 18 2002 by jm posted to -devel at | ||||
154 | # http://sourceforge.net/p/spamassassin/mailman/message/12977556/ | ||||
155 | # for results. The winners are now the default settings. | ||||
156 | 2 | 72µs | 2 | 411µs | # spent 217µs (24+194) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@156 which was called:
# once (24µs+194µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 156 # spent 217µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@156
# spent 194µs making 1 call to constant::import |
157 | 2 | 71µs | 2 | 435µs | # spent 230µs (25+205) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@157 which was called:
# once (25µs+205µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 157 # spent 230µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@157
# spent 205µs making 1 call to constant::import |
158 | 2 | 67µs | 2 | 443µs | # spent 236µs (28+207) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@158 which was called:
# once (28µs+207µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 158 # spent 236µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@158
# spent 207µs making 1 call to constant::import |
159 | 2 | 13.4ms | 2 | 373µs | # spent 196µs (20+177) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@159 which was called:
# once (20µs+177µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 159 # spent 196µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@159
# spent 177µs making 1 call to constant::import |
160 | |||||
161 | # tweaks by jm on May 12 2003, see -devel email at | ||||
162 | # http://sourceforge.net/p/spamassassin/mailman/message/14844556/ | ||||
163 | 2 | 81µs | 2 | 466µs | # spent 246µs (25+221) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@163 which was called:
# once (25µs+221µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 163 # spent 246µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@163
# spent 221µs making 1 call to constant::import |
164 | 2 | 73µs | 2 | 379µs | # spent 204µs (29+175) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@164 which was called:
# once (29µs+175µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 164 # spent 204µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@164
# spent 175µs making 1 call to constant::import |
165 | 2 | 88µs | 2 | 425µs | # spent 228µs (31+197) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@165 which was called:
# once (31µs+197µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 165 # spent 228µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@165
# spent 197µs making 1 call to constant::import |
166 | 2 | 57µs | 2 | 346µs | # spent 184µs (22+162) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@166 which was called:
# once (22µs+162µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 166 # spent 184µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@166
# spent 162µs making 1 call to constant::import |
167 | 2 | 69µs | 2 | 360µs | # spent 198µs (36+162) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@167 which was called:
# once (36µs+162µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 167 # spent 198µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@167
# spent 162µs making 1 call to constant::import |
168 | 2 | 71µs | 2 | 381µs | # spent 205µs (29+176) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@168 which was called:
# once (29µs+176µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 168 # spent 205µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@168
# spent 176µs making 1 call to constant::import |
169 | 2 | 68µs | 2 | 422µs | # spent 226µs (30+196) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@169 which was called:
# once (30µs+196µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 169 # spent 226µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@169
# spent 196µs making 1 call to constant::import |
170 | |||||
171 | # tweaks of 12 March 2004, see bug 2129. | ||||
172 | 2 | 77µs | 2 | 397µs | # spent 212µs (27+185) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@172 which was called:
# once (27µs+185µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 172 # spent 212µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@172
# spent 185µs making 1 call to constant::import |
173 | 2 | 80µs | 2 | 431µs | # spent 230µs (30+201) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@173 which was called:
# once (30µs+201µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 173 # spent 230µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@173
# spent 201µs making 1 call to constant::import |
174 | 2 | 68µs | 2 | 386µs | # spent 209µs (31+178) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@174 which was called:
# once (31µs+178µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 174 # spent 209µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@174
# spent 178µs making 1 call to constant::import |
175 | 2 | 92µs | 2 | 384µs | # spent 207µs (30+177) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@175 which was called:
# once (30µs+177µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 175 # spent 207µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@175
# spent 177µs making 1 call to constant::import |
176 | |||||
177 | # tweaks, see http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3173#c26 | ||||
178 | 2 | 68µs | 2 | 437µs | # spent 234µs (32+202) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@178 which was called:
# once (32µs+202µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 178 # spent 234µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@178
# spent 202µs making 1 call to constant::import |
179 | 2 | 219µs | 2 | 364µs | # spent 196µs (28+168) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@179 which was called:
# once (28µs+168µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 179 # spent 196µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@179
# spent 168µs making 1 call to constant::import |
180 | |||||
181 | # We store header-mined tokens in the db with a "HHeaderName:val" format. | ||||
182 | # some headers may contain lots of gibberish tokens, so allow a little basic | ||||
183 | # compression by mapping the header name at least here. these are the headers | ||||
184 | # which appear with the most frequency in my db. note: this doesn't have to | ||||
185 | # be 2-way (ie. LHSes that map to the same RHS are not a problem), but mixing | ||||
186 | # tokens from multiple different headers may impact accuracy, so might as well | ||||
187 | # avoid this if possible. These are the top ones from my corpus, BTW (jm). | ||||
188 | 1 | 31µs | %HEADER_NAME_COMPRESSION = ( | ||
189 | 'Message-Id' => '*m', | ||||
190 | 'Message-ID' => '*M', | ||||
191 | 'Received' => '*r', | ||||
192 | 'User-Agent' => '*u', | ||||
193 | 'References' => '*f', | ||||
194 | 'In-Reply-To' => '*i', | ||||
195 | 'From' => '*F', | ||||
196 | 'Reply-To' => '*R', | ||||
197 | 'Return-Path' => '*p', | ||||
198 | 'Return-path' => '*rp', | ||||
199 | 'X-Mailer' => '*x', | ||||
200 | 'X-Authentication-Warning' => '*a', | ||||
201 | 'Organization' => '*o', | ||||
202 | 'Organisation' => '*o', | ||||
203 | 'Content-Type' => '*c', | ||||
204 | 'x-spam-relays-trusted' => '*RT', | ||||
205 | 'x-spam-relays-untrusted' => '*RU', | ||||
206 | ); | ||||
207 | |||||
208 | # How many seconds should the opportunistic_expire lock be valid? | ||||
209 | 1 | 2µs | $OPPORTUNISTIC_LOCK_VALID = 300; | ||
210 | |||||
211 | # Should we use the Robinson f(w) equation from | ||||
212 | # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html ? | ||||
213 | # It gives better results, in that scores are more likely to distribute | ||||
214 | # into the <0.5 range for nonspam and >0.5 for spam. | ||||
215 | 2 | 72µs | 2 | 437µs | # spent 234µs (32+203) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@215 which was called:
# once (32µs+203µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 215 # spent 234µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@215
# spent 203µs making 1 call to constant::import |
216 | |||||
217 | # How many of the most significant tokens should we use for the p(w) | ||||
218 | # calculation? | ||||
219 | 2 | 74µs | 2 | 409µs | # spent 220µs (32+188) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@219 which was called:
# once (32µs+188µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 219 # spent 220µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@219
# spent 188µs making 1 call to constant::import |
220 | |||||
221 | # How many significant tokens are required for a classifier score to | ||||
222 | # be considered usable? | ||||
223 | 2 | 80µs | 2 | 493µs | # spent 257µs (21+236) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@223 which was called:
# once (21µs+236µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 223 # spent 257µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@223
# spent 236µs making 1 call to constant::import |
224 | |||||
225 | # How long a token should we hold onto? (note: German speakers typically | ||||
226 | # will require a longer token than English ones.) | ||||
227 | 2 | 14.9ms | 2 | 434µs | # spent 232µs (29+203) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@227 which was called:
# once (29µs+203µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 227 # spent 232µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@227
# spent 203µs making 1 call to constant::import |
228 | |||||
229 | ########################################################################### | ||||
230 | |||||
231 | # spent 132µs (84+48) within Mail::SpamAssassin::Plugin::Bayes::new which was called:
# once (84µs+48µs) by Mail::SpamAssassin::PluginHandler::load_plugin at line 1 of (eval 89)[Mail/SpamAssassin/PluginHandler.pm:129] | ||||
232 | 1 | 3µs | my $class = shift; | ||
233 | 1 | 2µs | my ($main) = @_; | ||
234 | |||||
235 | 1 | 3µs | $class = ref($class) || $class; | ||
236 | 1 | 13µs | 1 | 17µs | my $self = $class->SUPER::new($main); # spent 17µs making 1 call to Mail::SpamAssassin::Plugin::new |
237 | 1 | 2µs | bless ($self, $class); | ||
238 | |||||
239 | 1 | 6µs | $self->{main} = $main; | ||
240 | 1 | 4µs | $self->{conf} = $main->{conf}; | ||
241 | 1 | 3µs | $self->{use_ignores} = 1; | ||
242 | |||||
243 | 1 | 10µs | 1 | 31µs | $self->register_eval_rule("check_bayes"); # spent 31µs making 1 call to Mail::SpamAssassin::Plugin::register_eval_rule |
244 | 1 | 10µs | $self; | ||
245 | } | ||||
246 | |||||
247 | sub finish { | ||||
248 | my $self = shift; | ||||
249 | if ($self->{store}) { | ||||
250 | $self->{store}->untie_db(); | ||||
251 | } | ||||
252 | %{$self} = (); | ||||
253 | } | ||||
254 | |||||
255 | ########################################################################### | ||||
256 | |||||
257 | # Plugin hook. | ||||
258 | # Return this implementation object, for callers that need to know | ||||
259 | # it. TODO: callers shouldn't *need* to know it! | ||||
260 | # used only in test suite to get access to {store}, internal APIs. | ||||
261 | # | ||||
262 | sub learner_get_implementation { return shift; } | ||||
263 | |||||
264 | ########################################################################### | ||||
265 | |||||
266 | # Plugin hook. | ||||
267 | # Called in the parent process shortly before forking off child processes. | ||||
268 | sub prefork_init { | ||||
269 | my ($self) = @_; | ||||
270 | |||||
271 | if ($self->{store} && $self->{store}->UNIVERSAL::can('prefork_init')) { | ||||
272 | $self->{store}->prefork_init; | ||||
273 | } | ||||
274 | } | ||||
275 | |||||
276 | ########################################################################### | ||||
277 | |||||
278 | # Plugin hook. | ||||
279 | # Called in a child process shortly after being spawned. | ||||
280 | sub spamd_child_init { | ||||
281 | my ($self) = @_; | ||||
282 | |||||
283 | if ($self->{store} && $self->{store}->UNIVERSAL::can('spamd_child_init')) { | ||||
284 | $self->{store}->spamd_child_init; | ||||
285 | } | ||||
286 | } | ||||
287 | |||||
288 | ########################################################################### | ||||
289 | |||||
290 | # Plugin hook. | ||||
291 | sub check_bayes { | ||||
292 | my ($self, $pms, $fulltext, $min, $max) = @_; | ||||
293 | |||||
294 | return 0 if (!$self->{conf}->{use_learner}); | ||||
295 | return 0 if (!$self->{conf}->{use_bayes} || !$self->{conf}->{use_bayes_rules}); | ||||
296 | |||||
297 | if (!exists ($pms->{bayes_score})) { | ||||
298 | my $timer = $self->{main}->time_method("check_bayes"); | ||||
299 | $pms->{bayes_score} = $self->scan($pms, $pms->{msg}); | ||||
300 | } | ||||
301 | |||||
302 | if (defined $pms->{bayes_score} && | ||||
303 | ($min == 0 || $pms->{bayes_score} > $min) && | ||||
304 | ($max eq "undef" || $pms->{bayes_score} <= $max)) | ||||
305 | { | ||||
306 | if ($self->{conf}->{detailed_bayes_score}) { | ||||
307 | $pms->test_log(sprintf ("score: %3.4f, hits: %s", | ||||
308 | $pms->{bayes_score}, | ||||
309 | $pms->{bayes_hits})); | ||||
310 | } | ||||
311 | else { | ||||
312 | $pms->test_log(sprintf ("score: %3.4f", $pms->{bayes_score})); | ||||
313 | } | ||||
314 | return 1; | ||||
315 | } | ||||
316 | |||||
317 | return 0; | ||||
318 | } | ||||
319 | |||||
320 | ########################################################################### | ||||
321 | |||||
322 | # Plugin hook. | ||||
323 | # spent 1.31ms (52µs+1.25) within Mail::SpamAssassin::Plugin::Bayes::learner_close which was called:
# once (52µs+1.25ms) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm | ||||
324 | 1 | 2µs | my ($self, $params) = @_; | ||
325 | 1 | 4µs | my $quiet = $params->{quiet}; | ||
326 | |||||
327 | # do a sanity check here. Weird things happen if we remain tied | ||||
328 | # after compiling; for example, spamd will never see that the | ||||
329 | # number of messages has reached the bayes-scanning threshold. | ||||
330 | 1 | 26µs | 1 | 13µs | if ($self->{store}->db_readable()) { # spent 13µs making 1 call to Mail::SpamAssassin::BayesStore::DBM::db_readable |
331 | 1 | 2µs | warn "bayes: oops! still tied to bayes DBs, untying\n" unless $quiet; | ||
332 | 1 | 11µs | 1 | 1.24ms | $self->{store}->untie_db(); # spent 1.24ms making 1 call to Mail::SpamAssassin::BayesStore::DBM::untie_db |
333 | } | ||||
334 | } | ||||
335 | |||||
336 | ########################################################################### | ||||
337 | |||||
338 | # read configuration items to control bayes behaviour. Called by | ||||
339 | # BayesStore::read_db_configs(). | ||||
340 | # spent 2.35ms within Mail::SpamAssassin::Plugin::Bayes::read_db_configs which was called 236 times, avg 10µs/call:
# 236 times (2.35ms+0s) by Mail::SpamAssassin::BayesStore::read_db_configs at line 117 of Mail/SpamAssassin/BayesStore.pm, avg 10µs/call | ||||
341 | 236 | 521µs | my ($self) = @_; | ||
342 | |||||
343 | # use of hapaxes. Set on bayes object, since it controls prob | ||||
344 | # computation. | ||||
345 | 236 | 2.49ms | $self->{use_hapaxes} = $self->{conf}->{bayes_use_hapaxes}; | ||
346 | } | ||||
347 | ########################################################################### | ||||
348 | |||||
349 | sub ignore_message { | ||||
350 | my ($self,$PMS) = @_; | ||||
351 | |||||
352 | return 0 unless $self->{use_ignores}; | ||||
353 | |||||
354 | my $ig_from = $self->{main}->call_plugins ("check_wb_list", | ||||
355 | { permsgstatus => $PMS, type => 'from', list => 'bayes_ignore_from' }); | ||||
356 | my $ig_to = $self->{main}->call_plugins ("check_wb_list", | ||||
357 | { permsgstatus => $PMS, type => 'to', list => 'bayes_ignore_to' }); | ||||
358 | |||||
359 | my $ignore = $ig_from || $ig_to; | ||||
360 | |||||
361 | dbg("bayes: not using bayes, bayes_ignore_from or _to rule") if $ignore; | ||||
362 | |||||
363 | return $ignore; | ||||
364 | } | ||||
365 | |||||
366 | ########################################################################### | ||||
367 | |||||
368 | # Plugin hook. | ||||
369 | # spent 44.1s (36.8ms+44.1) within Mail::SpamAssassin::Plugin::Bayes::learn_message which was called 234 times, avg 189ms/call:
# 234 times (36.8ms+44.1s) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm, avg 189ms/call | ||||
370 | 234 | 498µs | my ($self, $params) = @_; | ||
371 | 234 | 843µs | my $isspam = $params->{isspam}; | ||
372 | 234 | 697µs | my $msg = $params->{msg}; | ||
373 | 234 | 645µs | my $id = $params->{id}; | ||
374 | |||||
375 | 234 | 949µs | if (!$self->{conf}->{use_bayes}) { return; } | ||
376 | |||||
377 | 234 | 2.37ms | 234 | 6.25s | my $msgdata = $self->get_body_from_msg ($msg); # spent 6.25s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg, avg 26.7ms/call |
378 | 234 | 476µs | my $ret; | ||
379 | |||||
380 | eval { | ||||
381 | 234 | 1.58ms | local $SIG{'__DIE__'}; # do not run user die() traps in here | ||
382 | 234 | 3.55ms | 234 | 1.89ms | my $timer = $self->{main}->time_method("b_learn"); # spent 1.89ms making 234 calls to Mail::SpamAssassin::time_method, avg 8µs/call |
383 | |||||
384 | 234 | 445µs | my $ok; | ||
385 | 234 | 1.25ms | if ($self->{main}->{learn_to_journal}) { | ||
386 | # If we're going to learn to journal, we'll try going r/o first... | ||||
387 | # If that fails for some reason, let's try going r/w. This happens | ||||
388 | # if the DB doesn't exist yet. | ||||
389 | 234 | 3.13ms | 235 | 1.59s | $ok = $self->{store}->tie_db_readonly() || $self->{store}->tie_db_writable(); # spent 1.58s making 234 calls to Mail::SpamAssassin::BayesStore::DBM::tie_db_readonly, avg 6.77ms/call
# spent 4.45ms making 1 call to Mail::SpamAssassin::BayesStore::DBM::tie_db_writable |
390 | } else { | ||||
391 | $ok = $self->{store}->tie_db_writable(); | ||||
392 | } | ||||
393 | |||||
394 | 234 | 926µs | if ($ok) { | ||
395 | 234 | 2.80ms | 234 | 36.3s | $ret = $self->_learn_trapped ($isspam, $msg, $msgdata, $id); # spent 36.3s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::_learn_trapped, avg 155ms/call |
396 | |||||
397 | 234 | 1.06ms | if (!$self->{main}->{learn_caller_will_untie}) { | ||
398 | $self->{store}->untie_db(); | ||||
399 | } | ||||
400 | } | ||||
401 | 234 | 2.74ms | 1; | ||
402 | 234 | 1.03ms | } or do { # if we died, untie the dbs. | ||
403 | my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat; | ||||
404 | $self->{store}->untie_db(); | ||||
405 | die "bayes: (in learn) $eval_stat\n"; | ||||
406 | }; | ||||
407 | |||||
408 | 234 | 3.54ms | return $ret; | ||
409 | } | ||||
410 | |||||
411 | # this function is trapped by the wrapper above | ||||
412 | # spent 36.3s (106ms+36.2) within Mail::SpamAssassin::Plugin::Bayes::_learn_trapped which was called 234 times, avg 155ms/call:
# 234 times (106ms+36.2s) by Mail::SpamAssassin::Plugin::Bayes::learn_message at line 395, avg 155ms/call | ||||
413 | 234 | 689µs | my ($self, $isspam, $msg, $msgdata, $msgid) = @_; | ||
414 | 234 | 896µs | my @msgid = ( $msgid ); | ||
415 | |||||
416 | 234 | 1.25ms | if (!defined $msgid) { | ||
417 | 234 | 2.69ms | 234 | 137ms | @msgid = $self->get_msgid($msg); # spent 137ms making 234 calls to Mail::SpamAssassin::Plugin::Bayes::get_msgid, avg 584µs/call |
418 | } | ||||
419 | |||||
420 | 234 | 1.07ms | foreach my $msgid_t ( @msgid ) { | ||
421 | 458 | 4.79ms | 458 | 31.2ms | my $seen = $self->{store}->seen_get ($msgid_t); # spent 31.2ms making 458 calls to Mail::SpamAssassin::BayesStore::DBM::seen_get, avg 68µs/call |
422 | |||||
423 | 458 | 3.28ms | if (defined ($seen)) { | ||
424 | if (($seen eq 's' && $isspam) || ($seen eq 'h' && !$isspam)) { | ||||
425 | dbg("bayes: $msgid_t already learnt correctly, not learning twice"); | ||||
426 | return 0; | ||||
427 | } elsif ($seen !~ /^[hs]$/) { | ||||
428 | warn("bayes: db_seen corrupt: value='$seen' for $msgid_t, ignored"); | ||||
429 | } else { | ||||
430 | # bug 3704: If the message was already learned, don't try learning it again. | ||||
431 | # this prevents, for instance, manually learning as spam, then autolearning | ||||
432 | # as ham, or visa versa. | ||||
433 | if ($self->{main}->{learn_no_relearn}) { | ||||
434 | dbg("bayes: $msgid_t already learnt as opposite, not re-learning"); | ||||
435 | return 0; | ||||
436 | } | ||||
437 | |||||
438 | dbg("bayes: $msgid_t already learnt as opposite, forgetting first"); | ||||
439 | |||||
440 | # kluge so that forget() won't untie the db on us ... | ||||
441 | my $orig = $self->{main}->{learn_caller_will_untie}; | ||||
442 | $self->{main}->{learn_caller_will_untie} = 1; | ||||
443 | |||||
444 | my $fatal = !defined $self->{main}->{bayes_scanner}->forget ($msg); | ||||
445 | |||||
446 | # reset the value post-forget() ... | ||||
447 | $self->{main}->{learn_caller_will_untie} = $orig; | ||||
448 | |||||
449 | # forget() gave us a fatal error, so propagate that up | ||||
450 | if ($fatal) { | ||||
451 | dbg("bayes: forget() returned a fatal error, so learn() will too"); | ||||
452 | return; | ||||
453 | } | ||||
454 | } | ||||
455 | |||||
456 | # we're only going to have seen this once, so stop if it's been | ||||
457 | # seen already | ||||
458 | last; | ||||
459 | } | ||||
460 | } | ||||
461 | |||||
462 | # Now that we're sure we haven't seen this message before ... | ||||
463 | 234 | 790µs | $msgid = $msgid[0]; | ||
464 | |||||
465 | 234 | 2.83ms | 234 | 1.40s | my $msgatime = $msg->receive_date(); # spent 1.40s making 234 calls to Mail::SpamAssassin::Message::receive_date, avg 5.97ms/call |
466 | |||||
467 | # If the message atime comes back as being more than 1 day in the | ||||
468 | # future, something's messed up and we should revert to current time as | ||||
469 | # a safety measure. | ||||
470 | # | ||||
471 | 234 | 1.21ms | $msgatime = time if ( $msgatime - time > 86400 ); | ||
472 | |||||
473 | 234 | 2.46ms | 234 | 31.0s | my $tokens = $self->tokenize($msg, $msgdata); # spent 31.0s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::tokenize, avg 132ms/call |
474 | |||||
475 | 468 | 8.70ms | 234 | 2.61ms | { my $timer = $self->{main}->time_method('b_count_change'); # spent 2.61ms making 234 calls to Mail::SpamAssassin::time_method, avg 11µs/call |
476 | 234 | 1.03ms | if ($isspam) { | ||
477 | 234 | 2.49ms | 234 | 9.65ms | $self->{store}->nspam_nham_change(1, 0); # spent 9.65ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::nspam_nham_change, avg 41µs/call |
478 | 234 | 2.44ms | 234 | 3.49s | $self->{store}->multi_tok_count_change(1, 0, $tokens, $msgatime); # spent 3.49s making 234 calls to Mail::SpamAssassin::BayesStore::DBM::multi_tok_count_change, avg 14.9ms/call |
479 | } else { | ||||
480 | $self->{store}->nspam_nham_change(0, 1); | ||||
481 | $self->{store}->multi_tok_count_change(0, 1, $tokens, $msgatime); | ||||
482 | } | ||||
483 | } | ||||
484 | |||||
485 | 234 | 3.06ms | 234 | 11.2ms | $self->{store}->seen_put ($msgid, ($isspam ? 's' : 'h')); # spent 11.2ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::seen_put, avg 48µs/call |
486 | 234 | 2.15ms | 234 | 104ms | $self->{store}->cleanup(); # spent 104ms making 234 calls to Mail::SpamAssassin::BayesStore::DBM::cleanup, avg 443µs/call |
487 | |||||
488 | 234 | 5.80ms | 234 | 0s | $self->{main}->call_plugins("bayes_learn", { toksref => $tokens, # spent 17.6ms making 234 calls to Mail::SpamAssassin::call_plugins, avg 75µs/call, recursion: max depth 1, sum of overlapping time 17.6ms |
489 | isspam => $isspam, | ||||
490 | msgid => $msgid, | ||||
491 | msgatime => $msgatime, | ||||
492 | }); | ||||
493 | |||||
494 | 234 | 3.06ms | 234 | 2.42ms | dbg("bayes: learned '$msgid', atime: $msgatime"); # spent 2.42ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 10µs/call |
495 | |||||
496 | 234 | 55.4ms | 1; | ||
497 | } | ||||
498 | |||||
499 | ########################################################################### | ||||
500 | |||||
501 | # Plugin hook. | ||||
502 | sub forget_message { | ||||
503 | my ($self, $params) = @_; | ||||
504 | my $msg = $params->{msg}; | ||||
505 | my $id = $params->{id}; | ||||
506 | |||||
507 | if (!$self->{conf}->{use_bayes}) { return; } | ||||
508 | |||||
509 | my $msgdata = $self->get_body_from_msg ($msg); | ||||
510 | my $ret; | ||||
511 | |||||
512 | # we still tie for writing here, since we write to the seen db | ||||
513 | # synchronously | ||||
514 | eval { | ||||
515 | local $SIG{'__DIE__'}; # do not run user die() traps in here | ||||
516 | my $timer = $self->{main}->time_method("b_learn"); | ||||
517 | |||||
518 | my $ok; | ||||
519 | if ($self->{main}->{learn_to_journal}) { | ||||
520 | # If we're going to learn to journal, we'll try going r/o first... | ||||
521 | # If that fails for some reason, let's try going r/w. This happens | ||||
522 | # if the DB doesn't exist yet. | ||||
523 | $ok = $self->{store}->tie_db_readonly() || $self->{store}->tie_db_writable(); | ||||
524 | } else { | ||||
525 | $ok = $self->{store}->tie_db_writable(); | ||||
526 | } | ||||
527 | |||||
528 | if ($ok) { | ||||
529 | $ret = $self->_forget_trapped ($msg, $msgdata, $id); | ||||
530 | |||||
531 | if (!$self->{main}->{learn_caller_will_untie}) { | ||||
532 | $self->{store}->untie_db(); | ||||
533 | } | ||||
534 | } | ||||
535 | 1; | ||||
536 | } or do { # if we died, untie the dbs. | ||||
537 | my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat; | ||||
538 | $self->{store}->untie_db(); | ||||
539 | die "bayes: (in forget) $eval_stat\n"; | ||||
540 | }; | ||||
541 | |||||
542 | return $ret; | ||||
543 | } | ||||
544 | |||||
545 | # this function is trapped by the wrapper above | ||||
546 | sub _forget_trapped { | ||||
547 | my ($self, $msg, $msgdata, $msgid) = @_; | ||||
548 | my @msgid = ( $msgid ); | ||||
549 | my $isspam; | ||||
550 | |||||
551 | if (!defined $msgid) { | ||||
552 | @msgid = $self->get_msgid($msg); | ||||
553 | } | ||||
554 | |||||
555 | while( $msgid = shift @msgid ) { | ||||
556 | my $seen = $self->{store}->seen_get ($msgid); | ||||
557 | |||||
558 | if (defined ($seen)) { | ||||
559 | if ($seen eq 's') { | ||||
560 | $isspam = 1; | ||||
561 | } elsif ($seen eq 'h') { | ||||
562 | $isspam = 0; | ||||
563 | } else { | ||||
564 | dbg("bayes: forget: msgid $msgid seen entry is neither ham nor spam, ignored"); | ||||
565 | return 0; | ||||
566 | } | ||||
567 | |||||
568 | # messages should only be learned once, so stop if we find a msgid | ||||
569 | # which was seen before | ||||
570 | last; | ||||
571 | } | ||||
572 | else { | ||||
573 | dbg("bayes: forget: msgid $msgid not learnt, ignored"); | ||||
574 | } | ||||
575 | } | ||||
576 | |||||
577 | # This message wasn't learnt before, so return | ||||
578 | if (!defined $isspam) { | ||||
579 | dbg("bayes: forget: no msgid from this message has been learnt, skipping message"); | ||||
580 | return 0; | ||||
581 | } | ||||
582 | elsif ($isspam) { | ||||
583 | $self->{store}->nspam_nham_change (-1, 0); | ||||
584 | } | ||||
585 | else { | ||||
586 | $self->{store}->nspam_nham_change (0, -1); | ||||
587 | } | ||||
588 | |||||
589 | my $tokens = $self->tokenize($msg, $msgdata); | ||||
590 | |||||
591 | if ($isspam) { | ||||
592 | $self->{store}->multi_tok_count_change (-1, 0, $tokens); | ||||
593 | } else { | ||||
594 | $self->{store}->multi_tok_count_change (0, -1, $tokens); | ||||
595 | } | ||||
596 | |||||
597 | $self->{store}->seen_delete ($msgid); | ||||
598 | $self->{store}->cleanup(); | ||||
599 | |||||
600 | $self->{main}->call_plugins("bayes_forget", { toksref => $tokens, | ||||
601 | isspam => $isspam, | ||||
602 | msgid => $msgid, | ||||
603 | }); | ||||
604 | |||||
605 | 1; | ||||
606 | } | ||||
607 | |||||
608 | ########################################################################### | ||||
609 | |||||
610 | # Plugin hook. | ||||
611 | sub learner_sync { | ||||
612 | my ($self, $params) = @_; | ||||
613 | if (!$self->{conf}->{use_bayes}) { return 0; } | ||||
614 | dbg("bayes: bayes journal sync starting"); | ||||
615 | $self->{store}->sync($params); | ||||
616 | dbg("bayes: bayes journal sync completed"); | ||||
617 | } | ||||
618 | |||||
619 | ########################################################################### | ||||
620 | |||||
621 | # Plugin hook. | ||||
622 | sub learner_expire_old_training { | ||||
623 | my ($self, $params) = @_; | ||||
624 | if (!$self->{conf}->{use_bayes}) { return 0; } | ||||
625 | dbg("bayes: expiry starting"); | ||||
626 | my $timer = $self->{main}->time_method("expire_bayes"); | ||||
627 | $self->{store}->expire_old_tokens($params); | ||||
628 | dbg("bayes: expiry completed"); | ||||
629 | } | ||||
630 | |||||
631 | ########################################################################### | ||||
632 | |||||
633 | # Plugin hook. | ||||
634 | # Check to make sure we can tie() the DB, and we have enough entries to do a scan | ||||
635 | # if we're told the caller will untie(), go ahead and leave the db tied. | ||||
636 | # spent 506µs (26+480) within Mail::SpamAssassin::Plugin::Bayes::learner_is_scan_available which was called:
# once (26µs+480µs) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm | ||||
637 | 1 | 2µs | my ($self, $params) = @_; | ||
638 | |||||
639 | 1 | 4µs | return 0 unless $self->{conf}->{use_bayes}; | ||
640 | 1 | 18µs | 1 | 480µs | return 0 unless $self->{store}->tie_db_readonly(); # spent 480µs making 1 call to Mail::SpamAssassin::BayesStore::DBM::tie_db_readonly |
641 | |||||
642 | # We need the DB to stay tied, so if the journal sync occurs, don't untie! | ||||
643 | my $caller_untie = $self->{main}->{learn_caller_will_untie}; | ||||
644 | $self->{main}->{learn_caller_will_untie} = 1; | ||||
645 | |||||
646 | # Do a journal sync if necessary. Do this before the nspam_nham_get() | ||||
647 | # call since the sync may cause an update in the number of messages | ||||
648 | # learnt. | ||||
649 | $self->_opportunistic_calls(1); | ||||
650 | |||||
651 | # Reset the variable appropriately | ||||
652 | $self->{main}->{learn_caller_will_untie} = $caller_untie; | ||||
653 | |||||
654 | my ($ns, $nn) = $self->{store}->nspam_nham_get(); | ||||
655 | |||||
656 | if ($ns < $self->{conf}->{bayes_min_spam_num}) { | ||||
657 | dbg("bayes: not available for scanning, only $ns spam(s) in bayes DB < ".$self->{conf}->{bayes_min_spam_num}); | ||||
658 | if (!$self->{main}->{learn_caller_will_untie}) { | ||||
659 | $self->{store}->untie_db(); | ||||
660 | } | ||||
661 | return 0; | ||||
662 | } | ||||
663 | if ($nn < $self->{conf}->{bayes_min_ham_num}) { | ||||
664 | dbg("bayes: not available for scanning, only $nn ham(s) in bayes DB < ".$self->{conf}->{bayes_min_ham_num}); | ||||
665 | if (!$self->{main}->{learn_caller_will_untie}) { | ||||
666 | $self->{store}->untie_db(); | ||||
667 | } | ||||
668 | return 0; | ||||
669 | } | ||||
670 | |||||
671 | return 1; | ||||
672 | } | ||||
673 | |||||
674 | ########################################################################### | ||||
675 | |||||
676 | sub scan { | ||||
677 | my ($self, $permsgstatus, $msg) = @_; | ||||
678 | my $score; | ||||
679 | |||||
680 | return unless $self->{conf}->{use_learner}; | ||||
681 | |||||
682 | # When we're doing a scan, we'll guarantee that we'll do the untie, | ||||
683 | # so override the global setting until we're done. | ||||
684 | my $caller_untie = $self->{main}->{learn_caller_will_untie}; | ||||
685 | $self->{main}->{learn_caller_will_untie} = 1; | ||||
686 | |||||
687 | goto skip if ($self->{main}->{bayes_scanner}->ignore_message($permsgstatus)); | ||||
688 | |||||
689 | goto skip unless $self->learner_is_scan_available(); | ||||
690 | |||||
691 | my ($ns, $nn) = $self->{store}->nspam_nham_get(); | ||||
692 | |||||
693 | ## if ($self->{log_raw_counts}) { # see _compute_prob_for_token() | ||||
694 | ## $self->{raw_counts} = " ns=$ns nn=$nn "; | ||||
695 | ## } | ||||
696 | |||||
697 | dbg("bayes: corpus size: nspam = $ns, nham = $nn"); | ||||
698 | |||||
699 | my $msgtokens; | ||||
700 | { my $timer = $self->{main}->time_method('b_tokenize'); | ||||
701 | my $msgdata = $self->_get_msgdata_from_permsgstatus ($permsgstatus); | ||||
702 | $msgtokens = $self->tokenize($msg, $msgdata); | ||||
703 | } | ||||
704 | |||||
705 | my $tokensdata; | ||||
706 | { my $timer = $self->{main}->time_method('b_tok_get_all'); | ||||
707 | $tokensdata = $self->{store}->tok_get_all(keys %{$msgtokens}); | ||||
708 | } | ||||
709 | |||||
710 | my $timer_compute_prob = $self->{main}->time_method('b_comp_prob'); | ||||
711 | |||||
712 | my $probabilities_ref = | ||||
713 | $self->_compute_prob_for_all_tokens($tokensdata, $ns, $nn); | ||||
714 | |||||
715 | my %pw; | ||||
716 | foreach my $tokendata (@{$tokensdata}) { | ||||
717 | my $prob = shift(@$probabilities_ref); | ||||
718 | next unless defined $prob; | ||||
719 | my ($token, $tok_spam, $tok_ham, $atime) = @{$tokendata}; | ||||
720 | $pw{$token} = { | ||||
721 | prob => $prob, | ||||
722 | spam_count => $tok_spam, | ||||
723 | ham_count => $tok_ham, | ||||
724 | atime => $atime | ||||
725 | }; | ||||
726 | } | ||||
727 | |||||
728 | my @pw_keys = keys %pw; | ||||
729 | |||||
730 | # If none of the tokens were found in the DB, we're going to skip | ||||
731 | # this message... | ||||
732 | if (!@pw_keys) { | ||||
733 | dbg("bayes: cannot use bayes on this message; none of the tokens were found in the database"); | ||||
734 | goto skip; | ||||
735 | } | ||||
736 | |||||
737 | my $tcount_total = keys %{$msgtokens}; | ||||
738 | my $tcount_learned = scalar @pw_keys; | ||||
739 | |||||
740 | # Figure out the message receive time (used as atime below) | ||||
741 | # If the message atime comes back as being in the future, something's | ||||
742 | # messed up and we should revert to current time as a safety measure. | ||||
743 | # | ||||
744 | my $msgatime = $msg->receive_date(); | ||||
745 | my $now = time; | ||||
746 | $msgatime = $now if ( $msgatime > $now ); | ||||
747 | |||||
748 | my @touch_tokens; | ||||
749 | my $tinfo_spammy = $permsgstatus->{bayes_token_info_spammy} = []; | ||||
750 | my $tinfo_hammy = $permsgstatus->{bayes_token_info_hammy} = []; | ||||
751 | |||||
752 | my %tok_strength = map( ($_, abs($pw{$_}->{prob} - 0.5)), @pw_keys); | ||||
753 | my $log_each_token = (would_log('dbg', 'bayes') > 1); | ||||
754 | |||||
755 | # now take the most significant tokens and calculate probs using | ||||
756 | # Robinson's formula. | ||||
757 | |||||
758 | @pw_keys = sort { $tok_strength{$b} <=> $tok_strength{$a} } @pw_keys; | ||||
759 | |||||
760 | if (@pw_keys > N_SIGNIFICANT_TOKENS) { $#pw_keys = N_SIGNIFICANT_TOKENS - 1 } | ||||
761 | |||||
762 | my @sorted; | ||||
763 | foreach my $tok (@pw_keys) { | ||||
764 | next if $tok_strength{$tok} < | ||||
765 | $Mail::SpamAssassin::Bayes::Combine::MIN_PROB_STRENGTH; | ||||
766 | |||||
767 | my $pw_tok = $pw{$tok}; | ||||
768 | my $pw_prob = $pw_tok->{prob}; | ||||
769 | |||||
770 | # What's more expensive, scanning headers for HAMMYTOKENS and | ||||
771 | # SPAMMYTOKENS tags that aren't there or collecting data that | ||||
772 | # won't be used? Just collecting the data is certainly simpler. | ||||
773 | # | ||||
774 | my $raw_token = $msgtokens->{$tok} || "(unknown)"; | ||||
775 | my $s = $pw_tok->{spam_count}; | ||||
776 | my $n = $pw_tok->{ham_count}; | ||||
777 | my $a = $pw_tok->{atime}; | ||||
778 | |||||
779 | push( @{ $pw_prob < 0.5 ? $tinfo_hammy : $tinfo_spammy }, | ||||
780 | [$raw_token, $pw_prob, $s, $n, $a] ); | ||||
781 | |||||
782 | push(@sorted, $pw_prob); | ||||
783 | |||||
784 | # update the atime on this token, it proved useful | ||||
785 | push(@touch_tokens, $tok); | ||||
786 | |||||
787 | if ($log_each_token) { | ||||
788 | dbg("bayes: token '$raw_token' => $pw_prob"); | ||||
789 | } | ||||
790 | } | ||||
791 | |||||
792 | if (!@sorted || (REQUIRE_SIGNIFICANT_TOKENS_TO_SCORE > 0 && | ||||
793 | $#sorted <= REQUIRE_SIGNIFICANT_TOKENS_TO_SCORE)) | ||||
794 | { | ||||
795 | dbg("bayes: cannot use bayes on this message; not enough usable tokens found"); | ||||
796 | goto skip; | ||||
797 | } | ||||
798 | |||||
799 | $score = Mail::SpamAssassin::Bayes::Combine::combine($ns, $nn, \@sorted); | ||||
800 | undef $timer_compute_prob; # end a timing section | ||||
801 | |||||
802 | # Couldn't come up with a probability? | ||||
803 | goto skip unless defined $score; | ||||
804 | |||||
805 | dbg("bayes: score = $score"); | ||||
806 | |||||
807 | # no need to call tok_touch_all unless there were significant | ||||
808 | # tokens and a score was returned | ||||
809 | # we don't really care about the return value here | ||||
810 | |||||
811 | { my $timer = $self->{main}->time_method('b_tok_touch_all'); | ||||
812 | $self->{store}->tok_touch_all(\@touch_tokens, $msgatime); | ||||
813 | } | ||||
814 | |||||
815 | my $timer_finish = $self->{main}->time_method('b_finish'); | ||||
816 | |||||
817 | $permsgstatus->{bayes_nspam} = $ns; | ||||
818 | $permsgstatus->{bayes_nham} = $nn; | ||||
819 | |||||
820 | ## if ($self->{log_raw_counts}) { # see _compute_prob_for_token() | ||||
821 | ## print "#Bayes-Raw-Counts: $self->{raw_counts}\n"; | ||||
822 | ## } | ||||
823 | |||||
824 | $self->{main}->call_plugins("bayes_scan", { toksref => $msgtokens, | ||||
825 | probsref => \%pw, | ||||
826 | score => $score, | ||||
827 | msgatime => $msgatime, | ||||
828 | significant_tokens => \@touch_tokens, | ||||
829 | }); | ||||
830 | |||||
831 | skip: | ||||
832 | if (!defined $score) { | ||||
833 | dbg("bayes: not scoring message, returning undef"); | ||||
834 | } | ||||
835 | |||||
836 | undef $timer_compute_prob; # end a timing section if still running | ||||
837 | if (!defined $timer_finish) { | ||||
838 | $timer_finish = $self->{main}->time_method('b_finish'); | ||||
839 | } | ||||
840 | |||||
841 | # Take any opportunistic actions we can take | ||||
842 | if ($self->{main}->{opportunistic_expire_check_only}) { | ||||
843 | # we're supposed to report on expiry only -- so do the | ||||
844 | # _opportunistic_calls() run for the journal only. | ||||
845 | $self->_opportunistic_calls(1); | ||||
846 | $permsgstatus->{bayes_expiry_due} = $self->{store}->expiry_due(); | ||||
847 | } | ||||
848 | else { | ||||
849 | $self->_opportunistic_calls(); | ||||
850 | } | ||||
851 | |||||
852 | # Do any cleanup we need to do | ||||
853 | $self->{store}->cleanup(); | ||||
854 | |||||
855 | # Reset the value accordingly | ||||
856 | $self->{main}->{learn_caller_will_untie} = $caller_untie; | ||||
857 | |||||
858 | # If our caller won't untie the db, we need to do it. | ||||
859 | if (!$caller_untie) { | ||||
860 | $self->{store}->untie_db(); | ||||
861 | } | ||||
862 | |||||
863 | $permsgstatus->set_tag ('BAYESTCHAMMY', | ||||
864 | ($tinfo_hammy ? scalar @{$tinfo_hammy} : 0)); | ||||
865 | $permsgstatus->set_tag ('BAYESTCSPAMMY', | ||||
866 | ($tinfo_spammy ? scalar @{$tinfo_spammy} : 0)); | ||||
867 | $permsgstatus->set_tag ('BAYESTCLEARNED', $tcount_learned); | ||||
868 | $permsgstatus->set_tag ('BAYESTC', $tcount_total); | ||||
869 | |||||
870 | $permsgstatus->set_tag ('HAMMYTOKENS', sub { | ||||
871 | my $pms = shift; | ||||
872 | $self->bayes_report_make_list | ||||
873 | ($pms, $pms->{bayes_token_info_hammy}, shift); | ||||
874 | }); | ||||
875 | |||||
876 | $permsgstatus->set_tag ('SPAMMYTOKENS', sub { | ||||
877 | my $pms = shift; | ||||
878 | $self->bayes_report_make_list | ||||
879 | ($pms, $pms->{bayes_token_info_spammy}, shift); | ||||
880 | }); | ||||
881 | |||||
882 | $permsgstatus->set_tag ('TOKENSUMMARY', sub { | ||||
883 | my $pms = shift; | ||||
884 | if ( defined $pms->{tag_data}{BAYESTC} ) | ||||
885 | { | ||||
886 | my $tcount_neutral = $pms->{tag_data}{BAYESTCLEARNED} | ||||
887 | - $pms->{tag_data}{BAYESTCSPAMMY} | ||||
888 | - $pms->{tag_data}{BAYESTCHAMMY}; | ||||
889 | my $tcount_new = $pms->{tag_data}{BAYESTC} | ||||
890 | - $pms->{tag_data}{BAYESTCLEARNED}; | ||||
891 | "Tokens: new, $tcount_new; " | ||||
892 | ."hammy, $pms->{tag_data}{BAYESTCHAMMY}; " | ||||
893 | ."neutral, $tcount_neutral; " | ||||
894 | ."spammy, $pms->{tag_data}{BAYESTCSPAMMY}." | ||||
895 | } else { | ||||
896 | "Bayes not run."; | ||||
897 | } | ||||
898 | }); | ||||
899 | |||||
900 | |||||
901 | return $score; | ||||
902 | } | ||||
903 | |||||
904 | ########################################################################### | ||||
905 | |||||
906 | # Plugin hook. | ||||
907 | sub learner_dump_database { | ||||
908 | my ($self, $params) = @_; | ||||
909 | my $magic = $params->{magic}; | ||||
910 | my $toks = $params->{toks}; | ||||
911 | my $regex = $params->{regex}; | ||||
912 | |||||
913 | # allow dump to occur even if use_bayes disables everything else ... | ||||
914 | #return 0 unless $self->{conf}->{use_bayes}; | ||||
915 | return 0 unless $self->{store}->tie_db_readonly(); | ||||
916 | |||||
917 | my @vars = $self->{store}->get_storage_variables(); | ||||
918 | |||||
919 | my($sb,$ns,$nh,$nt,$le,$oa,$bv,$js,$ad,$er,$na) = @vars; | ||||
920 | |||||
921 | my $template = '%3.3f %10u %10u %10u %s'."\n"; | ||||
922 | |||||
923 | if ( $magic ) { | ||||
924 | printf($template, 0.0, 0, $bv, 0, 'non-token data: bayes db version') | ||||
925 | or die "Error writing: $!"; | ||||
926 | printf($template, 0.0, 0, $ns, 0, 'non-token data: nspam') | ||||
927 | or die "Error writing: $!"; | ||||
928 | printf($template, 0.0, 0, $nh, 0, 'non-token data: nham') | ||||
929 | or die "Error writing: $!"; | ||||
930 | printf($template, 0.0, 0, $nt, 0, 'non-token data: ntokens') | ||||
931 | or die "Error writing: $!"; | ||||
932 | printf($template, 0.0, 0, $oa, 0, 'non-token data: oldest atime') | ||||
933 | or die "Error writing: $!"; | ||||
934 | if ( $bv >= 2 ) { | ||||
935 | printf($template, 0.0, 0, $na, 0, 'non-token data: newest atime') | ||||
936 | or die "Error writing: $!"; | ||||
937 | } | ||||
938 | if ( $bv < 2 ) { | ||||
939 | printf($template, 0.0, 0, $sb, 0, 'non-token data: current scan-count') | ||||
940 | or die "Error writing: $!"; | ||||
941 | } | ||||
942 | if ( $bv >= 2 ) { | ||||
943 | printf($template, 0.0, 0, $js, 0, 'non-token data: last journal sync atime') | ||||
944 | or die "Error writing: $!"; | ||||
945 | } | ||||
946 | printf($template, 0.0, 0, $le, 0, 'non-token data: last expiry atime') | ||||
947 | or die "Error writing: $!"; | ||||
948 | if ( $bv >= 2 ) { | ||||
949 | printf($template, 0.0, 0, $ad, 0, 'non-token data: last expire atime delta') | ||||
950 | or die "Error writing: $!"; | ||||
951 | |||||
952 | printf($template, 0.0, 0, $er, 0, 'non-token data: last expire reduction count') | ||||
953 | or die "Error writing: $!"; | ||||
954 | } | ||||
955 | } | ||||
956 | |||||
957 | if ( $toks ) { | ||||
958 | # let the store sort out the db_toks | ||||
959 | $self->{store}->dump_db_toks($template, $regex, @vars); | ||||
960 | } | ||||
961 | |||||
962 | if (!$self->{main}->{learn_caller_will_untie}) { | ||||
963 | $self->{store}->untie_db(); | ||||
964 | } | ||||
965 | return 1; | ||||
966 | } | ||||
967 | |||||
968 | ########################################################################### | ||||
969 | # TODO: these are NOT public, but the test suite needs to call them. | ||||
970 | |||||
971 | # spent 306ms (125+181) within Mail::SpamAssassin::Plugin::Bayes::get_msgid which was called 555 times, avg 552µs/call:
# 321 times (63.9ms+106ms) by Mail::SpamAssassin::Plugin::TxRep::check_senders_reputation at line 1241 of Mail/SpamAssassin/Plugin/TxRep.pm, avg 528µs/call
# 234 times (61.6ms+75.0ms) by Mail::SpamAssassin::Plugin::Bayes::_learn_trapped at line 417, avg 584µs/call | ||||
972 | 555 | 1.38ms | my ($self, $msg) = @_; | ||
973 | |||||
974 | 555 | 1.16ms | my @msgid; | ||
975 | |||||
976 | 555 | 5.88ms | 555 | 58.0ms | my $msgid = $msg->get_header("Message-Id"); # spent 58.0ms making 555 calls to Mail::SpamAssassin::Message::Node::get_header, avg 105µs/call |
977 | 555 | 9.85ms | 530 | 3.90ms | if (defined $msgid && $msgid ne '' && $msgid !~ /^\s*<\s*(?:\@sa_generated)?>.*$/) { # spent 3.90ms making 530 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 7µs/call |
978 | # remove \r and < and > prefix/suffixes | ||||
979 | 530 | 2.38ms | chomp $msgid; | ||
980 | 1060 | 28.8ms | 1060 | 8.50ms | $msgid =~ s/^<//; $msgid =~ s/>.*$//g; # spent 8.50ms making 1060 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 8µs/call |
981 | 530 | 2.02ms | push(@msgid, $msgid); | ||
982 | } | ||||
983 | |||||
984 | # Modified 2012-01-17 per bug 5185 to remove last received from msg_id calculation | ||||
985 | |||||
986 | # Use sha1_hex(Date: and top N bytes of body) | ||||
987 | # where N is MIN(1024 bytes, 1/2 of body length) | ||||
988 | # | ||||
989 | 555 | 5.23ms | 555 | 67.1ms | my $date = $msg->get_header("Date"); # spent 67.1ms making 555 calls to Mail::SpamAssassin::Message::Node::get_header, avg 121µs/call |
990 | 555 | 1.91ms | $date = "None" if (!defined $date || $date eq ''); # No Date? | ||
991 | |||||
992 | #Removed per bug 5185 | ||||
993 | #my @rcvd = $msg->get_header("Received"); | ||||
994 | #my $rcvd = $rcvd[$#rcvd]; | ||||
995 | #$rcvd = "None" if (!defined $rcvd || $rcvd eq ''); # No Received? | ||||
996 | |||||
997 | # Make a copy since pristine_body is a reference ... | ||||
998 | 555 | 21.5ms | 555 | 6.21ms | my $body = join('', $msg->get_pristine_body()); # spent 6.21ms making 555 calls to Mail::SpamAssassin::Message::get_pristine_body, avg 11µs/call |
999 | |||||
1000 | 555 | 2.92ms | if (length($body) > 64) { # Small Body? | ||
1001 | 555 | 2.42ms | my $keep = ( length $body > 2048 ? 1024 : int(length($body) / 2) ); | ||
1002 | 555 | 3.08ms | substr($body, $keep) = ''; | ||
1003 | } | ||||
1004 | |||||
1005 | #Stripping all CR and LF so that testing midstream from MTA and post delivery don't | ||||
1006 | #generate different id's simply because of LF<->CR<->CRLF changes. | ||||
1007 | 555 | 51.7ms | 555 | 24.2ms | $body =~ s/[\r\n]//g; # spent 24.2ms making 555 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 44µs/call |
1008 | |||||
1009 | 555 | 21.9ms | 555 | 12.7ms | unshift(@msgid, sha1_hex($date."\000".$body).'@sa_generated'); # spent 12.7ms making 555 calls to Digest::SHA::sha1_hex, avg 23µs/call |
1010 | |||||
1011 | 555 | 6.86ms | return wantarray ? @msgid : $msgid[0]; | ||
1012 | } | ||||
1013 | |||||
1014 | # spent 6.25s (53.1ms+6.19) within Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg which was called 234 times, avg 26.7ms/call:
# 234 times (53.1ms+6.19s) by Mail::SpamAssassin::Plugin::Bayes::learn_message at line 377, avg 26.7ms/call | ||||
1015 | 234 | 520µs | my ($self, $msg) = @_; | ||
1016 | |||||
1017 | 234 | 933µs | if (!ref $msg) { | ||
1018 | # I have no idea why this seems to happen. TODO | ||||
1019 | warn "bayes: msg not a ref: '$msg'"; | ||||
1020 | return { }; | ||||
1021 | } | ||||
1022 | |||||
1023 | my $permsgstatus = | ||||
1024 | 234 | 3.15ms | 234 | 69.0ms | Mail::SpamAssassin::PerMsgStatus->new($self->{main}, $msg); # spent 69.0ms making 234 calls to Mail::SpamAssassin::PerMsgStatus::new, avg 295µs/call |
1025 | 234 | 2.59ms | 234 | 2.41ms | $msg->extract_message_metadata ($permsgstatus); # spent 2.41ms making 234 calls to Mail::SpamAssassin::Message::extract_message_metadata, avg 10µs/call |
1026 | 234 | 2.16ms | 234 | 6.07s | my $msgdata = $self->_get_msgdata_from_permsgstatus ($permsgstatus); # spent 6.07s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::_get_msgdata_from_permsgstatus, avg 25.9ms/call |
1027 | 234 | 2.12ms | 234 | 46.7ms | $permsgstatus->finish(); # spent 46.7ms making 234 calls to Mail::SpamAssassin::PerMsgStatus::finish, avg 200µs/call |
1028 | |||||
1029 | 234 | 528µs | if (!defined $msgdata) { | ||
1030 | # why?! | ||||
1031 | warn "bayes: failed to get body for ".scalar($self->get_msgid($self->{msg}))."\n"; | ||||
1032 | return { }; | ||||
1033 | } | ||||
1034 | |||||
1035 | 234 | 4.16ms | 234 | 8.52ms | return $msgdata; # spent 8.52ms making 234 calls to Mail::SpamAssassin::PerMsgStatus::DESTROY, avg 36µs/call |
1036 | } | ||||
1037 | |||||
1038 | # spent 6.07s (16.1ms+6.05) within Mail::SpamAssassin::Plugin::Bayes::_get_msgdata_from_permsgstatus which was called 234 times, avg 25.9ms/call:
# 234 times (16.1ms+6.05s) by Mail::SpamAssassin::Plugin::Bayes::get_body_from_msg at line 1026, avg 25.9ms/call | ||||
1039 | 234 | 475µs | my ($self, $pms) = @_; | ||
1040 | |||||
1041 | 234 | 922µs | my $t_src = $self->{conf}->{bayes_token_sources}; | ||
1042 | 234 | 644µs | my $msgdata = { }; | ||
1043 | $msgdata->{bayes_token_body} = | ||||
1044 | 234 | 3.19ms | 234 | 248ms | $pms->{msg}->get_visible_rendered_body_text_array() if $t_src->{visible}; # spent 248ms making 234 calls to Mail::SpamAssassin::Message::get_visible_rendered_body_text_array, avg 1.06ms/call |
1045 | $msgdata->{bayes_token_inviz} = | ||||
1046 | 234 | 2.72ms | 234 | 106ms | $pms->{msg}->get_invisible_rendered_body_text_array() if $t_src->{invisible}; # spent 106ms making 234 calls to Mail::SpamAssassin::Message::get_invisible_rendered_body_text_array, avg 453µs/call |
1047 | $msgdata->{bayes_mimepart_digests} = | ||||
1048 | 234 | 489µs | $pms->{msg}->get_mimepart_digests() if $t_src->{mimepart}; | ||
1049 | 234 | 751µs | @{$msgdata->{bayes_token_uris}} = | ||
1050 | 234 | 3.97ms | 234 | 5.70s | $pms->get_uri_list() if $t_src->{uri}; # spent 5.70s making 234 calls to Mail::SpamAssassin::PerMsgStatus::get_uri_list, avg 24.3ms/call |
1051 | 234 | 2.13ms | return $msgdata; | ||
1052 | } | ||||
1053 | |||||
1054 | ########################################################################### | ||||
1055 | |||||
1056 | # The calling functions expect a uniq'ed array of tokens ... | ||||
1057 | # spent 31.0s (3.09+27.9) within Mail::SpamAssassin::Plugin::Bayes::tokenize which was called 234 times, avg 132ms/call:
# 234 times (3.09s+27.9s) by Mail::SpamAssassin::Plugin::Bayes::_learn_trapped at line 473, avg 132ms/call | ||||
1058 | 234 | 607µs | my ($self, $msg, $msgdata) = @_; | ||
1059 | |||||
1060 | 234 | 1.10ms | my $t_src = $self->{conf}->{bayes_token_sources}; | ||
1061 | 234 | 503µs | my @tokens; | ||
1062 | |||||
1063 | # visible tokens from the body | ||||
1064 | 234 | 2.41ms | if ($msgdata->{bayes_token_body}) { | ||
1065 | my(@t) = map($self->_tokenize_line ($_, '', 1), | ||||
1066 | 468 | 115ms | 4456 | 12.1s | @{$msgdata->{bayes_token_body}} ); # spent 12.1s making 4456 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 2.71ms/call |
1067 | 234 | 2.29ms | 234 | 2.31ms | dbg("bayes: tokenized body: %d tokens", scalar @t); # spent 2.31ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 10µs/call |
1068 | 234 | 55.4ms | push(@tokens, @t); | ||
1069 | } | ||||
1070 | # the URI list | ||||
1071 | 234 | 1.64ms | if ($msgdata->{bayes_token_uris}) { | ||
1072 | my(@t) = map($self->_tokenize_line ($_, '', 2), | ||||
1073 | 468 | 33.7ms | 2708 | 3.29s | @{$msgdata->{bayes_token_uris}} ); # spent 3.29s making 2708 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 1.21ms/call |
1074 | 234 | 1.78ms | 234 | 1.64ms | dbg("bayes: tokenized uri: %d tokens", scalar @t); # spent 1.64ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 7µs/call |
1075 | 234 | 7.17ms | push(@tokens, @t); | ||
1076 | } | ||||
1077 | # add invisible tokens | ||||
1078 | 234 | 1.14ms | if ($msgdata->{bayes_token_inviz}) { | ||
1079 | 234 | 455µs | my $tokprefix; | ||
1080 | 468 | 1.45ms | if (ADD_INVIZ_TOKENS_I_PREFIX) { $tokprefix = 'I*:' } | ||
1081 | if (ADD_INVIZ_TOKENS_NO_PREFIX) { $tokprefix = '' } | ||||
1082 | 234 | 995µs | if (defined $tokprefix) { | ||
1083 | my(@t) = map($self->_tokenize_line ($_, $tokprefix, 1), | ||||
1084 | 468 | 3.54ms | 53 | 584ms | @{$msgdata->{bayes_token_inviz}} ); # spent 584ms making 53 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 11.0ms/call |
1085 | 234 | 1.60ms | 234 | 1.41ms | dbg("bayes: tokenized invisible: %d tokens", scalar @t); # spent 1.41ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 6µs/call |
1086 | 234 | 1.36ms | push(@tokens, @t); | ||
1087 | } | ||||
1088 | } | ||||
1089 | |||||
1090 | # add digests and Content-Type of all MIME parts | ||||
1091 | 234 | 603µs | if ($msgdata->{bayes_mimepart_digests}) { | ||
1092 | my %shorthand = ( # some frequent MIME part contents for human readability | ||||
1093 | 'da39a3ee5e6b4b0d3255bfef95601890afd80709:text/plain'=> 'Empty-Plaintext', | ||||
1094 | 'da39a3ee5e6b4b0d3255bfef95601890afd80709:text/html' => 'Empty-HTML', | ||||
1095 | 'da39a3ee5e6b4b0d3255bfef95601890afd80709:text/xml' => 'Empty-XML', | ||||
1096 | 'adc83b19e793491b1c6ea0fd8b46cd9f32e592fc:text/plain'=> 'OneNL-Plaintext', | ||||
1097 | 'adc83b19e793491b1c6ea0fd8b46cd9f32e592fc:text/html' => 'OneNL-HTML', | ||||
1098 | '71853c6197a6a7f222db0f1978c7cb232b87c5ee:text/plain'=> 'TwoNL-Plaintext', | ||||
1099 | '71853c6197a6a7f222db0f1978c7cb232b87c5ee:text/html' => 'TwoNL-HTML', | ||||
1100 | ); | ||||
1101 | my(@t) = map('MIME:' . ($shorthand{$_} || $_), | ||||
1102 | @{ $msgdata->{bayes_mimepart_digests} }); | ||||
1103 | dbg("bayes: tokenized mime parts: %d tokens", scalar @t); | ||||
1104 | dbg("bayes: mime-part token %s", $_) for @t; | ||||
1105 | push(@tokens, @t); | ||||
1106 | } | ||||
1107 | |||||
1108 | # Tokenize the headers | ||||
1109 | 234 | 2.17ms | if ($t_src->{header}) { | ||
1110 | 234 | 480µs | my(@t); | ||
1111 | 234 | 7.44ms | 234 | 2.99s | my %hdrs = $self->_tokenize_headers ($msg); # spent 2.99s making 234 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers, avg 12.8ms/call |
1112 | 234 | 57.1ms | while( my($prefix, $value) = each %hdrs ) { | ||
1113 | 5605 | 89.2ms | 5605 | 7.60s | push(@t, $self->_tokenize_line ($value, "H$prefix:", 0)); # spent 7.60s making 5605 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_line, avg 1.36ms/call |
1114 | } | ||||
1115 | 234 | 1.97ms | 234 | 2.04ms | dbg("bayes: tokenized header: %d tokens", scalar @t); # spent 2.04ms making 234 calls to Mail::SpamAssassin::Logger::dbg, avg 9µs/call |
1116 | 234 | 39.6ms | push(@tokens, @t); | ||
1117 | } | ||||
1118 | |||||
1119 | # Go ahead and uniq the array, skip null tokens (can happen sometimes) | ||||
1120 | # generate an SHA1 hash and take the lower 40 bits as our token | ||||
1121 | 234 | 714µs | my %tokens; | ||
1122 | 234 | 1.08ms | foreach my $token (@tokens) { | ||
1123 | # skip empty tokens | ||||
1124 | 159813 | 3.85s | 155799 | 1.32s | $tokens{substr(sha1($token), -5)} = $token if $token ne ''; # spent 1.32s making 155799 calls to Digest::SHA::sha1, avg 8µs/call |
1125 | } | ||||
1126 | |||||
1127 | # return the keys == tokens ... | ||||
1128 | 234 | 46.9ms | return \%tokens; | ||
1129 | } | ||||
1130 | |||||
1131 | # spent 23.5s (17.3+6.23) within Mail::SpamAssassin::Plugin::Bayes::_tokenize_line which was called 12822 times, avg 1.84ms/call:
# 5605 times (5.50s+2.10s) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1113, avg 1.36ms/call
# 4456 times (8.86s+3.21s) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1066, avg 2.71ms/call
# 2708 times (2.51s+782ms) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1073, avg 1.21ms/call
# 53 times (450ms+134ms) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1084, avg 11.0ms/call | ||||
1132 | 12822 | 23.5ms | my $self = $_[0]; | ||
1133 | 12822 | 25.0ms | my $tokprefix = $_[2]; | ||
1134 | 12822 | 21.7ms | my $region = $_[3]; | ||
1135 | 12822 | 97.7ms | local ($_) = $_[1]; | ||
1136 | |||||
1137 | 12822 | 20.5ms | my @rettokens; | ||
1138 | |||||
1139 | # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam strings, | ||||
1140 | # and ISO-8859-15 alphas. Do not split on @'s; better results keeping it. | ||||
1141 | # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!" | ||||
1142 | |||||
1143 | ### (previous:) tr/-A-Za-z0-9,\@\*\!_'"\$.\241-\377 / /cs; | ||||
1144 | |||||
1145 | ### (now): see Bug 7130 for rationale (slower, but makes UTF-8 chars atomic) | ||||
1146 | 12822 | 2.97s | 210534 | 881ms | s{ ( [A-Za-z0-9,@*!_'"\$. -]+ | # spent 805ms making 197712 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 4µs/call
# spent 75.4ms making 12822 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 6µs/call |
1147 | { defined $1 ? $1 : ' ' }xsge; | ||||
1148 | [\xE0-\xEF][\x80-\xBF]{2} | | ||||
1149 | [\xF0-\xF4][\x80-\xBF]{3} | | ||||
1150 | [\xA1-\xFF] ) | . } | ||||
1151 | 185209 | 746ms | |||
1152 | # should we also turn NBSP ( \xC2\xA0 ) into space? | ||||
1153 | |||||
1154 | # DO split on "..." or "--" or "---"; common formatting error resulting in | ||||
1155 | # hapaxes. Keep the separator itself as a token, though, as long ones can | ||||
1156 | # be good spamsigns. | ||||
1157 | 12822 | 142ms | 12908 | 43.1ms | s/(\w)(\.{3,6})(\w)/$1 $2 $3/gs; # spent 42.5ms making 12822 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 3µs/call
# spent 600µs making 86 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 7µs/call |
1158 | 12822 | 164ms | 12862 | 30.2ms | s/(\w)(\-{2,6})(\w)/$1 $2 $3/gs; # spent 30.0ms making 12822 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 2µs/call
# spent 218µs making 40 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 5µs/call |
1159 | |||||
1160 | 12822 | 45.6ms | if (IGNORE_TITLE_CASE) { | ||
1161 | 12822 | 36.8ms | if ($region == 1 || $region == 2) { | ||
1162 | # lower-case Title Case at start of a full-stop-delimited line (as would | ||||
1163 | # be seen in a Western language). | ||||
1164 | 11448 | 424ms | 14579 | 223ms | s/(?:^|\.\s+)([A-Z])([^A-Z]+)(?:\s|$)/ ' '. (lc $1) . $2 . ' ' /ge; # spent 137ms making 7217 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 19µs/call
# spent 85.3ms making 7362 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 12µs/call |
1165 | } | ||||
1166 | } | ||||
1167 | |||||
1168 | 12822 | 95.9ms | 12822 | 130ms | my $magic_re = $self->{store}->get_magic_re(); # spent 130ms making 12822 calls to Mail::SpamAssassin::BayesStore::DBM::get_magic_re, avg 10µs/call |
1169 | |||||
1170 | # Note that split() in scope of 'use bytes' results in words with utf8 flag | ||||
1171 | # cleared, even if the source string has perl characters semantics !!! | ||||
1172 | # Is this really still desirable? | ||||
1173 | |||||
1174 | 12822 | 325ms | foreach my $token (split) { | ||
1175 | 158560 | 2.16s | 158560 | 757ms | $token =~ s/^[-'"\.,]+//; # trim non-alphanum chars at start or end # spent 757ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 5µs/call |
1176 | 158560 | 2.04s | 158560 | 755ms | $token =~ s/[-'"\.,]+$//; # so we don't get loads of '"foo' tokens # spent 755ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 5µs/call |
1177 | |||||
1178 | # Skip false magic tokens | ||||
1179 | # TVD: we need to do a defined() check since SQL doesn't have magic | ||||
1180 | # tokens, so the SQL BayesStore returns undef. I really want a way | ||||
1181 | # of optimizing that out, but I haven't come up with anything yet. | ||||
1182 | # | ||||
1183 | 158560 | 3.62s | 317120 | 1.08s | next if ( defined $magic_re && $token =~ /$magic_re/ ); # spent 771ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp, avg 5µs/call
# spent 306ms making 158560 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 2µs/call |
1184 | |||||
1185 | # *do* keep 3-byte tokens; there's some solid signs in there | ||||
1186 | 158560 | 394ms | my $len = length($token); | ||
1187 | |||||
1188 | # but extend the stop-list. These are squarely in the gray | ||||
1189 | # area, and it just slows us down to record them. | ||||
1190 | # See http://wiki.apache.org/spamassassin/BayesStopList for more info. | ||||
1191 | # | ||||
1192 | 158560 | 2.20s | 126321 | 874ms | next if $len < 3 || # spent 874ms making 126321 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 7µs/call |
1193 | ($token =~ /^(?:a(?:ble|l(?:ready|l)|n[dy]|re)|b(?:ecause|oth)|c(?:an|ome)|e(?:ach|mail|ven)|f(?:ew|irst|or|rom)|give|h(?:a(?:ve|s)|ttp)|i(?:n(?:formation|to)|t\'s)|just|know|l(?:ike|o(?:ng|ok))|m(?:a(?:de|il(?:(?:ing|to))?|ke|ny)|o(?:re|st)|uch)|n(?:eed|o[tw]|umber)|o(?:ff|n(?:ly|e)|ut|wn)|p(?:eople|lace)|right|s(?:ame|ee|uch)|t(?:h(?:at|is|rough|e)|ime)|using|w(?:eb|h(?:ere|y)|ith(?:out)?|or(?:ld|k))|y(?:ears?|ou(?:(?:\'re|r))?))$/i); | ||||
1194 | |||||
1195 | # are we in the body? If so, apply some body-specific breakouts | ||||
1196 | 109800 | 292ms | if ($region == 1 || $region == 2) { | ||
1197 | 64228 | 1.39s | 128048 | 271ms | if (CHEW_BODY_MAILADDRS && $token =~ /\S\@\S/i) { # spent 271ms making 128048 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 2µs/call |
1198 | 408 | 4.21ms | 408 | 30.7ms | push (@rettokens, $self->_tokenize_mail_addrs ($token)); # spent 30.7ms making 408 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs, avg 75µs/call |
1199 | } | ||||
1200 | elsif (CHEW_BODY_URIS && $token =~ /\S\.[a-z]/i) { | ||||
1201 | 5242 | 31.5ms | push (@rettokens, "UD:".$token); # the full token | ||
1202 | 10484 | 107ms | 5242 | 38.6ms | my $bit = $token; while ($bit =~ s/^[^\.]+\.(.+)$/$1/gs) { # spent 38.6ms making 5242 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call |
1203 | 8956 | 188ms | 8956 | 37.5ms | push (@rettokens, "UD:".$1); # UD = URL domain # spent 37.5ms making 8956 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call |
1204 | } | ||||
1205 | } | ||||
1206 | } | ||||
1207 | |||||
1208 | # note: do not trim down overlong tokens if they contain '*'. This is | ||||
1209 | # used as part of split tokens such as "HTo:D*net" indicating that | ||||
1210 | # the domain ".net" appeared in the To header. | ||||
1211 | # | ||||
1212 | 109800 | 372ms | 18366 | 42.6ms | if ($len > MAX_TOKEN_LENGTH && $token !~ /\*/) { # spent 42.6ms making 18366 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 2µs/call |
1213 | |||||
1214 | 17309 | 221ms | 17309 | 75.5ms | if (TOKENIZE_LONG_8BIT_SEQS_AS_UTF8_CHARS && $token =~ /[\x80-\xBF]{2}/) { # spent 75.5ms making 17309 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call |
1215 | # Bug 7135 | ||||
1216 | # collect 3- and 4-byte UTF-8 sequences, ignore 2-byte sequences | ||||
1217 | 9 | 333µs | 9 | 174µs | my(@t) = $token =~ /( (?: [\xE0-\xEF] | [\xF0-\xF4][\x80-\xBF] ) # spent 174µs making 9 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 19µs/call |
1218 | [\x80-\xBF]{2} )/xsg; | ||||
1219 | 9 | 20µs | if (@t) { | ||
1220 | 9 | 197µs | push (@rettokens, map('u8:'.$_, @t)); | ||
1221 | 9 | 47µs | next; | ||
1222 | } | ||||
1223 | } | ||||
1224 | |||||
1225 | if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xa0-\xff]{2}/) { | ||||
1226 | # Matt sez: "Could be asian? Autrijus suggested doing character ngrams, | ||||
1227 | # but I'm doing tuples to keep the dbs small(er)." Sounds like a plan | ||||
1228 | # to me! (jm) | ||||
1229 | while ($token =~ s/^(..?)//) { | ||||
1230 | push (@rettokens, "8:$1"); | ||||
1231 | } | ||||
1232 | next; | ||||
1233 | } | ||||
1234 | |||||
1235 | 17300 | 73.7ms | if (($region == 0 && HDRS_TOKENIZE_LONG_TOKENS_AS_SKIPS) | ||
1236 | || ($region == 1 && BODY_TOKENIZE_LONG_TOKENS_AS_SKIPS) | ||||
1237 | || ($region == 2 && URIS_TOKENIZE_LONG_TOKENS_AS_SKIPS)) | ||||
1238 | { | ||||
1239 | # if (TOKENIZE_LONG_TOKENS_AS_SKIPS) | ||||
1240 | # Spambayes trick via Matt: Just retain 7 chars. Do not retain the | ||||
1241 | # length, it does not help; see jm's mail to -devel on Nov 20 2002 at | ||||
1242 | # http://sourceforge.net/p/spamassassin/mailman/message/12977605/ | ||||
1243 | # "sk:" stands for "skip". | ||||
1244 | # Bug 7141: retain seven UTF-8 chars (or other bytes), | ||||
1245 | # if followed by at least two bytes | ||||
1246 | 11544 | 558ms | 34632 | 240ms | $token =~ s{ ^ ( (?> (?: [\x00-\x7F\xF5-\xFF] | # spent 120ms making 23088 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 5µs/call
# spent 120ms making 11544 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 10µs/call |
1247 | [\xC0-\xDF][\x80-\xBF] | | ||||
1248 | [\xE0-\xEF][\x80-\xBF]{2} | | ||||
1249 | [\xF0-\xF4][\x80-\xBF]{3} | . ){7} )) | ||||
1250 | .{2,} \z }{sk:$1}xs; | ||||
1251 | ## (was:) $token = "sk:".substr($token, 0, 7); # seven bytes | ||||
1252 | } | ||||
1253 | } | ||||
1254 | |||||
1255 | # decompose tokens? do this after shortening long tokens | ||||
1256 | 109791 | 284ms | if ($region == 1 || $region == 2) { | ||
1257 | 64219 | 210ms | if (DECOMPOSE_BODY_TOKENS) { | ||
1258 | 64219 | 875ms | 64219 | 187ms | if ($token =~ /[^\w:\*]/) { # spent 187ms making 64219 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 3µs/call |
1259 | 15418 | 42.2ms | my $decompd = $token; # "Foo!" | ||
1260 | 15418 | 318ms | 15418 | 170ms | $decompd =~ s/[^\w:\*]//gs; # spent 170ms making 15418 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 11µs/call |
1261 | 15418 | 84.5ms | push (@rettokens, $tokprefix.$decompd); # "Foo" | ||
1262 | } | ||||
1263 | |||||
1264 | 64219 | 857ms | 64219 | 279ms | if ($token =~ /[A-Z]/) { # spent 279ms making 64219 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call |
1265 | 34320 | 109ms | my $decompd = $token; $decompd = lc $decompd; | ||
1266 | 17160 | 132ms | push (@rettokens, $tokprefix.$decompd); # "foo!" | ||
1267 | |||||
1268 | 17160 | 222ms | 17160 | 67.3ms | if ($token =~ /[^\w:\*]/) { # spent 67.3ms making 17160 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call |
1269 | 1950 | 30.0ms | 1950 | 18.0ms | $decompd =~ s/[^\w:\*]//gs; # spent 18.0ms making 1950 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 9µs/call |
1270 | 1950 | 11.1ms | push (@rettokens, $tokprefix.$decompd); # "foo" | ||
1271 | } | ||||
1272 | } | ||||
1273 | } | ||||
1274 | } | ||||
1275 | |||||
1276 | 109791 | 1.13s | push (@rettokens, $tokprefix.$token); | ||
1277 | } | ||||
1278 | |||||
1279 | 12822 | 296ms | return @rettokens; | ||
1280 | } | ||||
1281 | |||||
1282 | # spent 2.99s (1.18+1.81) within Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers which was called 234 times, avg 12.8ms/call:
# 234 times (1.18s+1.81s) by Mail::SpamAssassin::Plugin::Bayes::tokenize at line 1111, avg 12.8ms/call | ||||
1283 | 234 | 568µs | my ($self, $msg) = @_; | ||
1284 | |||||
1285 | 234 | 490µs | my %parsed; | ||
1286 | |||||
1287 | my %user_ignore; | ||||
1288 | 468 | 204ms | $user_ignore{lc $_} = 1 for @{$self->{main}->{conf}->{bayes_ignore_headers}}; | ||
1289 | |||||
1290 | # get headers in array context | ||||
1291 | 234 | 465µs | my @hdrs; | ||
1292 | my @rcvdlines; | ||||
1293 | 234 | 18.6ms | 234 | 1.10s | for ($msg->get_all_headers()) { # spent 1.10s making 234 calls to Mail::SpamAssassin::Message::Node::get_all_headers, avg 4.69ms/call |
1294 | # first, keep a copy of Received headers, so we can strip down to last 2 | ||||
1295 | 7410 | 82.5ms | 7410 | 21.3ms | if (/^Received:/i) { # spent 21.3ms making 7410 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 3µs/call |
1296 | 1131 | 5.69ms | push(@rcvdlines, $_); | ||
1297 | 1131 | 2.24ms | next; | ||
1298 | } | ||||
1299 | # and now skip lines for headers we don't want (including all Received) | ||||
1300 | 6279 | 201ms | 12558 | 98.8ms | next if /^${IGNORED_HDRS}:/i; # spent 72.1ms making 6279 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 11µs/call
# spent 26.7ms making 6279 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp, avg 4µs/call |
1301 | next if IGNORE_MSGID_TOKENS && /^Message-ID:/i; | ||||
1302 | 4124 | 33.0ms | push(@hdrs, $_); | ||
1303 | } | ||||
1304 | 234 | 3.60ms | 234 | 27.2ms | push(@hdrs, $msg->get_all_metadata()); # spent 27.2ms making 234 calls to Mail::SpamAssassin::Message::get_all_metadata, avg 116µs/call |
1305 | |||||
1306 | # and re-add the last 2 received lines: usually a good source of | ||||
1307 | # spamware tokens and HELO names. | ||||
1308 | 468 | 2.21ms | if ($#rcvdlines >= 0) { push(@hdrs, $rcvdlines[$#rcvdlines]); } | ||
1309 | 468 | 1.97ms | if ($#rcvdlines >= 1) { push(@hdrs, $rcvdlines[$#rcvdlines-1]); } | ||
1310 | |||||
1311 | 234 | 2.27ms | for (@hdrs) { | ||
1312 | 5528 | 70.9ms | 5528 | 22.7ms | next unless /\S/; # spent 22.7ms making 5528 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call |
1313 | 5528 | 51.8ms | my ($hdr, $val) = split(/:/, $_, 2); | ||
1314 | |||||
1315 | # remove user-specified headers here, after Received, in case they | ||||
1316 | # want to ignore that too | ||||
1317 | 5528 | 16.7ms | next if exists $user_ignore{lc $hdr}; | ||
1318 | |||||
1319 | # Prep the header value | ||||
1320 | 5374 | 9.49ms | $val ||= ''; | ||
1321 | 5374 | 12.7ms | chomp($val); | ||
1322 | |||||
1323 | # special tokenization for some headers: | ||||
1324 | 5374 | 213ms | 17551 | 86.0ms | if ($hdr =~ /^(?:|X-|Resent-)Message-Id$/i) { # spent 71.6ms making 14037 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 5µs/call
# spent 14.4ms making 3514 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp, avg 4µs/call |
1325 | 225 | 2.14ms | 225 | 16.1ms | $val = $self->_pre_chew_message_id ($val); # spent 16.1ms making 225 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id, avg 72µs/call |
1326 | } | ||||
1327 | elsif (PRE_CHEW_ADDR_HEADERS && $hdr =~ /^(?:|X-|Resent-) | ||||
1328 | (?:Return-Path|From|To|Cc|Reply-To|Errors-To|Mail-Followup-To|Sender)$/ix) | ||||
1329 | { | ||||
1330 | 758 | 6.14ms | 758 | 194ms | $val = $self->_pre_chew_addr_header ($val); # spent 194ms making 758 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header, avg 257µs/call |
1331 | } | ||||
1332 | elsif ($hdr eq 'Received') { | ||||
1333 | 468 | 4.10ms | 468 | 98.8ms | $val = $self->_pre_chew_received ($val); # spent 98.8ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received, avg 211µs/call |
1334 | } | ||||
1335 | elsif ($hdr eq 'Content-Type') { | ||||
1336 | 222 | 2.05ms | 222 | 28.1ms | $val = $self->_pre_chew_content_type ($val); # spent 28.1ms making 222 calls to Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type, avg 127µs/call |
1337 | } | ||||
1338 | elsif ($hdr eq 'MIME-Version') { | ||||
1339 | 187 | 2.33ms | 187 | 1.10ms | $val =~ s/1\.0//; # totally innocuous # spent 1.10ms making 187 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 6µs/call |
1340 | } | ||||
1341 | elsif ($hdr =~ /^${MARK_PRESENCE_ONLY_HDRS}$/i) { | ||||
1342 | 224 | 571µs | $val = "1"; # just mark the presence, they create lots of hapaxen | ||
1343 | } | ||||
1344 | |||||
1345 | 5374 | 27.7ms | if (MAP_HEADERS_MID) { | ||
1346 | 5374 | 91.9ms | 5374 | 20.6ms | if ($hdr =~ /^(?:In-Reply-To|References|Message-ID)$/i) { # spent 20.6ms making 5374 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call |
1347 | 237 | 928µs | $parsed{"*MI"} = $val; | ||
1348 | } | ||||
1349 | } | ||||
1350 | 5374 | 16.8ms | if (MAP_HEADERS_FROMTOCC) { | ||
1351 | 5374 | 70.6ms | 5374 | 19.9ms | if ($hdr =~ /^(?:From|To|Cc)$/i) { # spent 19.9ms making 5374 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 4µs/call |
1352 | 435 | 1.49ms | $parsed{"*Ad"} = $val; | ||
1353 | } | ||||
1354 | } | ||||
1355 | 5374 | 17.0ms | if (MAP_HEADERS_USERAGENT) { | ||
1356 | 5374 | 70.3ms | 5374 | 17.4ms | if ($hdr =~ /^(?:X-Mailer|User-Agent)$/i) { # spent 17.4ms making 5374 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 3µs/call |
1357 | 64 | 272µs | $parsed{"*UA"} = $val; | ||
1358 | } | ||||
1359 | } | ||||
1360 | |||||
1361 | # replace hdr name with "compressed" version if possible | ||||
1362 | 5374 | 34.8ms | if (defined $HEADER_NAME_COMPRESSION{$hdr}) { | ||
1363 | 2009 | 8.50ms | $hdr = $HEADER_NAME_COMPRESSION{$hdr}; | ||
1364 | } | ||||
1365 | |||||
1366 | 5374 | 24.2ms | if (exists $parsed{$hdr}) { | ||
1367 | 288 | 2.46ms | $parsed{$hdr} .= " ".$val; | ||
1368 | } else { | ||||
1369 | 5086 | 38.6ms | $parsed{$hdr} = $val; | ||
1370 | } | ||||
1371 | 5374 | 51.6ms | 5374 | 59.7ms | if (would_log('dbg', 'bayes') > 1) { # spent 59.7ms making 5374 calls to Mail::SpamAssassin::Logger::would_log, avg 11µs/call |
1372 | dbg("bayes: header tokens for $hdr = \"$parsed{$hdr}\""); | ||||
1373 | } | ||||
1374 | } | ||||
1375 | |||||
1376 | 234 | 32.8ms | return %parsed; | ||
1377 | } | ||||
1378 | |||||
1379 | # spent 28.1ms (14.7+13.4) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type which was called 222 times, avg 127µs/call:
# 222 times (14.7ms+13.4ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1336, avg 127µs/call | ||||
1380 | 222 | 908µs | my ($self, $val) = @_; | ||
1381 | |||||
1382 | # hopefully this will retain good bits without too many hapaxen | ||||
1383 | 222 | 4.54ms | 222 | 2.45ms | if ($val =~ s/boundary=[\"\'](.*?)[\"\']/ /ig) { # spent 2.45ms making 222 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 11µs/call |
1384 | 173 | 631µs | my $boundary = $1; | ||
1385 | 173 | 407µs | $boundary = '' if !defined $boundary; # avoid a warning | ||
1386 | 173 | 7.11ms | 173 | 5.24ms | $boundary =~ s/[a-fA-F0-9]/H/gs; # spent 5.24ms making 173 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 30µs/call |
1387 | # break up blocks of separator chars so they become their own tokens | ||||
1388 | 173 | 9.08ms | 787 | 4.00ms | $boundary =~ s/([-_\.=]+)/ $1 /gs; # spent 3.10ms making 614 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 5µs/call
# spent 899µs making 173 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 5µs/call |
1389 | 173 | 729µs | $val .= $boundary; | ||
1390 | } | ||||
1391 | |||||
1392 | # stop-list words for Content-Type header: these wind up totally gray | ||||
1393 | 222 | 3.15ms | 222 | 1.67ms | $val =~ s/\b(?:text|charset)\b//; # spent 1.67ms making 222 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 8µs/call |
1394 | |||||
1395 | 222 | 1.92ms | $val; | ||
1396 | } | ||||
1397 | |||||
1398 | # spent 16.1ms (9.18+6.95) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id which was called 225 times, avg 72µs/call:
# 225 times (9.18ms+6.95ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1325, avg 72µs/call | ||||
1399 | 225 | 877µs | my ($self, $val) = @_; | ||
1400 | # we can (a) get rid of a lot of hapaxen and (b) increase the token | ||||
1401 | # specificity by pre-parsing some common formats. | ||||
1402 | |||||
1403 | # Outlook Express format: | ||||
1404 | 225 | 3.16ms | 225 | 1.59ms | $val =~ s/<([0-9a-f]{4})[0-9a-f]{4}[0-9a-f]{4}\$ # spent 1.59ms making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call |
1405 | ([0-9a-f]{4})[0-9a-f]{4}\$ | ||||
1406 | ([0-9a-f]{8})\@(\S+)>/ OEA$1 OEB$2 OEC$3 $4 /gx; | ||||
1407 | |||||
1408 | # Exim: | ||||
1409 | 225 | 2.16ms | 225 | 696µs | $val =~ s/<[A-Za-z0-9]{7}-[A-Za-z0-9]{6}-0[A-Za-z0-9]\@//; # spent 696µs making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 3µs/call |
1410 | |||||
1411 | # Sendmail: | ||||
1412 | 225 | 2.28ms | 225 | 797µs | $val =~ s/<20\d\d[01]\d[0123]\d[012]\d[012345]\d[012345]\d\. # spent 797µs making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call |
1413 | [A-F0-9]{10,12}\@//gx; | ||||
1414 | |||||
1415 | # try to split Message-ID segments on probable ID boundaries. Note that | ||||
1416 | # Outlook message-ids seem to contain a server identifier ID in the last | ||||
1417 | # 8 bytes before the @. Make sure this becomes its own token, it's a | ||||
1418 | # great spam-sign for a learning system! Be sure to split on ".". | ||||
1419 | 225 | 6.03ms | 225 | 3.86ms | $val =~ s/[^_A-Za-z0-9]/ /g; # spent 3.86ms making 225 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 17µs/call |
1420 | 225 | 2.05ms | $val; | ||
1421 | } | ||||
1422 | |||||
1423 | # spent 98.8ms (53.3+45.5) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received which was called 468 times, avg 211µs/call:
# 468 times (53.3ms+45.5ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1333, avg 211µs/call | ||||
1424 | 468 | 2.47ms | my ($self, $val) = @_; | ||
1425 | |||||
1426 | # Thanks to Dan for these. Trim out "useless" tokens; sendmail-ish IDs | ||||
1427 | # and valid-format RFC-822/2822 dates | ||||
1428 | |||||
1429 | 468 | 6.47ms | 468 | 3.16ms | $val =~ s/\swith\sSMTP\sid\sg[\dA-Z]{10,12}\s/ /gs; # Sendmail # spent 3.16ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call |
1430 | 468 | 6.05ms | 468 | 3.09ms | $val =~ s/\swith\sESMTP\sid\s[\dA-F]{10,12}\s/ /gs; # Sendmail # spent 3.09ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call |
1431 | 468 | 6.72ms | 468 | 3.43ms | $val =~ s/\bid\s[a-zA-Z0-9]{7,20}\b/ /gs; # Sendmail # spent 3.43ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call |
1432 | 468 | 4.72ms | 468 | 1.87ms | $val =~ s/\bid\s[A-Za-z0-9]{7}-[A-Za-z0-9]{6}-0[A-Za-z0-9]/ /gs; # exim # spent 1.87ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 4µs/call |
1433 | |||||
1434 | 468 | 12.7ms | 468 | 9.41ms | $val =~ s/(?:(?:Mon|Tue|Wed|Thu|Fri|Sat|Sun),\s)? # spent 9.41ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 20µs/call |
1435 | [0-3\s]?[0-9]\s | ||||
1436 | (?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)\s | ||||
1437 | (?:19|20)?[0-9]{2}\s | ||||
1438 | [0-2][0-9](?:\:[0-5][0-9]){1,2}\s | ||||
1439 | (?:\s*\(|\)|\s*(?:[+-][0-9]{4})|\s*(?:UT|[A-Z]{2,3}T))* | ||||
1440 | //gx; | ||||
1441 | |||||
1442 | # IPs: break down to nearest /24, to reduce hapaxes -- EXCEPT for | ||||
1443 | # IPs in the 10 and 192.168 ranges, they gets lots of significant tokens | ||||
1444 | # (on both sides) | ||||
1445 | # also make a dup with the full IP, as fodder for | ||||
1446 | # bayes_dump_to_trusted_networks: "H*r:ip*aaa.bbb.ccc.ddd" | ||||
1447 | 468 | 30.1ms | 1418 | 12.3ms | $val =~ s{\b(\d{1,3}\.)(\d{1,3}\.)(\d{1,3})(\.\d{1,3})\b}{ # spent 7.12ms making 950 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:substcont, avg 7µs/call
# spent 5.14ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 11µs/call |
1448 | 584 | 4.04ms | if ($2 eq '10' || ($2 eq '192' && $3 eq '168')) { | ||
1449 | $1.$2.$3.$4. | ||||
1450 | " ip*".$1.$2.$3.$4." "; | ||||
1451 | } else { | ||||
1452 | 584 | 6.55ms | $1.$2.$3. | ||
1453 | " ip*".$1.$2.$3.$4." "; | ||||
1454 | } | ||||
1455 | }gex; | ||||
1456 | |||||
1457 | # trim these: they turn out as the most common tokens, but with a | ||||
1458 | # prob of about .5. waste of space! | ||||
1459 | 468 | 15.6ms | 468 | 12.3ms | $val =~ s/\b(?:with|from|for|SMTP|ESMTP)\b/ /g; # spent 12.3ms making 468 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 26µs/call |
1460 | |||||
1461 | 468 | 4.01ms | $val; | ||
1462 | } | ||||
1463 | |||||
1464 | # spent 194ms (58.8+136) within Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header which was called 758 times, avg 257µs/call:
# 758 times (58.8ms+136ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1330, avg 257µs/call | ||||
1465 | 758 | 4.42ms | my ($self, $val) = @_; | ||
1466 | 758 | 1.44ms | local ($_); | ||
1467 | |||||
1468 | 758 | 8.09ms | 758 | 75.0ms | my @addrs = $self->{main}->find_all_addrs_in_line ($val); # spent 75.0ms making 758 calls to Mail::SpamAssassin::find_all_addrs_in_line, avg 99µs/call |
1469 | 758 | 1.35ms | my @toks; | ||
1470 | 758 | 2.93ms | foreach (@addrs) { | ||
1471 | 742 | 8.98ms | 742 | 60.7ms | push (@toks, $self->_tokenize_mail_addrs ($_)); # spent 60.7ms making 742 calls to Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs, avg 82µs/call |
1472 | } | ||||
1473 | 758 | 11.8ms | return join (' ', @toks); | ||
1474 | } | ||||
1475 | |||||
1476 | # spent 91.5ms (67.4+24.1) within Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs which was called 1150 times, avg 80µs/call:
# 742 times (43.6ms+17.1ms) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_addr_header at line 1471, avg 82µs/call
# 408 times (23.8ms+6.96ms) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1198, avg 75µs/call | ||||
1477 | 1150 | 5.86ms | my ($self, $addr) = @_; | ||
1478 | |||||
1479 | 1150 | 16.4ms | 1150 | 7.84ms | ($addr =~ /(.+)\@(.+)$/) or return (); # spent 7.84ms making 1150 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:match, avg 7µs/call |
1480 | 1150 | 2.02ms | my @toks; | ||
1481 | 1150 | 8.87ms | push(@toks, "U*".$1, "D*".$2); | ||
1482 | 3555 | 44.8ms | 2405 | 16.2ms | $_ = $2; while (s/^[^\.]+\.(.+)$/$1/gs) { push(@toks, "D*".$1); } # spent 16.2ms making 2405 calls to Mail::SpamAssassin::Plugin::Bayes::CORE:subst, avg 7µs/call |
1483 | 1150 | 26.4ms | return @toks; | ||
1484 | } | ||||
1485 | |||||
1486 | |||||
1487 | ########################################################################### | ||||
1488 | |||||
1489 | # compute the probability that a token is spammish for each token | ||||
1490 | sub _compute_prob_for_all_tokens { | ||||
1491 | my ($self, $tokensdata, $ns, $nn) = @_; | ||||
1492 | my @probabilities; | ||||
1493 | |||||
1494 | return if !$ns || !$nn; | ||||
1495 | |||||
1496 | my $threshold = 1; # ignore low-freq tokens below this s+n threshold | ||||
1497 | if (!USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS) { | ||||
1498 | $threshold = 10; | ||||
1499 | } | ||||
1500 | if (!$self->{use_hapaxes}) { | ||||
1501 | $threshold = 2; | ||||
1502 | } | ||||
1503 | |||||
1504 | foreach my $tokendata (@{$tokensdata}) { | ||||
1505 | my $s = $tokendata->[1]; # spam count | ||||
1506 | my $n = $tokendata->[2]; # ham count | ||||
1507 | my $prob; | ||||
1508 | |||||
1509 | 2 | 2.43ms | 2 | 273µs | # spent 176µs (78+97) within Mail::SpamAssassin::Plugin::Bayes::BEGIN@1509 which was called:
# once (78µs+97µs) by Mail::SpamAssassin::Plugin::TxRep::BEGIN@205 at line 1509 # spent 176µs making 1 call to Mail::SpamAssassin::Plugin::Bayes::BEGIN@1509
# spent 98µs making 1 call to warnings::unimport |
1510 | if ($s + $n >= $threshold) { | ||||
1511 | # ignoring low-freq tokens, also covers the (!$s && !$n) case | ||||
1512 | |||||
1513 | # my $ratios = $s / $ns; | ||||
1514 | # my $ration = $n / $nn; | ||||
1515 | # $prob = $ratios / ($ration + $ratios); | ||||
1516 | # | ||||
1517 | $prob = ($s * $nn) / ($n * $ns + $s * $nn); # same thing, faster | ||||
1518 | |||||
1519 | if (USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS) { | ||||
1520 | # use Robinson's f(x) equation for low-n tokens, instead of just | ||||
1521 | # ignoring them | ||||
1522 | my $robn = $s + $n; | ||||
1523 | $prob = | ||||
1524 | ($Mail::SpamAssassin::Bayes::Combine::FW_S_DOT_X + ($robn * $prob)) | ||||
1525 | / | ||||
1526 | ($Mail::SpamAssassin::Bayes::Combine::FW_S_CONSTANT + $robn); | ||||
1527 | } | ||||
1528 | } | ||||
1529 | |||||
1530 | # 'log_raw_counts' is used to log the raw data for the Bayes equations | ||||
1531 | # during a mass-check, allowing the S and X constants to be optimized | ||||
1532 | # quickly without requiring re-tokenization of the messages for each | ||||
1533 | # attempt. There's really no need for this code to be uncommented in | ||||
1534 | # normal use, however. It has never been publicly documented, so | ||||
1535 | # commenting it out is fine. ;) | ||||
1536 | # | ||||
1537 | ## if ($self->{log_raw_counts}) { | ||||
1538 | ## $self->{raw_counts} .= " s=$s,n=$n "; | ||||
1539 | ## } | ||||
1540 | |||||
1541 | push(@probabilities, $prob); | ||||
1542 | } | ||||
1543 | return \@probabilities; | ||||
1544 | } | ||||
1545 | |||||
1546 | # compute the probability that a token is spammish | ||||
1547 | sub _compute_prob_for_token { | ||||
1548 | my ($self, $token, $ns, $nn, $s, $n) = @_; | ||||
1549 | |||||
1550 | # we allow the caller to give us the token information, just | ||||
1551 | # to save a potentially expensive lookup | ||||
1552 | if (!defined($s) || !defined($n)) { | ||||
1553 | ($s, $n, undef) = $self->{store}->tok_get($token); | ||||
1554 | } | ||||
1555 | return if !$s && !$n; | ||||
1556 | |||||
1557 | my $probabilities_ref = | ||||
1558 | $self->_compute_prob_for_all_tokens([ [$token, $s, $n, 0] ], $ns, $nn); | ||||
1559 | |||||
1560 | return $probabilities_ref->[0]; | ||||
1561 | } | ||||
1562 | |||||
1563 | ########################################################################### | ||||
1564 | # If a token is neither hammy nor spammy, return 0. | ||||
1565 | # For a spammy token, return the minimum number of additional ham messages | ||||
1566 | # it would have had to appear in to no longer be spammy. Hammy tokens | ||||
1567 | # are handled similarly. That's what the function does (at the time | ||||
1568 | # of this writing, 31 July 2003, 16:02:55 CDT). It would be slightly | ||||
1569 | # more useful if it returned the number of /additional/ ham messages | ||||
1570 | # a spammy token would have to appear in to no longer be spammy but I | ||||
1571 | # fear that might require the solution to a cubic equation, and I | ||||
1572 | # just don't have the time for that now. | ||||
1573 | |||||
1574 | sub _compute_declassification_distance { | ||||
1575 | my ($self, $Ns, $Nn, $ns, $nn, $prob) = @_; | ||||
1576 | |||||
1577 | return 0 if $ns == 0 && $nn == 0; | ||||
1578 | |||||
1579 | if (!USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS) {return 0 if ($ns + $nn < 10);} | ||||
1580 | if (!$self->{use_hapaxes}) {return 0 if ($ns + $nn < 2);} | ||||
1581 | |||||
1582 | return 0 if $Ns == 0 || $Nn == 0; | ||||
1583 | return 0 if abs( $prob - 0.5 ) < | ||||
1584 | $Mail::SpamAssassin::Bayes::Combine::MIN_PROB_STRENGTH; | ||||
1585 | |||||
1586 | my ($Na,$na,$Nb,$nb) = $prob > 0.5 ? ($Nn,$nn,$Ns,$ns) : ($Ns,$ns,$Nn,$nn); | ||||
1587 | my $p = 0.5 - $Mail::SpamAssassin::Bayes::Combine::MIN_PROB_STRENGTH; | ||||
1588 | |||||
1589 | return int( 1.0 - 1e-6 + $nb * $Na * $p / ($Nb * ( 1 - $p )) ) - $na | ||||
1590 | unless USE_ROBINSON_FX_EQUATION_FOR_LOW_FREQS; | ||||
1591 | |||||
1592 | my $s = $Mail::SpamAssassin::Bayes::Combine::FW_S_CONSTANT; | ||||
1593 | my $sx = $Mail::SpamAssassin::Bayes::Combine::FW_S_DOT_X; | ||||
1594 | my $a = $Nb * ( 1 - $p ); | ||||
1595 | my $b = $Nb * ( $sx + $nb * ( 1 - $p ) - $p * $s ) - $p * $Na * $nb; | ||||
1596 | my $c = $Na * $nb * ( $sx - $p * ( $s + $nb ) ); | ||||
1597 | my $discrim = $b * $b - 4 * $a * $c; | ||||
1598 | my $disc_max_0 = $discrim < 0 ? 0 : $discrim; | ||||
1599 | my $dd_exact = ( 1.0 - 1e-6 + ( -$b + sqrt( $disc_max_0 ) ) / ( 2*$a ) ) - $na; | ||||
1600 | |||||
1601 | # This shouldn't be necessary. Should not be < 1 | ||||
1602 | return $dd_exact < 1 ? 1 : int($dd_exact); | ||||
1603 | } | ||||
1604 | |||||
1605 | ########################################################################### | ||||
1606 | |||||
1607 | sub _opportunistic_calls { | ||||
1608 | my($self, $journal_only) = @_; | ||||
1609 | |||||
1610 | # If we're not already tied, abort. | ||||
1611 | if (!$self->{store}->db_readable()) { | ||||
1612 | dbg("bayes: opportunistic call attempt failed, DB not readable"); | ||||
1613 | return; | ||||
1614 | } | ||||
1615 | |||||
1616 | # Is an expire or sync running? | ||||
1617 | my $running_expire = $self->{store}->get_running_expire_tok(); | ||||
1618 | if ( defined $running_expire && $running_expire+$OPPORTUNISTIC_LOCK_VALID > time() ) { | ||||
1619 | dbg("bayes: opportunistic call attempt skipped, found fresh running expire magic token"); | ||||
1620 | return; | ||||
1621 | } | ||||
1622 | |||||
1623 | # handle expiry and syncing | ||||
1624 | if (!$journal_only && $self->{store}->expiry_due()) { | ||||
1625 | dbg("bayes: opportunistic call found expiry due"); | ||||
1626 | |||||
1627 | # sync will bring the DB R/W as necessary, and the expire will remove | ||||
1628 | # the running_expire token, may untie as well. | ||||
1629 | $self->{main}->{bayes_scanner}->sync(1,1); | ||||
1630 | } | ||||
1631 | elsif ( $self->{store}->sync_due() ) { | ||||
1632 | dbg("bayes: opportunistic call found journal sync due"); | ||||
1633 | |||||
1634 | # sync will bring the DB R/W as necessary, may untie as well | ||||
1635 | $self->{main}->{bayes_scanner}->sync(1,0); | ||||
1636 | |||||
1637 | # We can only remove the running_expire token if we're doing R/W | ||||
1638 | if ($self->{store}->db_writable()) { | ||||
1639 | $self->{store}->remove_running_expire_tok(); | ||||
1640 | } | ||||
1641 | } | ||||
1642 | |||||
1643 | return; | ||||
1644 | } | ||||
1645 | |||||
1646 | ########################################################################### | ||||
1647 | |||||
1648 | # spent 29.6ms (19.6+10.0) within Mail::SpamAssassin::Plugin::Bayes::learner_new which was called:
# once (19.6ms+10.0ms) by Mail::SpamAssassin::PluginHandler::callback at line 204 of Mail/SpamAssassin/PluginHandler.pm | ||||
1649 | 1 | 2µs | my ($self) = @_; | ||
1650 | |||||
1651 | 1 | 2µs | my $store; | ||
1652 | 1 | 13µs | 1 | 44µs | my $module = untaint_var($self->{conf}->{bayes_store_module}); # spent 44µs making 1 call to Mail::SpamAssassin::Util::untaint_var |
1653 | 1 | 3µs | $module = 'Mail::SpamAssassin::BayesStore::DBM' if !$module; | ||
1654 | |||||
1655 | 1 | 8µs | 1 | 7µs | dbg("bayes: learner_new self=%s, bayes_store_module=%s", $self,$module); # spent 7µs making 1 call to Mail::SpamAssassin::Logger::dbg |
1656 | 1 | 4µs | undef $self->{store}; # DESTROYs previous object, if any | ||
1657 | eval ' | ||||
1658 | require '.$module.'; | ||||
1659 | $store = '.$module.'->new($self); | ||||
1660 | 1; | ||||
1661 | 1 | 188µs | ' or do { # spent 391µs executing statements in string eval | ||
1662 | my $eval_stat = $@ ne '' ? $@ : "errno=$!"; chomp $eval_stat; | ||||
1663 | die "bayes: learner_new $module new() failed: $eval_stat\n"; | ||||
1664 | }; | ||||
1665 | |||||
1666 | 1 | 10µs | 1 | 12µs | dbg("bayes: learner_new: got store=%s", $store); # spent 12µs making 1 call to Mail::SpamAssassin::Logger::dbg |
1667 | 1 | 4µs | $self->{store} = $store; | ||
1668 | |||||
1669 | 1 | 13µs | $self; | ||
1670 | } | ||||
1671 | |||||
1672 | ########################################################################### | ||||
1673 | |||||
1674 | sub bayes_report_make_list { | ||||
1675 | my ($self, $pms, $info, $param) = @_; | ||||
1676 | return "Tokens not available." unless defined $info; | ||||
1677 | |||||
1678 | my ($limit,$fmt_arg,$more) = split /,/, ($param || '5'); | ||||
1679 | |||||
1680 | my %formats = ( | ||||
1681 | short => '$t', | ||||
1682 | Short => 'Token: \"$t\"', | ||||
1683 | compact => '$p-$D--$t', | ||||
1684 | Compact => 'Probability $p -declassification distance $D (\"+\" means > 9) --token: \"$t\"', | ||||
1685 | medium => '$p-$D-$N--$t', | ||||
1686 | long => '$p-$d--${h}h-${s}s--${a}d--$t', | ||||
1687 | Long => 'Probability $p -declassification distance $D --in ${h} ham messages -and ${s} spam messages --${a} days old--token:\"$t\"' | ||||
1688 | ); | ||||
1689 | |||||
1690 | my $raw_fmt = (!$fmt_arg ? '$p-$D--$t' : $formats{$fmt_arg}); | ||||
1691 | |||||
1692 | return "Invalid format, must be one of: ".join(",",keys %formats) | ||||
1693 | unless defined $raw_fmt; | ||||
1694 | |||||
1695 | my $fmt = '"'.$raw_fmt.'"'; | ||||
1696 | my $amt = $limit < @$info ? $limit : @$info; | ||||
1697 | return "" unless $amt; | ||||
1698 | |||||
1699 | my $ns = $pms->{bayes_nspam}; | ||||
1700 | my $nh = $pms->{bayes_nham}; | ||||
1701 | my $digit = sub { $_[0] > 9 ? "+" : $_[0] }; | ||||
1702 | my $now = time; | ||||
1703 | |||||
1704 | join ', ', map { | ||||
1705 | my($t,$prob,$s,$h,$u) = @$_; | ||||
1706 | my $a = int(($now - $u)/(3600 * 24)); | ||||
1707 | my $d = $self->_compute_declassification_distance($ns,$nh,$s,$h,$prob); | ||||
1708 | my $p = sprintf "%.3f", $prob; | ||||
1709 | my $n = $s + $h; | ||||
1710 | my ($c,$o) = $prob < 0.5 ? ($h,$s) : ($s,$h); | ||||
1711 | my ($D,$S,$H,$C,$O,$N) = map &$digit($_), ($d,$s,$h,$c,$o,$n); | ||||
1712 | eval $fmt; ## no critic | ||||
1713 | } @{$info}[0..$amt-1]; | ||||
1714 | } | ||||
1715 | |||||
1716 | 1 | 30µs | 1; | ||
# spent 2.36s within Mail::SpamAssassin::Plugin::Bayes::CORE:match which was called 645267 times, avg 4µs/call:
# 158560 times (306ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1183, avg 2µs/call
# 128048 times (271ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1197, avg 2µs/call
# 126321 times (874ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1192, avg 7µs/call
# 64219 times (279ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1264, avg 4µs/call
# 64219 times (187ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1258, avg 3µs/call
# 18366 times (42.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1212, avg 2µs/call
# 17309 times (75.5ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1214, avg 4µs/call
# 17160 times (67.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1268, avg 4µs/call
# 14037 times (71.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1324, avg 5µs/call
# 7410 times (21.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1295, avg 3µs/call
# 6279 times (72.1ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1300, avg 11µs/call
# 5528 times (22.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1312, avg 4µs/call
# 5374 times (20.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1346, avg 4µs/call
# 5374 times (19.9ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1351, avg 4µs/call
# 5374 times (17.4ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1356, avg 3µs/call
# 1150 times (7.84ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs at line 1479, avg 7µs/call
# 530 times (3.90ms+0s) by Mail::SpamAssassin::Plugin::Bayes::get_msgid at line 977, avg 7µs/call
# 9 times (174µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1217, avg 19µs/call | |||||
sub Mail::SpamAssassin::Plugin::Bayes::CORE:qr; # opcode | |||||
# spent 812ms within Mail::SpamAssassin::Plugin::Bayes::CORE:regcomp which was called 168353 times, avg 5µs/call:
# 158560 times (771ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1183, avg 5µs/call
# 6279 times (26.7ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1300, avg 4µs/call
# 3514 times (14.4ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1324, avg 4µs/call | |||||
# spent 2.29s within Mail::SpamAssassin::Plugin::Bayes::CORE:subst which was called 415086 times, avg 6µs/call:
# 158560 times (757ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1175, avg 5µs/call
# 158560 times (755ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1176, avg 5µs/call
# 15418 times (170ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1260, avg 11µs/call
# 12822 times (75.4ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1146, avg 6µs/call
# 12822 times (42.5ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1157, avg 3µs/call
# 12822 times (30.0ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1158, avg 2µs/call
# 11544 times (120ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1246, avg 10µs/call
# 8956 times (37.5ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1203, avg 4µs/call
# 7217 times (137ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1164, avg 19µs/call
# 5242 times (38.6ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1202, avg 7µs/call
# 2405 times (16.2ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_mail_addrs at line 1482, avg 7µs/call
# 1950 times (18.0ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1269, avg 9µs/call
# 1060 times (8.50ms+0s) by Mail::SpamAssassin::Plugin::Bayes::get_msgid at line 980, avg 8µs/call
# 555 times (24.2ms+0s) by Mail::SpamAssassin::Plugin::Bayes::get_msgid at line 1007, avg 44µs/call
# 468 times (12.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1459, avg 26µs/call
# 468 times (9.41ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1434, avg 20µs/call
# 468 times (5.14ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1447, avg 11µs/call
# 468 times (3.43ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1431, avg 7µs/call
# 468 times (3.16ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1429, avg 7µs/call
# 468 times (3.09ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1430, avg 7µs/call
# 468 times (1.87ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1432, avg 4µs/call
# 225 times (3.86ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1419, avg 17µs/call
# 225 times (1.59ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1404, avg 7µs/call
# 225 times (797µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1412, avg 4µs/call
# 225 times (696µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_message_id at line 1409, avg 3µs/call
# 222 times (2.45ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1383, avg 11µs/call
# 222 times (1.67ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1393, avg 8µs/call
# 187 times (1.10ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_headers at line 1339, avg 6µs/call
# 173 times (5.24ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1386, avg 30µs/call
# 173 times (899µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1388, avg 5µs/call | |||||
# spent 1.02s within Mail::SpamAssassin::Plugin::Bayes::CORE:substcont which was called 229852 times, avg 4µs/call:
# 197712 times (805ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1146, avg 4µs/call
# 23088 times (120ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1246, avg 5µs/call
# 7362 times (85.3ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1164, avg 12µs/call
# 950 times (7.12ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_received at line 1447, avg 7µs/call
# 614 times (3.10ms+0s) by Mail::SpamAssassin::Plugin::Bayes::_pre_chew_content_type at line 1388, avg 5µs/call
# 86 times (600µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1157, avg 7µs/call
# 40 times (218µs+0s) by Mail::SpamAssassin::Plugin::Bayes::_tokenize_line at line 1158, avg 5µs/call |