Skip to content
Snippets Groups Projects
Commit f9ac537f authored by Lyudmila Vaseva's avatar Lyudmila Vaseva
Browse files

Document most triggered filters over the years

parent b03b5d02
No related branches found
No related tags found
No related merge requests found
...@@ -225,7 +225,191 @@ According to the Edit filter Notice board: ...@@ -225,7 +225,191 @@ According to the Edit filter Notice board:
EN: There are currently 201 enabled filters, and 12 stale filters with no hits in the past 30 days (Purge). from https://en.wikipedia.org/wiki/Special:AbuseFilter (29.11.2018) EN: There are currently 201 enabled filters, and 12 stale filters with no hits in the past 30 days (Purge). from https://en.wikipedia.org/wiki/Special:AbuseFilter (29.11.2018)
--> upward tendency --> upward tendency^^; not particularly significant
owing to quarries we have all the filters that were triggered from the filter log per year, from 2009 (when filters were first introduced/the MediaWiki extension was enabled) till end of 2018 with their corresponding number of times being triggered:
$ wc quarry-32489-en-wp-all-log-entries-before-20100101-run318768.csv
220 220 1879 quarry-32489-en-wp-all-log-entries-before-20100101-run318768.csv
$ wc quarry-32492-en-wp_-all-abuse-filter-log-entries-in-2010-run318774.csv
163 163 1476 quarry-32492-en-wp_-all-abuse-filter-log-entries-in-2010-run318774.csv
$ wc quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2011-run318775.csv
161 161 1480 quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2011-run318775.csv
$ wc quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2012-run318778.csv
170 170 1576 quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2012-run318778.csv
$ wc quarry-32495-en-wp_-all-abuse-filter-log-entries-in-2013-run318779.csv
178 178 1632 quarry-32495-en-wp_-all-abuse-filter-log-entries-in-2013-run318779.csv
$ wc quarry-32496-en-wp_-all-abuse-filter-log-entries-in-2014-run318780.csv
154 154 1434 quarry-32496-en-wp_-all-abuse-filter-log-entries-in-2014-run318780.csv
$ wc quarry-32497-en-wp_-all-abuse-filter-log-entries-in-2015-run318782.csv
200 200 1845 quarry-32497-en-wp_-all-abuse-filter-log-entries-in-2015-run318782.csv
$ wc quarry-32499-en-wp_-all-abuse-filter-log-entries-in-2016-run318789.csv
204 204 1902 quarry-32499-en-wp_-all-abuse-filter-log-entries-in-2016-run318789.csv
$ wc quarry-32500-en-wp_-all-abuse-filter-log-entries-in-2017-run318797.csv
231 231 2135 quarry-32500-en-wp_-all-abuse-filter-log-entries-in-2017-run318797.csv
$ wc quarry-32503-en-wp_-all-abuse-filter-log-entries-in-2018-run318831.csv
254 254 2353 quarry-32503-en-wp_-all-abuse-filter-log-entries-in-2018-run318831.csv
data is still not enough for us to talk about a tendency towards introducing more filters (after the initial dip)
10 most active filters per year:
==> quarry-32489-en-wp-all-log-entries-before-20100101-run318768.csv <==
afl_filter,count(*)
135,175455
30,160302
61,147377
18,133640
3,95916
172,89710
50,88827
98,80434
65,74098
132,68607
==> quarry-32492-en-wp_-all-abuse-filter-log-entries-in-2010-run318774.csv <==
afl_filter,count(*)
61,245179
135,242018
172,148053
30,119226
225,109912
3,105376
50,101542
132,78633
189,74528
98,54805
==> quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2011-run318775.csv <==
afl_filter,count(*)
61,218493
135,185304
172,119532
402,109347
30,89151
3,75761
384,71911
225,68318
50,67425
432,66480
==> quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2012-run318778.csv <==
afl_filter,count(*)
135,173830
384,144202
432,126156
172,105082
30,93718
3,90724
380,67814
351,59226
279,58853
225,58352
==> quarry-32495-en-wp_-all-abuse-filter-log-entries-in-2013-run318779.csv <==
afl_filter,count(*)
135,133309
384,129807
432,94017
172,92871
30,85722
279,76738
3,70067
380,58668
491,55454
225,48390
==> quarry-32496-en-wp_-all-abuse-filter-log-entries-in-2014-run318780.csv <==
afl_filter,count(*)
384,111570
135,111173
279,97204
172,82042
432,75839
30,62495
3,60656
636,52639
231,39693
380,39624
==> quarry-32497-en-wp_-all-abuse-filter-log-entries-in-2015-run318782.csv <==
afl_filter,count(*)
650,226460
61,196986
636,191320
527,189911
633,162319
384,141534
279,110137
135,99057
686,95356
172,82874
==> quarry-32499-en-wp_-all-abuse-filter-log-entries-in-2016-run318789.csv <==
afl_filter,count(*)
527,437099
61,274945
650,229083
633,218696
636,179948
384,179871
279,106699
135,95131
172,79843
30,68968
==> quarry-32500-en-wp_-all-abuse-filter-log-entries-in-2017-run318797.csv <==
afl_filter,count(*)
61,250394
633,218146
384,200748
527,192441
636,156409
650,151604
135,80056
172,70837
712,59537
833,58133
==> quarry-32503-en-wp_-all-abuse-filter-log-entries-in-2018-run318831.csv <==
afl_filter,count(*)
527,358210
61,234867
633,201400
384,177543
833,161030
636,144674
650,79381
135,75348
686,70550
172,64266
what do the most active filters do?
135 publicly available description: "repeating characters"; tag, warn
30 "large deletion from article by new editors"; tag, warn
61 "new user removing references" ("new user" is handled by "!("confirmed" in user_groups)"); tag
18 "test type edits from clicking on edit bar" (people don't replace Example texts when click-editing); filter seems to have been deleted in Feb 2012
3 "new user blanking articles"; tag, warn
172 "section blanking"; tag
50 "shouting" (contribution consists of all caps, numbers and punctuation); tag, warn
98 "creating very short new article"; tag
65 "excessive whitespace" (note: "associated with ascii art and some types of vandalism"); seems to have been deleted in Jan 2010
132 "removal of all categories"; tag, warn
225 "vandalism in all caps" (difference to 50? seems to be swear words, but shouldn't they be catched by 50 anyway?); k, action is "disallow"
189 "BLP vandalism or libel" (äh.. wat? seems to be insulting living people); tag
402 "new article without references"; seems to have been deleted in Apr 2013, before that disabled with comment "disabling, no real use"
384 "addition of bad words or other vandalism" (seems to be a blacklist); disallow
432 "starting new line with lower case letters"; tag, warn //I recall there was a rule of thumb recommending not to user filters for style things? although that's not really style, but rather wrong grammar..
380 hidden; public comment "multiple obscenities"; disallow
351 "text added after categories and interwiki"; tag, warn
279 "repeated attempts to vandalise"; tag, throttle (triggered when someone hits "edit" repeatedly in a short ammount of time)
491 "edits ending with emoticons or !"; tag, warn
636 "unexplained removal of sourced content"; warn (that, together with 634 and 635 refutes my theory that warn always goes together with tag)
231 "long string of characters containing no spaces" (that's surely english though^^); tag, warn
650 "creation of a new article without any categories"; weird, it's markes as enabled here https://en.wikipedia.org/wiki/Special:AbuseFilter/650 , but does not appear in the actions data set; ah, ok, that is because there are no actions (other than logging probably)
527 hidden; public comments "T34234: log/throttle possible sleeper account creations"; throttle
633 "possible canned edit summary" (I think that's an edit summary that does not reflect the real edit; pre-filled on mobile though); tag
686 "IP adding possible unreferenced material to BLP" (BLP= biography of living people? I thought, it was forbidden to edit them without a registered account); no actions
712 "possibly changing date of birth in infobox" ("possibly"? and I thought infoboxes were pre-generated from wikidata?); no actions
833 "newer user possibly adding a unreferenced or improperly referenced material"; no actions
* how often were (which) filters triggered * how often were (which) filters triggered
https://tools.wmflabs.org/ptwikis/Filters:enwiki https://tools.wmflabs.org/ptwikis/Filters:enwiki
...@@ -247,10 +431,13 @@ links to single filters, e.g. --> https://en.wikipedia.org/wiki/Special:AbuseFil ...@@ -247,10 +431,13 @@ links to single filters, e.g. --> https://en.wikipedia.org/wiki/Special:AbuseFil
"Visibility" is: private | public "Visibility" is: private | public
"Hit count": which period is counted? total number of hits since the filter was enabled? (for all enabled periods, in case it was enabled/disabled multiple times?) "Hit count": which period is counted? total number of hits since the filter was enabled? (for all enabled periods, in case it was enabled/disabled multiple times?)
Filter with most hits: Filter with most hits (altogether):
Filter ID Public description Actions Status Last modified Visibility Hit count Filter ID Public description Actions Status Last modified Visibility Hit count
61 New user removing references Tag Enabled 12:43, 14 May 2017 by Zzuuzz (talk | contribs) Public 1,593,851 hits 61 New user removing references Tag Enabled 12:43, 14 May 2017 by Zzuuzz (talk | contribs) Public 1,593,851 hits
see also quarry-32518;
the thing is, we can't really classify hits by filter actions since actions triggered by the filters change
https://en.wikipedia.org/wiki/Special:AbuseFilter/61 https://en.wikipedia.org/wiki/Special:AbuseFilter/61
statistics are info such as "Of the last 1,728 actions, this filter has matched 10 (0.58%). On average, its run time is 0.34 ms, and it consumes 3 conditions of the condition limit." // not sure what the condition limit is statistics are info such as "Of the last 1,728 actions, this filter has matched 10 (0.58%). On average, its run time is 0.34 ms, and it consumes 3 conditions of the condition limit." // not sure what the condition limit is
...@@ -267,6 +454,9 @@ https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=abuselog&aflu ...@@ -267,6 +454,9 @@ https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=abuselog&aflu
* percentage of triggered filters/all edits * percentage of triggered filters/all edits
* break down triggered filters according to typology * break down triggered filters according to typology
* percentage filters of different types over the years * percentage filters of different types over the years
the thing is, we can't really classify hits by filter actions since actions triggered by the filters change
we know what the actions of a given filter are just for now...
maybe the dumps can answer this question?
We can try to map some of the descriptive statistics with the WM quarry service: We can try to map some of the descriptive statistics with the WM quarry service:
https://quarry.wmflabs.org/query/32483 https://quarry.wmflabs.org/query/32483
...@@ -274,4 +464,4 @@ https://quarry.wmflabs.org/query/32489 ...@@ -274,4 +464,4 @@ https://quarry.wmflabs.org/query/32489
https://quarry.wmflabs.org/query/32487 https://quarry.wmflabs.org/query/32487
give us an idea of what data the abuse filter related tables contain give us an idea of what data the abuse filter related tables contain
Results: Results: see above
...@@ -403,6 +403,8 @@ mysql> describe abuse_filter_history; (from https://www.mediawiki.org/wiki/Exten ...@@ -403,6 +403,8 @@ mysql> describe abuse_filter_history; (from https://www.mediawiki.org/wiki/Exten
+---------------------+---------------------+------+-----+---------+----------------+ +---------------------+---------------------+------+-----+---------+----------------+
13 rows in set (0.00 sec) 13 rows in set (0.00 sec)
Note! table abuse_filter_history seems to not exist anymore
mysql> describe abuse_filter_action; (from https://www.mediawiki.org/wiki/Extension:AbuseFilter/abuse_filter_action_table) mysql> describe abuse_filter_action; (from https://www.mediawiki.org/wiki/Extension:AbuseFilter/abuse_filter_action_table)
+-----------------+---------------------+------+-----+---------+-------+ +-----------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra | | Field | Type | Null | Key | Default | Extra |
...@@ -413,6 +415,9 @@ mysql> describe abuse_filter_action; (from https://www.mediawiki.org/wiki/Extens ...@@ -413,6 +415,9 @@ mysql> describe abuse_filter_action; (from https://www.mediawiki.org/wiki/Extens
+-----------------+---------------------+------+-----+---------+-------+ +-----------------+---------------------+------+-----+---------+-------+
3 rows in set (0.00 sec) 3 rows in set (0.00 sec)
Seems to contain data for currently enabled filters only;
Question: how do we find data for disabled filters?
# API calls # API calls
## List information about filters: ## List information about filters:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment