From f9ac537ff7a3898df172149863abdf06fbbe712c Mon Sep 17 00:00:00 2001
From: Lyudmila Vaseva <vaseva@mi.fu-berlin.de>
Date: Sun, 6 Jan 2019 13:02:25 +0100
Subject: [PATCH] Document most triggered filters over the years

---
 EN-state-of-the-art | 196 +++++++++++++++++++++++++++++++++++++++++++-
 notes               |   5 ++
 2 files changed, 198 insertions(+), 3 deletions(-)

diff --git a/EN-state-of-the-art b/EN-state-of-the-art
index f2ed3f2..17257ef 100644
--- a/EN-state-of-the-art
+++ b/EN-state-of-the-art
@@ -225,7 +225,191 @@ According to the Edit filter Notice board:
 
 EN: There are currently 201 enabled filters, and 12 stale filters with no hits in the past 30 days (Purge). from https://en.wikipedia.org/wiki/Special:AbuseFilter (29.11.2018)
 
---> upward tendency
+--> upward tendency^^; not particularly significant
+
+owing to quarries we have all the filters that were triggered from the filter log per year, from 2009 (when filters were first introduced/the MediaWiki extension was enabled) till end of 2018 with their corresponding number of times being triggered:
+$ wc quarry-32489-en-wp-all-log-entries-before-20100101-run318768.csv
+ 220  220 1879 quarry-32489-en-wp-all-log-entries-before-20100101-run318768.csv
+$ wc quarry-32492-en-wp_-all-abuse-filter-log-entries-in-2010-run318774.csv
+ 163  163 1476 quarry-32492-en-wp_-all-abuse-filter-log-entries-in-2010-run318774.csv
+$ wc quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2011-run318775.csv
+ 161  161 1480 quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2011-run318775.csv
+$ wc quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2012-run318778.csv
+ 170  170 1576 quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2012-run318778.csv
+$ wc quarry-32495-en-wp_-all-abuse-filter-log-entries-in-2013-run318779.csv
+ 178  178 1632 quarry-32495-en-wp_-all-abuse-filter-log-entries-in-2013-run318779.csv
+$ wc quarry-32496-en-wp_-all-abuse-filter-log-entries-in-2014-run318780.csv
+ 154  154 1434 quarry-32496-en-wp_-all-abuse-filter-log-entries-in-2014-run318780.csv
+$ wc quarry-32497-en-wp_-all-abuse-filter-log-entries-in-2015-run318782.csv
+ 200  200 1845 quarry-32497-en-wp_-all-abuse-filter-log-entries-in-2015-run318782.csv
+$ wc quarry-32499-en-wp_-all-abuse-filter-log-entries-in-2016-run318789.csv
+ 204  204 1902 quarry-32499-en-wp_-all-abuse-filter-log-entries-in-2016-run318789.csv
+$ wc quarry-32500-en-wp_-all-abuse-filter-log-entries-in-2017-run318797.csv
+ 231  231 2135 quarry-32500-en-wp_-all-abuse-filter-log-entries-in-2017-run318797.csv
+$ wc quarry-32503-en-wp_-all-abuse-filter-log-entries-in-2018-run318831.csv
+ 254  254 2353 quarry-32503-en-wp_-all-abuse-filter-log-entries-in-2018-run318831.csv
+
+data is still not enough for us to talk about a tendency towards introducing more filters (after the initial dip)
+
+10 most active filters per year:
+==> quarry-32489-en-wp-all-log-entries-before-20100101-run318768.csv <==
+afl_filter,count(*)
+135,175455
+30,160302
+61,147377
+18,133640
+3,95916
+172,89710
+50,88827
+98,80434
+65,74098
+132,68607
+
+==> quarry-32492-en-wp_-all-abuse-filter-log-entries-in-2010-run318774.csv <==
+afl_filter,count(*)
+61,245179
+135,242018
+172,148053
+30,119226
+225,109912
+3,105376
+50,101542
+132,78633
+189,74528
+98,54805
+
+==> quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2011-run318775.csv <==
+afl_filter,count(*)
+61,218493
+135,185304
+172,119532
+402,109347
+30,89151
+3,75761
+384,71911
+225,68318
+50,67425
+432,66480
+
+==> quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2012-run318778.csv <==
+afl_filter,count(*)
+135,173830
+384,144202
+432,126156
+172,105082
+30,93718
+3,90724
+380,67814
+351,59226
+279,58853
+225,58352
+
+==> quarry-32495-en-wp_-all-abuse-filter-log-entries-in-2013-run318779.csv <==
+afl_filter,count(*)
+135,133309
+384,129807
+432,94017
+172,92871
+30,85722
+279,76738
+3,70067
+380,58668
+491,55454
+225,48390
+
+==> quarry-32496-en-wp_-all-abuse-filter-log-entries-in-2014-run318780.csv <==
+afl_filter,count(*)
+384,111570
+135,111173
+279,97204
+172,82042
+432,75839
+30,62495
+3,60656
+636,52639
+231,39693
+380,39624
+
+==> quarry-32497-en-wp_-all-abuse-filter-log-entries-in-2015-run318782.csv <==
+afl_filter,count(*)
+650,226460
+61,196986
+636,191320
+527,189911
+633,162319
+384,141534
+279,110137
+135,99057
+686,95356
+172,82874
+
+==> quarry-32499-en-wp_-all-abuse-filter-log-entries-in-2016-run318789.csv <==
+afl_filter,count(*)
+527,437099
+61,274945
+650,229083
+633,218696
+636,179948
+384,179871
+279,106699
+135,95131
+172,79843
+30,68968
+
+==> quarry-32500-en-wp_-all-abuse-filter-log-entries-in-2017-run318797.csv <==
+afl_filter,count(*)
+61,250394
+633,218146
+384,200748
+527,192441
+636,156409
+650,151604
+135,80056
+172,70837
+712,59537
+833,58133
+
+==> quarry-32503-en-wp_-all-abuse-filter-log-entries-in-2018-run318831.csv <==
+afl_filter,count(*)
+527,358210
+61,234867
+633,201400
+384,177543
+833,161030
+636,144674
+650,79381
+135,75348
+686,70550
+172,64266
+
+what do the most active filters do?
+135 publicly available description: "repeating characters"; tag, warn
+30 "large deletion from article by new editors"; tag, warn
+61 "new user removing references" ("new user" is handled by "!("confirmed" in user_groups)"); tag
+18 "test type edits from clicking on edit bar" (people don't replace Example texts when click-editing); filter seems to have been deleted in Feb 2012
+3 "new user blanking articles"; tag, warn
+172 "section blanking"; tag
+50 "shouting" (contribution consists of all caps, numbers and punctuation); tag, warn
+98 "creating very short new article"; tag
+65 "excessive whitespace" (note: "associated with ascii art and some types of vandalism"); seems to have been deleted in Jan 2010
+132 "removal of all categories"; tag, warn
+225 "vandalism in all caps" (difference to 50? seems to be swear words, but shouldn't they be catched by 50 anyway?); k, action is "disallow"
+189 "BLP vandalism or libel" (äh.. wat? seems to be insulting living people); tag
+402 "new article without references";  seems to have been deleted in Apr 2013, before that disabled with comment "disabling, no real use"
+384 "addition of bad words or other vandalism" (seems to be a blacklist); disallow
+432 "starting new line with lower case letters"; tag, warn //I recall there was a rule of thumb recommending not to user filters for style things? although that's not really style, but rather wrong grammar..
+380 hidden; public comment "multiple obscenities"; disallow
+351 "text added after categories and interwiki"; tag, warn
+279 "repeated attempts to vandalise"; tag, throttle (triggered when someone hits "edit" repeatedly in a short ammount of time)
+491 "edits ending with emoticons or !"; tag, warn
+636 "unexplained removal of sourced content"; warn (that, together with 634 and 635 refutes my theory that warn always goes together with tag)
+231 "long string of characters containing no spaces" (that's surely english though^^); tag, warn
+650 "creation of a new article without any categories"; weird, it's markes as enabled here https://en.wikipedia.org/wiki/Special:AbuseFilter/650 , but does not appear in the actions data set; ah, ok, that is because there are no actions (other than logging probably)
+527 hidden; public comments "T34234: log/throttle possible sleeper account creations"; throttle
+633 "possible canned edit summary" (I think that's an edit summary that does not reflect the real edit; pre-filled on mobile though); tag
+686 "IP adding possible unreferenced material to BLP" (BLP= biography of living people? I thought, it was forbidden to edit them without a registered account); no actions
+712 "possibly changing date of birth in infobox" ("possibly"? and I thought infoboxes were pre-generated from wikidata?); no actions
+833 "newer user possibly adding a unreferenced or improperly referenced material"; no actions
 
 * how often were (which) filters triggered
 https://tools.wmflabs.org/ptwikis/Filters:enwiki
@@ -247,10 +431,13 @@ links to single filters, e.g. --> https://en.wikipedia.org/wiki/Special:AbuseFil
 "Visibility" is: private | public
 "Hit count": which period is counted? total number of hits since the filter was enabled? (for all enabled periods, in case it was enabled/disabled multiple times?)
 
-Filter with most hits:
+Filter with most hits (altogether):
 Filter ID 	Public description 	Actions 	Status 	Last modified 	Visibility 	Hit count
 61 	New user removing references 	Tag 	Enabled 	12:43, 14 May 2017 by Zzuuzz (talk | contribs) 	Public 	1,593,851 hits
 
+see also quarry-32518;
+the thing is, we can't really classify hits by filter actions since actions triggered by the filters change
+
 https://en.wikipedia.org/wiki/Special:AbuseFilter/61
 statistics are info such as "Of the last 1,728 actions, this filter has matched 10 (0.58%). On average, its run time is 0.34 ms, and it consumes 3 conditions of the condition limit." // not sure what the condition limit is
 
@@ -267,6 +454,9 @@ https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=abuselog&aflu
 * percentage of triggered filters/all edits
   * break down triggered filters according to typology
 * percentage filters of different types over the years
+the thing is, we can't really classify hits by filter actions since actions triggered by the filters change
+we know what the actions of a given filter are just for now...
+maybe the dumps can answer this question?
 
 We can try to map some of the descriptive statistics with the WM quarry service:
 https://quarry.wmflabs.org/query/32483
@@ -274,4 +464,4 @@ https://quarry.wmflabs.org/query/32489
 https://quarry.wmflabs.org/query/32487
 give us an idea of what data the abuse filter related tables contain
 
-Results:
+Results: see above
diff --git a/notes b/notes
index 7b3d137..6693f61 100644
--- a/notes
+++ b/notes
@@ -403,6 +403,8 @@ mysql> describe abuse_filter_history; (from https://www.mediawiki.org/wiki/Exten
 +---------------------+---------------------+------+-----+---------+----------------+
 13 rows in set (0.00 sec)
 
+Note! table abuse_filter_history seems to not exist anymore
+
 mysql> describe abuse_filter_action; (from https://www.mediawiki.org/wiki/Extension:AbuseFilter/abuse_filter_action_table)
 +-----------------+---------------------+------+-----+---------+-------+
 | Field           | Type                | Null | Key | Default | Extra |
@@ -413,6 +415,9 @@ mysql> describe abuse_filter_action; (from https://www.mediawiki.org/wiki/Extens
 +-----------------+---------------------+------+-----+---------+-------+
 3 rows in set (0.00 sec)
 
+Seems to contain data for currently enabled filters only;
+Question: how do we find data for disabled filters?
+
 # API calls
 
 ## List information about filters:
-- 
GitLab