From f9ac537ff7a3898df172149863abdf06fbbe712c Mon Sep 17 00:00:00 2001 From: Lyudmila Vaseva <vaseva@mi.fu-berlin.de> Date: Sun, 6 Jan 2019 13:02:25 +0100 Subject: [PATCH] Document most triggered filters over the years --- EN-state-of-the-art | 196 +++++++++++++++++++++++++++++++++++++++++++- notes | 5 ++ 2 files changed, 198 insertions(+), 3 deletions(-) diff --git a/EN-state-of-the-art b/EN-state-of-the-art index f2ed3f2..17257ef 100644 --- a/EN-state-of-the-art +++ b/EN-state-of-the-art @@ -225,7 +225,191 @@ According to the Edit filter Notice board: EN: There are currently 201 enabled filters, and 12 stale filters with no hits in the past 30 days (Purge). from https://en.wikipedia.org/wiki/Special:AbuseFilter (29.11.2018) ---> upward tendency +--> upward tendency^^; not particularly significant + +owing to quarries we have all the filters that were triggered from the filter log per year, from 2009 (when filters were first introduced/the MediaWiki extension was enabled) till end of 2018 with their corresponding number of times being triggered: +$ wc quarry-32489-en-wp-all-log-entries-before-20100101-run318768.csv + 220 220 1879 quarry-32489-en-wp-all-log-entries-before-20100101-run318768.csv +$ wc quarry-32492-en-wp_-all-abuse-filter-log-entries-in-2010-run318774.csv + 163 163 1476 quarry-32492-en-wp_-all-abuse-filter-log-entries-in-2010-run318774.csv +$ wc quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2011-run318775.csv + 161 161 1480 quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2011-run318775.csv +$ wc quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2012-run318778.csv + 170 170 1576 quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2012-run318778.csv +$ wc quarry-32495-en-wp_-all-abuse-filter-log-entries-in-2013-run318779.csv + 178 178 1632 quarry-32495-en-wp_-all-abuse-filter-log-entries-in-2013-run318779.csv +$ wc quarry-32496-en-wp_-all-abuse-filter-log-entries-in-2014-run318780.csv + 154 154 1434 quarry-32496-en-wp_-all-abuse-filter-log-entries-in-2014-run318780.csv +$ wc quarry-32497-en-wp_-all-abuse-filter-log-entries-in-2015-run318782.csv + 200 200 1845 quarry-32497-en-wp_-all-abuse-filter-log-entries-in-2015-run318782.csv +$ wc quarry-32499-en-wp_-all-abuse-filter-log-entries-in-2016-run318789.csv + 204 204 1902 quarry-32499-en-wp_-all-abuse-filter-log-entries-in-2016-run318789.csv +$ wc quarry-32500-en-wp_-all-abuse-filter-log-entries-in-2017-run318797.csv + 231 231 2135 quarry-32500-en-wp_-all-abuse-filter-log-entries-in-2017-run318797.csv +$ wc quarry-32503-en-wp_-all-abuse-filter-log-entries-in-2018-run318831.csv + 254 254 2353 quarry-32503-en-wp_-all-abuse-filter-log-entries-in-2018-run318831.csv + +data is still not enough for us to talk about a tendency towards introducing more filters (after the initial dip) + +10 most active filters per year: +==> quarry-32489-en-wp-all-log-entries-before-20100101-run318768.csv <== +afl_filter,count(*) +135,175455 +30,160302 +61,147377 +18,133640 +3,95916 +172,89710 +50,88827 +98,80434 +65,74098 +132,68607 + +==> quarry-32492-en-wp_-all-abuse-filter-log-entries-in-2010-run318774.csv <== +afl_filter,count(*) +61,245179 +135,242018 +172,148053 +30,119226 +225,109912 +3,105376 +50,101542 +132,78633 +189,74528 +98,54805 + +==> quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2011-run318775.csv <== +afl_filter,count(*) +61,218493 +135,185304 +172,119532 +402,109347 +30,89151 +3,75761 +384,71911 +225,68318 +50,67425 +432,66480 + +==> quarry-32493-en-wp_-all-abuse-filter-log-entries-in-2012-run318778.csv <== +afl_filter,count(*) +135,173830 +384,144202 +432,126156 +172,105082 +30,93718 +3,90724 +380,67814 +351,59226 +279,58853 +225,58352 + +==> quarry-32495-en-wp_-all-abuse-filter-log-entries-in-2013-run318779.csv <== +afl_filter,count(*) +135,133309 +384,129807 +432,94017 +172,92871 +30,85722 +279,76738 +3,70067 +380,58668 +491,55454 +225,48390 + +==> quarry-32496-en-wp_-all-abuse-filter-log-entries-in-2014-run318780.csv <== +afl_filter,count(*) +384,111570 +135,111173 +279,97204 +172,82042 +432,75839 +30,62495 +3,60656 +636,52639 +231,39693 +380,39624 + +==> quarry-32497-en-wp_-all-abuse-filter-log-entries-in-2015-run318782.csv <== +afl_filter,count(*) +650,226460 +61,196986 +636,191320 +527,189911 +633,162319 +384,141534 +279,110137 +135,99057 +686,95356 +172,82874 + +==> quarry-32499-en-wp_-all-abuse-filter-log-entries-in-2016-run318789.csv <== +afl_filter,count(*) +527,437099 +61,274945 +650,229083 +633,218696 +636,179948 +384,179871 +279,106699 +135,95131 +172,79843 +30,68968 + +==> quarry-32500-en-wp_-all-abuse-filter-log-entries-in-2017-run318797.csv <== +afl_filter,count(*) +61,250394 +633,218146 +384,200748 +527,192441 +636,156409 +650,151604 +135,80056 +172,70837 +712,59537 +833,58133 + +==> quarry-32503-en-wp_-all-abuse-filter-log-entries-in-2018-run318831.csv <== +afl_filter,count(*) +527,358210 +61,234867 +633,201400 +384,177543 +833,161030 +636,144674 +650,79381 +135,75348 +686,70550 +172,64266 + +what do the most active filters do? +135 publicly available description: "repeating characters"; tag, warn +30 "large deletion from article by new editors"; tag, warn +61 "new user removing references" ("new user" is handled by "!("confirmed" in user_groups)"); tag +18 "test type edits from clicking on edit bar" (people don't replace Example texts when click-editing); filter seems to have been deleted in Feb 2012 +3 "new user blanking articles"; tag, warn +172 "section blanking"; tag +50 "shouting" (contribution consists of all caps, numbers and punctuation); tag, warn +98 "creating very short new article"; tag +65 "excessive whitespace" (note: "associated with ascii art and some types of vandalism"); seems to have been deleted in Jan 2010 +132 "removal of all categories"; tag, warn +225 "vandalism in all caps" (difference to 50? seems to be swear words, but shouldn't they be catched by 50 anyway?); k, action is "disallow" +189 "BLP vandalism or libel" (äh.. wat? seems to be insulting living people); tag +402 "new article without references"; seems to have been deleted in Apr 2013, before that disabled with comment "disabling, no real use" +384 "addition of bad words or other vandalism" (seems to be a blacklist); disallow +432 "starting new line with lower case letters"; tag, warn //I recall there was a rule of thumb recommending not to user filters for style things? although that's not really style, but rather wrong grammar.. +380 hidden; public comment "multiple obscenities"; disallow +351 "text added after categories and interwiki"; tag, warn +279 "repeated attempts to vandalise"; tag, throttle (triggered when someone hits "edit" repeatedly in a short ammount of time) +491 "edits ending with emoticons or !"; tag, warn +636 "unexplained removal of sourced content"; warn (that, together with 634 and 635 refutes my theory that warn always goes together with tag) +231 "long string of characters containing no spaces" (that's surely english though^^); tag, warn +650 "creation of a new article without any categories"; weird, it's markes as enabled here https://en.wikipedia.org/wiki/Special:AbuseFilter/650 , but does not appear in the actions data set; ah, ok, that is because there are no actions (other than logging probably) +527 hidden; public comments "T34234: log/throttle possible sleeper account creations"; throttle +633 "possible canned edit summary" (I think that's an edit summary that does not reflect the real edit; pre-filled on mobile though); tag +686 "IP adding possible unreferenced material to BLP" (BLP= biography of living people? I thought, it was forbidden to edit them without a registered account); no actions +712 "possibly changing date of birth in infobox" ("possibly"? and I thought infoboxes were pre-generated from wikidata?); no actions +833 "newer user possibly adding a unreferenced or improperly referenced material"; no actions * how often were (which) filters triggered https://tools.wmflabs.org/ptwikis/Filters:enwiki @@ -247,10 +431,13 @@ links to single filters, e.g. --> https://en.wikipedia.org/wiki/Special:AbuseFil "Visibility" is: private | public "Hit count": which period is counted? total number of hits since the filter was enabled? (for all enabled periods, in case it was enabled/disabled multiple times?) -Filter with most hits: +Filter with most hits (altogether): Filter ID Public description Actions Status Last modified Visibility Hit count 61 New user removing references Tag Enabled 12:43, 14 May 2017 by Zzuuzz (talk | contribs) Public 1,593,851 hits +see also quarry-32518; +the thing is, we can't really classify hits by filter actions since actions triggered by the filters change + https://en.wikipedia.org/wiki/Special:AbuseFilter/61 statistics are info such as "Of the last 1,728 actions, this filter has matched 10 (0.58%). On average, its run time is 0.34 ms, and it consumes 3 conditions of the condition limit." // not sure what the condition limit is @@ -267,6 +454,9 @@ https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=abuselog&aflu * percentage of triggered filters/all edits * break down triggered filters according to typology * percentage filters of different types over the years +the thing is, we can't really classify hits by filter actions since actions triggered by the filters change +we know what the actions of a given filter are just for now... +maybe the dumps can answer this question? We can try to map some of the descriptive statistics with the WM quarry service: https://quarry.wmflabs.org/query/32483 @@ -274,4 +464,4 @@ https://quarry.wmflabs.org/query/32489 https://quarry.wmflabs.org/query/32487 give us an idea of what data the abuse filter related tables contain -Results: +Results: see above diff --git a/notes b/notes index 7b3d137..6693f61 100644 --- a/notes +++ b/notes @@ -403,6 +403,8 @@ mysql> describe abuse_filter_history; (from https://www.mediawiki.org/wiki/Exten +---------------------+---------------------+------+-----+---------+----------------+ 13 rows in set (0.00 sec) +Note! table abuse_filter_history seems to not exist anymore + mysql> describe abuse_filter_action; (from https://www.mediawiki.org/wiki/Extension:AbuseFilter/abuse_filter_action_table) +-----------------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | @@ -413,6 +415,9 @@ mysql> describe abuse_filter_action; (from https://www.mediawiki.org/wiki/Extens +-----------------+---------------------+------+-----+---------+-------+ 3 rows in set (0.00 sec) +Seems to contain data for currently enabled filters only; +Question: how do we find data for disabled filters? + # API calls ## List information about filters: -- GitLab