From afc5557dd71bccd8b9bcdd4fe2bfd64c6916462a Mon Sep 17 00:00:00 2001 From: Lyudmila Vaseva <vaseva@mi.fu-berlin.de> Date: Sat, 20 Jul 2019 15:08:50 +0200 Subject: [PATCH] Continue refactoring filter hits analysis --- thesis/5-Overview-EN-Wiki.tex | 41 ++++++++++++++++++++--------------- thesis/references.bib | 9 ++++++++ 2 files changed, 32 insertions(+), 18 deletions(-) diff --git a/thesis/5-Overview-EN-Wiki.tex b/thesis/5-Overview-EN-Wiki.tex index 5fe4e7c..a837cbf 100644 --- a/thesis/5-Overview-EN-Wiki.tex +++ b/thesis/5-Overview-EN-Wiki.tex @@ -384,39 +384,44 @@ or that there wasn't a general surge in vandalism around this time. \end{figure} \textbf{There was a change in the edit filter software that allowed more filters to be activated, or a bug that caused false positives}\\ -Since so far, neither of the other hypothesis could be verifies, this explanation sounds likely. +Since so far neither of the other hypothesis could be verified, this explanation sounds likely. Another piece of data that seems to support it is the breakdown of the filter hits according to triggered filter action. As demonstrated on figure~\ref{fig:filter-hits-actions}, there was above all a significant hits peak caused by ``log only'' filters. -As discussed in section~\ref{sec:introduce-a-filter}, it is an established praxis to introduce new filters in ``log only'' mode and only switch on additional filter actions after a monitoring period that demonstrated that the filters function as desired/intended. -Hence, it sounds plausible that new filters in logging mode were introduced, which were then switched off after a significant number of false positives occured. +As discussed in section~\ref{sec:introduce-a-filter}, it is an established praxis to introduce new filters in ``log only'' mode and only switch on additional filter actions after a monitoring period showed that the filters function as intended. +Hence, it is plausible that new filters in logging mode were introduced, which were then switched off after a significant number of false positives occured. However, upon closer scritiny, this could not be confirmed. -The most frequently triggered filters in the period Jan-March 2016 are mainly the most triggered filters of all times and nearly all of them have been around for a while in 2016. -Also, no bug or a comparable incident with the software was found upon an inspection of the extension's issue tracker~\cite{phab-abusefilter-2015}, or commit messages of the commits to the software done during this period (May 2015–May 2016)~\cite{gerrit-abusefilter-source}. +The most frequently triggered filters in the period January–March 2016 are mainly the most triggered filters of all times and nearly all of them have been around for a while in 2016. +Also, no bug or a comparable incident with the software was found upon an inspection of the extension's issue tracker~\cite{phab-abusefilter-2015}, or commit messages of the commits to the software done during May 2015–May 2016~\cite{gerrit-abusefilter-source}. Moreover, no mention of the hits surge was found in the noticeboard~\cite{Wikipedia:EditFilterNoticeboard} and edit filter talk page archives~\cite{Wikipedia:EditFilterTalkArchive2016}. The in section~\ref{sec:filter-activity} mentioned condition limit has not changed either, as far as I can tell from the issue tracker, the commits and discussion archives, so the possible explanation that simply more filters have been at work since 2016 seems to be refuted as well. -The only somewhat telling/interesting patterns/phenomena that seem to shed some light on the matter are the breakdown of hits according to the editor's action which triggered them: there is an obvious surge in the attempted account creations in this period (see figure~\ref{fig:filter-hits-editors-actions}). -As a matter of fact, they could also be the explanation for the peak of log only hits–the most frequently tripped filter for the period January–March 2016 is filter 527 ``T34234: log/throttle possible sleeper account creations''. + +The only somewhat interesting pattern that seems to shed some light on the matter is the breakdown of hits according to the editor's action which triggered them: +There is an obvious surge in the attempted account creations in this period (see figure~\ref{fig:filter-hits-editors-actions}). +As a matter of fact, this could also be the explanation for the peak of log only hits–the most frequently tripped filter for the period January–March 2016 is filter 527 ``T34234: log/throttle possible sleeper account creations''. It is a throttle (only) filter, so everytime an edit matches its regex pattern, a ``log only'' entry is created in the abuse log. %it disallows every X attempt, only logging the rest of the account creations. %I think in its current form, it does not actually disallow anything, a ``disallow'' action should be enabled for this and the filter action is only 'throttle'; so in this form, it seems to simply log account creations +And the 3rd most active filter is a ``log only'' filter as well: 650 ``Creation of a new article without any categories'' (it was neither introduced at the time, nor was there any major change in the filter pattern). +Together, filters 527 and 650 are responsible for over 60\% of the ``log only'' hits in every of the months January, February and March 2016. -Another explanation that seemed worth persuing was to look into the editors who tripped filters and their corresponding edits. -For the period January-March 2016 there are some very active IP editors, the top of whom (how many hits) seemed to be enaging of the (probably automated) posting of spam links only. -Their edits however constitue some 1-3\% of all hits from the period, so the explanation ``it was viagra spam coming from Russian IPs'' is somewhat insufficient. +Another idea that seemed worth persuing was to look into the editors who tripped filters and their corresponding edits. +For the period January-March 2016 there are some very active IP editors, the top of whom (with over $1.000$ hits) seemed to be engaging exclusively in the (probably automated) posting of spam links. +Their edits however constitute some 1-3\% of all hits from the period, so the explanation ``it was viagra spam coming from Russian IPs'' is somewhat unsatisfactory. (Yes, it was viagra spam, and yes, a ``whois'' lookup proved them to really be Russian IPs. And, yes, whoever was editing could've also used a VPN, so I'm not opening a Russian bot fake news conspiracy theory just yet.) +A closer/more systematic scrutiny (syn!) of the editors causing the hits may be insightful though. +Right now, all the data analysed on the matter stems from the \emph{abuse\_filter\_log} table and the checks of the content of the edits were done manually on a sample basis via the web frontend of the AbuseLog~\cite{Wikipedia:AbuseLog} where one can click on the diff of the edit for edits that triggered public filters. +No simple automated check of what the offending editors were contributing was possible since the \emph{abuse\_filter\_log} table does not store the text of the edit which triggered a filter directly, but rather contains a reference to the \emph{text} table where the wikitext of all individual page revisions is stored~\cite{Wikipedia:TextTable}. +One needs to join the hit data from \emph{abuse\_filter\_log} with the \emph{text} table to obtain the content of the edits. - -Significant Geo/Socio-political events from the time, which triggered a lot of media (and Internet) attention and desinformation campaigns -- 2016 US elections -- Brexit referendum -- the so-called ``refugee crisis'' in Europe - +Last but not least, I took a step back and contemplated the significant geo/socio-political events from the time, which triggered a lot of media (and Internet) attention and desinformation campaigns. +Following things came to mind: 2016 US elections, the Brexit referendum and the so-called ``refugee crisis'' in Europe. There was also a severe organisational crisis in Wikimedia at the time during which a lot of staff left and eventually the executive director stepped down. - However, I couldn't draw a direct relationship between any of these political events and the edits which triggered edit filters. An investigation into the pages on which the filters were triggered proved them (the pages) to be quite innocuous: -one of the pages where most filter hits were logged in January 2016 was skateboard and the ~660 filter hits here seem like a drop in the ocean compared to the 37X.000 hits for the whole month. +the page where most filter hits were logged in January 2016 (beside the login page, on which all account creations are logged) was ``Skateboard'' and the $660$ filter hits here seem like a drop in the ocean compared to the $372.907$ hits for the whole month. +And the most triggered page in March (apart from the user login page) was the user page for user 209.236.119.231 who was also the editor with second most hits and who was apparently trying to post spam links on his own user page (after posting twice to ``Skateboard''). +%Should I even mention this at all? \begin{figure} \centering diff --git a/thesis/references.bib b/thesis/references.bib index 8359cc1..e8536c4 100644 --- a/thesis/references.bib +++ b/thesis/references.bib @@ -765,6 +765,15 @@ \url{https://en.wikipedia.org/w/index.php?title=Wikipedia:STiki&oldid=879253675}} } +@misc{Wikipedia:TextTable, + key = "Wikipedia Text Table", + author = {}, + title = {Wikipedia: Text Table}, + year = 2019, + note = {Retreived July 20, 2019 from + \url{https://www.mediawiki.org/w/index.php?title=Manual:Text_table&oldid=3287673}} +} + @misc{Wikipedia:Twinkle, key = "Wikipedia Twinkle Tool", author = {}, -- GitLab