From 96451f0f0d8b621c64a2ee8ee9df67e1f2cf9d78 Mon Sep 17 00:00:00 2001 From: Lyudmila Vaseva <vaseva@mi.fu-berlin.de> Date: Tue, 23 Jul 2019 17:41:51 +0200 Subject: [PATCH] Refactor filter activity section --- thesis/5-Overview-EN-Wiki.tex | 204 ++++++++++++--------------- thesis/appendix.tex | 212 ---------------------------- thesis/misc.tex | 257 ++++++++++++++++++++++++++++++++++ 3 files changed, 347 insertions(+), 326 deletions(-) diff --git a/thesis/5-Overview-EN-Wiki.tex b/thesis/5-Overview-EN-Wiki.tex index 982941f..1c41c48 100644 --- a/thesis/5-Overview-EN-Wiki.tex +++ b/thesis/5-Overview-EN-Wiki.tex @@ -270,55 +270,36 @@ Although apparently there are determined trolls who ``work accounts up'' to admi \section{Filter activity} \label{sec:filter-activity} -\begin{comment} -\subsection{Distinct filters over the years} -Thanks to quarry~\footnote{\url{https://quarry.wmflabs.org/}}, we have the numbers of all distinct filters triggered per year -from 2009 (when filters were first introduced/the MediaWiki extension was enabled) until the end of 2018: see table~\ref{tab:active-filters-count}. -This figure varies between $154$ in year 2014 and $254$ in 2018. -The explanation for this not particularly wide range of active filters lies probably in the so-called condition limit. -According to the edit filters' documentation~\cite{Wikipedia:EditFilterDocumentation}, the condition limit is a hard-coded treshold of total available conditions that can be evaluated by all active filters per incoming edit. -Currently, it is set to $1,000$. -The motivation for this heuristic is to avoid performance issues since every incoming edit is checked against all currently enabled filters which means that the more filters are active the longer the checks take. -However, the page also warns that counting conditions is not the ideal metric of filter performance, since there are simple comparisons that take significantly less time than a check against the \emph{all\_links} variable for example (which needs to query the database)~\cite{Wikipedia:EditFilterDocumentation}. -Nevertheless, the condition limit seems to still be the heuristic used for filter performance optimisation today. - -\begin{table} - \centering - \begin{tabular}{l r } - % \toprule - Year & Number of distinct filters \\ - \hline - 2009 & 220 \\ - 2010 & 163 \\ - 2011 & 161 \\ - 2012 & 170 \\ - 2013 & 178 \\ - 2014 & 154 \\ - 2015 & 200 \\ - 2016 & 204 \\ - 2017 & 231 \\ - 2018 & 254 \\ - % \bottomrule - \end{tabular} - \caption{Count of distinct filters triggered each year}~\label{tab:active-filters-count} -\end{table} -\end{comment} - -\subsection{Filter hits per month} - -We can backtrack the number of filter hits over the years on figure~\ref{fig:filter-hits}. +The number of filter hits per month over the years can be backtracked on figure~\ref{fig:filter-hits}. There is a dip in the number of hits in late 2014 and quite a surge in the beginnings of 2016, after which the overall number of filter hits stayed higher. There is also a certain periodicity to the graph, with smaller dips in the northern hemisphere's summer months (June, July, August) and smaller peaks in autumn/winter (mostly October/November). -This tendency cannot be really observed for the overall number of edits though (see figure~\ref{fig:edits-development}). -It seems that above all editors tripping filters are on vacation. +This tendency is not observed for the overall number of edits (see figure~\ref{fig:edits-development}). +Apparently, above all editors tripping filters are on vacation in June, July and August. + +Further, it is interesting to break down filter activity according to the types determined via the manual tagging (see section~\ref{sec:manual-classification}): +The corresponding distribution is shown in figure~\ref{fig:filter-hits-manual-tags}. +On the one hand, it demonstrates above all a surge in the hits of filters targeting vandalism in 2016. +On the other hand, another, somewhat subtler trend emerges: +In the first years following the introduction of the mechanism, good faith filters were matched most frequently. +This changed around the end of 2012 and since then the most hits are marked by vandalism filters. -\begin{landscape} \begin{figure} \centering - \includegraphics[width=0.9\columnwidth]{pics/filter-hits-zoomed.png} + \includegraphics[width=1\columnwidth]{pics/filter-hits-zoomed.png} \caption{EN Wikipedia edit filters: Hits per month}~\label{fig:filter-hits} \end{figure} -\end{landscape} + +\begin{figure} +\centering + \includegraphics[width=1\columnwidth]{pics/filter-hits-manual-tags.png} + \caption{EN Wikipedia edit filters: Hits per month}~\label{fig:filter-hits-manual-tags} +\end{figure} + +\begin{figure} +\centering + \includegraphics[width=0.9\columnwidth]{pics/reverts.png} + \caption{EN Wikipedia: Reverts for July 2001–April 2017}~\label{fig:reverts} +\end{figure} Regarding the hits surge and subsequent higher hit numbers, three possible explanations come to mind: \begin{enumerate} @@ -328,34 +309,29 @@ Regarding the hits surge and subsequent higher hit numbers, three possible expla \end{enumerate} I've undertaken following steps in an attempt to verify or refute each of these speculations:\\ - +\\ \textbf{The filter hits mirror the overall edits pattern from this time} \\ I've compared the filter hits pattern with the overall number of edits of the time (May 2015–May 2016). No correspondance could be determined (see figure~\ref{fig:edits-development}). \\ \\ \textbf{There was a general rise in vandalism in this period}\\ -In order to verify this assumption, it would be great to compare the filters hits patterns with revert patterns of other quality control mechanisms. +This assumption is supported by the peak in the hits of vandalism related filters end 2015–beginning 2016 observed in figure~\ref{fig:filter-hits-manual-tags}. +In order to verify it, a comparison of the filters' hits patterns with revert patterns of other quality control mechanisms seems logical. Unfortunately, computing these numbers is time-consuming and not completely trivial. One needs a dump of English Wikipedia's edit history data for the period in question; -then one has to determine the reverts in this data set; +then one has to determine the reverts in this data set (e.g. by using the \emph{mwreverts} python library); and then, more specifically, one needs to extract reverts done by quality control actors. Last step is crucial, since not every revert signifies a malicious edit is being reverted. -This point is aptly illustrated by~\cite{GeiHal2017} who have demonstrated that reverts can mean productive collaborative work between different actors(syn!). +This point is aptly illustrated by~\cite{GeiHal2017} who have demonstrated that reverts can mean productive collaborative work between different agents. The dumps are large and it takes time and computing power to obtain them and extract reverts. According to Geiger and Halfaker who have done this for their replication study~\cite{GeiHal2017}, the April 2017 database dump offered by the Wikimedia Foundation was 93GB compressed and it took a week to extract reverts out of it on a 16 core Xeon workstation. They also list the challenges they faced in determining bot accounts and their reverts. -If one is to verify the current assumption (syn) properly, following steps are necessary: -\begin{enumerate} - \item a fresh dump should be obtained - \item reverts should be extracted from it (e.g. by using the \emph{mwreverts} python library, used also by Geiger and Halfaker - \item reverts should be narrowed down to accounts known for doing quality-control work (for example by pre-compiling a list of anti-vandal bots); reverts (or respectively edits in general) done via Huggle and Twinkle are somewhat easy to identify since both tools leave a small code in the edit summary of their edits ("HG" for Huggle and "TW" for Twinkle)%TODO verify that's still the case -\end{enumerate} - Since time was scarce, I have run a first check of this assumption using the 2017 reverts dataset compiled by Geiger and Halfaker's for their study \footnote{Both researchers have placed a great value on reproducibility and have published their complete datasets, as well as scripts they used for their analyses for others to use and verify: \url{https://github.com/halfak/are-the-bots-really-fighting}.}. +The dataset is old, but still sufficient for scrutinising events at the beginning of 2016. Figure~\ref{fig:reverts} shows the total number of reverts, as well as reverts done by bots over time computed by Geiger and Halfaker. The filter hits pattern of 2015–2016 with the peak in filter hits and subsequent higher number of overall hits is not mirrored by the revert numbers \footnote{Just for completenes, the spike in March 2013 is the batch action by AddBot removing interwiki links, since these were handled by Wikidata discussed in the introduction of Geiger and Halfaker's paper. It didn't have anything to do with vandalism.} @@ -365,12 +341,6 @@ or that there wasn't a general surge in vandalism around this time. (Or that only vandalism caught by filters peaked, which sounds somewhat improbable.) \\ \\ -\begin{figure} -\centering - \includegraphics[width=0.9\columnwidth]{pics/reverts.png} - \caption{EN Wikipedia: Reverts for July 2001–April 2017}~\label{fig:reverts} -\end{figure} - \textbf{There was a change in the edit filter software that allowed more filters to be activated, or a bug that caused false positives}\\ Since so far neither of the other hypothesis could be verified, this explanation sounds likely. Another piece of data that seems to support it is the breakdown of the filter hits according to triggered filter action. @@ -384,33 +354,41 @@ Moreover, no mention of the hits surge was found in the noticeboard~\cite{Wikipe The in section~\ref{sec:filter-activity} mentioned condition limit has not changed either, as far as I can tell from the issue tracker, the commits and discussion archives, so the possible explanation that simply more filters have been at work since 2016 seems to be refuted as well. The only somewhat interesting pattern that seems to shed some light on the matter is the breakdown of hits according to the editor's action which triggered them: -There is an obvious surge in the attempted account creations in this period (see figure~\ref{fig:filter-hits-editors-actions}). +There is an obvious surge in the attempted account creations in the period November 2015–May 2016 (see figure~\ref{fig:filter-hits-editors-actions}). As a matter of fact, this could also be the explanation for the peak of log only hits—the most frequently tripped filter for the period January–March 2016 is filter 527 ``T34234: log/throttle possible sleeper account creations''. -It is a throttle (only) filter, so everytime an edit matches its pattern, a ``log only'' entry is created in the abuse log. +It is a throttle filter, with no further actions enabled, so everytime an edit matches its pattern, a ``log only'' entry is created in the abuse log. %it disallows every X attempt, only logging the rest of the account creations. %I think in its current form, it does not actually disallow anything, a ``disallow'' action should be enabled for this and the filter action is only 'throttle'; so in this form, it seems to simply log account creations -And the 3rd most active filter is a ``log only'' filter as well: 650 ``Creation of a new article without any categories'' (it was neither introduced at the time, nor was there any major change in the filter pattern). +And the 3rd most active filter is a ``log only'' filter as well: 650 ``Creation of a new article without any categories''. +(It was neither introduced at the time, nor was there any major change in the filter pattern.) Together, filters 527 and 650 are responsible for over 60\% of the ``log only'' hits in every of the months January, February and March 2016. Another idea that seemed worth persuing was to look into the editors who tripped filters and their corresponding edits. -For the period January-March 2016 there are some very active IP editors, the top of whom (with over $1.000$ hits) seemed to be engaging exclusively in the (probably automated) posting of spam links. -Their edits however constitute some 1-3\% of all hits from the period, so the explanation ``it was viagra spam coming from Russian IPs'' is somewhat unsatisfactory. +For the period January–March 2016 there are some very active IP editors, the top of whom (with over $1.000$ hits) seemed to be engaging exclusively in the (probably automated) posting of spam links. +Their edits however constitute some 1-3\% of all hits from the period which is insufficient to explain the peak +\footnote{Upon closer examination, these edits all seemed to contain spam links about erectile dysfunction medication and their IP records pertained to a Russian registry. +It is however possible that the offending editors were using a VPN or another proxy technology. +The speculations abouth the intent of the edits remain out of the scope of the present work.}. +\begin{comment} +so the explanation ``it was viagra spam coming from Russian IPs'' is somewhat unsatisfactory. (Yes, it was viagra spam, and yes, a ``whois'' lookup proved them to really be Russian IPs. And, yes, whoever was editing could've also used a VPN, so I'm not opening a Russian bot fake news conspiracy theory just yet.) -A closer/more systematic scrutiny (syn!) of the editors causing the hits may be insightful though. +\end{comment} +A more systematic scrutiny of the editors causing the hits was not possible due to time constraints, but may contribute more insights. Right now, all the data analysed on the matter stems from the \emph{abuse\_filter\_log} table and the checks of the content of the edits were done manually on a sample basis via the web frontend of the AbuseLog~\cite{Wikipedia:AbuseLog} where one can click on the diff of the edit for edits that matched public filters. -No simple automated check of what the offending editors were contributing was possible since the \emph{abuse\_filter\_log} table does not store the text of the edit which matches a filter's pattern directly, but rather contains a reference to the \emph{text} table where the wikitext of all individual page revisions is stored~\cite{Wikipedia:TextTable}. +No simple automated check of what the offending editors were trying to publish was possible since the \emph{abuse\_filter\_log} table does not store the text of the edit which matches a filter's pattern directly, but rather contains a reference to the \emph{text} table where the wikitext of all individual page revisions is stored~\cite{Wikipedia:TextTable}. One needs to join the hit data from \emph{abuse\_filter\_log} with the \emph{text} table to obtain the content of the edits. +\begin{comment} Last but not least, I took a step back and contemplated the significant geo/socio-political events from the time, which triggered a lot of media (and Internet) attention and desinformation campaigns. Following things came to mind: 2016 US elections, the Brexit referendum and the so-called ``refugee crisis'' in Europe. There was also a severe organisational crisis in Wikimedia at the time during which a lot of staff left and eventually the executive director stepped down. - However, I couldn't draw a direct relationship between any of these political events and the edits caught by edit filters. -An investigation into the pages on which the filters were triggered proved them (the pages) to be quite innocuous: -the page where most filter hits were logged in January 2016 (beside the login page, on which all account creations are logged) was ``Skateboard'' and the $660$ filter hits here seem like a drop in the ocean compared to the $372.907$ hits for the whole month. +\end{comment} + +Last but not least, an investigation into the pages on which the filters were triggered proved them (the pages) to be quite innocuous: +The page where most filter hits were logged in January 2016 (beside the login page, on which all account creations are logged) was ``Skateboard'' and the $660$ filter hits here are rather insignificant compared to the $372.907$ hits for the whole month. And the page in March (apart from the user login page) on which most filter hits took place was the user page for user 209.236.119.231 who was also the editor with second most hits and who was apparently trying to post spam links on his own user page (after posting twice to ``Skateboard''). -In general, the pages on which filters match seem more like a randomly selected platform (syn) on which the disrupting editors unload their spam. -%Should I even mention this at all? +In general, the pages on which filters match seem more like a randomly selected platform on which the disrupting editors unload their spam. \begin{figure} \centering @@ -424,9 +402,47 @@ In general, the pages on which filters match seem more like a randomly selected \caption{EN Wikipedia edit filters: Hits per month according to triggering editor's action}~\label{fig:filter-hits-editors-actions} \end{figure} -\subsection{Most active filters of all times} -\label{sec:most-active-all-times} +\section{Conclusions} +This chapter explored the edit filters on the Englisch Wikipedia in an attempt to determine what types of tasks these filters take over, +and how these tasks have evolved over time. + +%filters match distinctly more frequently than initially anticipated! <-- is this mentioned anywhere already + +Different characteristics of the edit filters, as well as their activity through the years were scrutinised. +Three main types of filter tasks were identified: preventing/tracking vandalism, guiding good faith but nonetheless disruptive edits towards a more constructive contribution, and various maintenance jobs such as tracking bugs or other conspicuous behaviour. +Filters aimed at particularly malicious users or behaviours are as a general rule hidden, whereas filters targeting general patterns are viewable by anyone interested. +We've determined that hidden filters seem to fluctuate more, which makes sense given their main area of application. +Public filters often target (syn) silly vandalism or test type edits, as well as spam. +The latter, above all when implemented (syn) in an automated fashion, together with disallowing edits by very determined vandals handled by hidden filters are in accord with the initial aim with which the filters were introduced (compare section~\ref{section:4-history}). +Interestingly, the mechanism also ended up being quite active in preventing silly (e.g. inserting series of repeating characters) or profanity vandalism, which the community initially didn't think of as part of the filters' assignment (see section~\ref{sec:most-active-all-times}). +The third area in which filters are quite active are various types of blankings (mostly by new users) where the filters issue warnings pointing towards possible alternatives the editor may want to achieve or the proper procedure for deleting articles for instance. + +The number of active filters stayed somewhat stable over time which is most probably to be attributed to the condition limit (see section~\ref{sec:filter-activity}). +However, this doesn't seem to be further disturbing the operation of the mechanism as a whole, +and %TODO better word +the edit filter managers use it as a performance heuristic to optimise conditions on individual filter, or routinely clean up (and disable) stale filters. + +Regarding the temporal filter activity trends, it was ascertained that a sudden peak in filter activity (syn) took place in the end of 2015–beginnings of 2016, after which the overall filter hit numbers stayed higher than they used to be before this occurence. +Although there were some pointers towards what happened there: +a surge in account creation attempts and possibly a big spam wave (the latter has to be verified in a systematic fashion), +no really satisfying explanation of the phenomenon could be established. +This remains one of the possible direction for future studies. + +%Historical trends +%TODO moved from section, revise so that it points to future work +The present section explores qualitatively/highlights patterns in the creation and usage of edit filters. +Unfortunately, no extensive quantitative analysis of these patterns was possible, since for it, an access to the \emph{abuse\_filter\_history} table of the AbuseFilter plugin (compare section~\ref{sec:mediawiki-ext}) is needed. +Unlike the other tables of the extension, the \emph{abuse\_filter\_history} table is currently not replicated and no public dump is accessible via Wikimedia's cloud service Toolforge~\cite{Wikimedia:ToolforgeDatabases}. +This seems to have been the case in the past, however, due to security concerns the dumps were discontinued. +A short term solution to renew the public replicas was attempted but unfortunately haven't been successful yet. +That is why the present chapter only shows some tendencies observed via manual browsing of different filters' history via the exposed API endpoint which allows querying the \emph{abuse\_filter\_history} table for public filters~\cite{Wikipedia:AbuseFilterHistory}. +The discussions surrounding this issue and its progress can be viewed in the following ticket on Wikimedia's issue tracker:~\cite{phabricator}. +Hence, exploring historical patterns in detail remains one of the directions for future studies. + +%TODO VERY IMPORTANT: come back to the verification whether the filters have achieved their proclaimed end + +%TODO Sort moved from most active filters of all times The ten most active filters of all times (with number of hits, public description, enabled filter actions, and the manual tag and parent category assigned during the coding described in section~\ref{sec:manual-classification}) are displayed in table~\ref{tab:most-active-actions}. For a more detailed reference, the ten most active filters of each year are listed in the appendix. %TODO are there some historical trends we can read out of it? @@ -480,47 +496,7 @@ As a matter of fact, a quick glance at the AbuseLog~\cite{Wikipedia:AbuseLog} co \item is it new filters that get triggered most frequently? or are there also very active old ones? -- we have the most active filters per year, where we can observe this. It's a mixture of older and newer filter IDs (they get an incremental ID, so it is somewhat obvious what's older and what's newer); is there a tendency to split and refine older filters? \end{comment} -% Most active filters per year -%TODO compare with table and with most active filters per year: is it old or new filters that get triggered most often? (I'd say it's a mixture of both and we can now actually answer this question with the history API, it shows us when a filter was first created) - - -\section{Conclusions} - -This chapter explored the edit filters on the Englisch Wikipedia in an attempt to determine what types of tasks these filters take over, -and how these tasks have evolved over time. - -Different characteristics of the edit filters, as well as their activity through the years were scrutinised. -Three main types of filter tasks were identified: preventing/tracking vandalism, guiding good faith but nonetheless disruptive edits towards a more constructive contribution, and various maintenance jobs such as tracking bugs or other conspicuous behaviour. -Filters aimed at particularly malicious users or behaviours are as a general rule hidden, whereas filters targeting general patterns are viewable by anyone interested. -We've determined that hidden filters seem to fluctuate more, which makes sense given their main area of application. -Public filters often target (syn) silly vandalism or test type edits, as well as spam. -The latter, above all when implemented (syn) in an automated fashion, together with disallowing edits by very determined vandals handled by hidden filters are in accord with the initial aim with which the filters were introduced (compare section~\ref{section:4-history}). -Interestingly, the mechanism also ended up being quite active in preventing silly (e.g. inserting series of repeating characters) or profanity vandalism, which the community initially didn't think of as part of the filters' assignment (see section~\ref{sec:most-active-all-times}). -The third area in which filters are quite active are various types of blankings (mostly by new users) where the filters issue warnings pointing towards possible alternatives the editor may want to achieve or the proper procedure for deleting articles for instance. - -The number of active filters stayed somewhat stable over time which is most probably to be attributed to the condition limit (see section~\ref{sec:filter-activity}). -However, this doesn't seem to be further disturbing the operation of the mechanism as a whole, -and %TODO better word -the edit filter managers use it as a performance heuristic to optimise conditions on individual filter, or routinely clean up (and disable) stale filters. - -Regarding the temporal filter activity trends, it was ascertained that a sudden peak in filter activity (syn) took place in the end of 2015–beginnings of 2016, after which the overall filter hit numbers stayed higher than they used to be before this occurence. -Although there were some pointers towards what happened there: -a surge in account creation attempts and possibly a big spam wave (the latter has to be verified in a systematic fashion), -no really satisfying explanation of the phenomenon could be established. -This remains one of the possible direction for future studies. - -%Historical trends -%TODO moved from section, revise so that it points to future work -The present section explores qualitatively/highlights patterns in the creation and usage of edit filters. -Unfortunately, no extensive quantitative analysis of these patterns was possible, since for it, an access to the \emph{abuse\_filter\_history} table of the AbuseFilter plugin (compare section~\ref{sec:mediawiki-ext}) is needed. -Unlike the other tables of the extension, the \emph{abuse\_filter\_history} table is currently not replicated and no public dump is accessible via Wikimedia's cloud service Toolforge~\cite{Wikimedia:ToolforgeDatabases}. -This seems to have been the case in the past, however, due to security concerns the dumps were discontinued. -A short term solution to renew the public replicas was attempted but unfortunately haven't been successful yet. -That is why the present chapter only shows some tendencies observed via manual browsing of different filters' history via the exposed API endpoint which allows querying the \emph{abuse\_filter\_history} table for public filters~\cite{Wikipedia:AbuseFilterHistory}. -The discussions surrounding this issue and its progress can be viewed in the following ticket on Wikimedia's issue tracker:~\cite{phabricator}. -Hence, exploring historical patterns in detail remains one of the directions for future studies. - -%TODO VERY IMPORTANT: come back to the verification whether the filters have achieved their proclaimed end +%************************************************************************** %TODO is it really important to have this here? diff --git a/thesis/appendix.tex b/thesis/appendix.tex index e8df603..78b73ea 100644 --- a/thesis/appendix.tex +++ b/thesis/appendix.tex @@ -364,215 +364,3 @@ abuse_filter_action \caption{abuse\_filter\_action schema}~\label{fig:app-db-schemas-afa} \end{figure*} -%TODO add column "manual tags" (see jupyter NB) -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - Filter ID & Publicly available description & Hitcount \\ % is the hitcount for the year or altogether till now?-- for the year, of course - \hline - 135 & repeating characters & 175455 \\ - 30 & "large deletion from article by new editors" & 160302 \\ - 61 & "new user removing references" & 147377 \\ - 18 & Test type edits from clicking on edit bar & 133640 \\ - 3 & "new user blanking articles" & 95916 \\ - 172 & "section blanking" & 89710 \\ - 50 & "shouting" (contribution consists of all caps, numbers and punctuation) & 88827 \\ - 98 & "creating very short new article" & 80434 \\ - 65 & "excessive whitespace" & 74098 \\ - 132 & "removal of all categories" & 68607 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2009}~\label{tab:app-most-active-2009} -\end{table} - -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - Filter ID & Publicly available description & Hitcount \\ - \hline - 61 & "new user removing references" & 245179 \\ - 135 & repeating characters & 242018 \\ - 172 & "section blanking" & 148053 \\ - 30 & "large deletion from article by new editors" & 119226 \\ - 225 & Vandalism in all caps & 109912 \\ - 3 & "new user blanking articles" & 105376 \\ - 50 & "shouting" & 101542 \\ - 132 & "removal of all categories" & 78633 \\ - 189 & BLP vandalism or libel & 74528 \\ - 98 & "creating very short new article" & 54805 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2010}~\label{tab:app-most-active-2010} -\end{table} - -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - Filter ID & Publicly available description & Hitcount \\ - \hline - 61 & "new user removing references"& 218493 \\ - 135 & repeating characters & 185304 \\ - 172 & "section blanking" & 119532 \\ - 402 & New article without references & 109347 \\ - 30 & Large deletion from article by new editors & 89151 \\ - 3 & "new user blanking articles" & 75761 \\ - 384 & Addition of bad words or other vandalism & 71911 \\ - 225 & Vandalism in all caps & 68318 \\ - 50 & "shouting" & 67425 \\ - 432 & Starting new line with lowercase letters & 66480 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2011}~\label{tab:app-most-active-2011} -\end{table} - -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - Filter ID & Publicly available description & Hitcount \\ - \hline - 135 & repeating characters & 173830 \\ - 384 & Addition of bad words or other vandalism & 144202 \\ - 432 & Starting new line with lowercase letters & 126156 \\ - 172 & "section blanking" & 105082 \\ - 30 & Large deletion from article by new editors & 93718 \\ - 3 & "new user blanking articles" & 90724 \\ - 380 & Multiple obscenities & 67814 \\ - 351 & Text added after categories and interwiki & 59226 \\ - 279 & Repeated attempts to vandalize & 58853 \\ - 225 & Vandalism in all caps & 58352 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2012}~\label{tab:app-most-active-2012} -\end{table} - -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - Filter ID & Publicly available description & Hitcount \\ - \hline - 135 & repeating characters & 133309 \\ - 384 & Addition of bad words or other vandalism & 129807 \\ - 432 & Starting new line with lowercase letters & 94017 \\ - 172 & "section blanking" & 92871 \\ - 30 & Large deletion from article by new editors & 85722 \\ - 279 & Repeated attempts to vandalize & 76738 \\ - 3 & "new user blanking articles" & 70067 \\ - 380 & Multiple obscenities & 58668 \\ - 491 & Edits ending with emoticons or ! & 55454 \\ - 225 & Vandalism in all caps & 48390 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2013}~\label{tab:app-most-active-2013} -\end{table} - -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - Filter ID & Publicly available description & Hitcount \\ - \hline - 384 & Addition of bad words or other vandalism & 111570 \\ - 135 & repeating characters & 111173 \\ - 279 & Repeated attempts to vandalize & 97204 \\ - 172 & "section blanking" & 82042 \\ - 432 & Starting new line with lowercase letters & 75839 \\ - 30 & Large deletion from article by new editors & 62495 \\ - 3 & "new user blanking articles" & 60656 \\ - 636 & Unexplained removal of sourced content & 52639 \\ - 231 & Long string of characters containing no spaces & 39693 \\ - 380 & Multiple obscenities & 39624 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2014}~\label{tab:app-most-active-2014} -\end{table} - -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - Filter ID & Publicly available description & Hitcount \\ - \hline - 650 & Creation of a new article without any categories & 226460 \\ - 61 & New user removing references & 196986 \\ - 636 & Unexplained removal of sourced content & 191320 \\ - 527 & T34234: log/throttle possible sleeper account creations & 189911 \\ - 633 & Possible canned edit summary & 162319 \\ - 384 & Addition of bad words or other vandalism & 141534 \\ - 279 & Repeated attempts to vandalize & 110137 \\ - 135 & repeating characters & 99057 \\ - 686 & IP adding possibly unreferenced material to BLP & 95356 \\ - 172 & "section blanking" & 82874 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2015}~\label{tab:app-most-active-2015} -\end{table} - -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - % \toprule - Filter ID & Publicly available description & Hitcount \\ - \hline - 527 & T34234: log/throttle possible sleeper account creations & 437099 \\ - 61 & New user removing references & 274945 \\ - 650 & Creation of a new article without any categories & 229083 \\ - 633 & Possible canned edit summary & 218696 \\ - 636 & Unexplained removal of sourced content & 179948 \\ - 384 & Addition of bad words or other vandalism & 179871 \\ - 279 & Repeated attempts to vandalize & 106699 \\ - 135 & repeating characters & 95131 \\ - 172 & "section blanking" & 79843 \\ - 30 & Large deletion from article by new editors & 68968 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2016}~\label{tab:app-most-active-2016} -\end{table} - -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - Filter ID & Publicly available description & Hitcount \\ - \hline - 61 & New user removing references & 250394 \\ - 633 & Possible canned edit summary & 218146 \\ - 384 & Addition of bad words or other vandalism & 200748 \\ - 527 & T34234: log/throttle possible sleeper account creations & 192441 \\ - 636 & Unexplained removal of sourced content & 156409 \\ - 650 & Creation of a new article without any categories & 151604 \\ - 135 & repeating characters & 80056 \\ - 172 & "section blanking" & 70837 \\ - 712 & Possibly changing date of birth in infobox & 59537 \\ - 833 & Newer user possibly adding unreferenced or improperly referenced material & 58133 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2017}~\label{tab:app-most-active-2017} -\end{table} - -\begin{table} - \centering - \begin{tabular}{r p{9cm} r } - % \toprule - Filter ID & Publicly available description & Hitcount \\ - \hline - 527 & T34234: log/throttle possible sleeper account creations & 358210 \\ - 61 & New user removing references & 234867 \\ - 633 & Possible canned edit summary & 201400 \\ - 384 & Addition of bad words or other vandalism & 177543 \\ - 833 & Newer user possibly adding unreferenced or improperly referenced material & 161030 \\ - 636 & Unexplained removal of sourced content & 144674 \\ - 650 & Creation of a new article without any categories & 79381 \\ - 135 & repeating characters & 75348 \\ - 686 & IP adding possibly unreferenced material to BLP & 70550 \\ - 172 & "section blanking" & 64266 \\ - % \bottomrule - \end{tabular} - \caption{10 most active filters in 2018}~\label{tab:app-most-active-2018} -\end{table} - diff --git a/thesis/misc.tex b/thesis/misc.tex index 84818d1..267a2ed 100644 --- a/thesis/misc.tex +++ b/thesis/misc.tex @@ -133,4 +133,261 @@ There seems to be a tendency that all actions but logging (which cannot be switc ** "in addition to filter 148, let's see what we get - Cen" (https://en.wikipedia.org/wiki/Special:AbuseFilter/188) // this illustrates the point that edit filter managers do introduce stuff they feel like introducing just to see if it catches something +%*************************************************** + +\subsection{Distinct filters over the years} +Thanks to quarry~\footnote{\url{https://quarry.wmflabs.org/}}, we have the numbers of all distinct filters triggered per year +from 2009 (when filters were first introduced/the MediaWiki extension was enabled) until the end of 2018: see table~\ref{tab:active-filters-count}. +This figure varies between $154$ in year 2014 and $254$ in 2018. +The explanation for this not particularly wide range of active filters lies probably in the so-called condition limit. +According to the edit filters' documentation~\cite{Wikipedia:EditFilterDocumentation}, the condition limit is a hard-coded treshold of total available conditions that can be evaluated by all active filters per incoming edit. +Currently, it is set to $1,000$. +The motivation for this heuristic is to avoid performance issues since every incoming edit is checked against all currently enabled filters which means that the more filters are active the longer the checks take. +However, the page also warns that counting conditions is not the ideal metric of filter performance, since there are simple comparisons that take significantly less time than a check against the \emph{all\_links} variable for example (which needs to query the database)~\cite{Wikipedia:EditFilterDocumentation}. +Nevertheless, the condition limit seems to still be the heuristic used for filter performance optimisation today. + +\begin{table} + \centering + \begin{tabular}{l r } + % \toprule + Year & Number of distinct filters \\ + \hline + 2009 & 220 \\ + 2010 & 163 \\ + 2011 & 161 \\ + 2012 & 170 \\ + 2013 & 178 \\ + 2014 & 154 \\ + 2015 & 200 \\ + 2016 & 204 \\ + 2017 & 231 \\ + 2018 & 254 \\ + % \bottomrule + \end{tabular} + \caption{Count of distinct filters triggered each year}~\label{tab:active-filters-count} +\end{table} + + + +If one is to verify the current assumption (syn) properly, following steps are necessary: +\begin{enumerate} + \item a fresh dump should be obtained + \item reverts should be extracted from it (e.g. by using the \emph{mwreverts} python library, used also by Geiger and Halfaker + \item reverts should be narrowed down to accounts known for doing quality-control work (for example by pre-compiling a list of anti-vandal bots); reverts (or respectively edits in general) done via Huggle and Twinkle are somewhat easy to identify since both tools leave a small code in the edit summary of their edits ("HG" for Huggle and "TW" for Twinkle)%TODO verify that's still the case +\end{enumerate} + +%********************************************************* + +\section{Most active filters per year} +%TODO add column "manual tags" (see jupyter NB) +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + Filter ID & Publicly available description & Hitcount \\ % is the hitcount for the year or altogether till now?-- for the year, of course + \hline + 135 & repeating characters & 175455 \\ + 30 & "large deletion from article by new editors" & 160302 \\ + 61 & "new user removing references" & 147377 \\ + 18 & Test type edits from clicking on edit bar & 133640 \\ + 3 & "new user blanking articles" & 95916 \\ + 172 & "section blanking" & 89710 \\ + 50 & "shouting" (contribution consists of all caps, numbers and punctuation) & 88827 \\ + 98 & "creating very short new article" & 80434 \\ + 65 & "excessive whitespace" & 74098 \\ + 132 & "removal of all categories" & 68607 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2009}~\label{tab:app-most-active-2009} +\end{table} + +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + Filter ID & Publicly available description & Hitcount \\ + \hline + 61 & "new user removing references" & 245179 \\ + 135 & repeating characters & 242018 \\ + 172 & "section blanking" & 148053 \\ + 30 & "large deletion from article by new editors" & 119226 \\ + 225 & Vandalism in all caps & 109912 \\ + 3 & "new user blanking articles" & 105376 \\ + 50 & "shouting" & 101542 \\ + 132 & "removal of all categories" & 78633 \\ + 189 & BLP vandalism or libel & 74528 \\ + 98 & "creating very short new article" & 54805 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2010}~\label{tab:app-most-active-2010} +\end{table} + +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + Filter ID & Publicly available description & Hitcount \\ + \hline + 61 & "new user removing references"& 218493 \\ + 135 & repeating characters & 185304 \\ + 172 & "section blanking" & 119532 \\ + 402 & New article without references & 109347 \\ + 30 & Large deletion from article by new editors & 89151 \\ + 3 & "new user blanking articles" & 75761 \\ + 384 & Addition of bad words or other vandalism & 71911 \\ + 225 & Vandalism in all caps & 68318 \\ + 50 & "shouting" & 67425 \\ + 432 & Starting new line with lowercase letters & 66480 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2011}~\label{tab:app-most-active-2011} +\end{table} + +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + Filter ID & Publicly available description & Hitcount \\ + \hline + 135 & repeating characters & 173830 \\ + 384 & Addition of bad words or other vandalism & 144202 \\ + 432 & Starting new line with lowercase letters & 126156 \\ + 172 & "section blanking" & 105082 \\ + 30 & Large deletion from article by new editors & 93718 \\ + 3 & "new user blanking articles" & 90724 \\ + 380 & Multiple obscenities & 67814 \\ + 351 & Text added after categories and interwiki & 59226 \\ + 279 & Repeated attempts to vandalize & 58853 \\ + 225 & Vandalism in all caps & 58352 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2012}~\label{tab:app-most-active-2012} +\end{table} + +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + Filter ID & Publicly available description & Hitcount \\ + \hline + 135 & repeating characters & 133309 \\ + 384 & Addition of bad words or other vandalism & 129807 \\ + 432 & Starting new line with lowercase letters & 94017 \\ + 172 & "section blanking" & 92871 \\ + 30 & Large deletion from article by new editors & 85722 \\ + 279 & Repeated attempts to vandalize & 76738 \\ + 3 & "new user blanking articles" & 70067 \\ + 380 & Multiple obscenities & 58668 \\ + 491 & Edits ending with emoticons or ! & 55454 \\ + 225 & Vandalism in all caps & 48390 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2013}~\label{tab:app-most-active-2013} +\end{table} + +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + Filter ID & Publicly available description & Hitcount \\ + \hline + 384 & Addition of bad words or other vandalism & 111570 \\ + 135 & repeating characters & 111173 \\ + 279 & Repeated attempts to vandalize & 97204 \\ + 172 & "section blanking" & 82042 \\ + 432 & Starting new line with lowercase letters & 75839 \\ + 30 & Large deletion from article by new editors & 62495 \\ + 3 & "new user blanking articles" & 60656 \\ + 636 & Unexplained removal of sourced content & 52639 \\ + 231 & Long string of characters containing no spaces & 39693 \\ + 380 & Multiple obscenities & 39624 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2014}~\label{tab:app-most-active-2014} +\end{table} + +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + Filter ID & Publicly available description & Hitcount \\ + \hline + 650 & Creation of a new article without any categories & 226460 \\ + 61 & New user removing references & 196986 \\ + 636 & Unexplained removal of sourced content & 191320 \\ + 527 & T34234: log/throttle possible sleeper account creations & 189911 \\ + 633 & Possible canned edit summary & 162319 \\ + 384 & Addition of bad words or other vandalism & 141534 \\ + 279 & Repeated attempts to vandalize & 110137 \\ + 135 & repeating characters & 99057 \\ + 686 & IP adding possibly unreferenced material to BLP & 95356 \\ + 172 & "section blanking" & 82874 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2015}~\label{tab:app-most-active-2015} +\end{table} + +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + % \toprule + Filter ID & Publicly available description & Hitcount \\ + \hline + 527 & T34234: log/throttle possible sleeper account creations & 437099 \\ + 61 & New user removing references & 274945 \\ + 650 & Creation of a new article without any categories & 229083 \\ + 633 & Possible canned edit summary & 218696 \\ + 636 & Unexplained removal of sourced content & 179948 \\ + 384 & Addition of bad words or other vandalism & 179871 \\ + 279 & Repeated attempts to vandalize & 106699 \\ + 135 & repeating characters & 95131 \\ + 172 & "section blanking" & 79843 \\ + 30 & Large deletion from article by new editors & 68968 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2016}~\label{tab:app-most-active-2016} +\end{table} + +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + Filter ID & Publicly available description & Hitcount \\ + \hline + 61 & New user removing references & 250394 \\ + 633 & Possible canned edit summary & 218146 \\ + 384 & Addition of bad words or other vandalism & 200748 \\ + 527 & T34234: log/throttle possible sleeper account creations & 192441 \\ + 636 & Unexplained removal of sourced content & 156409 \\ + 650 & Creation of a new article without any categories & 151604 \\ + 135 & repeating characters & 80056 \\ + 172 & "section blanking" & 70837 \\ + 712 & Possibly changing date of birth in infobox & 59537 \\ + 833 & Newer user possibly adding unreferenced or improperly referenced material & 58133 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2017}~\label{tab:app-most-active-2017} +\end{table} + +\begin{table} + \centering + \begin{tabular}{r p{9cm} r } + % \toprule + Filter ID & Publicly available description & Hitcount \\ + \hline + 527 & T34234: log/throttle possible sleeper account creations & 358210 \\ + 61 & New user removing references & 234867 \\ + 633 & Possible canned edit summary & 201400 \\ + 384 & Addition of bad words or other vandalism & 177543 \\ + 833 & Newer user possibly adding unreferenced or improperly referenced material & 161030 \\ + 636 & Unexplained removal of sourced content & 144674 \\ + 650 & Creation of a new article without any categories & 79381 \\ + 135 & repeating characters & 75348 \\ + 686 & IP adding possibly unreferenced material to BLP & 70550 \\ + 172 & "section blanking" & 64266 \\ + % \bottomrule + \end{tabular} + \caption{10 most active filters in 2018}~\label{tab:app-most-active-2018} +\end{table} -- GitLab