\section{Vandalism: Original Research}
According to Wikipedia's newspaper, the Signpost, edit filters were initially introduced as a vandalism prevention mechanism~\cite{Signpost2009}.
The aim of this section is to provide a better understanding of vandalism on Wikipedia: What is vandalism, and what not; who engages in vandalism; who is striving to prevent it and with what means?

\subsection{What is vandalism}

According to EN Wikipedia's policy~\cite{Wikipedia:Vandalism}, vandalism means ``intentionally making abusive edits to Wikipedia'' or, more specifically ``editing (or other behavior) deliberately intended to obstruct or defeat the project's purpose, which is to create a free encyclopedia''.
Vandalism includes ``malicious removal of encyclopedic content, or the changing of such content beyond all recognition, without any regard to our core content policies of neutral point of view (which does not mean no point of view), verifiability and no original research''
as well as ``adding irrelevant obscenities or crude humor to a page, illegitimately blanking pages, and inserting obvious nonsense into a page''
and ``[a]busive creation or usage of user accounts and IP addresses''.

Wikipedians have elaborated following vandalism typology~\cite{Wikipedia:Vandalism}:
\begin{itemize}
    \item  Abuse of tags
    \item  Account creation, malicious
    \item  Avoidant vandalism
    \item  Blanking, illegitimate
    \item  Copyrighted material, repeated uploading of
    \item  Edit summary vandalism
    \item  Format vandalism
    \item  Gaming the system
    \item  Hidden vandalism
    \item  Hoaxing vandalism
    \item  Image vandalism
    \item  Link vandalism
    \item  Page creation, illegitimate
    \item  Page lengthening
    \item  Page-move vandalism
    \item  Silly vandalism
    \item  Sneaky vandalism
    \item  Spam external linking
    \item  Stockbroking vandalism
    \item  talk page vandalism
    \item  Template vandalism
    \item  User and user talk page vandalism
    \item  Vandalbots
\end{itemize}

\subsection{What is not vandalism}

Additionally, there are different types of edits viewed as disruptive by the Wikipedia community.
Edit warring and pushing a single point of view and disregarding community feedback are examples here of. %TODO what are other examples?
Nevertheless, the guidelines warn that ``[d]isruptive editing is not vandalism, though vandalism is disruptive''~\cite{Wikipedia:DisruptiveEditing}.
And that different procedures should be adopted by editors in both cases.

The vandalism policy also cautions against using the ``vandalism'' label unless absolutely necessary since it tends to drive contributors away and prevent constructive discussions~\cite{Wikipedia:Vandalism}.
%TODO vgl good faith memo

\begin{comment}
\url{https://en.wikipedia.org/wiki/Wikipedia:Vandalism}
"Careful consideration may be required to differentiate between edits that are beneficial, edits that are detrimental but well-intentioned, and edits that are vandalism."
%TODO vgl with memo-good-faith

\url{https://en.wikipedia.org/wiki/Wikipedia:Disruptive_editing}

"Disruptive editing is not always intentional. Editors may be accidentally disruptive because they don't understand how to correctly edit, or because they lack the social skills or competence necessary to work collaboratively "
Okay what are disruptive edits that are not vandalism? (apart from edit wars)

"Engages in "disruptive cite-tagging"; adds unjustified {{citation needed}} tags to an article when the content tagged is already sourced, uses such tags to suggest that properly sourced article content is questionable."
\end{comment}

\subsection{Who engages in vandalism (and why?)}

The policy signals clearly that editors repeatedly engaging in vandalism are subject to banning.
Furthermore, it is explained that although warnings for vandalism are issued in general, these are not a prerequisite for banning~\cite{Wikipedia:Vandalism}.
%TODO: still not explained who and why

\subsection{Who is striving to prevent vandalism? How do they go about it?}

Since Wikipedia is a ``do-it-yourself'' project, every editor who notices vandalism is called upon to help fixing it.
There is a formal process for reporting users who persistently continue to engage in vandalism despite warnings~\cite{Wikipedia:AIV}, %TODO go into more detail?
as well as for requesting page protection for frequently vandalised pages~\cite{Wikipedia:PageProtection}.
And there are also users who specifically dedicate substantial amount of their Wikipedia contributions to fighting vandalism.

These dedicated vandal fighters mostly do so with the aid of some (semi or fully) automated tools which not only significantly speeds up the process (see below),
but, according to research, fundamentally changes the nature of the encyclopedia and its collaboration ecosystem~\cite{GeiRib2010}.

%***************************************************

\section{Historical development}
\subsection{What filters were implemented immediately after the launch + manual tags}
%TODO What were the first filters to be implemented immediately after the launch of the extension?
The extension was launched on March 17th, 2009.
Filter 1 is implemented in the late hours of that day.
Filters with IDs 1-80 (IDs are auto-incremented) were implemented the first 5 days after the extension was turned on (17-22.03.2009).
So, apparently the most urgent problems the initial edit filter managers perceived were:
page move vandalism (what Filter 1 initially targeted; it was later converted to a general test filter);
blanking articles (filter 3)
personal attacks (filter 9,11) and obscenities (12)
some concrete users/cases (hidden filters, e.g. 4,21) and sockpuppetry (16,17)

\subsection{Filter Usage/Activity}
%TODO decide how this fits into the overall narrative; write some kind of a fazit from this observations; also, decided whether this is the best representation or shouldn't they form a list rather

Following general filter operation practices were observed:
There are filters that have been switched on for a while, then deactivated and never activated again.
Some of them had only been active very briefly before they were switched off and deleted.
There are a couple of different reasons for that:
The edit filter managers decided not to implement the filter, because edit filters were deemed an inappropriate tool to deal with the issue at hand (e.g. filter 308 ``Malformed Mediation Cabal Requests'', 199 ``Unflagged Bots'', or 484 ``Shutdown of ClueBot by non-admin user'');
or decided to not implement the thing (that way): 290 ``172 Filter'' (catching edits about a Canadian politician coming from a certain IP range) was disabled, since relevant pages were protected;
or, because there were hardly any hits, so there wasn't really a problem at all (e.g. filter 304 ``Rayman vandalism'', 122 ``Changing Username malformed requests'', or 401 ``"Red hair" vandalism'').
This last group is possibly a result of edit filter managers implementing a filter ``just to see if it catches anything''.
It also occurs that filter managers implement a filter targeting the same phenomenon in parallel and without knowing of each other.
These duplicate cases are merged eventually, or alternatively all but one of them are switched off: filter 893 was switched off in favour of 891.
Sometimes, vandalism trends are only temporary and after a period of activity, the filters become stale.
This is also a reason for filters to be eventually powered off in order to save conditions from the condition limit.
Examples thereof are: 81 ``Badcharts'', 20 ``Saying "The abuse filter will block this"'', 663 ``Techno genre warrior''.
There are also filters that were switched off because they weren't doing what they were supposed to and only generated a big amount of false positives: filter 14 ``Test to detect new pages by new users''.
And there are filters testing a pattern which was eventually merged in another filter (e.g. filter 440 ``intextual.com markup'' was merged in filter 345 ``Extraneous formatting from browser extension'').

\begin{comment}
%TODO This is a duplicate of a paragraph in 4.5.1. Does it fit better here?
% this actually fits also in the patterns of new filters in chap.5; these are the filters introduced for couple of days/hours, then switched off to never be enabled again
Edit filter managers often introduce filters based on some phenomena they have observed caught by other filters, other algorithmic quality control mechanisms or general experience.
As all newly implemented filters, these are initially enabled in logging only mode until enough log entries are generated to evaluate whether the incident is severe and frequent enough to need a filter.
\end{comment}

Then, there are filters switched on for a while, deactivated for a while and activated again.
Sometimes because a pattern of vandalism is re-occuring, and sometimes—in order to fix technical issues with the filters: 61, 98 (was deactivated briefly since an editor found the "warn" action unfounded; re-enabled to tag), 148 ("20160213 - disabled - possible technical issue - see edit filter noticeboard - xaosflux")

Another group constitute enabled filters that have never been switched off since their introduction.
  164, 642 (if we ignore the 2min period it was disabled on 13.4.2018), 733 (2.11.2015-present), 29 (18.3.2009-present), 30 (18.3.2009-present), 33 (18.3.2009-present), 39 (18.3.2009-present), 50 (18.3.2009-present), 59 (19.3.2009-present), 80 (22.3.2009-present)
There are also some filters that have always been enabled with the exception of brief periods of time when the filter was deactivated (and the activated again), probably in order to update the conditions: 79, 135 (there were couple of others in Shirik's list, go back and look);
There seems to be a tendency that all actions but logging (which cannot be switched off) are took out, when edit filter managers are updating the pattern of the filter.

\subsection{How do filters emerge?}
  ** an older filter is split? 79 was split out of 61, apparently; 285 is split between "380, 384, 614 and others"; 174 is split from 29
  ** several older filters are merged?
  ** or functionality of an older filter is took and extended in a newer one (479->631); (82->278); (358->633);
  ** new condition(s) are tested and then merged into existing filter : stuff from 292 was merged to 135 (https://en.wikipedia.org/wiki/Special:AbuseFilter/history/135/diff/prev/4408 , also from 366; following the comments from https://en.wikipedia.org/wiki/Special:AbuseFilter/292 it was not conceived as a test filter though, but it was rather merged in 135 post-factum to save conditions); 440 was merged into 345; apparently 912 was merged into 11 (but 11 still looks like checking for "they suck" only^^); in 460: "Merging from 461, 472, 473, 474, and 475. --Reaper 2012-08-17"
  ** an incident caught repeatedly by a filter motivates the creation of a dedicated filter (994)
  ** filter is shut down, because editors notice there are 2 (or more filters) that do nearly identical checks: 344 shut down because of 3

  ** "in addition to filter 148, let's see what we get - Cen" (https://en.wikipedia.org/wiki/Special:AbuseFilter/188) // this illustrates the point that edit filter managers do introduce stuff they feel like introducing just to see if it catches something

%***************************************************

\subsection{Distinct filters over the years}
Thanks to quarry~\footnote{\url{https://quarry.wmflabs.org/}}, we have the numbers of all distinct filters triggered per year
from 2009 (when filters were first introduced/the MediaWiki extension was enabled) until the end of 2018: see table~\ref{tab:active-filters-count}.
This figure varies between $154$ in year 2014 and $254$ in 2018.
The explanation for this not particularly wide range of active filters lies probably in the so-called condition limit.
According to the edit filters' documentation~\cite{Wikipedia:EditFilterDocumentation}, the condition limit is a hard-coded treshold of total available conditions that can be evaluated by all active filters per incoming edit.
Currently, it is set to $1,000$.
The motivation for this heuristic is to avoid performance issues since every incoming edit is checked against all currently enabled filters which means that the more filters are active the longer the checks take.
However, the page also warns that counting conditions is not the ideal metric of filter performance, since there are simple comparisons that take significantly less time than a check against the \emph{all\_links} variable for example (which needs to query the database)~\cite{Wikipedia:EditFilterDocumentation}.
Nevertheless, the condition limit seems to still be the heuristic used for filter performance optimisation today.

\begin{table}
  \centering
  \begin{tabular}{l r }
    % \toprule
    Year & Number of distinct filters \\
    \hline
    2009 & 220 \\
    2010 & 163 \\
    2011 & 161 \\
    2012 & 170 \\
    2013 & 178 \\
    2014 & 154 \\
    2015 & 200 \\
    2016 & 204 \\
    2017 & 231 \\
    2018 & 254 \\
    % \bottomrule
  \end{tabular}
  \caption{Count of distinct filters triggered each year}~\label{tab:active-filters-count}
\end{table}



If one is to verify the current assumption (syn) properly, following steps are necessary:
\begin{enumerate}
    \item a fresh dump should be obtained
    \item reverts should be extracted from it (e.g. by using the \emph{mwreverts} python library, used also by Geiger and Halfaker
    \item reverts should be narrowed down to accounts known for doing quality-control work (for example by pre-compiling a list of anti-vandal bots); reverts (or respectively edits in general) done via Huggle and Twinkle are somewhat easy to identify since both tools leave a small code in the edit summary of their edits ("HG" for Huggle and "TW" for Twinkle)%TODO verify that's still the case
\end{enumerate}

%*********************************************************

\section{Most active filters per year}
%TODO add column "manual tags" (see jupyter NB)
\begin{table}
  \centering
  \begin{tabular}{r p{9cm} r }
    % \toprule
    Filter ID & Publicly available description & Hitcount \\ % is the hitcount for the year or altogether till now?-- for the year, of course
    \hline
    135 & repeating characters & 175455 \\
    30 & "large deletion from article by new editors" & 160302 \\
    61 & "new user removing references" & 147377 \\
    18 & Test type edits from clicking on edit bar & 133640 \\
    3 & "new user blanking articles" & 95916 \\
    172 & "section blanking" & 89710 \\
    50 & "shouting" (contribution consists of all caps, numbers and punctuation) & 88827 \\
    98 & "creating very short new article" & 80434 \\
    65 & "excessive whitespace"  & 74098 \\
    132 & "removal of all categories" & 68607 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2009}~\label{tab:app-most-active-2009}
\end{table}

\begin{table}
  \centering
  \begin{tabular}{r p{9cm} r }
    % \toprule
    Filter ID & Publicly available description & Hitcount \\
    \hline
    61 & "new user removing references"  & 245179 \\
    135 & repeating characters & 242018 \\
    172 & "section blanking" & 148053 \\
    30 & "large deletion from article by new editors" & 119226 \\
    225 & Vandalism in all caps & 109912 \\
    3 & "new user blanking articles" & 105376 \\
    50 & "shouting"  & 101542 \\
    132 & "removal of all categories" & 78633 \\
    189 & BLP vandalism or libel & 74528 \\
    98 & "creating very short new article" & 54805 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2010}~\label{tab:app-most-active-2010}
\end{table}

\begin{table}
  \centering
  \begin{tabular}{r p{9cm} r }
    % \toprule
    Filter ID & Publicly available description & Hitcount \\
    \hline
    61 & "new user removing references"& 218493 \\
    135 & repeating characters & 185304 \\
    172 & "section blanking" & 119532 \\
    402 & New article without references & 109347 \\
    30 & Large deletion from article by new editors & 89151 \\
    3 & "new user blanking articles" & 75761 \\
    384 & Addition of bad words or other vandalism & 71911 \\
    225 & Vandalism in all caps & 68318 \\
    50 & "shouting" & 67425 \\
    432 & Starting new line with lowercase letters & 66480 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2011}~\label{tab:app-most-active-2011}
\end{table}

\begin{table}
  \centering
  \begin{tabular}{r p{9cm} r }
    % \toprule
    Filter ID & Publicly available description & Hitcount \\
    \hline
    135 & repeating characters & 173830 \\
    384 & Addition of bad words or other vandalism & 144202 \\
    432 & Starting new line with lowercase letters & 126156 \\
    172 & "section blanking" & 105082 \\
    30 & Large deletion from article by new editors & 93718 \\
    3 & "new user blanking articles" & 90724 \\
    380 & Multiple obscenities & 67814 \\
    351 & Text added after categories and interwiki & 59226 \\
    279 & Repeated attempts to vandalize & 58853 \\
    225 & Vandalism in all caps & 58352 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2012}~\label{tab:app-most-active-2012}
\end{table}

\begin{table}
  \centering
  \begin{tabular}{r p{9cm} r }
    % \toprule
    Filter ID & Publicly available description & Hitcount \\
    \hline
    135 & repeating characters & 133309 \\
    384 & Addition of bad words or other vandalism & 129807 \\
    432 & Starting new line with lowercase letters & 94017 \\
    172 & "section blanking" & 92871 \\
    30 & Large deletion from article by new editors & 85722 \\
    279 & Repeated attempts to vandalize & 76738 \\
    3 & "new user blanking articles" & 70067 \\
    380 & Multiple obscenities & 58668 \\
    491 & Edits ending with emoticons or ! & 55454 \\
    225 & Vandalism in all caps & 48390 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2013}~\label{tab:app-most-active-2013}
\end{table}

\begin{table}
  \centering
  \begin{tabular}{r p{9cm} r }
    % \toprule
    Filter ID & Publicly available description & Hitcount \\
    \hline
    384 & Addition of bad words or other vandalism & 111570 \\
    135 & repeating characters & 111173 \\
    279 & Repeated attempts to vandalize & 97204 \\
    172 & "section blanking" & 82042 \\
    432 & Starting new line with lowercase letters & 75839 \\
    30  & Large deletion from article by new editors & 62495 \\
    3 & "new user blanking articles" & 60656 \\
    636 & Unexplained removal of sourced content & 52639 \\
    231 & Long string of characters containing no spaces & 39693 \\
    380 & Multiple obscenities & 39624 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2014}~\label{tab:app-most-active-2014}
\end{table}

\begin{table}
  \centering
  \begin{tabular}{r p{9cm} r }
    % \toprule
    Filter ID & Publicly available description & Hitcount \\
    \hline
    650 & Creation of a new article without any categories & 226460 \\
    61 & New user removing references & 196986 \\
    636 & Unexplained removal of sourced content & 191320 \\
    527 & T34234: log/throttle possible sleeper account creations & 189911 \\
    633 & Possible canned edit summary & 162319 \\
    384 & Addition of bad words or other vandalism & 141534 \\
    279 & Repeated attempts to vandalize & 110137 \\
    135 & repeating characters & 99057 \\
    686 & IP adding possibly unreferenced material to BLP & 95356 \\
    172 & "section blanking" & 82874 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2015}~\label{tab:app-most-active-2015}
\end{table}

\begin{table}
  \centering
  \begin{tabular}{r p{9cm} r }
    % \toprule
    % \toprule
    Filter ID & Publicly available description & Hitcount \\
    \hline
    527 & T34234: log/throttle possible sleeper account creations & 437099 \\
    61 & New user removing references & 274945 \\
    650 & Creation of a new article without any categories & 229083 \\
    633 & Possible canned edit summary & 218696 \\
    636 & Unexplained removal of sourced content & 179948 \\
    384 & Addition of bad words or other vandalism & 179871 \\
    279 & Repeated attempts to vandalize & 106699 \\
    135 & repeating characters & 95131 \\
    172 & "section blanking" & 79843 \\
    30 & Large deletion from article by new editors & 68968 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2016}~\label{tab:app-most-active-2016}
\end{table}

\begin{table}
  \centering
  \begin{tabular}{r p{9cm} r }
    % \toprule
    Filter ID & Publicly available description & Hitcount \\
    \hline
    61 & New user removing references & 250394 \\
    633 & Possible canned edit summary & 218146 \\
    384 & Addition of bad words or other vandalism & 200748 \\
    527 & T34234: log/throttle possible sleeper account creations & 192441 \\
    636 & Unexplained removal of sourced content & 156409 \\
    650 & Creation of a new article without any categories & 151604 \\
    135 & repeating characters & 80056 \\
    172 & "section blanking" & 70837 \\
    712 & Possibly changing date of birth in infobox & 59537 \\
    833 & Newer user possibly adding unreferenced or improperly referenced material & 58133 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2017}~\label{tab:app-most-active-2017}
\end{table}

\begin{table}
  \centering
    \begin{tabular}{r p{9cm} r }
    % \toprule
    Filter ID & Publicly available description & Hitcount \\
    \hline
    527 & T34234: log/throttle possible sleeper account creations & 358210 \\
    61 & New user removing references & 234867 \\
    633 & Possible canned edit summary & 201400 \\
    384 & Addition of bad words or other vandalism & 177543 \\
    833 & Newer user possibly adding unreferenced or improperly referenced material & 161030 \\
    636 & Unexplained removal of sourced content & 144674 \\
    650 & Creation of a new article without any categories & 79381 \\
    135 & repeating characters & 75348 \\
    686 & IP adding possibly unreferenced material to BLP & 70550 \\
    172 & "section blanking" & 64266 \\
    % \bottomrule
  \end{tabular}
  \caption{10 most active filters in 2018}~\label{tab:app-most-active-2018}
\end{table}