6-Discussion.tex

\chapter{Discussion}
\label{chap:discussion}

I started this inquiry with following questions:\\ %TODO  either number the questions with Qx from the beginning and use it consistently or leave it be altogether
Q1: What is the role of edit filters among existing algorithmic quality-control mechanisms on Wikipedia (bots, semi-automated tools, ORES, humans)?\\
%-- chapter 4 (and 2)
Q2: Edit filters are a classical rule-based system. Why are they still active today when more sophisticated ML approaches exist?\\
%-- chapter 6 (discussion)
Q3: Which type of tasks do filters take over?\\ %-- chapter 5
Q4: How have these tasks evolved over time (are they changes in the type, number, etc.)? %-- chapter 5 (can be significantly expanded)

In what follows, I go over each of them and summarise the findings.

\section{Q1 What is the role of edit filters among existing quality-control mechanisms on Wikipedia (bots, semi-automated tools, ORES, humans)?}

When edit filters were introduced in 2009, various other mechanisms that took care of quality control on Wikipedia had already been in place for some time.
However, the community felt the need for an instrument for preventing easy to recognise but pervasive and difficult to clean up vandalism as early as possible.
This was supposed to take workload off the other mechanisms along the quality control process (see figure~\ref{fig:funnel-with-filters}), especially off human editors who could then use their time more productively elsewhere, namely to check less obvious cases.
%TODO is there another important findind from chapter 4's conclusion that is missing here?

Both filters and bots are completely automated mechanisms, thus a comparison between the two seems reasonable.
What did the filters accomplish differently?
% before vs after
A key distinction is that while bots check already published edits which they may decide to eventually revert, filters are triggered before an edit ever published.
One may argue that nowadays this is not a significant difference.
Whether a disruptive edit is outright disallowed or caught and reverted two seconds after its publication by ClueBot NG doesn't have a tremendous impact on the readers:
The vast majority of them will never see the edit either way.
Still, there are various examples of vandalism that didn't survive long on Wikipedia but the brief time before they were reverted was sufficient for hundreds of media outlets to report these as news~\cite{Elder2016}, which severely undermines the project's credibility.

% The infrastructure question: Part of the software vs externally run --> compared to admin bots, better communication!
Another difference between bots and filters underlined several times in community discussions was that as a MediaWiki extension edit filters are part of the core software whereas bots are running on external infrastructure which makes them both slower and generally less reliable.
(Compare Geiger's account about running a bot on a workstation in his apartment which he simply pulled the plug on when he was moving out~\cite{Geiger2014}.)
Nowadays, we can ask ourselves whether this is still of significance:
A lot of bots are run on Toolforge~\cite{Wikimedia:Toolforge}, a cloud service providing a hosting environment for a variety of applications (bots, analytics, etc.) run by volunteers who work on Wikimedia projects.
The service is maintained by the Wikimedia Foundation the same way the Wikipedia servers are, so it is in consequence just as reliable and available as the encyclopedia itself.
The argument that someone powered off the basement computer on which they were running bot X is just not as relevant anymore.

% more on bots vs filters
% collaboration possible on filters?
% who edits filters (edit filter managers, above all trusted admins) and who edits bots (in theory anyone approved by the BAG)o
When comparing the tasks of bots proposed in related work (chapter~\ref{chap:background}) with the content analysis of filters' tasks conducted in chapter~\ref{chap:overview-en-wiki} (see also discussion for Q3 in section~\ref{sec:discussion-q3}), the results show great overlaps between the tasks descriptions for both tools.
From an end result perspective it doesn't seem to make a big difference, whether a problem is taken care of by an edit filter or a bot.
As mentioned in the paragraph above, whether malicious content is directly disallowed or reverted two seconds later (in which time probably a total of three users have seen it if any) is hardly a qualitative difference for Wikipedia's readers. %TODO (although I'm making a slightly different point in the paragraph above, clean up!)
I would argue though that there are other stakeholders for whom the choice of mechanism makes a bigger difference:
the operators of the quality control mechanisms and the users whose edits are being targeted.
The significant distinction for operators is that the architecture of the edit filter plugin supposedly fosters collaboration which results in a better system (compare with the famous ``given enough eyeballs, all bugs are shallow''~\cite{Raymond1999}).
Any edit filter manager can modify a filter causing problems and the development of a single filter is usually a collaborative process.
Just a view on the history of most filters reveals that they have been updated multiple times by various users.
In contrast, bots' source code is often not publicly available and they are mostly run by one operator only, so no real peer review of the code is practiced and the community has time and again complained of unresponsive bot operators in emergency cases~\cite{Wikipedia:EditFilterTalkArchive1}.
(On the other hand, more and more bots are based on code from various bot development frameworks such as pywikibot~\cite{pywikibot}, so this is not completely valid either.)
At the same time, it seems far more difficult to become an edit filter manager:
There are only very few of them, the vast majority admins or in exceptional cases very trusted users.
By contrast, a bot operator only needs an approval for their bot by the Bot Approvals Group and can get going.

The choice of mechanism also makes a difference for the editor whose edits have been deemed disruptive.
Filters assuming good faith seek communication with the offending user by issuing warnings which provide some feedback and allow the user to modify their edit (hopefully in a constructive fashion) and publish it again.
Bots on the other hand revert everything their algorithms find malicious directly.
They also leave warning messages on the user's talk page informing them that their edits have been reverted because the bot's heuristic was matched and point them to a false positives page where they can make a report.
It is still a revert-first-ask-questions-later approach which is rather discouraging for good faith newcomers.
In case of good faith edits, this would mean that an editor wishing to dispute this decision should raise the issue on the bot's talk page and research has shown that attempts to initiate discussions with (semi-)automated quality control agents have in general quite poor response rates ~\cite{HalGeiMorRied2013}.

Compared to MediaWiki's page protection mechanism, edit filters allow for accurate control on user level:
One can implement a filter targeting specific malicious users directly instead of restricting edit access for everyone.

%TODO Fazit?

\section{Q2: Edit filters are a classical rule-based system. Why are they still active today when more sophisticated ML approaches exist?}
\label{sec:discussion-q3}

Research has long demonstrated higher precision and recall of machine learning methods~\cite{PotSteGer2008}.
With this premise in mind, one has to ask:
Why are rule based mechanisms such as the edit filters still widely in use?
Several explanations of this phenomenon sound plausible.
For one, Wikipedia's edit filters are an established system which works and does its work reasonably well, so there is no need to change it  (``never touch a running system'').
Secondly, it has been organically woven in Wikipedia's quality control ecosystem.
There were historical necessities to which it responded and people at the time believed the mechanism to be the right solution to the problem they had.
We could ask why was it introduced in the first place when there were already other mechanisms.
Beside the specific instances of disruptive behaviour stated by the community as motivation to implement the extension,
a very plausible explanation here is that since Wikipedia is a volunteer project a lot of stuff probably happens because at some precise moment there are particular people who are familiar with some concrete technologies so they construct a solution using the technologies they are good at using (or want to use).

Another interesting reflection is that rule based systems are arguably easier to implement and above all to understand by humans which is why they still enjoy popularity today.
On the one hand, overall less technical knowledge is required in order to implement a single filter:
An edit filter manager has to ``merely'' understand regular expressions.
Bot development by contrast is a little more challenging:
A developer needs reasonable knowledge of at least one programming language and on top of that has to make themself familiar with artefacts like the Wikimedia API.
Moreover, since regular expressions are still somewhat human readable and comprehensible unlike a lot of popular machine learning algorithms, it is easier to hold rule based systems and their developers accountable.
Filters are a simple mechanism (simple to implement) that swiftly takes care of cases that are easily recognisable as undesirable.
ML needs training data (which expensive), and it is not simple to implement.
What is more, rule based mechanisms allow for a finer granularity of control:
An edit filter can define a rule to explicitly exclude particular malicious users from publishing, which cannot be straightforwardly implemented in a machine learning algorithm.

%Fazit?


\section{Q3: Which type of tasks do filters take over?}
\label{sec:discussion-q3}

% TODO comment on: so what's the role of the filters, why were they introduced (to get over with obvious persistent vandalism which was difficult to clean up, most probably automated) -- are they fulfilling this purpose?

Chapter~\ref{chap:overview-en-wiki} shows that edit filters target juvenile and grave vandalism, spam, good faith disruptive edits (e.g. blanking an article instead of moving it because of unfamiliarity with the software and proper procedure), and maintenance tasks.
In total, $2/3$ of the filters ever implemented are still hidden, and since according to the guidelines filters are supposed to be hidden when aimed at egregious vandalism by specific malicious users~\cite{Wikipedia:EditFilter}, the AbuseFilter extension appears to be used in accordance with its declared purpose.
At the same time, the January 2019 snapshot of the \emph{abuse\_filter} database table revealed a nearly equal numbers of enabled public and private filters.
This means that at the time, filters were targeting specific vandals as much as general disruptive behaviour.
It also leads to the conclusion that hidden filters fluctuate more which seems reasonable given their application area: specific users and behaviours.

As demonstrated by the bot taxonomy proposed by Halfaker and Riedl~\cite{HalRied2012} referred to in section~\ref{sec:5-conclusions}, bots are also doing a lot of these or similar tasks.
So, when a new problem arises, how does the community decide whether to implement a bot or a filter to handle it?
As discussed in the previous section~\ref{sec:discussion-q3}, this probably partially depends on who discovers/takes care of the problem and what technology they are familiar with and have access to.
There are also some guidelines (compare section~\ref{sec:introduce-a-filter}) which underline that filters are most suitable for problems concerning all pages and point out different approaches for solving other issues: using page protection for problems with a single page; using the title and spam blacklist for persistent spam waves or attempts to create abusive titles; using bots for in depth checks or problems with a single page.
Moreover, it is stated that no trivial formatting mistakes should trip filters\cite{Wikipedia:EditFilterRequested}:
This seems like a waste of computing power and unnecessary irritation to the user.
For what it is worth, I also think that bots are more suitable to take care of such cases.
However, the community is not always consistently sticking to these guidelines, and they do occasionally implement filters that contradict them.
(Examples therefor are filters that target non-disruptive or non-problematic behaviour such as filter 308 ``Malformed Mediation Cabal Requests'', or the fairly frequent hiding of filters tracking general behaviour.)
These are mostly switched off or in the case of hiding general patterns—made public—again relatively fast but there are also examples such as the filter 432 ``Starting new line with lowercase letters'' (still active as of 24 July 2019) which in my opinion violates the above mentioned trivial mistakes rule.

As a matter of fact, multiple edit filter managers also run bots.
Therefore, it looks relevant to consider how they decide which mechanism to apply when faced with a particular issue.
Preliminary results have shown that some of the users concerned seem to be rather bot operators who implement auxiliary filters and some—primarily edit filter managers who implement auxiliary bots.
As mentioned in section~\ref{sec:further-studies}, future work could further explore the relationships of filters and bots implemented by the same user, especially by taking the (currently unavailable) \emph{abuse\_filter\_history} table into account, or by conducting interviews with the users in question.

\begin{comment}
Some of the edit filter managers are actually also bot operators.
I've compiled a list of edit filter managers who are simultaneously also bot operators;
I've further assembled the bots they run and made notes on the bots that seem to be relevant to vandalism prevention/quality assurance.
I'm currently trying to determine from document traces what filter contributions the corresponding edit filter managers had and whether they are working on filters similar to the bots they operate.
Insight is currently minimal, since abuse\_filter\_history table is not available and we can only determine what filters an edit filter manager has worked on from limited traces such as: last modifier of the filter from abuse\_filter table; editors who signed their comments from abuse\_filter table; probably some noticeboards or talk page archives, but I haven't looked into these so far; the history web API for public filters.

It is to be said at this point that I eventually took these lists out of the public repository.
This clashes with my overall open science aspiration, however I determined that precisely such individual-related records are an example of ethically and privacy critical thickening of traces (see section~\ref{})
and that I'm not willing to offer potential trolls ready-made lists.

Finally, it stands to reason that if we are interested in the question when do people (who have access to both) implement a bot and when a filter, all we have to do is ask (see directions for future research in section~\ref{}).
\end{comment}

At the end, closer scrutiny and critical evaluation of the filter patterns are required.
%or At the end, particular aspects of the filter patterns were critically scrutinised:
It can be discussed whether it is fair and justified that 20\% of the enabled filters target only new (not confirmed) editors.
Why is it all right for an established editor to use swear words (see filter 384 ``Addition of bad words or other vandalism'') or insert longer strings of all caps (filter 50 ``Shouting'') whereas it is not for newbies?

\section{Q4: How have these tasks evolved over time (are they changes in the type, number, etc.)?}

Following insights about temporal trends were uncovered:
%Firstly, the overall number of active edit filters stays the same over time due to the condition limit.
Firstly, edit filters have been much more active than the initially anticipated few hits per hour—a consultation of the AbuseLog shows several entries per minute and the hit numbers resemble the revert counts in order of magnitude.
Secondly, the list of most active filters of all times reveals above all older filters which continue to be matched very frequently.
Moreover, it is mostly the same old filters which have been highly active through the years:
The list of the ten most active filters for each year since the introduction of the AbuseFilter extension is fairly stable.
Although, as pointed out in section~\ref{sec:4-def}, filter patterns can be changed, they are mostly only optimised for efficiency, so it can be assumed filters have been catching the same troublesome behaviour over the years.
Thirdly, the overall number of filter hits has risen since 2016 when a somewhat puzzling spike in filter hits which needs future investigation took place.
Additionally, the general tendency is that over time less good faith filter hits and more vandalism related ones occurred.

All in all, beside the peak in hits from 2016, additional temporal patterns of filters characteristics and activity can be explored.
These include practices of filters' creation, configuration, and refactoring.

\begin{comment}
Claudia: * A focus on the Good faith policies/guidelines is a historical development. After the huge surge in edits Wikipedia experienced starting 2005 the community needed a means to handle these (and the proportional amount of vandalism). They opted for automatisation. Automated system branded a lot of good faith edits as vandalism, which drove new comers away. A policy focus on good faith is part of the intentions to fix this.
\end{comment}


%***************************************

\section{Limitations}
\label{sec:limitations}

This work presents an initial attempt at analysing Wikipedia's edit filter system;
as such, it has several limitations.

\subsection{Limitations of the Data}
%EN Wiki only
Firstly, the thesis focuses on English Wikipedia only.
This offers an excellent starting point for the analysis of edit filters:
After all, EN Wikipedia was the first language version to which the mechanism was introduced.
However, valuable lessons can be learnt—about the communities, models of governance, usefulness of filters, etc.—from comparing edit filters' usage and activity across different language versions.
Just recall how for instance the edit filter managers group doesn't exist in certain language versions (section~\ref{section:who-can-edit}) and instead there it is administrators who have an \emph{abusefilter-modify} permission next to their other rights.
Effectively, for these language versions of Wikipedia (the Spanish, German, and Russian ones), a much bigger group of users has access to the mechanism.
It is expected that this shapes its governance and usage patterns.

% Missing abuse_filter_history
Moreover, the \emph{abuse\_filter\_history} table was not available, so no systematic analysis of the filters' development over time could be realised (see section~\ref{sec:overview-data}).

% No access to hidden filters
Finally, I had no access to the details of hidden filters, so no investigation of their patterns (for instance verifying whether they really target specific users) was possible.

\subsection{Limitations in the Research Process}
% Limitations of found data, missing IVs
Unfortunatly, conducting a classic ethnographic analysis was not possible.
It would have been particularly insightful to talk to edit filter managers (above all such who are simultaneously also bot operators) and developers of the extension, as well as regular editors who have tripped a filter about their experiences.
Basically, really only ``found data'' was used, and as pointed out in section~\ref{sec:trace-ethnography} this has the shortcoming of observing only what was discussed in the documentation archives and recorded by the logs.
As Geiger and Halfaker maintain, Wikipedia's databases have the purpose of allowing the Wikipedian community to build an encyclopedia, not to facilitate scientific research~\cite{GeiHal2017}.
Future studies can and should use further data sources and for instance utilise the first insights of the current research as interview prompts.

% Trace Ethnography: misinterpretation of data, due to insufficient experience with in this community of practice.
Another limitation that comes to mind is related to the applied methodology of trace ethnography.
The data of the present study do not speak for themselves:
Instead, domain knowledge of the Wikipedian ecosystem is necessary in order to be able to accurately invert traces.
Previous to this research, I have had a Wikipedia account for several years but have only used it to make occasional (rather minor) edits.
I have learnt a lot since the beginning of the project, but it is still very much possible that I have misinterpreted data due to insufficient experience and lack of background knowledge.

% Coding by 1 person
Thirdly, as signaled in section~\ref{sec:manual-classification}, the manual filter classification was undertaken by one person only (me), so my biases have certainly shaped the labels.
To increase reliability, the coding process should be applied by at least one more researcher and both sets of labeled data should be compared.


%************************************************************************

\section{Directions for Future Studies}
\label{sec:further-studies}
Throughout the thesis, a variety of intriguing questions arose which couldn't be addressed due to various reasons, above all—insufficient time.
Here, a comprehensive list of all these pointers for possible future research is provided.

\begin{enumerate}
    \item \textbf{How have edit filters's tasks evolved over time?} Unfortunately, no detailed historical analysis of the filters could be realised, since the database table storing changes to individual filters (\emph{abuse\_filter\_history}) is not currently replicated (see section~\ref{sec:overview-data}).
    As mentioned in section~\ref{sec:overview-data}, a patch aiming to renew the replication of the table is currently under review~\cite{gerrit-tables-replication}.
    When a dump becomes available, an extensive investigation of filters' actions, creation and activation patterns, as well as patterns they have targeted over time will be possible.
    \item \textbf{What proportion of quality control work do filters take over?} Filter hits can be systematically compared with the number of all edits and reverts via other quality control mechanisms.
    \item \textbf{Is it possible to study the filter patterns in a more systematic fashion? What can be learnt from this?} For example, it has come to attention that $1/5$ of all active filters discriminate against new users via the \verb|!("confirmed" in user_groups)| pattern.
        Are there other tendencies of interest?
    \item \textbf{Is there a qualitative difference between the tasks/patterns of public and hidden filters?} According to the guidelines for filter creation, general filters should be public while filters targeting particular users should be hidden. Is there something more to be learnt from an examination of hidden filters' patterns? Do they actually conform to the guidelines? %One will have to request access to them for research purposes, sign an NDA, etc.
    \item \textbf{How are false positives handled?} Have filters been shut down regularly, because they matched more false positives than they had real value? Are there big amounts of false positives that corrupt the filters hit data and thus the interpretations offered by the current work?
    \item \textbf{To implement a bot or to implement a filter?} An ethnographic inquiry into if an editor is simultaneously an edit filter manager and a bot operator when faced with a new problem, how do they decide which mechanism to employ for the solution?
    \item \textbf{What are the repercussions on affected editors?} An ethnographic study of the consequences of edit filters for editors whose edits are filtered. Do they experience frustration or alienation? Do they understand what is going on? Or do they experience for example edit filters' warnings as helpful and appreciate the hints they have been given and use them to improve their collaboration?
    \item \textbf{What are the differences between how filters are governed on EN Wikipedia compared to other language versions?} Different Wikipedia language versions each have a local community behind them.
        These communities vary, sometimes significantly, in their modes of organisation and values.
        It would be very insightful to explore disparities between filter governance and the types of filters implemented between different language versions.
    \item \textbf{Are edit filters a suitable mechanism for fighting harassment?} A disturbing rise in online personal attacks and harassment is observed in a variety of online spaces, including Wikipedia~\cite{Duggan2014}.
        The Wikimedia Foundation sought to better understand harassment in their projects via a Harassment Survey conducted in 2015~\cite{Wikimedia:HarassmentSurvey}.
        According to the edit filter noticeboard archives~\cite{Wikipedia:EditFilterNoticeboardHarassment}, there have been some attempts to combat harassment by means of filters.
        The tool is also mentioned repeatedly in the timeline of Wikipedia's Community Health Initiative~\cite{Wikipedia:CommunityHealthInitiative} which seeks to reduce harassment and disruptive behaviour on Wikipedia.
        An evaluation of its usefulness and success at this task would be really interesting.
    \item \textbf{(How) has the notion of ``vandalism'' on Wikipedia evolved over time?} By comparing older and newer filters, or respectively updates in filter patterns, it could be investigated whether there has been a qualitative change in the interpretation of the ``vandalism'' notion on Wikipedia.
    \item \textbf{What are the urgent situations in which edit filter managers are given the freedom to act as they see fit and ignore best practices of filter adoption?} (i.e. switch on a filter in log only mode first and announce it on the notice board so others can have a look)? Who determines they are urgent? These cases should be scrutinised extra carefully since ``urgent situations'' have historically always been an excuse for cuts in civil liberties.
%* is there a qualitative difference between complaints of bots and complaints of filters?
    %\item \textbf{Do edit filter managers specialize on particular types of filters (e.g. vandalism vs. good faith?)} \emph{abuse\_filter\_history } table is needed for this
    %\item \textbf{Do edit filter managers stick to the edit filter guidelines?} e.g. filters should't be implemented for trivial problems (such as spelling mistakes); problems with specific pages are generally better taken care of by protecting the page and problematic title by the title blacklist; general filters shouldn't be hidden
\end{enumerate}

%TODO further points for future study
\begin{comment}
\subsection{What filters were implemented immediately after the launch + manual tags}
\subsection{Filter Usage/Activity}
    There are filters that have been switched on for a while, then deactivated and never activated again. (phenomenon was over; they never caught anything in the first place; ..)
Switched on and stayed on;
switched off very fast;...
\subsection{How do filters emerge?}
  ** an older filter is split? 79 was split out of 61, apparently; 285 is split between "380, 384, 614 and others"; 174 is split from 29
  ** several older filters are merged?
  ** or functionality of an older filter is took and extended in a newer one (479->631); (82->278); (358->633);
\end{comment}