In May 2014 \textit{The New Yorker} published a story called ``How a Raccoon Became an Aardvark'' in its column ``Annals of Technology''~\cite{Randall2014}.
It tells an anecdote about a New York student who, some 6 years before, edited the Wikipedia article on ``coati'' (a member of the racoon family) to state that the coati is ``also known as a Brazilian aardvark''.
In May 2014 the US American magasine \textit{The New Yorker} published a story called ``How a Raccoon Became an Aardvark'' in its column ``Annals of Technology''~\cite{Randall2014}.
It tells an anecdote about a New York student who, some 6 years before, edited the Wikipedia article on ``coati'' (a member of the racoon family native to South and Central America) to state that the coati is ``also known as a Brazilian aardvark''.
Now, this is exactly how Wikipedia works, right?
Anyone can edit and small contribution by small contribution the world's largest knowledge base is compiled.
Except, the kid made the thing up and published on Wikipedia an inside joke he had with his brother on their holiday trip to Brazil.
Unsourced pieces of information are not supposed to survive long on Wikipedia and he thought that the edit will be swiftly deleted.
Fast-forward to 2010, not only had the entry on ``coati'' not changed, but it cited a 2010 article by the \textit{Telegraph} as an evidence.
In the meantime several newspapers and a book published by the University of Chicago~\cite{} claimed that the coati was known as a Brazilan aardvark.
It proved not trivial to erase the piece from Wikipedia since there all these other sources affirming the statement.
Unsourced pieces of information are not supposed to survive long on Wikipedia and he thought that the edit would be swiftly deleted.
Fast-forward to 2010, not only had the entry on ``coati'' not changed, but it cited a 2010 article by the newspaper the \textit{Telegraph} as an evidence.
In the meantime several newspapers, a YouTube video and a book published by the University of Chicago~\cite{} claimed that the coati was known as a Brazilan aardvark.
It proved not trivial to erase the piece/snippet from Wikipedia since there were all these other sources affirming the statement.
By then, it was not exactly false either: the coati \emph{was} known as ``Brazilian aardvark'', at least on the Internet.
Now, despite the fact that at any given moment/on the whole Wikipedia may contain more or less the same amount of errors as the encyclopedia Britanica, %TODO quote!
the stories of hoaxes like the one above are precisely why it is still maintained that information on Wikipedia cannot be trusted and it cannot be used as a serious bibliographic reference.
the stories like the one above are precisely why it is still maintained that information on Wikipedia cannot be trusted and it cannot be used as a serious bibliographic reference.
%TODO transition is somewhat jumpy
The Wikipedian community is well-aware of their project's poor reliability reputation and has a long standing history of quality control processes.
Not only hoaxes, but profanities, malicious vandals have been there since the very beginning and have increased with the rise of the project to prominence.
%Since its conception in 2001, when nobody believed it was ever going to be a serious encyclopedia, the project has grown steadily.
At the latest, with the exponential surge in the numbers of users and edits around 2006, the community began realising that they needed a more automated means for quality control.
The same year, the first anti-vandal bots were introduced, followed by semi-automated patroling tools such as Twinkle (in 2007) and Huggle (in the beginnings of 2008).
In 2009, yet another mechanism was introduced(syn)/announced/implemented.
Its core developer, Andrew Garrett, known on Wikipedia as User:Werdna, has called it ``abuse filter'', and according to EN Wikipedia's newspaper, The Signpost, its purpose was to ``allow[...] all edits to be checked against automatic filters and heuristics, which can be set up to look for patterns of vandalism including page move vandalism and juvenile-type vandalism, as well as common newbie mistakes''.
%TODO decide whether to cite the Signpost here already, since it appears again in chapter4
Quality control.
%TODO right now, an abrupt end
%TODO check Aaron Swartz' blog for opening quotes
%TODO reuse this?
\begin{comment}
``Code 2.0 TO WIKIPEDIA, THE ONE SURPRISE THAT TEACHES MORE THAN EVERYTHING HERE.'' reads one of the inscriptions of Lawrence Lessig's ``Code Version 2.0'' (p.v)~\cite{Lessig2006}.
And although I'm not quite sure what exactly Lessig meant by this regarding the update of his famous book, I readily agree that Wikipedia is important because it teaches us stuff.
Not only in the literal sense, because it is, well, an encyclopedia.
Being an open encyclopedia, which has grown to be one of the biggest open collaborative projects in the world, studying its complex governance, community building and algorithmic systems can teach us a lot about other, less open systems.
\end{comment}
\begin{comment}
Idea: have opening quotes per chapter
check Aaron Swartz' blog for opening quotes
Another candidate for an opening quote:
\cite{Geiger2014}
"Bots aren’t usually part of some master plan – if they were, they probably wouldn’t be bots."
...
...
@@ -35,6 +57,10 @@ Another candidate for an opening quote:
\cite{Charmaz2006}
"At each phase of the research journey, \textit{your} reasings of your work guide your next moves."(p.xi)
"In short, the finished work is a construction–yours." (p.xi)
Nice quote:
The Wikipedia Revolution: How A Bunch of Nobodies Created The World's Greatest Encyclopedia is a 2009 popular history book by new media researcher and writer Andrew Lih.
\end{comment}
\begin{comment}
...
...
@@ -42,34 +68,7 @@ Another candidate for an opening quote:
- code is law
\end{comment}
``Code 2.0 TO WIKIPEDIA, THE ONE SURPRISE THAT TEACHES MORE THAN EVERYTHING HERE.'' reads one of the inscriptions of Lawrence Lessig's ``Code Version 2.0'' (p.v)~\cite{Lessig2006}.
And although I'm not quite sure what exactly Lessig meant by this regarding the update of his famous book, I readily agree that Wikipedia is important because it teaches us stuff.
Not only in the literal sense, because it is, well, an encyclopedia.
Being an open encyclopedia, which has grown to be one of the biggest open collaborative projects in the world, studying its complex governance, community building and algorithmic systems can teach us a lot about other, less open systems.
%TODO verify all these claims and numbers
Since its conception in 2001, when nobody believed it was ever going to be a serious encyclopedia, the project has grown steadily.
At the latest, with the exponential surge in the numbers of users and edits around 2006, the community began realising that they needed a more automated quality control process.
The same year, the first anti vandal bots were introduced, followed by semi-automated tools such as Twinkle (in 2007) and Huggle (in the beginnings of 2008).
In 2009, yet another mechanism was introduced(syn)/announced/implemented.
Its core developer, Andrew Garrett, known on Wikipedia as User:Werdna, has called it ``abuse filter'', and according to EN Wikipedia's newspaper, The Signpost, its purpose was to ``allow[] all edits to be checked against automatic filters and heuristics, which can be set up to look for patterns of vandalism including page move vandalism and juvenile-type vandalism, as well as common newbie mistakes''.
%TODO decide whether to cite the Signpost here already, since it appears again in chapter4
%TODO decide what to do with this paragraph; most of it should be mentioned already
\begin{comment}
As I recently learned, apparently this guideline arose/took such a central position not from the very beginning of the existence of the collaborative encyclopedia.
It rather arose at a time when, after a significant growth in Wikipedia, it wasn't manageable to govern the project (and most importantly fight emergent vandalism which grew proportionally to the project's growth) manually anymore.
To counteract vandalism, a number of automated measures was applied.
These, however, had also unforseen negative consequences: they drove newcomers away~\cite{HalKitRied2011}(quote literature) (since their edits were often classified as "vandalism", because they were not familiar with guidelines / wiki syntax / etc.)
In an attempt to fix this issue, "Assume good faith" rose to a prominent position among Wikipedia's Guidelines.
(Specifically, the page was created on March 3rd, 2004 and was originally refering to good faith during edit wars.
An expansion of the page from December 29th 2004 starts refering to vandalism. https://en.wikipedia.org/w/index.php?title=Wikipedia:Assume_good_faith&oldid=8915036)
\end{comment}
\begin{comment}
Nice quote:
The Wikipedia Revolution: How A Bunch of Nobodies Created The World's Greatest Encyclopedia is a 2009 popular history book by new media researcher and writer Andrew Lih.
\cite{HalGeiMorRied2013}
"formalization of implicit norms into rules, and the embedding of these rules in technologies
%TODO should this be its own section? Or rather a part of next one?
\begin{comment}
\begin{itemize}
\item Wo setze ich an? (Problemstellung / Ausgangslage)
\item Identifikation der signifikanten Problemen im betrachteten Forschungsbereich
\item Ein kurzer Überblick über den aktuellen Forschungsstand in dem Bereich inklusive vorhandener Lösungen (ausführlicher dann in den Folgeabschnitten)
The aim of the present work is to provide a comprehensive overview of Wikipedia's abuse filter extension, which was later renamed so that its end user facing part is nowadays known as Edit Filter.
\end{comment}
This inquiry can be embedded in the context of (algorithmic) quality-control mechanisms on Wikipedia.
There is a whole ecosystem (syn?) of actors struggling to maintain the anyone-can-edit encyclopedia as good^^ and free of malicious, spam and ?(hoax?) content as possible.
The present work can be embedded in the context of (algorithmic) quality-control mechanisms on Wikipedia and in the more general context (syn!) of algorithmic governance.
%TODO go into algorithmic governance!
There is a whole ecosystem (syn?) of actors struggling to maintain the anyone-can-edit encyclopedia as good^^ and free of malicious content as possible.
We want to be able to better understand the role of edit filters in the vandal fighting network of humans, bots, semi-automated tools, and the machine learning framework ORES.
%After all, edit filters were introduced to Wikipedia at a time when bots and semi-automated tools already existed and were involved in quality control: in 2009 (compare timeline, Twinkle's page is from Jan 2007, Huggle's from beginning of 2008; bot's have been around longer, but first records, at least by me so far, of vandal fighting bots come from 2006 ). %TODO: when was the other stuff introduced
Why were filters introduced, when other mechanisms existed already?
Moreover, there seems to be a gap in the scientific literature on the subject.
\begin{comment}
\section{Algorithmic Governance}
should be mentioned here;
it's important for framing along with Lessig's "Code is law".
– software features and bots – making rules part of the infrastructure, to a certain extent, makes
them harder to change and easier to enforce” (p. 87)"
\end{comment}
After all, edit filters were introduced to Wikipedia at a time when the majority of the afore mentioned mechanisms already existed and were involved in quality control: in 2009 (compare timeline, Twinkle's page is from Jan 2007, Huggle's from beginning of 2008; bot's have been around longer, but first records, at least by me so far, of vandal fighting bots come from 2006 ).
%Why were filters introduced, when other mechanisms existed already?
The aim of this work is to find out why edit filters were introduced on Wikipedia and how these fit in Wikipedia's quality control ecosystem.
More precisely, we want to unearth the tasks taken over by filters in contrast to other quality control meachanisms
and understand how different users of Wikipedia (admins/sysops, regular editors, readers) interact with these and what repercussions the filters have on them.
To this end, we study the academic contributions on Wikipedia's quality control mechanisms and give a descriptive overview of the adoption process as well as the current state of edit filters on EN Wikipedia.
Framework for future research
The aim of this work is to find out why edit filters were introduced on Wikipedia and what role they assume in Wikipedia's quality control ecosystem.
We want to unearth the tasks taken over by filters %in contrast to other quality control meachanisms
and track how these tasks have evolved over time (are there changes in type, numbers, etc.?).
%and understand how different users of Wikipedia (admins/sysops, regular editors, readers) interact with these and what repercussions the filters have on them.
Last but not least, it is discussed why a classic rule based system such as the filters is still operational today when more sophisticated machine-learning approaches exist.
Since this is just an initial discovery of the features, tasks and repercussions of edit filters, a framework for future research is also offered.
\begin{comment}
%TODO put this somewhere, fun fact
This year the filters have a 10 year anniversary^^
# Motivation
* What is the role of filters among existing (algorithmic) quality-control mechanisms (bots, semi-automated tools, ORES, humans)? Which type of tasks do filters take over? - chapter 4
* How have these tasks evolved over time (are there changes in the type, number, etc.)? - chapter 5
* What are suitable areas of application for rule-based systems such as filters in contrast to the other ML-based approaches? - discussion
Revise Questions from Confluence:
Q1 We wanted to improve our understanding of the role of filters in existing algorithmic quality-control mechanisms (bots, ORES, humans). -- chapter 4 (and 2)
Subquestion (for discussion): Since filters are classical rule-based systems, what are suitable areas of application for such rule-based system in contrast to the other ML-based approaches. -- chapter 6 (discussion)
...
...
@@ -163,15 +134,6 @@ Revise Questions from Confluence:
Q2 Which type of tasks do these filters take over? -- chapter 5
Subquestion: How these tasks evolve over time (are they changes in the type, number, etc.)? -- chapter 5 (can be significantly expanded)
Important to differentiate between parts that
* report results
* interpret results
Note:
* to answer the question about evolution over time, I really do need the abuse_filter_history table
* modify 3rd question to: why are regexes still there when we have ML; answering it most probably involves talking to people
* check what questions the first bot papers asked, may serve as inspiration
\begin{itemize}
\item Was sind die mit dieser Arbeit verfolgten Ziele? Welches Problem soll gelöst werden?
\item Eine Beschreibung der ersten Ideen, der vorgeschlagene Ansatz und die aktuell erreichten Resultate
...
...
@@ -179,42 +141,40 @@ Note:
\item Eine Diskussion, wie die vorgeschlagene Lösung sich von bestehenden unterscheidet, was ist neu oder besser?
\end{itemize}
* Think about: what's the computer science take on the field? How can we design a "better"/more efficient/more user friendly system? A system that reflects particular values (vgl Code 2.0, Chapter 3, p.34)?
* GT is good for tackling controversial questions: e.g. are filters with disallow action a too severe interference with the editing process that has way too much negative consequences? (e.g. driving away new comers?)
\section{Algorithmic Governance}
* framing question: why does filter legacy system still exist in times of fancier machine learning tools?
\end{comment}
should be mentioned here;
it's important for framing along with Lessig's "Code is law".
%TODO die wichtigsten erkenntnisse mehrmals erwähnen: intro, schluss, tralala; nicht dass sie unter gehen weil ich von lautern Bäumen den Wald nicht mehr sehe
* should there be some recommended guidelines based on the insights?
* or some design recommendations?
* or maybe just a framework for future research: what are questions we just opened?; we still don't know the answer to and should be addressed by future research?
\cite{GeiHal2017}
Claudia's paper:
"“In both cases of algorithmic governance
– software features and bots – making rules part of the infrastructure, to a certain extent, makes
them harder to change and easier to enforce” (p. 87)"
%TODO rename or get rid of the section (there is a "Methods" chapter with a slightly different focus, but keep these things:
* methodology: what are the sources of knowledge
* literature: what insights have we won from it?
* documentation (Wikipedia, MediaWiki pages): what have we learnt here
* data (filters stats, REGEX patterns): what do the filters actually do?
To this end, we review the academic contributions on Wikipedia's quality control mechanisms in order to gather a better understanding of the different quality control mechanisms, their tasks, and the challenges they face.
Moreover, we study documentation of the MediaWiki AbuseFilter extension, as well as guidelines for its use,.. ,and discussion archives prior to its introduction. %why in order to understand how and why filters were introduced
And last but not least, we look into/analyse/... the filters themselves and their log data in order to (syn!) determine what they actually do.
\begin{comment}
\begin{itemize}
\item Wie will ich meine Ziele erreichen? (Methodische Überlegungen)
\item Darstellung zum Forschungsdesign.
\item Insbesondere bei Master: Wie kann die Zielerreichung ``gemessen'' werden?
\end{itemize}
\end{comment}
%TODO die wichtigsten erkenntnisse mehrmals erwähnen: intro, schluss, tralala; nicht dass sie unter gehen weil ich von lautern Bäumen den Wald nicht mehr sehe
The remaining part of this thesis is organised in the following manner:
This thesis is organised in the following manner:
Chapter~\ref{chap:background} situates the topic in the academic discourse and examines some key notions relevant for the subsequent analysis.
In chapter~\ref{chap:methods}, I discuss scientific methods that helped me to accomplish the analysis (syn!).
Next, I describe the edit filter mechanism in general: how and why it was conceived, how it works and how it can be embedded in Wikipedia's quality control frame.