Newer
Older
% ---------------------------------------------------
% ----- Introduction of the template
% ----- for Bachelor-, Master thesis and class papers
% ---------------------------------------------------
% Created by C. Müller-Birn on 2012-08-17, CC-BY-SA 3.0.
% Last upadte: C. Müller-Birn 2015-11-27
% Freie Universität Berlin, Institute of Computer Science, Human Centered Computing.
%
\chapter{Introduction}
\label{chap:introduction}
``Code 2.0 TO WIKIPEDIA, THE ONE SURPRISE THAT TEACHES MORE THAN EVERYTHING HERE.'' reads one of the inscriptions of Lawrence Lessig's ``Code Version 2.0'' (p.v)~\cite{Lessig2006}.
And although I'm not quite sure what exactly Lessig meant by this regarding the update of his famous book, I readily agree that Wikipedia is important because it teaches us stuff.
Not only in the literal sense, because it is, well, an encyclopedia.
Being an open encyclopedia, which has grown to be one of the biggest open collaborative projects in the world, studying its complex governance, community building and algorithmic systems can teach us a lot about other, less open systems.
%TODO verify all these claims and numbers
Since its conception in 2001, when nobody believed it was ever going to be a serious encyclopedia, the project has grown steadily.
At the latest, with the exponential surge in the numbers of users and edits around 2006, the community began realising that they needed a more automated quality control process.
The same year, the first anti vandal bots were introduced, followed by semi-automated tools such as Twinkle (in 2007) and Huggle (in the beginnings of 2008).
In 2009, yet another mechanism was introduced(syn)/announced/implemented.
Its core developer, Andrew Garrett, known on Wikipedia as User:Werdna, has called it ``abuse filter'', and according to EN Wikipedia's newspaper, The Signpost, its purpose was to ``allow[] all edits to be checked against automatic filters and heuristics, which can be set up to look for patterns of vandalism including page move vandalism and juvenile-type vandalism, as well as common newbie mistakes''.
%TODO decide whether to cite the Signpost here already, since it appears again in chapter4
\begin{comment}
Don't make it a separate subsection, but use it to introduce the topic with a story, the way Geiger does.
If the genesis doesn't make sense here, move it to Edit filters
Nice quote:
The Wikipedia Revolution: How A Bunch of Nobodies Created The World's Greatest Encyclopedia is a 2009 popular history book by new media researcher and writer Andrew Lih.
\cite{Tkacz2014}
"As historical artifacts, encyclopedias have regularly offered great insight into the periods in which they were written. They tell us about what constitutes knowledge at a particular time as well as how the various bodies of knowledge were thought to relate to one another." (p.4)
%************************************************************************
\section{Subject and Context}
%TODO should this be its own section? Or rather a part of next one?
\begin{itemize}
\item Wo setze ich an? (Problemstellung / Ausgangslage)
\item Identifikation der signifikanten Problemen im betrachteten Forschungsbereich
\item Ein kurzer Überblick über den aktuellen Forschungsstand in dem Bereich inklusive vorhandener Lösungen (ausführlicher dann in den Folgeabschnitten)
\end{itemize}
%************************************************************************
The aim of the present work is to provide a comprehensive overview of Wikipedia's abuse filter extension, which was later renamed so that its end user facing part is nowadays known as Edit Filter.
This inquiry can be embedded in the context of (algorithmic) quality-control mechanisms on Wikipedia.
There is a whole ecosystem (syn?) of actors struggling to maintain the anyone-can-edit encyclopedia as good^^ and free of malicious, spam and ?(hoax?) content as possible.
We want to be able to better understand the role of edit filters in the vandal fighting network of humans, bots, semi-automated tools, and the machine learning framework ORES.
%After all, edit filters were introduced to Wikipedia at a time when bots and semi-automated tools already existed and were involved in quality control: in 2009 (compare timeline, Twinkle's page is from Jan 2007, Huggle's from beginning of 2008; bot's have been around longer, but first records, at least by me so far, of vandal fighting bots come from 2006 ). %TODO: when was the other stuff introduced
Why were filters introduced, when other mechanisms existed already?
Moreover, there seems to be a gap in the scientific literature on the subject.
\section{Aims of this work}
%alt title: \section{Intended Contributions}
The aim of this work is to find out why edit filters were introduced on Wikipedia and how these fit in Wikipedia's quality control ecosystem.
More precisely, we want to unearth the tasks taken over by filters in contrast to other quality control meachanisms
and understand how different users of Wikipedia (admins/sysops, regular editors, readers) interact with these and what repercussions the filters have on them.
To this end, we study the academic contributions on Wikipedia's quality control mechanisms and give a descriptive overview of the adoption process as well as the current state of edit filters on EN Wikipedia.
* What is the role of filters among existing (algorithmic) quality-control mechanisms (bots, semi-automated tools, ORES, humans)? Which type of tasks do filters take over? - chapter 4
* How have these tasks evolved over time (are there changes in the type, number, etc.)? - chapter 5
* What are suitable areas of application for rule-based systems such as filters in contrast to the other ML-based approaches? - discussion
Questions from Confluence
Q1 We wanted to improve our understanding of the role of filters in existing algorithmic quality-control mechanisms (bots, ORES, humans).
Q2 Which type of tasks do these filters take over in comparison to the other mechanisms? How these tasks evolve over time (are they changes in the type, number, etc.)?
Q3 Since filters are classical rule-based systems, what are suitable areas of application for such rule-based system in contrast to the other ML-based approaches.
Note:
* to answer the question about evolution over time, I really do need the abuse_filter_history table
* modify 3rd question to: why are regexes still there when we have ML; answering it most probably involves talking to people
* check what questions the first bot papers asked, may serve as inspiration
\begin{itemize}
\item Was sind die mit dieser Arbeit verfolgten Ziele? Welches Problem soll gelöst werden?
\item Eine Beschreibung der ersten Ideen, der vorgeschlagene Ansatz und die aktuell erreichten Resultate
\item Eine Beschreibung, welchen Beitrag die Arbeit leistet, um das vorgestellte Problem zu lösen
\item Eine Diskussion, wie die vorgeschlagene Lösung sich von bestehenden unterscheidet, was ist neu oder besser?
\end{itemize}
* Think about: what's the computer science take on the field? How can we design a "better"/more efficient/more user friendly system? A system that reflects particular values (vgl Code 2.0, Chapter 3, p.34)?
* GT is good for tackling controversial questions: e.g. are filters with disallow action a too severe interference with the editing process that has way too much negative consequences? (e.g. driving away new comers?)
* framing question: why does filter legacy system still exist in times of fancier machine learning tools?
%TODO die wichtigsten erkenntnisse mehrmals erwähnen: intro, schluss, tralala; nicht dass sie unter gehen weil ich von lautern Bäumen den Wald nicht mehr sehe
* where is the thesis going?
* should there be some recommended guidelines based on the insights?
* or some design recommendations?
* or maybe just a framework for future research: what are questions we just opened?; we still don't know the answer to and should be addressed by future research?
%************************************************************
\section{Methods}
\begin{itemize}
\item Wie will ich meine Ziele erreichen? (Methodische Überlegungen)
\item Darstellung zum Forschungsdesign.
\item Insbesondere bei Master: Wie kann die Zielerreichung ``gemessen'' werden?
\end{itemize}
* methodology: what are the sources of knowledge
* literature: what insights have we won from it?
* documentation (Wikipedia, MediaWiki pages): what have we learnt here
* data (filters stats, REGEX patterns): what do the filters actually do?
The remaining part of this thesis is organised in the following manner:
Chapter~\ref{chap:background} situates the topic in the academic discourse and examines some key notions relevant for the subsequent analysis.
In chapter~\ref{chap:methods}, I discuss scientific methods that helped me to accomplish the analysis (syn!).
Next, I describe the edit filter mechanism in general: how and why it was conceived, how it works and how it can be embedded in Wikipedia's quality control frame.
A detailed analysis (syn!) of the current state of all implemented edit filters on English Wikipedia is presented in chapter~\ref{chap:overview-en-wiki}.
We discuss the findings and the limitations of the inquiry in chapter~\ref{chap:discussion}.
Finally, the analysis (syn!) is wrapped up in the conclusion where also directions for possible future investigations are given.