Move code books to appendix

93b1a5a1 · Lyudmila Vaseva · bf6aee65 · 93b1a5a1 · 93b1a5a1 · 93b1a5a1
Commit 93b1a5a1 authored 5 years ago by Lyudmila Vaseva
--- a/memos/code-book
+++ b/memos/code-book
@@ -4,7 +4,7 @@ The purpose of this document/section is to provide an overview of the labels\foo
 ## A few notes on the labels/labeling process
-I started coding strongly influenced by the coding methodologies applied by Grounded Theory scholars~\cite{Charmaz2006} %TODO add page nums
+I started coding strongly influenced by the coding methodologies applied by Grounded Theory scholars~\cite[42-71]{Charmaz2006} %TODO add page nums
 and mostly let the labels emerge during the process.
 In addition to that, for vandalism related labels, I used some of the vandalism types identified by the community in https://en.wikipedia.org/wiki/Wikipedia:Vandalism_types . %TODO: I need arguments why I haven't taken this 1:1

--- a/thesis/appendix.tex
+++ b/thesis/appendix.tex
@@ -9,8 +9,291 @@
 \chapter{Appendix}
 \label{ch:Appendix}
-\section{Erster Teil Appendix}
+\section{Code book}
-\label{app:first_appendix}
+\label{app:code_book}
+The purpose of this document/section is to provide an overview of the labels\footnote{Here, I use the words "codes"/"tags"/"labels" interchangeably.} used for the manual tagging of edit filters.
+\subsection{A few notes on the labels/labeling process}
+I started coding strongly influenced by the coding methodologies applied by Grounded Theory scholars~\cite[42-71]{Charmaz2006} and mostly let the labels emerge during the process.
+In addition to that, for vandalism related labels, I used some of the vandalism types identified by the community in \url{https://en.wikipedia.org/wiki/Wikipedia:Vandalism_types}. %TODO: I need arguments why I haven't taken this 1:1
+I labeled the dataset twice.
+One motivation therefor was to return to it once I've gained better insight into the data and more detailed understanding of it and use this newly gained knowledge to re-evaluate ambiguous cases, i.e. re-label some data with codes that emerged later in the process.
+Another motivation for this second round of labeling was to ensure at least some intra-coder integrity, since, unfortunately, multiple coders were not available~\cite{LazFenHo2017}. %TODO add page num; I also need to elaborata on methodoly here
+The first time, I looked through the data paying special attention to the name of the filters, the comments, as well as the regular expression pattern constituting the filter and identified one or several possible labels. %TODO reword? I also looked at the comments, name and regex the second time..
+In ambiguous cases, I either labeled the filter with the code which I deemed most appropriate and a question mark, or assigned all possible labels (or both).
+There were also cases for which I could not gather any insight relying on the name, comments and pattern, since the filters were hidden from public view and the name was not descriptive enough.
+However, upon some further reflection, I think it is safe to assume that all hidden filters target a form of (more or less grave) vandalism, since the guidelines suggest that filters should not be hidden unless dealing with cases of persistent and specific vandalism where it could be expected that the vandalising editors will actively look for the filter pattern in their attempts to circumvent the filter. %TODO: quote needed
+Therefore, during the second round of labeling I intend to label all such cases as 'hidden\_vandalism'.
+And then again, there were also cases where I could not determine any suitable label, since I didn't understand the regex pattern or could not press it into a category.
+During the 1st labeling, these were labeled 'unknown', 'unclear' or 'not sure'.
+For the second round, I intend to unify them under 'unknown'.
+For a number of filters, it was particularly difficult to determine whether they were targeting vandalism or good faith edits.
+The only thing that would have distinguished between the two would be the editor's motivation, which we have no way of knowing.
+During the first labeling session, I tended to label such filters with 'vandalism?', 'good\_faith?'.
+For the cross-validation labeling (2nd time), I intend to stick to the "assume good faith" guideline myself %TODO: quote
+and only label as vandalism cases where good faith can definitely be no longer assumed.
+%TODO compare also with revising codes as the analysis goes along according to Grounded Theory
+One characteristic/feature which guided me here is the filter action which represents the judgement of the edit filter manager(s).
+Since communication is crucial when assuming good faith, all ambiguous cases which have "warn" as a filter action, will receive a 'good\_faith' label.
+On the other hand, I will label all filters set to "disallow" as 'vandalism' or a particular type thereof, since the filter action is a clear sign that at least the edit filter managers have decided that seeking a dialog with the offending editor is no longer an option.
+The second time, I labeled the whole data set again, this time using the here quoted compiled code book and assigned to every filter every label I deemed appropriate, without looking at the labels I assigned the first time around.
+I then compared the labels from both coding sessions. %TODO And did what?; how big was the divergence between both coding sessions?; should I select one, most specific label possible? or allow for multiple labels? what should I do with the cases where it's not clear whether it's vandalism or good faith?
+%TODO quote M's methodology book
+%TODO maybe take this out here
+Some of the hidden filters deal with cases of personal attacks.
+These are hidden to protect the persons involved. %TODO quote needed
+%TODO disclose links to 1st and 2nd labelling
+First round of labeling is available under \url{https://github.com/lusy/wikifilters/blob/master/filter-lists/20190106115600_filters-sorted-by-hits-manual-tags.csv}.
+%TODO I actually need a final document where I compare both and decide on final (at least for this work) labeling I rely upon
+Def
+Example <-- examples so far come from the 1st round of labeling
+\subsection{Cluster Vandalism}
+\subsubsection{Structure related}
+'bot\_vandalism'
+  Def: vandalism caused by an automated agent; we know that's what's being targeted because of description in name or notes of the filter
+  Examples: 277 "possible vandalbot"; 276 "scripted anomtalk/spoofed IP vandalism"
+'page\_move\_vandalism'
+  Def: vandalism involving moving a page, mostly to some nonsensical name (Wikipedia typology: "Renaming pages (referred to as "page-moving") to disruptive, irrelevant, or otherwise inappropriate terms.")
+  Examples: 883 "Page moves to bad words or other vandalism"; 334 "Grawp page move vandalism"
+'image\_vandalism'
+  Def: "Uploading shock images that do not belong at all on Wikipedia; Inappropriately placing explicit images legitimately used on Wikipedia on pages where they do not belong"
+  Examples: 952 "Image vandalism IV"; 428 "Image abuse";
+'talk\_page\_vandalism'
+  Def: Malicious activity taking place at talk pages: e.g. modifiyng or removing other users' comments from discussions
+  Examples: 842 "Talk page abuse";
+'template\_vandalism'
+  Def: "Modifying a template in a harmful or disruptive manner. This is especially serious, because it'll negatively impact the appearance of multiple pages. Some templates appear on hundreds of pages." (From Wikipedia Vandalism Typology)
+  Examples: 203 "Template spam from 88.105.0.0/16";
+'link\_vandalism'
+  Def: According to Wikipedia Vandalism Typology: "Modifying internal or external links within a page so that they appear the same in the finished version but link to a page/site that they are not intended to (e.g. spam, self-promotion, an explicit image, a shock site, or some other irrelevant page)
+    Adding external links to non-notable or irrelevant sites
+    Adding spam links
+    Adding external links that may belong on another Wikipedia page, but have no relevance to the subject matter of the page to which they are added"
+  Examples: none sofar, I do have explicit categories for seo and self promotion.. %TODO: do I need this cat? delete?
+'abuse\_of\_tags\_vandalism'
+  Def: not quite sure whether I need the tag; also not quite sure what it is.
+  Only example: 747 "Removal or addition of [[WP:PP-30-500|pp-30-500]] by non-admin"
+'avoidant\_vandalism'
+  Def: According to Wikipedia Vandalism Typology: "Removal of tags such as {{afd}} and {{copyvio}} in order to conceal deletion candidates or avert deletion of such content. (This does NOT avert deletion. This actually increases the chance that the article will be deleted.); Removal of a {{speedy deletion}} tag from an article one created him/herself. Only the {{hangon}} tag can be placed there by the creator to avert deletion.; Removal of recent warnings from one's own user talk page of vandalism or other serious violations"
+  Examples: not satisfied with the one thing a dubbed "avoidant\_vandalism?" so far.
+'username\_vandalism'
+  Def: According to Wikipedia Vandalism Typology: "Creating accounts with usernames that contain deliberately offensive or disruptive terms is considered vandalism, whether the account is used or not. For Wikipedia's policy on what is considered inappropriate for a username, see Wikipedia:Username policy. See also Wikipedia:Sock puppet." (although they call this "Malicious account creation "); in theory there shouldn't be very many filters of that sort, since there is a username blacklist which would be the more appropriate mechanism to take care of this.
+  Examples: 827 "Abusive username activity" (unfortunately hidden, so we don't know what the activity is)
+\subsubsection{Content related vandalism}
+'silly\_vandalism'
+  Def: blatant, immediately obvious vandalism, such as inserting repeating characters or other intentional nonsence, such as "Baby carrots are yummy in my tummy." (Edit on the Veganism-Page); obscenities? %TODO where do we put obscenities? and stuff like ALL CAPS?
+  Examples: 338 "Vuvuzela vandalism", 135 "Repeating characters"
+'trolling'
+  Def: "Trolling" is explicitely referenced in the filter name %TODO look for additional def
+  According to \url{https://en.wikipedia.org/w/index.php?title=Internet_troll&oldid=902578463} :
+  "In Internet slang, a troll is a person who starts quarrels or upsets people on the Internet to distract and sow discord by posting inflammatory and digressive,[1] extraneous, or off-topic messages in an online community (such as a newsgroup, forum, chat room, or blog) with the intent of provoking readers into displaying emotional responses[2] and normalizing tangential discussion,[3] whether for the troll's amusement or a specific gain. "
+  Examples: 896 "ANI trolling", 615 "Reference desk trolling"
+'hoaxing'
+  Def: deliberately inserting false information (From Wikipedia typology: "Adding plausible misinformation to articles; Use of fictitious references")
+  Examples: ?
+'prank'
+  Def: We probably don't need this, see below for the only filter in this category
+  Examples: 396 "Don't delete the main page" (which was never tripped by the way^^)
+'profanity\_vandalism'
+  Def: included during 2nd labeling for marking filters dealing with inserting profanities into articles in general, without them being targeted at a person (that is the difference to 'personal\_attacks')
+\subsubsection{Politically motivated vandalism}
+'religious\_vandalism'
+Def: Disruptions on topics related to religion
+Examples: 131 "Removal of controversial images" (see content; however this could fall under "image\_vandalism" as well)
+'politically\_motivated'
+Def: Disruptions on explicitely politic matters
+Examples: 154 "Macedonia naming conflict 2"; 19 "Replacement of "partition of India" with "independence of Pakistan""
+\subsubsection{Hardcore vandalism (the really malicious cases)}
+'sockpuppetry'
+  Def: Filter contains "sock", "sockpuppets", "sockpuppetry" or similar in their name ('af\_public\_comments') or maybe notes ("af\_comments"); expected to be mostly hidden filters (which may have been made public upon deletion or being disabled for example)
+  Sockpuppetry is often long term abuse, aber not necessarily all long term abuse involves sock puppetry
+  Examples: 16 "Prolific socker I"; 114 "sleeper socks";
+'long\_term\_abuse'
+  Def: Filters that had "Long term abuse" or "LTA" or similar in their name ('af\_public\_comments'); expected to be mostly hidden filters
+  Example: 51 "LTA Username / LTA IP hopping disruption (Oshwah)"; 937 "Qwertywander long-term abuse";
+'abuse'
+  Def: Filter contains "abuse", "abusive" or similar in its name; <-- do we really need the category
+'harassment'
+  Def: Filter contains "harassment" in their name/comments
+  Examples: 792 "Harassment"; 330 "Attacks on editors";
+'doxxing'
+  Def: Disclosing private information of other people (e.g. address, contact details, details about their life not know to the public) without their consent; Often with the purpose to facilitate organised harassment
+  (According to \url{https://en.wikipedia.org/w/index.php?title=Doxing&oldid=902687406} : "Doxing (from dox, abbreviation of documents)[1] or doxxing[2][3] is the Internet-based practice of researching and broadcasting private or identifying information (especially personally identifying information) about an individual or organization")
+  Examples: 120 "Real life info" (not quite sure though, since filter is hidden)
+Note: according to Wikipedia this behaviour constitutes harassment: "Posting another editor's personal information is harassment, unless that person has voluntarily posted their own information, or links to such information, on Wikipedia. Personal information includes legal name, date of birth, identification numbers, home or workplace address, job title and work organisation, telephone number, email address, other contact information, or photograph, whether such information is accurate or not. " (\url{https://en.wikipedia.org/w/index.php?title=Wikipedia:Harassment&oldid=902881999})
+'personal\_attacks'
+  Def: what is the difference between this and harassment? Maybe use harassment only for cases explicitely worded as such? If we cannot find sufficient justification for having both labels, merge! %maybe put all the obscenities and swear words here; but what if it is not targeted towards a user?
+  Examples: 299 "Personal attacks"; 693 "Drake Bell attack";
+'impersonation'
+  Def: Labels filters that target cases where an editor is trying to pose as another editor. Mostly "impersonation" is metioned in the filter name/comments
+  Examples: 568 "SPI Clerk impersonation";
+'not\_polite'
+  Def: Interaction with others turning non-civil without becoming directly a personal attack? Do we really need this tag if we'll only label one filter with it?
+  Examples: 521 "Feedback: All caps" (single example)
+\subsubsection{General vandalism}
+'general vandalism'
+ Def: vandalism for which none of the more specific tags applied
+ Example:
+'hidden\_vandalism'
+ Def: Tag for hidden filters where a more specific tag could not be determined
+ Example:
+\subsection{Spam/malware/etc.}
+'spam'
+  Def: There is a "Spam" type of vandalism in the Wikipedia Vandalism Typology. However, I've got the feeling that I'm mostly labeling the cases listed there as "self promotion" or similar (although maybe not; This is the def: "    Adding text to any page that promotes an interest that benefits the user, except in user space in a manner allowable under Wikipedia's guidelines
+    Adding external links to site(s) that promote an interest from which the user benefits
+    Adding external links to site(s) that have ads from which the user benefits, even if the site has information relevant to the article");
+  I've so far labeled "spam" foremost filters which contain the word in their name
+  Examples: 862 "Arabic string spam";  523 "Page creation spammer";
+'phishing'
+  Def: Probably stuff that had "phishing" in their name
+  Examples: 870 "nowiki phishing" <- only instance
+'malware'
+  Def: Malware is explicitely mentioned in the filter's name
+  Examples: 243 "WikiMedia Viewer possible malware"; 429 "Possible malware attack" <-- only two instances
+\subsection{Disruptive editing which is not outright vandalism}
+'copyright violation'
+  Def: Filters targeting potential copyright violations: e.g. images without license information, ..
+  Examples: 798 "Possible copyvio for image upload"; 278 "Possible copyright violations"
+'bad\_style'
+  Def: Filters targeting edits deviating from what is percieved a good encyclopedic style (def?)
+  Examples: 899 "Adding "The Sun" or "Dailystar" to BLPs" (presumably, bc they are unreliable sources;); 491 "Edits ending with emoticons or !"; 253 "Signing a non-discussion page"
+'lazyness'
+  Def: Slacking on orthography norms (this is something that shouldn't be handled by filters, according to guidelines); smth else?
+  Examples: 18 "Test type edits from clicking on edit bar" (not quite sure why I have labeled this 'lazyness'); 292 "Repeating characters in edit summary"; 432 "Starting new line with lowercase letters"
+'edit\_warring'
+  Def: Filters targeting edits that revert each other
+  Examples: 622 "Genre edit-warring"; 419 "User removing himself from AIV" (first labeling, I would actually simply label this 'vandalism' upon second inspection)
+'wiki\_policy'
+  Def: Filters target edits violating Wikipedia's policies
+  Examples: 930 "Prevent indexing userspaces by newer users"; 272 "Page author removing CSD tags"
+'guideline\_vio'
+  Def: Filters target edits violating Wikipedia's guidelines %TODO do we need this one and the previous? I would rather merge them.
+  Examples: 55 "Signing articles" (which is also labeled 'bad\_style'); it is also the only filter with a 'guideline\_vio' label from the 1st round of labeling
+\subsubsection{SEO/COI/POV problems}
+'biased\_pov'
+  Def: Hm.. I have the feeling all the filters here should be relabeled..
+  Examples: 148 "Users creating autobiographies" (which should be rather "self\_promotion", I think); 894 "Possible Self-Published Sources" (maybe 'self\_promotion' is also more suitable here?); 878 "New user removing COI template"
+'conflict\_of\_interest'
+  Def: Filter targets people editing articles about themselves or organisations they are affilitated to or receive money from. (Compare 'stockbrocker\_vandalism', not sure whether we need both)
+  Examples: same as "stockbrocker\_vandalism", so we are merging them
+'stockbrocker\_vandalism'
+  Def: not quite sure how this label emerged, it does not seem to be one of the Vandalism Types in the Wikipedia Vandalism Typology %TODO: merge with 'conflict of interest'
+  Examples: 302 "Possible COI"; 588 "Promotional usernames"
+'self\_promotion'
+  Def: specifically promoting one-self, it is kind of part of the 'conflict\_of\_interest'
+  Examples: 214 "Creating articles with title contained in username" (this is actually one of the 3 filters with this as a label candidate, so I think we can savely merge it with 'conflict\_of\_interest' without significantly losing facettes)
+'seo'
+  Def: Filters targeting SEO edits (mostly, explicitely mentioned in the filter name)
+  Examples: 36 "SEO push University of Atlanta"; 682 "SEO/Attack page"; 554 "top100 blog charts" (bc of this and the Daily Mail sources, I am contemplating creating a 'unreliable\_sources' label)
+\subsection{Good faith}
+'good\_faith'
+  Def: In ambigous cases, e.g. editor blanking sections, we assume good faith as long as there are not any indicators to the contrary. One such indicator would be the filter action: filters set to "warn" try to communicate with the editors, point out potential pitfalls to them and give them the opportunity to update and publish the edit (or publish the edit regardles, if they think all is good). Filters set to "disallow" on the other hand, do not seek to guide an editor but rather protect the encyclopedia from harmful content.
+  Examples: 180	"Large unwikified new article"; 98 "Creating very short new article"; 657 "Adding an external link to a disambiguation page" (used to be labeled 'good\_faith?', but since actions are "warn,tag", according to my newly defined guidelines, this is a good\_faith filter)
+TODO: label cases additionally with scope/area the filter is targeting
+'good\_faith\_refs'
+'good\_faith\_deletion'
+'good\_faith\_orthography'
+'good\_faith\_article\_creation'
+'good\_faith\_external\_resources'
+'good\_faith\_template'
+'good\_faith\_wiki\_syntax'
+'good\_faith\_test\_edits'
+Introducing because of filter 18 "Test type edits from clicking on edit bar"
+'good\_faith\_edit\_summary'
+'good\_faith\_revert'
+'good\_faith\_wiki\_ĺinks'
+'good\_faith\_user\_page'
+'good\_faith\_redirect'
+\subsection{Maintenance}
+'bug'
+  Def: Filters targeting software bugs from MediaWiki, browser extensions, etc which sometimes cause eroneou syntax
+  Examples: 577 "VisualEditor bugs: Strange icons"; 606 "ANI restoration bug";
+'test'
+  Def: Various test filters (of single edit filter managers or jointly used)
+  Examples: 398 "Test filter 398"; 358 "Od Mishehu's test filter"; 424 "Repeatedly blocked user --  testing-only rule for filter 425"
+'general\_maintenance' (used to be 'maintenance' upon 1st labeling)
+  Def: Filters taking care of other maintenance tasks (It looks like, I will have problems to distinguish between this one and 'general\_tracking')
+  Examples: 728 "Huggle"; 942 "Log edits to protected pages"; 199 "Unflagged Bots"
+\subsection{Unknown}
+'unknown'
+  Def: Cannot determine at all what the filter is doing (but hidden filters with no clear names should be labeled "hidden\_vandalism", since it's pretty clear they target vandalism)
+  Examples: various; as far as I can see though, all of them are getting re-labeled to "hidden\_vandalism"
+'misc'
+  Def: Cannot fit the filter into any category (it is not that functionality is unclear but I couldn't think of a suitable label)
+  Examples: 388 "Unusual redirect III" (which is hidden, so according to new def, should be re-labeled "hidden\_vandalism"), 688 "Beals" (same); 708 "SPI page moves"; 152 "External links with referal tags"
+'unclear'
+  Def: I'd say that is similar to misc and both should be merged
+  Examples: 362 "New user creating page", 300 "Cross-posting"
+\subsection{Contemplating to introduce}
+'general\_tracking'
+  Def: There are various filters introduced with the aim to track certain behaviour in order to determin whether it occurs frequently and how problematic it is
+  Examples: 362 "New user creating page" would fit better in here I think
 \section{Zweiter Teil Appendix}
 \label{app:second_appendix}

--- a/thesis/thesis_main.tex
+++ b/thesis/thesis_main.tex
@@ -136,7 +136,7 @@ introduction,
 5-Overview-EN-Wiki,
 6-Discussion,
 conclusion,
-%appendix
+appendix
 }
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -243,6 +243,6 @@ conclusion,
 %----- Appendix
 %---------------------------------------------------
 \backmatter
-%\include{appendix}
+\include{appendix}
 \end{document}