Reflect on open questions

43447d9e · Lyudmila Vaseva · d2cfea6b · 43447d9e · 43447d9e
Commit 43447d9e authored 6 years ago by Lyudmila Vaseva
--- a/article/proceedings.tex
+++ b/article/proceedings.tex
@@ -657,9 +657,9 @@ So far, I haven't managed to trigger a filter with a different action.
 \begin{itemize}
    \item how many filters are there (were there over the years): 954 filters (stand: 06.01.2019); TODO: historically?
    \item what do the most active filters do?: see~\ref{tab:most-active-actions}
-    \item get a sense of what gets filtered (more qualitative): TODO: refine after sorting through manual categories; preliminary: vandalism; unintentional suboptimal behavior from new users who don't know better ("good faith edits") such as blanking an article/section; creating an article without categories; adding larger texts without references; large unwikified new article (180); or from users who are too lazy (to write proper edit summaries; editing behaviours and styles not suitable for an encyclopedia (poor grammar/not commiting to orthography norms; use of emoticons and !; ascii art?); "unexplained removal of sourced content" (636) may be an attempt to silence a view point the editor doesn't like; self-promotion(adding unreferenced material to BLP; "users creating autobiographies" 148;); harassment; sockpuppetry; potential copyright violations
+    \item get a sense of what gets filtered (more qualitative): TODO: refine after sorting through manual categories; preliminary: vandalism; unintentional suboptimal behavior from new users who don't know better ("good faith edits") such as blanking an article/section; creating an article without categories; adding larger texts without references; large unwikified new article (180); or from users who are too lazy (to write proper edit summaries; editing behaviours and styles not suitable for an encyclopedia (poor grammar/not commiting to orthography norms; use of emoticons and !; ascii art?); "unexplained removal of sourced content" (636) may be an attempt to silence a view point the editor doesn't like; self-promotion(adding unreferenced material to BLP; "users creating autobiographies" 148;); harassment; sockpuppetry; potential copyright violations; that's more or less it actually. There's a third bigger cluster of maintenance stuff, such as tracking bugs or other problems, trying to sort through bot edits and such. For further details see the jupyter notebook.
    \item has the willingness of the community to use filters increased over time?: looking at aggregated values of number of triggered filters per year, the answer is rather it's quite constant; TODO: plot it at a finer granularity
-    \item how often were (which) filters triggered: see \url{filter-lists/20190106115600_filters-sorted-by-hits.csv} and~\ref{tab:most-active-actions}; TODO aggregate hitcounts over tagged categories after finished tagging
+    \item how often were (which) filters triggered: see \url{filter-lists/20190106115600_filters-sorted-by-hits.csv} and~\ref{tab:most-active-actions}; see also jupyter notebook for aggregated hitcounts over tagged categories
    \item percentage of triggered filters/all edits; break down triggered filters according to typology: TODO still need the complete abuse\_filter\_log table!; and probably further dumps in order to know total number of edits
    \item percentage filters of different types over the years: TODO according to actions (I need a complete abuse\_filter\_log table for this!); according to self-assigned tags (finish tagging!)
    \item what gets classified as vandalism? has this changed over time? TODO: (look at words and patterns triggered by the vandalism filters; read vandalism policy page); pay special attention to filters labeled as vandalism by the edit filter editors (i.e. in the public description) vs these I labeled as vandalism
@@ -678,7 +678,7 @@ So far, I haven't managed to trigger a filter with a different action.
    \item what are the values in the "group" column? what do they mean?
    \item which are the most frequently triggered filters of all time?
    \item is it new filters that get triggered most frequently? or are there also very active old ones?
-    \item how many different edit filter editros are there (af\_user)?
+    \item how many different edit filter editors are there (af\_user)?
    \item categorise filters according to which name spaces they apply to; pay special attention to edits in user/talks name spaces (may be indication of filtering harassment)
 \end{itemize}


--- a/src/explore.ipynb
+++ b/src/explore.ipynb
@@ -558,6 +558,31 @@
    "print (raw_df['af_user_text'].value_counts())"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Vandalism\n",
+    "\n",
+    "We may be interested in how the notion of vandalism changed over the years. For this an inquiry into which filters have \"vandalism\" in their public description (and were tagged as \"vandalism\") and what they do may be interesting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Potential harassment\n",
+    "\n",
+    "Another idea would be to classify filters according to the namespaces they cover. A filter targeting the talk/user name spaces may be indicative of dealing with personal attacks or harassment."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},

 %% Cell type:markdown id: tags:

 # An explorative inquiry into EN Wikipedia's edit filter system

 This notebook serves to explore EN Wikipedia's edit filters

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 import itertools
 import collections
 ```

 %% Cell type:markdown id: tags:

 We import a cleaned version of manually annotated edit filters:

 %% Cell type:code id: tags:

 ``` python
 df = pd.read_csv("20190106115600_filters-sorted-by-hits-manual-tags.csv", sep='\t')
 ```

 %% Cell type:markdown id: tags:

 ## General stats

 %% Cell type:code id: tags:

 ``` python
 # Number of filters
 len(df)
 ```

 %% Output

    954

 %% Cell type:code id: tags:

 ``` python
 # Active (enabled) filters
 print (len(df.query('af_enabled==1')))

 # Disabled filters
 print (len(df.query('af_enabled==0')))

 # Deleted filters
 print (len(df.query('af_deleted==1')))

 # Active public filters
 print (len(df.query('af_hidden==0 and af_enabled==1')))

 # Deleted and enabled
 print (len(df.query('af_deleted==1 and af_enabled==1')))
 ```

 %% Output

    201
    753
    600
    110
    0

 %% Cell type:code id: tags:

 ``` python
 # hidden filters
 print (len(df.query('af_hidden==1')))

 # active hidden filters
 print (len(df.query('af_hidden==1 and af_enabled==1')))
 ```

 %% Output

    593
    91

 %% Cell type:code id: tags:

 ``` python
 # global filters
 print (len(df.query('af_global==0')))
 ```

 %% Output

    954

 %% Cell type:code id: tags:

 ``` python
 # throttled
 print (len(df.query('af_throttled==0')))

 print (len(df.query('af_throttled==1')))
 ```

 %% Output

    948
    6

 %% Cell type:code id: tags:

 ``` python
 # group
 print (len(df.query('af_group=="default"')))
 print (df.query('af_group!="default"'))

 # --> so available groups are "default" and "feedback"
 ```

 %% Output

    947
         Unnamed: 0  af_id  af_hidden  af_global  af_enabled  af_deleted  \
    168         168    497          0          0           0           1
    173         173    494          0          0           0           1
    174         174    502          0          0           0           1
    187         187    495          0          0           0           1
    190         190    496          0          0           0           1
    227         227    475          0          0           0           1
    349         349    461          0          0           0           1
    
         af_throttled  af_group    af_timestamp af_actions  af_hit_count  \
    168             0  feedback  20130108151106   disallow          3660
    173             0  feedback  20130108151035   disallow          3325
    174             0  feedback  20130424011002   disallow          3280
    187             0  feedback  20130108151045   disallow          2697
    190             0  feedback  20130108151054   disallow          2658
    227             0  feedback  20131003210159        NaN          1390
    349             0  feedback  20130411173111   disallow           283
    
                      af_public_comments                          manual_tags  \
    168     Feedback: Common Vandalism 5               vandalism, harassment?
    173     Feedback: Common Vandalism 2              vandalism?, harassment?
    174   Feedback: Extremely long words  vandalism?, good_faith?, bad_style?
    187     Feedback: Common Vandalism 3               vandalism, harassment?
    190     Feedback: Common Vandalism 4               vandalism, harassment?
    227     Feedback: Vandalism or libel                vandalism, harassment
    349  Feedback: Vandalism in all caps               vandalism, harassment?
    
                                            notes
    168  deleted; “Merged back into 460. --mlitn”
    173  deleted; “Merged back into 460. --mlitn”
    174                                   deleted
    187  deleted; “Merged back into 460. --mlitn”
    190  deleted; “Merged back into 460. --mlitn”
    227                                   deleted
    349                                       NaN

 %% Cell type:markdown id: tags:

 ## Helper functions

 %% Cell type:code id: tags:

 ``` python
 flatten = lambda x: list(itertools.chain.from_iterable(x))
 ```

 %% Cell type:markdown id: tags:

 ## Edit filter actions

 %% Cell type:code id: tags:

 ``` python
 actions = df['af_actions'].fillna('')
 actions_list = [x.split(",") for x in list(actions)]
 all_actions = flatten(actions_list)

 print(collections.Counter(all_actions).most_common())
 ```

 %% Output

    [('', 413), ('disallow', 406), ('warn', 122), ('tag', 70), ('throttle', 52), ('blockautopromote', 4)]

 %% Cell type:code id: tags:

 ``` python
 # What are the actions of active hidden filters
 active_hidden = df.query('af_hidden==1 and af_enabled==1')
 print(collections.Counter(list(active_hidden['af_actions'].fillna(''))).most_common())
 ```

 %% Output

    [('disallow', 51), ('', 19), ('throttle,disallow', 7), ('throttle', 4), ('tag', 3), ('warn,tag', 2), ('throttle,warn', 2), ('warn', 1), ('disallow,tag', 1), ('warn,disallow', 1)]

 %% Cell type:code id: tags:

 ``` python
 # What are the actions of active public filters
 active_public = df.query('af_hidden==0 and af_enabled==1')
 print(collections.Counter(list(active_public['af_actions'].fillna(''))).most_common())
 ```

 %% Output

    [('tag', 25), ('warn,tag', 25), ('disallow', 22), ('', 20), ('warn', 12), ('throttle,tag', 2), ('warn,disallow', 2), ('throttle,warn,tag', 1), ('throttle,disallow', 1)]

 %% Cell type:markdown id: tags:

 ## Explore Manual Tags

 %% Cell type:code id: tags:

 ``` python
 manual_tags = df['manual_tags']
 manual_tags_list = [x.split(", ") for x in list(manual_tags)]
 all_tags = flatten(manual_tags_list)

 print(collections.Counter(all_tags).most_common())
 ```

 %% Output

    [('vandalism', 263), ('vandalism?', 162), ('unknown', 71), ('good_faith?', 63), ('misc', 59), ('sockpuppetry', 59), ('good_faith', 48), ('test', 43), ('spam?', 41), ('long_term_abuse', 35), ('sockpuppetry?', 35), ('harassment?', 31), ('harassment', 24), ('abuse?', 21), ('biased_pov', 17), ('spam', 17), ('biased_pov?', 15), ('unclear', 14), ('bad_style', 13), ('bad_style?', 12), ('bug?', 10), ('wiki_policy?', 9), ('long_term_abuse?', 9), ('misc?', 8), ('seo', 8), ('politically_motivated?', 8), ('maintenance', 7), ('trolling?', 7), ('maintenance?', 6), ('personal_attacks', 6), ('bug', 5), ('vandalbot', 5), ('page_move_vandalism', 5), ('silly_vandalism', 5), ('lazyness', 4), ('seo?', 4), ('test?', 4), ('hoaxing?', 4), ('personal_attacks?', 4), ('edit_warring?', 3), ('copyright', 3), ('image_vandalism', 3), ('talk_page_vandalism', 3), ('page_move_vandalism?', 3), ('conflict_of_interest', 3), ('stockbrocker_vandalism', 3), ('copyright?', 2), ('vandalbot?', 2), ('religious_vandalism?', 2), ('politically_motivated', 2), ('self_promotion?', 2), ('template_spam', 2), ('hoaxing', 2), ('silly_vandalism?', 2), ('doxxing?', 2), ('not_polite', 1), ('template_vandalism', 1), ('religious_vandalism', 1), ('self_promotion', 1), ('abuse', 1), ('template_vandalism?', 1), ('link_vandalism?', 1), ('abuse_of_tags_vandalism?', 1), ('avoidant_vandalism', 1), ('guideline_vio?', 1), ('username_vandalism?', 1), ('phishing?', 1), ('avoidant_vandalism?', 1), ('malware?', 1), ('malware', 1), ('conflict_of_interest?', 1), ('impersonation', 1), ('prank', 1)]

 %% Cell type:markdown id: tags:

 ('vandalism', 263),
 ('vandalism?', 162),
  ('spam?', 41),
  ('spam', 17),
  ('vandalbot', 5),
  ('vandalbot?', 2),
  ('page_move_vandalism', 5),
  ('page_move_vandalism?', 3),
  ('silly_vandalism', 5),
  ('silly_vandalism?', 2),
  ('trolling?', 7),
  ('hoaxing?', 4),
  ('hoaxing', 2),
  ('copyright', 3),
  ('copyright?', 2),
  ('image_vandalism', 3),
  ('talk_page_vandalism', 3),
  ('template_vandalism?', 1),
  ('template_vandalism', 1),
  ('template_spam', 2),
  ('link_vandalism?', 1),
  ('abuse_of_tags_vandalism?', 1),
  ('avoidant_vandalism', 1),
  ('avoidant_vandalism?', 1),
  ('username_vandalism?', 1),

 ('prank', 1)

 ('phishing?', 1),
 ('malware?', 1),
 ('malware', 1),

 ('guideline_vio?', 1),

 ('religious_vandalism?', 3),
 ('politically_motivated?', 8),
 ('politically_motivated', 2),

 ('sockpuppetry', 59),
 ('sockpuppetry?', 35),
 ('long_term_abuse', 35),
 ('long_term_abuse?', 9),
 ('abuse', 1),
 ('abuse?', 21),
 ('harassment?', 31),
 ('harassment', 24),
 ('doxxing?', 2),
 ('personal_attacks', 6),
 ('personal_attacks?', 4),
 ('impersonation', 1),
 ('not_polite', 1),

 ('biased_pov', 17),
 ('biased_pov?', 15),

 ('conflict_of_interest', 3),
 ('stockbrocker_vandalism', 3),
 ('self_promotion?', 2),
 ('conflict_of_interest?', 1),
 ('self_promotion', 1),

 ('seo', 8),
 ('seo?', 4),

 ('bad_style', 13),
 ('bad_style?', 12),
 ('edit_warring?', 3),

 ('good_faith?', 63),
 ('good_faith', 48),

 ('lazyness', 4),

 ('maintenance', 7),
 ('maintenance?', 5),
 ('maintenance? ', 1),

 ('bug', 5),
 ('bug?', 10),
 ('wiki_policy?', 9),

 ('test', 43),
 ('test?', 4),

 ('unknown', 71),
 ('misc', 59),
 ('misc?', 8),
 ('unclear', 14),

 %% Cell type:markdown id: tags:

 ## Hit count

 %% Cell type:code id: tags:

 ``` python
 df['af_hit_count'].describe()
 ```

 %% Output

    count    9.540000e+02
    mean     2.401892e+04
    std      1.205649e+05
    min      0.000000e+00
    25%      7.000000e+00
    50%      9.050000e+01
    75%      1.185250e+03
    max      1.611956e+06
    Name: af_hit_count, dtype: float64

 %% Cell type:markdown id: tags:

 ## Edit filter editors

 %% Cell type:code id: tags:

 ``` python
 raw_df = pd.read_csv("quarry-32518-all-filters-sorted-num-hits.csv", sep=',')
 editors = raw_df['af_user_text']
 print (editors.unique())
 print (len(editors.unique()))
 print (raw_df['af_user_text'].value_counts())
 ```

 %% Output

    ['Zzuuzz' 'Dragons flight' 'This, that and the other' 'MusikAnimal' 'Crow'
     'Samtar' 'Xaosflux' 'King of Hearts' 'Amorymeltzer' 'Samwalton9'
     'Biblioworm' 'NawlinWiki' 'MER-C' 'Rich Farmbrough' 'Galobtter'
     'Cenarium' 'Ruslik0' 'Legoktm' 'Od Mishehu' 'BU Rob13' 'Prodego'
     'Timotheus Canens' 'Oshwah' 'The Earwig' 'The Anome' 'Kww' 'Beetstra'
     'Reaper Eternal' 'BethNaught' 'Mlitn' 'Cyp' "There'sNoTime" 'Kuru'
     'Shirik' 'Xeno' 'Kaldari' 'Kingpin13' 'DoRD' 'Elockid' 'Ritchie333'
     'Maxim' 'Ryan Kaldari (WMF)' 'Cyberpower678' 'GB fan' 'Jackmcbarn' 'L235'
     'Smalljim' 'Materialscientist' 'Someguy1221' 'Billinghurst' 'Tedder'
     'Gogo Dodo' 'Triplestop' 'Darkwind' 'Amalthea' 'Slakr' 'Scottywong'
     'Mr.Z-man' 'SQL' 'Avraham' 'NuclearWarfare' 'OverlordQ' 'Nihiltres'
     'Hersfold' 'Mifter' 'Chris G' 'EdoDodo' 'Nakon' 'Werdna' 'Wknight94'
     'DMacks' 'East718' 'Georgewilliamherbert' 'Mindmatrix' 'Rschen7754'
     'Lustiger seth' "Chris G's Test Account"]
    77
    MusikAnimal                 249
    King of Hearts               91
    Zzuuzz                       81
    Rich Farmbrough              61
    Ruslik0                      59
    Prodego                      45
    Samwalton9                   34
    Cenarium                     32
    NawlinWiki                   28
    Xaosflux                     27
    Reaper Eternal               25
    Shirik                       23
    Beetstra                     16
    Dragons flight               15
    Crow                         13
    Legoktm                      11
    Samtar                        9
    The Anome                     9
    Cyp                           7
    BethNaught                    6
    Ryan Kaldari (WMF)            5
    BU Rob13                      5
    Oshwah                        5
    Kww                           5
    Od Mishehu                    5
    There'sNoTime                 5
    Elockid                       4
    Kuru                          4
    Materialscientist             4
    Mlitn                         4
                               ...
    This, that and the other      1
    Nihiltres                     1
    Ritchie333                    1
    East718                       1
    Lustiger seth                 1
    Chris G                       1
    Tedder                        1
    Amalthea                      1
    Mindmatrix                    1
    DoRD                          1
    Avraham                       1
    Georgewilliamherbert          1
    EdoDodo                       1
    Darkwind                      1
    Rschen7754                    1
    Jackmcbarn                    1
    Nakon                         1
    Slakr                         1
    Smalljim                      1
    Chris G's Test Account        1
    Scottywong                    1
    Timotheus Canens              1
    Mifter                        1
    L235                          1
    Triplestop                    1
    Hersfold                      1
    Billinghurst                  1
    SQL                           1
    DMacks                        1
    Xeno                          1
    Name: af_user_text, Length: 77, dtype: int64

 %% Cell type:markdown id: tags:

+## Vandalism
+
+We may be interested in how the notion of vandalism changed over the years. For this an inquiry into which filters have "vandalism" in their public description (and were tagged as "vandalism") and what they do may be interesting.
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## Potential harassment
+
+Another idea would be to classify filters according to the namespaces they cover. A filter targeting the talk/user name spaces may be indicative of dealing with personal attacks or harassment.
+
+%% Cell type:markdown id: tags:
+
 ## Code snippets that may come in handy

 %% Cell type:code id: tags:

 ``` python
 # make a data frame out of list
 ten_tags = manual_tags.head(10).str.split(", ", n = 1, expand = True).apply(pd.Series)
 ten_tags = ten_tags.rename(columns = lambda x : 'tag_' + str(x))
 ten_tags
 ```

 %% Output

    0               good_faith
    1                vandalism
    2                vandalism
    3                vandalism
    4               good_faith
    5               good_faith
    6     good_faith, lazyness
    7    vandalism, good_faith
    8               good_faith
    9               good_faith
    Name: manual_tags, dtype: object

            tag_0       tag_1
    0  good_faith        None
    1   vandalism        None
    2   vandalism        None
    3   vandalism        None
    4  good_faith        None
    5  good_faith        None
    6  good_faith    lazyness
    7   vandalism  good_faith
    8  good_faith        None
    9  good_faith        None

 %% Cell type:code id: tags:

 ``` python
 raw_df.groupby('af_user_text').count()
 ```

 %% Output

    MusikAnimal                 249
    King of Hearts               91
    Zzuuzz                       81
    Rich Farmbrough              61
    Ruslik0                      59
    Prodego                      45
    Samwalton9                   34
    Cenarium                     32
    NawlinWiki                   28
    Xaosflux                     27
    Reaper Eternal               25
    Shirik                       23
    Beetstra                     16
    Dragons flight               15
    Crow                         13
    Legoktm                      11
    Samtar                        9
    The Anome                     9
    Cyp                           7
    BethNaught                    6
    Ryan Kaldari (WMF)            5
    BU Rob13                      5
    Oshwah                        5
    Kww                           5
    Od Mishehu                    5
    There'sNoTime                 5
    Elockid                       4
    Kuru                          4
    Materialscientist             4
    Mlitn                         4
                               ...
    This, that and the other      1
    Nihiltres                     1
    Ritchie333                    1
    East718                       1
    Lustiger seth                 1
    Chris G                       1
    Tedder                        1
    Amalthea                      1
    Mindmatrix                    1
    DoRD                          1
    Avraham                       1
    Georgewilliamherbert          1
    EdoDodo                       1
    Darkwind                      1
    Rschen7754                    1
    Jackmcbarn                    1
    Nakon                         1
    Slakr                         1
    Smalljim                      1
    Chris G's Test Account        1
    Scottywong                    1
    Timotheus Canens              1
    Mifter                        1
    L235                          1
    Triplestop                    1
    Hersfold                      1
    Billinghurst                  1
    SQL                           1
    DMacks                        1
    Xeno                          1
    Name: af_user_text, Length: 77, dtype: int64