Skip to content
Snippets Groups Projects
Commit 43447d9e authored by Lyudmila Vaseva's avatar Lyudmila Vaseva
Browse files

Reflect on open questions

parent d2cfea6b
No related branches found
No related tags found
No related merge requests found
......@@ -657,9 +657,9 @@ So far, I haven't managed to trigger a filter with a different action.
\begin{itemize}
\item how many filters are there (were there over the years): 954 filters (stand: 06.01.2019); TODO: historically?
\item what do the most active filters do?: see~\ref{tab:most-active-actions}
\item get a sense of what gets filtered (more qualitative): TODO: refine after sorting through manual categories; preliminary: vandalism; unintentional suboptimal behavior from new users who don't know better ("good faith edits") such as blanking an article/section; creating an article without categories; adding larger texts without references; large unwikified new article (180); or from users who are too lazy (to write proper edit summaries; editing behaviours and styles not suitable for an encyclopedia (poor grammar/not commiting to orthography norms; use of emoticons and !; ascii art?); "unexplained removal of sourced content" (636) may be an attempt to silence a view point the editor doesn't like; self-promotion(adding unreferenced material to BLP; "users creating autobiographies" 148;); harassment; sockpuppetry; potential copyright violations
\item get a sense of what gets filtered (more qualitative): TODO: refine after sorting through manual categories; preliminary: vandalism; unintentional suboptimal behavior from new users who don't know better ("good faith edits") such as blanking an article/section; creating an article without categories; adding larger texts without references; large unwikified new article (180); or from users who are too lazy (to write proper edit summaries; editing behaviours and styles not suitable for an encyclopedia (poor grammar/not commiting to orthography norms; use of emoticons and !; ascii art?); "unexplained removal of sourced content" (636) may be an attempt to silence a view point the editor doesn't like; self-promotion(adding unreferenced material to BLP; "users creating autobiographies" 148;); harassment; sockpuppetry; potential copyright violations; that's more or less it actually. There's a third bigger cluster of maintenance stuff, such as tracking bugs or other problems, trying to sort through bot edits and such. For further details see the jupyter notebook.
\item has the willingness of the community to use filters increased over time?: looking at aggregated values of number of triggered filters per year, the answer is rather it's quite constant; TODO: plot it at a finer granularity
\item how often were (which) filters triggered: see \url{filter-lists/20190106115600_filters-sorted-by-hits.csv} and~\ref{tab:most-active-actions}; TODO aggregate hitcounts over tagged categories after finished tagging
\item how often were (which) filters triggered: see \url{filter-lists/20190106115600_filters-sorted-by-hits.csv} and~\ref{tab:most-active-actions}; see also jupyter notebook for aggregated hitcounts over tagged categories
\item percentage of triggered filters/all edits; break down triggered filters according to typology: TODO still need the complete abuse\_filter\_log table!; and probably further dumps in order to know total number of edits
\item percentage filters of different types over the years: TODO according to actions (I need a complete abuse\_filter\_log table for this!); according to self-assigned tags (finish tagging!)
\item what gets classified as vandalism? has this changed over time? TODO: (look at words and patterns triggered by the vandalism filters; read vandalism policy page); pay special attention to filters labeled as vandalism by the edit filter editors (i.e. in the public description) vs these I labeled as vandalism
......@@ -678,7 +678,7 @@ So far, I haven't managed to trigger a filter with a different action.
\item what are the values in the "group" column? what do they mean?
\item which are the most frequently triggered filters of all time?
\item is it new filters that get triggered most frequently? or are there also very active old ones?
\item how many different edit filter editros are there (af\_user)?
\item how many different edit filter editors are there (af\_user)?
\item categorise filters according to which name spaces they apply to; pay special attention to edits in user/talks name spaces (may be indication of filtering harassment)
\end{itemize}
......
%% Cell type:markdown id: tags:
# An explorative inquiry into EN Wikipedia's edit filter system
This notebook serves to explore EN Wikipedia's edit filters
%% Cell type:code id: tags:
``` python
import pandas as pd
import itertools
import collections
```
%% Cell type:markdown id: tags:
We import a cleaned version of manually annotated edit filters:
%% Cell type:code id: tags:
``` python
df = pd.read_csv("20190106115600_filters-sorted-by-hits-manual-tags.csv", sep='\t')
```
%% Cell type:markdown id: tags:
## General stats
%% Cell type:code id: tags:
``` python
# Number of filters
len(df)
```
%% Output
954
%% Cell type:code id: tags:
``` python
# Active (enabled) filters
print (len(df.query('af_enabled==1')))
# Disabled filters
print (len(df.query('af_enabled==0')))
# Deleted filters
print (len(df.query('af_deleted==1')))
# Active public filters
print (len(df.query('af_hidden==0 and af_enabled==1')))
# Deleted and enabled
print (len(df.query('af_deleted==1 and af_enabled==1')))
```
%% Output
201
753
600
110
0
%% Cell type:code id: tags:
``` python
# hidden filters
print (len(df.query('af_hidden==1')))
# active hidden filters
print (len(df.query('af_hidden==1 and af_enabled==1')))
```
%% Output
593
91
%% Cell type:code id: tags:
``` python
# global filters
print (len(df.query('af_global==0')))
```
%% Output
954
%% Cell type:code id: tags:
``` python
# throttled
print (len(df.query('af_throttled==0')))
print (len(df.query('af_throttled==1')))
```
%% Output
948
6
%% Cell type:code id: tags:
``` python
# group
print (len(df.query('af_group=="default"')))
print (df.query('af_group!="default"'))
# --> so available groups are "default" and "feedback"
```
%% Output
947
Unnamed: 0 af_id af_hidden af_global af_enabled af_deleted \
168 168 497 0 0 0 1
173 173 494 0 0 0 1
174 174 502 0 0 0 1
187 187 495 0 0 0 1
190 190 496 0 0 0 1
227 227 475 0 0 0 1
349 349 461 0 0 0 1
af_throttled af_group af_timestamp af_actions af_hit_count \
168 0 feedback 20130108151106 disallow 3660
173 0 feedback 20130108151035 disallow 3325
174 0 feedback 20130424011002 disallow 3280
187 0 feedback 20130108151045 disallow 2697
190 0 feedback 20130108151054 disallow 2658
227 0 feedback 20131003210159 NaN 1390
349 0 feedback 20130411173111 disallow 283
af_public_comments manual_tags \
168 Feedback: Common Vandalism 5 vandalism, harassment?
173 Feedback: Common Vandalism 2 vandalism?, harassment?
174 Feedback: Extremely long words vandalism?, good_faith?, bad_style?
187 Feedback: Common Vandalism 3 vandalism, harassment?
190 Feedback: Common Vandalism 4 vandalism, harassment?
227 Feedback: Vandalism or libel vandalism, harassment
349 Feedback: Vandalism in all caps vandalism, harassment?
notes
168 deleted; “Merged back into 460. --mlitn”
173 deleted; “Merged back into 460. --mlitn”
174 deleted
187 deleted; “Merged back into 460. --mlitn”
190 deleted; “Merged back into 460. --mlitn”
227 deleted
349 NaN
%% Cell type:markdown id: tags:
## Helper functions
%% Cell type:code id: tags:
``` python
flatten = lambda x: list(itertools.chain.from_iterable(x))
```
%% Cell type:markdown id: tags:
## Edit filter actions
%% Cell type:code id: tags:
``` python
actions = df['af_actions'].fillna('')
actions_list = [x.split(",") for x in list(actions)]
all_actions = flatten(actions_list)
print(collections.Counter(all_actions).most_common())
```
%% Output
[('', 413), ('disallow', 406), ('warn', 122), ('tag', 70), ('throttle', 52), ('blockautopromote', 4)]
%% Cell type:code id: tags:
``` python
# What are the actions of active hidden filters
active_hidden = df.query('af_hidden==1 and af_enabled==1')
print(collections.Counter(list(active_hidden['af_actions'].fillna(''))).most_common())
```
%% Output
[('disallow', 51), ('', 19), ('throttle,disallow', 7), ('throttle', 4), ('tag', 3), ('warn,tag', 2), ('throttle,warn', 2), ('warn', 1), ('disallow,tag', 1), ('warn,disallow', 1)]
%% Cell type:code id: tags:
``` python
# What are the actions of active public filters
active_public = df.query('af_hidden==0 and af_enabled==1')
print(collections.Counter(list(active_public['af_actions'].fillna(''))).most_common())
```
%% Output
[('tag', 25), ('warn,tag', 25), ('disallow', 22), ('', 20), ('warn', 12), ('throttle,tag', 2), ('warn,disallow', 2), ('throttle,warn,tag', 1), ('throttle,disallow', 1)]
%% Cell type:markdown id: tags:
## Explore Manual Tags
%% Cell type:code id: tags:
``` python
manual_tags = df['manual_tags']
manual_tags_list = [x.split(", ") for x in list(manual_tags)]
all_tags = flatten(manual_tags_list)
print(collections.Counter(all_tags).most_common())
```
%% Output
[('vandalism', 263), ('vandalism?', 162), ('unknown', 71), ('good_faith?', 63), ('misc', 59), ('sockpuppetry', 59), ('good_faith', 48), ('test', 43), ('spam?', 41), ('long_term_abuse', 35), ('sockpuppetry?', 35), ('harassment?', 31), ('harassment', 24), ('abuse?', 21), ('biased_pov', 17), ('spam', 17), ('biased_pov?', 15), ('unclear', 14), ('bad_style', 13), ('bad_style?', 12), ('bug?', 10), ('wiki_policy?', 9), ('long_term_abuse?', 9), ('misc?', 8), ('seo', 8), ('politically_motivated?', 8), ('maintenance', 7), ('trolling?', 7), ('maintenance?', 6), ('personal_attacks', 6), ('bug', 5), ('vandalbot', 5), ('page_move_vandalism', 5), ('silly_vandalism', 5), ('lazyness', 4), ('seo?', 4), ('test?', 4), ('hoaxing?', 4), ('personal_attacks?', 4), ('edit_warring?', 3), ('copyright', 3), ('image_vandalism', 3), ('talk_page_vandalism', 3), ('page_move_vandalism?', 3), ('conflict_of_interest', 3), ('stockbrocker_vandalism', 3), ('copyright?', 2), ('vandalbot?', 2), ('religious_vandalism?', 2), ('politically_motivated', 2), ('self_promotion?', 2), ('template_spam', 2), ('hoaxing', 2), ('silly_vandalism?', 2), ('doxxing?', 2), ('not_polite', 1), ('template_vandalism', 1), ('religious_vandalism', 1), ('self_promotion', 1), ('abuse', 1), ('template_vandalism?', 1), ('link_vandalism?', 1), ('abuse_of_tags_vandalism?', 1), ('avoidant_vandalism', 1), ('guideline_vio?', 1), ('username_vandalism?', 1), ('phishing?', 1), ('avoidant_vandalism?', 1), ('malware?', 1), ('malware', 1), ('conflict_of_interest?', 1), ('impersonation', 1), ('prank', 1)]
%% Cell type:markdown id: tags:
('vandalism', 263),
('vandalism?', 162),
('spam?', 41),
('spam', 17),
('vandalbot', 5),
('vandalbot?', 2),
('page_move_vandalism', 5),
('page_move_vandalism?', 3),
('silly_vandalism', 5),
('silly_vandalism?', 2),
('trolling?', 7),
('hoaxing?', 4),
('hoaxing', 2),
('copyright', 3),
('copyright?', 2),
('image_vandalism', 3),
('talk_page_vandalism', 3),
('template_vandalism?', 1),
('template_vandalism', 1),
('template_spam', 2),
('link_vandalism?', 1),
('abuse_of_tags_vandalism?', 1),
('avoidant_vandalism', 1),
('avoidant_vandalism?', 1),
('username_vandalism?', 1),
('prank', 1)
('phishing?', 1),
('malware?', 1),
('malware', 1),
('guideline_vio?', 1),
('religious_vandalism?', 3),
('politically_motivated?', 8),
('politically_motivated', 2),
('sockpuppetry', 59),
('sockpuppetry?', 35),
('long_term_abuse', 35),
('long_term_abuse?', 9),
('abuse', 1),
('abuse?', 21),
('harassment?', 31),
('harassment', 24),
('doxxing?', 2),
('personal_attacks', 6),
('personal_attacks?', 4),
('impersonation', 1),
('not_polite', 1),
('biased_pov', 17),
('biased_pov?', 15),
('conflict_of_interest', 3),
('stockbrocker_vandalism', 3),
('self_promotion?', 2),
('conflict_of_interest?', 1),
('self_promotion', 1),
('seo', 8),
('seo?', 4),
('bad_style', 13),
('bad_style?', 12),
('edit_warring?', 3),
('good_faith?', 63),
('good_faith', 48),
('lazyness', 4),
('maintenance', 7),
('maintenance?', 5),
('maintenance? ', 1),
('bug', 5),
('bug?', 10),
('wiki_policy?', 9),
('test', 43),
('test?', 4),
('unknown', 71),
('misc', 59),
('misc?', 8),
('unclear', 14),
%% Cell type:markdown id: tags:
## Hit count
%% Cell type:code id: tags:
``` python
df['af_hit_count'].describe()
```
%% Output
count 9.540000e+02
mean 2.401892e+04
std 1.205649e+05
min 0.000000e+00
25% 7.000000e+00
50% 9.050000e+01
75% 1.185250e+03
max 1.611956e+06
Name: af_hit_count, dtype: float64
%% Cell type:markdown id: tags:
## Edit filter editors
%% Cell type:code id: tags:
``` python
raw_df = pd.read_csv("quarry-32518-all-filters-sorted-num-hits.csv", sep=',')
editors = raw_df['af_user_text']
print (editors.unique())
print (len(editors.unique()))
print (raw_df['af_user_text'].value_counts())
```
%% Output
['Zzuuzz' 'Dragons flight' 'This, that and the other' 'MusikAnimal' 'Crow'
'Samtar' 'Xaosflux' 'King of Hearts' 'Amorymeltzer' 'Samwalton9'
'Biblioworm' 'NawlinWiki' 'MER-C' 'Rich Farmbrough' 'Galobtter'
'Cenarium' 'Ruslik0' 'Legoktm' 'Od Mishehu' 'BU Rob13' 'Prodego'
'Timotheus Canens' 'Oshwah' 'The Earwig' 'The Anome' 'Kww' 'Beetstra'
'Reaper Eternal' 'BethNaught' 'Mlitn' 'Cyp' "There'sNoTime" 'Kuru'
'Shirik' 'Xeno' 'Kaldari' 'Kingpin13' 'DoRD' 'Elockid' 'Ritchie333'
'Maxim' 'Ryan Kaldari (WMF)' 'Cyberpower678' 'GB fan' 'Jackmcbarn' 'L235'
'Smalljim' 'Materialscientist' 'Someguy1221' 'Billinghurst' 'Tedder'
'Gogo Dodo' 'Triplestop' 'Darkwind' 'Amalthea' 'Slakr' 'Scottywong'
'Mr.Z-man' 'SQL' 'Avraham' 'NuclearWarfare' 'OverlordQ' 'Nihiltres'
'Hersfold' 'Mifter' 'Chris G' 'EdoDodo' 'Nakon' 'Werdna' 'Wknight94'
'DMacks' 'East718' 'Georgewilliamherbert' 'Mindmatrix' 'Rschen7754'
'Lustiger seth' "Chris G's Test Account"]
77
MusikAnimal 249
King of Hearts 91
Zzuuzz 81
Rich Farmbrough 61
Ruslik0 59
Prodego 45
Samwalton9 34
Cenarium 32
NawlinWiki 28
Xaosflux 27
Reaper Eternal 25
Shirik 23
Beetstra 16
Dragons flight 15
Crow 13
Legoktm 11
Samtar 9
The Anome 9
Cyp 7
BethNaught 6
Ryan Kaldari (WMF) 5
BU Rob13 5
Oshwah 5
Kww 5
Od Mishehu 5
There'sNoTime 5
Elockid 4
Kuru 4
Materialscientist 4
Mlitn 4
...
This, that and the other 1
Nihiltres 1
Ritchie333 1
East718 1
Lustiger seth 1
Chris G 1
Tedder 1
Amalthea 1
Mindmatrix 1
DoRD 1
Avraham 1
Georgewilliamherbert 1
EdoDodo 1
Darkwind 1
Rschen7754 1
Jackmcbarn 1
Nakon 1
Slakr 1
Smalljim 1
Chris G's Test Account 1
Scottywong 1
Timotheus Canens 1
Mifter 1
L235 1
Triplestop 1
Hersfold 1
Billinghurst 1
SQL 1
DMacks 1
Xeno 1
Name: af_user_text, Length: 77, dtype: int64
%% Cell type:markdown id: tags:
## Vandalism
We may be interested in how the notion of vandalism changed over the years. For this an inquiry into which filters have "vandalism" in their public description (and were tagged as "vandalism") and what they do may be interesting.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Potential harassment
Another idea would be to classify filters according to the namespaces they cover. A filter targeting the talk/user name spaces may be indicative of dealing with personal attacks or harassment.
%% Cell type:markdown id: tags:
## Code snippets that may come in handy
%% Cell type:code id: tags:
``` python
# make a data frame out of list
ten_tags = manual_tags.head(10).str.split(", ", n = 1, expand = True).apply(pd.Series)
ten_tags = ten_tags.rename(columns = lambda x : 'tag_' + str(x))
ten_tags
```
%% Output
0 good_faith
1 vandalism
2 vandalism
3 vandalism
4 good_faith
5 good_faith
6 good_faith, lazyness
7 vandalism, good_faith
8 good_faith
9 good_faith
Name: manual_tags, dtype: object
tag_0 tag_1
0 good_faith None
1 vandalism None
2 vandalism None
3 vandalism None
4 good_faith None
5 good_faith None
6 good_faith lazyness
7 vandalism good_faith
8 good_faith None
9 good_faith None
%% Cell type:code id: tags:
``` python
raw_df.groupby('af_user_text').count()
```
%% Output
MusikAnimal 249
King of Hearts 91
Zzuuzz 81
Rich Farmbrough 61
Ruslik0 59
Prodego 45
Samwalton9 34
Cenarium 32
NawlinWiki 28
Xaosflux 27
Reaper Eternal 25
Shirik 23
Beetstra 16
Dragons flight 15
Crow 13
Legoktm 11
Samtar 9
The Anome 9
Cyp 7
BethNaught 6
Ryan Kaldari (WMF) 5
BU Rob13 5
Oshwah 5
Kww 5
Od Mishehu 5
There'sNoTime 5
Elockid 4
Kuru 4
Materialscientist 4
Mlitn 4
...
This, that and the other 1
Nihiltres 1
Ritchie333 1
East718 1
Lustiger seth 1
Chris G 1
Tedder 1
Amalthea 1
Mindmatrix 1
DoRD 1
Avraham 1
Georgewilliamherbert 1
EdoDodo 1
Darkwind 1
Rschen7754 1
Jackmcbarn 1
Nakon 1
Slakr 1
Smalljim 1
Chris G's Test Account 1
Scottywong 1
Timotheus Canens 1
Mifter 1
L235 1
Triplestop 1
Hersfold 1
Billinghurst 1
SQL 1
DMacks 1
Xeno 1
Name: af_user_text, Length: 77, dtype: int64
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment