Skip to content
Snippets Groups Projects
Commit 38f43ecd authored by Lyudmila Vaseva's avatar Lyudmila Vaseva
Browse files

Add random notes

parent e21e80a0
No related branches found
No related tags found
No related merge requests found
...@@ -899,3 +899,33 @@ TODO: Questions to ask of a text (p.39-40): ...@@ -899,3 +899,33 @@ TODO: Questions to ask of a text (p.39-40):
* What kinds of comparisons can you make between texts? Between different texts on the same topic? Similar texts at different times such as organizational annual reports? Between different authors who address the same questions? * What kinds of comparisons can you make between texts? Between different texts on the same topic? Similar texts at different times such as organizational annual reports? Between different authors who address the same questions?
* Who benefits from the text? Why? * Who benefits from the text? Why?
" "
================================================================
https://en.wikipedia.org/w/api.php?action=help&modules=main
action
Which action to perform.
abusefiltercheckmatch
Check to see if an AbuseFilter matches a set of variables, an edit, or a logged AbuseFilter event.
abusefilterchecksyntax
Check syntax of an AbuseFilter filter.
abusefilterevalexpression
Evaluates an AbuseFilter expression.
abusefilterunblockautopromote
Unblocks a user from receiving autopromotions due to an abusefilter consequence.
================================================================
https://en.wikipedia.org/wiki/Wikipedia:Database_download
================================================================
https://stats.wikimedia.org/v2
To generate stats for different wiki projects
=====================================================================
Claudia: * A focus on the Good faith policies/guidelines is a historical development. After the huge surge in edits Wikipedia experienced starting 2005 the community needed a means to handle these (and the proportional amount of vandalism). They opted for automatisation. Automated system branded a lot of good faith edits as vandalism, which drove new comers away. A policy focus on good faith is part of the intentions to fix this.
%% Cell type:markdown id: tags:
# An explorative inquiry into EN Wikipedia's edit filter system
This notebook serves to explore EN Wikipedia's edit filters
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import pandas as pd import pandas as pd
import itertools import itertools
import collections import collections
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We import a cleaned version of manually annotated edit filters: We import a cleaned version of manually annotated edit filters:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df = pd.read_csv("20190106115600_filters-sorted-by-hits-manual-tags.csv", sep='\t') df = pd.read_csv("20190106115600_filters-sorted-by-hits-manual-tags.csv", sep='\t')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## General stats ## General stats
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Number of filters # Number of filters
len(df) len(df)
``` ```
%% Output %% Output
954 954
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Active (enabled) filters # Active (enabled) filters
print (len(df.query('af_enabled==1'))) print (len(df.query('af_enabled==1')))
# Disabled filters # Disabled filters
print (len(df.query('af_enabled==0'))) print (len(df.query('af_enabled==0')))
# Deleted filters # Deleted filters
print (len(df.query('af_deleted==1'))) print (len(df.query('af_deleted==1')))
# Active public filters # Active public filters
print (len(df.query('af_hidden==0 and af_enabled==1'))) print (len(df.query('af_hidden==0 and af_enabled==1')))
# Deleted and enabled # Deleted and enabled
print (len(df.query('af_deleted==1 and af_enabled==1'))) print (len(df.query('af_deleted==1 and af_enabled==1')))
``` ```
%% Output %% Output
201 201
753 753
600 600
110 110
0 0
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# hidden filters # hidden filters
print (len(df.query('af_hidden==1'))) print (len(df.query('af_hidden==1')))
# active hidden filters # active hidden filters
print (len(df.query('af_hidden==1 and af_enabled==1'))) print (len(df.query('af_hidden==1 and af_enabled==1')))
``` ```
%% Output %% Output
593 593
91 91
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# global filters # global filters
print (len(df.query('af_global==0'))) print (len(df.query('af_global==0')))
``` ```
%% Output %% Output
954 954
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# throttled # throttled
print (len(df.query('af_throttled==0'))) print (len(df.query('af_throttled==0')))
print (len(df.query('af_throttled==1'))) print (len(df.query('af_throttled==1')))
``` ```
%% Output %% Output
948 948
6 6
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# group # group
print (len(df.query('af_group=="default"'))) print (len(df.query('af_group=="default"')))
print (df.query('af_group!="default"')) print (df.query('af_group!="default"'))
# --> so available groups are "default" and "feedback" # --> so available groups are "default" and "feedback"
``` ```
%% Output %% Output
947 947
Unnamed: 0 af_id af_hidden af_global af_enabled af_deleted \ Unnamed: 0 af_id af_hidden af_global af_enabled af_deleted \
168 168 497 0 0 0 1 168 168 497 0 0 0 1
173 173 494 0 0 0 1 173 173 494 0 0 0 1
174 174 502 0 0 0 1 174 174 502 0 0 0 1
187 187 495 0 0 0 1 187 187 495 0 0 0 1
190 190 496 0 0 0 1 190 190 496 0 0 0 1
227 227 475 0 0 0 1 227 227 475 0 0 0 1
349 349 461 0 0 0 1 349 349 461 0 0 0 1
af_throttled af_group af_timestamp af_actions af_hit_count \ af_throttled af_group af_timestamp af_actions af_hit_count \
168 0 feedback 20130108151106 disallow 3660 168 0 feedback 20130108151106 disallow 3660
173 0 feedback 20130108151035 disallow 3325 173 0 feedback 20130108151035 disallow 3325
174 0 feedback 20130424011002 disallow 3280 174 0 feedback 20130424011002 disallow 3280
187 0 feedback 20130108151045 disallow 2697 187 0 feedback 20130108151045 disallow 2697
190 0 feedback 20130108151054 disallow 2658 190 0 feedback 20130108151054 disallow 2658
227 0 feedback 20131003210159 NaN 1390 227 0 feedback 20131003210159 NaN 1390
349 0 feedback 20130411173111 disallow 283 349 0 feedback 20130411173111 disallow 283
af_public_comments manual_tags \ af_public_comments manual_tags \
168 Feedback: Common Vandalism 5 vandalism, harassment? 168 Feedback: Common Vandalism 5 vandalism, harassment?
173 Feedback: Common Vandalism 2 vandalism?, harassment? 173 Feedback: Common Vandalism 2 vandalism?, harassment?
174 Feedback: Extremely long words vandalism?, good_faith?, bad_style? 174 Feedback: Extremely long words vandalism?, good_faith?, bad_style?
187 Feedback: Common Vandalism 3 vandalism, harassment? 187 Feedback: Common Vandalism 3 vandalism, harassment?
190 Feedback: Common Vandalism 4 vandalism, harassment? 190 Feedback: Common Vandalism 4 vandalism, harassment?
227 Feedback: Vandalism or libel vandalism, harassment 227 Feedback: Vandalism or libel vandalism, harassment
349 Feedback: Vandalism in all caps vandalism, harassment? 349 Feedback: Vandalism in all caps vandalism, harassment?
notes notes
168 deleted; “Merged back into 460. --mlitn” 168 deleted; “Merged back into 460. --mlitn”
173 deleted; “Merged back into 460. --mlitn” 173 deleted; “Merged back into 460. --mlitn”
174 deleted 174 deleted
187 deleted; “Merged back into 460. --mlitn” 187 deleted; “Merged back into 460. --mlitn”
190 deleted; “Merged back into 460. --mlitn” 190 deleted; “Merged back into 460. --mlitn”
227 deleted 227 deleted
349 NaN 349 NaN
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Helper functions ## Helper functions
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
flatten = lambda x: list(itertools.chain.from_iterable(x)) flatten = lambda x: list(itertools.chain.from_iterable(x))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Edit filter actions ## Edit filter actions
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
actions = df['af_actions'].fillna('') actions = df['af_actions'].fillna('')
actions_list = [x.split(",") for x in list(actions)] actions_list = [x.split(",") for x in list(actions)]
all_actions = flatten(actions_list) all_actions = flatten(actions_list)
print(collections.Counter(all_actions).most_common()) print(collections.Counter(all_actions).most_common())
``` ```
%% Output %% Output
[('', 413), ('disallow', 406), ('warn', 122), ('tag', 70), ('throttle', 52), ('blockautopromote', 4)] [('', 413), ('disallow', 406), ('warn', 122), ('tag', 70), ('throttle', 52), ('blockautopromote', 4)]
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# What are the actions of active hidden filters # What are the actions of active hidden filters
active_hidden = df.query('af_hidden==1 and af_enabled==1') active_hidden = df.query('af_hidden==1 and af_enabled==1')
print(collections.Counter(list(active_hidden['af_actions'].fillna(''))).most_common()) print(collections.Counter(list(active_hidden['af_actions'].fillna(''))).most_common())
``` ```
%% Output %% Output
[('disallow', 51), ('', 19), ('throttle,disallow', 7), ('throttle', 4), ('tag', 3), ('warn,tag', 2), ('throttle,warn', 2), ('warn', 1), ('disallow,tag', 1), ('warn,disallow', 1)] [('disallow', 51), ('', 19), ('throttle,disallow', 7), ('throttle', 4), ('tag', 3), ('warn,tag', 2), ('throttle,warn', 2), ('warn', 1), ('disallow,tag', 1), ('warn,disallow', 1)]
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# What are the actions of active public filters # What are the actions of active public filters
active_public = df.query('af_hidden==0 and af_enabled==1') active_public = df.query('af_hidden==0 and af_enabled==1')
print(collections.Counter(list(active_public['af_actions'].fillna(''))).most_common()) print(collections.Counter(list(active_public['af_actions'].fillna(''))).most_common())
``` ```
%% Output %% Output
[('tag', 25), ('warn,tag', 25), ('disallow', 22), ('', 20), ('warn', 12), ('throttle,tag', 2), ('warn,disallow', 2), ('throttle,warn,tag', 1), ('throttle,disallow', 1)] [('tag', 25), ('warn,tag', 25), ('disallow', 22), ('', 20), ('warn', 12), ('throttle,tag', 2), ('warn,disallow', 2), ('throttle,warn,tag', 1), ('throttle,disallow', 1)]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Explore Manual Tags ## Explore Manual Tags
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
manual_tags = df['manual_tags'] manual_tags = df['manual_tags']
manual_tags_list = [x.split(", ") for x in list(manual_tags)] manual_tags_list = [x.split(", ") for x in list(manual_tags)]
all_tags = flatten(manual_tags_list) all_tags = flatten(manual_tags_list)
print(collections.Counter(all_tags).most_common()) print(collections.Counter(all_tags).most_common())
``` ```
%% Output %% Output
[('vandalism', 263), ('vandalism?', 162), ('unknown', 71), ('good_faith?', 63), ('misc', 59), ('sockpuppetry', 59), ('good_faith', 48), ('test', 43), ('spam?', 41), ('long_term_abuse', 35), ('sockpuppetry?', 35), ('harassment?', 31), ('harassment', 24), ('abuse?', 21), ('biased_pov', 17), ('spam', 17), ('biased_pov?', 15), ('unclear', 14), ('bad_style', 13), ('bad_style?', 12), ('bug?', 10), ('wiki_policy?', 9), ('long_term_abuse?', 9), ('misc?', 8), ('seo', 8), ('politically_motivated?', 8), ('maintenance', 7), ('trolling?', 7), ('maintenance?', 6), ('personal_attacks', 6), ('bug', 5), ('vandalbot', 5), ('page_move_vandalism', 5), ('silly_vandalism', 5), ('lazyness', 4), ('seo?', 4), ('test?', 4), ('hoaxing?', 4), ('personal_attacks?', 4), ('edit_warring?', 3), ('copyright', 3), ('image_vandalism', 3), ('talk_page_vandalism', 3), ('page_move_vandalism?', 3), ('conflict_of_interest', 3), ('stockbrocker_vandalism', 3), ('copyright?', 2), ('vandalbot?', 2), ('religious_vandalism?', 2), ('politically_motivated', 2), ('self_promotion?', 2), ('template_spam', 2), ('hoaxing', 2), ('silly_vandalism?', 2), ('doxxing?', 2), ('not_polite', 1), ('template_vandalism', 1), ('religious_vandalism', 1), ('self_promotion', 1), ('abuse', 1), ('template_vandalism?', 1), ('link_vandalism?', 1), ('abuse_of_tags_vandalism?', 1), ('avoidant_vandalism', 1), ('guideline_vio?', 1), ('username_vandalism?', 1), ('phishing?', 1), ('avoidant_vandalism?', 1), ('malware?', 1), ('malware', 1), ('conflict_of_interest?', 1), ('impersonation', 1), ('prank', 1)] [('vandalism', 263), ('vandalism?', 162), ('unknown', 71), ('good_faith?', 63), ('misc', 59), ('sockpuppetry', 59), ('good_faith', 48), ('test', 43), ('spam?', 41), ('long_term_abuse', 35), ('sockpuppetry?', 35), ('harassment?', 31), ('harassment', 24), ('abuse?', 21), ('biased_pov', 17), ('spam', 17), ('biased_pov?', 15), ('unclear', 14), ('bad_style', 13), ('bad_style?', 12), ('bug?', 10), ('wiki_policy?', 9), ('long_term_abuse?', 9), ('misc?', 8), ('seo', 8), ('politically_motivated?', 8), ('maintenance', 7), ('trolling?', 7), ('maintenance?', 6), ('personal_attacks', 6), ('bug', 5), ('vandalbot', 5), ('page_move_vandalism', 5), ('silly_vandalism', 5), ('lazyness', 4), ('seo?', 4), ('test?', 4), ('hoaxing?', 4), ('personal_attacks?', 4), ('edit_warring?', 3), ('copyright', 3), ('image_vandalism', 3), ('talk_page_vandalism', 3), ('page_move_vandalism?', 3), ('conflict_of_interest', 3), ('stockbrocker_vandalism', 3), ('copyright?', 2), ('vandalbot?', 2), ('religious_vandalism?', 2), ('politically_motivated', 2), ('self_promotion?', 2), ('template_spam', 2), ('hoaxing', 2), ('silly_vandalism?', 2), ('doxxing?', 2), ('not_polite', 1), ('template_vandalism', 1), ('religious_vandalism', 1), ('self_promotion', 1), ('abuse', 1), ('template_vandalism?', 1), ('link_vandalism?', 1), ('abuse_of_tags_vandalism?', 1), ('avoidant_vandalism', 1), ('guideline_vio?', 1), ('username_vandalism?', 1), ('phishing?', 1), ('avoidant_vandalism?', 1), ('malware?', 1), ('malware', 1), ('conflict_of_interest?', 1), ('impersonation', 1), ('prank', 1)]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
('vandalism', 263), ('vandalism', 263),
('vandalism?', 162), ('vandalism?', 162),
('spam?', 41), ('spam?', 41),
('spam', 17), ('spam', 17),
('vandalbot', 5), ('vandalbot', 5),
('vandalbot?', 2), ('vandalbot?', 2),
('page_move_vandalism', 5), ('page_move_vandalism', 5),
('page_move_vandalism?', 3), ('page_move_vandalism?', 3),
('silly_vandalism', 5), ('silly_vandalism', 5),
('silly_vandalism?', 2), ('silly_vandalism?', 2),
('trolling?', 7), ('trolling?', 7),
('hoaxing?', 4), ('hoaxing?', 4),
('hoaxing', 2), ('hoaxing', 2),
('copyright', 3), ('copyright', 3),
('copyright?', 2), ('copyright?', 2),
('image_vandalism', 3), ('image_vandalism', 3),
('talk_page_vandalism', 3), ('talk_page_vandalism', 3),
('template_vandalism?', 1), ('template_vandalism?', 1),
('template_vandalism', 1), ('template_vandalism', 1),
('template_spam', 2), ('template_spam', 2),
('link_vandalism?', 1), ('link_vandalism?', 1),
('abuse_of_tags_vandalism?', 1), ('abuse_of_tags_vandalism?', 1),
('avoidant_vandalism', 1), ('avoidant_vandalism', 1),
('avoidant_vandalism?', 1), ('avoidant_vandalism?', 1),
('username_vandalism?', 1), ('username_vandalism?', 1),
('prank', 1) ('prank', 1)
('phishing?', 1), ('phishing?', 1),
('malware?', 1), ('malware?', 1),
('malware', 1), ('malware', 1),
('guideline_vio?', 1), ('guideline_vio?', 1),
('religious_vandalism?', 3), ('religious_vandalism?', 3),
('politically_motivated?', 8), ('politically_motivated?', 8),
('politically_motivated', 2), ('politically_motivated', 2),
('sockpuppetry', 59), ('sockpuppetry', 59),
('sockpuppetry?', 35), ('sockpuppetry?', 35),
('long_term_abuse', 35), ('long_term_abuse', 35),
('long_term_abuse?', 9), ('long_term_abuse?', 9),
('abuse', 1), ('abuse', 1),
('abuse?', 21), ('abuse?', 21),
('harassment?', 31), ('harassment?', 31),
('harassment', 24), ('harassment', 24),
('doxxing?', 2), ('doxxing?', 2),
('personal_attacks', 6), ('personal_attacks', 6),
('personal_attacks?', 4), ('personal_attacks?', 4),
('impersonation', 1), ('impersonation', 1),
('not_polite', 1), ('not_polite', 1),
('biased_pov', 17), ('biased_pov', 17),
('biased_pov?', 15), ('biased_pov?', 15),
('conflict_of_interest', 3), ('conflict_of_interest', 3),
('stockbrocker_vandalism', 3), ('stockbrocker_vandalism', 3),
('self_promotion?', 2), ('self_promotion?', 2),
('conflict_of_interest?', 1), ('conflict_of_interest?', 1),
('self_promotion', 1), ('self_promotion', 1),
('seo', 8), ('seo', 8),
('seo?', 4), ('seo?', 4),
('bad_style', 13), ('bad_style', 13),
('bad_style?', 12), ('bad_style?', 12),
('edit_warring?', 3), ('edit_warring?', 3),
('good_faith?', 63), ('good_faith?', 63),
('good_faith', 48), ('good_faith', 48),
('lazyness', 4), ('lazyness', 4),
('maintenance', 7), ('maintenance', 7),
('maintenance?', 5), ('maintenance?', 5),
('maintenance? ', 1), ('maintenance? ', 1),
('bug', 5), ('bug', 5),
('bug?', 10), ('bug?', 10),
('wiki_policy?', 9), ('wiki_policy?', 9),
('test', 43), ('test', 43),
('test?', 4), ('test?', 4),
('unknown', 71), ('unknown', 71),
('misc', 59), ('misc', 59),
('misc?', 8), ('misc?', 8),
('unclear', 14), ('unclear', 14),
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Combine manual tags with filter actions ## Combine manual tags with filter actions
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# What are the actions and tags of active public filters # What are the actions and tags of active public filters
active_public = df.query('af_hidden==0 and af_enabled==1').sort_values(by=['af_actions']) active_public = df.query('af_hidden==0 and af_enabled==1').sort_values(by=['af_actions'])
with pd.option_context('display.max_rows', None, 'display.max_columns', None): with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(active_public[['af_id', 'af_actions', 'manual_tags']].fillna('')) print(active_public[['af_id', 'af_actions', 'manual_tags']].fillna(''))
``` ```
%% Output %% Output
af_id af_actions manual_tags af_id af_actions manual_tags
653 897 disallow spam, vandalbot 653 897 disallow spam, vandalbot
67 803 disallow vandalism, good_faith 67 803 disallow vandalism, good_faith
41 12 disallow vandalism 41 12 disallow vandalism
37 320 disallow vandalism 37 320 disallow vandalism
499 694 disallow good_faith 499 694 disallow good_faith
99 782 disallow misc 99 782 disallow misc
22 260 disallow vandalism 22 260 disallow vandalism
54 365 disallow vandalism 54 365 disallow vandalism
130 784 disallow vandalism 130 784 disallow vandalism
19 46 disallow vandalism 19 46 disallow vandalism
171 860 disallow vandalism 171 860 disallow vandalism
110 554 disallow seo?, vandalism?, spam? 110 554 disallow seo?, vandalism?, spam?
47 680 disallow good_faith 47 680 disallow good_faith
470 843 disallow vandalism, sockpuppetry 470 843 disallow vandalism, sockpuppetry
3 384 disallow vandalism 3 384 disallow vandalism
234 892 disallow bad_style?, misc? 234 892 disallow bad_style?, misc?
239 930 disallow wiki_policy? 239 930 disallow wiki_policy?
268 812 disallow vandalism? 268 812 disallow vandalism?
328 788 disallow vandalism 328 788 disallow vandalism
271 642 disallow good_faith? 271 642 disallow good_faith?
12 225 disallow vandalism 12 225 disallow vandalism
302 828 disallow vandalism 302 828 disallow vandalism
68 117 tag good_faith? 68 117 tag good_faith?
75 753 tag vandalism 75 753 tag vandalism
78 164 tag good_faith 78 164 tag good_faith
155 632 tag good_faith, spam 155 632 tag good_faith, spam
85 627 tag biased_pov 85 627 tag biased_pov
94 59 tag good_faith? 94 59 tag good_faith?
100 655 tag good_faith?, bad_style? 100 655 tag good_faith?, bad_style?
106 224 tag misc?, copyright? 106 224 tag misc?, copyright?
226 921 tag vandalism?, harassment? 226 921 tag vandalism?, harassment?
131 735 tag vandalism 131 735 tag vandalism
134 878 tag biased_pov, good_faith 134 878 tag biased_pov, good_faith
82 323 tag vandalism 82 323 tag vandalism
86 846 tag vandalism? 86 846 tag vandalism?
0 61 tag good_faith 0 61 tag good_faith
33 180 tag good_faith 33 180 tag good_faith
14 189 tag vandalism, harassment 14 189 tag vandalism, harassment
20 98 tag good_faith 20 98 tag good_faith
40 631 tag good_faith 40 631 tag good_faith
29 550 tag misc 29 550 tag misc
6 633 tag good_faith, lazyness 6 633 tag good_faith, lazyness
35 391 tag vandalism 35 391 tag vandalism
53 339 tag vandalism, harassment 53 339 tag vandalism, harassment
24 148 tag biased_pov 24 148 tag biased_pov
31 29 tag good_faith 31 29 tag good_faith
4 172 tag good_faith 4 172 tag good_faith
107 420 throttle,disallow vandalism? 107 420 throttle,disallow vandalism?
10 279 throttle,tag vandalism 10 279 throttle,tag vandalism
71 249 throttle,tag good_faith?, vandalism? 71 249 throttle,tag good_faith?, vandalism?
43 80 throttle,warn,tag vandalism, biased_pov, seo 43 80 throttle,warn,tag vandalism, biased_pov, seo
149 869 warn biased_pov 149 869 warn biased_pov
151 702 warn biased_pov?, spam? 151 702 warn biased_pov?, spam?
157 894 warn biased_pov 157 894 warn biased_pov
189 783 warn good_faith 189 783 warn good_faith
81 167 warn good_faith 81 167 warn good_faith
248 879 warn good_faith? 248 879 warn good_faith?
7 636 warn vandalism, good_faith 7 636 warn vandalism, good_faith
88 664 warn good_faith 88 664 warn good_faith
375 901 warn vandalism 375 901 warn vandalism
391 928 warn good_faith 391 928 warn good_faith
449 838 warn vandalism?, maintenance? 449 838 warn vandalism?, maintenance?
177 850 warn good_faith? 177 850 warn good_faith?
158 887 warn,disallow vandalism?, spam?, sockpuppetry? 158 887 warn,disallow vandalism?, spam?, sockpuppetry?
125 890 warn,disallow vandalism? 125 890 warn,disallow vandalism?
233 891 warn,tag spam?, biased_pov? 233 891 warn,tag spam?, biased_pov?
11 432 warn,tag good_faith, lazyness 11 432 warn,tag good_faith, lazyness
5 30 warn,tag good_faith 5 30 warn,tag good_faith
211 766 warn,tag vandalism 211 766 warn,tag vandalism
1 135 warn,tag vandalism 1 135 warn,tag vandalism
8 3 warn,tag good_faith 8 3 warn,tag good_faith
45 11 warn,tag vandalism, harassment 45 11 warn,tag vandalism, harassment
160 5 warn,tag good_faith? 160 5 warn,tag good_faith?
61 33 warn,tag vandalism, good_faith? 61 33 warn,tag vandalism, good_faith?
64 346 warn,tag good_faith, vandalism? 64 346 warn,tag good_faith, vandalism?
38 39 warn,tag vandalism 38 39 warn,tag vandalism
34 351 warn,tag good_faith? 34 351 warn,tag good_faith?
30 149 warn,tag misc 30 149 warn,tag misc
90 657 warn,tag good_faith? 90 657 warn,tag good_faith?
91 113 warn,tag good_faith 91 113 warn,tag good_faith
95 174 warn,tag good_faith?, vandalism? 95 174 warn,tag good_faith?, vandalism?
28 79 warn,tag good_faith 28 79 warn,tag good_faith
25 491 warn,tag bad_style 25 491 warn,tag bad_style
101 602 warn,tag misc 101 602 warn,tag misc
108 345 warn,tag bug? 108 345 warn,tag bug?
21 220 warn,tag misc 21 220 warn,tag misc
15 132 warn,tag vandalism, good_faith 15 132 warn,tag vandalism, good_faith
138 912 warn,tag vandalism 138 912 warn,tag vandalism
13 50 warn,tag vandalism, good_faith 13 50 warn,tag vandalism, good_faith
17 231 warn,tag vandalism 17 231 warn,tag vandalism
9 650 good_faith 9 650 good_faith
23 686 vandalism, harassment, biased_pov 23 686 vandalism, harassment, biased_pov
26 833 good_faith 26 833 good_faith
27 712 good_faith, misc 27 712 good_faith, misc
58 126 misc? 58 126 misc?
63 867 good_faith 63 867 good_faith
79 716 good_faith?, vandalism? 79 716 good_faith?, vandalism?
92 711 seo, vandalism? 92 711 seo, vandalism?
109 733 vandalism, good_faith 109 733 vandalism, good_faith
115 837 good_faith? 115 837 good_faith?
175 777 vandalism 175 777 vandalism
197 861 test 197 861 test
218 942 maintenance 218 942 maintenance
257 899 wiki_policy?, biased_pov?, bad_style? 257 899 wiki_policy?, biased_pov?, bad_style?
273 856 good_faith?, vandalism? 273 856 good_faith?, vandalism?
315 862 spam? 315 862 spam?
414 798 copyright 414 798 copyright
640 883 page_move_vandalism 640 883 page_move_vandalism
666 929 long_term_abuse, bad_style? 666 929 long_term_abuse, bad_style?
704 932 seo?, vandalism?, spam? 704 932 seo?, vandalism?, spam?
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**TODO** It would be interesting to check all those filters which actions are set to "disallow" but I've **TODO** It would be interesting to check all those filters which actions are set to "disallow" but I've labeled as "good_faith" for example
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Hit count ## Hit count
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df['af_hit_count'].describe() df['af_hit_count'].describe()
``` ```
%% Output %% Output
count 9.540000e+02 count 9.540000e+02
mean 2.401892e+04 mean 2.401892e+04
std 1.205649e+05 std 1.205649e+05
min 0.000000e+00 min 0.000000e+00
25% 7.000000e+00 25% 7.000000e+00
50% 9.050000e+01 50% 9.050000e+01
75% 1.185250e+03 75% 1.185250e+03
max 1.611956e+06 max 1.611956e+06
Name: af_hit_count, dtype: float64 Name: af_hit_count, dtype: float64
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Edit filter editors ## Edit filter editors
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
raw_df = pd.read_csv("quarry-32518-all-filters-sorted-num-hits.csv", sep=',') raw_df = pd.read_csv("quarry-32518-all-filters-sorted-num-hits.csv", sep=',')
editors = raw_df['af_user_text'] editors = raw_df['af_user_text']
print (editors.unique()) print (editors.unique())
print (len(editors.unique())) print (len(editors.unique()))
print (raw_df['af_user_text'].value_counts()) print (raw_df['af_user_text'].value_counts())
``` ```
%% Output %% Output
['Zzuuzz' 'Dragons flight' 'This, that and the other' 'MusikAnimal' 'Crow' ['Zzuuzz' 'Dragons flight' 'This, that and the other' 'MusikAnimal' 'Crow'
'Samtar' 'Xaosflux' 'King of Hearts' 'Amorymeltzer' 'Samwalton9' 'Samtar' 'Xaosflux' 'King of Hearts' 'Amorymeltzer' 'Samwalton9'
'Biblioworm' 'NawlinWiki' 'MER-C' 'Rich Farmbrough' 'Galobtter' 'Biblioworm' 'NawlinWiki' 'MER-C' 'Rich Farmbrough' 'Galobtter'
'Cenarium' 'Ruslik0' 'Legoktm' 'Od Mishehu' 'BU Rob13' 'Prodego' 'Cenarium' 'Ruslik0' 'Legoktm' 'Od Mishehu' 'BU Rob13' 'Prodego'
'Timotheus Canens' 'Oshwah' 'The Earwig' 'The Anome' 'Kww' 'Beetstra' 'Timotheus Canens' 'Oshwah' 'The Earwig' 'The Anome' 'Kww' 'Beetstra'
'Reaper Eternal' 'BethNaught' 'Mlitn' 'Cyp' "There'sNoTime" 'Kuru' 'Reaper Eternal' 'BethNaught' 'Mlitn' 'Cyp' "There'sNoTime" 'Kuru'
'Shirik' 'Xeno' 'Kaldari' 'Kingpin13' 'DoRD' 'Elockid' 'Ritchie333' 'Shirik' 'Xeno' 'Kaldari' 'Kingpin13' 'DoRD' 'Elockid' 'Ritchie333'
'Maxim' 'Ryan Kaldari (WMF)' 'Cyberpower678' 'GB fan' 'Jackmcbarn' 'L235' 'Maxim' 'Ryan Kaldari (WMF)' 'Cyberpower678' 'GB fan' 'Jackmcbarn' 'L235'
'Smalljim' 'Materialscientist' 'Someguy1221' 'Billinghurst' 'Tedder' 'Smalljim' 'Materialscientist' 'Someguy1221' 'Billinghurst' 'Tedder'
'Gogo Dodo' 'Triplestop' 'Darkwind' 'Amalthea' 'Slakr' 'Scottywong' 'Gogo Dodo' 'Triplestop' 'Darkwind' 'Amalthea' 'Slakr' 'Scottywong'
'Mr.Z-man' 'SQL' 'Avraham' 'NuclearWarfare' 'OverlordQ' 'Nihiltres' 'Mr.Z-man' 'SQL' 'Avraham' 'NuclearWarfare' 'OverlordQ' 'Nihiltres'
'Hersfold' 'Mifter' 'Chris G' 'EdoDodo' 'Nakon' 'Werdna' 'Wknight94' 'Hersfold' 'Mifter' 'Chris G' 'EdoDodo' 'Nakon' 'Werdna' 'Wknight94'
'DMacks' 'East718' 'Georgewilliamherbert' 'Mindmatrix' 'Rschen7754' 'DMacks' 'East718' 'Georgewilliamherbert' 'Mindmatrix' 'Rschen7754'
'Lustiger seth' "Chris G's Test Account"] 'Lustiger seth' "Chris G's Test Account"]
77 77
MusikAnimal 249 MusikAnimal 249
King of Hearts 91 King of Hearts 91
Zzuuzz 81 Zzuuzz 81
Rich Farmbrough 61 Rich Farmbrough 61
Ruslik0 59 Ruslik0 59
Prodego 45 Prodego 45
Samwalton9 34 Samwalton9 34
Cenarium 32 Cenarium 32
NawlinWiki 28 NawlinWiki 28
Xaosflux 27 Xaosflux 27
Reaper Eternal 25 Reaper Eternal 25
Shirik 23 Shirik 23
Beetstra 16 Beetstra 16
Dragons flight 15 Dragons flight 15
Crow 13 Crow 13
Legoktm 11 Legoktm 11
Samtar 9 Samtar 9
The Anome 9 The Anome 9
Cyp 7 Cyp 7
BethNaught 6 BethNaught 6
Ryan Kaldari (WMF) 5 Ryan Kaldari (WMF) 5
BU Rob13 5 BU Rob13 5
Oshwah 5 Oshwah 5
Kww 5 Kww 5
Od Mishehu 5 Od Mishehu 5
There'sNoTime 5 There'sNoTime 5
Elockid 4 Elockid 4
Kuru 4 Kuru 4
Materialscientist 4 Materialscientist 4
Mlitn 4 Mlitn 4
... ...
This, that and the other 1 This, that and the other 1
Nihiltres 1 Nihiltres 1
Ritchie333 1 Ritchie333 1
East718 1 East718 1
Lustiger seth 1 Lustiger seth 1
Chris G 1 Chris G 1
Tedder 1 Tedder 1
Amalthea 1 Amalthea 1
Mindmatrix 1 Mindmatrix 1
DoRD 1 DoRD 1
Avraham 1 Avraham 1
Georgewilliamherbert 1 Georgewilliamherbert 1
EdoDodo 1 EdoDodo 1
Darkwind 1 Darkwind 1
Rschen7754 1 Rschen7754 1
Jackmcbarn 1 Jackmcbarn 1
Nakon 1 Nakon 1
Slakr 1 Slakr 1
Smalljim 1 Smalljim 1
Chris G's Test Account 1 Chris G's Test Account 1
Scottywong 1 Scottywong 1
Timotheus Canens 1 Timotheus Canens 1
Mifter 1 Mifter 1
L235 1 L235 1
Triplestop 1 Triplestop 1
Hersfold 1 Hersfold 1
Billinghurst 1 Billinghurst 1
SQL 1 SQL 1
DMacks 1 DMacks 1
Xeno 1 Xeno 1
Name: af_user_text, Length: 77, dtype: int64 Name: af_user_text, Length: 77, dtype: int64
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Vandalism ## Vandalism
We may be interested in how the notion of vandalism changed over the years. For this an inquiry into which filters have "vandalism" in their public description (and were tagged as "vandalism") and what they do may be interesting. We may be interested in how the notion of vandalism changed over the years. For this an inquiry into which filters have "vandalism" in their public description (and were tagged as "vandalism") and what they do may be interesting.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Potential harassment ## Potential harassment
Another idea would be to classify filters according to the namespaces they cover. A filter targeting the talk/user name spaces may be indicative of dealing with personal attacks or harassment. Another idea would be to classify filters according to the namespaces they cover. A filter targeting the talk/user name spaces may be indicative of dealing with personal attacks or harassment.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Code snippets that may come in handy ## Code snippets that may come in handy
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# make a data frame out of list # make a data frame out of list
ten_tags = manual_tags.head(10).str.split(", ", n = 1, expand = True).apply(pd.Series) ten_tags = manual_tags.head(10).str.split(", ", n = 1, expand = True).apply(pd.Series)
ten_tags = ten_tags.rename(columns = lambda x : 'tag_' + str(x)) ten_tags = ten_tags.rename(columns = lambda x : 'tag_' + str(x))
ten_tags ten_tags
``` ```
%% Output %% Output
0 good_faith 0 good_faith
1 vandalism 1 vandalism
2 vandalism 2 vandalism
3 vandalism 3 vandalism
4 good_faith 4 good_faith
5 good_faith 5 good_faith
6 good_faith, lazyness 6 good_faith, lazyness
7 vandalism, good_faith 7 vandalism, good_faith
8 good_faith 8 good_faith
9 good_faith 9 good_faith
Name: manual_tags, dtype: object Name: manual_tags, dtype: object
tag_0 tag_1 tag_0 tag_1
0 good_faith None 0 good_faith None
1 vandalism None 1 vandalism None
2 vandalism None 2 vandalism None
3 vandalism None 3 vandalism None
4 good_faith None 4 good_faith None
5 good_faith None 5 good_faith None
6 good_faith lazyness 6 good_faith lazyness
7 vandalism good_faith 7 vandalism good_faith
8 good_faith None 8 good_faith None
9 good_faith None 9 good_faith None
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
raw_df.groupby('af_user_text').count() raw_df.groupby('af_user_text').count()
``` ```
%% Output %% Output
MusikAnimal 249 MusikAnimal 249
King of Hearts 91 King of Hearts 91
Zzuuzz 81 Zzuuzz 81
Rich Farmbrough 61 Rich Farmbrough 61
Ruslik0 59 Ruslik0 59
Prodego 45 Prodego 45
Samwalton9 34 Samwalton9 34
Cenarium 32 Cenarium 32
NawlinWiki 28 NawlinWiki 28
Xaosflux 27 Xaosflux 27
Reaper Eternal 25 Reaper Eternal 25
Shirik 23 Shirik 23
Beetstra 16 Beetstra 16
Dragons flight 15 Dragons flight 15
Crow 13 Crow 13
Legoktm 11 Legoktm 11
Samtar 9 Samtar 9
The Anome 9 The Anome 9
Cyp 7 Cyp 7
BethNaught 6 BethNaught 6
Ryan Kaldari (WMF) 5 Ryan Kaldari (WMF) 5
BU Rob13 5 BU Rob13 5
Oshwah 5 Oshwah 5
Kww 5 Kww 5
Od Mishehu 5 Od Mishehu 5
There'sNoTime 5 There'sNoTime 5
Elockid 4 Elockid 4
Kuru 4 Kuru 4
Materialscientist 4 Materialscientist 4
Mlitn 4 Mlitn 4
... ...
This, that and the other 1 This, that and the other 1
Nihiltres 1 Nihiltres 1
Ritchie333 1 Ritchie333 1
East718 1 East718 1
Lustiger seth 1 Lustiger seth 1
Chris G 1 Chris G 1
Tedder 1 Tedder 1
Amalthea 1 Amalthea 1
Mindmatrix 1 Mindmatrix 1
DoRD 1 DoRD 1
Avraham 1 Avraham 1
Georgewilliamherbert 1 Georgewilliamherbert 1
EdoDodo 1 EdoDodo 1
Darkwind 1 Darkwind 1
Rschen7754 1 Rschen7754 1
Jackmcbarn 1 Jackmcbarn 1
Nakon 1 Nakon 1
Slakr 1 Slakr 1
Smalljim 1 Smalljim 1
Chris G's Test Account 1 Chris G's Test Account 1
Scottywong 1 Scottywong 1
Timotheus Canens 1 Timotheus Canens 1
Mifter 1 Mifter 1
L235 1 L235 1
Triplestop 1 Triplestop 1
Hersfold 1 Hersfold 1
Billinghurst 1 Billinghurst 1
SQL 1 SQL 1
DMacks 1 DMacks 1
Xeno 1 Xeno 1
Name: af_user_text, Length: 77, dtype: int64 Name: af_user_text, Length: 77, dtype: int64
......
import sys import sys
import pandas as pd import pandas as pd
from mw import database
def read_filters(filepath): def read_filters(filepath):
...@@ -21,6 +22,25 @@ def get_filters_actions(in_file, out_file): ...@@ -21,6 +22,25 @@ def get_filters_actions(in_file, out_file):
df_update.to_csv(out_file, sep='\t') df_update.to_csv(out_file, sep='\t')
#print(df[['af_id', 'af_hidden', 'af_actions', 'af_hit_count', 'af_public_comments']]) #print(df[['af_id', 'af_hidden', 'af_actions', 'af_hit_count', 'af_public_comments']])
def download_db_table():
db = database.DB.from_params(
host="analytics-store.eqiad.wmnet",
read_default_file="~/.my.cnf",
user="research",
db="enwiki"
)
users = db.users.query(
registered_after="20140101000000",
direction="newer",
limit=10
)
for user in users:
print("{user_id}:{user_name} -- {user_editcount} edits".format(**user))
''' '''
main main
''' '''
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment