diff --git a/article/proceedings.tex b/article/proceedings.tex index 5dd59e79c28141fc24f2becc9babe074310b3db3..4167ad8d30f5241f2a0a3cdb6166ac25e273eecf 100644 --- a/article/proceedings.tex +++ b/article/proceedings.tex @@ -725,13 +725,16 @@ The edit is not saved. \textbf{Interesting questions} \begin{itemize} - \item how many filters are there (were there over the years): 954 filters (stand: 06.01.2019); TODO: historically? + \item how many filters are there (were there over the years): 954 filters (stand: 06.01.2019); TODO: historically?; This includes deleted filters \item what do the most active filters do?: see~\ref{tab:most-active-actions} \item get a sense of what gets filtered (more qualitative): TODO: refine after sorting through manual categories; preliminary: vandalism; unintentional suboptimal behavior from new users who don't know better ("good faith edits") such as blanking an article/section; creating an article without categories; adding larger texts without references; large unwikified new article (180); or from users who are too lazy (to write proper edit summaries; editing behaviours and styles not suitable for an encyclopedia (poor grammar/not commiting to orthography norms; use of emoticons and !; ascii art?); "unexplained removal of sourced content" (636) may be an attempt to silence a view point the editor doesn't like; self-promotion(adding unreferenced material to BLP; "users creating autobiographies" 148;); harassment; sockpuppetry; potential copyright violations; that's more or less it actually. There's a third bigger cluster of maintenance stuff, such as tracking bugs or other problems, trying to sort through bot edits and such. For further details see the jupyter notebook. + Interestingly, there was a guideline somewhere stating that no trivial behaviour should trip filters (e.g. starting every paragraph with a small letter;) I actually think, a bot fixing this would be more appropriate. \item has the willingness of the community to use filters increased over time?: looking at aggregated values of number of triggered filters per year, the answer is rather it's quite constant; TODO: plot it at a finer granularity + when aggregating filter triggers per month, one notices that there's an overall slight upward tendency. + Also, there is a dip in the middle of 2014 and a notable peak at the beginning of 2016, that should be investigated further. \item how often were (which) filters triggered: see \url{filter-lists/20190106115600_filters-sorted-by-hits.csv} and~\ref{tab:most-active-actions}; see also jupyter notebook for aggregated hitcounts over tagged categories \item percentage of triggered filters/all edits; break down triggered filters according to typology: TODO still need the complete abuse\_filter\_log table!; and probably further dumps in order to know total number of edits - \item percentage filters of different types over the years: TODO according to actions (I need a complete abuse\_filter\_log table for this!); according to self-assigned tags (finish tagging!) + \item percentage filters of different types over the years: according to actions (I need a complete abuse\_filter\_log table for this!); according to self-assigned tags %TODO plot! \item what gets classified as vandalism? has this changed over time? TODO: (look at words and patterns triggered by the vandalism filters; read vandalism policy page); pay special attention to filters labeled as vandalism by the edit filter editors (i.e. in the public description) vs these I labeled as vandalism \end{itemize} @@ -755,7 +758,7 @@ The edit is not saved. \textbf{Questions on abuse\_filter\_log table} \begin{itemize} \item how often were filters with different actions triggered? (afl\_actions) - \item what types of users trigger the filters (IPs? registered?) + \item what types of users trigger the filters (IPs? registered?) : IPs: 16,489,266, logged in users: 6,984,897 (Stand 15.03.2019); \item on what articles filters get triggered most frequently (afl\_title) \item what types of user actions trigger filters most frequently? (afl\_action) (edit, delete, createaccount, move, upload, autocreateaccount, stashupload) \item in which namespaces get filters triggered most frequently? diff --git a/src/explore.ipynb b/src/explore.ipynb index b5330554cb455cd0f5df72221d5f3665275f0ecf..ba599cf0d143130bfbdfd3a099501febf00b094c 100644 --- a/src/explore.ipynb +++ b/src/explore.ipynb @@ -42,7 +42,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -311,6 +311,192 @@ "# \"The group this filter belongs to, as defined in $wgAbuseFilterValidGroups.\" still don't get it" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Timestamp\n", + "\n", + "Have a lot of filters been modified for the last time recently?" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "af_timestamp\n", + "2009-03-31 29\n", + "2009-04-30 19\n", + "2009-05-31 11\n", + "2009-06-30 8\n", + "2009-07-31 8\n", + "2009-08-31 21\n", + "2009-09-30 21\n", + "2009-10-31 6\n", + "2009-11-30 13\n", + "2009-12-31 3\n", + "2010-01-31 9\n", + "2010-02-28 9\n", + "2010-03-31 10\n", + "2010-04-30 5\n", + "2010-05-31 5\n", + "2010-06-30 1\n", + "2010-07-31 0\n", + "2010-08-31 10\n", + "2010-09-30 5\n", + "2010-10-31 0\n", + "2010-11-30 4\n", + "2010-12-31 1\n", + "2011-01-31 2\n", + "2011-02-28 0\n", + "2011-03-31 23\n", + "2011-04-30 2\n", + "2011-05-31 3\n", + "2011-06-30 1\n", + "2011-07-31 0\n", + "2011-08-31 2\n", + "2011-09-30 1\n", + "2011-10-31 0\n", + "2011-11-30 1\n", + "2011-12-31 0\n", + "2012-01-31 0\n", + "2012-02-29 10\n", + "2012-03-31 0\n", + "2012-04-30 0\n", + "2012-05-31 0\n", + "2012-06-30 1\n", + "2012-07-31 1\n", + "2012-08-31 36\n", + "2012-09-30 2\n", + "2012-10-31 1\n", + "2012-11-30 1\n", + "2012-12-31 2\n", + "2013-01-31 4\n", + "2013-02-28 1\n", + "2013-03-31 2\n", + "2013-04-30 21\n", + "2013-05-31 0\n", + "2013-06-30 1\n", + "2013-07-31 2\n", + "2013-08-31 0\n", + "2013-09-30 0\n", + "2013-10-31 2\n", + "2013-11-30 0\n", + "2013-12-31 0\n", + "2014-01-31 13\n", + "2014-02-28 0\n", + "2014-03-31 9\n", + "2014-04-30 3\n", + "2014-05-31 0\n", + "2014-06-30 1\n", + "2014-07-31 0\n", + "2014-08-31 3\n", + "2014-09-30 1\n", + "2014-10-31 0\n", + "2014-11-30 0\n", + "2014-12-31 3\n", + "2015-01-31 4\n", + "2015-02-28 33\n", + "2015-03-31 2\n", + "2015-04-30 3\n", + "2015-05-31 2\n", + "2015-06-30 12\n", + "2015-07-31 9\n", + "2015-08-31 8\n", + "2015-09-30 2\n", + "2015-10-31 1\n", + "2015-11-30 3\n", + "2015-12-31 8\n", + "2016-01-31 5\n", + "2016-02-29 3\n", + "2016-03-31 2\n", + "2016-04-30 6\n", + "2016-05-31 2\n", + "2016-06-30 93\n", + "2016-07-31 8\n", + "2016-08-31 43\n", + "2016-09-30 21\n", + "2016-10-31 7\n", + "2016-11-30 5\n", + "2016-12-31 7\n", + "2017-01-31 9\n", + "2017-02-28 7\n", + "2017-03-31 6\n", + "2017-04-30 25\n", + "2017-05-31 30\n", + "2017-06-30 4\n", + "2017-07-31 3\n", + "2017-08-31 1\n", + "2017-09-30 4\n", + "2017-10-31 10\n", + "2017-11-30 3\n", + "2017-12-31 6\n", + "2018-01-31 6\n", + "2018-02-28 2\n", + "2018-03-31 7\n", + "2018-04-30 21\n", + "2018-05-31 3\n", + "2018-06-30 3\n", + "2018-07-31 9\n", + "2018-08-31 9\n", + "2018-09-30 11\n", + "2018-10-31 42\n", + "2018-11-30 34\n", + "2018-12-31 32\n", + "2019-01-31 25\n", + "Freq: M, dtype: int64\n" + ] + } + ], + "source": [ + "df_origin['af_timestamp'] = pd.to_datetime(df_origin['af_timestamp'], format=\"%Y%m%d%H%M%S\")\n", + "\n", + "#df_modified = df_origin['af_timestamp'].groupby([df_origin.af_timestamp.dt.to_period(\"M\")]).agg('count')\n", + "#df_modified\n", + "df_modified = df_origin.set_index('af_timestamp').resample(\"M\").size()\n", + "\n", + "with pd.option_context('display.max_rows', None, 'display.max_columns', None):\n", + " print (df_modified)" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.xlabel('Month')\n", + "plt.ylabel('Num filters')\n", + "plt.plot(df_modified, 'bo')\n", + "plt.show()\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Filter hits over the years" + ] + }, { "cell_type": "code", "execution_count": 5,