\documentclass[sigconf,review=False,anonymous=False,natbib=True,screen=True,balance=true]{acmart} <>= f <- function (x) {formatC(x, format="d", big.mark=',')} r <- readRDS("remember_sample_quality_labels.RDS") r2 <- readRDS("ordinal.quality.model.RDS") r3 <- readRDS("ordinal.quality.analysis.RDS") format.percent <- function(x) {paste0(formatC(x*100,format="d",big.mark=','),"\\%")} r.articles <- r3 r4 <- readRDS("ordinal.quality.analysis.noweights.RDS") r.noweights <- r4 r.revisions <- readRDS("ordinal.quality.analysis_revisions.RDS") wp10dict <- rev(c("fa"="Featured","ga"="Good","b"="B",'c'='C','start'="Start",'stub'="Stub")) options(scipen=999) library(xtable) options(xtable.floating = FALSE, xtable.timestamp = '', xtable.include.rownames=FALSE, math.style.negative=TRUE, booktabs = TRUE, xtable.format.args=list(big.mark=','), xtable.sanitize.text.function=identity # tikzDefaultEngine='xetex' ) library(data.table) library(knitr) library(tikzDevice) library(ggplot2) theme_set(theme_bw()) library(ggdist) knit_hooks$set(document = function(x) { sub('\\usepackage[]{color}', '\\usepackage[]{color}', x, fixed = TRUE) }) opts_chunk$set(fig.='pdf') opts_chunk$set(dev='pdf') opts_chunk$set(external=TRUE) # knitr::clean_cache() opts_chunk$set(cache=FALSE) overwrite <- FALSE @ \definecolor{c77a1d2}{RGB}{119,161,210} \definecolor{bf9837}{RGB}{191,152,55} \definecolor{cc0c0c0}{RGB}{192,192,192} \def \globalscale {0.2} \usepackage[american]{babel} %\usepackage{wrapfig} \usepackage{tikz} \usepackage{booktabs} \usepackage{multicol} \usepackage{subcaption} %% %% \BibTeX command to typeset BibTeX logo in the docs \AtBeginDocument{% \providecommand\BibTeX{{% \normalfont B\kern-0.5em{\scshape i\kern-0.25em b}\kern-0.8em\TeX}}} %% Rights management information. This information is sent to you %% when you complete the rights form. These commands have SAMPLE %% values in them; it is your responsibility as an author to replace %% the commands and values with those provided to you when you %% complete the rights form. \copyrightyear{2021} \acmYear{2021} \setcopyright{acmlicensed}\acmConference[OpenSym 2021]{17th International Symposium on Open Collaboration}{September 15--17, 2021}{Online, Spain} \acmBooktitle{17th International Symposium on Open Collaboration (OpenSym 2021), September 15--17, 2021, Online, Spain} \acmPrice{15.00} \acmDOI{10.1145/3479986.3479991} \acmISBN{978-1-4503-8500-8/21/09} %% Submission ID. %% Use this when submitting an article to a sponsored event. You'll %% receive a unique submission ID from the organizers %% of the event, and this ID should be used as the parameter to this command. %%\acmSubmissionID{123-A56-BU3} %% %% The majority of ACM publications use numbered citations and %% references. The command \citestyle{authoryear} switches to the %% "author year" style. %% %% If you are preparing content for an event %% sponsored by ACM SIGGRAPH, you must use the "author year" style of %% citations and references. %% Uncommenting %% the next command will enable that style. %%\citestyle{acmauthoryear} %\usepackage{subfig} \usepackage{xcolor} \usepackage{colortbl} \definecolor{mygreen}{HTML}{43bf71} \usepackage{subcaption} \def\citepos#1{{\hypersetup{citecolor=black}\citeauthor{#1}}'s \citep{#1}} \def\citespos#1{{\hypersetup{citecolor=black}\citeauthor{#1}}' \citep{#1}} \let\oldciteauthor=\citeauthor \def\citeauthor#1{{\hypersetup{citecolor=black}\oldciteauthor{#1}}} %% \usepackage[htt]{hyphenat} \usepackage{amsmath} %% end of the preamble, start of the body of the document source. \hyphenation{Wi-ki-pe-di-a} \begin{document} % \baselineskip 24ptn %% %% The "title" command has an optional parameter, %% allowing the author to define a "short title" to be used in page headers. %% Sneha suggests changing the title suggests it should make reference to ORES. \title[Measuring Article Quality]{Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression} %% The "author" command and its associated commands are used to define %% the authors and their affiliations. %% Of note is the shared affiliation of the first two authors, and the %% "authornote" and "authornotemark" commands %% used to denote shared contribution to the research. \author{Nathan TeBlunthuis} \email{nathante@uw.edu} \orcid{0000-0002-3333-5013} \affiliation{% \institution{University of Washington} \streetaddress{Box 353740} \city{Seattle} \state{Washington} \country{USA} \postcode{98195} } %% %% By default, the full list of authors will be used in the page %% headers. Often, this list is too long, and will overlap %% other information printed in the page headers. This command allows %% the author to define a more concise list %% of authors' names for this purpose. \renewcommand{\shortauthors}{TeBlunthuis} %% %% The abstract is a short summary of the work to be presented in the %% article. %% Abstract 150 words \begin{abstract} % Most explanations of changes in online group size focus on internal factors like social structures or design decisions. % do not make the , and render critical questions like “which other groups are a given group's strongest competitors or mutualists?” unanswerable. Organizing complex peer production projects and advancing scientific knowledge of open collaboration each depend on the ability to measure quality. Wikipedia community members and academic researchers have used article quality ratings for purposes like tracking knowledge gaps and studying how political polarization shapes collaboration. Even so, measuring quality presents many methodological challenges. The most widely used systems use quality assesements on discrete ordinal scales, but such labels can be inconvenient for statistics and machine learning. Prior work handles this by assuming that different levels of quality are ``evenly spaced'' from one another. This assumption runs counter to intuitions about degrees of effort needed to raise Wikipedia articles to different quality levels. % Furthermore, models from prior work are fit to datasets that oversample high-quality articles. This limits their accuracy for representative samples of articles or revisions. I describe a technique extending the Wikimedia Foundations' ORES article quality model to address these limitations. My method uses weighted ordinal regression models to construct one-dimensional continuous measures of quality. While scores from my technique and from prior approaches are correlated, my approach improves accuracy for research datasets and provides evidence that the ``evenly spaced'' assumption is unfounded in practice on English Wikipedia. I conclude with recommendations for using quality scores in future research and include the full code, data, and models. \end{abstract} \begin{CCSXML} 10003120.10003130.10003233.10003301 Human-centered computing~Wikis 500 10003120.10003130.10011762 Human-centered computing~Empirical studies in collaborative and social computing 500 10003120.10003130.10003131.10003234 Human-centered computing~Social content sharing 400 \end{CCSXML} \ccsdesc[500]{Human-centered computing~Collaborative and social computing theory, concepts and paradigms} \ccsdesc[400]{Human-centered computing~Social content sharing} \ccsdesc[500]{Human-centered computing~Computer supported cooperative work} \keywords{sociotechnical systems, measurement, statistics, quality, machine learning, peer production, Wikipedia, online communities, methods, datasets} %% The code below is generated by the tool at http://dl.acm.org/ccs.cfm. %% Please copy and paste the code instead of the example below. %% %% Keywords. The author(s) should pick words that accurately describe %% the work being presented. Separate the keywords with commas. % \keywords{datasets, neural networks, gaze detection, text tagging} %% %% This command processes the author and affiliation and title %% information and builds the first part of the formatted document. \maketitle % UNCOMMENT BELOW FOR PROOF READING % \fontsize{12pt}{24pt} % \selectfont % \baselineskip 24pt %% We're going for a "known puzzle" + "clarifying confusion" framing %% Rememver to frame aronud the depvar \section{Introduction} \label{sec:introduction} % LATEX NOTE: This alphabet below is here so we can measure the line-length of % different layouts. Typesetters suggest that an average line-length of % between 45-90 characters and a rule of thumb for typesetting is that you % should be able to fit between 2-3 alphabets on one line. Generally speaking, % the shorter the line length, the better -- and the smaller the linespacing % can become. The following line is 3 alphabets (73 characters). % Kaylea suggests adding "support learning" to the motivation in reference to how wikiedu uses the ORES quality measures. % This first paragraph is very Wikipedia-centric. Measuring content quality in peer production projects like Wikipedia is important so projects can learn about themselves and track progress. Measuring quality also helps build confidence that information is accurate and supports monitoring how well an encyclopedia includes diverse subject areas to identify gaps needing attention \cite{redi_taxonomy_2021}. Measuring quality enables tracking and evaluating the progress of subprojects and initiatives organized to fill the gaps \cite{halfaker_interpolating_2017, warncke-wang_success_2015}. Raising an article to a high standard of quality is a recognized achievement among contributors, so assessing quality can help motivate contributions \cite{ayers_how_2008,forte_why_2005}. In these ways, measuring quality can be of key importance to advancing the priorities of the Wikimedia movement and is also important to other kinds of open collaboration \cite{champion_underproduction_2021}. Measuring quality also presents methodological and ontological challenges. How can ``quality'' be conceptualized so that measurement of the goals of a project and the value it produces can be precise and accurate? Language editions of Wikipedia, including English, peer produce quality labels that have been useful both for motivating and coordinating project work and for enabling research. Epistemic virtues of this approach stem from the community-constructed criteria for assessment and from formalized procedures for third-party evaluation organized by WikiProjects. These systems also have two important limitations: (1) ratings are likely to lag behind changes in article quality, and (2) quality is assessed on a discrete ordinal scale, which violates typical assumptions in statistical analysis. Both limitations are surmountable. The machine learning framework introduced by \citeauthor{warncke-wang_tell_2013} \cite{warncke-wang_tell_2013}, further developed by \citeauthor{halfaker_interpolating_2017} \cite{halfaker_interpolating_2017}, implemented by the Objective Revision Evaluation Service\footnote{\url{https://www.mediawiki.org/wiki/ORES} (\url{https://perma.cc/TH6L-KFT6})} (ORES) article quality models and adopted by several research studies of Wikipedia article quality \cite[e.g.][]{halfaker_ores_2020, kocielnik_reciprocity_2018, shi_wisdom_2019, warncke-wang_success_2015} was designed to address the first limitation by using article assessments at the time they were made as ``ground truth.'' Article quality might drift in the periods between assessments, but it seems safe to assume that new quality assessments are accurate at the time they are made. A model trained on recent assessments can predict what quality label an article would receive if assessed in its current state. %In this paper, I build on these models to address the second limitation by developing a one-dimensional measurement of article quality that does not assume that the quality levels are evenly spaced. This paper introduces a method for constructing interpretable one-dimensional measures of article quality from Wikipedia quality assessments and the ORES article quality model. The method improves upon prior approaches in two important ways. First, by using inverse probability weighting to calibrate the model, it is more accurate for typical research applications, and second, it does not depend on the assumption that quality levels are ``evenly spaced,'' which threatens the validity of prior research \cite{halfaker_interpolating_2017, arazy_evolutionary_2019}. In addition, this paper helps us understand the validity of previous work by analyzing the performance of the ORES quality model and testing the ``evenly spaced'' assumption. In §\ref{sec:background}, I provide a brief overview of quality measurement in peer production research in which I foreground the importance of the assumptions needed to use machine learning predictions in downstream analysis---particularly the ``evenly spaced'' assumption used by \citeauthor{halfaker_interpolating_2017} \cite{halfaker_interpolating_2017} to justify the use of a handpicked weighted sum to combine article class probabilities. Next, in §\ref{sec:methods}, I describe how to build accurate ordinal quality models that are appropriately calibrated for analyses of representative samples of Wikipedia articles or revisions. I also briefly explain how ordinal regression provides an interpretable one-dimensional measure of quality and how it relaxes the ``evenly spaced'' assumption. Finally, in §\ref{sec:results} I present the results of my analysis to (1) show how the precision of the measurement depends on proper calibration and (2) demonstrate that the ``evenly spaced'' assumption is violated. Despite this, I find that scores from the ordinal models are highly correlated with those from prior work so the ``evenly spaced'' assumption may be acceptable in some applications. I conclude in §\ref{sec:discussion} with recommendations for measuring article quality in future research. \section{Background} \label{sec:background} % first point: measuring quality can help peer production projects % second point: measuring quality can help science % Mako thinks this is cute and it's fine to keep it but the bit about freezing mercury in the discussion takes it a bit far. Measurement is important to science as available knowledge often constrains the development of improved tools for advancing knowledge. For example, in the book \textit{Inventing Temperature}, Hasok \citeauthor{chang_inventing_2004} \cite{chang_inventing_2004}, the philosopher and historian of science, documents how extending theories of heat beyond the range of human sense perception required scientists to develop new types of thermometers. This in turn required better knowledge of heat and of thermometric materials such as the freezing point of mercury. Part of the challenge of scientific advancement is that measurement devices developed under certain conditions may give unexpected results outside of the range in which they are calibrated: a thermometer will give impossibly low temperature readings when its mercury unexpectedly freezes. Today, machine learning models are used to extend the range of quality measurements in peer production research, but state of the art machine learning can be quite sensitive to the nuances of how their training data are selected \cite{recht_imagenet_2019}. % This project introduces a new measurement device for measuring article quality and provides assurance that the measurement is reasonably accurate over the range of a given dataset. \subsection{Measuring Quality in Peer Production} As described in §\ref{sec:introduction}, measuring quality has been of great importance to peer production projects like Wikipedia and in the construction of knowledge about how such projects work. The foundation of article quality measurement in Wikipedia has been the peer production of article quality assessment organized by WikiProjects who develop criteria for articles in their domain \cite{phoebe_ayers_how_2008}. This enables quality assessment to be consistent across different subject areas, but the procedures for assessing quality are tailored to the values of each WikiProject. Yet, like human sense perception of temperature, these quality assessments are limited in that they require human time and attention. In addition, humans' limited ability to discriminate between levels on a scale limits the sensitivity of quality assessments. Articles are assessed irregularly and infrequently at the discretion of volunteer editors. Therefore, for most article revisions, it is not known what quality class the article would be assigned if it were newly assessed. % This paragraph is a bit lit reviewy and nonessential to the argument. Cut or reowrk. Researchers have proposed many ideas to extend the range of quality measurement beyond the direct perception of Wikipedians, such as page length \cite{blumenstock_size_2008}, persistent word revisions \cite{adler_content-driven_2007, biancani_measuring_2014}, collaboration network structures \cite{raman_classifying_2020}, and template-based flaw detection \cite{anderka_predicting_2012}. Carefully constructed indexes benchmarked against English language Wikipedia quality assessments might allow quality measurement of articles that have not been assessed or in projects that have underproduced article assessments \cite{lewoniewski_relative_2017}. However, such indexes may lack emic validity if they fail to capture important aspects of quality or if notions of quality vary between linguistic communities and might even shape the editing activity in unexpected ways that could ultimately defeat their purpose \cite{goodhart_problems_1984,strathern_improving_1997}. Peer-produced quality labels depend on the limited capacity of volunteer communities to coordinate quality assessment, but also provide impressive validity for evaluating projects on their own terms. \subsection{Article Quality Models Extend Measurement to Unassessed Articles} Perhaps the most successful approaches to extending the range of quality measurements use machine learning models trained on available article quality assessments to predict the quality of revisions that have not been assessed. The ORES article quality model (henceforth ORES) implements this approach, but other similar article quality predictors have been developed \cite{anderka_breakdown_2012,dang_quality_2016,zhang_history-based_2018,druck_learning_2008,sarkar_stre_2019,raman_classifying_2020}, and additional features including those based on language models can substantially improve classification performance compared to ORES \cite{schmidt_article_2019}. The ORES model is a tree-based classifier that predicts the quality class of a Wikipedia article at the time it is assessed.\footnote{The system uses cross-validation to select among candidates that include random-forest and boosted decision tree models.} These tree-based models are reasonable for practical purposes with the reported ability to predict within one level of the true quality class with 90\% accuracy (although in §\ref{sec:accuracy} I find a decline in accuracy in a more recent dataset). Yet, since these models do not account for the ordering of quality labels, the use of these predictions in downstream analysis introduces complicated methodological challenges. The ORES classifiers are fit using \texttt{scikit-learn}\footnote{\url{https://scikit-learn.org/stable/}(\url{https://perma.cc/5Y8B-W8T5})} through minimization of the multinomial deviance as shown \cite{pedregosa_scikit-learn_2011,hastie_elements_2018}: % = -\sum_{k=1}^{K}I(y=\mathcal{G}_k)f_k(x) + log(sum_{l=1}^K(e^{f_l(x)})) \begin{equation} L(y_i,p(x_i)) = -\sum_{k=1}^K{I(y_i=\mathcal{G}_{i,k})\mathrm{log}~p_k(x_i)} \label{eq:multinomial.loglik} \end{equation} \noindent For each article $i$ with predictors $x_i$ that has been labeled with a quality class $y_i$, the ORES model outputs an estimated probability $p_k(x_i)$ that the article belongs to each quality class $k \in \{\mathrm{\textit{stub}}, \mathrm{\textit{start}}, \mathrm{\textit{C-class}}, \mathrm{\textit{B-class}}, \mathrm{\textit{Good article (GA)}}, \mathrm{\textit{Featured article (FA)}}\}$. The predicted probabilities $p(x_i)$ sum to one so the ORES model outputs a unit vector for each article. If $\mathcal{G}_{i,k}$, the most probable quality class (MPQC) according to the model, is the true label, then $I(y_i=\mathcal{G}_{i,k})$ equals $1$ ($I$ is the indicator function) and the log predicted probability $p_k(x_i)$ of the correct class is subtracted from the loss $L(y_i,p(x_i))$. Note that this model does not use the fact that article quality classes are ordered. If it did, then it would have to penalize an incorrect classification of a \textit{Good article} as \textit{C-class} more than a classification of a \textit{Good article} as \textit{B-class}. In this model, different quality classes have no intrinsic rank or ordering and thus are akin to different categories of article subjects like animals, vegetables, or minerals. The MPQC is perhaps the most natural way to use the ORES output to measure quality. It has been used in several studies including to provide evidence that politically polarized collaboration on Wikipedia leads to high quality articles \cite{shi_wisdom_2019} and to understand the relationship between article quality and donation \cite{kocielnik_reciprocity_2018}. However, the MPQC is limited in that it does not measure quality differences between articles that have the same MPQC. Consider two hypothetical articles; the first has the multinomial prediction $(0.1,0.3,0.4,0.075,0.075,0)$ and the second has the prediction $(0.075,0.075,0.4,0.3,0.1,0)$. The MPQC will assign both the \textit{C-class} label even though the first article has the same chance at being a \textit{Stub} or \textit{Start-class} as the second article's chance at being a \textit{B-class} or even a \textit{Good article}. At best, the MPQC has limited sensitivity to subtle variations or gradual changes in quality \cite{halfaker_interpolating_2017}. \subsection{Combining Scores for Granular Measurement} To further extend the range of article quality measurement within article quality classes, \citeauthor{halfaker_interpolating_2017} \cite{halfaker_interpolating_2017} constructed a numerical quality score using a linear combination (a weighted sum) of the elements of the multinomial prediction $p(x_i)$. This is advantageous from a statistical perspective as it naturally provides a continuous measure of quality which can typically justify a normal or log-normal statistical model. It can also support higher-order aggregations for measuring the quality of a set of articles \cite{halfaker_interpolating_2017}. \citeauthor{halfaker_interpolating_2017} handpicks the coefficients $[0,1,2,3,4,5]$ to make a linear combination of the predictions under the assumption ``that the ordinal quality scale developed by Wikipedia editors is roughly cardinal and evenly spaced,'' which I refer to the ``evenly spaced'' assumption. It essentially says that a \textit{Start-class} article has one more unit quality of a \textit{Stub-class} article, and that a \textit{C-class} article has one more unit of quality than a \textit{Start-class} article and so on. This approach is being adopted by other researchers including \citeauthor{arazy_evolutionary_2019} \cite{arazy_evolutionary_2019}. The considerable degree of effort and expertise required to raise articles to higher levels of quality raises doubt in the assumption \cite{jemielniak_common_2014}. Higher quality levels correspond to increasing completeness, encyclopedic character, usefulness to wider audiences, incorporation of multimedia, polished citations, and adherence to Wikipedia's policies. The English language Wikipedia editing guideline on content assessment\footnote{\url{https://en.wikipedia.org/w/index.php?title=Wikipedia:Content_assessment&oldid=1023695750} (\url{https://perma.cc/2JUV-6SD})} defines a \textit{Good article} as ``useful to nearly all readers, with no obvious problems'' and a \textit{Featured article} article as ``professional, outstanding and thorough.'' According to Wikipedians, it can take ``three to six months of full time work'' to write a \emph{Featured article}.\footnote{Public statement by Stuart Yeates, an expert Wikipedian; quoted with permission. \url{https://lists.wikimedia.org/hyperkitty/list/wiki-research-l@lists.wikimedia.org/message/7U35LHAXRWEPABN75DOTPOIEA2VYCTQQ/} (\url{https://perma.cc/9V4P-WRXR})} Are we to assume that the difference in quality between a \textit{Good article} and a \textit{Featured article} is measurably the same as that between a \textit{Stub} defined as as ``little more than a dictionary definition'' and a \textit{Start-class} that is ``a very basic description of the topic?'' How could we even answer this question? %This paper provides a methodology to answer it, but the answer depends on how quality is measured. If the ``evenly spaced'' assumption is reasonable, then \citeauthor{halfaker_interpolating_2017}'s \cite{halfaker_interpolating_2017} weighted sum approach is too. But if increasing Wikipedia article classes do not represent roughly equal improvements in quality, this may threaten the accuracy of analysis dependent on the assumption. Suppose that a \textit{B-class} article has not 1, but 2 units of quality greater than a \textit{C-class} article, then \citeauthor{halfaker_interpolating_2017} could have underestimated the improvement in the knowledge gap of women scientists, which was considerably driven by improvement in \textit{B-class} articles. In the next section, I provide a straightforward extension of the ORES article quality model based on ordinal regression that both relaxes the ``evenly spaced'' assumption and provides a better calibrated and more accurate one-dimensional measure of quality. %I now describe my implementation of the approach. I will then evaluate my model in terms of predictive accuracy, the spacing of quality levels, and comparison with prior approaches. \section{Data, Methods and Measures} \label{sec:methods} %\citeauthor{halfaker_interpolating_2017} \cite{halfaker_interpolating_2017} constructed a one-dimensional measure of article quality using handpicked linear combination of the ORES category predictions assuming that quality levels are evenly spaced. I choose the linear combination I use Bayesian ordinal regression models that use the ORES predicted probabilities to predict the quality class labels and quantify the distance between quality classes. I now provide a brief overview of ordinal regression as needed to explain my approach to measuring quality. Understanding ordinal regression depends on background knowledge of odds and generalized linear models. I recommend \citeauthor{mcelreath_statistical_2018} \cite{mcelreath_statistical_2018} for reference. \subsection{Bayesian Ordinal Regression} Ordinal regression predicts quality class membership using a single linear model for all classes and identifies boundaries between classes using the log cumulative odds link function shown below in Eq. \ref{eq:ordinal.regression}. The log cumulative odds is not the only possible choice of link function, but it is the most common, is the easiest to interpret, and is appropriate here. \begin{table*} \caption{Numbers of articles and revisions, sample sizes, and regression weights for each quality level.} <>= ## tab <- data.table(r$label_sample_counts) ## tab <- tab[,'label.raw':= factor(names(r$label_sample_counts),levels=c("WP10.stub","WP10.start","WP10.c","WP10.b","WP10.a","WP10.ga","WP10.fa"),ordered=TRUE)] ## tab <- tab[order(label.raw)] ## setnames(tab,old='V1',new='Population size') ## tab <- tab[,Label:=c("Stub","Start","C", "B","A","Good","Featured")] ## tab <- tab[,key:=c(1,2,3,4,5,6,7)] ## sample_counts <- data.table(r$sample_counts)[,'key':=wp10] ## tab <- tab[sample_counts,on='key'] ## tab[,'Sample size':=talk_page_id] tab <- r2$sample.weights tab[,"Label":=c("Stub","Start","C","B","GA","FA")] setnames(tab,old=c("n_articles","n_revisions","N","article_weight","revision_weight"),new=c("No. of articles","No. of revisions", "Sample size", "Article weights","Revision weights")) print(xtable(tab[,c("Label","No. of articles","No. of revisions", "Sample size", "Article weights","Revision weights"),with=F],digits=2)) @ \label{tab:sample} \end{table*} \begin{align} \mathrm{log}&~\frac{\mathrm{Pr}(y_i \le k)}{1 - \mathrm{Pr}(y_i \le k)} = \alpha_k - \phi_i \label{eq:ordinal.regression} \\ \phi_i &= B x_i \nonumber \end{align} \noindent As in Eq. \ref{eq:multinomial.loglik}, $y_i$ is the quality label for article $i$. The left hand side of Eq. \ref{eq:ordinal.regression} gives the log odds that $y_i$ is less than or equal to quality level $k$. The ordinal quality measure is given by a linear model $\phi_i = B x_i$ ($x_i$ is a vector of transformed ORES scores for article $i$). Key to interpreting $\phi_i$ as a quality measure are the intercept parameters $a_k$ for each quality level $k$. The log cumulative odds (the log odds that the article $y_i$ has quality less than or equal to $k$) are given by the difference between the intercept and the linear model $a_k$ - $\phi_i$. Therefore, if $\phi_i = \alpha_k$ then the chances that $i <= k$ equal the chances that $i > k$. When $\phi_i$ is less than $\alpha_k$, the quality of article $i$ is probably less than or equal to quality level $k$. As $\phi_i - \alpha_k$ increases so do the chances that article $i$ is of quality better than $k$. In this way, the threshold parameters $a_k$ define quantitative article quality levels on the scale of the ordinal quality measure $\phi_i$. \begin{figure*} <>= testdf <- r.noweights$test.d library(wCorr) library(purrr) library(modi) levls <- c("stub","start","c","b","ga","fa") ores.noweight <- data.table(prob.predicted=apply(testdf[,.(Stub,Start,C,B,GA,FA)],2,mean), var.predicted=apply(testdf[,.(Stub,Start,C,B,GA,FA)],2,var), wp10=levls, weighttype="No weight") ores.article <- data.table(prob.predicted=apply(testdf[,.(Stub,Start,C,B,GA,FA)],2,partial(weighted.mean, w=testdf$article_weight)), var.predicted=apply(testdf[,.(Stub,Start,C,B,GA,FA)],2,partial(weighted.var, w=testdf$article_weight)), wp10=levls, weighttype="Article weight") ores.revision <- data.table(prob.predicted=apply(testdf[,.(Stub,Start,C,B,GA,FA)],2,partial(weighted.mean, w=testdf$revision_weight)), var.predicted=apply(testdf[,.(Stub,Start,C,B,GA,FA)],2,partial(weighted.var, w=testdf$revision_weight)), wp10=levls, weighttype="Revision weight") ores <- rbindlist(list(ores.noweight,ores.article,ores.revision)) ores <- ores[order(weighttype,wp10)] article.weighted <- r.articles$calibration.stats article.weighted <- article.weighted[order(weighttype,wp10)] ores <- ores[,':='(prob.data=article.weighted$prob.data, model.type='ORES' )] ores <- ores[,':='(calibration=prob.data-prob.predicted)] article.weighted$model.type <- "Article model" revision.weighted <- r.revisions$calibration.stats revision.weighted$model.type <- "Revision model" unweighted <- r.noweights$calibration.stats unweighted$model.type <- "Quality class model" df <- rbind(article.weighted, revision.weighted, unweighted, ores,fill=T) N <- nrow(r.articles$test.df)/6 df[,wp10:=wp10dict[wp10]] df[wp10=="Featured",wp10:='FA'] df[wp10=="Good",wp10:='GA'] df[,wp10:=factor(wp10,levels=c("Stub","Start","C","B","GA","FA"),ordered=T)] ## weighttypedict <- list("Article weight"="Articles", "Revision weight"="Revisions", "No weight"="Quality Classes") ## df[,Level.of.analysis:=factor(weighttypedict[weighttype],levels=c("Articles","Revisions","Quality Classes"))] weighttypedict <- list("Article weight"="Article unit of analysis", "Revision weight"="Revision unit of analysis", "No weight"="Quality class unit of analysis") df[,Level.of.analysis:=factor(weighttypedict[weighttype],levels=c("Article unit of analysis","Revision unit of analysis","Quality class unit of analysis"))] df[,model.type:=factor(model.type,levels=c("Article model","Revision model","Quality class model","ORES"),ordered=T)] #df[,model.type:=factor(model.type,levels=c("Quality Class Model","Revision Model","Article Model"))] p <- ggplot(df, aes(x=wp10, y=calibration, ymax=calibration + 1.96*sqrt(var.predicted)/sqrt(N), ymin=calibration - 1.96*sqrt(var.predicted)/sqrt(N))) + geom_point(size=0.5) + geom_errorbar(size=0.5,width=0.5) + facet_grid(Level.of.analysis~model.type,labeller=label_wrap_gen(width=20)) p <- p + geom_hline(yintercept=0,linetype='solid',color='grey50') p <- p + scale_y_continuous("P(Data) - P(Predicted)") p <- p + scale_x_discrete("Ordinal quality score") p <- p + theme(strip.text.y = element_text(angle = 0),axis.text.x=element_text(angle=45,vjust=1,hjust=1)) print(p) @ \caption{Calibration of each predictive quality model on datasets representative of each unit of analysis (article, revision, quality class). Each chart shows, for each quality class, the miscalibration of a model (columns) with respect to a dataset weighted to represent a unit of analysis (rows). The y-axis shows difference between the true probability of the quality class and the average predicted probability of that class, given a chosen unit of analysis. Points close to zero indicate good calibration. For example, the top-left chart shows that the article model is well-calibrated to the dataset on which it was fit and the middle-left chart shows that the article model predicts that articles are \textit{Stubs} with probability greater than the frequency of \textit{Stubs} in a random sample of revisions. Error bars show 95\% confidence intervals. \label{fig:calibration}} \end{figure*} Informally, an ordinal regression model maps a linear regression model to the ordinal scale using the log cumulative odds link function. It does this by inferring thresholds that partition the range of linear predictions. When the linear predictor for an article crosses a threshold, the probability that the article has quality greater than that corresponding to the threshold begins to increase. Bayesian inference allows interpreting model parameters like $\phi_i$ and $\alpha_k$ as random variables and provides accurate quantification of uncertainty in thresholds and predictions. I fit models using the R package Bayesian regression modeling using Stan (\texttt{brms}) \cite{burkner_brms_2017} version 2.15.0. I use the default priors for ordinal regression, which are weakly informative. Due to the large sample size, the data overwhelm the priors and the priors have little influence over results. I confirmed this by fitting equivalent frequentist models using the \texttt{polr} function in the \texttt{MASS} R package \cite{venables_modern_2002} and found that the estimates of intercepts and coefficients were very close. % "all useful information" not strictly true The six quality scores output by the ORES article quality classifier are perfectly collinear by construction because they sum to one. This means they cannot all be included in the same regression model. Since interpreting the coefficients is not important, I take the linear transformation of the ORES scores using appropriately weighted principle component analysis and use the first five principle components as the independent variables. This is simpler and more statistically efficient than a model selection procedure. %I fit 3 ordinal regression models, one for each of the units of analysis using weights as described below in §\ref{sec:data}. The use of different weights is important to ensure that the model, and therefore the resulting quality scale is well calibrated to the chosen unit of analysis as shown in Figure \ref{fig:calibration}. To further demonstrate the importance of calibrating the models to the correct unit of analysis, I report the accuracy of each model (and of the MPQC) on each weighted dataset in §\ref{sec:accuracy}. \subsection{Dataset and Model Calibration} \label{sec:data} \begin{figure*} \centering <>= rescale_scores <- function(scores,benchmark){ infinum <- min(benchmark) supremum <- max(benchmark) return( (scores - infinum)/(supremum-infinum)) } df <- r3$test.df df <- df[,ordinal.pred.correct := ordinal.pred == wp10] best.model <- r3$best.model draws <- as.data.table(best.model$fit) intercepts <- draws[,apply(draws[,c('b_Intercept[1]','b_Intercept[2]','b_Intercept[3]','b_Intercept[4]','b_Intercept[5]'),with=F],2,mean)] intercept.sd <- draws[,apply(draws[,c('b_Intercept[1]','b_Intercept[2]','b_Intercept[3]','b_Intercept[4]','b_Intercept[5]'),with=F],2,sd)] df <- df[,type := "Article model"] intercepts <- data.table(intercepts,intercept.sd) #intercepts[,name:= c("Start","C","B","GA","FA")] intercepts[,name:= c("Stub","Start","C","B","GA")] intercepts[,type:='Article model'] df2 <- r4$test.df df2 <- df2[,ordinal.pred.correct := ordinal.pred == wp10] df2 <- df2[,type:="Quality class model"] best.model <- r4$best.model draws <- as.data.table(best.model$fit) intercepts2 <- draws[,apply(draws[,c('b_Intercept[1]','b_Intercept[2]','b_Intercept[3]','b_Intercept[4]','b_Intercept[5]'),with=F],2,mean)] intercept2.sd <- draws[,apply(draws[,c('b_Intercept[1]','b_Intercept[2]','b_Intercept[3]','b_Intercept[4]','b_Intercept[5]'),with=F],2,sd)] intercepts2 <- data.table(intercepts2,intercept2.sd) setnames(intercepts2,old=c("intercepts2",'intercept2.sd'),new=c("intercepts","intercept.sd")) #intercepts2[,name:= c("Start","C","B","GA","FA")] intercepts2[,name:= c("Stub","Start","C","B","GA")] intercepts2[,type:= "Quality class model"] df3 <- r.revisions$test.df df3 <- df3[,ordinal.pred.correct := ordinal.pred == wp10] best.model <- r.revisions$best.model draws <- as.data.table(best.model$fit) intercepts3 <- draws[,apply(draws[,c('b_Intercept[1]','b_Intercept[2]','b_Intercept[3]','b_Intercept[4]','b_Intercept[5]'),with=F],2,mean)] intercept3.sd <- draws[,apply(draws[,c('b_Intercept[1]','b_Intercept[2]','b_Intercept[3]','b_Intercept[4]','b_Intercept[5]'),with=F],2,sd)] df3 <- df3[,type := "Revision model"] intercepts3 <- data.table(intercepts3,intercept3.sd) setnames(intercepts3,old=c("intercepts3",'intercept3.sd'),new=c("intercepts",'intercept.sd')) #intercepts3[,name:= c("Start","C","B","GA","FA")] intercepts3[,name:= c("Stub","Start","C","B","GA")] intercepts3[,type:='Revision model'] intercepts <- rbindlist(list(intercepts, intercepts2, intercepts3),fill=T) intercepts[,type:=factor(type,levels=c("Quality class model", "Revision model", "Article model"),ordered=T)] df <- df[,.(wp10,quality.ordinal,type,ordinal.pred.correct)] df2 <- df2[,.(wp10,quality.ordinal,type,ordinal.pred.correct)] df3 <- df3[,.(wp10,quality.ordinal,type,ordinal.pred.correct)] df <- rbind(df,df2,df3,fill=T) df[,quality.ordinal.rescaled:=rescale_scores(quality.ordinal,quality.ordinal),by=.(type)] df[,wp10:=wp10dict[wp10]] df[,wp10:=factor(wp10,levels=wp10dict,ordered=T)] hists <- df[,.(list(hist(quality.ordinal.rescaled,plot=F,breaks=70))),by=.(type,wp10)] ## densities <- df[,.(list(density(quality.ordinal))),by=.(type,wp10)] chunks <- list() for(i in 1:nrow(hists)){ dens <- hists[i] x <- dens[,V1][[1]]$mids y <- dens[,V1][[1]]$count type <- dens$type wp10 <- dens$wp10 part <- data.table(x=x,y=y,type=type,wp10=wp10) chunks <- append(chunks,list(part)) } dfdens <- rbindlist(chunks) df <- df[order(quality.ordinal)] df[,first.true:= .SD[min(which(ordinal.pred.correct==T)),quality.ordinal.rescaled],by=.(type,wp10)] df[,first.false:= .SD[min(which(ordinal.pred.correct==F)),quality.ordinal.rescaled],by=.(type,wp10)] df[,last.true:= .SD[max(which(ordinal.pred.correct==T)),quality.ordinal.rescaled],by=.(type,wp10)] df[,last.false:= .SD[max(which(ordinal.pred.correct==F)),quality.ordinal.rescaled],by=.(type,wp10)] dfdens <- dfdens[unique(df[,.(type,wp10,first.true,first.false,last.true,last.false)]), on=c("type","wp10")] dfdens[(x>=first.true)&(x<=last.true),region:="Correct"] dfdens[(x<=first.true)&(first.true <= first.false),region:="Correct"] dfdens[(x<=first.true)&(first.true >= first.false),region:="Incorrect"] dfdens[(x>=last.true)&(last.false >= last.true),region:="Incorrect"] dfdens[(x>=last.true)&(last.true >= last.false),region:="Correct"] dfdens[is.na(region),region:="Incorrect"] width <- mean(dfdens[,.(dx=mean(diff(x))),by=.(wp10,type)]$dx) intercepts[,model.type:=type] intercepts[,intercept_max := intercepts + 1.96*intercept.sd] intercepts[,intercept_min := intercepts - 1.96*intercept.sd] intercepts[,intercept_rescaled := rescale_scores(intercepts,df[model.type==type]$quality.ordinal),by=.(type)] intercepts[,intercept_max_rescaled := rescale_scores(intercept_max,df[model.type==type]$quality.ordinal),by=.(type)] intercepts[,intercept_min_rescaled := rescale_scores(intercept_min,df[model.type==type]$quality.ordinal),by=.(type)] scores.df <- df p <- ggplot(dfdens, aes(x=x,y=y,fill=region,color=region)) p <- p + facet_grid(wp10~type,as.table=F) ## intercepts[name=='A',xpos:=xpos-0.1] ## intercepts[name=='GA',xpos:=xpos+0.1] intercepts[,y:=125] intercepts[((model.type=='Article model')|(model.type=='Revision model')) & (name=='Start' ), y:=98] dfdens[,type:=factor(type,levels=c("Quality class model", "Revision model", "Article model"),ordered=T)] p <- p + xlab("Ordinal quality score") p <- p + ylab("Article quality class") p <- p + scale_fill_viridis_d("Model prediction", labels=list("Correct","Incorrect"),direction=1,begin=0.7,end=0.2) p <- p + scale_color_viridis_d("Model prediction", labels=list("Correct","Incorrect"),direction=1,begin=0.7,end=0.2) p <- p + geom_rect(aes(xmin=intercept_min_rescaled,xmax=intercept_max_rescaled),intercepts,inherit.aes=F,ymin=0,ymax=max(dfdens$y),fill='black',alpha=0.2) p <- p + geom_vline(data=intercepts,aes(xintercept=intercept_rescaled),linetype='dashed') #p <- p + geom_label(aes(x=intercept_rescaled, y=y,label=name),intercepts,inherit.aes=F,size=1.9) p <- p + geom_histogram(stat='identity',width=width) #p <- p + guides(color=F) p <- p + theme(legend.position='right') p <- p + theme(strip.text.y = element_text(angle = 0),axis.text.x=element_text(angle=45,vjust=1,hjust=1)) print(p) @ \caption{Quality scores and predictions of the ordinal regression models. Columns in the grid of charts correspond to the ordinal quality model calibrated to the indicated unit of analysis and rows correspond to sampled articles having the indicated level of quality as assessed by Wikipedians. Each chart shows the histogram of scores, thresholds inferred by the ordinal model with 95\% credible intervals colored in gray, and colors indicating when the model makes correct or incorrect predictions. The thresholds are not evenly spaced, especially in \textit{revision model} and \textit{article model} that has more weight on lower quality classes. These two models infer that the gaps between \textit{Stub} and \textit{Start} and between \textit{Start} and \textit{C-class} articles are considerably wider than the gap between \textit{C-class} and \textit{B-class} articles. \label{fig:spacing}} \end{figure*} I draw a new random sample of 5,000 articles from each quality class to develop my models. I first reuse code from the \texttt{articlequality}\footnote{\url{https://pypi.org/project/articlequality} (\url{https://perma.cc/8R4H-MAZ9})} Python package to process the March 2020 XML dumps for English Wikipedia and extract up-to-date article quality labels. I then select pages that have been assessed by a member of at least one WikiProject. Following prior work, if an article is assessed at different levels according to more than one WikiProject, I assign it to the highest such level and I drop articles having the rarely used \emph{A-class} quality level \cite{halfaker_interpolating_2017,warncke-wang_success_2015,warncke-wang_tell_2013}. Next, I use the \texttt{revscoring}\footnote{\url{https://pypi.org/project/revscoring} (\url{https://perma.cc/3HFN-V23Z})} Python package to obtain the ORES scores of the labeled article versions. Some of these versions have been deleted leading to missing observations at each quality level. Table \ref{tab:sample} shows the number of articles sampled in each quality class. I reserve a random sample of 2000 articles which I use in reporting my results and fit my ordinal regression models on the remainder. %For a fair comparison of predictive accuracy, I holdout a random sample of \Sexpr{r2[['n.holdout']]} articles. %From these labeled articles I draw a new stratified sample to enable the use of a smaller sample that is ``balanced,'' meaning that it has equal sample sizes for all article classes as shown in Table \ref{tab:sample}. The ORES article quality classifiers are fit on a ``balanced'' dataset having an equal number of articles in each quality class. Thus, an ORES score is the probability that an article is a member of a quality class under the assumption that the article was drawn from a population where each quality class contains an equal number of articles. Simply put, the model has learned from its training data that each quality class is about the same size. This is not representative of the overall article quality on Wikipedia, which is highly skewed with over 3 million \textit{Stubs} but only around \textit{7,000} \textit{Featured articles} as shown in Table \ref{tab:sample}. Although using a balanced dataset likely improves the accuracy of the ORES models, for the ordinal regression models, the choice of unit of analysis presents a trade-off between accuracy in a representative sample of articles or revisions and accuracy within each quality class. Constructing a balanced dataset by oversampling is a common practice in machine learning because it can improve predictive performance. However, oversampling can also lead to badly calibrated predictive probabilities as shown in Fig. \ref{fig:calibration}. Calibration means that, on average, the predicted probability of a quality class equals the average true probability of that class for the unit of analysis. The ``balanced'' dataset on which ORES is trained has the \textit{quality class} unit of analysis because each quality class has equal representation. However, researchers are more interested in analyzing representative samples of \textit{articles} or \textit{revisions}. For example, the article unit of analysis would be used to estimate the average quality of a random sample of articles and the revision unit of analysis might be used to model the change in the quality of an encyclopedia over time. Weighting allows the use of the balanced dataset to estimate a model as if the dataset were a uniform random sample of a different unit of analysis. My method uses a balanced dataset to fit ordinal regression models with inverse probability weighting to calibrate each model to the unit of analysis of a research project. For example, each article in the model calibrated to the article unit of analysis is weighted by the probability of its quality class in the population of articles divided by the probability of its quality class in the sample. The size of the sample and the weights for the article and revision levels of analysis are also shown in Table \ref{tab:sample}. % It turns out that the ``evenly spaced'' assumption is sensitive to the unit of analysis. % This requires dropping one of the scores, but it is not obvious which one should be dropped. For both the weighted and unweighted models, I fit six models each dropping a different scores and then use approximate leave-one-out cross validation (LOO-CV) implemented in the \textsc{loo} R package to choose \cite{vehtari_practical_2017}. LOO-CV takes advantage of the Bayesian model to accurately and reliable calculate the expected log out-of-sample pointwise predictive accuracy (ELPD) using Pareto smoothed importance sampling. The choice does not matter much as the standard errors of the ELPD differences are not much smaller than the differences themselves. As shown in Table \ref{tab:loo.comparison}, the best models according to the ELPD have the \textit{start-class} score removed for models with weights and with the \textit{stub-class} score removed for the unweighted models. I therefore use these models from here on. \section{Results} \label{sec:results} I first report my findings about the spacing of the quality classes in each of the models in §\ref{sec:spacing}. Quality classes are not evenly spaced, especially when articles or revisions are the unit of analysis. Next, in §\ref{sec:accuracy}, I report the accuracy of each of the models and the uncertainty of the ordinal quality scale. All models perform similarly to or better than the MPQC within the pertinent unit of analysis. The unweighted model provides the best accuracy and lowest uncertainty across the entire range of quality levels, but is poorly calibrated for other units of analysis. Finally, in §\ref{sec:correlation}, I show that all quality measures are highly correlated, but the ordinal quality measures agree with one another more than with the ``evenly spaced'' measure. \begin{figure} <>= article.weighted <- r.articles$ordinal.fitted article.weighted$type <- 'Article model' revision.weighted <- r.revisions$ordinal.fitted revision.weighted$type <- "Revision model" unweighted <- r.noweights$ordinal.fitted unweighted$type <- "Quality class model" df <- rbind(article.weighted, revision.weighted, unweighted) df[,Estimate.rescaled := rescale_scores(Estimate,scores.df[type==type]$quality.ordinal),by=.(type)] df[,Est.Error.rescaled := rescale_scores(Est.Error,scores.df[type==type]$quality.ordinal),by=.(type)] df[,Q97.5.rescaled := rescale_scores(Q97.5,scores.df[type==type]$quality.ordinal),by=.(type)] df[,Q2.5.rescaled := rescale_scores(Q2.5,scores.df[type==type]$quality.ordinal),by=.(type)] df[,Error.rescaled := Q97.5.rescaled-Q2.5.rescaled] df[,type:=factor(type,levels=c('Quality class model','Revision model','Article model'),ordered=T)] p <- ggplot(df, aes(x=Estimate.rescaled,y=Error.rescaled)) + geom_point(alpha=0.5,size=0.5) + facet_wrap(type~.,scales='free_x',ncol=1) p <- p + geom_vline(data=intercepts,aes(xintercept=intercept_rescaled),linetype='dashed') p <- p + ylab("Size of 95% credible interval") p <- p + xlab("Ordinal quality score") #p <- p + geom_label(aes(x=intercept_rescaled, y=0.1,label=name),intercepts,inherit.aes=F,size=1.9) p <- p + geom_rect(aes(xmin=intercept_min_rescaled,xmax=intercept_max_rescaled),intercepts,inherit.aes=F,ymin=0,ymax=max(dfdens$y),fill='black',alpha=0.2) print(p) qe6 <- r.articles$test.df$quality.even6 article <- r.articles$test.df$quality.ordinal revision <- r.revisions$test.df$quality.ordinal unweighted <- r.noweights$test.df$quality.ordinal @ \caption{Uncertainty in ordinal quality scores for models calibrated at each unit of analysis. Points show the size of the 95\% credible interval for the ordinal quality score for each article in the dataset. The quality class model has low uncertainty across the range of quality. Models calibrated to the revision and article levels of analysis have less uncertainty at the low end of the quality scale, but greater uncertainty at the higher end of the scale. \label{fig:uncertainty}} \end{figure} \subsection{Spacing of Quality Classes} \label{sec:spacing} The grid of charts in Fig. \ref{fig:spacing} shows quality scores and thresholds for each model (columns) and article quality level (rows). Each chart shows the histogram of quality scores $\phi_i$ given to articles having the true quality label corresponding to the row of the grid. The histograms are colored to indicate regions where the model correctly predicts that articles belong to their true class. Vertical dashed lines show the thresholds inferred by the model with 95\% credible intervals colored in gray. Different models have different ranges of scores, so Fig. \ref{fig:spacing} shows results normalized between 0 and 1. <>= quality.class.featured.min <- unique(dfdens[(type=='Quality class model') & (wp10=='Featured'),first.true]) quality.class.C.min <- unique(dfdens[(type=='Quality class model') & (wp10=='C'),first.true]) quality.class.C.max <- unique(dfdens[(type=='Quality class model') & (wp10=='C'),last.true]) revision.stub.min <- unique(dfdens[(type=='Revision model') & (wp10=='Stub'),first.true]) revision.stub.max <- unique(dfdens[(type=='Revision model') & (wp10=='Stub'),last.true]) revision.C.min <- unique(dfdens[(type=='Revision model') & (wp10=='C'),first.true]) revision.C.max <- unique(dfdens[(type=='Revision model') & (wp10=='C'),last.true]) revision.GA.min <- unique(dfdens[(type=='Revision model') & (wp10=='Good'),first.true]) revision.GA.max <- unique(dfdens[(type=='Revision model') & (wp10=='Good'),last.true]) revision.stub.range <- revision.stub.max - revision.stub.min revision.C.range <- revision.C.max - revision.C.min article.stub.min <- unique(dfdens[(type=='Article model') & (wp10=='Stub'),first.true]) article.stub.max <- unique(dfdens[(type=='Article model') & (wp10=='Stub'),last.true]) article.C.min <- unique(dfdens[(type=='Article model') & (wp10=='C'),first.true]) article.C.max <- unique(dfdens[(type=='Article model') & (wp10=='C'),last.true]) article.stub.range <- article.stub.max - article.stub.min article.C.range <- article.C.max - revision.C.min @ No matter the unit of analysis, article quality classes are not evenly spaced. The quality class model provides a quality scale in which \textit{Featured} articles take up $\Sexpr{format.percent(round(1-quality.class.featured.min,2))}$ of the scale and are expected to score in the range of $[\Sexpr{round(quality.class.featured.min,2)}, 1]$, but probable \textit{C-class} articles only span $\Sexpr{format.percent(round(quality.class.C.max-quality.class.C.min,2))}$ of the scale in the range $[\Sexpr{round(quality.class.C.min,2)}, \Sexpr{round(quality.class.C.max,2)}]$. Researchers are likely to be interested in models calibrated to the article or revision units of analysis, and in these cases, the quality classes are far from evenly spaced. The \textit{revision model} assigns $\Sexpr{format.percent(round(revision.stub.range,2))}$ of the scale to \textit{Stubs}, from $0$ to $\Sexpr{round(revision.stub.max,2)}$. It assigns \textit{C-class} articles the smallest part of the scale, only $\Sexpr{format.percent(round(revision.C.range,2))}$ of it, from $\Sexpr{round(revision.C.min,2)}$ to $\Sexpr{round(revision.C.max,2)}$. The \textit{article model} is even more extreme. It assigns \textit{Stubs} to the interval $[\Sexpr{round(article.stub.min,2)}, \Sexpr{round(article.stub.max,2)}]$, $\Sexpr{format.percent(round(article.stub.range,2))}$ of the scale, and the space between thresholds defining the range of \textit{C-class} articles is so narrow that it virtually never predicts that an article will be C-class. In general terms, the \textit{quality class model} gives relatively equal amounts of space to each quality class compared to the other models, while reserving nearly the top half of the scale for the top 2 quality classes. The \textit{revision model} and \textit{article model} do the opposite and use the bottom half of the scale to account for differences within the bottom two quality classes, leave some room for \textit{B-class} articles, but squeeze the top end of the scale and \textit{C-class} articles into relatively small intervals. %spacing between the levels is relatively even compared to the other units of analysis. A greater range of the ordinal quality scale is given to \textit{Featured} articles than to \textit{Good} articles, and a smaller range is given to \textit{C-class} and \textit{B-class} articles. Things are quite different in circumstances more likely to be of interest to researchers: when the units of analysis are revisions or articles. In both cases a large range of the scale is taken by \textit{Stub} and \textit{Start-class} articles at bottom of the scale; \textit{C-class} articles have a quite small range of the scale, perhaps due to the difficulty in distinguishing them from \textit{B} or \textit{Start-class} articles; and \textit{Good} and \textit{Featured} articles are given some part of the scale, but substantially less than when the unit of analysis is the quality class. \subsection{Accuracy and Uncertainty} \label{sec:accuracy} I evaluate predictive performance in terms of \textit{accuracy}, the proportion of predictions of article quality that are correct. To allow comparison with the reported accuracy of the ORES quality models, I also report \textit{off-by-one accuracy}, which includes predictions within one level of the true quality class among correct predictions. As shown in Table \ref{tab:accuracy}, the ordinal regression models have better predictive ability than the MPQC except when the unit of analysis is the quality class. In this case, the best ordinal quality model has worse accuracy than the MPQC but slightly better off-by-one accuracy. Table \ref{tab:accuracy} shows accuracy and off-by-one accuracy weighted for each unit of analysis. Accuracy for a given unit of analysis depends on having a model fit to data representative of that unit of analysis. Accuracy scores are higher when greater weight is placed on lower article quality classes, suggesting that it is easier to discriminate between these classes \begin{table*} \caption{Accuracy of quality prediction models depends on the unit of analysis. The greatest accuracy and off-by-one accuracy scores are highlighted. Models are more accurate when calibrated on the same unit of analysis on which they are evaluated. Compared to the MPQC, the ordinal quality models have better accuracy when revisions or articles are the unit of analysis. When the quality class is the unit of analysis, the ordinal quality model has worse accuracy, but predicts within one quality class with slightly better accuracy. \label{tab:accuracy}} <>= model.type <- c("Ores predicted", "Article model","Revision model","Quality class model") accuracy.type <- c("Article weighted", "Revision weighted", "Unweighted") df <- data.table(expand.grid(model.type, accuracy.type)) df1 <- r.articles$test.df df1$model.type <- "Article level" df2 <- r.revisions$test.df df2$model.type <- "Revision level" df3 <- r.noweights$test.df df3$model.type <- "Quality class level" df <- rbind(df1, df2, df3,fill=T) df <- df[,ordinal.pred := factor(ordinal.pred,levels=c("stub","start",'c','b','ga','fa'),ordered=T)] df <- df[,MPQC := factor(MPQC,levels=c("stub","start",'c','b','ga','fa'),ordered=T)] df3 <- df3[,MPQC := factor(MPQC,levels=c("stub","start",'c','b','ga','fa'),ordered=T)] acc.unweighted <- df[,.(exact=mean(ordinal.pred == wp10), within1=mean(abs(as.numeric(ordinal.pred) - as.numeric(wp10))<=1)),by=.(model.type)] acc.unweighted[,accuracy.type:='Unweighted'] acc.article <- df[,.(exact=weighted.mean(ordinal.pred == wp10,article_weight), within1=weighted.mean(abs(as.numeric(ordinal.pred) - as.numeric(wp10))<=1,article_weight)),by=.(model.type)] acc.article[,accuracy.type:='Article weighted'] acc.revision <- df[,.(exact=weighted.mean(ordinal.pred == wp10,revision_weight), within1=weighted.mean(abs(as.numeric(ordinal.pred) - as.numeric(wp10))<=1,revision_weight)),by=.(model.type)] acc.revision[,accuracy.type:='Revision weighted'] acc.ores <- data.table(expand.grid(c('Ores predicted'),accuracy.type)) names(acc.ores) <- c("model.type","accuracy.type") acc.ores[accuracy.type=='Unweighted',":="(exact=mean(df3$MPQC==df3$wp10),within1=mean(abs(as.numeric(df3$MPQC) - as.numeric(df3$wp10)) <= 1))] acc.ores[accuracy.type=='Article weighted',":="(exact=weighted.mean(df3$MPQC==df3$wp10,df3$article_weight),within1=weighted.mean(abs(as.numeric(df3$MPQC) - as.numeric(df3$wp10)) <= 1,df3$article_weight))] acc.ores[accuracy.type=='Revision weighted',":="(exact=weighted.mean(df3$MPQC==df3$wp10,df3$revision_weight),within1=weighted.mean(abs(as.numeric(df3$MPQC) - as.numeric(df3$wp10)) <= 1,df3$revision_weight))] tab <- rbind(acc.unweighted,acc.revision,acc.ores,acc.article) tab <- tab[order(accuracy.type,model.type)] tab <- tab[order(accuracy.type,model.type,exact,within1)] tab <- tab[, best.acc := max(exact),by=.(accuracy.type)] tab <- tab[, best.within1 := max(within1),by=.(accuracy.type)] tab <- tab[,exact.str:=paste0(round(exact,2))] tab[exact==best.acc,exact.str:=paste0("\\cellcolor{mygreen}",round(exact,2))] tab[,exact:=exact.str] tab <- tab[,w1.str:=paste0(round(within1,2))] tab[within1==best.within1,w1.str:=paste0("\\cellcolor{mygreen}",round(within1,2))] tab[,within1:=w1.str] tab[model.type=='Ores predicted',model.type:="ORES MPQC"] tab[model.type=='ORES MPQC',isord:="No"] tab[model.type=='Article level',model.type:="Article"] tab[model.type=='Article',isord:="Yes"] tab[model.type=='Revision level',model.type:="Revision"] tab[model.type=='Revision',isord:="Yes"] tab[model.type=='Quality class level',model.type:="Quality class"] tab[model.type=='Quality class',isord:="Yes"] tab[accuracy.type=='Unweighted',accuracy.type:="Quality class"] tab[accuracy.type=='Revision weighted',accuracy.type:="Revision"] tab[accuracy.type=='Article weighted',accuracy.type:="Article"] setnames(tab,old=c("model.type","exact","within1","accuracy.type","isord"),new= c("Model", "Accuracy", "Off-by-one accuracy", "Unit of analysis", "Ordinal model?")) print(xtable(tab[,c(4,1,9,2,3),with=F]),hline.after=c(-1,0,4,8,nrow(tab))) @ \end{table*} The ORES article quality model has been quickly adopted by researchers, but its accuracy is limited. While off-by-one accuracy is above 90\% when the article is the unit of analysis, the MPQC only predicts the correct quality class 55\% of the time when the quality class is the unit of analysis. The trade-offs in selecting a unit of analysis on which to calibrate the models are further illustrated by Fig. \ref{fig:uncertainty}, which plots the size of the 95\% credible intervals as a function of the quality scores for each model. As in Fig. \ref{fig:spacing}, quality scores in this plot are rescaled between 0 and 1. The models calibrated to articles or revisions have more certainty in the lower range of the quality scale compared to the model that places equal weight in all quality classes. This comes with a trade-off for the higher range of quality. While the \textit{quality class model} has relatively low uncertainty across the entire range of quality, the \textit{revision model} and \textit{article model} have greater uncertainty at higher levels of quality. \subsection{Correlation Between Scores} \label{sec:correlation} Although the models have different predictive performances and uncertainties, as measures of quality, they are nearly perfectly correlated with one another as shown in Fig. \ref{fig:correlation}. For each quality score, including the ``evenly spaced'' weighted sum, Fig. \ref{fig:correlation} shows a scatter plot and two correlation statistics: Kendall's $\tau$ and Pearson's $r$. Pearson's $r$ is the standard linear correlation coefficient and Kendall's $\tau$ is a nonparametric rank-based correlation defined as the probability that the quality scores will agree about which of any two articles has higher quality minus the probability that they will disagree. According to Pearson's $r$ all the quality scores are highly correlated with correlation coefficients of about $0.98$ or higher. Kendall's $\tau$ measures nonlinear correlation and reveals discrepancies between the ordinal models and the ``evenly spaced'' measures. The Pearson correlation between scores from the \textit{revision model} and the scores from the \textit{quality class model} are about the same as the correlation between the \textit{revision model} scores and the ``evenly spaced'' scores ($r=\Sexpr{round(cor(revision,qe6),2)}$). However, according to Kendall's $\tau$, scores from the \textit{revision model} are more similar to those from the \textit{quality class model} ($r=\Sexpr{round(cor(revision,unweighted,method='kendall'),2)}$) than to the scores from the ``evenly spaced'' approach ($r=\Sexpr{round(cor(revision,qe6,method='kendall'),2)}$). The evenly spaced model is more likely to disagree with the model-based scores than any of the model-based scores are to disagree with one another as visualized in the scatter plots in Fig. \ref{fig:correlation}. Disagreement between the ``evenly spaced'' method and the ordinal models is greatest among articles in the middle of the quality range. \begin{figure*} <>= df <- data.table(qe6, article, revision, unweighted) names(df) <- c("Evenly spaced", "Article model","Revision model", "Quality class model") cortable <- cor(df,method='kendall') cortable[lower.tri(cortable)] <- '' library(rlang) library(GGally) # library(latex2exp) mycor <- function(data,mapping){ tab <<- data mapping <<-mapping x <- tab[[quo_get_expr(mapping[['x']])]] y <- tab[[quo_get_expr(mapping[['y']])]] pearson <- cor(x,y) kendall <- cor(x,y,method='kendall') ## s1 <- paste0(": ",,"") ## s2 <- paste0(": ",round(pearson,3)) # lab <- paste0("tau",s1,"rho",s2) lab <- as.expression(bquote(atop(tau:~.(round(kendall,3)),r:~.(round(pearson,3))))) p <- ggally_text(label=lab,color='black',size=4,parse=F) p <- p + theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank()) return(p) } p <- ggpairs(df, upper = list(continuous=wrap(mycor)),diag=list(continuous='blank'),lower=list(continuous=wrap('points',alpha=0.3,size=0.5))) p <- p + theme(strip.text.y = element_text(angle = 0)) print(p) @ \caption{Correlations between quality measures show that the different approaches to measuring quality are quite similar. ``Evenly spaced'' uses a weighted sum of the ORES scores with handpicked coefficients \cite{halfaker_interpolating_2017}. Lower values of Kendall's $\tau$, a nonparametric rank correlation statistic, compared to Pearson's $r$ suggest nonlinear differences between the weighted sum and the other measures. \label{fig:correlation}} \end{figure*} \section{Discussion} \label{sec:discussion} Past efforts to extend the measurement of Wikipedia article quality from peer-produced article quality assessments to unassessed versions of articles and from the discrete to the continuous domain have relied upon machine learning and expedient but untested assumptions like that quality levels are ``evenly spaced.'' % I argued in §\ref{sec:background} that using machine learning to extend the article quality measurement from the direct observation of human article assessment to unobserved articles and from the discrete levels to a continuous scale might be analogous to how thermometry extended into new extremes of hot and cold where assumptions like the liquidity of mercury break down. Scientists, unaware that mercury has a solid state were baffled and misled by impossibly low temperature readings from thermometers in which the mercury had unexpectedly frozen \cite{chang_inventing_2004}. While I suggest technical improvements for statistical models for measuring quality, I also find that scores from my models are highly correlated to those obtained under the ``evenly spaced'' assumption. I set out to provide a better way to convert the probability vector output by the ORES article quality model into a continuous scale and to test the assumption that the quality levels are evenly spaced. I used ordinal regression models to infer spacing between quality levels and used the linear predictor of these models as a continuous measure of quality. While I found in §\ref{sec:spacing} that the quality levels are not evenly spaced and that the spacing depends on the unit of analysis to which the models are calibrated, I also showed in §\ref{sec:correlation} that the model-based quality measures are highly, although not perfectly, correlated with the ``evenly spaced'' measure. This provides some assurance that past results built on this measure are unlikely to mislead. That said, I recommend that future work adopt appropriately calibrated model-based quality measures instead of the ``evenly spaced'' approach, and I argue that it is important to improve the accuracy of article quality predictors to enable more precise article quality measurement. \subsection{Recommendations for Measuring Article Quality} How should future researchers approach the question of how to measure Wikipedia article quality? While I cannot provide a final or complete answer to the question, I believe the exercise reported in this paper provides some insights on which to base recommendations. It is important to note that I consider here only approaches to measuring quality that assume the use of a good predictor of article quality assessment, such as the ORES quality model. I do not consider other based approaches such as those based on indexes \cite{lewoniewski_relative_2017} described in §\ref{sec:background}. \subsubsection{Use the principle components of ORES scores for statistical control of article quality} In many statistical analyses, the only purpose of measuring quality will be as a statistical control or adjustment. For example, \citeauthor{zhang_crowd_2017} \cite{zhang_crowd_2017} used the MPQC as a control variable in a propensity score matching analysis of promotion to \textit{Featured article} status, but as argued in §\ref{sec:methods}, the MPQC provides less information than the vector of ORES scores. Using the principle components is simpler than using an ordinal quality model. I recommend obtaining ORES scores for your dataset, taking the principle components, and dropping the least significant one to remove collinearity. \subsubsection{Use ordinal quality scores when article quality is an independent variable} \label{sec:qciv} In other cases, research questions will ask how article quality is related to an outcome of interest, like how \citeauthor{kocielnik_reciprocity_2018} \cite{kocielnik_reciprocity_2018} set out to explore factors associated with donations to the Wikimedia Foundation. They use the MPQC as an independent variable, which complicates their analysis. Although they conclude that ``pages with higher quality attract more donations,'' this is not strictly true. They actually found a nonlinear relationship where readers of \textit{B-class} articles were more likely to donate than readers of \textit{Featured articles}. Using a continuous measure of quality is more convenient when the average linear relationship is the target of inference. I recommend using an ordinal regression model appropriate to the downstream unit of analysis because this will justify the interpretation of the measure. If the downstream unit of analysis differs substantively from those used here, such as if different selection criteria are applied, I recommend reusing my code to calibrate a new ordinal regression model to a new dataset. Otherwise, reusing one of my models should be adequate. Finally, in the Bayesian framework, the scores are interpretable as random variables. This provides a justification for incorporating the variance of these scores as measurement errors to improve estimation in downstream analysis \cite{mcelreath_statistical_2018}. % Although the ``evenly spaced'' scores and the scores based on ordinal regression are highly correlated, there are a number of reasons to prefer my approach. % The most important is simply that it requires no strong assumptions about the relationships between levels of article quality. Rather, it learns both the spacing between quality levels and the best combination of ORES scores for predicting article quality assessment. % Second, the scores have grounded statistical interpretations as the linear predictor in an ordinal quality model. Given the intercepts from the model, the scores are directly interpretable as a probability distribution over article quality classes. \subsubsection{Use the MPQC or ordinal quality scores when article quality is the dependent variable} Using the MPQC as the outcome in an ordinal regression model, as is done by \citeauthor{shi_wisdom_2019} \cite{shi_wisdom_2019} in their analysis of Wikipedia articles with politically polarized editors, is a reasonable choice as long as it provides sufficient variation and a more granular quality measure is not needed. Although it is theoretically possible that using the MPQC might introduce statistical bias because it less accurate than ordinal quality scores for units of analysis other than the quality class and omits variation within quality classes, such threats to validity do not seem more significant than the threat introduced by inaccurate predictions. If the MPQC does not provide sufficient granularity and a continuous measure is desired as in \citeauthor{halfaker_interpolating_2017} \cite{halfaker_interpolating_2017} or \citeauthor{arazy_evolutionary_2019} \cite{arazy_evolutionary_2019}, I recommend using a measure based on ordinal regression as described in §\ref{sec:qciv}. \subsection{Limitations} Although intuitions about the varying degrees of effort required to develop articles with different levels of quality led me to question the ``evenly spaced'' assumption, my findings that quality classes are not evenly spaced do not necessarily reflect relative degrees of effort. Rather, spaces between levels are chosen to link a linear model to ordinal data. The spacing of intervals depends on the ability of the ORES scores to predict quality classes. The ORES article quality model has relative difficulty classifying \textit{C-class} and \textit{B-class} articles \cite{halfaker_interpolating_2017}. Perhaps, the differences between these quality classes are minor compared to the other classes. Maybe ORES lacks the features or ability to model these differences and the space between these classes will grow if its predictive performance improves. The usefulness of article quality scores depends on the accuracy of the model. The ORES quality models are accurate enough to be useful for researchers, but they still only predict the correct quality class 55\% of the time on a balanced dataset. Of course, this limits the accuracy of the ordinal regression models reported here. Furthermore, while the ORES quality models were designed with carefully chosen features intended to limit biases \cite{halfaker_ores_2020}, it is still quite plausible that the accuracy of predictive quality models may vary depending on characteristics of the article \cite{kleinberg_inherent_2016}. Such inaccuracies may introduce bias, threaten downstream analysis or lead to unanticipated consequences of collaboration tools built upon the models \cite{teblunthuis_effects_2021}. Therefore, improving the accuracy of article quality prediction models is important to the validity of future article quality research. Adopting machine learning models that can incorporate ordinal loss functions is a promising direction and can reduce the need for auxiliary ordinal regression models \cite{cardoso_learning_2007}. This paper only considers measuring article quality for English language Wikipedia, but expanding knowledge of collaborative encyclopedia production depends on studying other languages as audiences and collaborative dynamics can greatly vary between projects \cite{hecht_tower_2010,lemmerich_why_2019,teblunthuis_dwelling_2019}. Other languages carry out quality assessments \cite{lewoniewski_relative_2017}, and some of these have been used to build ORES article quality models. Future work should extend this project to provide multilingual article quality measures in one continuous dimension. An additional limitation stems from the likelihood that peer-produced quality labels are biased. For instance, the English Wikipedia community has a well-documented pattern of discrimination against content associated with marginalized groups such as biographies of women \cite{tripodi_ms_2021, menking_people_2019} and indigenous knowledge \cite{van_der_velden_decentering_2013}. Although demonstrating biases in article quality assessment is a task for future research, if Wikipedians' assessments of article quality are biased then model predictions of quality will almost certainly be as well. \section{Conclusion} Measuring article quality in one continuous dimension is a valuable tool for studying the peer production of information goods because it provides granularity and is amenable to statistical analysis. Prior approaches extended ORES article quality prediction into a continuous measure under the ``evenly spaced'' assumption. I showed how to use ordinal regression models to transform the ORES predictions into a continuous measure of quality that is interpretable as a probability distribution over article quality levels, provides an account of its own uncertainty and does not assume that quality levels are ``evenly spaced.'' Calibrating the models to the chosen unit of analysis improves accuracy for research applications. I recommend that future work adopt this approach when article quality is an independent variable in a statistical analysis. \section{Code and Data Availability} Code, data and instructions for replicating or reusing this analysis are available in the Harvard Dataverse at \url{https://doi.org/10.7910/DVN/U5V0G1}. \begin{acks} I am grateful to the members of the Community Data Science Collective for their feedback on early drafts of this work including Kaylea Champion, Sneha Narayan, Jeremy Foote, and Benjamin Mako Hill. I would also like to thank Aaron Halfaker for encouraging me to write this after seeing a preliminary version. Thanks to Stuart Yeates and other participants in the \texttt{wiki-research-l} mailing list \url{wiki-research-l@lists.wikimedia.org} for answering my questions about measuring article quality and effort. Finally, thank you to the anonymous OpenSym reviewers whose careful and constructive feedback improved the paper. \end{acks} % bibliography here \bibliographystyle{ACM-Reference-Format} \bibliography{refs} \end{document} % LocalWords: \begin{table*} \begin{subtable}{0.49\textwidth} <>= tab <- data.table(r3$comparison.loo) n <- rownames(r3$comparison.loo) tab[,'Model':=n] model.names <- c("model.main.noStub"="No stub","model.main.noStart"="No start", "model.main.noC"="No C", "model.main.noB"="No B","model.main.noGa"="No GA", "model.main.noFa"="No FA") tab <- tab[,"Model":=model.names[Model]] tab <- tab[!is.na(Model)] setnames(tab,old=c("elpd_loo","se_elpd_loo","looic","se_looic","elpd_diff","se_diff"), new=c("ELPD","SE ELPD","LOOIC","SE LOOIC","ELPD Diff","SE Diff")) print(xtable(tab[,c("Model","ELPD","ELPD Diff","SE Diff"),with=F],digits=c(0,2,2,2,2))) @ \caption{Weighted model} \end{subtable} \hfill \begin{subtable}{0.49\textwidth} <>= tab <- data.table(r4$comparison.loo) n <- rownames(r4$comparison.loo) tab[,'Model':=n] model.names <- c("model.main.noStub"="No stub","model.main.noStart"="No start", "model.main.noC"="No C", "model.main.noB"="No B","model.main.noGa"="No GA", "model.main.noFa"="No FA") tab <- tab[,"Model":=model.names[Model]] tab <- tab[!is.na(Model)] setnames(tab,old=c("elpd_loo","se_elpd_loo","looic","se_looic","elpd_diff","se_diff"), new=c("ELPD","SE ELPD","LOOIC","SE LOOIC","ELPD Diff","SE Diff")) print(xtable(tab[,c("Model","ELPD","ELPD Diff", "SE Diff"),with=F],digits=c(0,2,2,2,2))) @ \caption{Unweighted model} \end{subtable} \caption{Model selection using approximate leave-one-out cross validation.\label{tab:loo.comparison}} \end{table*} \bibliographystyle{ACM-Reference-Format} \bibliography{refs} \end{document} Quality is likely conceptualized as a continuous property but is assessed on an ordinal scale with discrete levels. Machine learning article quality prediction can extend quality measurement to the range within which complicates statistical analysis and does not capture variation within levels of the scale. Indeed these limitations of peer produced article quality assessments have motivated \citeauthor{warncke-wang_tell_2013} \cite{warncke-wang_tell_2013} to introduce machine learning models for predicting unobserved article quality using up-to-date article assessments as training data and \citeauthor{halfaker_interpolating_2017} \cite{halfaker_interpolating_2017} to extend these models to interpolate quality on a continuous scale under the assumption that levels of quality on the scale are ``evenly spaced.' In theory, this can bias statistical results if such quality differences within articles having the same MPQC are correlated with the residuals. To provide a fair comparison with prior work and to support applications that seek greater predictive performance in the higher quality classes, I also fit unweighted models. but will have worse predictive performance within the higher, smaller, quality classes. This diverges from prior work which has prioritized accuracy within each class using unweighted models fit on datasets having roughly equal numbers of examples from each class \cite{halfaker_interpolating_2017,warncke-wang_success_2015,warncke-wang_tell_2013}.