Feb 18, 2013

What can be learned from repeated misprints in journal citations

Mistakes, misprints, and typos are easy: a transposition of characters can happen with the slip of fingers;  a word can be added or dropped from a quote with a shift of the eye. These things happen.

The way these things happen, their frequency and their patterns, may also be important.

As M.V. Simkin and V.P. Roychowdhury argue in a 2002 paper, "misprints in scientific citations should not be discarded as a mere happenstance, but, similar to Freudian slips, analyzed."

In that paper, pointed out recently by demographer Conrad Hackett and historian Yoni Applebaum, claims that mistakes made in scientific citations can tell us something about academics' work. Simkin and Roychowdhury, from the University of California, Los Angeles, look at the way certain mistakes in citations are repeated, copied, and spread. They identified one paper, for example, that had been mis-cited 196 times. Out of those 196 citations that did not exactly, correctly reproduce the information of the original paper, 78 made exactly the same mistake. There were identical misprints in more than half of the cases where there were misprints. That is statistically improbable, according to Simkin and Roychowdhury, if everyone were reading the original paper. It's just not plausible that 78 different papers made exactly the same mistake.

They argue that these 78 identical misprints of a single source are, rather, reproductions of a mistake.

Simkin and Roychowdhury reported that they found a dozen cases of this, and from these sets of identical mistakes they attempt to work out a method to mathematically estimate the percentage of people who have not read the work that they cite. Their conclusion:
There are some questions raised by this conclusion. The paper is ambiguous on some points. For example, is it reasonable to think that accurate citations are being reproduced at the same rate as inaccurate citations? Also, does it hold that because one oft-repeated citation was re-cited without examination of the original that that is also true for less common citations? If it's true that 80 percent of those who cited a particularly popular paper didn't actually read that paper, does that mean that, therefore, 80 percent of all citations are second or third hand? Or that 80 percent of authors always do this? It's not clear to me that there's an argument that these particular cases are representative of all cases of citations. There are good reasons to be skeptical of a strong version of this claim, such as,
This is a very bold interpretation of the evidence being offered. It is not a misrepresentation of the authors' arguments, but it's nevertheless a very strong claim that may not be completely supported by the study.  A more cautious interpretation of Simkin and Roychowdhury's argument might be that there is a class of citations that are copied and reproduced by scholars who don't take the time to search out the original source.

If one sets aside the question of whether or not these citations were copied without consulting the original, and thinks about this research as just examining how citations are propagated, a stronger and more important conclusion emerges. This is the conclusion that Simkin and Roychowdhury reach in a later paper on the same subject. There, they claim "Our analysis of misprint propagation provides the evidence that citation copying dominates the dynamics of the network of scientific papers" (emphasis original).

There may or may not be evidence that eight out of 10 journal article authors are not doing due diligence with as many as eight out of 10 of their sources, but the way misprints are reproduced does demonstrate something important about the way information spreads.

The more apt metaphor for these misprints that are being reproduced might be genetic mutations. Simkin and Roychowdhury describe the errors as Freudian slips, but that's not quite right. A slip of the tongue may reveal much, but it's not repeated and reproduced.

The suggestion implicit in these studies is that these mistakes are important, but not that they're important in and of themselves, necessarily, but important in how they demonstrate the functioning of information networks. What is visible, with these repeated mistakes and misprints, is the way that academic information is not distributed in a way that's simply linear and unidirectional. Units of information, represented by citations, are not actually, mostly, passed from the original source to a recipient, but from a source to multiple recipients, who are then also sources for other recipients, in a multidirectional, webbed process.

As Simkin has elsewhere argued: "a paper that already was cited is likely to be cited again, and after it is cited again it is even more likely to be cited in the future."

To a certain extent, this is obvious. After all, in our own era of techno-social networks, the operations of some networks are readily apparent. It's easy to see that Conrad Hackett's tweet about this paper was retweeted more than 30 times, and one can trace a transmission line quite simply, from, say, this blog post to Yoni Applebaum's tweet to Hackett's. This is how the data from a paper published more than a decade ago spreads. While it's less apparent that the 32 journal articles that cite this 2002 paper about citation also can by this information in this fashion, it's plausible to expect that similar sorts of networks were at work.

Despite the common appearance of such networks, though, and the apparent ways that information is distributed, i.e., reproduced, it's commonly forgotten.

One way that this is ignored in my own area of study is that readers, in my case specifically readers of contemporary Christian fiction, are thought of as singular. There are assumptions of isolation: readers reading what they read all by themselves, individuals, completely detached from any social processes. This isn't always true, but it's often the case that the ways in which readers are networked and connected are ignored. There's still a persistent practice of thinking of texts as transmissions of information from the author to the reader, in a simply, linear, unidirectional process.

A fuller understanding of what has to happen for an individual reader to read a text can reveal a lot of detail that would otherwise by mystified.

As Robert Darnton has written, there is a "communications circuit," a cycle through which books, and also more generally texts, "come into being and spread through society." Darnton argues that the history of a text should be broadened to include not just the author and the reader, but the author, the publisher, the printer, the shipper, the bookseller, and the readers. None of these agents are acting in isolation, as kind of pure individuals. When a reader reads, all of these other agents are also at work. The reader is always part of these social contexts and processes, these networks -- though, Darnton notes, "Reading remains the most difficult stage to study in the circuit that books follow" -- and also there are networks of readers.

The dynamics of such networks, as Simkin and Roychowdhury say, often dominate, which is to say they go a long way in determining who reads and how.

This is true, too, for academic readers, and can be traced in the way academic information is distributed. The reproduction and propagation of misprinted citations is one interesting way to see that.

2 comments:

  1. Thanks, Daniel, for this sophisticated critique of the original paper. I was very deliberate in the way I rephrased Hackett's tweet, which seemed to make a claim not supported by Simkin and Roychowdhury. What they're actually looking at - and this is crucial - is the small subset of the very most commonly cited papers.

    This isn't, in other words, a study that's terribly illuminating about the research practices of scholars, even within the scientific networks it examines. But it does, I suspect, shed light on the practice of name-checking - the ritual citation of the best known works on particular topics, whether or not they bear more than a tangential relationship to the subject of the paper itself. I think that Simkin and Roychowdhury make a reasonable case that, most of the time, scholars who are citing these papers have read the relevant scholarly literature, seen the same one or two papers cited in every work they consult, and so simply copy those particular citations into their own papers.

    I'm less comfortable, I should say, with the supposition that this means that they've never read the works they're citing. Think, for example, of Joan Scott's 'Gender: A Useful Category of Historical Analysis' - by an enormous margin, the most frequently read and cited article in history over the past few decades. It's virtually impossible to get a doctorate in history without reading it. But when you're writing an article and feel obliged to name-check it, you might quite reasonably copy the citation out of a footnote of another article you happen to be reading, instead of going back to look it up yourself.

    The other variable here is bad meta-data in library catalogs, citation indexes, journal indexes, and bibliographies. I've frequently read an article and, instead of painstakingly copying over the detailed citation, simply downloaded it directly into my citation software. To be perfectly honest, I don't generally check whether that citation has the page numbers exactly right, or whether it uses the right year for copyright. I just assume it does. This paper uses an older dataset, so it probably doesn't include EndNote or Zotero. But these problems were at least as endemic to older card catalogs, library databases, and bibliographies. What shows up as copying may, in many cases, actually be reliance on the same bad sources.

    So the 'copying without reading' allegation, while sensational, is probably less compelling then the analysis you offer - that this shows the self-reinforcing cycle by which frequently cited papers become cited even more frequently.

    ReplyDelete
    Replies
    1. Thanks, Yoni.

      I don't know that I made it clear enough that I'm not critical of your reading of the paper. Actually, I think you're more cautions about interpreting the evidence in the paper than the paper is itself.

      Delete