I’m a taxpayer, I want my data!
A ruling by the Information Commissioner has ordered scientists at Queen’s University in Belfast to hand over copies of 40 years of research data on tree rings after a long battle with a climate sceptic. (PDF of the ruling) This is an important precedent for scientists, who have to comply with the strictest interpretation of the
Freedom of Information (FoI) Act. According to the Times: "Phil Willis, a Liberal Democrat MP and chairman of the Science and
Technology Select Committee, said that scientists now needed to work on
the presumption that if research is publicly funded, the data ought to
be made publicly available." More and more, there are demands for public releases of research data.
Were the scientists right in trying to withhold data, or is the public interest stronger? Is there a moral obligation to publish not just the results of publicly funded research, but the underlying data?
The Belfast researchers cited several reasons for not complying: copying the data from hundreds of floppy disks (!) and paper files would be unreasonably laborious and expensive, the data itself would not be meaningful to the independent researcher making the FoI request, the data was unfinished and in use, releasing it would weaken intellectual property rights and it was commercially confidential. These mirror many of the complaints from other researchers (and non-researchers) when forced to release data.
The first issue is whether it would be unreasonably hard to release the data. In this era of multiterabyte hard drives and Internet communications, it might seem completely obsolete. But research data rarely resides in a nice package, and even figuring out where all the raw pieces going into a major dataset may be complicated. Yet having data in a collected, copyable format is also deeply important for safeguarding it for the future. One of the more worrying revelations in the climategate affair was that crucial climate data on which current models are partially based had been erased. NASA has lost many tapes of early space exploration data that could be important today – some due to whale oil shortage! If important data cannot easily be copied, then the research group may have a bigger methodological and curatorial problem than just pesky FoI requests. If the total dataset is measured in petabytes or resides on parchment,
then this objection may hold water.
The second argument is that the data is not meaningful in itself. Raw measurement data rarely mean anything on its own: it needs to be linked to where and when it was measured, how, local conditions and the quirks of the apparatus. A deep understanding of what the measurement process involves is often needed to make sense of it, as noted by a physicist commenter. Add to this the need for several layers of data processing, statistics and interpretation before the data becomes useful. One of the beneficial effects of the climategate affair was the final release of some land weather station data from the Met Office. This triggered a brief flurry of amateur exploration which I participated in. In the end I managed to replicate official curves, but the most enduring result was an appreciation of how even fairly well-behaved data can be complex to process. But the fact that data needs much context and processing in order to become information does not mean it should not be released; on the contrary it suggests that there are so many possible sources of bias and mistake that more eyes are needed.
A common theme in scientist comments to the ruling is that this kind of
legislation will support cranks and ideologically biased researchers at
the expense of proper researchers. I find this argument doubtful: cranks
will make wild theories regardless of data, and access to real data
will also make it harder for the biased researchers to hide their bias
The argument that the data was unfinished and important for upcoming publications is somewhat amusing given that the dataset in Belfast has a 40 year history. But I do not doubt that it is under constant revision or updating, nor that researchers are finding new things to publish about. But being under revision does not mean the current data is meaningless. The real issue here, which I will discuss more below, is to what extent outsiders have a right to access data that is being used to produce scientific papers before these papers are finished.
The issue of intellectual property rights and commercial secrets get close to the real moral issues. Doesn't researchers who work hard to create data have a right to use it, control to whom they give it and in particular, benefit from it?
The commercial angle might be a side-track, although I believe this may have been a strong factor in this case: using dendrochronological data for dating can be profitable, and having outsiders gain access reduces the income. This appears to be a motivation in attempts of keeping maps, postcodes and even meteorological data private. But there are public benefits of releasing such data (e.g. see Climate crank inadvertently does archaeology a favour), and they might overrule the profit motive at least for publicly funded research. The goal of the funding and regulation should not be to provide a benefit to a small group of recipients, but to society as a whole. This is particularly important in the case of data of wide importance, such as climate data. In this particular case dendrochronology might not be the right tool to check the Medieval Warm Period (or not), but the benefits to archaeology might still be regarded as worthwhile. We should not undervalue the benefits of giving amateur scientists the chance to use real data, as evidenced by the growth of "internet astronomy". New uses, which were previously not even conceivable, may also emerge from access to the data.
A real fear among researchers is that being forced to release data will give competitors the chance to exploit it for their own career ends, leaving the data-producers at a disadvantage. This might be related to the speed of release: in genomics mandated release occurs at a very early stage, producing problems. However, releasing data at a slightly later stage may be beneficial, as found in protein crystallography. There the mandated release of coordinates and reflections appears to be beneficial. The issue seems to be that the originators will get enough time to publish based on their data, and that appropriate credit will be given for uses of the data. That way free riding can be reduced.
In many ways this is exactly the same issue as surrounds how to properly protect and reward creators of good new ideas. Patents and other intellectual property law attempts to do this by granting temporary monopolies. They are rarely motivated by the idea of a true intellectual property but rather a utilitarian weighing of the benefits to society and the benefits to the creator. In the case of scientific data I think one can make a case that until it is used to make publications it is not public: it is the act of making a public claim that "X is/isn't the case" that makes the data and reasoning leading to the claim a matter of public interest. Before that it is just private knowledge. Afterwards, given the collective nature of science and knowledge, withholding the data will hurt the further development of science that was the whole point of publicly funding the research in the first place. There might be some use for an embargo period to enable publication of multiple papers, but it needs to be kept short to keep legitimate investigations of the data possible.
It might be argued that public funding would imply public ownership of the data before publication, but public ownership does not imply unrestricted access. As a taxpayer I in a sense own some of the fighter jets of my country, but I don't think I have a right to fly one or individually demand that one flies a mission I happen to like. Data access can also be restricted on grounds of privacy of test subjects and maybe on security grounds. But in order to stay consistent with the idea of publicly funded science the data produced should be disseminated. Even privately funded science is often based on the idea of promoting general advancement rather than benefiting a certain group, so it would be contra-productive to lock away the data.
If private or commercial research is done for the purpose of gaining an individual advantage I am fine with the idea of it being closed, but the results should also not be taken as seriously as open science: we know that funding bias can be a major factor, and we know that mistakes, biases and misconduct are more common than we would like in science. Open, peer-reviewed publications are likely not enough to guard against this (hence the requirements of publicly registering clinical trials), so data and method sharing might be necessary to make us confident in the results.
I think the key question here is whether science is seen as a fundamentally cooperative or competitive enterprise. Certainly individual researchers compete for money, posts and recognition. Theories equally compete in "the marketplace of ideas". But I think most agree that is at most a means towards a desired end: the growth of systematic, universal knowledge. Keeping data private does not seem to improve the growth as much as allowing it to spread fairly freely. Past climategate revelations have shown that data sharing often happens between "friendly" research groups while the data is withheld from opposing researchers: this goes against the aim of neutrality, and might contribute to groupthink. Sharing information with people we disagree with is central to the scientific process. The way to "win" should be to come up with better ways of getting and understanding the information. That is what we should set up rules to favour, and that is what we as scientists ought to strive for.