Open data, data formats, and the CIS

Every time someone talks about open data, I think about the CIS, a public research institution in Spain mostly known for its political surveys. Its case nicely illustrates some of the challenges involved in releasing data to a public that may include occasional consumers, but that is largely formed by sophisticated —maybe professional— users. Let me start by saying that social researchers in Spain cannot be thankful enough to an institution that runs excellent surveys and shares the results for free. However, it is also fair to admit that the CIS takes a surprisingly outdated approach to data sharing.

In 2009 the CIS started making all the data available to download free of charge. Before that, researchers had to buy each survey separately and the data was distributed in a disk. That was a huge step. But the CIS modernized the channel of distribution and nothing else. As of today, the data is still provided in fixed-width format (FWF), a non-proprietary format designed to minimize disk storage. For SAS or SPSS users, the CIS makes available a map from each column of the microdata file to a column of the final data matrix. Users of any other programs must either build their own cross-walk –a really ungrateful task–, ask for favors to friends who own a SPSS or SAS license, or depend on free clones such as PSPP.

That implies that the CIS is favoring, somewhat arbitrarily, users of a particular software. One could make a favorable case based on the popularity of SPSS among social scientists,1 but it is also true that Stata became the de facto standard in the past 10 years among the top schools, and that R is now living a golden age. However, the problem is not about software —the CIS may legitimaly decide not to provide dictionaries for any program—, it is about the fact that the CIS distributes the data trying to optimize one dimension that is not relevant any more at the expense of usability. Or to put it differently, the CIS is not up to speed with respect to the current standards.

It is a detail of seamingly little importance that talks about a broader problem related to bureaucracy, technology, and open data. The dicussion is not about revamping the technology or the workflow that the CIS uses for building the data files. We are not talking about making the data available through an API or about cutting down the turnaround time for each survey —dimensions in which there is certainly room for improvement. The CIS publicly shares a very small number of datasets and exporting them to a new format is trivial. It is uncontroversial that the transition cost to a new data format is zero in practical terms. And yet, that transition has not occurred.

The broader point is that the CIS has not adapted to a new technological environment or to the changes in the habits of users in spite of the fact that it is supposed to interact mostly with researchers –a small, tightly-connected, highly-qualified community. The reason may well be the misalignment of incentives to the technical personnel, the lack of communication with the user base, or maybe sheer immobilism. None of those explanations would be surprising. Be that as it may, it is important to remember that when we talk about open data we have to bring in the institutional side of it: someone with its own private preferences has to collect, distribute, and maintain the data. Those operations require both a fluid vertical interaction between the data provider and the users, and also flexibility to adapt to standards that may change very quickly. And sometimes, perhaps unfairly, we have troubles associating both attributes with the bureaucracy.

  1. I have yet to meet a social scientist using SAS. 

Dialogue & Discussion