Big data in social science research: access and replication

This is a blog I wrote for Nesta in two parts (part 1, part 2). I include both parts here for convenience.

Academic researchers are increasingly recognising the value in proprietary big data. The huge, networked, personal datasets associated with web giants such as Facebook could be a boon for social science research. But without a way for interested researchers to consistently access this data, such research fails the basic test of replicability.

In 1967, psychologist Stanley Milgram published the results of an experiment designed to explore the ‘small world problem’: how long is the chain of acquaintances connecting two randomly selected people? Milgram gave a card to each of around 300 volunteers, and asked them to deliver it to a target person in a distant city. They could pass it directly only if they knew the target personally, otherwise, they were instructed to pass it to a personal acquaintance whom they judged more likely to know the target. Only around a quarter of the cards made it, and of those, the average number of intermediaries was around six. So arose the famous ‘six degrees of separation’.

In 2013, the idea of a quantified social network no longer seems novel. From Facebook, I know that 55% of my friends are male. Thanks to LinkedIn I know that David Cameron is a 3rd degree connection, as is Barack Obama. (I’m not especially well-connected: this is the small-world effect at work.) Such datasets are beginning to revolutionise the kind of research Milgram pioneered. Last year researchers from Facebook and the University of Milan published an updated version of Milgram’s work. They examined the entire Facebook network - at that time 721 million active users - and concluded that a pair of random individuals had an average of 4, rather than 6, degrees of separation.

It’s not just social psychology that is benefiting. Across the social sciences, big data is providing insights. Previously, economists wanting to work on inflation were limited to official figures collected by government agencies such as the Office for National Statistics (ONS), and released with a delay. Now, the Billion Prices Project at MIT, in collaboration with spinout company Pricestats, is offering an alternative: measures of inflation calculated from billions of price observations, scraped from ecommerce sites across the web, available in real time. This has been so successful that the Economist now includes Pricestats data for Argentina, rather than official estimates, citing the unreliability of the latter.

The world of big data promises exciting new avenues of social research, but there is reason to be wary. Such research introduces new difficulties compared with research using traditional data sources. The biggest of these is that the data are usually proprietary: even when favoured researchers are invited to work inside these companies, and even when they publish in respected journals, the raw data are rarely made public. As a consequence, replication - a core part of the scientific method - is rarely possible. This is serious drawback, and it will need to be resolved before we can realise the value of big data in social science research.

In the first part of this two-part post, I argued that big data has the potential to open up exciting new avenues in social research. But much of the world’s data is commercial, and private. Access, and in particular access for the purpose of replicating published results, remains difficult. In this second part, I illustrate the problem, and suggest the direction possible solutions will take.

The increasing use of big data in social research is exciting. But this is a trend on a collision path with another emerging research trend: full and open publication of data and code.

In the early days of academic publishing, it was common to include full data tables in articles. These were often no more than a page, and few other methods existed for dissemination. Then, as datasets grew larger and journal pages more crowded, convention shifted to ‘data available on request from the author’ - an extra step that discouraged careful review. This is still too often the case, and it leads to persistent errors. Only recently, a paper by economists Carmen Reinhart and Kenneth Rogoff, which had been highly influential in the stimulus-vs-austerity debate, was found to contain just such an error. (The mistake came to light only after a graduate student obtained the original spreadsheet calculations used by Rogoff and Reinhart. There’s no doubt it would have been detected earlier had the spreadsheet been published as a matter of course.)

Thankfully, the advent of the web has made it once again possible for all data and calculations supporting academic papers to be published, and leading journals have required this for some time now.

Unfortunately, datasets from internet giants like Facebook, LinkedIn and Google do not fit this paradigm. Not all such companies cooperate with academic researchers. But even when they do, detailed supporting data are almost never made public, which prevents careful review and replication by other researchers. Many companies offer public APIs to access selected data, but these are usually designed with the commercial, rather than research, ecosystem in mind, and they are often not useful for replication.

Three barriers prevent publication of more data. The first is sheer size: Facebook’s data is measured in tens of petabytes. The second is commercial confidentiality: for many of these businesses, data is a key competitive advantage. The third is user privacy: users would not stand for all their data being made available to the (research) world at large, even with the best of intentions.

These are difficult challenges, but partial solutions exist. The authors of the Facebook small-world experiment, to their credit, made available intermediate data, allowing some degree of replication by other researchers. There is a field of research, ‘statistical disclosure control’, dedicated to examining the problem of maintaining individual privacy when disseminating statistical data. Public institutions such as the UK Data Service and the ONS have a great deal of experience with the parallel problem that arises in official survey data. There are already secure facilities in place with both the technological capacity to deal with large datasets, and the controls in place to minimise risks to privacy: these offer a model to imitate.

The academy is beginning to recognise the extent of the problem, and develop solutions, both institutional and technological, for safely disseminating large proprietary datasets. The Economic and Social Research Council recently announced £14 million in funding for Business Datasafe, a repository designed for just such data. This is a good first step, but alone it will not be enough. Business leaders must be persuaded to sign up to the broader ‘data philanthropy’ agenda: even without the Datasafe, they could directly expose data in a way more useful to researchers. Users need to have confidence that personal data will remain private.

The social payoff to greater research using big data could be very large. But if we fail to address the access issue, some of the most fascinating and influential results in the next few decades of social science will remain unreviewed, untested and unreplicated. If that happens, then we will be squandering much of the value of this new world of research.

Add comment

Comments are moderated and will not appear immediately.