What is big data?

People often ask me

What is big data?

The answer I usually give is

Any data too large to process using your normal tools & techniques.

That’s a very context dependent answer: last week in one case it meant one million records, too big for Stata on a laptop, and in another it meant a dataset growing at about 1TB per day. To put it another way, big data is anything where you have to think about the engineering side of data science: where you can’t just open up R and run lm(), because that would take a day and need a terabyte of memory.


Multi-color text in ggplot2

Occasionally when producing charts, it’s helpful to plot a single text element in multiple colors. Here’s an example of labels from the SDG Atlas where we used multiple colors to good effect to make labels for “Bangladesh” and “United States” clearer (I’ve darkened some parts of the chart to make the relevant labels clearer):



Introduction to Wikidata using SPARQL

You’re certainly familiar with Wikipedia, but you may not be aware of Wikidata, which is an ongoing effort to structure some of the data underlying Wikipedia. Traditionally, facts (e.g. the population of New York City) are embedded in the text of a wiki, and there’s no easy way to automatically extract them. Wikipedia has a little more structure than this, but it’s still really designed for humans rather than machines.

Wikidata is the opposite – designed for machines, not humans.

It’s part of the broader semantic web movement, which aims to make the web more and more machine readable. Most of the time you don’t notice this, but when you run a query like “spouse of George Washington” and see this, rather than just a collection of links, that’s Google taking advantage of semantic web data (probably – they might also be using machine learning to infer it from unstructured text).

Screenshot 2017-09-11 16.45.07