How the west was really won: by manipulative data vizualization.

One of most well-known historical theories about America is the “frontier thesis,” advanced by Frederick Jackson Turner in 1893. It states that the long existence of a “frontier,” a zone between the settled and unsettled regions of the United States, is responsible for distinctive aspects of the American character: independence, self-reliance, egalitarianism, a certain disdain for high culture and learning, etc. The United States was, at independence, a fairly narrow strip of land east of the Mississippi, but the frontier was pushed continually west until eventually it reached the Pacific and vanished altogether.

You can see this movement on a sequence of population density maps, based on decennial census results, published in the 1898 Statistical atlas of the United States. Here’s the 1790 to 1860, when they didn’t even both showing the west:



What is big data?

People often ask me

What is big data?

The answer I usually give is

Any data too large to process using your normal tools & techniques.

That’s a very context dependent answer: last week in one case it meant one million records, too big for Stata on a laptop, and in another it meant a dataset growing at about 1TB per day. To put it another way, big data is anything where you have to think about the engineering side of data science: where you can’t just open up R and run lm(), because that would take a day and need a terabyte of memory.