Google’s data supremacy: should we be worried?

The beginnings of a debate has emerged, in the Financial Times, between Evgeny Morozov and Hal Varian, regarding Google’s alleged monopoly – not in search, but in data.

Morozov contends [unpaywalled] that Google’s damaging market power is not so much its dominance of search, but its ability to successful compete in other domains using the data acquired through search. This is an important argument: the same argument that was used to justify the EU’s antitrust action against Microsoft (that it leveraged an operating system monopoly to unfairly compete in web browsers).

In reply, Varian claims that data is a nonrival good, and that moreover, Google is remarkably open with its data. That data is nonrival – that is, that one person’s “consumption” of data does not preclude another’s consumption of the same data – is indisputable. But I’m not sure that matters, because it certainly is excludable – that is, it is very easy for Google to keep its data proprietary. In many domains, Google’s accumulation of data now appears unassailable. Considering the physical capital alone needed to store such data (millions upon millions of computers in dozens of data centres), there are few competitors with pockets deep enough to mount a challenge – and that is without even considering the skills and technologies required. Microsoft, Apple and Amazon perhaps, but the former has made little headway competing in search, and the latter pair seem more interested in preserving their own profitable corners of the internet, if occasionally bumping up against Google’s areas of dominance.

As for Varian’s argument that Google make its data and tools freely available, I don’t think that is inconsistent with Morozov’s accusation that Google ‘hoards’ data. Varian cites, amongst other examples, Google Maps. And indeed Maps is a perfect illustration of how data can be at once somewhat open and yet hoarded. Maps data is open to consumers: you can search, annotate and share maps. But it is only semi-open to producers, potential competitors of Google. Google offers an API to access map data, but it is extremely restrictive. You can use it if you are going to display the data on a Google map (which promote’s Google’s own service), but not otherwise. Unsurprisingly for a profitmaking entity, Google is most open where it profits to be open.

I recently encountered this “must display map” policy myself. I have been working on a small, innovative event search service, the kind of ‘nimble enterprise’ Morozov wants to encourage. To make it useful I have to geocode locations – that is, convert text like “White House” into precise latitudes and longitudes. Google, through its Maps and Places service, has unquestionably the best, most accurate geocoding service with worldwide coverage. It was able to build this in part because every day hundreds or thousands of people search for “white house” and then click on the most relevant search result, thus letting Google know what they really meant. But for my service I can’t use Google geocoding, because I need to geocode each event in advance, without showing a Google map to any user. This is a service Google will not sell me for any price (surely the mark of a monopolist). Instead I’m using the genuinely open OpenStreetMap geocoder, which (though impressive) is orders of magnitude worse. (Moreover since it is open, Google can extract useful data from OSM, while aggressively competing with it. Google has not always been on the best of terms with the open mapping community.)

Lest this seem like whining (I’m not as good as Google, that is hardly grounds for government intervention on my behalf), remember that we’re talking about encouraging innovation. It’s a peculiar feeling working on something for a couple of months and knowing that somebody within Google could produce a product 10x better using a couple of weeks of “20% time”. I guarantee that, every day, dozens of startup ideas are rejected because of this fear. The only comfort I have is that Google looks increasingly like a sprawling, unfocused bureaucracy: despite Google’s vast resources, some products seem to have been neglected for years. (Which is good for competitors, but bad for consumers.)

I would argue that geocoding is precisely the kind of service that should be considered for inclusion in Morozov’s “common infrastructure”. Right now, there is still competition for these infrastructure services – Microsoft Bing does geocoding, Amazon could plausibly introduce an AWS service to do the same. But as AI becomes ever stronger (and remember Google is it at the very forefront of this research), Google’s data pile will become ever more of a competitive – and difficult to replicate – advantage.

Moreover, there is a positive feedback at work: bigger data means better services, which means more users, which means bigger data.

Varian is right that Google beat Yahoo (or better, Altavista) by having better algorithms, but they were algorithms (principally, PageRank) that depended on data for input – and that was a very different era, when nobody had very much data, making it relatively easy for a venture-funded competitor to leapfrog ahead. Today, I think you could plausibly claim that Google systems already store some significant fraction of all recorded (public) information (something Varian himself once worked on). Leaping ahead with a better algorithm may not be that simple today.

What happens to the startup ecosystem when every idea would be more efficiently implemented within Google than outside it?

Google maps new !-style embed format

A while ago Google changed the structure of embedded map URLs. The old format used the web-standard key1=value1&key=value2 style, and you can find a reasonably good description of these parameters here. Unfortunately the new style is less verbose and much less intelligible, which seems like a step backwards even if these links are mostly hidden under the surface. Either Google wanted them to be less human-readable, or they care enough about saving a few bytes here and there to do this. I can’t find any good explanation of how to parse these links, so here’s my morning’s attempt. Be warned that this is all guesswork based on a limited sample and some experimentation. (more…)