open data

UK Hansard Archive Bulk Download URL File (or When is Open Data Not)

I am currently working on a project that involves large scale analysis of various countries’ Hansards (this is, transcripts of parliamentary debate). In general, this is messy data. Recent transcripts have been produced, possibly natively, in a variety of XML or SGML formats. Earlier transcripts have, where available, been digitised from printed archives.

The UK Parliament has such a digitised archive, here.

Frustratingly though, although these zipped XML files are available, there is no bulk download option or simple FTP archive of them. Instead, the files are listed in a paged format. Worse, the pages are generated by a form submit using client side javascript, so standard spidering options like curl won’t work.

So, to save anyone else the pain, here is a link to a file I built that contains links to every file in this archive. I used the handy FormRequest feature of Scrapy, my favourite, heavily used, scraping tool.

You can use this directly with wget -i urls.txt, although be warned, it has nearly 3000 files of just over 1MB each. You’re welcome.