UK Hansard Archive Bulk Download URL File (or When is Open Data Not)

I am currently working on a project that involves large scale analysis of various countries’ Hansards (this is, transcripts of parliamentary debate). In general, this is messy data. Recent transcripts have been produced, possibly natively, in a variety of XML or SGML formats. Earlier transcripts have, where available, been digitised from printed archives.

The UK Parliament has such a digitised archive, here.

Frustratingly though, although these zipped XML files are available, there is no bulk download option or simple FTP archive of them. Instead, the files are listed in a paged format. Worse, the pages are generated by a form submit using client side javascript, so standard spidering options like curl won’t work.

So, to save anyone else the pain, here is a link to a file I built that contains links to every file in this archive. I used the handy FormRequest feature of Scrapy, my favourite, heavily used, scraping tool.

https://github.com/econandrew/uk-hansard-archive-urls/blob/master/urls.txt

You can use this directly with wget -i urls.txt, although be warned, it has nearly 3000 files of just over 1MB each. You’re welcome.

3 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s