UK Hansard archive bulk download URL file (or when is open data not)

I am currently working on a project that involves large scale analysis of various countries’ Hansards (this is, transcripts of parliamentary debate). In general, this is messy data. Recent transcripts have been produced, possibly natively, in a variety of XML or SGML formats. Earlier transcripts have, where available, been digitised from printed archives.

The UK Parliament has such a digitised archive, here.

Frustratingly though, although these zipped XML files are available, there is no bulk download option or simple FTP archive of them. Instead, the files are listed in a paged format. Worse, the pages are generated by a form submit using client side javascript, so standard spidering options like curl won’t work.

So, to save anyone else the pain, here is a link to a file I built that contains links to every file in this archive. I used the handy FormRequest feature of Scrapy, my favourite, heavily used, scraping tool.

https://github.com/econandrew/uk-hansard-archive-urls/blob/master/urls.txt

You can use this directly with wget -i urls.txt, although be warned, it has nearly 3000 files of just over 1MB each. You’re welcome.

Comments (3)

beautiful, thanks! Exactly what I was looking for! :D

ihtoitim 2014-11-30 18:43:25 -0500

beautiful, thanks! Exactly what I’m looking for! :D

ihtoit 2014-11-30 18:45:15 -0500

Pingback: Clustering debates from UK politicians - Lateral

2015-10-16 12:18:09 -0400

Add comment

Comments are moderated and will not appear immediately.