UK Hansard archive bulk download URL file (or when is open data not)
I am currently working on a project that involves large scale analysis of various countries’ Hansards (this is, transcripts of parliamentary debate). In general, this is messy data. Recent transcripts have been produced, possibly natively, in a variety of XML or SGML formats. Earlier transcripts have, where available, been digitised from printed archives.
The UK Parliament has such a digitised archive, here.
Frustratingly though, although these zipped XML files are available, there is no bulk download option or simple FTP archive of them. Instead, the files are listed in a paged format. Worse, the pages are generated by a form submit using client side javascript, so standard spidering options like curl
won’t work.
So, to save anyone else the pain, here is a link to a file I built that contains links to every file in this archive. I used the handy FormRequest
feature of Scrapy, my favourite, heavily used, scraping tool.
https://github.com/econandrew/uk-hansard-archive-urls/blob/master/urls.txt
You can use this directly with wget -i urls.txt
, although be warned, it has nearly 3000 files of just over 1MB each. You’re welcome.
Add comment
Comments are moderated and will not appear immediately.
Comments (3)
beautiful, thanks! Exactly what I was looking for! :D
beautiful, thanks! Exactly what I’m looking for! :D