UK Hansard Archive Bulk Download URL File (or When is Open Data Not)

I am currently working on a project that involves large scale analysis of various countries’ Hansards (this is, transcripts of parliamentary debate). In general, this is messy data. Recent transcripts have been produced, possibly natively, in a variety of XML or SGML formats. Earlier transcripts have, where available, been digitised from printed archives.

The UK Parliament has such a digitised archive, here.

Frustratingly though, although these zipped XML files are available, there is no bulk download option or simple FTP archive of them. Instead, the files are listed in a paged format. Worse, the pages are generated by a form submit using client side javascript, so standard spidering options like curl won’t work.

So, to save anyone else the pain, here is a link to a file I built that contains links to every file in this archive. I used the handy FormRequest feature of Scrapy, my favourite, heavily used, scraping tool.

You can use this directly with wget -i urls.txt, although be warned, it has nearly 3000 files of just over 1MB each. You’re welcome.


Leave a Reply to ihtoitim Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s