I am currently working on a project that involves large scale analysis of various countries’ Hansards (this is, transcripts of parliamentary debate). In general, this is messy data. Recent transcripts have been produced, possibly natively, in a variety of XML or SGML formats. Earlier transcripts have, where available, been digitised from printed archives.
The UK Parliament has such a digitised archive, here.
curl won’t work.
So, to save anyone else the pain, here is a link to a file I built that contains links to every file in this archive. I used the handy
FormRequest feature of Scrapy, my favourite, heavily used, scraping tool.
You can use this directly with
wget -i urls.txt, although be warned, it has nearly 3000 files of just over 1MB each. You’re welcome.