I was lucky enough to attend Digital Humanities 2014 in Lausanne, Switzerland this summer. One of the topics I was most interested in learning about was web archiving and scraping. Being able to automate the gathering of information from the internet, as well as being able to save websites for offline viewing and analysis, is something that I need for my own research, and seemed to be quite popular amongst other academics in attendance as well.
I took a workshop led by Scott Reed from the Internet Archive on their Archive-It web-archiving service. At the workshop I ran into Ian Milligan, professor in the Department of History at the University of Waterloo, who suggested I check out the web scraping tutorials on the Programming Historian website. The Programming Historian has a great variety of tutorials on not just web scraping, but geographic information systems, data management, and APIs, all aimed at non-computer scientists.
For more on web archiving, check out Ian Milligan and Nick Ruest’s presentation The Great WARC Adventure: WARCs from creation to use: “This presentation will cover a historical overview of web archiving, how best to both capture and preserve websites, and make them discoverable and usable using open source tools that can be easily replicated by other organizations, the interplay of the archivist and historian with respect to web archives, and finally ways to access web archives”
The International Internet Preservation Consortium has also provided a small directory of open source web archiving tools.