Investigate Companies By Scraping Data Off The Web

In order to put together its awesome “Dollars for Docs” database that let readers search to see if their doctor had received pharma company payments ProPublica had to convert data from all sorts of Websites, PDFs, Excel docs and even Flash sites into one system. Not an easy task, but that kind of data wrasslin’ is key for modern investigative journalism, and ProPublica have put together tutorials to show you how you can do it too.

There may be a bit of a learning curve, but if you have decent computer nerd skills, appetite and passion, you can get cracking on the next big data-driven muckrack.

Check these ones out in particular:

Scraping for Journalism: A Guide for Collecting Data
Chapter 1. Using Google Refine to Clean Messy Data
Chapter 4. Scraping HTML