Investigate Companies By Scraping Data Off The Web

In order to put together its awesome “Dollars for Docs” database that let readers search to see if their doctor had received pharma company payments ProPublica had to convert data from all sorts of Websites, PDFs, Excel docs and even Flash sites into one system. Not an easy task, but that kind of data wrasslin’ is key for modern investigative journalism, and ProPublica have put together tutorials to show you how you can do it too.

There may be a bit of a learning curve, but if you have decent computer nerd skills, appetite and passion, you can get cracking on the next big data-driven muckrack.

Check these ones out in particular:

Scraping for Journalism: A Guide for Collecting Data
Chapter 1. Using Google Refine to Clean Messy Data
Chapter 4. Scraping HTML


Edit Your Comment

  1. Blueskylaw says:

    I found that Gawker Media (former owner of Consumerist), founded by Nick denton, announced (in 2008) the suspension of a bonus payment scheme based on pageviews by which Gawker had paid $50,000 a month on the average to its staff, citing a need to generate actual advertising revenue as opposed to just increasing traffic.

  2. lastingsmilledge says:

    ….or you can just wait until PPSA comes into effect next year and get all of this information from HHS’s website (2011 data is required to be reported early next year).

  3. Dan says:

    lastingsmilledge> Not exactly true…the companies are required to report their data in 2012, but it’s only being collected by the fed agency. 2013 is when the database actually goes live:

  4. smartmuffin says:

    Heh, this is the exact thing they warn us about in OPSEC briefings in the military, that terrorists can use a few scattered pieces of information on the Internet to “piece together the puzzle” for devastating results.

    • Loias supports harsher punishments against corporations says:

      By devastating results, I assume they mean revealing the ethically defunct actions of large corporations?

    • jessjj347 says:

      There’s nothing wrong with obtaining freely available information. I’m not sure about the ethics of reverse engineering, but that’s what happens especially with Flash sites.

    • pop top says:

      Wait. Are you trying to tell say that people might abuse this for selfish or nefarious purposes, and that since it’s obviously never happened before, we should live in constant fear of new inventions in case someone might use them for evil?

  5. jessjj347 says:

    Woooo web scraping. The worst part is when the website you’re scraping gets updated and breaks your stuff.

  6. There's room to move as a fry cook says:

    In my experience the HTML approach can work when the website passes variables with GET (puts variables in the URL string) but is more difficult when the website passes variables through sessions or uses Ajax/javascript. I have also encountered hit limits & temporary banning.

  7. AllanG54 says:

    I find out about a lot of crap companies and their scandals just by reading the website “”. The stories there read like a terrific novel.

  8. kcarlson says:

    For Python developers, BeautifulSoup is a good web scraping tool.