computer programming
Investigate Companies By Scraping Data Off The Web

1.5.11 11:00 AM EDT By Ben Popken

computer programming ruby scraping html google refine propublica INVESTIGATIONS journalism how-to

In order to put together its awesome “Dollars for Docs” database that let readers search to see if their doctor had received pharma company payments ProPublica had to convert data from all sorts of Websites, PDFs, Excel docs and even Flash sites into one system. Not an easy task, but that kind of data wrasslin’ is key for modern investigative journalism, and ProPublica have put together tutorials to show you how you can do it too.

There may be a bit of a learning curve, but if you have decent computer nerd skills, appetite and passion, you can get cracking on the next big data-driven muckrack.

Check these ones out in particular:

Scraping for Journalism: A Guide for Collecting Data
Chapter 1. Using Google Refine to Clean Messy Data
Chapter 4. Scraping HTML

Want more consumer news? Visit our parent organization, Consumer Reports, for the latest on scams, recalls, and other consumer issues.

Welcome to the Consumerist Archives

Related