Scott Murray

Tools for Extracting Data From PDFs

Last updated 2015 September 10

It used to be that once data was published in PDF form — such as on a government website — it was as good as dead. Fortunately, lots of smart people have been developing new tools to help use extract tables of data from PDF and export it in structured, usable formats (like CSV).

Here are the tools I’ve found to be useful. Results may vary as each tool has its own strengths and weaknesses; try them all to see what works best for your document. (If you know of others, please let me know.)

For those curious why it’s so difficult to pull data out of PDFs, you might enjoy this read from ProPublica.

ScraperWiki
scraperwiki.com
Free

CometDocs
cometdocs.com
Free

PDF Converter
freepdfconvert.com/pdf-excel
Free, but limited to 2 pages and 10 files total, with a 30 minute delay for processing

Nitro Cloud
nitrocloud.com/pricing
Convert 5 documents for free

Tabula
github.com/tabulapdf/tabula
Free download for Windows, Mac, Linux; also see “Introducing Tabula