Data Extraction and Migration With Spacy and Tesseract

Data Extraction and Migration With Spacy and Tesseract

How AI tools can help transfer and organize content in a Drupal migration
Adam Kempler

Software Architect

Adam Kempler

August 31, 2021

At GovWebworks, we often address unique issues for our public sector clients by coming up with creative solutions. Sometimes, the solutions are worth sharing so others can use them. This is one of those cases.

Recently we needed to transfer thousands of PDFs of fiscal impact statements for the South Carolina Revenue and Fiscal Affairs Office as part of their migration to a Drupal content management system, and make the data in the PDFs searchable at the same time. Here’s what we did.

Our data extraction solution

In the legacy site, the impact statements could be browsed as links to files, with the list being manually updated as new impact statements were added, or existing ones updated.

We wanted to provide a more elegant solution for the new site. The goal was to provide users with a faceted search of key data within the impact statements. Terms included things like:

  • Legislative Session
  • Legislative Body
  • Bill Number
  • Bill Status
  • Bill Author

This data existed in all of the separate PDFs, we just needed a way to consistently extract it. That’s when we found the solution. Spacy and Tesseract to the rescue!

Spacy is an open source Python library for Natural Language Processing that excels at extracting information from unstructured content.

Tesseract is an open source ORC engine developed by Google that can recognize and process text in more than 100 languages out of the box.

How we did it

Our first step was to create an Impact Statement content type in Drupal that provided the discrete fields that we wanted, as well as a media field for storing the actual PDF version of the impact statement.

We then created a migration (using Drupal’s Migrate API) to create new Impact Statement entities for each impact statement, and populate a media field with the PDF.

The next step was to figure out how to consistently retrieve the data from each PDF to populate the relevant fields in the content type during the migration.

To make this easy for other developers on the project to install and use, I wrapped Tesseract and Spacy in a Flask app that exposed the key functionality as a REST API. I then Dockerized the app so anyone on the team could just download the repo and get started, without needing to install all the dependencies locally.

We then created a custom process plugin for our Migrate API implementation that fed the PDF to the REST endpoint and received back JSON containing the field data needed for our content type.

The API would first pass each PDF to Tesseract to process and extract the text which was then passed to Spacy to extract the discrete data needed for the content types. This data was returned to the Migrate plugin for validation and cleanup before being stored in the content type.

Try it on your next project

If you want to take a look, I’ve put the implementation up on github. Feel free to use it as a starting point for your own explorations. Keep in mind that some parts of the implementation were hard coded for this particular use-case, and there were a lot of best practices skipped as the goal was just to create an internal tool to experiment with and facilitate the migration.

In the future, we would take this to the next level by extracting other relevant information from documents. We could then create a knowledge graph of relationships, and recurring entities such as companies, locations, and people, etc. This would provide more ways for users to find and explore documents and data.

Learn more