We’re getting ready to launch a brand new search engine for PACER content. When it launches, one of the big features it will have is full-text search for the millions of documents that people have submitted using our RECAP system. To our knowledge, this will be the first free system for searching PACER content in this way, allowing you to look up documents by any word they might contain.
The big problem with this goal? We have about a million PDFs that consist only of images. Some of these are actually quite beautiful:
A beautiful handwritten motion. It goes on like this for 46 pages.
But others are hideous:
An 84 page log from 1957. It’s come a long ways just to appear on this blog today.
But no matter how a document looks, we want to extract the text so that we can make it searchable. This is done using a system called Optical Character Recognition (OCR),
Original URL: https://free.law/2016/09/26/extracting-text-from-our-collection-of-pacer-documents/