You are here: Home » NewsFeeds » Pdftabextract – A set of tools for data mining OCR-processed PDFs

Pdftabextract – A set of tools for data mining OCR-processed PDFs

README.md

July 2016 / Feb. 2017, Markus Konrad markus.konrad@wzb.eu / Berlin Social Science Center

Introduction

This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed)
PDF files. Before these files can be processed they need to be converted to XML files in
pdf2xml format. This is very simple — see section below for instructions.

Module overview

After that you can view the extracted text boxes with the
pdf2xml-viewer tool if you like. The pdf2xml format can be loaded and parsed with functions in the common submodule. Lines can be detected in the scanned images using the imgproc module. If the pages are skewed or rotated, this can be detected and fixed with methods from imgproc and functions in textboxes. Lines or text box positions can be clustered in order to detect table columns and rows using the clustering module. When columns and rows were


 

Original article