You are here: Home » NewsFeeds » Ask HN: Open source OCR library?

Ask HN: Open source OCR library?


As others pointed out, Tesseract with OpenCV (for identifying and cropping the text region) is quite effective. On top of that, Tesseract is fully trainable with custom fonts.

In our use case, we’ve mostly had to deal with handwritten text and that’s where none of them really did well. Your next best bet would be to use HoG(Histogram of oriented gradients) along with SVMs. OpenCV has really good implementations of both.

Even then, we’ve had to write extra heuristics to disambiguate between 2 and z and s and 5 etc. That was too much work and a lot of if-else. We’re currently putting in our efforts on CNNs(Convolutional Neural Networks). As a start, you can look at Torch or Caffe.



Tesseract is ok, but I gather that a lot of the good work in the last few years on it has remained closed source within Google.

If you want to do text extraction, look at things like Stroke Width Transform to extract regions of text before passing them to Tesseract.



I’ve used tesseract to great affect. I don’t know how your images are but if only part of the image has text in it, you should only send that part to the OCR engine. If you send the entire image and only a portion of it has text in it, chances of the OCR extracting text are slim. There are pre-processing techniques [1] you can use to crop out the part of the image that has text

[1]: https://en.wikipedia.org/?title=Hough_transform



Tesseract does no layout analysis.

So if the source image contains text columns or pull quotes or similar, the output text will just be each row of text, from the far left to the far right.



 

Original article