In this blog posts I would like to share some of my experience in training models for the outstanding OCR program OCRopus.
Keep training and test data seperated.
Inseparable from the training of a model is testing. I put my training and test data into two directories,
testing. That makes it easy to add more test data later. And it also simplifies scripts like the one below.
Choose good training data.
I mostly work with source that contain alphabetic lists. In that case it is a bad idea to use complete pages for training (and testing). Particularly capitals would be missing. It is better to select lines from every page. For example you could select the 10th and 15th line of each page.
Let OCR help.
Creating ground truth data is time consuming. If there is a model that more or less works for your scans — use it to prepare ground truth data. In most cases it is easier to proofread and correct text than to type it from scratch.
ocropus-rpred -q -m INITIAL_MODEL_FILE training/*.png testing/*.png
ocropus-gtedit html training/*.png testing/*.png
Have enough samples for each character.
In many cases it will be difficult to have enough samples for every character. At least in German texts the Letters J, Y, X, and Q are rare. To check which characters occur in you training data you can use this command:
cat training/*.gt.txt | sed 's/\(.\)/\1\n/g'| sort | uniq -c
Now you can look through your scanned text if you find more examples for the rare or even missing letters. Our directory layout with
testing directory makes it easy to add more examples.
And don’t forget to check you test data as well:
cat testing/*.gt.txt | sed 's/\(.\)/\1\n/g'| sort | uniq -c
Find typos in ground truth.
If you have not created your ground truth with double-keying or something similar, there is a great chance that it will contain typos. Especialy if you followed my previous hint with OCR correction you might find it difficult to spot mixed characters like one, lower case l and upper case I.
As soon as you have a suitable trained model, you can use it to spot errors in ground truth data. Run a prediction on the training data and check all lines with differences between ground truth and prediction.
ocropus-rpred -q -m YOUR_MODEL_FILE training/*.png
ocropus-gtedit html `ocropus-errs training/*.gt.txt 2>/dev/null |grep -v "^ 0" | cu