Testing OCRopus character models

After you have trained an OCRopus character model or selected an existing character model you want to measure its character recognition accuracy. Do measure it you need ground truth data (images and text) that has not been used in the training for the model. If you would use images that have been used in the training process you might only measure and overfitting of the model (i.e., the character models ‘knows’ the solution for exactly this image).

When training my character models I have already excluded ca. 10% of the ground truth data from the training and put in into a directory named testing. Run a recognition for the images contained in that folder:

ocropus-rpred -q -m YOURMODEL "testing/*.bin.png"

If your computer has several CPUs you can speed up the recognition step by using the -Q option. In this example OCRopus will use six cores:

ocropus-rpred -q -Q6 -m YOURMODEL "testing/*.bin.png"

OCRopus contains a program that compares the .gt.txt files (containing the correct ground truth text) with the .txt files generated by the OCR:

ocropus-errs "testing/*.gt.txt"

The output of the program will end with lines like this:

errors 49
missing 0
total 14051
err 0.349 %
errnomiss 0.349 %

The testing folder contained 14,051 character. Only 49 of these characters are not the ones expected. In other words, in this sample 99.65% of the characters were recognized correctly.

You can also analyse the most common problems using a program included in OCRopus:

ocropus-econf "testing/*.gt.txt"

In my example the result looks like this:

errors 49
missing 0
total 14051
err 0.349 %
errnomiss 0.349 %
25 _
8 , .
7 _ '
5 _ -
3 _ .
1 . ,

25 whitespace characters are missing, 15 others have been recognized as punctuation. If you are only interested in wrong letters you can choose a different comparision by adding an option to ocropus-errs or ocropus-econf:

ocropus-errs -k letters "testing/*.gt.txt"

The possible option for the comparision are:

  • exact – count every character
  • nospace – ignore whitespace characters (space, tab)
  • letdig – count only letters (A-Z) and digits (0-9)
  • letters – count only letters (A-Z)
  • spletdig – count letters, digits and space
  • digits – count only digits (0-9)
  • lnc – count only letters (A-Z) and ignore the case