Published: Jun 9, 2017 by Jesper Zedlitz
When training a character model for OCRopus you need a good selection of ground truth data for training and testing. To be able to recognize a certain charater it must be included in the training data. Otherwise the neuronal network has no chance to detect the character’s appearance. Although not strictly required, it is also a good idea to include every possible character at least once in the ground truth data used for testing.
With several thousand characters it is sometimes difficult to keep an overview. Therefore, I have written a small script that checks the characters contained in the ground data. You can find it here in my fork of the OCRopus code: https://github.com/jze/ocropy/blob/master/ground-truth-tests.sh
Here is the result of the skript applied to the currect version of the Gothic print model https://github.com/jze/ocropus-model_fraktur
WARNING: Missing training data for these characters: <=>°«»*%+ WARNING: Missing test data for these characters: /§¼½¾âřXY
I will immediately begin the search for the missing characters…