When training a character model for OCRopus you need a good
selection of ground truth data for training and testing. To be able
to recognize a certain charater it must be included in the training
data. Otherwise the neuronal network has no chance to detect the
character’s appearance. Although not strictly required, it is also
a good idea to include every possible character at least once in the
ground truth data used for testing.
With several thousand characters it is sometimes difficult to keep
an overview. Therefore, I have written a small script that checks the
characters contained in the ground data. You can find it here in my
fork of the OCRopus code: https://github.com/jze/ocropy/blob/master/ground-truth-tests.sh
Here is the result of the skript applied to the currect version of
the Gothic print model https://github.com/jze/ocropus-model_fraktur
WARNING: Missing training data for these characters:
<=>°«»[]*%+
WARNING: Missing test data for these characters:
/§¼½¾âřXY
I will immediately begin the search for the missing characters…