Dataset Formats
File Extensions
File extensions are used to map ground truth and image files (e.g., a .png
and its corresponding .gt.txt
).
Everything after the first dot (.) in a filename is considered as the files extension.
So do not use dots to label files.
Most dataset formats provide a .gt_extension
and a .pred_extension
which can be used to modify the loaded and written files.
Plain files
Files provided as single images of a line (e. g. png or jpg). The ground truth is provided as plain text files in UTF-8 encoding using the same base name as the corresponding image and .gt.txt as extension.
Example:
+ train
- 0001.png
- 0001.gt.txt
- 0002.png
- 0002.gt.txt
- ...
+ test
- 1001.png
- 1001.gt.txt
- 1002.png
- 1002.gt.txt
- ...
Command-Line Call:
calamari-train --train.images train/*.png
calamari-predict --data.images test/*.png
PageXML
Use --train PageXML
to switch mode.
Provide page images as --train.images
and the corresponding xml
-files as --xml_files
Example structure:
+ train
- 0001.png
- 0001.xml
- 0002.png
- 0002.xml
- ...
+ test
- 1001.png
- 1001.xml
- 1002.png
- 1002.xml
- ...
Call:
calamari-train --train PageXML --train.images train/*.png
calamari-predict --data PageXML --data.images test/*.png
Abbyy
Use --train Abbyy
to switch mode.
Provide page images as --train.images
, the corresponding xml
files must end with .abbyy.xml
.
Example structure:
+ train
- 0001.png
- 0001.abbyy.xml
- 0002.png
- 0002.abbyy.xml
- ...
+ test
- 1001.png
- 1001.abbyy.xml
- 1002.png
- 1002.abbyy.xml
- ...
Call:
calamari-train --train Abbyy --train.images train/*.png
calamari-predict --data Abbyy --data.images test/*.png
HDF5
Use --train Hdf5
to switch mode.
The content of a h5-file is:
images
: list of raw imagesimages_dims
: the shape of the images (numpy arrays)codec
: integer mapping to decode thetranscriptions
(ASCII)transcripts
: list of encoded transcriptions using the codec
Call
calamari-train --train Hdf5 --train.files train.h5
calamari-predict --train Hdf5 --train.files test.h5