comparison mupdf-source/thirdparty/tesseract/doc/combine_tessdata.1.asc @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
comparison
equal deleted inserted replaced
1:1d09e1dec1d9 2:b50eed0cc0ef
1 COMBINE_TESSDATA(1)
2 ===================
3
4 NAME
5 ----
6 combine_tessdata - combine/extract/overwrite/list/compact Tesseract data
7
8 SYNOPSIS
9 --------
10 *combine_tessdata* ['OPTION'] 'FILE'...
11
12 DESCRIPTION
13 -----------
14 combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact
15 tessdata components in [lang].traineddata files.
16
17 To combine all the individual tessdata components (unicharset, DAWGs,
18 classifier templates, ambiguities, language configs) located at, say,
19 /home/$USER/temp/eng.* run:
20
21 combine_tessdata /home/$USER/temp/eng.
22
23 The result will be a combined tessdata file /home/$USER/temp/eng.traineddata
24
25 Specify option -e if you would like to extract individual components
26 from a combined traineddata file. For example, to extract language config
27 file and the unicharset from tessdata/eng.traineddata run:
28
29 combine_tessdata -e tessdata/eng.traineddata \
30 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
31
32 The desired config file and unicharset will be written to
33 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
34
35 Specify option -o to overwrite individual components of the given
36 [lang].traineddata file. For example, to overwrite language config
37 and unichar ambiguities files in tessdata/eng.traineddata use:
38
39 combine_tessdata -o tessdata/eng.traineddata \
40 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
41
42 As a result, tessdata/eng.traineddata will contain the new language config
43 and unichar ambigs, plus all the original DAWGs, classifier templates, etc.
44
45 Note: the file names of the files to extract to and to overwrite from should
46 have the appropriate file suffixes (extensions) indicating their tessdata
47 component type (.unicharset for the unicharset, .unicharambigs for unichar
48 ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h.
49
50 Specify option -u to unpack all the components to the specified path:
51
52 combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
53
54 This will create /home/$USER/temp/eng.* files with individual tessdata
55 components from tessdata/eng.traineddata.
56
57 OPTIONS
58 -------
59
60 *-c* '.traineddata' 'FILE'...:
61 Compacts the LSTM component in the .traineddata file to int.
62
63 *-d* '.traineddata' 'FILE'...:
64 Lists directory of components from the .traineddata file.
65
66 *-e* '.traineddata' 'FILE'...:
67 Extracts the specified components from the .traineddata file
68
69 *-l* '.traineddata' 'FILE'...:
70 List the network information.
71
72 *-o* '.traineddata' 'FILE'...:
73 Overwrites the specified components of the .traineddata file
74 with those provided on the command line.
75
76 *-u* '.traineddata' 'PATHPREFIX'
77 Unpacks the .traineddata using the provided prefix.
78
79 CAVEATS
80 -------
81 'Prefix' refers to the full file prefix, including period (.)
82
83
84 COMPONENTS
85 ----------
86 The components in a Tesseract lang.traineddata file as of
87 Tesseract 4.0 are briefly described below; For more information on
88 many of these files, see
89 <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
90 and
91 <https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html>
92
93 lang.config::
94 (Optional) Language-specific overrides to default config variables.
95 For 4.0 traineddata files, lang.config provides control parameters which
96 can affect layout analysis, and sub-languages.
97
98 lang.unicharset::
99 (Required - 3.0x legacy tesseract) The list of symbols that Tesseract recognizes, with properties.
100 See unicharset(5).
101
102 lang.unicharambigs::
103 (Optional - 3.0x legacy tesseract) This file contains information on pairs of recognized symbols
104 which are often confused. For example, 'rn' and 'm'.
105
106 lang.inttemp::
107 (Required - 3.0x legacy tesseract) Character shape templates for each unichar. Produced by
108 mftraining(1).
109
110 lang.pffmtable::
111 (Required - 3.0x legacy tesseract) The number of features expected for each unichar.
112 Produced by mftraining(1) from *.tr* files.
113
114 lang.normproto::
115 (Required - 3.0x legacy tesseract) Character normalization prototypes generated by cntraining(1)
116 from *.tr* files.
117
118 lang.punc-dawg::
119 (Optional - 3.0x legacy tesseract) A dawg made from punctuation patterns found around words.
120 The "word" part is replaced by a single space.
121
122 lang.word-dawg::
123 (Optional - 3.0x legacy tesseract) A dawg made from dictionary words from the language.
124
125 lang.number-dawg::
126 (Optional - 3.0x legacy tesseract) A dawg made from tokens which originally contained digits.
127 Each digit is replaced by a space character.
128
129 lang.freq-dawg::
130 (Optional - 3.0x legacy tesseract) A dawg made from the most frequent words which would have
131 gone into word-dawg.
132
133 lang.fixed-length-dawgs::
134 (Optional - 3.0x legacy tesseract) Several dawgs of different fixed lengths -- useful for
135 languages like Chinese.
136
137 lang.shapetable::
138 (Optional - 3.0x legacy tesseract) When present, a shapetable is an extra layer between the character
139 classifier and the word recognizer that allows the character classifier to
140 return a collection of unichar ids and fonts instead of a single unichar-id
141 and font.
142
143 lang.bigram-dawg::
144 (Optional - 3.0x legacy tesseract) A dawg of word bigrams where the words are separated by a space
145 and each digit is replaced by a '?'.
146
147 lang.unambig-dawg::
148 (Optional - 3.0x legacy tesseract) .
149
150 lang.params-model::
151 (Optional - 3.0x legacy tesseract) .
152
153 lang.lstm::
154 (Required - 4.0 LSTM) Neural net trained recognition model generated by lstmtraining.
155
156 lang.lstm-punc-dawg::
157 (Optional - 4.0 LSTM) A dawg made from punctuation patterns found around words.
158 The "word" part is replaced by a single space. Uses lang.lstm-unicharset.
159
160 lang.lstm-word-dawg::
161 (Optional - 4.0 LSTM) A dawg made from dictionary words from the language.
162 Uses lang.lstm-unicharset.
163
164 lang.lstm-number-dawg::
165 (Optional - 4.0 LSTM) A dawg made from tokens which originally contained digits.
166 Each digit is replaced by a space character. Uses lang.lstm-unicharset.
167
168 lang.lstm-unicharset::
169 (Required - 4.0 LSTM) The unicode character set that Tesseract recognizes, with properties.
170 Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files.
171
172 lang.lstm-recoder::
173 (Required - 4.0 LSTM) Unicharcompress, aka the recoder, which maps the unicharset
174 further to the codes actually used by the neural network recognizer. This is created as
175 part of the starter traineddata by combine_lang_model.
176
177 lang.version::
178 (Optional) Version string for the traineddata file.
179 First appeared in version 4.0 of Tesseract.
180 Old version of traineddata files will report Version:Pre-4.0.0.
181 4.0 version of traineddata files may include the network spec
182 used for LSTM training as part of version string.
183
184 HISTORY
185 -------
186 combine_tessdata(1) first appeared in version 3.00 of Tesseract
187
188 SEE ALSO
189 --------
190 tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5),
191 unicharambigs(5)
192
193 COPYING
194 -------
195 Copyright \(C) 2009, Google Inc.
196 Licensed under the Apache License, Version 2.0
197
198 AUTHOR
199 ------
200 The Tesseract OCR engine was written by Ray Smith and his research groups
201 at Hewlett Packard (1985-1995) and Google (2006-2018).