Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/tesseract/doc/combine_tessdata.1.asc @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 COMBINE_TESSDATA(1) | |
| 2 =================== | |
| 3 | |
| 4 NAME | |
| 5 ---- | |
| 6 combine_tessdata - combine/extract/overwrite/list/compact Tesseract data | |
| 7 | |
| 8 SYNOPSIS | |
| 9 -------- | |
| 10 *combine_tessdata* ['OPTION'] 'FILE'... | |
| 11 | |
| 12 DESCRIPTION | |
| 13 ----------- | |
| 14 combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact | |
| 15 tessdata components in [lang].traineddata files. | |
| 16 | |
| 17 To combine all the individual tessdata components (unicharset, DAWGs, | |
| 18 classifier templates, ambiguities, language configs) located at, say, | |
| 19 /home/$USER/temp/eng.* run: | |
| 20 | |
| 21 combine_tessdata /home/$USER/temp/eng. | |
| 22 | |
| 23 The result will be a combined tessdata file /home/$USER/temp/eng.traineddata | |
| 24 | |
| 25 Specify option -e if you would like to extract individual components | |
| 26 from a combined traineddata file. For example, to extract language config | |
| 27 file and the unicharset from tessdata/eng.traineddata run: | |
| 28 | |
| 29 combine_tessdata -e tessdata/eng.traineddata \ | |
| 30 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset | |
| 31 | |
| 32 The desired config file and unicharset will be written to | |
| 33 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset | |
| 34 | |
| 35 Specify option -o to overwrite individual components of the given | |
| 36 [lang].traineddata file. For example, to overwrite language config | |
| 37 and unichar ambiguities files in tessdata/eng.traineddata use: | |
| 38 | |
| 39 combine_tessdata -o tessdata/eng.traineddata \ | |
| 40 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs | |
| 41 | |
| 42 As a result, tessdata/eng.traineddata will contain the new language config | |
| 43 and unichar ambigs, plus all the original DAWGs, classifier templates, etc. | |
| 44 | |
| 45 Note: the file names of the files to extract to and to overwrite from should | |
| 46 have the appropriate file suffixes (extensions) indicating their tessdata | |
| 47 component type (.unicharset for the unicharset, .unicharambigs for unichar | |
| 48 ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h. | |
| 49 | |
| 50 Specify option -u to unpack all the components to the specified path: | |
| 51 | |
| 52 combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng. | |
| 53 | |
| 54 This will create /home/$USER/temp/eng.* files with individual tessdata | |
| 55 components from tessdata/eng.traineddata. | |
| 56 | |
| 57 OPTIONS | |
| 58 ------- | |
| 59 | |
| 60 *-c* '.traineddata' 'FILE'...: | |
| 61 Compacts the LSTM component in the .traineddata file to int. | |
| 62 | |
| 63 *-d* '.traineddata' 'FILE'...: | |
| 64 Lists directory of components from the .traineddata file. | |
| 65 | |
| 66 *-e* '.traineddata' 'FILE'...: | |
| 67 Extracts the specified components from the .traineddata file | |
| 68 | |
| 69 *-l* '.traineddata' 'FILE'...: | |
| 70 List the network information. | |
| 71 | |
| 72 *-o* '.traineddata' 'FILE'...: | |
| 73 Overwrites the specified components of the .traineddata file | |
| 74 with those provided on the command line. | |
| 75 | |
| 76 *-u* '.traineddata' 'PATHPREFIX' | |
| 77 Unpacks the .traineddata using the provided prefix. | |
| 78 | |
| 79 CAVEATS | |
| 80 ------- | |
| 81 'Prefix' refers to the full file prefix, including period (.) | |
| 82 | |
| 83 | |
| 84 COMPONENTS | |
| 85 ---------- | |
| 86 The components in a Tesseract lang.traineddata file as of | |
| 87 Tesseract 4.0 are briefly described below; For more information on | |
| 88 many of these files, see | |
| 89 <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html> | |
| 90 and | |
| 91 <https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html> | |
| 92 | |
| 93 lang.config:: | |
| 94 (Optional) Language-specific overrides to default config variables. | |
| 95 For 4.0 traineddata files, lang.config provides control parameters which | |
| 96 can affect layout analysis, and sub-languages. | |
| 97 | |
| 98 lang.unicharset:: | |
| 99 (Required - 3.0x legacy tesseract) The list of symbols that Tesseract recognizes, with properties. | |
| 100 See unicharset(5). | |
| 101 | |
| 102 lang.unicharambigs:: | |
| 103 (Optional - 3.0x legacy tesseract) This file contains information on pairs of recognized symbols | |
| 104 which are often confused. For example, 'rn' and 'm'. | |
| 105 | |
| 106 lang.inttemp:: | |
| 107 (Required - 3.0x legacy tesseract) Character shape templates for each unichar. Produced by | |
| 108 mftraining(1). | |
| 109 | |
| 110 lang.pffmtable:: | |
| 111 (Required - 3.0x legacy tesseract) The number of features expected for each unichar. | |
| 112 Produced by mftraining(1) from *.tr* files. | |
| 113 | |
| 114 lang.normproto:: | |
| 115 (Required - 3.0x legacy tesseract) Character normalization prototypes generated by cntraining(1) | |
| 116 from *.tr* files. | |
| 117 | |
| 118 lang.punc-dawg:: | |
| 119 (Optional - 3.0x legacy tesseract) A dawg made from punctuation patterns found around words. | |
| 120 The "word" part is replaced by a single space. | |
| 121 | |
| 122 lang.word-dawg:: | |
| 123 (Optional - 3.0x legacy tesseract) A dawg made from dictionary words from the language. | |
| 124 | |
| 125 lang.number-dawg:: | |
| 126 (Optional - 3.0x legacy tesseract) A dawg made from tokens which originally contained digits. | |
| 127 Each digit is replaced by a space character. | |
| 128 | |
| 129 lang.freq-dawg:: | |
| 130 (Optional - 3.0x legacy tesseract) A dawg made from the most frequent words which would have | |
| 131 gone into word-dawg. | |
| 132 | |
| 133 lang.fixed-length-dawgs:: | |
| 134 (Optional - 3.0x legacy tesseract) Several dawgs of different fixed lengths -- useful for | |
| 135 languages like Chinese. | |
| 136 | |
| 137 lang.shapetable:: | |
| 138 (Optional - 3.0x legacy tesseract) When present, a shapetable is an extra layer between the character | |
| 139 classifier and the word recognizer that allows the character classifier to | |
| 140 return a collection of unichar ids and fonts instead of a single unichar-id | |
| 141 and font. | |
| 142 | |
| 143 lang.bigram-dawg:: | |
| 144 (Optional - 3.0x legacy tesseract) A dawg of word bigrams where the words are separated by a space | |
| 145 and each digit is replaced by a '?'. | |
| 146 | |
| 147 lang.unambig-dawg:: | |
| 148 (Optional - 3.0x legacy tesseract) . | |
| 149 | |
| 150 lang.params-model:: | |
| 151 (Optional - 3.0x legacy tesseract) . | |
| 152 | |
| 153 lang.lstm:: | |
| 154 (Required - 4.0 LSTM) Neural net trained recognition model generated by lstmtraining. | |
| 155 | |
| 156 lang.lstm-punc-dawg:: | |
| 157 (Optional - 4.0 LSTM) A dawg made from punctuation patterns found around words. | |
| 158 The "word" part is replaced by a single space. Uses lang.lstm-unicharset. | |
| 159 | |
| 160 lang.lstm-word-dawg:: | |
| 161 (Optional - 4.0 LSTM) A dawg made from dictionary words from the language. | |
| 162 Uses lang.lstm-unicharset. | |
| 163 | |
| 164 lang.lstm-number-dawg:: | |
| 165 (Optional - 4.0 LSTM) A dawg made from tokens which originally contained digits. | |
| 166 Each digit is replaced by a space character. Uses lang.lstm-unicharset. | |
| 167 | |
| 168 lang.lstm-unicharset:: | |
| 169 (Required - 4.0 LSTM) The unicode character set that Tesseract recognizes, with properties. | |
| 170 Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files. | |
| 171 | |
| 172 lang.lstm-recoder:: | |
| 173 (Required - 4.0 LSTM) Unicharcompress, aka the recoder, which maps the unicharset | |
| 174 further to the codes actually used by the neural network recognizer. This is created as | |
| 175 part of the starter traineddata by combine_lang_model. | |
| 176 | |
| 177 lang.version:: | |
| 178 (Optional) Version string for the traineddata file. | |
| 179 First appeared in version 4.0 of Tesseract. | |
| 180 Old version of traineddata files will report Version:Pre-4.0.0. | |
| 181 4.0 version of traineddata files may include the network spec | |
| 182 used for LSTM training as part of version string. | |
| 183 | |
| 184 HISTORY | |
| 185 ------- | |
| 186 combine_tessdata(1) first appeared in version 3.00 of Tesseract | |
| 187 | |
| 188 SEE ALSO | |
| 189 -------- | |
| 190 tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), | |
| 191 unicharambigs(5) | |
| 192 | |
| 193 COPYING | |
| 194 ------- | |
| 195 Copyright \(C) 2009, Google Inc. | |
| 196 Licensed under the Apache License, Version 2.0 | |
| 197 | |
| 198 AUTHOR | |
| 199 ------ | |
| 200 The Tesseract OCR engine was written by Ray Smith and his research groups | |
| 201 at Hewlett Packard (1985-1995) and Google (2006-2018). |
