comparison mupdf-source/thirdparty/tesseract/doc/tesseract.1.asc @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
comparison
equal deleted inserted replaced
1:1d09e1dec1d9 2:b50eed0cc0ef
1 TESSERACT(1)
2 ============
3 :doctype: manpage
4
5 NAME
6 ----
7 tesseract - command-line OCR engine
8
9 SYNOPSIS
10 --------
11 *tesseract* 'FILE' 'OUTPUTBASE' ['OPTIONS']... ['CONFIGFILE']...
12
13 DESCRIPTION
14 -----------
15 tesseract(1) is a commercial quality OCR engine originally developed at HP
16 between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
17 UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
18 at Google until 2018.
19
20
21 IN/OUT ARGUMENTS
22 ----------------
23 'FILE'::
24 The name of the input file.
25 This can either be an image file or a text file. +
26 Most image file formats (anything readable by Leptonica) are supported. +
27 A text file lists the names of all input images (one image name per line).
28 The results will be combined in a single file for each output file format
29 (txt, pdf, hocr, xml). +
30 If 'FILE' is `stdin` or `-` then the standard input is used.
31
32 'OUTPUTBASE'::
33 The basename of the output file (to which the appropriate extension
34 will be appended). By default the output will be a text file
35 with `.txt` added to the basename unless there are one or more
36 parameters set which explicitly specify the desired output. +
37 If 'OUTPUTBASE' is `stdout` or `-` then the standard output is used.
38
39
40 [[TESSDATADIR]]
41 OPTIONS
42 -------
43 *-c* 'CONFIGVAR=VALUE'::
44 Set value for parameter 'CONFIGVAR' to VALUE. Multiple *-c* arguments are allowed.
45
46 *--dpi* 'N'::
47 Specify the resolution 'N' in DPI for the input image(s).
48 A typical value for 'N' is `300`. Without this option,
49 the resolution is read from the metadata included in the image.
50 If an image does not include that information, Tesseract tries to guess it.
51
52 *-l* 'LANG'::
53 *-l* 'SCRIPT'::
54 The language or script to use.
55 If none is specified, `eng` (English) is assumed.
56 Multiple languages may be specified, separated by plus characters.
57 Tesseract uses 3-character ISO 639-2 language codes
58 (see <<LANGUAGES,*LANGUAGES AND SCRIPTS*>>).
59
60 *--psm* 'N'::
61 Set Tesseract to only run a subset of layout analysis and assume
62 a certain form of image. The options for 'N' are:
63
64 0 = Orientation and script detection (OSD) only.
65 1 = Automatic page segmentation with OSD.
66 2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
67 3 = Fully automatic page segmentation, but no OSD. (Default)
68 4 = Assume a single column of text of variable sizes.
69 5 = Assume a single uniform block of vertically aligned text.
70 6 = Assume a single uniform block of text.
71 7 = Treat the image as a single text line.
72 8 = Treat the image as a single word.
73 9 = Treat the image as a single word in a circle.
74 10 = Treat the image as a single character.
75 11 = Sparse text. Find as much text as possible in no particular order.
76 12 = Sparse text with OSD.
77 13 = Raw line. Treat the image as a single text line,
78 bypassing hacks that are Tesseract-specific.
79
80 *--oem* 'N'::
81 Specify OCR Engine mode. The options for 'N' are:
82
83 0 = Original Tesseract only.
84 1 = Neural nets LSTM only.
85 2 = Tesseract + LSTM.
86 3 = Default, based on what is available.
87
88 *--tessdata-dir* 'PATH'::
89 Specify the location of tessdata path.
90
91 *--user-patterns* 'FILE'::
92 Specify the location of user patterns file.
93
94 *--user-words* 'FILE'::
95 Specify the location of user words file.
96
97 [[CONFIGFILE]]
98 'CONFIGFILE'::
99 The name of a config to use. The name can be a file in `tessdata/configs`
100 or `tessdata/tessconfigs`, or an absolute or relative file path.
101 A config is a plain text file which contains a list of parameters and
102 their values, one per line, with a space separating parameter from value. +
103 Interesting config files include:
104
105 * *alto* -- Output in ALTO format ('OUTPUTBASE'`.xml`).
106 * *hocr* -- Output in hOCR format ('OUTPUTBASE'`.hocr`).
107 * *page* -- Output in PAGE format ('OUTPUTBASE'`.page.xml`).
108 The output can be customized with the flags:
109 page_xml_polygon -- Create polygons instead of bounding boxes (default: true)
110 page_xml_level -- Create the PAGE file on 0=linelevel or 1=wordlevel (default: 0)
111 * *pdf* -- Output PDF ('OUTPUTBASE'`.pdf`).
112 * *tsv* -- Output TSV ('OUTPUTBASE'`.tsv`).
113 * *txt* -- Output plain text ('OUTPUTBASE'`.txt`).
114 * *get.images* -- Write processed input images to file ('OUTPUTBASE'`.processedPAGENUMBER.tif`).
115 * *logfile* -- Redirect debug messages to file (`tesseract.log`).
116 * *lstm.train* -- Output files used by LSTM training ('OUTPUTBASE'`.lstmf`).
117 * *makebox* -- Write box file ('OUTPUTBASE'`.box`).
118 * *quiet* -- Redirect debug messages to '/dev/null'.
119
120 It is possible to select several config files, for example
121 `tesseract image.png demo alto hocr pdf txt` will create four output files
122 `demo.alto`, `demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
123
124 *Nota bene:* The options *-l* 'LANG', *-l* 'SCRIPT' and *--psm* 'N'
125 must occur before any 'CONFIGFILE'.
126
127
128 SINGLE OPTIONS
129 --------------
130 *-h, --help*::
131 Show help message.
132
133 *--help-extra*::
134 Show extra help for advanced users.
135
136 *--help-psm*::
137 Show page segmentation modes.
138
139 *--help-oem*::
140 Show OCR Engine modes.
141
142 *-v, --version*::
143 Returns the current version of the tesseract(1) executable.
144
145 *--list-langs*::
146 List available languages for tesseract engine.
147 Can be used with *--tessdata-dir* 'PATH'.
148
149 *--print-parameters*::
150 Print tesseract parameters.
151
152
153 [[LANGUAGES]]
154 LANGUAGES AND SCRIPTS
155 ---------------------
156
157 To recognize some text with Tesseract, it is normally necessary to specify
158 the language(s) or script(s) of the text (unless it is English text which is
159 supported by default) using *-l* 'LANG' or *-l* 'SCRIPT'.
160
161 Selecting a language automatically also selects the language specific
162 character set and dictionary (word list).
163
164 Selecting a script typically selects all characters of that script
165 which can be from different languages. The dictionary which is included
166 also contains a mix from different languages.
167 In most cases, a script also supports English.
168 So it is possible to recognize a language that has not been specifically
169 trained for by using traineddata for the script it is written in.
170
171 More than one language or script may be specified by using `+`.
172 Example: `tesseract myimage.png myimage -l eng+deu+fra`.
173
174 https://github.com/tesseract-ocr/tessdata_fast provides fast language and
175 script models which are also part of Linux distributions.
176
177 For Tesseract 4, `tessdata_fast` includes traineddata files for the
178 following languages:
179
180 *afr* (Afrikaans),
181 *amh* (Amharic),
182 *ara* (Arabic),
183 *asm* (Assamese),
184 *aze* (Azerbaijani),
185 *aze_cyrl* (Azerbaijani - Cyrilic),
186 *bel* (Belarusian),
187 *ben* (Bengali),
188 *bod* (Tibetan),
189 *bos* (Bosnian),
190 *bre* (Breton),
191 *bul* (Bulgarian),
192 *cat* (Catalan; Valencian),
193 *ceb* (Cebuano),
194 *ces* (Czech),
195 *chi_sim* (Chinese simplified),
196 *chi_tra* (Chinese traditional),
197 *chr* (Cherokee),
198 *cos* (Corsican),
199 *cym* (Welsh),
200 *dan* (Danish),
201 *deu* (German),
202 *deu_latf* (German Fraktur Latin),
203 *div* (Dhivehi),
204 *dzo* (Dzongkha),
205 *ell* (Greek, Modern, 1453-),
206 *eng* (English),
207 *enm* (English, Middle, 1100-1500),
208 *epo* (Esperanto),
209 *equ* (Math / equation detection module),
210 *est* (Estonian),
211 *eus* (Basque),
212 *fas* (Persian),
213 *fao* (Faroese),
214 *fil* (Filipino),
215 *fin* (Finnish),
216 *fra* (French),
217 *frm* (French, Middle, ca.1400-1600),
218 *fry* (West Frisian),
219 *gla* (Scottish Gaelic),
220 *gle* (Irish),
221 *glg* (Galician),
222 *grc* (Greek, Ancient, to 1453),
223 *guj* (Gujarati),
224 *hat* (Haitian; Haitian Creole),
225 *heb* (Hebrew),
226 *hin* (Hindi),
227 *hrv* (Croatian),
228 *hun* (Hungarian),
229 *hye* (Armenian),
230 *iku* (Inuktitut),
231 *ind* (Indonesian),
232 *isl* (Icelandic),
233 *ita* (Italian),
234 *ita_old* (Italian - Old),
235 *jav* (Javanese),
236 *jpn* (Japanese),
237 *kan* (Kannada),
238 *kat* (Georgian),
239 *kat_old* (Georgian - Old),
240 *kaz* (Kazakh),
241 *khm* (Central Khmer),
242 *kir* (Kirghiz; Kyrgyz),
243 *kmr* (Kurdish Kurmanji),
244 *kor* (Korean),
245 *kor_vert* (Korean vertical),
246 *lao* (Lao),
247 *lat* (Latin),
248 *lav* (Latvian),
249 *lit* (Lithuanian),
250 *ltz* (Luxembourgish),
251 *mal* (Malayalam),
252 *mar* (Marathi),
253 *mkd* (Macedonian),
254 *mlt* (Maltese),
255 *mon* (Mongolian),
256 *mri* (Maori),
257 *msa* (Malay),
258 *mya* (Burmese),
259 *nep* (Nepali),
260 *nld* (Dutch; Flemish),
261 *nor* (Norwegian),
262 *oci* (Occitan post 1500),
263 *ori* (Oriya),
264 *osd* (Orientation and script detection module),
265 *pan* (Panjabi; Punjabi),
266 *pol* (Polish),
267 *por* (Portuguese),
268 *pus* (Pushto; Pashto),
269 *que* (Quechua),
270 *ron* (Romanian; Moldavian; Moldovan),
271 *rus* (Russian),
272 *san* (Sanskrit),
273 *sin* (Sinhala; Sinhalese),
274 *slk* (Slovak),
275 *slv* (Slovenian),
276 *snd* (Sindhi),
277 *spa* (Spanish; Castilian),
278 *spa_old* (Spanish; Castilian - Old),
279 *sqi* (Albanian),
280 *srp* (Serbian),
281 *srp_latn* (Serbian - Latin),
282 *sun* (Sundanese),
283 *swa* (Swahili),
284 *swe* (Swedish),
285 *syr* (Syriac),
286 *tam* (Tamil),
287 *tat* (Tatar),
288 *tel* (Telugu),
289 *tgk* (Tajik),
290 *tha* (Thai),
291 *tir* (Tigrinya),
292 *ton* (Tonga),
293 *tur* (Turkish),
294 *uig* (Uighur; Uyghur),
295 *ukr* (Ukrainian),
296 *urd* (Urdu),
297 *uzb* (Uzbek),
298 *uzb_cyrl* (Uzbek - Cyrilic),
299 *vie* (Vietnamese),
300 *yid* (Yiddish),
301 *yor* (Yoruba)
302
303 To use a non-standard language pack named `foo.traineddata`, set the
304 `TESSDATA_PREFIX` environment variable so the file can be found at
305 `TESSDATA_PREFIX/tessdata/foo.traineddata` and give Tesseract the
306 argument *-l* `foo`.
307
308 For Tesseract 4, `tessdata_fast` includes traineddata files for the
309 following scripts:
310
311 *Arabic*,
312 *Armenian*,
313 *Bengali*,
314 *Canadian_Aboriginal*,
315 *Cherokee*,
316 *Cyrillic*,
317 *Devanagari*,
318 *Ethiopic*,
319 *Fraktur*,
320 *Georgian*,
321 *Greek*,
322 *Gujarati*,
323 *Gurmukhi*,
324 *HanS* (Han simplified),
325 *HanS_vert* (Han simplified, vertical),
326 *HanT* (Han traditional),
327 *HanT_vert* (Han traditional, vertical),
328 *Hangul*,
329 *Hangul_vert* (Hangul vertical),
330 *Hebrew*,
331 *Japanese*,
332 *Japanese_vert* (Japanese vertical),
333 *Kannada*,
334 *Khmer*,
335 *Lao*,
336 *Latin*,
337 *Malayalam*,
338 *Myanmar*,
339 *Oriya* (Odia),
340 *Sinhala*,
341 *Syriac*,
342 *Tamil*,
343 *Telugu*,
344 *Thaana*,
345 *Thai*,
346 *Tibetan*,
347 *Vietnamese*.
348
349 The same languages and scripts are available from
350 https://github.com/tesseract-ocr/tessdata_best.
351 `tessdata_best` provides slow language and script models.
352 These models are needed for training. They also can give better OCR results,
353 but the recognition takes much more time.
354
355 Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
356
357 There is a third repository, https://github.com/tesseract-ocr/tessdata,
358 with models which support both the Tesseract 3 legacy OCR engine and the
359 Tesseract 4 LSTM OCR engine.
360
361
362 CONFIG FILES AND AUGMENTING WITH USER DATA
363 ------------------------------------------
364
365 Tesseract config files consist of lines with parameter-value pairs (space
366 separated). The parameters are documented as flags in the source code like
367 the following one in tesseractclass.h:
368
369 `STRING_VAR_H(tessedit_char_blacklist, "",
370 "Blacklist of chars not to recognize");`
371
372 These parameters may enable or disable various features of the engine, and
373 may cause it to load (or not load) various data. For instance, let's suppose
374 you want to OCR in English, but suppress the normal dictionary and load an
375 alternative word list and an alternative list of patterns -- these two files
376 are the most commonly used extra data files.
377
378 If your language pack is in '/path/to/eng.traineddata' and the hocr config
379 is in '/path/to/configs/hocr' then create three new files:
380
381 '/path/to/eng.user-words':
382 [verse]
383 the
384 quick
385 brown
386 fox
387 jumped
388
389 '/path/to/eng.user-patterns':
390 [verse]
391 1-\d\d\d-GOOG-411
392 www.\n\\\*.com
393
394 '/path/to/configs/bazaar':
395 [verse]
396 load_system_dawg F
397 load_freq_dawg F
398 user_words_suffix user-words
399 user_patterns_suffix user-patterns
400
401 Now, if you pass the word 'bazaar' as a <<CONFIGFILE,'CONFIGFILE'>> to
402 Tesseract, Tesseract will not bother loading the system dictionary nor
403 the dictionary of frequent words and will load and use the 'eng.user-words'
404 and 'eng.user-patterns' files you provided. The former is a simple word list,
405 one per line. The format of the latter is documented in 'dict/trie.h'
406 on 'read_pattern_list()'.
407
408
409 ENVIRONMENT VARIABLES
410 ---------------------
411 *`TESSDATA_PREFIX`*::
412 If the `TESSDATA_PREFIX` is set to a path, then that path is used to
413 find the `tessdata` directory with language and script recognition
414 models and config files.
415 Using <<TESSDATADIR,*--tessdata-dir* 'PATH'>> is the recommended alternative.
416 *`OMP_THREAD_LIMIT`*::
417 If the `tesseract` executable was built with multithreading support,
418 it will normally use four CPU cores for the OCR process. While this
419 can be faster for a single image, it gives bad performance if the host
420 computer provides less than four CPU cores or if OCR is made for many images.
421 Only a single CPU core is used with `OMP_THREAD_LIMIT=1`.
422
423
424 HISTORY
425 -------
426 The engine was developed at Hewlett Packard Laboratories Bristol and at
427 Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
428 changes made in 1996 to port to Windows, and some $$C++$$izing in 1998. A
429 lot of the code was written in C, and then some more was written in $$C++$$.
430 The $$C++$$ code makes heavy use of a list system using macros. This predates
431 STL, was portable before STL, and is more efficient than STL lists, but has
432 the big negative that if you do get a segmentation violation, it is hard to
433 debug.
434
435 Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
436 to train Tesseract.
437
438 Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
439 See <https://github.com/tesseract-ocr/docs/blob/main/AT-1995.pdf>.
440 Since Tesseract 2.00,
441 scripts are now included to allow anyone to reproduce some of these tests.
442 See <https://tesseract-ocr.github.io/tessdoc/TestingTesseract.html> for more
443 details.
444
445 Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
446 and Korean. It also introduced a new, single-file based system of managing
447 language data.
448
449 Tesseract 3.02 added BiDirectional text support, the ability to recognize
450 multiple languages in a single image, and improved layout analysis.
451
452 Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
453 on line recognition, but also still supports the legacy Tesseract OCR engine of
454 Tesseract 3 which works by recognizing character patterns. Compatibility with
455 Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
456 support the legacy engine, for example those from the tessdata repository
457 (https://github.com/tesseract-ocr/tessdata).
458
459 For further details, see the release notes in the Tesseract documentation
460 (<https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html>).
461
462
463 RESOURCES
464 ---------
465 Main web site: <https://github.com/tesseract-ocr> +
466 User forum: <https://groups.google.com/g/tesseract-ocr> +
467 Documentation: <https://tesseract-ocr.github.io/> +
468 Information on training: <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
469
470 SEE ALSO
471 --------
472 ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1),
473 shape_training(1), mftraining(1), unicharambigs(5), unicharset(5),
474 unicharset_extractor(1), wordlist2dawg(1)
475
476 AUTHOR
477 ------
478 Tesseract development was led at Hewlett-Packard and Google by Ray Smith.
479 The development team has included:
480
481 Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger,
482 Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke,
483 Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle,
484 Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
485 Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
486 Lloyd, Shobhit Saxena, and Thomas Kielbus.
487
488 For a list of contributors see
489 <https://github.com/tesseract-ocr/tesseract/blob/main/AUTHORS>.
490
491 COPYING
492 -------
493 Licensed under the Apache License, Version 2.0