Mercurial > hgrepos > Python2 > PyMuPDF

diff mupdf-source/thirdparty/tesseract/doc/tesseract.1.asc @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author: Franz Glasner <fzglas.hg@dom66.de>
date: Mon, 15 Sep 2025 11:43:07 +0200
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/mupdf-source/thirdparty/tesseract/doc/tesseract.1.asc	Mon Sep 15 11:43:07 2025 +0200
@@ -0,0 +1,493 @@
+TESSERACT(1)
+============
+:doctype: manpage
+
+NAME
+----
+tesseract - command-line OCR engine
+
+SYNOPSIS
+--------
+*tesseract* 'FILE' 'OUTPUTBASE' ['OPTIONS']... ['CONFIGFILE']...
+
+DESCRIPTION
+-----------
+tesseract(1) is a commercial quality OCR engine originally developed at HP
+between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
+UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
+at Google until 2018.
+
+
+IN/OUT ARGUMENTS
+----------------
+'FILE'::
+  The name of the input file.
+  This can either be an image file or a text file. +
+  Most image file formats (anything readable by Leptonica) are supported. +
+  A text file lists the names of all input images (one image name per line).
+  The results will be combined in a single file for each output file format
+  (txt, pdf, hocr, xml). +
+  If 'FILE' is `stdin` or `-` then the standard input is used.
+
+'OUTPUTBASE'::
+  The basename of the output file (to which the appropriate extension
+  will be appended).  By default the output will be a text file
+  with `.txt` added to the basename unless there are one or more
+  parameters set which explicitly specify the desired output. +
+  If 'OUTPUTBASE' is `stdout` or `-` then the standard output is used.
+
+
+[[TESSDATADIR]]
+OPTIONS
+-------
+*-c* 'CONFIGVAR=VALUE'::
+  Set value for parameter 'CONFIGVAR' to VALUE. Multiple *-c* arguments are allowed.
+
+*--dpi* 'N'::
+  Specify the resolution 'N' in DPI for the input image(s).
+  A typical value for 'N' is `300`. Without this option,
+  the resolution is read from the metadata included in the image.
+  If an image does not include that information, Tesseract tries to guess it.
+
+*-l* 'LANG'::
+*-l* 'SCRIPT'::
+  The language or script to use.
+  If none is specified, `eng` (English) is assumed.
+  Multiple languages may be specified, separated by plus characters.
+  Tesseract uses 3-character ISO 639-2 language codes
+  (see <<LANGUAGES,*LANGUAGES AND SCRIPTS*>>).
+
+*--psm* 'N'::
+  Set Tesseract to only run a subset of layout analysis and assume
+  a certain form of image. The options for 'N' are:
+
+  0 = Orientation and script detection (OSD) only.
+  1 = Automatic page segmentation with OSD.
+  2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
+  3 = Fully automatic page segmentation, but no OSD. (Default)
+  4 = Assume a single column of text of variable sizes.
+  5 = Assume a single uniform block of vertically aligned text.
+  6 = Assume a single uniform block of text.
+  7 = Treat the image as a single text line.
+  8 = Treat the image as a single word.
+  9 = Treat the image as a single word in a circle.
+  10 = Treat the image as a single character.
+  11 = Sparse text. Find as much text as possible in no particular order.
+  12 = Sparse text with OSD.
+  13 = Raw line. Treat the image as a single text line,
+       bypassing hacks that are Tesseract-specific.
+
+*--oem* 'N'::
+  Specify OCR Engine mode. The options for 'N' are:
+
+  0 = Original Tesseract only.
+  1 = Neural nets LSTM only.
+  2 = Tesseract + LSTM.
+  3 = Default, based on what is available.
+
+*--tessdata-dir* 'PATH'::
+  Specify the location of tessdata path.
+
+*--user-patterns* 'FILE'::
+  Specify the location of user patterns file.
+
+*--user-words* 'FILE'::
+  Specify the location of user words file.
+
+[[CONFIGFILE]]
+'CONFIGFILE'::
+  The name of a config to use. The name can be a file in `tessdata/configs`
+  or `tessdata/tessconfigs`, or an absolute or relative file path.
+  A config is a plain text file which contains a list of parameters and
+  their values, one per line, with a space separating parameter from value. +
+  Interesting config files include:
+
+  * *alto* -- Output in ALTO format ('OUTPUTBASE'`.xml`).
+  * *hocr* -- Output in hOCR format ('OUTPUTBASE'`.hocr`).
+  * *page* -- Output in PAGE format ('OUTPUTBASE'`.page.xml`).
+              The output can be customized with the flags:
+              page_xml_polygon -- Create polygons instead of bounding boxes (default: true)
+              page_xml_level -- Create the PAGE file on  0=linelevel or 1=wordlevel (default: 0)
+  * *pdf* -- Output PDF ('OUTPUTBASE'`.pdf`).
+  * *tsv* -- Output TSV ('OUTPUTBASE'`.tsv`).
+  * *txt* -- Output plain text ('OUTPUTBASE'`.txt`).
+  * *get.images* -- Write processed input images to file ('OUTPUTBASE'`.processedPAGENUMBER.tif`).
+  * *logfile* -- Redirect debug messages to file (`tesseract.log`).
+  * *lstm.train* -- Output files used by LSTM training ('OUTPUTBASE'`.lstmf`).
+  * *makebox* -- Write box file ('OUTPUTBASE'`.box`).
+  * *quiet* -- Redirect debug messages to '/dev/null'.
+
+It is possible to select several config files, for example
+`tesseract image.png demo alto hocr pdf txt` will create four output files
+`demo.alto`, `demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
+
+*Nota bene:*   The options *-l* 'LANG', *-l* 'SCRIPT' and *--psm* 'N'
+must occur before any 'CONFIGFILE'.
+
+
+SINGLE OPTIONS
+--------------
+*-h, --help*::
+  Show help message.
+
+*--help-extra*::
+  Show extra help for advanced users.
+
+*--help-psm*::
+  Show page segmentation modes.
+
+*--help-oem*::
+  Show OCR Engine modes.
+
+*-v, --version*::
+  Returns the current version of the tesseract(1) executable.
+
+*--list-langs*::
+  List available languages for tesseract engine.
+  Can be used with *--tessdata-dir* 'PATH'.
+
+*--print-parameters*::
+  Print tesseract parameters.
+
+
+[[LANGUAGES]]
+LANGUAGES AND SCRIPTS
+---------------------
+
+To recognize some text with Tesseract, it is normally necessary to specify
+the language(s) or script(s) of the text (unless it is English text which is
+supported by default) using *-l* 'LANG' or *-l* 'SCRIPT'.
+
+Selecting a language automatically also selects the language specific
+character set and dictionary (word list).
+
+Selecting a script typically selects all characters of that script
+which can be from different languages. The dictionary which is included
+also contains a mix from different languages.
+In most cases, a script also supports English.
+So it is possible to recognize a language that has not been specifically
+trained for by using traineddata for the script it is written in.
+
+More than one language or script may be specified by using `+`.
+Example: `tesseract myimage.png myimage -l eng+deu+fra`.
+
+https://github.com/tesseract-ocr/tessdata_fast provides fast language and
+script models which are also part of Linux distributions.
+
+For Tesseract 4, `tessdata_fast` includes traineddata files for the
+following languages:
+
+*afr* (Afrikaans),
+*amh* (Amharic),
+*ara* (Arabic),
+*asm* (Assamese),
+*aze* (Azerbaijani),
+*aze_cyrl* (Azerbaijani - Cyrilic),
+*bel* (Belarusian),
+*ben* (Bengali),
+*bod* (Tibetan),
+*bos* (Bosnian),
+*bre* (Breton),
+*bul* (Bulgarian),
+*cat* (Catalan; Valencian),
+*ceb* (Cebuano),
+*ces* (Czech),
+*chi_sim* (Chinese simplified),
+*chi_tra* (Chinese traditional),
+*chr* (Cherokee),
+*cos* (Corsican),
+*cym* (Welsh),
+*dan* (Danish),
+*deu* (German),
+*deu_latf* (German Fraktur Latin),
+*div* (Dhivehi),
+*dzo* (Dzongkha),
+*ell* (Greek, Modern, 1453-),
+*eng* (English),
+*enm* (English, Middle, 1100-1500),
+*epo* (Esperanto),
+*equ* (Math / equation detection module),
+*est* (Estonian),
+*eus* (Basque),
+*fas* (Persian),
+*fao* (Faroese),
+*fil* (Filipino),
+*fin* (Finnish),
+*fra* (French),
+*frm* (French, Middle, ca.1400-1600),
+*fry* (West Frisian),
+*gla* (Scottish Gaelic),
+*gle* (Irish),
+*glg* (Galician),
+*grc* (Greek, Ancient, to 1453),
+*guj* (Gujarati),
+*hat* (Haitian; Haitian Creole),
+*heb* (Hebrew),
+*hin* (Hindi),
+*hrv* (Croatian),
+*hun* (Hungarian),
+*hye* (Armenian),
+*iku* (Inuktitut),
+*ind* (Indonesian),
+*isl* (Icelandic),
+*ita* (Italian),
+*ita_old* (Italian - Old),
+*jav* (Javanese),
+*jpn* (Japanese),
+*kan* (Kannada),
+*kat* (Georgian),
+*kat_old* (Georgian - Old),
+*kaz* (Kazakh),
+*khm* (Central Khmer),
+*kir* (Kirghiz; Kyrgyz),
+*kmr* (Kurdish Kurmanji),
+*kor* (Korean),
+*kor_vert* (Korean vertical),
+*lao* (Lao),
+*lat* (Latin),
+*lav* (Latvian),
+*lit* (Lithuanian),
+*ltz* (Luxembourgish),
+*mal* (Malayalam),
+*mar* (Marathi),
+*mkd* (Macedonian),
+*mlt* (Maltese),
+*mon* (Mongolian),
+*mri* (Maori),
+*msa* (Malay),
+*mya* (Burmese),
+*nep* (Nepali),
+*nld* (Dutch; Flemish),
+*nor* (Norwegian),
+*oci* (Occitan post 1500),
+*ori* (Oriya),
+*osd* (Orientation and script detection module),
+*pan* (Panjabi; Punjabi),
+*pol* (Polish),
+*por* (Portuguese),
+*pus* (Pushto; Pashto),
+*que* (Quechua),
+*ron* (Romanian; Moldavian; Moldovan),
+*rus* (Russian),
+*san* (Sanskrit),
+*sin* (Sinhala; Sinhalese),
+*slk* (Slovak),
+*slv* (Slovenian),
+*snd* (Sindhi),
+*spa* (Spanish; Castilian),
+*spa_old* (Spanish; Castilian - Old),
+*sqi* (Albanian),
+*srp* (Serbian),
+*srp_latn* (Serbian - Latin),
+*sun* (Sundanese),
+*swa* (Swahili),
+*swe* (Swedish),
+*syr* (Syriac),
+*tam* (Tamil),
+*tat* (Tatar),
+*tel* (Telugu),
+*tgk* (Tajik),
+*tha* (Thai),
+*tir* (Tigrinya),
+*ton* (Tonga),
+*tur* (Turkish),
+*uig* (Uighur; Uyghur),
+*ukr* (Ukrainian),
+*urd* (Urdu),
+*uzb* (Uzbek),
+*uzb_cyrl* (Uzbek - Cyrilic),
+*vie* (Vietnamese),
+*yid* (Yiddish),
+*yor* (Yoruba)
+
+To use a non-standard language pack named `foo.traineddata`, set the
+`TESSDATA_PREFIX` environment variable so the file can be found at
+`TESSDATA_PREFIX/tessdata/foo.traineddata` and give Tesseract the
+argument *-l* `foo`.
+
+For Tesseract 4, `tessdata_fast` includes traineddata files for the
+following scripts:
+
+*Arabic*,
+*Armenian*,
+*Bengali*,
+*Canadian_Aboriginal*,
+*Cherokee*,
+*Cyrillic*,
+*Devanagari*,
+*Ethiopic*,
+*Fraktur*,
+*Georgian*,
+*Greek*,
+*Gujarati*,
+*Gurmukhi*,
+*HanS* (Han simplified),
+*HanS_vert* (Han simplified, vertical),
+*HanT* (Han traditional),
+*HanT_vert* (Han traditional, vertical),
+*Hangul*,
+*Hangul_vert* (Hangul vertical),
+*Hebrew*,
+*Japanese*,
+*Japanese_vert* (Japanese vertical),
+*Kannada*,
+*Khmer*,
+*Lao*,
+*Latin*,
+*Malayalam*,
+*Myanmar*,
+*Oriya* (Odia),
+*Sinhala*,
+*Syriac*,
+*Tamil*,
+*Telugu*,
+*Thaana*,
+*Thai*,
+*Tibetan*,
+*Vietnamese*.
+
+The same languages and scripts are available from
+https://github.com/tesseract-ocr/tessdata_best.
+`tessdata_best` provides slow language and script models.
+These models are needed for training. They also can give better OCR results,
+but the recognition takes much more time.
+
+Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
+
+There is a third repository, https://github.com/tesseract-ocr/tessdata,
+with models which support both the Tesseract 3 legacy OCR engine and the
+Tesseract 4 LSTM OCR engine.
+
+
+CONFIG FILES AND AUGMENTING WITH USER DATA
+------------------------------------------
+
+Tesseract config files consist of lines with parameter-value pairs (space
+separated).  The parameters are documented as flags in the source code like
+the following one in tesseractclass.h:
+
+`STRING_VAR_H(tessedit_char_blacklist, "",
+             "Blacklist of chars not to recognize");`
+
+These parameters may enable or disable various features of the engine, and
+may cause it to load (or not load) various data.  For instance, let's suppose
+you want to OCR in English, but suppress the normal dictionary and load an
+alternative word list and an alternative list of patterns -- these two files
+are the most commonly used extra data files.
+
+If your language pack is in '/path/to/eng.traineddata' and the hocr config
+is in '/path/to/configs/hocr' then create three new files:
+
+'/path/to/eng.user-words':
+[verse]
+the
+quick
+brown
+fox
+jumped
+
+'/path/to/eng.user-patterns':
+[verse]
+1-\d\d\d-GOOG-411
+www.\n\\\*.com
+
+'/path/to/configs/bazaar':
+[verse]
+load_system_dawg     F
+load_freq_dawg       F
+user_words_suffix    user-words
+user_patterns_suffix user-patterns
+
+Now, if you pass the word 'bazaar' as a <<CONFIGFILE,'CONFIGFILE'>> to
+Tesseract, Tesseract will not bother loading the system dictionary nor
+the dictionary of frequent words and will load and use the 'eng.user-words'
+and 'eng.user-patterns' files you provided.  The former is a simple word list,
+one per line.  The format of the latter is documented in 'dict/trie.h'
+on 'read_pattern_list()'.
+
+
+ENVIRONMENT VARIABLES
+---------------------
+*`TESSDATA_PREFIX`*::
+  If the `TESSDATA_PREFIX` is set to a path, then that path is used to
+  find the `tessdata` directory with language and script recognition
+  models and config files.
+  Using <<TESSDATADIR,*--tessdata-dir* 'PATH'>> is the recommended alternative.
+*`OMP_THREAD_LIMIT`*::
+  If the `tesseract` executable was built with multithreading support,
+  it will normally use four CPU cores for the OCR process. While this
+  can be faster for a single image, it gives bad performance if the host
+  computer provides less than four CPU cores or if OCR is made for many images.
+  Only a single CPU core is used with `OMP_THREAD_LIMIT=1`.
+
+
+HISTORY
+-------
+The engine was developed at Hewlett Packard Laboratories Bristol and at
+Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
+changes made in 1996 to port to Windows, and some $$C++$$izing in 1998. A
+lot of the code was written in C, and then some more was written in $$C++$$.
+The $$C++$$ code makes heavy use of a list system using macros. This predates
+STL, was portable before STL, and is more efficient than STL lists, but has
+the big negative that if you do get a segmentation violation, it is hard to
+debug.
+
+Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
+to train Tesseract.
+
+Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
+See <https://github.com/tesseract-ocr/docs/blob/main/AT-1995.pdf>.
+Since Tesseract 2.00,
+scripts are now included to allow anyone to reproduce some of these tests.
+See <https://tesseract-ocr.github.io/tessdoc/TestingTesseract.html> for more
+details.
+
+Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
+and Korean. It also introduced a new, single-file based system of managing
+language data.
+
+Tesseract 3.02 added BiDirectional text support, the ability to recognize
+multiple languages in a single image, and improved layout analysis.
+
+Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
+on line recognition, but also still supports the legacy Tesseract OCR engine of
+Tesseract 3 which works by recognizing character patterns. Compatibility with
+Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
+support the legacy engine, for example those from the tessdata repository
+(https://github.com/tesseract-ocr/tessdata).
+
+For further details, see the release notes in the Tesseract documentation
+(<https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html>).
+
+
+RESOURCES
+---------
+Main web site: <https://github.com/tesseract-ocr> +
+User forum: <https://groups.google.com/g/tesseract-ocr> +
+Documentation: <https://tesseract-ocr.github.io/> +
+Information on training: <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
+
+SEE ALSO
+--------
+ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1),
+shape_training(1), mftraining(1), unicharambigs(5), unicharset(5),
+unicharset_extractor(1), wordlist2dawg(1)
+
+AUTHOR
+------
+Tesseract development was led at Hewlett-Packard and Google by Ray Smith.
+The development team has included:
+
+Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger,
+Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke,
+Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle,
+Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
+Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
+Lloyd, Shobhit Saxena, and Thomas Kielbus.
+
+For a list of contributors see
+<https://github.com/tesseract-ocr/tesseract/blob/main/AUTHORS>.
+
+COPYING
+-------
+Licensed under the Apache License, Version 2.0
author	Franz Glasner <fzglas.hg@dom66.de>
date	Mon, 15 Sep 2025 11:43:07 +0200
parents
children