Mercurial > hgrepos > Python2 > PyMuPDF
diff mupdf-source/thirdparty/tesseract/doc/unicharambigs.5.asc @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/mupdf-source/thirdparty/tesseract/doc/unicharambigs.5.asc Mon Sep 15 11:43:07 2025 +0200 @@ -0,0 +1,89 @@ +UNICHARAMBIGS(5) +================ + +NAME +---- +unicharambigs - Tesseract unicharset ambiguities + +DESCRIPTION +----------- +The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) +is used by Tesseract to represent possible ambiguities between characters, +or groups of characters. + +The file contains a number of lines, laid out as follow: + +........................... +[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num] +........................... + +[horizontal] +Field one:: the number of characters contained in field two +Field two:: the character sequence to be replaced +Field three:: the number of characters contained in field four +Field four:: the character sequence used to replace field two +Field five:: contains either 1 or 0. 1 denotes a mandatory +replacement, 0 denotes an optional replacement. + +Characters appearing in fields two and four should appear in +unicharset. The numbers in fields one and three refer to the +number of unichars (not bytes). + +EXAMPLE +------- + +............................... +v1 +2 ' ' 1 " 1 +1 m 2 r n 0 +3 i i i 1 m 0 +............................... + +The first line is a version identifier. +In this example, all instances of the '2' character sequence '''' will +*always* be replaced by the '1' character sequence '"'; a '1' character +sequence 'm' *may* be replaced by the '2' character sequence 'rn', and +the '3' character sequence *may* be replaced by the '1' character +sequence 'm'. + +Version 3.03 and on supports a new, simpler format for the unicharambigs +file: + +............................... +v2 +'' " 1 +m rn 0 +iii m 0 +............................... + +In this format, the "error" and "correction" are simple UTF-8 strings +separated by a space, and, after another space, the same type specifier +as v1 (0 for optional and 1 for mandatory substitution). Note the downside +of this simpler format is that Tesseract has to encode the UTF-8 strings +into the components of the unicharset. In complex scripts, this encoding +may be ambiguous. In this case, the encoding is chosen such as to use the +least UTF-8 characters for each component, ie the shortest unicharset +components will make up the encoding. + +HISTORY +------- +The unicharambigs file first appeared in Tesseract 3.00; prior to that, a +similar format, called DangAmbigs ('dangerous ambiguities') was used: the +format was almost identical, except only mandatory replacements could be +specified, and field 5 was absent. + +BUGS +---- +This is a documentation "bug": it's not currently clear what should be done +in the case of ligatures (such as 'fi') which may also appear as regular +letters in the unicharset. + +SEE ALSO +-------- +tesseract(1), unicharset(5) +https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file + +AUTHOR +------ +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-2018).
