Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/tesseract/doc/unicharambigs.5.asc @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 UNICHARAMBIGS(5) | |
| 2 ================ | |
| 3 | |
| 4 NAME | |
| 5 ---- | |
| 6 unicharambigs - Tesseract unicharset ambiguities | |
| 7 | |
| 8 DESCRIPTION | |
| 9 ----------- | |
| 10 The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) | |
| 11 is used by Tesseract to represent possible ambiguities between characters, | |
| 12 or groups of characters. | |
| 13 | |
| 14 The file contains a number of lines, laid out as follow: | |
| 15 | |
| 16 ........................... | |
| 17 [num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num] | |
| 18 ........................... | |
| 19 | |
| 20 [horizontal] | |
| 21 Field one:: the number of characters contained in field two | |
| 22 Field two:: the character sequence to be replaced | |
| 23 Field three:: the number of characters contained in field four | |
| 24 Field four:: the character sequence used to replace field two | |
| 25 Field five:: contains either 1 or 0. 1 denotes a mandatory | |
| 26 replacement, 0 denotes an optional replacement. | |
| 27 | |
| 28 Characters appearing in fields two and four should appear in | |
| 29 unicharset. The numbers in fields one and three refer to the | |
| 30 number of unichars (not bytes). | |
| 31 | |
| 32 EXAMPLE | |
| 33 ------- | |
| 34 | |
| 35 ............................... | |
| 36 v1 | |
| 37 2 ' ' 1 " 1 | |
| 38 1 m 2 r n 0 | |
| 39 3 i i i 1 m 0 | |
| 40 ............................... | |
| 41 | |
| 42 The first line is a version identifier. | |
| 43 In this example, all instances of the '2' character sequence '''' will | |
| 44 *always* be replaced by the '1' character sequence '"'; a '1' character | |
| 45 sequence 'm' *may* be replaced by the '2' character sequence 'rn', and | |
| 46 the '3' character sequence *may* be replaced by the '1' character | |
| 47 sequence 'm'. | |
| 48 | |
| 49 Version 3.03 and on supports a new, simpler format for the unicharambigs | |
| 50 file: | |
| 51 | |
| 52 ............................... | |
| 53 v2 | |
| 54 '' " 1 | |
| 55 m rn 0 | |
| 56 iii m 0 | |
| 57 ............................... | |
| 58 | |
| 59 In this format, the "error" and "correction" are simple UTF-8 strings | |
| 60 separated by a space, and, after another space, the same type specifier | |
| 61 as v1 (0 for optional and 1 for mandatory substitution). Note the downside | |
| 62 of this simpler format is that Tesseract has to encode the UTF-8 strings | |
| 63 into the components of the unicharset. In complex scripts, this encoding | |
| 64 may be ambiguous. In this case, the encoding is chosen such as to use the | |
| 65 least UTF-8 characters for each component, ie the shortest unicharset | |
| 66 components will make up the encoding. | |
| 67 | |
| 68 HISTORY | |
| 69 ------- | |
| 70 The unicharambigs file first appeared in Tesseract 3.00; prior to that, a | |
| 71 similar format, called DangAmbigs ('dangerous ambiguities') was used: the | |
| 72 format was almost identical, except only mandatory replacements could be | |
| 73 specified, and field 5 was absent. | |
| 74 | |
| 75 BUGS | |
| 76 ---- | |
| 77 This is a documentation "bug": it's not currently clear what should be done | |
| 78 in the case of ligatures (such as 'fi') which may also appear as regular | |
| 79 letters in the unicharset. | |
| 80 | |
| 81 SEE ALSO | |
| 82 -------- | |
| 83 tesseract(1), unicharset(5) | |
| 84 https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file | |
| 85 | |
| 86 AUTHOR | |
| 87 ------ | |
| 88 The Tesseract OCR engine was written by Ray Smith and his research groups | |
| 89 at Hewlett Packard (1985-1995) and Google (2006-2018). |
