comparison mupdf-source/thirdparty/tesseract/doc/unicharambigs.5.asc @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
comparison
equal deleted inserted replaced
1:1d09e1dec1d9 2:b50eed0cc0ef
1 UNICHARAMBIGS(5)
2 ================
3
4 NAME
5 ----
6 unicharambigs - Tesseract unicharset ambiguities
7
8 DESCRIPTION
9 -----------
10 The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
11 is used by Tesseract to represent possible ambiguities between characters,
12 or groups of characters.
13
14 The file contains a number of lines, laid out as follow:
15
16 ...........................
17 [num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
18 ...........................
19
20 [horizontal]
21 Field one:: the number of characters contained in field two
22 Field two:: the character sequence to be replaced
23 Field three:: the number of characters contained in field four
24 Field four:: the character sequence used to replace field two
25 Field five:: contains either 1 or 0. 1 denotes a mandatory
26 replacement, 0 denotes an optional replacement.
27
28 Characters appearing in fields two and four should appear in
29 unicharset. The numbers in fields one and three refer to the
30 number of unichars (not bytes).
31
32 EXAMPLE
33 -------
34
35 ...............................
36 v1
37 2 ' ' 1 " 1
38 1 m 2 r n 0
39 3 i i i 1 m 0
40 ...............................
41
42 The first line is a version identifier.
43 In this example, all instances of the '2' character sequence '''' will
44 *always* be replaced by the '1' character sequence '"'; a '1' character
45 sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
46 the '3' character sequence *may* be replaced by the '1' character
47 sequence 'm'.
48
49 Version 3.03 and on supports a new, simpler format for the unicharambigs
50 file:
51
52 ...............................
53 v2
54 '' " 1
55 m rn 0
56 iii m 0
57 ...............................
58
59 In this format, the "error" and "correction" are simple UTF-8 strings
60 separated by a space, and, after another space, the same type specifier
61 as v1 (0 for optional and 1 for mandatory substitution). Note the downside
62 of this simpler format is that Tesseract has to encode the UTF-8 strings
63 into the components of the unicharset. In complex scripts, this encoding
64 may be ambiguous. In this case, the encoding is chosen such as to use the
65 least UTF-8 characters for each component, ie the shortest unicharset
66 components will make up the encoding.
67
68 HISTORY
69 -------
70 The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
71 similar format, called DangAmbigs ('dangerous ambiguities') was used: the
72 format was almost identical, except only mandatory replacements could be
73 specified, and field 5 was absent.
74
75 BUGS
76 ----
77 This is a documentation "bug": it's not currently clear what should be done
78 in the case of ligatures (such as 'fi') which may also appear as regular
79 letters in the unicharset.
80
81 SEE ALSO
82 --------
83 tesseract(1), unicharset(5)
84 https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file
85
86 AUTHOR
87 ------
88 The Tesseract OCR engine was written by Ray Smith and his research groups
89 at Hewlett Packard (1985-1995) and Google (2006-2018).