comparison mupdf-source/thirdparty/tesseract/doc/unicharset.5.asc @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
comparison
equal deleted inserted replaced
1:1d09e1dec1d9 2:b50eed0cc0ef
1 UNICHARSET(5)
2 =============
3 :doctype: manpage
4
5 NAME
6 ----
7 unicharset - character properties file used by tesseract(1)
8
9 DESCRIPTION
10 -----------
11 Tesseract's unicharset file contains information on each symbol
12 (unichar) the Tesseract OCR engine is trained to recognize.
13
14 A unicharset file (i.e. 'eng.unicharset') is distributed as part of a
15 Tesseract language pack (i.e. 'eng.traineddata'). For information on
16 extracting the unicharset file, see combine_tessdata(1).
17
18 The first line of a unicharset file contains the number of unichars in
19 the file. After this line, each subsequent line provides information for
20 a single unichar. The first such line contains a placeholder reserved for
21 the space character. Each unichar is referred to within Tesseract by its
22 Unichar ID, which is the line number (minus 1) within the unicharset file.
23 Therefore, space gets unichar 0.
24
25 Each unichar line in the unicharset file (v2+) may have four space-separated fields:
26
27 'character' 'properties' 'script' 'id'
28
29 Starting with Tesseract v3.02, more information may be given for each unichar:
30
31 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
32
33 Entries:
34
35 'character':: The UTF-8 encoded string to be produced for this unichar.
36 'properties':: An integer mask of character properties, one per bit.
37 From least to most significant bit, these are: isalpha, islower, isupper,
38 isdigit, ispunctuation.
39 'glyph_metrics':: Ten comma-separated integers representing various standards
40 for where this glyph is to be found within a baseline-normalized coordinate
41 system where 128 is normalized to x-height.
42 * min_bottom, max_bottom: the ranges where the bottom of the character can
43 be found.
44 * min_top, max_top: the ranges where the top of the character may be found.
45 * min_width, max_width: horizontal width of the character.
46 * min_bearing, max_bearing: how far from the usual start position does the
47 leftmost part of the character begin.
48 * min_advance, max_advance: how far from the printer's cell left do we
49 advance to begin the next character.
50 'script':: Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
51 'other_case':: The Unichar ID of the other case version of this character
52 (upper or lower).
53 'direction':: The Unicode BiDi direction of this character, as defined by
54 ICU's enum UCharDirection. (0 = Left to Right, 1 = Right to Left,
55 2 = European Number...)
56 'mirror':: The Unichar ID of the BiDirectional mirror of this character.
57 For example the mirror of open paren is close paren, but Latin Capital C
58 has no mirror, so it remains a Latin Capital C.
59 'normed_form':: The UTF-8 representation of a "normalized form" of this unichar
60 for the purpose of blaming a module for errors given ground truth text.
61 For instance, a left or right single quote may normalize to an ASCII quote.
62
63
64 EXAMPLE (v2)
65 ------------
66 ..............
67 ; 10 Common 46
68 b 3 Latin 59
69 W 5 Latin 40
70 7 8 Common 66
71 = 0 Common 93
72 ..............
73
74 ";" is a punctuation character. Its properties are thus represented by the
75 binary number 10000 (10 in hexadecimal).
76
77 "b" is an alphabetic character and a lower case character. Its properties are
78 thus represented by the binary number 00011 (3 in hexadecimal).
79
80 "W" is an alphabetic character and an upper case character. Its properties are
81 thus represented by the binary number 00101 (5 in hexadecimal).
82
83 "7" is just a digit. Its properties are thus represented by the binary number
84 01000 (8 in hexadecimal).
85
86 "=" is not punctuation nor a digit nor an alphabetic character. Its properties
87 are thus represented by the binary number 00000 (0 in hexadecimal).
88
89 Japanese or Chinese alphabetic character properties are represented by the
90 binary number 00001 (1 in hexadecimal): they are alphabetic, but neither
91 upper nor lower case.
92
93 EXAMPLE (v3.02)
94 ---------------
95 ..................................................................
96 110
97 NULL 0 NULL 0
98 N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
99 Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
100 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
101 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
102 a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
103 . . .
104 ..................................................................
105
106 CAVEATS
107 -------
108 Although the unicharset reader maintains the ability to read unicharsets
109 of older formats and will assign default values to missing fields,
110 the accuracy will be degraded.
111
112 Further, most other data files are indexed by the unicharset file,
113 so changing it without re-generating the others is likely to have dire
114 consequences.
115
116 HISTORY
117 -------
118 The unicharset format first appeared with Tesseract 2.00, which was the
119 first version to support languages other than English. The unicharset file
120 contained only the first two fields, and the "ispunctuation" property was
121 absent (punctuation was regarded as "0", as "=" is in the above example.
122
123 SEE ALSO
124 --------
125 tesseract(1), combine_tessdata(1), unicharset_extractor(1)
126
127 <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
128
129
130 AUTHOR
131 ------
132 The Tesseract OCR engine was written by Ray Smith and his research groups
133 at Hewlett Packard (1985-1995) and Google (2006-2018).