Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/tesseract/doc/unicharset.5.asc @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 UNICHARSET(5) | |
| 2 ============= | |
| 3 :doctype: manpage | |
| 4 | |
| 5 NAME | |
| 6 ---- | |
| 7 unicharset - character properties file used by tesseract(1) | |
| 8 | |
| 9 DESCRIPTION | |
| 10 ----------- | |
| 11 Tesseract's unicharset file contains information on each symbol | |
| 12 (unichar) the Tesseract OCR engine is trained to recognize. | |
| 13 | |
| 14 A unicharset file (i.e. 'eng.unicharset') is distributed as part of a | |
| 15 Tesseract language pack (i.e. 'eng.traineddata'). For information on | |
| 16 extracting the unicharset file, see combine_tessdata(1). | |
| 17 | |
| 18 The first line of a unicharset file contains the number of unichars in | |
| 19 the file. After this line, each subsequent line provides information for | |
| 20 a single unichar. The first such line contains a placeholder reserved for | |
| 21 the space character. Each unichar is referred to within Tesseract by its | |
| 22 Unichar ID, which is the line number (minus 1) within the unicharset file. | |
| 23 Therefore, space gets unichar 0. | |
| 24 | |
| 25 Each unichar line in the unicharset file (v2+) may have four space-separated fields: | |
| 26 | |
| 27 'character' 'properties' 'script' 'id' | |
| 28 | |
| 29 Starting with Tesseract v3.02, more information may be given for each unichar: | |
| 30 | |
| 31 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form' | |
| 32 | |
| 33 Entries: | |
| 34 | |
| 35 'character':: The UTF-8 encoded string to be produced for this unichar. | |
| 36 'properties':: An integer mask of character properties, one per bit. | |
| 37 From least to most significant bit, these are: isalpha, islower, isupper, | |
| 38 isdigit, ispunctuation. | |
| 39 'glyph_metrics':: Ten comma-separated integers representing various standards | |
| 40 for where this glyph is to be found within a baseline-normalized coordinate | |
| 41 system where 128 is normalized to x-height. | |
| 42 * min_bottom, max_bottom: the ranges where the bottom of the character can | |
| 43 be found. | |
| 44 * min_top, max_top: the ranges where the top of the character may be found. | |
| 45 * min_width, max_width: horizontal width of the character. | |
| 46 * min_bearing, max_bearing: how far from the usual start position does the | |
| 47 leftmost part of the character begin. | |
| 48 * min_advance, max_advance: how far from the printer's cell left do we | |
| 49 advance to begin the next character. | |
| 50 'script':: Name of the script (Latin, Common, Greek, Cyrillic, Han, null). | |
| 51 'other_case':: The Unichar ID of the other case version of this character | |
| 52 (upper or lower). | |
| 53 'direction':: The Unicode BiDi direction of this character, as defined by | |
| 54 ICU's enum UCharDirection. (0 = Left to Right, 1 = Right to Left, | |
| 55 2 = European Number...) | |
| 56 'mirror':: The Unichar ID of the BiDirectional mirror of this character. | |
| 57 For example the mirror of open paren is close paren, but Latin Capital C | |
| 58 has no mirror, so it remains a Latin Capital C. | |
| 59 'normed_form':: The UTF-8 representation of a "normalized form" of this unichar | |
| 60 for the purpose of blaming a module for errors given ground truth text. | |
| 61 For instance, a left or right single quote may normalize to an ASCII quote. | |
| 62 | |
| 63 | |
| 64 EXAMPLE (v2) | |
| 65 ------------ | |
| 66 .............. | |
| 67 ; 10 Common 46 | |
| 68 b 3 Latin 59 | |
| 69 W 5 Latin 40 | |
| 70 7 8 Common 66 | |
| 71 = 0 Common 93 | |
| 72 .............. | |
| 73 | |
| 74 ";" is a punctuation character. Its properties are thus represented by the | |
| 75 binary number 10000 (10 in hexadecimal). | |
| 76 | |
| 77 "b" is an alphabetic character and a lower case character. Its properties are | |
| 78 thus represented by the binary number 00011 (3 in hexadecimal). | |
| 79 | |
| 80 "W" is an alphabetic character and an upper case character. Its properties are | |
| 81 thus represented by the binary number 00101 (5 in hexadecimal). | |
| 82 | |
| 83 "7" is just a digit. Its properties are thus represented by the binary number | |
| 84 01000 (8 in hexadecimal). | |
| 85 | |
| 86 "=" is not punctuation nor a digit nor an alphabetic character. Its properties | |
| 87 are thus represented by the binary number 00000 (0 in hexadecimal). | |
| 88 | |
| 89 Japanese or Chinese alphabetic character properties are represented by the | |
| 90 binary number 00001 (1 in hexadecimal): they are alphabetic, but neither | |
| 91 upper nor lower case. | |
| 92 | |
| 93 EXAMPLE (v3.02) | |
| 94 --------------- | |
| 95 .................................................................. | |
| 96 110 | |
| 97 NULL 0 NULL 0 | |
| 98 N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N | |
| 99 Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y | |
| 100 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 | |
| 101 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 | |
| 102 a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a | |
| 103 . . . | |
| 104 .................................................................. | |
| 105 | |
| 106 CAVEATS | |
| 107 ------- | |
| 108 Although the unicharset reader maintains the ability to read unicharsets | |
| 109 of older formats and will assign default values to missing fields, | |
| 110 the accuracy will be degraded. | |
| 111 | |
| 112 Further, most other data files are indexed by the unicharset file, | |
| 113 so changing it without re-generating the others is likely to have dire | |
| 114 consequences. | |
| 115 | |
| 116 HISTORY | |
| 117 ------- | |
| 118 The unicharset format first appeared with Tesseract 2.00, which was the | |
| 119 first version to support languages other than English. The unicharset file | |
| 120 contained only the first two fields, and the "ispunctuation" property was | |
| 121 absent (punctuation was regarded as "0", as "=" is in the above example. | |
| 122 | |
| 123 SEE ALSO | |
| 124 -------- | |
| 125 tesseract(1), combine_tessdata(1), unicharset_extractor(1) | |
| 126 | |
| 127 <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html> | |
| 128 | |
| 129 | |
| 130 AUTHOR | |
| 131 ------ | |
| 132 The Tesseract OCR engine was written by Ray Smith and his research groups | |
| 133 at Hewlett Packard (1985-1995) and Google (2006-2018). |
