Mercurial > hgrepos > Python2 > PyMuPDF
diff mupdf-source/thirdparty/tesseract/doc/unicharset.5.asc @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/mupdf-source/thirdparty/tesseract/doc/unicharset.5.asc Mon Sep 15 11:43:07 2025 +0200 @@ -0,0 +1,133 @@ +UNICHARSET(5) +============= +:doctype: manpage + +NAME +---- +unicharset - character properties file used by tesseract(1) + +DESCRIPTION +----------- +Tesseract's unicharset file contains information on each symbol +(unichar) the Tesseract OCR engine is trained to recognize. + +A unicharset file (i.e. 'eng.unicharset') is distributed as part of a +Tesseract language pack (i.e. 'eng.traineddata'). For information on +extracting the unicharset file, see combine_tessdata(1). + +The first line of a unicharset file contains the number of unichars in +the file. After this line, each subsequent line provides information for +a single unichar. The first such line contains a placeholder reserved for +the space character. Each unichar is referred to within Tesseract by its +Unichar ID, which is the line number (minus 1) within the unicharset file. +Therefore, space gets unichar 0. + +Each unichar line in the unicharset file (v2+) may have four space-separated fields: + + 'character' 'properties' 'script' 'id' + +Starting with Tesseract v3.02, more information may be given for each unichar: + + 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form' + +Entries: + +'character':: The UTF-8 encoded string to be produced for this unichar. +'properties':: An integer mask of character properties, one per bit. + From least to most significant bit, these are: isalpha, islower, isupper, + isdigit, ispunctuation. +'glyph_metrics':: Ten comma-separated integers representing various standards + for where this glyph is to be found within a baseline-normalized coordinate + system where 128 is normalized to x-height. + * min_bottom, max_bottom: the ranges where the bottom of the character can + be found. + * min_top, max_top: the ranges where the top of the character may be found. + * min_width, max_width: horizontal width of the character. + * min_bearing, max_bearing: how far from the usual start position does the + leftmost part of the character begin. + * min_advance, max_advance: how far from the printer's cell left do we + advance to begin the next character. +'script':: Name of the script (Latin, Common, Greek, Cyrillic, Han, null). +'other_case':: The Unichar ID of the other case version of this character + (upper or lower). +'direction':: The Unicode BiDi direction of this character, as defined by + ICU's enum UCharDirection. (0 = Left to Right, 1 = Right to Left, + 2 = European Number...) +'mirror':: The Unichar ID of the BiDirectional mirror of this character. + For example the mirror of open paren is close paren, but Latin Capital C + has no mirror, so it remains a Latin Capital C. +'normed_form':: The UTF-8 representation of a "normalized form" of this unichar + for the purpose of blaming a module for errors given ground truth text. + For instance, a left or right single quote may normalize to an ASCII quote. + + +EXAMPLE (v2) +------------ +.............. +; 10 Common 46 +b 3 Latin 59 +W 5 Latin 40 +7 8 Common 66 += 0 Common 93 +.............. + +";" is a punctuation character. Its properties are thus represented by the +binary number 10000 (10 in hexadecimal). + +"b" is an alphabetic character and a lower case character. Its properties are +thus represented by the binary number 00011 (3 in hexadecimal). + +"W" is an alphabetic character and an upper case character. Its properties are +thus represented by the binary number 00101 (5 in hexadecimal). + +"7" is just a digit. Its properties are thus represented by the binary number +01000 (8 in hexadecimal). + +"=" is not punctuation nor a digit nor an alphabetic character. Its properties +are thus represented by the binary number 00000 (0 in hexadecimal). + +Japanese or Chinese alphabetic character properties are represented by the +binary number 00001 (1 in hexadecimal): they are alphabetic, but neither +upper nor lower case. + +EXAMPLE (v3.02) +--------------- +.................................................................. +110 +NULL 0 NULL 0 +N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N +Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y +1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 +9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 +a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a +. . . +.................................................................. + +CAVEATS +------- +Although the unicharset reader maintains the ability to read unicharsets +of older formats and will assign default values to missing fields, +the accuracy will be degraded. + +Further, most other data files are indexed by the unicharset file, +so changing it without re-generating the others is likely to have dire +consequences. + +HISTORY +------- +The unicharset format first appeared with Tesseract 2.00, which was the +first version to support languages other than English. The unicharset file +contained only the first two fields, and the "ispunctuation" property was +absent (punctuation was regarded as "0", as "=" is in the above example. + +SEE ALSO +-------- +tesseract(1), combine_tessdata(1), unicharset_extractor(1) + +<https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html> + + +AUTHOR +------ +The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-2018).
