diff mupdf-source/thirdparty/tesseract/doc/unicharambigs.5.asc @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/mupdf-source/thirdparty/tesseract/doc/unicharambigs.5.asc	Mon Sep 15 11:43:07 2025 +0200
@@ -0,0 +1,89 @@
+UNICHARAMBIGS(5)
+================
+
+NAME
+----
+unicharambigs - Tesseract unicharset ambiguities
+
+DESCRIPTION
+-----------
+The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
+is used by Tesseract to represent possible ambiguities between characters,
+or groups of characters.
+
+The file contains a number of lines, laid out as follow:
+
+...........................
+[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
+...........................
+
+[horizontal]
+Field one:: the number of characters contained in field two
+Field two:: the character sequence to be replaced
+Field three:: the number of characters contained in field four
+Field four:: the character sequence used to replace field two
+Field five:: contains either 1 or 0. 1 denotes a mandatory
+replacement, 0 denotes an optional replacement.
+
+Characters appearing in fields two and four should appear in
+unicharset. The numbers in fields one and three refer to the
+number of unichars (not bytes).
+
+EXAMPLE
+-------
+
+...............................
+v1
+2       ' '     1       "     1
+1       m       2       r n   0
+3       i i i   1       m     0
+...............................
+
+The first line is a version identifier.
+In this example, all instances of the '2' character sequence '''' will
+*always* be replaced by the '1' character sequence '"'; a '1' character
+sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
+the '3' character sequence *may* be replaced by the '1' character
+sequence 'm'.
+
+Version 3.03 and on supports a new, simpler format for the unicharambigs
+file:
+
+...............................
+v2
+'' " 1
+m rn 0
+iii m 0
+...............................
+
+In this format, the "error" and "correction" are simple UTF-8 strings
+separated by a space, and, after another space, the same type specifier
+as v1 (0 for optional and 1 for mandatory substitution). Note the downside
+of this simpler format is that Tesseract has to encode the UTF-8 strings
+into the components of the unicharset. In complex scripts, this encoding
+may be ambiguous. In this case, the encoding is chosen such as to use the
+least UTF-8 characters for each component, ie the shortest unicharset
+components will make up the encoding.
+
+HISTORY
+-------
+The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
+similar format, called DangAmbigs ('dangerous ambiguities') was used: the
+format was almost identical, except only mandatory replacements could be
+specified, and field 5 was absent.
+
+BUGS
+----
+This is a documentation "bug": it's not currently clear what should be done
+in the case of ligatures (such as 'fi') which may also appear as regular
+letters in the unicharset.
+
+SEE ALSO
+--------
+tesseract(1), unicharset(5)
+https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file
+
+AUTHOR
+------
+The Tesseract OCR engine was written by Ray Smith and his research groups
+at Hewlett Packard (1985-1995) and Google (2006-2018).