comparison mupdf-source/thirdparty/tesseract/doc/wordlist2dawg.1.asc @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
comparison
equal deleted inserted replaced
1:1d09e1dec1d9 2:b50eed0cc0ef
1 WORDLIST2DAWG(1)
2 ================
3 :doctype: manpage
4
5 NAME
6 ----
7 wordlist2dawg - convert a wordlist to a DAWG for Tesseract
8
9 SYNOPSIS
10 --------
11 *wordlist2dawg* 'WORDLIST' 'DAWG' 'lang.unicharset'
12
13 *wordlist2dawg* -t 'WORDLIST' 'DAWG' 'lang.unicharset'
14
15 *wordlist2dawg* -r 1 'WORDLIST' 'DAWG' 'lang.unicharset'
16
17 *wordlist2dawg* -r 2 'WORDLIST' 'DAWG' 'lang.unicharset'
18
19 *wordlist2dawg* -l <short> <long> 'WORDLIST' 'DAWG' 'lang.unicharset'
20
21 DESCRIPTION
22 -----------
23 wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
24 (DAWG) for use with Tesseract. A DAWG is a compressed, space and time
25 efficient representation of a word list.
26
27 OPTIONS
28 -------
29 -t
30 Verify that a given dawg file is equivalent to a given wordlist.
31
32 -r 1
33 Reverse a word if it contains an RTL character.
34
35 -r 2
36 Reverse all words.
37
38 -l <short> <long>
39 Produce a file with several dawgs in it, one each for words
40 of length <short>, <short+1>,... <long>
41
42 ARGUMENTS
43 ---------
44
45 'WORDLIST'
46 A plain text file in UTF-8, one word per line.
47
48 'DAWG'
49 The output DAWG to write.
50
51 'lang.unicharset'
52 The unicharset of the language. This is the unicharset
53 generated by mftraining(1).
54
55 SEE ALSO
56 --------
57 tesseract(1), combine_tessdata(1), dawg2wordlist(1)
58
59 <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
60
61 COPYING
62 -------
63 Copyright \(C) 2006 Google, Inc.
64 Licensed under the Apache License, Version 2.0
65
66 AUTHOR
67 ------
68 The Tesseract OCR engine was written by Ray Smith and his research groups
69 at Hewlett Packard (1985-1995) and Google (2006-2018).