comparison mupdf-source/thirdparty/tesseract/doc/wordlist2dawg.1.asc @ 3:2c135c81b16c

MERGE: upstream PyMuPDF 1.26.4 with MuPDF 1.26.7
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:44:09 +0200
parents b50eed0cc0ef
children
comparison
equal deleted inserted replaced
0:6015a75abc2d 3:2c135c81b16c
1 WORDLIST2DAWG(1)
2 ================
3 :doctype: manpage
4
5 NAME
6 ----
7 wordlist2dawg - convert a wordlist to a DAWG for Tesseract
8
9 SYNOPSIS
10 --------
11 *wordlist2dawg* 'WORDLIST' 'DAWG' 'lang.unicharset'
12
13 *wordlist2dawg* -t 'WORDLIST' 'DAWG' 'lang.unicharset'
14
15 *wordlist2dawg* -r 1 'WORDLIST' 'DAWG' 'lang.unicharset'
16
17 *wordlist2dawg* -r 2 'WORDLIST' 'DAWG' 'lang.unicharset'
18
19 *wordlist2dawg* -l <short> <long> 'WORDLIST' 'DAWG' 'lang.unicharset'
20
21 DESCRIPTION
22 -----------
23 wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
24 (DAWG) for use with Tesseract. A DAWG is a compressed, space and time
25 efficient representation of a word list.
26
27 OPTIONS
28 -------
29 -t
30 Verify that a given dawg file is equivalent to a given wordlist.
31
32 -r 1
33 Reverse a word if it contains an RTL character.
34
35 -r 2
36 Reverse all words.
37
38 -l <short> <long>
39 Produce a file with several dawgs in it, one each for words
40 of length <short>, <short+1>,... <long>
41
42 ARGUMENTS
43 ---------
44
45 'WORDLIST'
46 A plain text file in UTF-8, one word per line.
47
48 'DAWG'
49 The output DAWG to write.
50
51 'lang.unicharset'
52 The unicharset of the language. This is the unicharset
53 generated by mftraining(1).
54
55 SEE ALSO
56 --------
57 tesseract(1), combine_tessdata(1), dawg2wordlist(1)
58
59 <https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
60
61 COPYING
62 -------
63 Copyright \(C) 2006 Google, Inc.
64 Licensed under the Apache License, Version 2.0
65
66 AUTHOR
67 ------
68 The Tesseract OCR engine was written by Ray Smith and his research groups
69 at Hewlett Packard (1985-1995) and Google (2006-2018).