Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/tesseract/README.md @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 # Tesseract OCR | |
| 2 | |
| 3 [](https://scan.coverity.com/projects/tesseract-ocr) | |
| 4 [](https://github.com/tesseract-ocr/tesseract/security/code-scanning) | |
| 5 [](https://issues.oss-fuzz.com/issues?q=is:open%20title:tesseract-ocr) | |
| 6 \ | |
| 7 [](https://raw.githubusercontent.com/tesseract-ocr/tesseract/main/LICENSE) | |
| 8 [](https://github.com/tesseract-ocr/tesseract/releases/) | |
| 9 | |
| 10 ## Table of Contents | |
| 11 | |
| 12 * [Tesseract OCR](#tesseract-ocr) | |
| 13 * [About](#about) | |
| 14 * [Brief history](#brief-history) | |
| 15 * [Installing Tesseract](#installing-tesseract) | |
| 16 * [Running Tesseract](#running-tesseract) | |
| 17 * [For developers](#for-developers) | |
| 18 * [Support](#support) | |
| 19 * [License](#license) | |
| 20 * [Dependencies](#dependencies) | |
| 21 * [Latest Version of README](#latest-version-of-readme) | |
| 22 | |
| 23 ## About | |
| 24 | |
| 25 This package contains an **OCR engine** - `libtesseract` and a **command line program** - `tesseract`. | |
| 26 | |
| 27 Tesseract 4 adds a new neural net (LSTM) based [OCR engine](https://en.wikipedia.org/wiki/Optical_character_recognition) which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). | |
| 28 It also needs [traineddata](https://tesseract-ocr.github.io/tessdoc/Data-Files.html) files which support the legacy engine, for example those from the [tessdata](https://github.com/tesseract-ocr/tessdata) repository. | |
| 29 | |
| 30 Stefan Weil is the current lead developer. Ray Smith was the lead developer until 2018. The maintainer is Zdenko Podobny. For a list of contributors see [AUTHORS](https://github.com/tesseract-ocr/tesseract/blob/main/AUTHORS) | |
| 31 and GitHub's log of [contributors](https://github.com/tesseract-ocr/tesseract/graphs/contributors). | |
| 32 | |
| 33 Tesseract has **unicode (UTF-8) support**, and can **recognize [more than 100 languages](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)** "out of the box". | |
| 34 | |
| 35 Tesseract supports **[various image formats](https://tesseract-ocr.github.io/tessdoc/InputFormats)** including PNG, JPEG and TIFF. | |
| 36 | |
| 37 Tesseract supports **various output formats**: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV, ALTO and PAGE. | |
| 38 | |
| 39 You should note that in many cases, in order to get better OCR results, you'll need to **[improve the quality](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html) of the image** you are giving Tesseract. | |
| 40 | |
| 41 This project **does not include a GUI application**. If you need one, please see the [3rdParty](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html) documentation. | |
| 42 | |
| 43 Tesseract **can be trained to recognize other languages**. | |
| 44 See [Tesseract Training](https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html) for more information. | |
| 45 | |
| 46 ## Brief history | |
| 47 | |
| 48 Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google. | |
| 49 | |
| 50 Major version 5 is the current stable version and started with release | |
| 51 [5.0.0](https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0) on November 30, 2021. Newer minor versions and bugfix versions are available from | |
| 52 [GitHub](https://github.com/tesseract-ocr/tesseract/releases/). | |
| 53 | |
| 54 Latest source code is available from [main branch on GitHub](https://github.com/tesseract-ocr/tesseract/tree/main). | |
| 55 Open issues can be found in [issue tracker](https://github.com/tesseract-ocr/tesseract/issues), | |
| 56 and [planning documentation](https://tesseract-ocr.github.io/tessdoc/Planning.html). | |
| 57 | |
| 58 See **[Release Notes](https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html)** | |
| 59 and **[Change Log](https://github.com/tesseract-ocr/tesseract/blob/main/ChangeLog)** for more details of the releases. | |
| 60 | |
| 61 ## Installing Tesseract | |
| 62 | |
| 63 You can either [Install Tesseract via pre-built binary package](https://tesseract-ocr.github.io/tessdoc/Installation.html) | |
| 64 or [build it from source](https://tesseract-ocr.github.io/tessdoc/Compiling.html). | |
| 65 | |
| 66 Before building Tesseract from source, please check that your system has a compiler which is one of the [supported compilers](https://tesseract-ocr.github.io/tessdoc/supported-compilers.html). | |
| 67 | |
| 68 ## Running Tesseract | |
| 69 | |
| 70 Basic **[command line usage](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html)**: | |
| 71 | |
| 72 tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...] | |
| 73 | |
| 74 For more information about the various command line options use `tesseract --help` or `man tesseract`. | |
| 75 | |
| 76 Examples can be found in the [documentation](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html#simplest-invocation-to-ocr-an-image). | |
| 77 | |
| 78 ## For developers | |
| 79 | |
| 80 Developers can use `libtesseract` [C](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/capi.h) or | |
| 81 [C++](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h) API to build their own application. If you need bindings to `libtesseract` for other programming languages, please see the | |
| 82 [wrapper](https://tesseract-ocr.github.io/tessdoc/AddOns.html#tesseract-wrappers) section in the AddOns documentation. | |
| 83 | |
| 84 Documentation of Tesseract generated from source code by doxygen can be found on [tesseract-ocr.github.io](https://tesseract-ocr.github.io/). | |
| 85 | |
| 86 ## Support | |
| 87 | |
| 88 Before you submit an issue, please review **[the guidelines for this repository](https://github.com/tesseract-ocr/tesseract/blob/main/CONTRIBUTING.md)**. | |
| 89 | |
| 90 For support, first read the [documentation](https://tesseract-ocr.github.io/tessdoc/), | |
| 91 particularly the [FAQ](https://tesseract-ocr.github.io/tessdoc/FAQ.html) to see if your problem is addressed there. | |
| 92 If not, search the [Tesseract user forum](https://groups.google.com/g/tesseract-ocr), the [Tesseract developer forum](https://groups.google.com/g/tesseract-dev) and [past issues](https://github.com/tesseract-ocr/tesseract/issues), and if you still can't find what you need, ask for support in the mailing-lists. | |
| 93 | |
| 94 Mailing-lists: | |
| 95 | |
| 96 * [tesseract-ocr](https://groups.google.com/g/tesseract-ocr) - For tesseract users. | |
| 97 * [tesseract-dev](https://groups.google.com/g/tesseract-dev) - For tesseract developers. | |
| 98 | |
| 99 Please report an issue only for a **bug**, not for asking questions. | |
| 100 | |
| 101 ## License | |
| 102 | |
| 103 The code in this repository is licensed under the Apache License, Version 2.0 (the "License"); | |
| 104 you may not use this file except in compliance with the License. | |
| 105 You may obtain a copy of the License at | |
| 106 | |
| 107 http://www.apache.org/licenses/LICENSE-2.0 | |
| 108 | |
| 109 Unless required by applicable law or agreed to in writing, software | |
| 110 distributed under the License is distributed on an "AS IS" BASIS, | |
| 111 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| 112 See the License for the specific language governing permissions and | |
| 113 limitations under the License. | |
| 114 | |
| 115 **NOTE**: This software depends on other packages that may be licensed under different open source licenses. | |
| 116 | |
| 117 Tesseract uses [Leptonica library](http://leptonica.com/) which essentially | |
| 118 uses a [BSD 2-clause license](http://leptonica.com/about-the-license.html). | |
| 119 | |
| 120 ## Dependencies | |
| 121 | |
| 122 Tesseract uses [Leptonica library](https://github.com/DanBloomberg/leptonica) | |
| 123 for opening input images (e.g. not documents like pdf). | |
| 124 It is suggested to use leptonica with built-in support for [zlib](https://zlib.net), | |
| 125 [png](https://sourceforge.net/projects/libpng) and | |
| 126 [tiff](http://www.simplesystems.org/libtiff) (for multipage tiff). | |
| 127 | |
| 128 ## Latest Version of README | |
| 129 | |
| 130 For the latest online version of the README.md see: | |
| 131 | |
| 132 <https://github.com/tesseract-ocr/tesseract/blob/main/README.md> |
