Python2/PyMuPDF: mupdf-source/thirdparty/tesseract/ChangeLog comparison

comparison mupdf-source/thirdparty/tesseract/ChangeLog @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.

author	Franz Glasner <fzglas.hg@dom66.de>
date	Mon, 15 Sep 2025 11:43:07 +0200
parents
children

comparison

equal deleted inserted replaced

-:1d09e1dec1d9
+:b50eed0cc0ef
+2024-11-10 - V5.5.0
+* Set hOCR capabilities ocrp_dir and ocrp_lang unconditionally.
+* Calculate row bounding box in single-word mode per (issue #4304).
+* Reduce clock syscalls (#4303).
+* Several small performance and other code fixes.
+* Modernized code.
+* Print time for tessedit_timing_debug in milliseconds.
+* Print time for ErrorCounter::ComputeErrorRate in milliseconds.
+* cmake: Correctly set the soversion based on SemVer properties.
+* Do not export PDBs for static libraries (issue #4279).
+* Several other small fixes and improvements for builds and CI.
+* Modernize code for renderers and remove filename conversion for Windows (#4330).
+* Add build rule for Windows installer.
+* Support symbolic values for --oem and --psm options.
+* Remove Tensorflow support.
+* Add RISC-V V support (#4346).
+* Remove broken GitHub action msys2-4.1.1.
+2024-06-11 - V5.4.1
+* Avoid FP overflow in NormEvidenceOf (fixes issue #4257) (#4259)
+* Small build fixes and code improvements (#4262, #4263, #4266, #4267)
+2024-06-06 - V5.4.0
+* Small build fixes and code improvements
+(#4241, #4243, #4244, #4245, #4246, #4248, #4249, #4250, #4253)
+2024-05-19 - V5.4.0-rc2
+* Fix setup of datadir on installations with Conda (issue #4230) (#4240)
+* Fix FP exception in Wordrec::angle_change (issue #4242) (#4243)
+2024-05-12 - V5.4.0-rc1
+* Build fixes, code refactoring and other smaller changes.
+* Fix grey result of indexed PNG in pdfrenderer.
+* Rename frk -> deu_latf (ISO 639-3, ISO 15924).
+* Remove broken Dockerfile.
+* Fixes for several issues reported by Coverity Scan.
+* Remove unsupported OpenCL code and related API functions (#4220).
+* Facilitate vectorization for generic build (#4223).
+* Add PAGE XML renderer / export (#4214).
+* Support training without lstmf files.
+* Improve CCUtil::main_setup (fixes issue #4230 related to Coda).
+* Allow for text angle/gradient to be retrieved (#4070).
+2024-01-18 - V5.3.4
+* Fixes for scrollview
+* Fixes for autoconf, clang and sw builds
+* Improve OCR for an image URL
+* Fail on curl download errors
+* New parameter curl_cookiefile
+* Set User-Agent: header field in HTTP request for curl downloads
+* Output directory list from "combine_tessdata -d" to stdout
+* Other small improvements for code and documentation.
+2023-10-05 - V5.3.3
+* Small code fixes and improvements to fix Coverity Scan issues.
+* Disable -mfpu=neon for aarch64.
+* Fix build without git clone in cloned directory (required for FreeBSD).
+* Other build fixes for autotools, cmake and sw.
+* Fix regression in layout detection which was introduced in release 5.0.0.
+* Fix regression which prevented loading of submodels, introduced in release 5.0.0-rc2.
+* Other small improvements for code and documentation.
+2023-07-11 - V5.3.2
+* Updates for snap package building.
+* Support for Sgaw and W Pwo Karen languages in the Myanmar validator (#4065).
+* Improve format of logging from lstmtraining.
+* Use less digits in filenames of checkpoints written by lstmtraining.
+* Replace deprecated sprintf.
+* Remove unused code in function fix_rep_char.
+* Avoid 32 bit overflow in multiplication (fixes 3 CodeQL CI alerts).
+* Avoid conversions from std::string to char* to std::string.
+* Abort with error message if OSD is requested with LSTM-only model.
+* cmake: allow to disable tiff (-DDISABLE_TIFF=ON).
+* cmake: provide info about disabled LibArchive and CURL.
+* cmake: check if leptonica was build with tiff support.
+* Remove old broken GitHub action vcpkg-4.1.1 (fixes issue #4078).
+* Create config.yml.
+* Fix typos.
+2023-04-01 - V5.3.1
+* Bug fixes for some special scenarios:
+* Fix issue #4010.
+* textord: Catch empty rows in block iterator (fixes #4039).
+* Fix FP division by zero (issue #3995).
+* Improve documentation and log messages.
+* Build fixes and improvements (mainly for cmake).
+2022-12-22 - V5.3.0
+* Minor updates for documentation and cmake builds.
+2022-12-13 - V5.3.0-rc1
+* Fix the training tools for the legacy OCR engine (fix issue #3925).
+* PDF renderer: Ignore non-text blocks (fix issue #3957).
+* Remove colormap before thresholding (fix issue #3940).
+* Fix a number of performance issues reported by Coverity Scan.
+* Training tools: Replace call of exit function by return statement in main function.
+* Fix double free in function vigorous_noise_removal (fix issue #3876).
+* Create to_win if needed in Textord::make_spline_rows (fix issue #3875).
+* Bug fixes for ScrollView viewer:
+* Fix memory issues in ScrollView::MessageReceiver.
+* Catch potential nullptr in SVNetwork::SVNetwork.
+* Move svpaint.cpp from src/viewer to src/.
+* Add rule for svpaint executable in Autotools.
+* Bug fixes and improvements for build tools:
+* Fix AMD64 detection with autobuild on FreeBSD (fix issue #3964).
+* Fix tesseract.pc generated from CMake to match Autotools.
+* Detect availability of AVX512-VNNI.
+* configure.ac: fix build on aarch64_be.
+* Drop CI for old versions of macOS and Ubuntu.
+2022-07-06 - V5.2.0
+* Improvements and fixes for continuous integration,
+autoconf and cmake builds.
+* Set /Os for some 32 bit MS compilers (fixes #3769).
+* Improve comments and other documentation.
+* Add initial support for Intel AVX512F.
+* Fix for very large PDF files on 32 bit hosts (fixes #3805).
+* Fix NEON detection on FreeBSD.
+* Fix regression with UZN files (fixes #3837).
+* Fix calling delete[] for memory allocated by malloc in C API.
+* Add an API function to init tesseract with traineddata from memory
+(fixes #3691).
+* Replace direct access to Leptonica internal data structures by
+function calls and support latest releases of Leptonica.
+* Replace std::regex by std::string functions (fixes issue #3830).
+* Use compiled-in TESSDATA_PREFIX also on Windows (fixes #3767).
+* Add new parameter 'invert_threshold', change the default threshold
+from 0.5 to 0.7 and mark parameter 'tessedit_do_invert' as deprecated.
+2022-03-01 - V5.1.0
+* Handle image and line regions in output formats ALTO, hOCR and text.
+* New parameter curl_timeout for curl_easy_setop.
+* Build fixes and improvements.
+* Catch nullptr in PageIterator::Orientation to improve robustness.
+* Remove unused code.
+2022-01-06 - V5.0.1
+* Add SPDX-License-Identifier to public include files.
+* Support redirections when running OCR on a URL.
+* Lots of fixes and improvements for cmake builds.
+Distributions should use the autoconf build.
+* Fix broken msys2 build with gcc 11.
+* Fix parameter certainty_scale (was duplicated).
+* Fix some compiler warnings and clean code.
+* Correctly detect amd64 and i386 on FreeBSD.
+* Add libarchive and libcurl in continuous integration actions.
+* Update submodule googletest to release v1.11.0.
+2021-11-22 - V5.0.0
+* Faster training and recognition by default (float instead of
+double calculations)
+* More options for binarization
+* Improved support for ARM NEON
+* Modernized code
+* Removed proprietary data types like GenericVector and STRING
+from public API
+* pdf.ttf no longer needed, now integrated into the code
+* Faster flat build with automake
+* New options for combine_tessdata to show details of traineddata files
+* Improved training messages
+* Improved unit tests and fuzzing tests
+* Lots of bug fixes
+2021-11-15 - V4.1.3
+* Fix build regression for autoconf build
+2021-11-14 - V4.1.2
+* Add RowAttributes getter to PageIterator
+* Allow line images with larger width for training
+* Fix memory leaks
+* Improve build process
+* Don't output empty ALTO sourceImageInformation (issue #2700)
+* Extend URI support for Tesseract with libcurl
+* Abort LSTM training with integer model (fixes issue #1573)
+* Update documentation
+* Make automake builds less noisy by default
+* Don't use -march=native in automake builds
+2019-12-26 - V4.1.1
+* Implemented sw build (cppan is depreciated)
+* Improved cmake build
+* Code cleanup and optimization
+* A lot of bug fixes...
+2019-07-07 - V4.1.0
+* Added new renders Alto, LSTMBox, WordStrBox.
+* Added character boxes in hOCR output.
+* Added python training scripts (experimental) as alternative shell scripts.
+* Better support AVX / AVX2 / SSE.
+* Disable OpenMP support by default (see e.g. #1171, #1081).
+* Fix for bounding box problem.
+* Implemented support for whitelist/blacklist in LSTM engine.
+* Improved cmake configuration.
+* Code modernization and improvements.
+* A lot of bug fixes...
+2018-10-29 - V4.0.0
+* Added new neural network system based on LSTMs, with major accuracy gains.
+* Improvements to PDF rendering.
+* Fixes to trainingdata rendering.
+* Added LSTM models+lang models to 101 languages. (tessdata repository)
+* Improved multi-page TIFF handling.
+* Fixed damage to binary images when processing PDFs.
+* Fixes to training process to allow incremental training from a recognition model.
+* Made LSTM the default engine, pushed cube out.
+* Deleted cube code.
+* Changed OEModes --oem 0 for legacy tesseract engine, --oem 1 for LSTM, --oem 2 for both, --oem 3 for default.
+* Avoid use of Leptonica debug parameters or functions.
+* Fixed multi-language mode.
+* Removed support for VS2010.
+* Added Support for VS2015 and VS2017 with CPPAN.
+* Implemented invisible text only for PDF.
+* Added AVX / SSE support for Windows.
+* Enabled OpenMP support.
+* Parameter unlv_tilde_crunching change to false.
+* Miscellaneous Fixes.
+* Detailed Changelog can be found at https://tesseract-ocr.github.io/tessdoc/4.0x-Changelog.html and https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html#tesseract-release-notes-oct-29-2018---v400
+2017-02-16 - V3.05.00
+* Made some fine tuning to the hOCR output.
+* Added TSV as another optional output format.
+* Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method.
+* text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer.
+* Training tools - Replaced asserts with tprintf() and exit(1).
+* Fixed Cygwin compatibility.
+* Improved multipage tiff processing.
+* Improved the embedded pdf font (pdf.ttf).
+* Enable selection of OCR engine mode from command line.
+* Changed tesseract command line parameter '-psm' to '--psm'.
+* Write output of tesseract --help, --version and --list-langs to stdout instead of stderr.
+* Added new C API for orientation and script detection, removed the old one.
+* Increased minimum autoconf version to 2.59.
+* Removed dead code.
+* Require Leptonica 1.74 or higher.
+* Fixed many compiler warning.
+* Fixed memory and resource leaks.
+* Fixed some issues with the 'Cube' OCR engine.
+* Fixed some openCL issues.
+* Added option to build Tesseract with CMake build system.
+* Implemented CPPAN support for easy Windows building.
+2016-02-17 - V3.04.01
+* Added OSD renderer for psm 0. Works for single page and multi-page images.
+* Improve tesstrain.sh script.
+* Simplify build and run of ScrollView.
+* Improved PDF output for OS X Preview utility.
+* INCOMPATIBLE fix to hOCR line height information - commit 134ebc3.
+* Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD).
+* Enable OpenMP support.
+* Many bug fixes.
+2015-07-11 - V3.04.00
+* Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting).
+* Tesseract now requires leptonica 1.71 or a higher version.
+* Removed official support for VS 2008.
+* Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid
+* Major updates to training system as a result of extensive testing on 100 languages.
+* New training data for over 100 languages
+* Improved performance with PIC compilation option.
+* Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript.
+* Improved font identification.
+* Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc.
+* Fixed problems with shifted baselines so recognition can recover from layout analysis errors.
+* Major refactor to improve speed on difficult images, especially when running a heap checker.
+* Moved params from global in page layout to tesseractclass.
+* Improved single column layout analysis.
+* Allow ocr output to multiple formats using tesseract command line executable.
+* Fixed issues with mixed eng+ara scripts.
+* Improved script consistency in numbers.
+* Major refactor of control.cpp to enable line recognition.
+* Added tesstrain.sh - a master training script.
+* Added ability to text2image training tool to just list available fonts.
+* Added ability to text2image to underline words.
+* Improved efficiency of image processing for PDF output.
+* Added parameter description for each parameter listed with 'print-parameters' command line option.
+* Added font info to hOCR output.
+* Enabled streaming input and output of multi-page documents.
+* Many bug fixes.
+2014-02-04 - V3.03(rc1)
+* Added new training tool text2image to generate box/tif file pairs from
+text and truetype fonts.
+* Added support for PDF output with searchable text.
+* Removed entire IMAGE class and all code in image directory.
+* Tesseract executable: support for output to stdout; limited support for one
+page images from stdin  (especially on Windows)
+* Added Renderer to API to allow document-level processing and output
+of document formats, like hOCR, PDF.
+* Major refactor of word-level recognition, beam search, eliminating dead code.
+* Refactored classifier to make it easier to add new ones.
+* Generalized feature extractor to allow feature extraction from greyscale.
+* Improved sub/superscript treatment.
+* Improved baseline fit.
+* Added set_unicharset_properties to training tools.
+* Many bug fixes.
+* More training source data included.
+2012-02-01 - V3.02
+* Moved ResultIterator/PageIterator to ccmain.
+* Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
+* Added paragraph detection in layout analysis/post OCR.
+* Fixed inconsistent xheight during training and over-chopping.
+* Added simultaneous multi-language capability.
+* Refactored top-level word recognition module.
+* Added experimental equation detector.
+* Improved handling of resolution from input images.
+* Blamer module added for error analysis.
+* Cleaned up externally used namespace by removing includes from baseapi.h.
+* Removed dead memory mangagement code.
+* Tidied up constraints on control parameters.
+* Added support for ShapeTable in classifier and training.
+* Refactored class pruner.
+* Fixed training leaks and randomness.
+* Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
+* Improved line detection and removal.
+* Added fixed pitch chopper for CJK.
+* Added UNICHARSET to WERD_CHOICE to make mult-language handling easier.
+* Fixed problems with internally scaled images.
+* Added page and bbox to string in tr files to identify source of training data better.
+* Fixes to Hindi Shiroreka splitter.
+* Added word bigram correction.
+* Reduced stack memory consumption and eliminated some ugly typedefs.
+* Added new uniform classifier API.
+* Added new training error counter.
+* Fixed endian bug in dawg reader.
+* Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
+2010-11-29 - V3.01
+* Removed old/dead serialise/deserialize methods on *LISTIZED classes.
+* Total rewrite of DENORM to better encapsulate operation and make
+for potential to extract features from images.
+* Thread-safety! Moved all critical global and static variables to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
+* Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. *There is no training module for Cube yet.*
+* `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube.
+* Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
+* Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future.
+* ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already bootstrapped the language with character boxes. "Cyclic dependency" on traineddata.
+* Auto orientation and script detection added to page layout analysis.
+* Deleted *lots* of dead code.
+* Fixxht module replaced with scalable data-driven module.
+* Output font characteristics accuracy improved.
+* Removed the double conversion at each classification.
+* Upgraded oldest structs to be classes and deprecated PBLOB.
+* Removed non-deterministic baseline fit.
+* Added fixed length dawgs for Chinese.
+* Handling of vertical text improved.
+* Handling of leader dots improved.
+* Table detection greatly improved.
+* Fixed a couple of memory leaks.
+* Fixed font labels on output text. (Not perfect, but a lot better than before.)
+* Cleanup and more bug fixes
+* Special treatments for Hindi.
+* Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)
+2010-09-21 - V3.00
+* Preparations for thread safety:
+* Changed TessBaseAPI methods to be non-static
+* Created a class hierarchy for the directories to hold instance data,
+and began moving code into the classes.
+* Moved thresholding code to a separate class.
+* Added major new page layout analysis module.
+* Added HOCR output (issues 221, 263: thanks to amkryukov).
+* Added Leptonica as main image I/O and handling. Currently optional,
+but in future releases linking with Leptonica will be mandatory.
+* Ambiguity table rewritten to allow definite replacements in place
+of fix_quotes.
+* Added TessdataManager to combine data files into a single file.
+* Some dead code deleted.
+* VC++6 no longer supported. It can't cope with the use of templates.
+* Many more languages added.
+* Doxygenation of most of the function header comments.
+* Added man pages.
+* Added bash completion script (issue 247: thanks to neskiem)
+* Fix integer overview in thresholding (issue 366: thanks to Cyanide.Drake)
+* Add Danish Fraktur support (issues 300, 360: thanks to
+dsl602230@vip.cybercity.dk)
+* Fix file pointer leak (issue 359, thanks to yukihiro.nakadaira)
+* Fix an error using user-words (Issue 345: thanks to max.markin)
+* Fix a memory leak in tablefind.cpp (Issue 342, thanks to zdravco)
+* Fix a segfault due to double fclose (Issue 320, thanks to souther)
+* Fix an automake error (Issue 318, thanks to ichanjz)
+* Fix a Win32 crash on fileFormatIsTiff() (Issues 304, 316, 317, 330, 347,
+349, 352: thanks to nguyenq87, max.markin, zdenop)
+* Fixed a number of errors in newer (stricter) versions of VC++ (Issues
+301, among others)
+2009-06-30 - V2.04
+* Integrated bug fixes and patches and misc changes for portability.
+* Integrated a patch to remove some of the "access" macros.
+* Removed dependence on lua from the viewer, speeding it up
+dramatically.
+* Fixed the viewer so it compiles and runs properly!
+* Specifically fixing issues: 1, 63, 67, 71, 76, 81, 82, 106, 111,
+112, 128, 129, 130, 133, 135, 142, 143, 145, 147, 153, 154, 160,
+165, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209, 108, 169
+2008-04-22 - V2.03
+* Fixed crash introduced in 2.02.
+* Fixed lack of tessembedded.cpp in distribution.
+* Added test for leptonica header files and conditional test for lib.
+2008-04-21 - V2.02 (again)
+* Fixed namespace collisions with jpeg library (INT32).
+* Portability fixes for Windows for new code.
+* Updates to autoconf system for new code.
+2008-01-23 - V2.02
+* Improvements to clustering, training and classifier.
+* Major internationalization improvements for large-character-set
+* languages, eg Kannada.
+* Removed some compiler warnings.
+* Added multipage tiff support for training and running.
+* Updated graphics output to talk to new java-based viewer.
+* Added ability to save n-best lists.
+* Added leptonica support for more file types.
+* Improved Init/End to make them safe.
+* Reduced memory use of dictionaries.
+* Added some new APIs to TessBaseAPI.
+2007-08-27 - V2.01
+* Fixed UTF8 input problems with box file reader.
+* Fixed various infinite loops and crashes in dawg code.
+* Removed include of config_auto.h from host.h.
+* Added automatic wctype encoding to unicharset_extractor.
+* Fixed dawg table too full error.
+* Removed svn files from tarball.
+* Added new functions to tessdll.
+* Increased maximum utf8 string in a classification result to 8.
+2007-07-02 - V2.00
+* Converted internal character handling to UTF8.
+* Trained with 6 languages.
+* Added unicharset_extractor, wordlist2dawg.
+* Added boxfile creation mode.
+* Added UNLV regression test capability.
+* Fixed problems with copyright and registered symbols.
+* Fixed extern "C" declarations problem.
+2007-05-15 - V1.04
+* Added dll exports for Windows.
+* Fixed name collisions with stl etc.
+* Made some preliminary changes ready for unicodeization.
+* Several bug fixes discovered during unicodeization.
+2007-02-02 - V1.03
+* Added mftraining and cntraining.
+* Added baseapi with adaptive thresholding for grey and color.
+* Fixed many memory leaks.
+* Fixed several bugs including lack of use of adaptive classifier.
+* Added ifdefs to eliminate graphics code and add embedded platform support.
+* Incorporated several patches, including 64-bit builds, Mac builds.
+* Minor accuracy improvements.
+2006-10-04 - V1.02
+* Removed dependency on Aspirin.
+* Fixed a few missing Apache license headers.
+* Removed $log.
+2006-09-07 - V1.01.
+* Added mfcpch.cpp and getopt.cpp for VC++.
+* Fixed problem with greyscale images and no libtiff.
+* Stopped debug window from being used for the usage output.
+* Fixed load of inttemp for big-endian architectures.
+* Fixed some Mac compilation issues.
+2006-06-16 - V1.0 of open source Tesseract checked-in.

Mercurial > hgrepos > Python2 > PyMuPDF

comparison mupdf-source/thirdparty/tesseract/ChangeLog @ 2:b50eed0cc0ef upstream