comparison mupdf-source/thirdparty/tesseract/ChangeLog @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
comparison
equal deleted inserted replaced
1:1d09e1dec1d9 2:b50eed0cc0ef
1 2024-11-10 - V5.5.0
2 * Set hOCR capabilities ocrp_dir and ocrp_lang unconditionally.
3 * Calculate row bounding box in single-word mode per (issue #4304).
4 * Reduce clock syscalls (#4303).
5 * Several small performance and other code fixes.
6 * Modernized code.
7 * Print time for tessedit_timing_debug in milliseconds.
8 * Print time for ErrorCounter::ComputeErrorRate in milliseconds.
9 * cmake: Correctly set the soversion based on SemVer properties.
10 * Do not export PDBs for static libraries (issue #4279).
11 * Several other small fixes and improvements for builds and CI.
12 * Modernize code for renderers and remove filename conversion for Windows (#4330).
13 * Add build rule for Windows installer.
14 * Support symbolic values for --oem and --psm options.
15 * Remove Tensorflow support.
16 * Add RISC-V V support (#4346).
17 * Remove broken GitHub action msys2-4.1.1.
18
19 2024-06-11 - V5.4.1
20 * Avoid FP overflow in NormEvidenceOf (fixes issue #4257) (#4259)
21 * Small build fixes and code improvements (#4262, #4263, #4266, #4267)
22
23 2024-06-06 - V5.4.0
24 * Small build fixes and code improvements
25 (#4241, #4243, #4244, #4245, #4246, #4248, #4249, #4250, #4253)
26
27 2024-05-19 - V5.4.0-rc2
28 * Fix setup of datadir on installations with Conda (issue #4230) (#4240)
29 * Fix FP exception in Wordrec::angle_change (issue #4242) (#4243)
30
31 2024-05-12 - V5.4.0-rc1
32 * Build fixes, code refactoring and other smaller changes.
33 * Fix grey result of indexed PNG in pdfrenderer.
34 * Rename frk -> deu_latf (ISO 639-3, ISO 15924).
35 * Remove broken Dockerfile.
36 * Fixes for several issues reported by Coverity Scan.
37 * Remove unsupported OpenCL code and related API functions (#4220).
38 * Facilitate vectorization for generic build (#4223).
39 * Add PAGE XML renderer / export (#4214).
40 * Support training without lstmf files.
41 * Improve CCUtil::main_setup (fixes issue #4230 related to Coda).
42 * Allow for text angle/gradient to be retrieved (#4070).
43
44 2024-01-18 - V5.3.4
45 * Fixes for scrollview
46 * Fixes for autoconf, clang and sw builds
47 * Improve OCR for an image URL
48 * Fail on curl download errors
49 * New parameter curl_cookiefile
50 * Set User-Agent: header field in HTTP request for curl downloads
51 * Output directory list from "combine_tessdata -d" to stdout
52 * Other small improvements for code and documentation.
53
54 2023-10-05 - V5.3.3
55 * Small code fixes and improvements to fix Coverity Scan issues.
56 * Disable -mfpu=neon for aarch64.
57 * Fix build without git clone in cloned directory (required for FreeBSD).
58 * Other build fixes for autotools, cmake and sw.
59 * Fix regression in layout detection which was introduced in release 5.0.0.
60 * Fix regression which prevented loading of submodels, introduced in release 5.0.0-rc2.
61 * Other small improvements for code and documentation.
62
63 2023-07-11 - V5.3.2
64 * Updates for snap package building.
65 * Support for Sgaw and W Pwo Karen languages in the Myanmar validator (#4065).
66 * Improve format of logging from lstmtraining.
67 * Use less digits in filenames of checkpoints written by lstmtraining.
68 * Replace deprecated sprintf.
69 * Remove unused code in function fix_rep_char.
70 * Avoid 32 bit overflow in multiplication (fixes 3 CodeQL CI alerts).
71 * Avoid conversions from std::string to char* to std::string.
72 * Abort with error message if OSD is requested with LSTM-only model.
73 * cmake: allow to disable tiff (-DDISABLE_TIFF=ON).
74 * cmake: provide info about disabled LibArchive and CURL.
75 * cmake: check if leptonica was build with tiff support.
76 * Remove old broken GitHub action vcpkg-4.1.1 (fixes issue #4078).
77 * Create config.yml.
78 * Fix typos.
79
80 2023-04-01 - V5.3.1
81 * Bug fixes for some special scenarios:
82 * Fix issue #4010.
83 * textord: Catch empty rows in block iterator (fixes #4039).
84 * Fix FP division by zero (issue #3995).
85 * Improve documentation and log messages.
86 * Build fixes and improvements (mainly for cmake).
87
88 2022-12-22 - V5.3.0
89 * Minor updates for documentation and cmake builds.
90
91 2022-12-13 - V5.3.0-rc1
92 * Fix the training tools for the legacy OCR engine (fix issue #3925).
93 * PDF renderer: Ignore non-text blocks (fix issue #3957).
94 * Remove colormap before thresholding (fix issue #3940).
95 * Fix a number of performance issues reported by Coverity Scan.
96 * Training tools: Replace call of exit function by return statement in main function.
97 * Fix double free in function vigorous_noise_removal (fix issue #3876).
98 * Create to_win if needed in Textord::make_spline_rows (fix issue #3875).
99 * Bug fixes for ScrollView viewer:
100 * Fix memory issues in ScrollView::MessageReceiver.
101 * Catch potential nullptr in SVNetwork::SVNetwork.
102 * Move svpaint.cpp from src/viewer to src/.
103 * Add rule for svpaint executable in Autotools.
104 * Bug fixes and improvements for build tools:
105 * Fix AMD64 detection with autobuild on FreeBSD (fix issue #3964).
106 * Fix tesseract.pc generated from CMake to match Autotools.
107 * Detect availability of AVX512-VNNI.
108 * configure.ac: fix build on aarch64_be.
109 * Drop CI for old versions of macOS and Ubuntu.
110
111 2022-07-06 - V5.2.0
112 * Improvements and fixes for continuous integration,
113 autoconf and cmake builds.
114 * Set /Os for some 32 bit MS compilers (fixes #3769).
115 * Improve comments and other documentation.
116 * Add initial support for Intel AVX512F.
117 * Fix for very large PDF files on 32 bit hosts (fixes #3805).
118 * Fix NEON detection on FreeBSD.
119 * Fix regression with UZN files (fixes #3837).
120 * Fix calling delete[] for memory allocated by malloc in C API.
121 * Add an API function to init tesseract with traineddata from memory
122 (fixes #3691).
123 * Replace direct access to Leptonica internal data structures by
124 function calls and support latest releases of Leptonica.
125 * Replace std::regex by std::string functions (fixes issue #3830).
126 * Use compiled-in TESSDATA_PREFIX also on Windows (fixes #3767).
127 * Add new parameter 'invert_threshold', change the default threshold
128 from 0.5 to 0.7 and mark parameter 'tessedit_do_invert' as deprecated.
129
130 2022-03-01 - V5.1.0
131 * Handle image and line regions in output formats ALTO, hOCR and text.
132 * New parameter curl_timeout for curl_easy_setop.
133 * Build fixes and improvements.
134 * Catch nullptr in PageIterator::Orientation to improve robustness.
135 * Remove unused code.
136
137 2022-01-06 - V5.0.1
138 * Add SPDX-License-Identifier to public include files.
139 * Support redirections when running OCR on a URL.
140 * Lots of fixes and improvements for cmake builds.
141 Distributions should use the autoconf build.
142 * Fix broken msys2 build with gcc 11.
143 * Fix parameter certainty_scale (was duplicated).
144 * Fix some compiler warnings and clean code.
145 * Correctly detect amd64 and i386 on FreeBSD.
146 * Add libarchive and libcurl in continuous integration actions.
147 * Update submodule googletest to release v1.11.0.
148
149 2021-11-22 - V5.0.0
150 * Faster training and recognition by default (float instead of
151 double calculations)
152 * More options for binarization
153 * Improved support for ARM NEON
154 * Modernized code
155 * Removed proprietary data types like GenericVector and STRING
156 from public API
157 * pdf.ttf no longer needed, now integrated into the code
158 * Faster flat build with automake
159 * New options for combine_tessdata to show details of traineddata files
160 * Improved training messages
161 * Improved unit tests and fuzzing tests
162 * Lots of bug fixes
163
164 2021-11-15 - V4.1.3
165 * Fix build regression for autoconf build
166
167 2021-11-14 - V4.1.2
168 * Add RowAttributes getter to PageIterator
169 * Allow line images with larger width for training
170 * Fix memory leaks
171 * Improve build process
172 * Don't output empty ALTO sourceImageInformation (issue #2700)
173 * Extend URI support for Tesseract with libcurl
174 * Abort LSTM training with integer model (fixes issue #1573)
175 * Update documentation
176 * Make automake builds less noisy by default
177 * Don't use -march=native in automake builds
178
179 2019-12-26 - V4.1.1
180 * Implemented sw build (cppan is depreciated)
181 * Improved cmake build
182 * Code cleanup and optimization
183 * A lot of bug fixes...
184
185 2019-07-07 - V4.1.0
186 * Added new renders Alto, LSTMBox, WordStrBox.
187 * Added character boxes in hOCR output.
188 * Added python training scripts (experimental) as alternative shell scripts.
189 * Better support AVX / AVX2 / SSE.
190 * Disable OpenMP support by default (see e.g. #1171, #1081).
191 * Fix for bounding box problem.
192 * Implemented support for whitelist/blacklist in LSTM engine.
193 * Improved cmake configuration.
194 * Code modernization and improvements.
195 * A lot of bug fixes...
196
197 2018-10-29 - V4.0.0
198 * Added new neural network system based on LSTMs, with major accuracy gains.
199 * Improvements to PDF rendering.
200 * Fixes to trainingdata rendering.
201 * Added LSTM models+lang models to 101 languages. (tessdata repository)
202 * Improved multi-page TIFF handling.
203 * Fixed damage to binary images when processing PDFs.
204 * Fixes to training process to allow incremental training from a recognition model.
205 * Made LSTM the default engine, pushed cube out.
206 * Deleted cube code.
207 * Changed OEModes --oem 0 for legacy tesseract engine, --oem 1 for LSTM, --oem 2 for both, --oem 3 for default.
208 * Avoid use of Leptonica debug parameters or functions.
209 * Fixed multi-language mode.
210 * Removed support for VS2010.
211 * Added Support for VS2015 and VS2017 with CPPAN.
212 * Implemented invisible text only for PDF.
213 * Added AVX / SSE support for Windows.
214 * Enabled OpenMP support.
215 * Parameter unlv_tilde_crunching change to false.
216 * Miscellaneous Fixes.
217 * Detailed Changelog can be found at https://tesseract-ocr.github.io/tessdoc/4.0x-Changelog.html and https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html#tesseract-release-notes-oct-29-2018---v400
218
219 2017-02-16 - V3.05.00
220 * Made some fine tuning to the hOCR output.
221 * Added TSV as another optional output format.
222 * Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method.
223 * text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer.
224 * Training tools - Replaced asserts with tprintf() and exit(1).
225 * Fixed Cygwin compatibility.
226 * Improved multipage tiff processing.
227 * Improved the embedded pdf font (pdf.ttf).
228 * Enable selection of OCR engine mode from command line.
229 * Changed tesseract command line parameter '-psm' to '--psm'.
230 * Write output of tesseract --help, --version and --list-langs to stdout instead of stderr.
231 * Added new C API for orientation and script detection, removed the old one.
232 * Increased minimum autoconf version to 2.59.
233 * Removed dead code.
234 * Require Leptonica 1.74 or higher.
235 * Fixed many compiler warning.
236 * Fixed memory and resource leaks.
237 * Fixed some issues with the 'Cube' OCR engine.
238 * Fixed some openCL issues.
239 * Added option to build Tesseract with CMake build system.
240 * Implemented CPPAN support for easy Windows building.
241
242 2016-02-17 - V3.04.01
243 * Added OSD renderer for psm 0. Works for single page and multi-page images.
244 * Improve tesstrain.sh script.
245 * Simplify build and run of ScrollView.
246 * Improved PDF output for OS X Preview utility.
247 * INCOMPATIBLE fix to hOCR line height information - commit 134ebc3.
248 * Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD).
249 * Enable OpenMP support.
250 * Many bug fixes.
251
252 2015-07-11 - V3.04.00
253 * Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting).
254 * Tesseract now requires leptonica 1.71 or a higher version.
255 * Removed official support for VS 2008.
256 * Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid
257 * Major updates to training system as a result of extensive testing on 100 languages.
258 * New training data for over 100 languages
259 * Improved performance with PIC compilation option.
260 * Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript.
261 * Improved font identification.
262 * Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc.
263 * Fixed problems with shifted baselines so recognition can recover from layout analysis errors.
264 * Major refactor to improve speed on difficult images, especially when running a heap checker.
265 * Moved params from global in page layout to tesseractclass.
266 * Improved single column layout analysis.
267 * Allow ocr output to multiple formats using tesseract command line executable.
268 * Fixed issues with mixed eng+ara scripts.
269 * Improved script consistency in numbers.
270 * Major refactor of control.cpp to enable line recognition.
271 * Added tesstrain.sh - a master training script.
272 * Added ability to text2image training tool to just list available fonts.
273 * Added ability to text2image to underline words.
274 * Improved efficiency of image processing for PDF output.
275 * Added parameter description for each parameter listed with 'print-parameters' command line option.
276 * Added font info to hOCR output.
277 * Enabled streaming input and output of multi-page documents.
278 * Many bug fixes.
279
280 2014-02-04 - V3.03(rc1)
281 * Added new training tool text2image to generate box/tif file pairs from
282 text and truetype fonts.
283 * Added support for PDF output with searchable text.
284 * Removed entire IMAGE class and all code in image directory.
285 * Tesseract executable: support for output to stdout; limited support for one
286 page images from stdin (especially on Windows)
287 * Added Renderer to API to allow document-level processing and output
288 of document formats, like hOCR, PDF.
289 * Major refactor of word-level recognition, beam search, eliminating dead code.
290 * Refactored classifier to make it easier to add new ones.
291 * Generalized feature extractor to allow feature extraction from greyscale.
292 * Improved sub/superscript treatment.
293 * Improved baseline fit.
294 * Added set_unicharset_properties to training tools.
295 * Many bug fixes.
296 * More training source data included.
297
298 2012-02-01 - V3.02
299 * Moved ResultIterator/PageIterator to ccmain.
300 * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
301 * Added paragraph detection in layout analysis/post OCR.
302 * Fixed inconsistent xheight during training and over-chopping.
303 * Added simultaneous multi-language capability.
304 * Refactored top-level word recognition module.
305 * Added experimental equation detector.
306 * Improved handling of resolution from input images.
307 * Blamer module added for error analysis.
308 * Cleaned up externally used namespace by removing includes from baseapi.h.
309 * Removed dead memory mangagement code.
310 * Tidied up constraints on control parameters.
311 * Added support for ShapeTable in classifier and training.
312 * Refactored class pruner.
313 * Fixed training leaks and randomness.
314 * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
315 * Improved line detection and removal.
316 * Added fixed pitch chopper for CJK.
317 * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier.
318 * Fixed problems with internally scaled images.
319 * Added page and bbox to string in tr files to identify source of training data better.
320 * Fixes to Hindi Shiroreka splitter.
321 * Added word bigram correction.
322 * Reduced stack memory consumption and eliminated some ugly typedefs.
323 * Added new uniform classifier API.
324 * Added new training error counter.
325 * Fixed endian bug in dawg reader.
326 * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
327
328 2010-11-29 - V3.01
329 * Removed old/dead serialise/deserialize methods on *LISTIZED classes.
330 * Total rewrite of DENORM to better encapsulate operation and make
331 for potential to extract features from images.
332 * Thread-safety! Moved all critical global and static variables to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
333 * Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. *There is no training module for Cube yet.*
334 * `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube.
335 * Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
336 * Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future.
337 * ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already bootstrapped the language with character boxes. "Cyclic dependency" on traineddata.
338 * Auto orientation and script detection added to page layout analysis.
339 * Deleted *lots* of dead code.
340 * Fixxht module replaced with scalable data-driven module.
341 * Output font characteristics accuracy improved.
342 * Removed the double conversion at each classification.
343 * Upgraded oldest structs to be classes and deprecated PBLOB.
344 * Removed non-deterministic baseline fit.
345 * Added fixed length dawgs for Chinese.
346 * Handling of vertical text improved.
347 * Handling of leader dots improved.
348 * Table detection greatly improved.
349 * Fixed a couple of memory leaks.
350 * Fixed font labels on output text. (Not perfect, but a lot better than before.)
351 * Cleanup and more bug fixes
352 * Special treatments for Hindi.
353 * Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)
354
355 2010-09-21 - V3.00
356 * Preparations for thread safety:
357 * Changed TessBaseAPI methods to be non-static
358 * Created a class hierarchy for the directories to hold instance data,
359 and began moving code into the classes.
360 * Moved thresholding code to a separate class.
361 * Added major new page layout analysis module.
362 * Added HOCR output (issues 221, 263: thanks to amkryukov).
363 * Added Leptonica as main image I/O and handling. Currently optional,
364 but in future releases linking with Leptonica will be mandatory.
365 * Ambiguity table rewritten to allow definite replacements in place
366 of fix_quotes.
367 * Added TessdataManager to combine data files into a single file.
368 * Some dead code deleted.
369 * VC++6 no longer supported. It can't cope with the use of templates.
370 * Many more languages added.
371 * Doxygenation of most of the function header comments.
372 * Added man pages.
373 * Added bash completion script (issue 247: thanks to neskiem)
374 * Fix integer overview in thresholding (issue 366: thanks to Cyanide.Drake)
375 * Add Danish Fraktur support (issues 300, 360: thanks to
376 dsl602230@vip.cybercity.dk)
377 * Fix file pointer leak (issue 359, thanks to yukihiro.nakadaira)
378 * Fix an error using user-words (Issue 345: thanks to max.markin)
379 * Fix a memory leak in tablefind.cpp (Issue 342, thanks to zdravco)
380 * Fix a segfault due to double fclose (Issue 320, thanks to souther)
381 * Fix an automake error (Issue 318, thanks to ichanjz)
382 * Fix a Win32 crash on fileFormatIsTiff() (Issues 304, 316, 317, 330, 347,
383 349, 352: thanks to nguyenq87, max.markin, zdenop)
384 * Fixed a number of errors in newer (stricter) versions of VC++ (Issues
385 301, among others)
386
387 2009-06-30 - V2.04
388 * Integrated bug fixes and patches and misc changes for portability.
389 * Integrated a patch to remove some of the "access" macros.
390 * Removed dependence on lua from the viewer, speeding it up
391 dramatically.
392 * Fixed the viewer so it compiles and runs properly!
393 * Specifically fixing issues: 1, 63, 67, 71, 76, 81, 82, 106, 111,
394 112, 128, 129, 130, 133, 135, 142, 143, 145, 147, 153, 154, 160,
395 165, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209, 108, 169
396
397 2008-04-22 - V2.03
398 * Fixed crash introduced in 2.02.
399 * Fixed lack of tessembedded.cpp in distribution.
400 * Added test for leptonica header files and conditional test for lib.
401
402 2008-04-21 - V2.02 (again)
403 * Fixed namespace collisions with jpeg library (INT32).
404 * Portability fixes for Windows for new code.
405 * Updates to autoconf system for new code.
406
407 2008-01-23 - V2.02
408 * Improvements to clustering, training and classifier.
409 * Major internationalization improvements for large-character-set
410 * languages, eg Kannada.
411 * Removed some compiler warnings.
412 * Added multipage tiff support for training and running.
413 * Updated graphics output to talk to new java-based viewer.
414 * Added ability to save n-best lists.
415 * Added leptonica support for more file types.
416 * Improved Init/End to make them safe.
417 * Reduced memory use of dictionaries.
418 * Added some new APIs to TessBaseAPI.
419
420 2007-08-27 - V2.01
421 * Fixed UTF8 input problems with box file reader.
422 * Fixed various infinite loops and crashes in dawg code.
423 * Removed include of config_auto.h from host.h.
424 * Added automatic wctype encoding to unicharset_extractor.
425 * Fixed dawg table too full error.
426 * Removed svn files from tarball.
427 * Added new functions to tessdll.
428 * Increased maximum utf8 string in a classification result to 8.
429
430 2007-07-02 - V2.00
431 * Converted internal character handling to UTF8.
432 * Trained with 6 languages.
433 * Added unicharset_extractor, wordlist2dawg.
434 * Added boxfile creation mode.
435 * Added UNLV regression test capability.
436 * Fixed problems with copyright and registered symbols.
437 * Fixed extern "C" declarations problem.
438
439 2007-05-15 - V1.04
440 * Added dll exports for Windows.
441 * Fixed name collisions with stl etc.
442 * Made some preliminary changes ready for unicodeization.
443 * Several bug fixes discovered during unicodeization.
444
445 2007-02-02 - V1.03
446 * Added mftraining and cntraining.
447 * Added baseapi with adaptive thresholding for grey and color.
448 * Fixed many memory leaks.
449 * Fixed several bugs including lack of use of adaptive classifier.
450 * Added ifdefs to eliminate graphics code and add embedded platform support.
451 * Incorporated several patches, including 64-bit builds, Mac builds.
452 * Minor accuracy improvements.
453
454 2006-10-04 - V1.02
455 * Removed dependency on Aspirin.
456 * Fixed a few missing Apache license headers.
457 * Removed $log.
458
459 2006-09-07 - V1.01.
460 * Added mfcpch.cpp and getopt.cpp for VC++.
461 * Fixed problem with greyscale images and no libtiff.
462 * Stopped debug window from being used for the usage output.
463 * Fixed load of inttemp for big-endian architectures.
464 * Fixed some Mac compilation issues.
465
466 2006-06-16 - V1.0 of open source Tesseract checked-in.