# HG changeset patch # User Franz Glasner # Date 1757929071 -7200 # Node ID 1d09e1dec1d90bfb6fdad3f5f10ed08942b5e92b ADD: PyMuPDF v1.26.4: the original sdist. It does not yet contain MuPDF. This normally will be downloaded when building PyMuPDF. diff -r 000000000000 -r 1d09e1dec1d9 COPYING --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COPYING Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,661 @@ + GNU AFFERO GENERAL PUBLIC LICENSE + Version 3, 19 November 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The GNU Affero General Public License is a free, copyleft license for +software and other kinds of works, specifically designed to ensure +cooperation with the community in the case of network server software. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +our General Public Licenses are intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + Developers that use our General Public Licenses protect your rights +with two steps: (1) assert copyright on the software, and (2) offer +you this License which gives you legal permission to copy, distribute +and/or modify the software. + + A secondary benefit of defending all users' freedom is that +improvements made in alternate versions of the program, if they +receive widespread use, become available for other developers to +incorporate. Many developers of free software are heartened and +encouraged by the resulting cooperation. However, in the case of +software used on network servers, this result may fail to come about. +The GNU General Public License permits making a modified version and +letting the public access it on a server without ever releasing its +source code to the public. + + The GNU Affero General Public License is designed specifically to +ensure that, in such cases, the modified source code becomes available +to the community. It requires the operator of a network server to +provide the source code of the modified version running there to the +users of that server. Therefore, public use of a modified version, on +a publicly accessible server, gives the public access to the source +code of the modified version. + + An older license, called the Affero General Public License and +published by Affero, was designed to accomplish similar goals. This is +a different license, not a version of the Affero GPL, but Affero has +released a new version of the Affero GPL which permits relicensing under +this license. + + The precise terms and conditions for copying, distribution and +modification follow. + + TERMS AND CONDITIONS + + 0. Definitions. + + "This License" refers to version 3 of the GNU Affero General Public License. + + "Copyright" also means copyright-like laws that apply to other kinds of +works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this +License. Each licensee is addressed as "you". "Licensees" and +"recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the work +in a fashion requiring copyright permission, other than the making of an +exact copy. The resulting work is called a "modified version" of the +earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work based +on the Program. + + To "propagate" a work means to do anything with it that, without +permission, would make you directly or secondarily liable for +infringement under applicable copyright law, except executing it on a +computer or modifying a private copy. Propagation includes copying, +distribution (with or without modification), making available to the +public, and in some countries other activities as well. + + To "convey" a work means any kind of propagation that enables other +parties to make or receive copies. Mere interaction with a user through +a computer network, with no transfer of a copy, is not conveying. + + An interactive user interface displays "Appropriate Legal Notices" +to the extent that it includes a convenient and prominently visible +feature that (1) displays an appropriate copyright notice, and (2) +tells the user that there is no warranty for the work (except to the +extent that warranties are provided), that licensees may convey the +work under this License, and how to view a copy of this License. If +the interface presents a list of user commands or options, such as a +menu, a prominent item in the list meets this criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work +for making modifications to it. "Object code" means any non-source +form of a work. + + A "Standard Interface" means an interface that either is an official +standard defined by a recognized standards body, or, in the case of +interfaces specified for a particular programming language, one that +is widely used among developers working in that language. + + The "System Libraries" of an executable work include anything, other +than the work as a whole, that (a) is included in the normal form of +packaging a Major Component, but which is not part of that Major +Component, and (b) serves only to enable use of the work with that +Major Component, or to implement a Standard Interface for which an +implementation is available to the public in source code form. A +"Major Component", in this context, means a major essential component +(kernel, window system, and so on) of the specific operating system +(if any) on which the executable work runs, or a compiler used to +produce the work, or an object code interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all +the source code needed to generate, install, and (for an executable +work) run the object code and to modify the work, including scripts to +control those activities. However, it does not include the work's +System Libraries, or general-purpose tools or generally available free +programs which are used unmodified in performing those activities but +which are not part of the work. For example, Corresponding Source +includes interface definition files associated with source files for +the work, and the source code for shared libraries and dynamically +linked subprograms that the work is specifically designed to require, +such as by intimate data communication or control flow between those +subprograms and other parts of the work. + + The Corresponding Source need not include anything that users +can regenerate automatically from other parts of the Corresponding +Source. + + The Corresponding Source for a work in source code form is that +same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of +copyright on the Program, and are irrevocable provided the stated +conditions are met. This License explicitly affirms your unlimited +permission to run the unmodified Program. The output from running a +covered work is covered by this License only if the output, given its +content, constitutes a covered work. This License acknowledges your +rights of fair use or other equivalent, as provided by copyright law. + + You may make, run and propagate covered works that you do not +convey, without conditions so long as your license otherwise remains +in force. You may convey covered works to others for the sole purpose +of having them make modifications exclusively for you, or provide you +with facilities for running those works, provided that you comply with +the terms of this License in conveying all material for which you do +not control copyright. Those thus making or running the covered works +for you must do so exclusively on your behalf, under your direction +and control, on terms that prohibit them from making any copies of +your copyrighted material outside their relationship with you. + + Conveying under any other circumstances is permitted solely under +the conditions stated below. Sublicensing is not allowed; section 10 +makes it unnecessary. + + 3. Protecting Users' Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological +measure under any applicable law fulfilling obligations under article +11 of the WIPO copyright treaty adopted on 20 December 1996, or +similar laws prohibiting or restricting circumvention of such +measures. + + When you convey a covered work, you waive any legal power to forbid +circumvention of technological measures to the extent such circumvention +is effected by exercising rights under this License with respect to +the covered work, and you disclaim any intention to limit operation or +modification of the work as a means of enforcing, against the work's +users, your or third parties' legal rights to forbid circumvention of +technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program's source code as you +receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy an appropriate copyright notice; +keep intact all notices stating that this License and any +non-permissive terms added in accord with section 7 apply to the code; +keep intact all notices of the absence of any warranty; and give all +recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, +and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to +produce it from the Program, in the form of source code under the +terms of section 4, provided that you also meet all of these conditions: + + a) The work must carry prominent notices stating that you modified + it, and giving a relevant date. + + b) The work must carry prominent notices stating that it is + released under this License and any conditions added under section + 7. This requirement modifies the requirement in section 4 to + "keep intact all notices". + + c) You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable section 7 + additional terms, to the whole of the work, and all its parts, + regardless of how they are packaged. This License gives no + permission to license the work in any other way, but it does not + invalidate such permission if you have separately received it. + + d) If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has interactive + interfaces that do not display Appropriate Legal Notices, your + work need not make them do so. + + A compilation of a covered work with other separate and independent +works, which are not by their nature extensions of the covered work, +and which are not combined with it such as to form a larger program, +in or on a volume of a storage or distribution medium, is called an +"aggregate" if the compilation and its resulting copyright are not +used to limit the access or legal rights of the compilation's users +beyond what the individual works permit. Inclusion of a covered work +in an aggregate does not cause this License to apply to the other +parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms +of sections 4 and 5, provided that you also convey the +machine-readable Corresponding Source under the terms of this License, +in one of these ways: + + a) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that product + model, to give anyone who possesses the object code either (1) a + copy of the Corresponding Source for all the software in the + product that is covered by this License, on a durable physical + medium customarily used for software interchange, for a price no + more than your reasonable cost of physically performing this + conveying of source, or (2) access to copy the + Corresponding Source from a network server at no charge. + + c) Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, and + only if you received the object code with such an offer, in accord + with subsection 6b. + + d) Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to the + Corresponding Source in the same way through the same place at no + further charge. You need not require recipients to copy the + Corresponding Source along with the object code. If the place to + copy the object code is a network server, the Corresponding Source + may be on a different server (operated by you or a third party) + that supports equivalent copying facilities, provided you maintain + clear directions next to the object code saying where to find the + Corresponding Source. Regardless of what server hosts the + Corresponding Source, you remain obligated to ensure that it is + available for as long as needed to satisfy these requirements. + + e) Convey the object code using peer-to-peer transmission, provided + you inform other peers where the object code and Corresponding + Source of the work are being offered to the general public at no + charge under subsection 6d. + + A separable portion of the object code, whose source code is excluded +from the Corresponding Source as a System Library, need not be +included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means any +tangible personal property which is normally used for personal, family, +or household purposes, or (2) anything designed or sold for incorporation +into a dwelling. In determining whether a product is a consumer product, +doubtful cases shall be resolved in favor of coverage. For a particular +product received by a particular user, "normally used" refers to a +typical or common use of that class of product, regardless of the status +of the particular user or of the way in which the particular user +actually uses, or expects or is expected to use, the product. A product +is a consumer product regardless of whether the product has substantial +commercial, industrial or non-consumer uses, unless such uses represent +the only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, +procedures, authorization keys, or other information required to install +and execute modified versions of a covered work in that User Product from +a modified version of its Corresponding Source. The information must +suffice to ensure that the continued functioning of the modified object +code is in no case prevented or interfered with solely because +modification has been made. + + If you convey an object code work under this section in, or with, or +specifically for use in, a User Product, and the conveying occurs as +part of a transaction in which the right of possession and use of the +User Product is transferred to the recipient in perpetuity or for a +fixed term (regardless of how the transaction is characterized), the +Corresponding Source conveyed under this section must be accompanied +by the Installation Information. But this requirement does not apply +if neither you nor any third party retains the ability to install +modified object code on the User Product (for example, the work has +been installed in ROM). + + The requirement to provide Installation Information does not include a +requirement to continue to provide support service, warranty, or updates +for a work that has been modified or installed by the recipient, or for +the User Product in which it has been modified or installed. Access to a +network may be denied when the modification itself materially and +adversely affects the operation of the network or violates the rules and +protocols for communication across the network. + + Corresponding Source conveyed, and Installation Information provided, +in accord with this section must be in a format that is publicly +documented (and with an implementation available to the public in +source code form), and must require no special password or key for +unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of this +License by making exceptions from one or more of its conditions. +Additional permissions that are applicable to the entire Program shall +be treated as though they were included in this License, to the extent +that they are valid under applicable law. If additional permissions +apply only to part of the Program, that part may be used separately +under those permissions, but the entire Program remains governed by +this License without regard to the additional permissions. + + When you convey a copy of a covered work, you may at your option +remove any additional permissions from that copy, or from any part of +it. (Additional permissions may be written to require their own +removal in certain cases when you modify the work.) You may place +additional permissions on material, added by you to a covered work, +for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material you +add to a covered work, you may (if authorized by the copyright holders of +that material) supplement the terms of this License with terms: + + a) Disclaiming warranty or limiting liability differently from the + terms of sections 15 and 16 of this License; or + + b) Requiring preservation of specified reasonable legal notices or + author attributions in that material or in the Appropriate Legal + Notices displayed by works containing it; or + + c) Prohibiting misrepresentation of the origin of that material, or + requiring that modified versions of such material be marked in + reasonable ways as different from the original version; or + + d) Limiting the use for publicity purposes of names of licensors or + authors of the material; or + + e) Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f) Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified versions of + it) with contractual assumptions of liability to the recipient, for + any liability that these contractual assumptions directly impose on + those licensors and authors. + + All other non-permissive additional terms are considered "further +restrictions" within the meaning of section 10. If the Program as you +received it, or any part of it, contains a notice stating that it is +governed by this License along with a term that is a further +restriction, you may remove that term. If a license document contains +a further restriction but permits relicensing or conveying under this +License, you may add to a covered work material governed by the terms +of that license document, provided that the further restriction does +not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you +must place, in the relevant source files, a statement of the +additional terms that apply to those files, or a notice indicating +where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in the +form of a separately written license, or stated as exceptions; +the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly +provided under this License. Any attempt otherwise to propagate or +modify it is void, and will automatically terminate your rights under +this License (including any patent licenses granted under the third +paragraph of section 11). + + However, if you cease all violation of this License, then your +license from a particular copyright holder is reinstated (a) +provisionally, unless and until the copyright holder explicitly and +finally terminates your license, and (b) permanently, if the copyright +holder fails to notify you of the violation by some reasonable means +prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is +reinstated permanently if the copyright holder notifies you of the +violation by some reasonable means, this is the first time you have +received notice of violation of this License (for any work) from that +copyright holder, and you cure the violation prior to 30 days after +your receipt of the notice. + + Termination of your rights under this section does not terminate the +licenses of parties who have received copies or rights from you under +this License. If your rights have been terminated and not permanently +reinstated, you do not qualify to receive new licenses for the same +material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or +run a copy of the Program. Ancillary propagation of a covered work +occurring solely as a consequence of using peer-to-peer transmission +to receive a copy likewise does not require acceptance. However, +nothing other than this License grants you permission to propagate or +modify any covered work. These actions infringe copyright if you do +not accept this License. Therefore, by modifying or propagating a +covered work, you indicate your acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically +receives a license from the original licensors, to run, modify and +propagate that work, subject to this License. You are not responsible +for enforcing compliance by third parties with this License. + + An "entity transaction" is a transaction transferring control of an +organization, or substantially all assets of one, or subdividing an +organization, or merging organizations. If propagation of a covered +work results from an entity transaction, each party to that +transaction who receives a copy of the work also receives whatever +licenses to the work the party's predecessor in interest had or could +give under the previous paragraph, plus a right to possession of the +Corresponding Source of the work from the predecessor in interest, if +the predecessor has it or can get it with reasonable efforts. + + You may not impose any further restrictions on the exercise of the +rights granted or affirmed under this License. For example, you may +not impose a license fee, royalty, or other charge for exercise of +rights granted under this License, and you may not initiate litigation +(including a cross-claim or counterclaim in a lawsuit) alleging that +any patent claim is infringed by making, using, selling, offering for +sale, or importing the Program or any portion of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this +License of the Program or a work on which the Program is based. The +work thus licensed is called the contributor's "contributor version". + + A contributor's "essential patent claims" are all patent claims +owned or controlled by the contributor, whether already acquired or +hereafter acquired, that would be infringed by some manner, permitted +by this License, of making, using, or selling its contributor version, +but do not include claims that would be infringed only as a +consequence of further modification of the contributor version. For +purposes of this definition, "control" includes the right to grant +patent sublicenses in a manner consistent with the requirements of +this License. + + Each contributor grants you a non-exclusive, worldwide, royalty-free +patent license under the contributor's essential patent claims, to +make, use, sell, offer for sale, import and otherwise run, modify and +propagate the contents of its contributor version. + + In the following three paragraphs, a "patent license" is any express +agreement or commitment, however denominated, not to enforce a patent +(such as an express permission to practice a patent or covenant not to +sue for patent infringement). To "grant" such a patent license to a +party means to make such an agreement or commitment not to enforce a +patent against the party. + + If you convey a covered work, knowingly relying on a patent license, +and the Corresponding Source of the work is not available for anyone +to copy, free of charge and under the terms of this License, through a +publicly available network server or other readily accessible means, +then you must either (1) cause the Corresponding Source to be so +available, or (2) arrange to deprive yourself of the benefit of the +patent license for this particular work, or (3) arrange, in a manner +consistent with the requirements of this License, to extend the patent +license to downstream recipients. "Knowingly relying" means you have +actual knowledge that, but for the patent license, your conveying the +covered work in a country, or your recipient's use of the covered work +in a country, would infringe one or more identifiable patents in that +country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or +arrangement, you convey, or propagate by procuring conveyance of, a +covered work, and grant a patent license to some of the parties +receiving the covered work authorizing them to use, propagate, modify +or convey a specific copy of the covered work, then the patent license +you grant is automatically extended to all recipients of the covered +work and works based on it. + + A patent license is "discriminatory" if it does not include within +the scope of its coverage, prohibits the exercise of, or is +conditioned on the non-exercise of one or more of the rights that are +specifically granted under this License. You may not convey a covered +work if you are a party to an arrangement with a third party that is +in the business of distributing software, under which you make payment +to the third party based on the extent of your activity of conveying +the work, and under which the third party grants, to any of the +parties who would receive the covered work from you, a discriminatory +patent license (a) in connection with copies of the covered work +conveyed by you (or copies made from those copies), or (b) primarily +for and in connection with specific products or compilations that +contain the covered work, unless you entered into that arrangement, +or that patent license was granted, prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting +any implied license or other defenses to infringement that may +otherwise be available to you under applicable patent law. + + 12. No Surrender of Others' Freedom. + + If conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot convey a +covered work so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you may +not convey it at all. For example, if you agree to terms that obligate you +to collect a royalty for further conveying from those to whom you convey +the Program, the only way you could satisfy both those terms and this +License would be to refrain entirely from conveying the Program. + + 13. Remote Network Interaction; Use with the GNU General Public License. + + Notwithstanding any other provision of this License, if you modify the +Program, your modified version must prominently offer all users +interacting with it remotely through a computer network (if your version +supports such interaction) an opportunity to receive the Corresponding +Source of your version by providing access to the Corresponding Source +from a network server at no charge, through some standard or customary +means of facilitating copying of software. This Corresponding Source +shall include the Corresponding Source for any work covered by version 3 +of the GNU General Public License that is incorporated pursuant to the +following paragraph. + + Notwithstanding any other provision of this License, you have +permission to link or combine any covered work with a work licensed +under version 3 of the GNU General Public License into a single +combined work, and to convey the resulting work. The terms of this +License will continue to apply to the part which is the covered work, +but the work with which it is combined will remain governed by version +3 of the GNU General Public License. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new versions of +the GNU Affero General Public License from time to time. Such new versions +will be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + + Each version is given a distinguishing version number. If the +Program specifies that a certain numbered version of the GNU Affero General +Public License "or any later version" applies to it, you have the +option of following the terms and conditions either of that numbered +version or of any later version published by the Free Software +Foundation. If the Program does not specify a version number of the +GNU Affero General Public License, you may choose any version ever published +by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future +versions of the GNU Affero General Public License can be used, that proxy's +public statement of acceptance of a version permanently authorizes you +to choose that version for the Program. + + Later license versions may give you additional or different +permissions. However, no additional obligations are imposed on any +author or copyright holder as a result of your choosing to follow a +later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY +APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT +HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY +OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM +IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF +ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS +THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY +GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF +DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD +PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), +EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF +SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided +above cannot be given local legal effect according to their terms, +reviewing courts shall apply local law that most closely approximates +an absolute waiver of all civil liability in connection with the +Program, unless a warranty or assumption of liability accompanies a +copy of the Program in return for a fee. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU Affero General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU Affero General Public License for more details. + + You should have received a copy of the GNU Affero General Public License + along with this program. If not, see . + +Also add information on how to contact you by electronic and paper mail. + + If your software can interact with users remotely through a computer +network, you should also make sure that it provides a way for users to +get its source. For example, if your program is a web application, its +interface could display a "Source" link that leads users to an archive +of the code. There are many ways you could offer source, and different +solutions will be better for different programs; see section 13 for the +specific requirements. + + You should also get your employer (if you work as a programmer) or school, +if any, to sign a "copyright disclaimer" for the program, if necessary. +For more information on this, and how to apply and follow the GNU AGPL, see +. diff -r 000000000000 -r 1d09e1dec1d9 PKG-INFO --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/PKG-INFO Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,83 @@ +Metadata-Version: 2.1 +Name: PyMuPDF +Version: 1.26.4 +Summary: A high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents. +Description-Content-Type: text/markdown +Author: Artifex +Author-email: support@artifex.com +License: Dual Licensed - GNU AFFERO GPL 3.0 or Artifex Commercial License +Classifier: Development Status :: 5 - Production/Stable +Classifier: Intended Audience :: Developers +Classifier: Intended Audience :: Information Technology +Classifier: Operating System :: MacOS +Classifier: Operating System :: Microsoft :: Windows +Classifier: Operating System :: POSIX :: Linux +Classifier: Programming Language :: C +Classifier: Programming Language :: C++ +Classifier: Programming Language :: Python :: 3 :: Only +Classifier: Programming Language :: Python :: Implementation :: CPython +Classifier: Topic :: Utilities +Classifier: Topic :: Multimedia :: Graphics +Classifier: Topic :: Software Development :: Libraries +Requires-Python: >=3.9 +Project-URL: Documentation, https://pymupdf.readthedocs.io/ +Project-URL: Source, https://github.com/pymupdf/pymupdf +Project-URL: Tracker, https://github.com/pymupdf/PyMuPDF/issues +Project-URL: Changelog, https://pymupdf.readthedocs.io/en/latest/changes.html + +# PyMuPDF + +**PyMuPDF** is a high performance **Python** library for data extraction, analysis, conversion & manipulation of [PDF (and other) documents](https://pymupdf.readthedocs.io/en/latest/the-basics.html#supported-file-types). + +# Community +Join us on **Discord** here: [#pymupdf](https://discord.gg/TSpYGBW4eq) + + +# Installation + +**PyMuPDF** requires **Python 3.9 or later**, install using **pip** with: + +`pip install PyMuPDF` + +There are **no mandatory** external dependencies. However, some [optional features](#pymupdf-optional-features) become available only if additional packages are installed. + +You can also try without installing by visiting [PyMuPDF.io](https://pymupdf.io/#examples). + + +# Usage + +Basic usage is as follows: + +```python +import pymupdf # imports the pymupdf library +doc = pymupdf.open("example.pdf") # open a document +for page in doc: # iterate the document pages + text = page.get_text() # get plain text encoded as UTF-8 + +``` + + +# Documentation + +Full documentation can be found on [pymupdf.readthedocs.io](https://pymupdf.readthedocs.io). + + + +# Optional Features + +* [fontTools](https://pypi.org/project/fonttools/) for creating font subsets. +* [pymupdf-fonts](https://pypi.org/project/pymupdf-fonts/) contains some nice fonts for your text output. +* [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) for optical character recognition in images and document pages. + + + +# About + +**PyMuPDF** adds **Python** bindings and abstractions to [MuPDF](https://mupdf.com/), a lightweight **PDF**, **XPS**, and **eBook** viewer, renderer, and toolkit. Both **PyMuPDF** and **MuPDF** are maintained and developed by [Artifex Software, Inc](https://artifex.com). + +**PyMuPDF** was originally written by [Jorj X. McKie](mailto:jorj.x.mckie@outlook.de). + + +# License and Copyright + +**PyMuPDF** is available under [open-source AGPL](https://www.gnu.org/licenses/agpl-3.0.html) and commercial license agreements. If you determine you cannot meet the requirements of the **AGPL**, please contact [Artifex](https://artifex.com/contact/pymupdf-inquiry.php) for more information regarding a commercial license. diff -r 000000000000 -r 1d09e1dec1d9 README.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.md Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,60 @@ +# PyMuPDF + +**PyMuPDF** is a high performance **Python** library for data extraction, analysis, conversion & manipulation of [PDF (and other) documents](https://pymupdf.readthedocs.io/en/latest/the-basics.html#supported-file-types). + +# Community +Join us on **Discord** here: [#pymupdf](https://discord.gg/TSpYGBW4eq) + + +# Installation + +**PyMuPDF** requires **Python 3.9 or later**, install using **pip** with: + +`pip install PyMuPDF` + +There are **no mandatory** external dependencies. However, some [optional features](#pymupdf-optional-features) become available only if additional packages are installed. + +You can also try without installing by visiting [PyMuPDF.io](https://pymupdf.io/#examples). + + +# Usage + +Basic usage is as follows: + +```python +import pymupdf # imports the pymupdf library +doc = pymupdf.open("example.pdf") # open a document +for page in doc: # iterate the document pages + text = page.get_text() # get plain text encoded as UTF-8 + +``` + + +# Documentation + +Full documentation can be found on [pymupdf.readthedocs.io](https://pymupdf.readthedocs.io). + + + +# Optional Features + +* [fontTools](https://pypi.org/project/fonttools/) for creating font subsets. +* [pymupdf-fonts](https://pypi.org/project/pymupdf-fonts/) contains some nice fonts for your text output. +* [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) for optical character recognition in images and document pages. + + + +# About + +**PyMuPDF** adds **Python** bindings and abstractions to [MuPDF](https://mupdf.com/), a lightweight **PDF**, **XPS**, and **eBook** viewer, renderer, and toolkit. Both **PyMuPDF** and **MuPDF** are maintained and developed by [Artifex Software, Inc](https://artifex.com). + +**PyMuPDF** was originally written by [Jorj X. McKie](mailto:jorj.x.mckie@outlook.de). + + +# License and Copyright + +**PyMuPDF** is available under [open-source AGPL](https://www.gnu.org/licenses/agpl-3.0.html) and commercial license agreements. If you determine you cannot meet the requirements of the **AGPL**, please contact [Artifex](https://artifex.com/contact/pymupdf-inquiry.php) for more information regarding a commercial license. + + + + diff -r 000000000000 -r 1d09e1dec1d9 READMEb.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/READMEb.md Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,7 @@ +# PyMuPDFb + +This wheel contains [MuPDF](https://mupdf.readthedocs.io/) shared libraries for +use by [PyMuPDF](https://pymupdf.readthedocs.io/). + +This wheel is shared by PyMuPDF wheels that are specific to different Python +versions, significantly reducing the total size of a release. diff -r 000000000000 -r 1d09e1dec1d9 READMEd.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/READMEd.md Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,4 @@ +# PyMuPDFd + +This wheel contains [MuPDF](https://mupdf.readthedocs.io/) build-time files +that were used to build [PyMuPDF](https://pymupdf.readthedocs.io/). diff -r 000000000000 -r 1d09e1dec1d9 changes.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/changes.txt Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,2776 @@ +Change Log +========== + + +**Changes in version 1.26.4** + +* Use MuPDF-1.26.7. + +* Fixed issues: + + * **Fixed** `3806 `_: pdf to image rendering ignore optional content offs + * **Fixed** `4388 `_: Incorrect PixMap from page due to cached data from other PDF + * **Fixed** `4457 `_: Wrong characters displayed after font subsetting (w/ native method) + * **Fixed** `4462 `_: delete_pages() does not accept a single int + * **Fixed** `4533 `_: Open PDF error segmentation fault + * **Fixed** `4565 `_: MacOS uses Tesseract and not Tesseract-OCR + * **Fixed** `4571 `_: Broken merged pdfs. + * **Fixed** `4590 `_: TypeError in utils.py scrub(): annot.update_file(buffer=...) is invalid + * **Fixed** `4614 `_: Intercept bad widgets when inserting to another PDF + * **Fixed** `4639 `_: pymupdf.mupdf.FzErrorGeneric: code=1: Director error: : 'JM_new_bbox_device_Device' object has no attribute 'layer_name' + +* Other: + + * Check that #4392 `Segfault when running with pytest and -Werror` is fixed if PyMuPDF is built with swig>=4.4. + * Add `Page.clip_to_rect()`. + * Improved search for Tesseract data. + * Retrospectively mark #4496 as fixed in 1.26.1. + * Retrospectively mark #4503 as fixed in 1.26.3. + * Added experimental support for Graal. + + +**Changes in version 1.26.3 (2025-07-02)** + +* Use MuPDF-1.26.3. + +* Fixed issues: + + * **Fixed** `4462 `_: delete_pages() does not accept a single int + * **Fixed** `4503 `_: Undetected character styles + * **Fixed** `4527 `_: Rect.intersects() is much slower than necessary + * **Fixed** `4564 `_: Possible encoding issue in PDF metadata + * **Fixed** `4575 `_: Bug with IRect contains method + +* Other: + + * Class Shape is now available as pymupdf.Shape. + * Added table cell markdown support. + + +**Changes in version 1.26.2** + +[Skipped.] + + +**Changes in version 1.26.1 (2025-06-11)** + +* Use MuPDF-1.26.2. + +* Fixed issues: + + * **Fixed** `4520 `_: show_pdf_page does not like empty pages created by new_page + * **Fixed** `4524 `_: fitz.get_text ignores 'pages' kwarg + * **Fixed** `4412 `_: Regression? Spurious error? in insert_pdf in v1.25.4 + * **Fixed** `4496 `_: pymupdf4llm with pymupdfpro + +* Other: + + * Partial fix for `4503 `_: Undetected character styles + * New method `Document.rewrite_images()`, useful for reducing file size, changing image formats, or converting color spaces. + * `Page.get_text()`: restrict positional args to match docs. + * Removed bogus definition of class `Shape`. + * Removed release date from module, docs and changelog. + * `pymupdf.pymupdf_date` and `pymupdf.VersionDate` are now both None. + * They will be removed in a future release. + + +**Changes in version 1.26.0 (2025-05-22)** + +* Use MuPDF-1.26.1. + +* Fixed issues: + + * **Fixed** `4324 `_: cluster_drawings() fails to cluster horizontal and vertical thin lines + * **Fixed** `4363 `_: Trouble with searching + * **Fixed** `4404 `_: IndexError in page.get_links() + * **Fixed** `4412 `_: Regression? Spurious error? in insert_pdf in v1.25.4 + * **Fixed** `4423 `_: pymupdf.mupdf.FzErrorFormat: code=7: cannot find object in xref error encountered after version 1.25.3 + * **Fixed** `4435 `_: get_pixmap method stuck on one page + * **Fixed** `4439 `_: New Xml class from data does not work - bug in code + * **Fixed** `4445 `_: Broken XREF table incorrectly repaired + * **Fixed** `4447 `_: Stroke color of annotations cannot be correctly set + * **Fixed** `4479 `_: set_layer_ui_config() toggles all layers rather than just one + * **Fixed** `4505 `_: Follow Widget flag values up its parent structure + +* Other: + + * Partial fixed for `4457 `_: Wrong characters displayed after font subsetting (w/ native method) + * Support image stamp annotations. + * Support recoloring pages. + * Added example of using Django's file storage API to open files with pymupdf. + * Clarified FreeText annotation color options. + We now raise an exception if an attempt is made to set attributes that can not be supported. + * Fixed potential segv in Pixmap.is_unicolor(). + * Added runtime assert that that PyMuPDF and MuPDF were built with compatible + NDEBUG settings (related to `4390 `_). + * Simplified handling of filename/filetype when opening documents. + * Removed PDF linearization support. + * Calls to `Document.save()` with `linear` set to true will now raise an exception. + * See https://artifex.com/blog/mupdf-removes-linearisation for more information. + +**Changes in version 1.25.5 (2025-03-31)** + +* Fixed issues: + + * **Fixed** `4372 `_: Text insertion fails due to missing /Resources object + * **Fixed** `4400 `_: Infinite loop in fill_textbox + * **Fixed** `4403 `_: Unable to get_text() - layer/clip nesting too deep + * **Fixed** `4415 `_: PDF page is mirrored, origin is at bottom-left + +* Other: + + * Use MuPDF-1.25.6. + * Fixed MuPDF SEGV on MacOS with particular fonts. + * Fixed `Annot.get_textpage()`'s `clip` arg. + * Fixed Python-3.14 (pre-release) build error. + + +**Changes in version 1.25.4 (2025-03-14)** + +* Use MuPDF-1.25.5. + +* Fixed issues: + + * **Fixed** `4079 `_: Unexpected result for apply_redactions() + * **Fixed** `4224 `_: MuPDF error: format error: negative code in 1d faxd + * **Fixed** `4303 `_: page.get_image_info() returns outdated cached results after replacing image + * **Fixed** `4309 `_: FzErrorFormat Error When Deleting First Page + * **Fixed** `4336 `_: Major Performance Regression: pix.color_count is 150x slower in version 1.25.3 compared to 1.23.8 + * **Fixed** `4341 `_: Invalid label retrieval when /Kids is an array of multiple /Nums + +* Other: + + * Fixed handling of duplicate widget names when joining PDFs (PR #4347). + * Improved Pyodide build. + * Avoid SWIG-related build errors with Python-3.13 by disabling PY_LIMITED_API. + + +**Changes in version 1.25.3 (2025-02-06)** + +* Use MuPDF-1.25.4. + +* Fixed issues: + + * **Fixed** `4139 `_: Text color numbers change between 1.24.14 and 1.25.0 + * **Fixed** `4141 `_: Some insertion methods fails for pages without a /Resources object + * **Fixed** `4180 `_: Search problems + * **Fixed** `4182 `_: Text coordinate extraction error + * **Fixed** `4245 `_: Highlighting issue distorted on recent versions + * **Fixed** `4254 `_: add_freetext_annot is drawing text outside the annotation box + +* Other: + + * In annotations: + * Added support for subtype FreeTextCallout. + * Added support for rich text. + * Added miter_limit arg to insert_text*() to allow suppression of spikes caused by long miters. + * Add Widget Support to `Document.insert_pdf()`. + * Add `bibi` to span dicts. + * Add `synthetic' to char dict. + * Fixed Pyodide builds. + + +**Changes in version 1.25.2 (2025-01-17)** + +* Fixed issues: + + * **Fixed** `4055 `_: "Yes" for all checkboxes does not work for all PDF rendering engines. + * **Fixed** `4155 `_: samples_mv is unsafe + * **Fixed** `4162 `_: Got AttributeError, when tried to add Signature field + * **Fixed** `4186 `_: Incorrect handling of JPEG with color space CMYK image extraction + * **Fixed** `4195 `_: Pixmaps that are inverted and have an alpha channel are not rendered properly + * **Fixed** `4225 `_: pixmap.pil_save() fails due to colorspace definition + * **Fixed** `4232 `_: Incorrect Font style and Size + +* Other: + + * Use Python's built-in glyphname <> unicode conversion. + * Improve speed of pixmap color inversion. + * Add new `char_flags` member to span dictionary, for example allows detection of invisible text. + * Detect image masks in TextPage output. + * Added `Pixmap.pil_image()`. + + +**Changes in version 1.25.1 (2024-12-11)** + +* Use MuPDF-1.25.2. + +* Fixed issues: + + * **Fixed** `4125 `_: memory leak while convert Pixmap's colorspace + * **Fixed** `4034 `_: Possible regression in pdf cleaning during save. + + +**Changes in version 1.25.0 (2024-12-05)** + +* Use MuPDF-1.25.1. + +* Fixed issues: + + * **Fixed** `4026 `_: page.get_text('blocks') output two piece of very similar text with different bbox + * **Fixed** `4004 `_: Segmentation Fault When Updating PDF Form Field Value + * **Fixed** `3887 `_: Subset Fonts problem using Fallback Font + * **Fixed** `3886 `_: Another issue with destroying PDF when inserting html + * **Fixed** `3751 `_: apply_redactions causes part of the page content to be hidden / transparent + + +.. codespell:ignore-begin + +**Changes in version 1.24.14 (2024-11-19)** + +* Use MuPDF-1.24.11. + +* Fixed issues: + + * **Fixed** `3448 `_: get_pixmap function removes the table and leaves just the content behind + * **Fixed** `3758 `_: Got "malloc(): unaligned tcache chunk detected Aborted (core dumped)" while using add_redact_annot/apply_redactions + * **Fixed** `3813 `_: Stories: Ordered list count broken with nested unordered list + * **Fixed** `3933 `_: font.valid_codepoints() - malfunction + * **Fixed** `4018 `_: PyMuPDF hangs when iterating over zero page PDF pages backwards + * **Fixed** `4043 `_: fullcopypage bug + * **Fixed** `4047 `_: Segmentation Fault in add_redact_annot + * **Fixed** `4050 `_: Content of dict returned by doc.embfile_info() does not fit to documentation + +* Other: + + * Ensure that words from `Page.get_text()` never contain RTL/LTR char mixtures. + * Fix building with system MuPDF. + * Add dot product for points and vectors. + + +**Changes in version 1.24.13 (2024-10-29)** + +* Fixed issues: + + * **Fixed** `3848 `_: Piximap program crash + * **Fixed** `3950 `_: Unable to consistently extract field labels from PDFs + * **Fixed** `3981 `_: PyMuPDF 1.24.12 with pyinstaller throws error. + * **Fixed** `3994 `_: pix.color_topusage raise Segmentation fault (core dumped) + + +**Changes in version 1.24.12 (2024-10-21)** + +* Fixed issues: + + * **Fixed** `3914 `_: Ability to print MuPDF errors to logging instead of stdout + * **Fixed** `3916 `_: insert_htmlbox error: int too large to convert to float + * **Fixed** `3950 `_: Unable to consistently extract field labels from PDFs + +* Supported Python versions are now 3.9-3.13. + + * Dropped support for Python-3.8 because end-of-life. + * Added support for Python-3.13 because now released. + * See: https://devguide.python.org/versions/ + + +**Changes in version 1.24.11 (2024-10-03)** + +* Use MuPDF-1.24.10. + +* Fixed issues: + + * **Fixed** `3624 `_: Pdf file transform to image have a black block + * **Fixed** `3859 `_: doc.need_appearances() fails with "AttributeError: module 'pymupdf.mupdf' has no attribute 'PDF_TRUE' " + * **Fixed** `3863 `_: apply_redactions() does not work as expected + * **Fixed** `3905 `_: open stream can raise a FzErrorFormat error instead of FileDataError + +* Wheels now use the Python Stable ABI: + + * There is one PyMuPDF wheel for each platform. + * Each wheel works with all supported Python versions. + * Each wheel is built using the oldest supported Python version (currently 3.8). + * There is no PyMuPDFb wheel. + +* Other: + + * Improvements to get_text_words() with sort=True. + * Tests now always get the latest versions of required Python packages. + * Removed dependency on setuptools. + * Added item to PyMuPDF-1.24.10 changes below - fix of #3630. + + +**Changes in version 1.24.10 (2024-09-02)** + +* Use MuPDF-1.24.9. + +* Fixed issues: + + * **Fixed** `3450 `_: get_pixmap function takes too long to process + * **Fixed** `3569 `_: Invalid OCGs not ignored by SVG image creation + * **Fixed** `3603 `_: ObjStm compression and PDF linearization doesn't work together + * **Fixed** `3650 `_: Linebreak inserted between each letter + * **Fixed** `3661 `_: Update Document to check the /XYZ len + * **Fixed** `3698 `_: documentation issue - old code in the annotations documentation + * **Fixed** `3705 `_: Document.select() behaves weirdly in some particular kind of pdf files + * **Fixed** `3706 `_: extend Document.__getitem__ type annotation to reflect that the method also accepts slices + * **Fixed** `3727 `_: Method get_pixmap() make the program exit without any exceptions or messages + * **Fixed** `3767 `_: Cannot get Tessdata with Tesseract-OCR 5 + * **Fixed** `3773 `_: Link.set_border gives TypeError: '<' not supported between instances of 'NoneType' and 'int' + * **Fixed** `3774 `_: fitz.__version__` does not work anymore + * **Fixed** `3789 `_: ValueError: not enough values to unpack (expected 3, got 2) is thrown when call insert_pdf + * **Fixed** `3820 `_: class improves namedDest handling + + * **Fixed** `3630 `_: page.apply_redactions gives unwanted black rectangle + +* Other: + + * Object streams and linearization cannot be used together; attempting to do + so will raise an exception. (#3603) + * Fixed handling of non-existing /Contents object. + + +**Changes in version 1.24.9 (2024-07-24)** + +* Use MuPDF-1.24.8. + + +**Changes in version 1.24.8 (2024-07-22)** + +* Fixed issues: + + * **Fixed** `3636 `_: API documentation for the open function is not obvious to find. + * **Fixed** `3654 `_: docx parsing was broken in 1.24.7 + * **Fixed** `3677 `_: Unable to extract subset font name using the newer versions of PyMuPDF : 1.24.6 and 1.24.7. + * **Fixed** `3687 `_: Page.get_text results in AssertionError for epub files + +Other: + +* Fixed various spelling mistakes spotted by codespell. +* Improved how we modify MuPDF's default configuration on Windows. +* Make text search to work with ligatures. + + +**Changes in version 1.24.7 (2024-06-26)** + +* Fixed issues: + + * **Fixed** `3615 `_: Document.pagemode or Document.pagelayout crashes for epub files + * **Fixed** `3616 `_: not last version reported + + +**Changes in version 1.24.6 (2024-06-25)** + +* Use MuPDF-1.24.4 + +* Fixed issues: + + * **Fixed** `3599 `_: Story.fit_width() has a weird line + * **Fixed** `3594 `_: Garbled extraction for Amazon Sustainability Report + * **Fixed** `3591 `_: 'width' in Page.get_drawings() returns width equal as 0 + * **Fixed** `3561 `_: ZeroDivisionError: float division by zero with page.apply_redactions() + * **Fixed** `3559 `_: SegFault 11 when empty H1 H2 H3 H4 etc element is used in insert_htmlbox + * **Fixed** `3539 `_: Add dotted gridline detection to table recognition + * **Fixed** `3519 `_: get_toc(simple=False) AttributeError: 'Outline' object has no attribute 'rect' + * **Fixed** `3510 `_: page.get_label() gets wrong label on the first page of doc + * **Fixed** `3494 `_: 1.24.2/1.24.3: spurious characters introduced when using subset_fonts and insert_pdf + * **Fixed** `3470 `_: subset_fonts error exit without exception/warning + * **Fixed** `3400 `_: set_toc alters link coordinates for some rotated pages on pymupdf 1.24.2 + * **Fixed** `3347 `_: Incorrect links to points on pages having different heights + * **Fixed** `3237 `_: Set_metadata() does not work + * **Fixed** `3493 `_: Isolate PyMuPDF from other libraries; issues when PyMuPDF is loaded with other libraries like GdkPixbuf + +* Other: + + * Fixed concurrent use of PyMuPDF caused by use of constant temporary filenames. + + * Add musllinux x86_64 wheels to release. + + * Added clearer version information: + + * `pymupdf.pymupdf_version`. + * `pymupdf.mupdf_version`. + * `pymupdf.pymupdf_date`. + + +**Changes in version 1.24.5 (2024-05-30)** + +* Fixed issues: + + * **Fixed** `3479 `_: regression: fill_textbox: IndexError: pop from empty list + * **Fixed** `3488 `_: set_toc method error + +* Other: + + * Some more fixes to use MuPDF floating formatting. + * Removed/disabled some unnecessary diagnostics. + * Fixed utils.do_links() crash. + * Experimental new functions `pymupdf.apply_pages()` and `pymupdf.get_text()`. + * Addresses wrong label generation for label styles "a" and "A". + + +**Changes in version 1.24.4 (2024-05-16)** + + * **Fixed** `3418 `_: Re-introduced bug, text align add_redact_annot + * **Fixed** `3472 `_: insert_pdf gives SystemError + +* Other: + + * Fixed sysinstall test failing to remove all of prior installation before + new install. + * Fixed `utils.do_links()` crash. + * Correct `TextPage` creation Code. + * Unified various diagnostics. + * Fix bug in `page_merge()`. + + +**Changes in version 1.24.3 (2024-05-09)** + +* + The Python module is now called `pymupdf`. `fitz` is still supported for + backwards compatibility. + +* Use MuPDF-1.24.2. + +* Fixed issues: + + * **Fixed** `3357 `_: PyMuPDF==1.24.0 will hanging when using page.get_text("text") + * **Fixed** `3376 `_: Redacting results are not as expected in 1.24.x. + * **Fixed** `3379 `_: Documentation mismatch for get_text_blocks return value order. + * **Fixed** `3381 `_: Contents stream contains floats in scientific notation + * **Fixed** `3402 `_: Cannot add Widgets containing inter-field-calculation JavaScript + * **Fixed** `3414 `_: missing attribute set_dpi() + * **Fixed** `3430 `_: page.get_text() cause process freeze with certain pdf on v1.24.2 + +* Other: + + * New/modified methods: + + * `Page.remove_rotation()`: new, set page rotation to zero while keeping appearance. + + * Fixed some problems when checking for PDF properties. + * Fixed pip builds from sdist + (see discussion `3360 `_: + Alpine linux docker build failing "No matching distribution found for pymupdfb==1.24.1"). + + +**Changes in version 1.24.2 (2024-04-17)** + +* Removed obsolete classic implementation from releases + (previously available as module `fitz_old`). + +* Fixed issues: + + * **Fixed** `3331 `_: Document.pages() is incorrectly type-hinted + * **Fixed** `3354 `_: PyMuPDF==1.24.1: AttributeError: property 'metadata' of 'Document' object has no setter + +* Other: + + * New/modified methods: + + * `Document.bake()`: new, make annotations / fields permanent content. + * `Page.cluster_drawings()`: new, identifies drawing items + (i.e. vector graphics or line-art) + that belong together based on their geometrical vicinity. + * `Page.apply_redactions()`: added new parameter `text`. + * `Document.subset_fonts()`: use MuPDF's `pdf_subset_fonts()` instead of PyMuPDF code. + + * The `Document` class now supports page numbers specified as slices. + * Avoid causing MuPDF warnings. + + +**Changes in version 1.24.1 (2024-04-02)** + +* Fixed issues: + + * **Fixed** `3278 `_: apply_redactions moves some unredacted text + * **Fixed** `3301 `_: Be more permissive when classifying links as kind LINK_URI + * **Fixed** `3306 `_: Text containing capital 'ET' not appearing as annotation + +* Other: + + * Use MuPDF-1.24.1. + * Support ObjStm Compression. + Methods `Document.save()`, `Document.ez_save()` and `Document.write()` + now support new parameters `use_objstm`, compression_effort` and + `preserve_metadata`. + + +**Changes in version 1.24.0 (2024-03-21)** + +* Fixed issues: + + * **Fixed** `3281 `_: Preparing metadata (pyproject.toml) did not run successfully + * **Fixed** `3279 `_: PyMuPDF no longer builds in Alpine Linux + * **Fixed** `3257 `_: apply_redactions() deleting text outside of annoted box + * **Fixed** `3216 `_: AttributeError: 'Annot' object has no attribute '__del__' + * **Fixed** `3207 `_: get_drawings's items is missing line from h path operator + * **Fixed** `3201 `_: Memory leaks when merging PDFs + * **Fixed** `3197 `_: page.get_text() returns hexadecimal text for some characters + * **Fixed** `3196 `_: Remove text not working in 1.23.25 version vs 1.20.2 + * **Fixed** `3172 `_: PDF's 45º lines dissapearing in png conversion + * **Fixed** `3135 `_: Do not log warnings to stdout + * **Fixed** `3125 `_: get_pixmap method stuck on one page and runs forever + * **Fixed** `2964 `_: There is an issue with the image generated by the page.get_pixmap() function + +* Other: + + * Use MuPDF-1.24.0. + * Add support for redacting vector graphics. + * Several fixes for table module + + * Add new method for outputting the table as a markdown string. + + * Address errors in computing the table header object: + + We now allow None as the cell value, because this will be resolved where + needed (e.g. in the pandas DataFrame). + + We previously tried to enforce rect-like tuples in all header cell + bboxes, however this fails for tables with all-None columns. This fix + enables this and constructs an empty string in the corresponding cell + string. + + We now correctly include start / stop points of lines in the bbox of the + clustered graphic. We previously joined the line's rectangle - which had + no effect because this is always empty. + + * Improved exception text if we fail to open document. + * Fixed build with new libclang 18. + + +**Changes in version 1.23.26 (2024-02-29)** + +* Fixed issues: + + * **Fixed** `3199 `_: Add entry_points to setuptools configuration to provide command-line console scripts + * **Fixed** `3209 `_: Empty vertices in ink annotation + +* Other: + + * Improvements to table detection: + + * Improved check for empty tables, fixes bugs when determining table headers. + * Improved computation of enveloping vector graphic rectangles. + * Ignore more meaningless "pseudo" tables + + * Install command-line 'pymupdf' command that runs fitz/__main__.py. + * Don't overwrite MuPDF's config.h when building on non-Windows. + * Fix `Story` constructor's `archive` arg to match docs - now accepts a single `Archive` constructor arg. + * Do not include MuPDF source in sdist; will be downloaded automatically when building. + + +**Changes in version 1.23.25 (2024-02-20)** + +* Fixed issues: + + * **Fixed** `3182 `_: Pixmap.invert_irect argument type error + * **Fixed** `3186 `_: extractText() extracts broken text from pdf + * **Fixed** `3191 `_: Error on .find_tables() + +* Other: + + * When building, be able to specify python-config directly, with environment + variable `PIPCL_PYTHON_CONFIG`. + + +**Changes in version 1.23.24 (2024-02-19)** + +* Fixed issues: + + * **Fixed** `3148 `_: Table extraction - vertical text not handled correctly + * **Fixed** `3179 `_: Table Detection: Incorrect Separation of Vector Graphics Clusters + * **Fixed** `3180 `_: Cannot show optional content group: AttributeError: module 'fitz.mupdf' has no attribute 'pdf_array_push_drop' + +* Other: + + * Be able to test system install using `sudo pip install` instead of a venv. + + +**Changes in version 1.23.23 (2024-02-18)** + +* Fixed issues: + + * **Fixed** `3126 `_: Initialising Archive with a pathlib.Path fails. + * **Fixed** `3131 `_: Calling the next attribute of an Annot raises a "No attribute .parent" warning + * **Fixed** `3134 `_: Using an IRect as clip parameter in Page.get_pixmap no longer works since 1.23.9 + * **Fixed** `3140 `_: PDF document stays in use after closing + * **Fixed** `3150 `_: doc.select() hangs on this doc. + * **Fixed** `3163 `_: AssertionError on using fitz.IRect + * **Fixed** `3177 `_: fitz.Pixmap(None, pix) Unrecognised args for constructing Pixmap + +* Other: + + * + Improved `Document.select() by using new MuPDF function + `pdf_rearrange_pages()`. This is a more complete (and faster) + implementation of what needs to be done here in that not only pages will + be rearranged, but also consequential changes will be made to the table + of contents, links to removed pages and affected entries in the Optional + Content definitions. + * `TextWriter.appendv()`: added `small_caps` arg. + * Fixed some valgrind errors with MuPDF master. + * Fixed `Document.insert_image()` when build with MuPDF master. + + +**Changes in version 1.23.22 (2024-02-12)** + +* Fixed issues: + + * **Fixed** `3143 `_: Difference in decoding of OCGs names between doc.get_ocgs() and page.get_drawings() + + * **Fixed** `3139 `_: Pixmap resizing needs positional arg "clip" - even if None. + +* Other: + + * Removed the use of MuPDF function `fz_image_size()` from PyMuPDF. + + +**Changes in version 1.23.21 (2024-02-01)** + +* Fixed issues: + +* Other: + + * Fixed bug in set_xml_metadata(), PR `3112 https://github.com/pymupdf/PyMuPDF/pull/3112>`_: Fix pdf_add_stream metadata error + * Fixed lack of `.parent` member in `TextPage` from `Annot.get_textpage()`. + * Fixed bug in `Page.add_widget()`. + + +**Changes in version 1.23.20 (2024-01-29)** + +* Bug fixes: + + * **Fixed** `3100 `_: Wrong internal property accessed in get_xml_metadata + +* Other: + + * Significantly improved speed of `Document.get_toc()`. + + +**Changes in version 1.23.19 (2024-01-25)** + +* Bug fixes: + + * **Fixed** `3087 `_: Exception in insert_image with mask specified + * **Fixed** `3094 `_: TypeError: '<' not supported between instances of 'FzLocation' and 'int' in doc.delete_pages + +* Other: + + * When finding tables: + + * Allow addition of user-defined "virtual" vector graphics when finding tables. + * Confirm that the enveloping bboxes of vector graphics are inside the clip rectangle. + * Avoid slow finding of rectangle intersections. + + * Added `Font.bbox` property. + + +**Changes in version 1.23.18 (2024-01-23)** + +* Bug fixes: + + * **Fixed** `3081 `_: doc.close() not closing the document + +* Other: + + * Reduced size of sdist to fit on pypi.org (by reducing size of two test files). + * Fix `Annot.file_info()` if no `Desc` item. + + +**Changes in version 1.23.17 (2024-01-22)** + +* Bug fixes: + + * **Fixed** `3062 `_: page_rotation_reset does not return page to original rotation + * **Fixed** `3070 `_: update_link(): AttributeError: 'Page' object has no attribute 'super' + +* Other: + + * Fixed bug in `Page.links()` (PR #3075). + * Fixed bug in `Page.get_bboxlog()` with layers. + * Add support for timeouts in scripts/ and tests/run_compound.py. + + +**Changes in version 1.23.16 (2024-01-18)** + +* Bug fixes: + + * **Fixed** `3058 `_: Pixmap created from CMYK JPEG delivers RGB format + +* Other: + + * In table detection strategy "lines_strict", exclude fill-only vector graphics. + * Fixed sysinstall test failure. + * In documentation, update feature matrix with item about text writing. + + +**Changes in version 1.23.15 (2024-01-16)** + +* Bug fixes: + + * **Fixed** `3050 `_: python3.9 pix.set_pixel has something wrong in c.append( ord(i)) + +* Other: + + * Improved docs for Page.find_tables(). + + +**Changes in version 1.23.14 (2024-01-15)** + +* Bug fixes: + + * **Fixed** `3038 `_: JM_pixmap_from_display_list > Assertion Error : Checking for wrong type + * **Fixed** `3039 `_: Issue with doc.close() not closing the document in PyMuPDF + +* Other: + + * Ensure valid "re" rectangles in `Page.get_drawings()` with derotated pages. + + +**Changes in version 1.23.13 (2024-01-15)** + +* Bug fixes: + + * **Fixed** `2979 `_: list index out of range in to_pandas() + * **Fixed** `3001 `_: Calling find_tables() on one document alters the bounding boxes of a subsequent document + +* Other: + + * Fixed `Rect.height` and `Rect.width` to never return negative values. + * Fixed `TextPage.extractIMGINFO()`'s returned `dictkey_yres` value. + + +**Changes in version 1.23.12 (2024-01-12)** + +* * **Fixed** `3027 `_: Page.get_text throws Attribute Error for 'parent' + + +**Changes in version 1.23.11 (2024-01-12)** + +* Fixed some Pixmap construction bugs. +* Fixed Pixmap.yres(). + + +**Changes in version 1.23.10 (2024-01-12)** + +* Bug fixes: + + * **Fixed** `3020 `_: Can't resize a PixMap + +* Other: + + * Fixed Page.delete_image(). + + +**Changes in version 1.23.9 (2024-01-11)** + +* Default to new "rebased" implementation. + + * The old "classic" implementation is available with `import fitz_old as fitz`. + * For more information about why we are changing to the rebased implementation, + see: https://github.com/pymupdf/PyMuPDF/discussions/2680 + +* Use MuPDF-1.23.9. + +* Bug fixes (rebased implementation only): + + * **Fixed** `2911 `_: Page.derotation_matrix returns a tuple instead of a Matrix with rebased implementation + * **Fixed** `2919 `_: Rebased version: KeyError in resolve_names when merging pdfs + * **Fixed** `2922 `_: New feature that allows inserting named-destination links doesn't work + * **Fixed** `2943 `_: ZeroDivisionError: float division by zero when use apply_redactions() + * **Fixed** `2950 `_: Shelling out to pip during tests is problematic + * **Fixed** `2954 `_: Replacement unicode character in text extraction + * **Fixed** `2957 `_: apply_redactions() moving text + * **Fixed** `2961 `_: Passing a string as a page number raises IndexError instead of TypeError. + * **Fixed** `2969 `_: annot.next throws AttributeError + * **Fixed** `2978 `_: 1.23.9rc1: module 'fitz.mupdf' has no attribute 'fz_copy_pixmap_rect' + + * **Fixed** `2907 `_: segfault trying to call clean_contents on certain pdfs with python 3.12 + * **Fixed** `2905 `_: SystemError: returned a result with an exception set + * **Fixed** `2742 `_: Segmentation Fault when inserting three (but not two) copies of the same source page into one destination page + +* Other: + + * Add optional setting of opacity to `Page.insert_htmlbox()`. + * Fixed issue with add_redact_annot() mentioned in #2934. + * Fixed `Page.rotation()` to return 0 for non-PDF documents instead of raising an exception. + * Fixed internal quad detection to cope with any Python sequence. + * Fixed rebased `fitz.pymupdf_version_tuple` - was previously set to mupdf version. + * Improved support for Linux system installs, including adding regular testing on Github. + * Add missing `flake8` to `scripts/gh_release.py:test_packages`. + * Use newly public functions in MuPDF-1.23.8. + * Improved `scripts/test.py` to help investigation of MuPDF issues. + + +**Changes in version 1.23.8 (2023-12-19)** + +* Bug fixes (rebased implementation only): + + * **Fixed** `2634 `_: get_toc and set_toc do not behave consistently for rotated pages + * **Fixed** `2861 `_: AttributeError in getLinkDict during PDF Merge + * **Fixed** `2871 `_: KeyError in getLinkDict during PDF merge + * **Fixed** `2886 `_: Error in Skeleton for Named Link Destinations + +* Bug fixes (rebased and classic implementations): + + * **Fixed** `2885 `_: pymupdf find tables too slow + +* Other: + + * Rebased implementation: + + * `Page.insert_htmlbox()`: new, much more powerful alternative to `Page.insert_textbox()` or `TextWriter.fill_textbox()`, using `Story`. + * `Story.fit*()`: new methods for fitting a Story into an expanded rect. + * `Story.write_with_links()`: add support for external links. + * `Document.language()`: fixed to use MuPDF's new `mupdf.fz_string_from_text_language2()`. + * `Document.subset_fonts()` - fixed. + * Fixed internal `Archive._add_treeitem()` method. + * Fixed `fitz_new.__doc__` to contain PyMuPDF and Python version information, and OS name. + * Removed use of `(*args, **kwargs)` in API, we now specify keyword args explicitly. + * Work with new MuPDF Python exception classes. + + * Fixed bug where `button_states()` returns None when `/AP` points to an indirect object. + * Fixed pillow test to not ignore all errors, and install pillow when testing. + * Added test for `fitz.css_for_pymupdf_font()` (uses package `pymupdf-fonts`). + * Simplified Github Actions test specifications. + * Updated `tests/README.md`. + + +**Changes in version 1.23.7 (2023-11-30)** + +* Bug fixes in rebased implementation, not fixed in classic implementation: + + * **Fixed** `2232 `_: Geometry helper classes should support keyword arguments + * **Fixed** `2788 `_: Problem with get_toc in pymupdf 1.23.6 + * **Fixed** `2791 `_: Experiencing small memory leak in save() + +* Bug fixes (rebased and classic implementations): + + * **Fixed** `2736 `_: Failure when set cropbox with mediabox negative value + * **Fixed** `2749 `_: RuntimeError: cycle in structure tree + * **Fixed** `2753 `_: Story.write_with_links will ignore everything after the first "page break" in the HTML. + * **Fixed** `2812 `_: find_tables on landscape page generates reversed text + * **Fixed** `2829 `_: [cannot create /Annot for kind] is still printed despite #2345 is closed. + * **Fixed** `2841 `_: Unexpected KeyError when using scrub with fitz_new + +* Use MuPDF-1.23.7. + +* Other: + + * Rebased implementation: + + * Added flake8 code checking to test suite, and made various fixes. + * Disable diagnostics during Document constructor to match classic implementation. + + * Additional fix to `2553 `_: Invalid characters in versions >= 1.22 + * Fixed `MuPDF Bug 707324 `_: Story: HTML table row background color repeated incorrectly + * Added `scripts/test.py`, for simple build+test of PyMuPDF git checkout. + * Added `fitz.pymupdf_version_tuple`, e.g. `(1, 23, 6)`. + * Restored mistakenly-reverted fix for `2345 `_: Turn off print statements in utils.py + * Include any trailing `... repeated times...` text in warnings returned by `mupdf_warnings()` (rebased only). + + + +**Changes in version 1.23.6 (2023-11-06)** + +* Bug fixes: + + * **Fixed** `2553 `_: Invalid characters in versions >= 1.22 + * **Fixed** `2608 `_: Incorrect utf32 text extraction (high & low surrogates are split) + * **Fixed** `2710 `_: page.rect and text location wrong / differing from older version + * **Fixed** `2774 `_: wrong encoding for "\?" character when sort=True + * **Fixed** `2775 `_: fitz_new does not work with python3.10 or earlier + * **Fixed** `2777 `_: With fitz_new, wrong type for Page.mediabox + +* Other: + + * Use MuPDF-1.23.5. + * Added Document.resolve_names() (rebased implementation only). + + +**Changes in version 1.23.5 (2023-10-11)** + +* Bug fixes: + + * **Fixed** `2341 `_: Handling negative values in the zoom section for LINK_GOTO in linkDest + * **Fixed** `2522 `_: Typo in set_layer() - NameError: name 'f' is not defined + * **Fixed** `2548 `_: Fitz freezes on some PDFs when calling the fitz.Page.get_text_blocks method. + * **Fixed** `2596 `_: save(garbage=3) breaks get_pixmap() with side effect + * **Fixed** `2635 `_: "clean=True" makes objects invisible in the pdf + * **Fixed** `2637 `_: Page.insert_textbox incorrectly handles the last word if it starts a new line + * **Fixed** `2699 `_: extract paragraph with below table + * **Fixed** `2703 `_: Wrong fontsize calculation in corner cases ("page.get_texttrace()") + * **Fixed** `2710 `_: page.rect and text location wrong / differing from older version + * **Fixed** `2723 `_: When will a Python 3.12 wheel be available? + * **Fixed** `2730 `_: persistent get_text() formatting + +* Other: + + * Use MuPDF-1.23.4. + * Fix optimisation flags with system installs. + * Fixed the problem that the clip parameter does not take effect during table recognition + * Support Pillow mode "RGBa" + * Support extra word delimiters + * Support checking valid PDF name objects + + +**Changes in version 1.23.4 (2023-09-26)** + +* Improved build instructions. +* Fixed Tesseract in rebased implementation. +* Improvements to build/install with system MuPDF. +* Fixed Pyodide builds. +* Fixed rebased bug in _insert_image(). + +* Bug fixes: + + * **Fixed** `2556 `_: Segmentation fault at caling get_cdrawings(extended=True) + * **Fixed** `2637 `_: Page.insert_textbox incorrectly handles the last word if it starts a new line + * **Fixed** `2683 `_: Windows sdist build failure - non-quoting of path and using UNIX which command + * **Fixed** `2691 `_: Page.get_textpage_ocr() bug in rebased fitz_new version + * **Fixed** `2692 `_: Page.get_pixmap(clip=Rect()) bug in rebased fitz_new version + + +**Changes in version 1.23.3 (2023-08-31)** + +* Fixed use of Tesseract for OCR. + + +**Changes in version 1.23.2 (2023-08-28)** + +* **Fixed** `#2613 `_: release 1.23.0 not MacOS-arm64 compatible + + +**Changes in version 1.23.1 (2023-08-24)** + +* Updated README and package summary description. + +* + Fixed a problem on some Linux installations with Python-3.10 + (and possibly earlier versions) where `import fitz` failed with + `ImportError: libcrypt.so.2: cannot open shared object file: No such + file or directory`. + +* + Fixed `incompatible architecture` error on MacOS arm64. + +* + Fixed installation warning from Poetry about missing entry in wheels' + RECORD files. + + +**Changes in version 1.23.0 (2023-08-22)** + +* Add method `find_tables()` to the `Page` object. + + This allows locating tables on any supported document page, and + extracting table content by cell. + +* New "rebased" implementation of PyMuPDF. + + The rebased implementation is available as Python module + `fitz_new`. It can be used as a drop-in replacement with `import + fitz_new as fitz`. + +* + Python-independent MuPDF libraries are now in a second wheel called + `PyMuPDFb` that will be automatically installed by pip. + + This is to save space on pypi.org - a full release only needs one + `PyMuPDFb` wheel for each OS. + +* Bug fixes: + + * **Fixed** `#2542 `_: fitz.utils.scrub AttributeError Annot object has no attribute fileUpd inside + * **Fixed** `#2533 `_: get_texttrace returned a incorrect character bbox + * **Fixed** `#2537 `_: Validation when setting a grouped RadioButton throws a RuntimeError: path to 'V' has indirects + +* Other changes: + + * Dropped support for Python-3.7. + + * Fix for wrong page / annot `/Contents` cleaning. + + We need to set `pdf_filter_options::no_update` to zero. + + * Added new function get_tessdata(). + + * Cope with problem `/Annot` arrays. + + When copying page annotations in method Document.insert_pdf we + previously did not check the validity of members of the `/Annots` + array. For faulty members (like null or non-dictionary items) this + could cause unnecessary exceptions. This fix implements more checks + and skips such array items. + + * Additional annotation type checks. + + We did not previously check for annotation type when getting / + setting annotation border properties. This is now checked in + accordance with MuPDF. + + * Increase fault tolerance. + + Avoid exceptions in method `insert_pdf()` when source pages contains + invalid items in the `/Annots` array. + + * Return empty border dict for applicable annots. + + We previously were returning a non-empty border dictionary even for + non-applicable annotation types. We now return the empty dictionary + `{}` in these cases. This requires some corresponding changes in the + annotation `.update()` method, namely for dashes and border width. + + * Restrict `set_rect` to applicable annot types. + + We were insufficiently excluding non-applicable annotation types + from `set_rect()` method. We now let MuPDF catch unsupported + annotations and return `False` in these cases. + + * Wrong fontsize computation in `page.get_texttrace()`. + + When computing the font size we were using the final text + transformation matrix, where we should have taken `span->trm` + instead. This is corrected here. + + * Updates to cope with changes to latest MuPDF. + + `pdf_lookup_anchor()` has been removed. + + * Update fill_textbox to better respect rect.width + + The function norm_words in fill_textbox had a bug in its last + loop, appending n+1 characters when actually measuring width of n + characters. It led to a bug in fill_texbox when you tried to write + a single word mostly composed of "wide" letters (M,m, W, w...), + causing the written text to exceed the given rect. + + The fix was just to replace n+1 by n. + + * Add `script_focus` and `script_blur` options to widget. + + + +**Changes in version 1.22.5 (2023-06-21)** + +* This release uses ``MuPDF-1.22.2``. + +* Bug fixes: + + * **Fixed** `#2365 `_: Incorrect dictionary values for type "fs" drawings. + * **Fixed** `#2391 `_: Check box automatically uncheck when we update same checkbox more than 1 times. + * **Fixed** `#2400 `_: Gaps within text of same line not filled with spaces. + * **Fixed** `#2404 `_: Blacklining an image in PDF won't remove underlying content in version 1.22.X. + * **Fixed** `#2430 `_: Incorrectly reducing ref count of Py_None. + * **Fixed** `#2450 `_: Empty fill color and fill opacity for paths with fill and stroke operations with 1.22.* + * **Fixed** `#2462 `_: Error at "get_drawing(extended=True )" + * **Fixed** `#2468 `_: Decode error when trying to get drawings + * **Fixed** `#2710 `_: page.rect and text location wrong / differing from older version + * **Fixed** `#2723 `_: When will a Python 3.12 wheel be available? + +* New features: + + * **Changed** Annotations now support "cloudy" borders. + The :attr:`Annot.border` property has the new item `clouds`, + and method :meth:`Annot.set_border` supports the corresponding `clouds` argument. + + * **Changed** Radio button widgets in the same RB group + are now consistently updated **if the group is defined in the standard way**. + + * **Added** Support for the `/Locked` key in PDF Optional Content. + This array inside the catalog entry `/OCProperties` can now be extracted and set. + + * **Added** Support for new parameter `tessdata` in OCR functions. + New function :meth:`get_tessdata` locates the language support folder if Tesseract is installed. + + + +**Changes in version 1.22.3 (2023-05-10)** + +* This release uses ``MuPDF-1.22.0``. + +* Bug fixes: + + * **Fixed** `#2333 `_: Unable to set any of button radio group in form + + +**Changes in version 1.22.2 (2023-04-26)** + +* This release uses ``MuPDF-1.22.0``. + +* Bug fixes: + + * **Fixed** `#2369 `_: Image extraction bugs with newer versions + + +**Changes in version 1.22.1 (2023-04-18)** + +* This release uses ``MuPDF-1.22.0``. + +* Bug fixes: + + * **Fixed** `#2345 `_: Turn off print statements in utils.py + * **Fixed** `#2348 `_: extract_image returns an extension "flate" instead of "png" + * **Fixed** `#2350 `_: Can not make widget (checkbox) to read-only by adding flags PDF_FIELD_IS_READ_ONLY + * **Fixed** `#2355 `_: 1.22.0 error when using get_toc (AttributeError: 'SwigPyObject' object has no attribute) + + +**Changes in version 1.22.0 (2023-04-14)** + +* This release uses ``MuPDF-1.22.0``. + +* Behavioural changes: + + * Text extraction now includes glyphs that overlap with clip rect; previously + they were included only if they were entirely contained within the clip + rect. + +* Bug fixes: + + * **Fixed** `#1763 `_: Interactive(smartform) form PDF calculation not working in pymupdf + * **Fixed** `#1995 `_: RuntimeError: image is too high for a long paged pdf file when trying + * **Fixed** `#2093 `_: Image in pdf changes color after applying redactions + * **Fixed** `#2108 `_: Redaction removing more text than expected + * **Fixed** `#2141 `_: Failed to read JPX header when trying to get blocks + * **Fixed** `#2144 `_: Replace image throws an error + * **Fixed** `#2146 `_: Wrong Handling of Reference Count of "None" Object + * **Fixed** `#2161 `_: Support adding images as pages directly + * **Fixed** `#2168 `_: ``page.add_highlight_annot(start=pointa, stop=pointb)`` not working + * **Fixed** `#2173 `_: Double free of ``Colorspace`` used in ``Pixmap`` + * **Fixed** `#2179 `_: Incorrect documentation for ``pixmap.tint_with()`` + * **Fixed** `#2208 `_: Pushbutton widget appears as check box + * **Fixed** `#2210 `_: ``apply_redactions()`` move pdf text to right after redaction + * **Fixed** `#2220 `_: ``Page.delete_image()`` | object has no attribute ``is_image`` + * **Fixed** `#2228 `_: open some pdf cost too much time + * **Fixed** `#2238 `_: Bug - can not extract data from file in the newest version 1.21.1 + * **Fixed** `#2242 `_: Python quits silently in ``Story.element_positions()`` if callback function prototype is wrong + * **Fixed** `#2246 `_: TextWriter write text in a wrong position + * **Fixed** `#2248 `_: After redacting the content, the position of the remaining text changes + * **Fixed** `#2250 `_: docs: unclear or broken link in page.rst + * **Fixed** `#2251 `_: mupdf_display_errors does not apply to Pixmap when loading broken image + * **Fixed** `#2270 `_: ``Annot.get_text("words")`` - doesn't return the first line of words + * **Fixed** `#2275 `_: insert_image: document that rotations are counterclockwise + * **Fixed** `#2278 `_: Can not make widget (checkbox) to read-only by adding flags PDF_FIELD_IS_READ_ONLY + * **Fixed** `#2290 `_: Different image format/data from Page.get_text("dict") and Fitz.get_page_images() + * **Fixed** `#2293 `_: 68 failed tests when installing from sdist on my box + * **Fixed** `#2300 `_: Too much recursion in tree (parents), makes program terminate + * **Fixed** `#2322 `_: add_highlight_annot using clip generates "A Number is Out of Range" error in PDF + +* Other: + + * Add key "/AS (Yes)" to the underlying annot object of a selected button form field. + + * Remove unused ``Document`` methods ``has_xref_streams()`` and + ``has_old_style_xrefs()`` as MuPDF equivalents have been removed. + + * Add new ``Document`` methods and properties for getting/setting + ``/PageMode``, ``/PageLayout`` and ``/MarkInfo``. + + * New ``Document`` property ``version_count``, which contains the number of + incremental saves plus one. + + * New ``Document`` property ``is_fast_webaccess`` which tells whether the + document is linearized. + + * ``DocumentWriter`` is now a context manager. + + * Add support for ``Pixmap`` JPEG output. + + * Add support for drawing rectangles with rounded corners. + + * ``get_drawings()``: added optional ``extended`` arg. + + * Fixed issue where trace devices' state was not being initialised + correctly; data returned from things like ``fitz.Page.get_texttrace()`` + might be slightly altered, e.g. ``linewidth`` values. + + * Output warning to ``stderr`` if it looks like we are being used with + current directory containing an invalid ``fitz/`` directory, because + this can break import of ``fitz`` module. For example this happens + if one attempts to use ``fitz`` when current directory is a PyMuPDF + checkout. + +* Documentation: + + * General rework: + + * Introduces a new home page and new table of contents. + * Structural update to include new About section. + * Comparison & performance graphing. + * Includes performance methodology in appendix. + * Updates conf.py to understand single back-ticks as code. + * Converts double back-ticks to single back-ticks. + * Removes redundant files. + + * Improve ``insert_file()`` documentation. + + * ``get_bboxlog()``: aded optional ``layers`` to ``get_bboxlog()``. + * ``Page.get_texttrace()``: add new dictionary key ``layer``, name of Optional Content Group. + + * Mention use of Python venv in installation documentation. + + * Added missing fix for #2057 to release 1.21.1's changelog. + + * Fixes many links to the PyMuPDF-Utilities repo scripts. + + * Avoid duplication of ``changes.txt`` and ``docs/changes.rst``. + +* Build + + * Added ``pyproject.toml`` file to improve builds using pip etc. + + + +**Changes in Version 1.21.1 (2022-12-13)** + +* This release uses ``MuPDF-1.21.1``. + +* Bug fixes: + + * **Fixed** `#2110 `_: Fully embedded font is extracted only partially if it occupies more than one object + * **Fixed** `#2094 `_: Rectangle Detection Logic + * **Fixed** `#2088 `_: Destination point not set for named links in toc + * **Fixed** `#2087 `_: Image with Filter "[/FlateDecode/JPXDecode]" not extracted + * **Fixed** `#2086 `_: Document.save() owner_pw & user_pw has buffer overflow bug + * **Fixed** `#2076 `_: Segfault in fitz.py + * **Fixed** `#2057 `_: Document.save garbage parameter not working in PyMuPDF 1.21.0 + * **Fixed** `#2051 `_: Missing DPI Parameter + * **Fixed** `#2048 `_: Invalid size of TextPage and bbox with newest version 1.21.0 + * **Fixed** `#2045 `_: SystemError: returned a result with an error set + * **Fixed** `#2039 `_: 1.21.0 fails to build against system libmupdf + * **Fixed** `#2036 `_: Archive::Archive defined twice + +* Other + + * Swallow "&zoom=nan" in link uri strings. + * Add new Page utility methods ``Page.replace_image()`` and ``Page.delete_image()``. + +* Documentation: + + * `#2040 `_: Added note about test failure with non-default build of MuPDF, to ``tests/README.md``. + * `#2037 `_: In ``docs/installation.rst``, mention incompatibility with chocolatey.org on Windows. + * `#2061 `_: Fixed description of ``Annot.file_info``. + * `#2065 `_: Show how to insert internal PDF link. + * Improved description of building from source without an sdist. + * Added information about running tests. + * `#2084 `_: Fixed broken link to PyMuPDF-Utilities. + + +**Changes in Version 1.21.0 (2022-11-8)** + +* This release uses ``MuPDF-1.21.0``. + +* New feature: Stories. + +* Added wheels for Python-3.11. + +* Bug fixes: + + * **Fixed** `#1701 `_: Broken custom image insertion. + * **Fixed** `#1854 `_: `Document.delete_pages()` declines keyword arguments. + * **Fixed** `#1868 `_: Access Violation Error at `page.apply_redactions()`. + * **Fixed** `#1909 `_: Adding text with `fontname="Helvetica"` can silently fail. + * **Fixed** `#1913 `_: `draw_rect()`: does not respect width if color is not specified. + * **Fixed** `#1917 `_: `subset_fonts()`: make it possible to silence the stdout. + * **Fixed** `#1936 `_: Rectangle detection can be incorrect producing wrong output. + * **Fixed** `#1945 `_: Segmentation fault when saving with `clean=True`. + * **Fixed** `#1965 `_: `pdfocr_save()` Hard Crash. + * **Fixed** `#1971 `_: Segmentation fault when using `get_drawings()`. + * **Fixed** `#1946 `_: `block_no` and `block_type` switched in `get_text()` docs. + * **Fixed** `#2013 `_: AttributeError: 'Widget' object has no attribute '_annot' in delete widget. + +* Misc changes to core code: + + * Fixed various compiler warnings and a sequence-point bug. + * Added support for Memento builds. + * Fixed leaks detected by Memento in test suite. + * Fixed handling of exceptions in set_name() and set_rect(). + * Allow build with latest MuPDF, for regular testing of PyMuPDF master. + * Cope with new MuPDF exceptions when setting rect for some Annot types. + * Reduced cosmetic differences between MuPDF's config.h and PyMuPDF's _config.h. + * Cope with various changes to MuPDF API. + +* Other: + + * Fixed various broken links and typos in docs. + * Mention install of `swig-python` on MacOS for #875. + * Added (untested) wheels for macos-arm64. + + + + +**Changes in Version 1.20.2** + +* This release uses ``MuPDF-1.20.3``. + +* **Fixed** `#1787 `_. + Fix linking issues on Unix systems. + +* **Fixed** `#1824 `_. + SegFault when applying redactions overlapping a transparent image. (Fixed + in ``MuPDF-1.20.3``.) + +* Improvements to documentation: + + * Improved information about building from source in ``docs/installation.rst``. + * Clarified memory allocation setting ``JM_MEMORY` in ``docs/tools.rst``. + * Fixed link to PDF Reference manual in ``docs/app3.rst``. + * Fixed building of html documentation on OpenBSD. + * Moved old ``docs/faq.rst`` into separate ``docs/recipes-*`` files. + +* Removed some unused files and directories: + + * ``installation/`` + * ``docs/wheelnames.txt`` + + +**Changes in Version 1.20.1** + +* **Fixed** `#1724 `_. + Fix for building on FreeBSD. + +* **Fixed** `#1771 `_. + `linkDest()` had a broken call to `re.match()`, introduced in 1.20.0. + +* **Fixed** `#1751 `_. + `get_drawings()` and `get_cdrawings()` previously always returned with `closePath=False`. + +* **Fixed** `#1645 `_. + Default FreeText annotation text color is now black. + +* Improvements to sphinx-generated documentation: + + * Use readthedocs theme with enhancements. + * Renamed the `.txt` files to have `.rst` suffixes. + +------ + +**Changes in Version 1.20.0** + +This release uses ``MuPDF-1.20.0``, released 2022-06-15. + +* Cope with new MuPDF link uri format, changed from ``#,,`` to ``#page=&zoom=,,``. + + * In ``tests/test_insertpdf.py``, use new reference output ``joined-1.20.pdf``. We also check that new output values are approximately the same as the old ones. + +* **Fixed** `#1738 `_. Leak of `pdf_graft_map`. + Also fixed a SEGV issue that this seemed to expose, caused by incorrect freeing of underlying fz_document. + +* **Fixed** `#1733 `_. Fixed ownership of `Annotation.get_pixmap()`. + +Changes to build/release process: + +* If pip builds from source because an appropriate wheel is not available, we no longer require MuPDF to be pre-installed. Instead the required MuPDF source is embedded in the sdist and automatically built into PyMuPDF. + +* Various changes to ``setup.py`` to download the required MuPDF release as required. See comments at start of setup.py for details. + +* Added ``.github/workflows/build_wheels.yml`` to control building of wheels on Github. + +------ + +**Changes in Version 1.19.6** + +* **Fixed** `#1620 `_. The :ref:`TextPage` created by :meth:`Page.get_textpage` will now be freed correctly (removed memory leak). +* **Fixed** `#1601 `_. Document open errors should now be more concise and easier to interpret. In the course of this, two PyMuPDF-specific Python exceptions have been **added:** + + - ``EmptyFileError`` -- raised when trying to create a :ref:`Document` (``fitz.open()``) from an empty file or zero-length memory. + - ``FileDataError`` -- raised when MuPDF encounters irrecoverable document structure issues. + +* **Added** :meth:`Page.load_widget` given a PDF field's xref. + +* **Added** Dictionary :attr:`pdfcolor` which provide the about 500 colors defined as PDF color values with the lower case color name as key. + +* **Added** algebra functionality to the :ref:`Quad` class. These objects can now also be added and subtracted among themselves, and be multiplied by numbers and matrices. + +* **Added** new constants defining the default text extraction flags for more comfortable handling. Their naming convention is like :data:`TEXTFLAGS_WORDS` for ``page.get_text("words")``. See :ref:`text_extraction_flags`. + +* **Changed** :meth:`Page.annots` and :meth:`Page.widgets` to detect and prevent reloading the page (illegally) inside the iterator loops via :meth:`Document.reload_page`. Doing this brings down the interpretor. Documented clean ways to do annotation and widget mass updates within properly designed loops. + +* **Changed** several internal utility functions to become standalone ("SWIG inline") as opposed to be part of the :ref:`Tools` class. This, among other things, increases the performance of geometry object creation. + +* **Changed** :meth:`Document.update_stream` to always accept stream updates - whether or not the dictionary object behind the xref already is a stream. Thus the former ``new`` parameter is now ignored and will be removed in v1.20.0. + + +------ + +**Changes in Version 1.19.5** + +* **Fixed** `#1518 `_. A limited "fix": in some cases, rectangles and quadrupels were not correctly encoded to support re-drawing by :ref:`Shape`. + +* **Fixed** `#1521 `_. This had the same ultimate reason behind issue #1510. + +* **Fixed** `#1513 `_. Some Optional Content functions did not support non-ASCII characters. + +* **Fixed** `#1510 `_. Support more soft-mask image subtypes. + +* **Fixed** `#1507 `_. Immunize against items in the outlines chain, that are ``"null"`` objects. + +* **Fixed** re-opened `#1417 `_. ("too many open files"). This was due to insufficient calls to MuPDF's ``fz_drop_document()``. This also fixes `#1550 `_. + +* **Fixed** several undocumented issues in relation to incorrectly setting the text span origin :data:`point_like`. + +* **Fixed** undocumented error computing the character bbox in method :meth:`Page.get_texttrace` when text is **flipped** (as opposed to just rotated). + +* **Added** items to the dictionary returned by :meth:`image_properties`: ``orientation`` and ``transform`` report the natural image orientation (EXIF data). + +* **Added** method :meth:`Document.xref_copy`. It will make a given target PDF object an exact copy of a source object. + + +------ + +**Changes in Version 1.19.4** + + +* **Fixed** `#1505 `_. Immunize against circular outline items. + +* **Fixed** `#1484 `_. Correct CropBox coordinates are now returned in all situations. + +* **Fixed** `#1479 `_. + +* **Fixed** `#1474 `_. TextPage objects are now properly deleted again. + +* **Added** :ref:`Page` methods and attributes for PDF ``/ArtBox``, ``/BleedBox``, ``/TrimBox``. + +* **Added** global attribute :attr:`TESSDATA_PREFIX` for easy checking of OCR support. + +* **Changed** :meth:`Document.xref_set_key` such that dictionary keys will physically be removed if set to value ``"null"``. + +* **Changed** :meth:`Document.extract_font` to optionally return a dictionary (instead of a tuple). + +------ + +**Changes in Version 1.19.3** + +This patch version implements minor improvements for :ref:`Pixmap` and also some important fixes. + +* **Fixed** `#1351 `_. Reverted code that introduced the memory growth in v1.18.15. + +* **Fixed** `#1417 `_. Developped circumvention for growth of open file handles using :meth:`Document.insert_pdf`. + +* **Fixed** `#1418 `_. Developped circumvention for memory growth using :meth:`Document.insert_pdf`. + +* **Fixed** `#1430 `_. Developped circumvention for mass pixmap generations of document pages. + +* **Fixed** `#1433 `_. Solves a bbox error for some Type 3 font in PyMuPDF text processing. + +* **Added** :meth:`Pixmap.color_topusage` to determine the share of the most frequently used color. Solves `#1397 `_. + +* **Added** :meth:`Pixmap.warp` which makes a new pixmap from a given arbitrary convex quad inside the pixmap. + +* **Added** :attr:`Annot.irt_xref` and :meth:`Annot.set_irt_xref` to inquire or set the `/IRT` ("In Responde To") property of an annotation. Implements `#1450 `_. + +* **Added** :meth:`Rect.torect` and :meth:`IRect.torect` which compute a matrix that transforms to a given other rectangle. + +* **Changed** :meth:`Pixmap.color_count` to also return the count of each color. +* **Changed** :meth:`Page.get_texttrace` to also return correct span and character bboxes if ``span["dir"] != (1, 0)``. + +------ + +**Changes in Version 1.19.2** + +This patch version implements minor improvements for :meth:`Page.get_drawings` and also some important fixes. + +* **Fixed** `#1388 `_. Fixed intermittent memory corruption when insert or updating annotations. + +* **Fixed** `#1375 `_. Inconsistencies between line numbers as returned by the "words" and the "dict" options of :meth:`Page.get_text` have been corrected. + +* **Fixed** `#1364 `_. The check for being a ``"rawdict"`` span in :meth:`recover_span_quad` now works correctly. + +* **Fixed** `#1342 `_. Corrected the check for rectangle infiniteness in :meth:`Page.show_pdf_page`. + +* **Changed** :meth:`Page.get_drawings`, :meth:`Page.get_cdrawings` to return an indicator on the area orientation covered by a rectangle. This implements `#1355 `_. Also, the recognition rate for rectangles and quads has been significantly improved. + +* **Changed** all text search and extraction methods to set the new ``flags`` option ``TEXT_MEDIABOX_CLIP`` to ON by default. That bit causes the automatic suppression of all characters that are completely outside a page's mediabox (in as far as that notion is supported for a document type). This eliminates the need for using ``clip=page.rect`` or similar for omitting text outside the visible area. + +* **Added** parameter ``"dpi"`` to :meth:`Page.get_pixmap` and :meth:`Annot.get_pixmap`. When given, parameter ``"matrix"`` is ignored, and a :ref:`Pixmap` with the desired dots per inch is created. + +* **Added** attributes :attr:`Pixmap.is_monochrome` and :attr:`Pixmap.is_unicolor` allowing fast checks of pixmap properties. Addresses `#1397 `_. + +* **Added** method :meth:`Pixmap.color_count` to determine the unique colors in the pixmap. + +* **Added** boolean parameter ``"compress"`` to PDF document method :meth:`Document.update_stream`. Addresses / enables solution for `#1408 `_. + +------ + +**Changes in Version 1.19.1** + +This is the first patch version to support MuPDF v1.19.0. Apart from one bug fix, it includes important improvements for OCR support and the option to **sort extracted text** to the standard reading order "from top-left to bottom-right". + +* **Fixed** `#1328 `_. "words" text extraction again returns correct ``(x0, y0)`` coordinates. + +* **Changed** :meth:`Page.get_textpage_ocr`: it now supports parameter ``dpi`` to control OCR quality. It is also possible to choose whether the **full page** should be OCRed or **only the images displayed** by the page. + +* **Changed** :meth:`Page.get_drawings` and :meth:`Page.get_cdrawings` to automatically convert colors to RGB color tuples. Implements `#1332 `_. Similar change was applied to :meth:`Page.get_texttrace`. + +* **Changed** :meth:`Page.get_text` to support a parameter ``sort``. If set to ``True`` the output is conveniently sorted. + + +------ + +**Changes in Version 1.19.0** + +This is the first version supporting MuPDF 1.19.*, published 2021-10-05. It introduces many new features compared to the previous version 1.18.*. + +PyMuPDF has now picked up integrated Tesseract OCR support, which was already present in MuPDF v1.18.0. + +* Supported images can be OCRed via their :ref:`Pixmap` which results in a 1-page PDF with a text layer. +* All supported document pages (i.e. not only PDFs), can be OCRed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCRing) that can be searched and extracted without restrictions. +* All this requires an independent installation of Tesseract. MuPDF actually (only) needs the location of Tesseract's ``"tessdata"`` folder, where its language support data are stored. This location must be available as environment variable ``TESSDATA_PREFIX``. + +A new MuPDF feature is **journalling PDF updates**, which is also supported by this PyMuPDF version. Changes may be logged, rolled back or replayed, allowing to implement a whole new level of control over PDF document integrity -- similar to functions present in modern database systems. + +A third feature (unrelated to the new MuPDF version) includes the ability to detect when page **objects cover or hide each other**. It is now e.g. possible to see that text is covered by a drawing or an image. + +* **Changed** terminology and meaning of important geometry concepts: Rectangles are now characterized as *finite*, *valid* or *empty*, while the definitions of these terms have also changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the :ref:`Rect` section for details. + +* **Added** new parameter `"no_new_id"` to :meth:`Document.save` / :meth:`Document.tobytes` methods. Use it to suppress updating the second item of the document ``/ID`` which in PDF indicates that the original file has been updated. If the PDF has no ``/ID`` at all yet, then no new one will be created either. + +* **Added** a **journalling facility** for PDF updates. This allows logging changes, undoing or redoing them, or saving the journal for later use. Refer to :meth:`Document.journal_enable` and friends. + +* **Added** new :ref:`Pixmap` methods :meth:`Pixmap.pdfocr_save` and :meth:`Pixmap.pdfocr_tobytes`, which generate a 1-page PDF containing the pixmap as PNG image with OCR text layer. + +* **Added** :meth:`Page.get_textpage_ocr` which executes optical character recognition for the page, then extracts the results and stores them together with "normal" page content in a :ref:`TextPage`. Use or reuse this object in subsequent text extractions and text searches to avoid multiple efforts. The existing text search and text extraction methods have been extended to support a separately created textpage -- see next item. + +* **Added** a new parameter ``textpage`` to text extraction and text search methods. This allows reuse of a previously created :ref:`TextPage` and thus achieves significant runtime benefits -- which is especially important for the new OCR features. But "normal" text extractions can definitely also benefit. + +* **Added** :meth:`Page.get_texttrace`, a technical method delivering low-level text character properties. It was present before as a private method, but the author felt it now is mature enough to be officially available. It specifically includes a "sequence number" which indicates the page appearance build operation that painted the text. + +* **Added** :meth:`Page.get_bboxlog` which delivers the list of rectangles of page objects like text, images or drawings. Its significance lies in its sequence: rectangles intersecting areas with a lower index are covering or hiding them. + +* **Changed** methods :meth:`Page.get_drawings` and :meth:`Page.get_cdrawings` to include a "sequence number" indicating the page appearance build operation that created the drawing. + +* **Fixed** `#1311 `_. Field values in comboboxes should now be handled correctly. +* **Fixed** `#1290 `_. Error was caused by incorrect rectangle emptiness check, which is fixed due to new geometry logic of this version. +* **Fixed** `#1286 `_. Text alignment for redact annotations is working again. +* **Fixed** `#1287 `_. Infinite loop issue for non-Windows systems when applying some redactions has been resolved. +* **Fixed** `#1284 `_. Text layout destruction after applying redactions in some cases has been resolved. + +------ + +**Changes in Version 1.18.18 / 1.18.19** + +* **Fixed** issue `#1266 `_. Failure to set :attr:`Pixmap.samples` in important cases, was hotfixed in a new version 1.18.19. + +* **Fixed** issue `#1257 `_. Removing the read-only flag from PDF fields is now possible. + +* **Fixed** issue `#1252 `_. Now correctly specifying the ``zoom`` value for PDF link annotations. + +* **Fixed** issue `#1244 `_. Now correctly computing the transform matrix in :meth:`Page.get_image__bbox`. + +* **Fixed** issue `#1241 `_. Prevent returning artifact characters in :meth:`Page.get_textbox`, which happened in certain constellations. + +* **Fixed** issue `#1234 `_. Avoid creating infinite rectangles in corner cases -- :meth:`Page.get_drawings`, :meth:`Page.get_cdrawings`. + +* **Added** test data and test scripts to the source PyPI source distribution. + +------ + +**Changes in Version 1.18.17** + +Focus of this version are major performance improvements of selected functions. + +* **Fixed** issue `#1199 `_. Using a non-existing page number in :meth:`Document.get_page_images` and friends will no longer lead to segfaults. + +* **Changed** :meth:`Page.get_drawings` to now differentiate between "stroke", "fill" and combined paths. Paths containing more than one rectangle (i.e. "re" items) are now supported. Extracting "clipped" paths is now available as an option. + +* **Added** :meth:`Page.get_cdrawings`, performance-optimized version of :meth:`Page.get_drawings`. + +* **Added** :attr:`Pixmap.samples_mv`, *memoryview* of a pixmap's pixel area. Does not copy and thus always accesses the current state of that area. + +* **Added** :attr:`Pixmap.samples_ptr`, Python "pointer" to a pixmap's pixel area. Allows much faster creation (factor 800+) of Qt images. + + + +------ + +**Changes in Version 1.18.16** + +* **Fixed** issue `#1184 `_. Existing PDF widget fonts in a PDF are now accepted (i.e. not forcedly changed to a Base-14 font). + +* **Fixed** issue `#1154 `_. Text search hits should now be correct when ``clip`` is specified. + +* **Fixed** issue `#1152 `_. + +* **Fixed** issue `#1146 `_. + +* **Added** :attr:`Link.flags` and :meth:`Link.set_flags` to the :ref:`Link` class. Implements enhancement requests `#1187 `_. + +* **Added** option to *simulate* :meth:`TextWriter.fill_textbox` output for predicting the number of lines, that a given text would occupy in the textbox. + +* **Added** text output support as subcommand `gettext` to the ``fitz`` CLI module. Most importantly, original **physical text layout** reproduction is now supported. + + +------ + +**Changes in Version 1.18.15** + +* **Fixed** issue `#1088 `_. Removing an annotation's fill color should now work again both ways, using the ``fill_color=[]`` argument in :meth:`Annot.update` as well as ``fill=[]`` in :meth:`Annot.set_colors`. + +* **Fixed** issue `#1081 `_. :meth:`Document.subset_fonts`: fixed an error which created wrong character widths for some fonts. + +* **Fixed** issue `#1078 `_. :meth:`Page.get_text` and other methods related to text extraction: changed the default value of the :ref:`TextPage` ``flags`` parameter. All whitespace and :data:`ligatures` are now preserved. + +* **Fixed** issue `#1085 `_. The old *snake_cased* alias of ``fitz.detTextlength`` is now defined correctly. + +* **Changed** :meth:`Document.subset_fonts` will now correctly prefix font subsets with an appropriate six letter uppercase tag, complying with the PDF specification. + +* **Added** new method :meth:`Widget.button_states` which returns the possible values that a button-type field can have when being set to "on" or "off". + +* **Added** support of text with **Small Capital** letters to the :ref:`Font` and :ref:`TextWriter` classes. This is reflected by an additional bool parameter ``small_caps`` in various of their methods. + + +------ + +**Changes in Version 1.18.14** + +* **Finished** implementing new, "snake_cased" names for methods and properties, that were "camelCased" and awkward in many aspects. At the end of this documentation, there is section :ref:`Deprecated` with more background and a mapping of old to new names. + +* **Fixed** issue `#1053 `_. :meth:`Page.insert_image`: when given, include image mask in the hash computation. + +* **Fixed** issue `#1043 `_. Added ``Pixmap.getPNGdata`` to the aliases of :meth:`Pixmap.tobytes`. + +* **Fixed** an internal error when computing the enveloping rectangle of drawn paths as returned by :meth:`Page.get_drawings`. + +* **Fixed** an internal error occasionally causing loops when outputting text via :meth:`TextWriter.fill_textbox`. + +* **Added** :meth:`Font.char_lengths`, which returns a tuple of character widths of a string. + +* **Added** more ways to specify pages in :meth:`Document.delete_pages`. Now a sequence (list, tuple or range) can be specified, and the Python ``del`` statement can be used. In the latter case, Python ``slices`` are also accepted. + +* **Changed** :meth:`Document.del_toc_item`, which disables a single item of the TOC: previously, the title text was removed. Instead, now the complete item will be shown grayed-out by supporting viewers. + + +------ + +**Changes in Version 1.18.13** + +* **Fixed** issue `#1014 `_. +* **Fixed** an internal memory leak when computing image bboxes -- :meth:`Page.get_image_bbox`. +* **Added** support for low-level access and modification of the PDF trailer. Applies to :meth:`Document.xref_get_keys`, :meth:`Document.xref_get_key`, and :meth:`Document.xref_set_key`. +* **Added** documentation for maintaining private entries in PDF metadata. +* **Added** documentation for handling transparent image insertions, :meth:`Page.insert_image`. +* **Added** :meth:`Page.get_image_rects`, an improved version of :meth:`Page.get_image_bbox`. +* **Changed** :meth:`Document.delete_pages` to support various ways of specifying pages to delete. Implements `#1042 `_. +* **Changed** :meth:`Page.insert_image` to also accept the xref of an existing image in the file. This allows "copying" images between pages, and extremely fast mutiple insertions. +* **Changed** :meth:`Page.insert_image` to also accept the integer parameter ``alpha``. To be used for performance improvements. +* **Changed** :meth:`Pixmap.set_alpha` to support new parameters for pre-multiplying colors with their alpha values and setting a specific color to fully transparent (e.g. white). +* **Changed** :meth:`Document.embfile_add` to automatically set creation and modification date-time. Correspondingly, :meth:`Document.embfile_upd` automatically maintains modification date-time (``/ModDate`` PDF key), and :meth:`Document.embfile_info` correspondingly reports these data. In addition, the embedded file's associated "collection item" is included via its :data:`xref`. This supports the development of PDF portfolio applications. + +------ + +**Changes in Version 1.18.11 / 1.18.12** + +* **Fixed** issue `#972 `_. Improved layout of source distribution material. +* **Fixed** issue `#962 `_. Stabilized Linux distribution detection for generating PyMuPDF from sources. +* **Added:** :meth:`Page.get_xobjects` delivers the result of :meth:`Document.get_page_xobjects`. +* **Added:** :meth:`Page.get_image_info` delivers meta information for all images shown on the page. +* **Added:** :meth:`Tools.mupdf_display_warnings` allows setting on / off the display of MuPDF-generated warnings. The default is off. +* **Added:** :meth:`Document.ez_save` convenience alias of :meth:`Document.save` with some different defaults. +* **Changed:** Image extractions of document pages now also contain the image's **transformation matrix**. This concerns :meth:`Page.get_image_bbox` and the DICT, JSON, RAWDICT, and RAWJSON variants of :meth:`Page.get_text`. + + +------ + +**Changes in Version 1.18.10** + +* **Fixed** issue `#941 `_. Added old aliases for :meth:`DisplayList.get_pixmap` and :meth:`DisplayList.get_textpage`. +* **Fixed** issue `#929 `_. Stabilized removal of JavaScript objects with :meth:`Document.scrub`. +* **Fixed** issue `#927 `_. Removed a loop in the reworked :meth:`TextWriter.fill_textbox`. +* **Changed** :meth:`Document.xref_get_keys` and :meth:`Document.xref_get_key` to also allow accessing the PDF trailer dictionary. This can be done by using `-1` as the xref number argument. +* **Added** a number of functions for reconstructing the quads for text lines, spans and characters extracted by :meth:`Page.get_text` options "dict" and "rawdict". See :meth:`recover_quad` and friends. +* **Added** :meth:`Tools.unset_quad_corrections` to suppress character quad corrections (occasionally required for erroneous fonts). + +------ + +**Changes in Version 1.18.9** + + +* **Fixed** issue `#888 `_. Removed ambiguous statements concerning PyMuPDF's license, which is now clearly stated to be GNU AGPL V3. +* **Fixed** issue `#895 `_. +* **Fixed** issue `#896 `_. Since v1.17.6 PyMuPDF suppresses the font subset tags and only reports the base fontname in text extraction outputs "dict" / "json" / "rawdict" / "rawjson". Now a new global parameter can request the old behaviour, :meth:`Tools.set_subset_fontnames`. +* **Fixed** issue `#885 `_. Pixmap creation now also works with filenames given as ``pathlib.Paths``. +* **Changed** :meth:`Document.subset_fonts`: Text is **not rewritten** any more and should therefore **retain all its origial properties** -- like being hidden or being controlled by Optional Content mechanisms. +* **Changed** :ref:`TextWriter` output to also accept text in right to left mode (Arabian, Hebrew): :meth:`TextWriter.fill_textbox`, :meth:`TextWriter.append`. These methods now accept a new boolean parameter `right_to_left`, which is *False* by default. Implements `#897 `_. +* **Changed** :meth:`TextWriter.fill_textbox` to return all lines of text, that did not fit in the given rectangle. Also changed the default of the ``warn`` parameter to no longer print a warning message in overflow situations. +* **Added** a utility function :meth:`recover_quad`, which computes the quadrilateral of a span. This function can be used for correctly marking text extracted with the "dict" or "rawdict" options of :meth:`Page.get_text`. + +------ + +**Changes in Version 1.18.8** + + +This is a bug fix version only. We are publishing early because of the potentially widely used functions. + +* **Fixed** issue `#881 `_. Fixed a memory leak in :meth:`Page.insert_image` when inserting images from files or memory. +* **Fixed** issue `#878 `_. ``pathlib.Path`` objects should now correctly handle file path hierarchies. + + +------ + +**Changes in Version 1.18.7** + + +* **Added** an experimental :meth:`Document.subset_fonts` which reduces the size of eligible fonts based on their use by text in the PDF. Implements `#855 `_. +* **Implemented** request `#870 `_: :meth:`Document.convert_to_pdf` now also supports PDF documents. +* **Renamed** ``Document.write`` to :meth:`Document.tobytes` for greater clarity. But the deprecated name remains available for some time. +* **Implemented** request `#843 `_: :meth:`Document.tobytes` now supports linearized PDF output. :meth:`Document.save` now also supports writing to Python **file objects**. In addition, the open function now also supports Python file objects. +* **Fixed** issue `#844 `_. +* **Fixed** issue `#838 `_. +* **Fixed** issue `#823 `_. More logic for better support of OCRed text output (Tesseract, ABBYY). +* **Fixed** issue `#818 `_. +* **Fixed** issue `#814 `_. +* **Added** :meth:`Document.get_page_labels` which returns a list of page label definitions of a PDF. +* **Added** :meth:`Document.has_annots` and :meth:`Document.has_links` to check whether these object types are present anywhere in a PDF. +* **Added** expert low-level functions to simplify inquiry and modification of PDF object sources: :meth:`Document.xref_get_keys` lists the keys of object :data:`xref`, :meth:`Document.xref_get_key` returns type and content of a key, and :meth:`Document.xref_set_key` modifies the key's value. +* **Added** parameter ``thumbnails`` to :meth:`Document.scrub` to also allow removing page thumbnail images. +* **Improved** documentation for how to add valid text marker annotations for non-horizontal text. + +We continued the process of renaming methods and properties from *"mixedCase"* to *"snake_case"*. Documentation usually mentions the new names only, but old, deprecated names remain available for some time. + + + +------ + +**Changes in Version 1.18.6** + +* **Fixed** issue `#812 `_. +* **Fixed** issue `#793 `_. Invalid document metadata previously prevented opening some documents at all. This error has been removed. +* **Fixed** issue `#792 `_. Text search and text extraction will make no rectangle containment checks at all if the default ``clip=None`` is used. +* **Fixed** issue `#785 `_. +* **Fixed** issue `#780 `_. Corrected a parameter check error. +* **Fixed** issue `#779 `_. Fixed typo +* **Added** an option to set the desired line height for text boxes. Implements `#804 `_. +* **Changed** text position retrieval to better cope with Tesseract's glyphless font. Implements `#803 `_. +* **Added** an option to choose the prefix of new annotations, fields and links for providing unique annotation ids. Implements request `#807 `_. +* **Added** getting and setting color and text properties for Table of Contents items for PDFs. Implements `#779 `_. +* **Added** PDF page label handling: :meth:`Page.get_label()` returns the page label, :meth:`Document.get_page_numbers` return all page numbers having a specified label, and :meth:`Document.set_page_labels` adds or updates a PDF's page label definition. + + + +.. note:: + This version introduces **Python type hinting**. The goal is to provide each parameter and the return value of all functions and methods with type information. This still is work in progress although the majority of functions has already been handled. + + +------ + +**Changes in Version 1.18.5** + +Apart from several fixes, this version also focusses on several minor, but important feature improvements. Among the latter is a more precise computation of proper line heights and insertion points for writing / inserting text. As opposed to using font-agnostic constants, these values are now taken from the font's properties. + +Also note that this is the first version which does no longer provide pregenerated wheels for Python versions older than 3.6. PIP also discontinues support for these by end of this year 2020. + +* **Fixed** issue `#771 `_. By using "small glyph heights" option, the full page text can be extracted. +* **Fixed** issue `#768 `_. +* **Fixed** issue `#750 `_. +* **Fixed** issue `#739 `_. The "dict", "rawdict" and corresponding JSON output variants now have two new *span* keys: ``"ascender"`` and ``"descender"``. These floats represent special font properties which can be used to compute bboxes of spans or characters of **exactly fontsize height** (as opposed to the default line height). An example algorithm is shown in section "Span Dictionary" `here `_. Also improved the detection and correction of ill-specified ascender / descender values encountered in some fonts. +* **Added** a new, experimental :meth:`Tools.set_small_glyph_heights` -- also in response to issue `#739 `_. This method sets or unsets a global parameter to **always compute bboxes with fontsize height**. If "on", text searching and all text extractions will returned rectangles, bboxes and quads with a smaller height. +* **Fixed** issue `#728 `_. +* **Changed** fill color logic of 'Polyline' annotations: this parameter now only pertains to line end symbols -- the annotation itself can no longer have a fill color. Also addresses issue `#727 `_. +* **Changed** :meth:`Page.getImageBbox` to also compute the bbox if the image is contained in an XObject. +* **Changed** :meth:`Shape.insertTextbox`, resp. :meth:`Page.insertTextbox`, resp. :meth:`TextWriter.fillTextbox` to respect font's properties "ascender" / "descender" when computing line height and insertion point. This should no longer lead to line overlaps for multi-line output. These methods used to ignore font specifics and used constant values instead. + + +------ + +**Changes in Version 1.18.4** + +This version adds several features to support PDF Optional Content. Among other things, this includes OCMDs (Optional Content Membership Dictionaries) with the full scope of *"visibility expressions"* (PDF key ``/VE``), text insertions (including the :ref:`TextWriter` class) and drawings. + +* **Fixed** issue `#727 `_. Freetext annotations now support an uncolored rectangle when ``fill_color=None``. +* **Fixed** issue `#726 `_. UTF-8 encoding errors are now handled for HTML / XML :meth:`Page.getText` output. +* **Fixed** issue `#724 `_. Empty values are no longer stored in the PDF /Info metadata dictionary. +* **Added** new methods :meth:`Document.set_oc` and :meth:`Document.get_oc` to set or get optional content references for **existing** image and form XObjects. These methods are similar to the same-named methods of :ref:`Annot`. +* **Added** :meth:`Document.set_ocmd`, :meth:`Document.get_ocmd` for handling OCMDs. +* **Added** **Optional Content** support for text insertion and drawing. +* **Added** new method :meth:`Page.deleteWidget`, which deletes a form field from a page. This is analogous to deleting annotations. +* **Added** support for Popup annotations. This includes defining the Popup rectangle and setting the Popup to open or closed. Methods / attributes :meth:`Annot.set_popup`, :meth:`Annot.set_open`, :attr:`Annot.has_popup`, :attr:`Annot.is_open`, :attr:`Annot.popup_rect`, :attr:`Annot.popup_xref`. + +Other changes: + +* The **naming of methods and attributes** in PyMuPDF is far from being satisfactory: we have *CamelCases*, *mixedCases* and *lower_case_with_underscores* all over the place. With the :ref:`Annot` as the first candidate, we have started an activity to clean this up step by step, converting to lower case with underscores for methods and attributes while keeping UPPERCASE for the constants. + + - Old names will remain available to prevent code breaks, but they will no longer be mentioned in the documentation. + - New methods and attributes of all classes will be named according to the new standard. + +------ + +**Changes in Version 1.18.3** + +As a major new feature, this version introduces support for PDF's **Optional Content** concept. + +* **Fixed** issue `#714 `_. +* **Fixed** issue `#711 `_. +* **Fixed** issue `#707 `_: if a PDF user password, but no owner password is supplied nor present, then the user password is also used as the owner password. +* **Fixed** ``expand`` and ``deflate`` parameters of methods :meth:`Document.save` and :meth:`Document.write`. Individual image and font compression should now finally work. Addresses issue `#713 `_. +* **Added** a support of PDF optional content. This includes several new :ref:`Document` methods for inquiring and setting optional content status and adding optional content configurations and groups. In addition, images, form XObjects and annotations now can be bound to optional content specifications. **Resolved** issue `#709 `_. + + + +------ + +**Changes in Version 1.18.2** + +This version contains some interesting improvements for text searching: any number of search hits is now returned and the **hit_max** parameter was removed. The new **clip** parameter in addition allows to restrict the search area. Searching now detects hyphenations at line breaks and accordingly finds hyphenated words. + +* **Fixed** issue `#575 `_: if using ``quads=False`` in text searching, then overlapping rectangles on the same line are joined. Previously, parts of the search string, which belonged to different "marked content" items, each generated their own rectangle -- just as if occurring on separate lines. +* **Added** :attr:`Document.isRepaired`, which is true if the PDF was repaired on open. +* **Added** :meth:`Document.setXmlMetadata` which either updates or creates PDF XML metadata. Implements issue `#691 `_. +* **Added** :meth:`Document.getXmlMetadata` returns PDF XML metadata. +* **Changed** creation of PDF documents: they will now always carry a PDF identification (``/ID`` field) in the document trailer. Implements issue `#691 `_. +* **Changed** :meth:`Page.searchFor`: a new parameter ``clip`` is accepted to restrict the search to this rectangle. Correspondingly, the attribute :attr:`TextPage.rect` is now respected by :meth:`TextPage.search`. +* **Changed** parameter ``hit_max`` in :meth:`Page.searchFor` and :meth:`TextPage.search` is now obsolete: methods will return all hits. +* **Changed** character **selection criteria** in :meth:`Page.getText`: a character is now considered to be part of a ``clip`` if its bbox is fully contained. Before this, a non-empty intersection was sufficient. +* **Changed** :meth:`Document.scrub` to support a new option `redact_images`. This addresses issue `#697 `_. + + +------ + +**Changes in Version 1.18.1** + +* **Fixed** issue `#692 `_. PyMuPDF now detects and recovers from more cyclic resource dependencies in PDF pages and for the first time reports them in the MuPDF warnings store. +* **Fixed** issue `#686 `_. +* **Added** opacity options for the :ref:`Shape` class: Stroke and fill colors can now be set to some transparency value. This means that all :ref:`Page` draw methods, methods :meth:`Page.insertText`, :meth:`Page.insertTextbox`, :meth:`Shape.finish`, :meth:`Shape.insertText`, and :meth:`Shape.insertTextbox` support two new parameters: *stroke_opacity* and *fill_opacity*. +* **Added** new parameter ``mask`` to :meth:`Page.insertImage` for optionally providing an external image mask. Resolves issue `#685 `_. +* **Added** :meth:`Annot.soundGet` for extracting the sound of an audio annotation. + +------ + +**Changes in Version 1.18.0** + +This is the first PyMuPDF version supporting MuPDF v1.18. The focus here is on extending PyMuPDF's own functionality -- apart from bug fixing. Subsequent PyMuPDF patches may address features new in MuPDF. + +* **Fixed** issue `#519 `_. This upstream bug occurred occasionally for some pages only and seems to be fixed now: page layout should no longer be ruined in these cases. + +* **Fixed** issue `#675 `_. + + - Unsuccessful storage allocations should now always lead to exceptions (circumvention of an upstream bug intermittently crashing the interpreter). + - :ref:`Pixmap` size is now based on ``size_t`` instead of ``int`` in C and should be correct even for extremely large pixmaps. + +* **Fixed** issue `#668 `_. Specification of dashes for PDF drawing insertion should now correctly reflect the PDF spec. +* **Fixed** issue `#669 `_. A major source of memory leakage in :meth:`Page.insert_pdf` has been removed. +* **Added** keyword *"images"* to :meth:`Page.apply_redactions` for fine-controlling the handling of images. +* **Added** :meth:`Annot.getText` and :meth:`Annot.getTextbox`, which offer the same functionality as the :ref:`Page` versions. +* **Added** key *"number"* to the block dictionaries of :meth:`Page.getText` / :meth:`Annot.getText` for options "dict" and "rawdict". +* **Added** :meth:`glyph_name_to_unicode` and :meth:`unicode_to_glyph_name`. Both functions do not really connect to a specific font and are now independently available, too. The data are now based on the `Adobe Glyph List `_. +* **Added** convenience functions :meth:`adobe_glyph_names` and :meth:`adobe_glyph_unicodes` which return the respective available data. +* **Added** :meth:`Page.getDrawings` which returns details of drawing operations on a document page. Works for all document types. +* Improved performance of :meth:`Document.insert_pdf`. Multiple object copies are now also suppressed across multiple separate insertions from the same source. This saves time, memory and target file size. Previously this mechanism was only active within each single method execution. The feature can also be suppressed with the new method bool parameter *final=1*, which is the default. +* For PNG images created from pixmaps, the resolution (dpi) is now automatically set from the respective :attr:`Pixmap.xres` and :attr:`Pixmap.yres` values. + + +------ + +**Changes in Version 1.17.7** + +* **Fixed** issue `#651 `_. An upstream bug causing interpreter crashes in corner case redaction processings was fixed by backporting MuPDF changes from their development repo. +* **Fixed** issue `#645 `_. Pixmap top-left coordinates can be set (again) by their own method, :meth:`Pixmap.set_origin`. +* **Fixed** issue `#622 `_. :meth:`Page.insertImage` again accepts a :data:`rect_like` parameter. +* **Added** severeal new methods to improve and speed-up table of contents (TOC) handling. Among other things, TOC items can now changed or deleted individually -- without always replacing the complete TOC. Furthermore, access to some PDF page attributes is now possible without first **loading** the page. This has a very significant impact on the performance of TOC manipulation. +* **Added** an option to :meth:`Document.insert_pdf` which allows displaying progress messages. Adresses `#640 `_. +* **Added** :meth:`Page.getTextbox` which extracts text contained in a rectangle. In many cases, this should obsolete writing your own script for this type of thing. +* **Added** new ``clip`` parameter to :meth:`Page.getText` to simplify and speed up text extraction of page sub areas. +* **Added** :meth:`TextWriter.appendv` to add text in **vertical write mode**. Addresses issue `#653 `_ + + +------ + +**Changes in Version 1.17.6** + +* **Fixed** issue `#605 `_ +* **Fixed** issue `#600 `_ -- text should now be correctly positioned also for pages with a CropBox smaller than MediaBox. +* **Added** text span dictionary key ``origin`` which contains the lower left coordinate of the first character in that span. +* **Added** attribute :attr:`Font.buffer`, a *bytes* copy of the font file. +* **Added** parameter *sanitize* to :meth:`Page.cleanContents`. Allows switching of sanitization, so only syntax cleaning will be done. + +------ + +**Changes in Version 1.17.5** + +* **Fixed** issue `#561 `_ -- second go: certain :ref:`TextWriter` usages with many alternating fonts did not work correctly. +* **Fixed** issue `#566 `_. +* **Fixed** issue `#568 `_. +* **Fixed** -- opacity is now correctly taken from the :ref:`TextWriter` object, if not given in :meth:`TextWriter.writeText`. +* **Added** a new global attribute :attr:`fitz_fontdescriptors`. Contains information about usable fonts from repository `pymupdf-fonts `_. +* **Added** :meth:`Font.valid_codepoints` which returns an array of unicode codepoints for which the font has a glyph. +* **Added** option ``text_as_path`` to :meth:`Page.getSVGimage`. this implements `#580 `_. Generates much smaller SVG files with parseable text if set to *False*. + + +------ + +**Changes in Version 1.17.4** + +* **Fixed** issue `#561 `_. Handling of more than 10 :ref:`Font` objects on one page should now work correctly. +* **Fixed** issue `#562 `_. Annotation pixmaps are no longer derived from the page pixmap, thus avoiding unintended inclusion of page content. +* **Fixed** issue `#559 `_. This **MuPDF** bug is being temporarily fixed with a pre-version of MuPDF's next release. +* **Added** utility function :meth:`repair_mono_font` for correcting displayed character spacing for some mono-spaced fonts. +* **Added** utility method :meth:`Document.need_appearances` for fine-controlling Form PDF behavior. Addresses issue `#563 `_. +* **Added** utility function :meth:`sRGB_to_pdf` to recover the PDF color triple for a given color integer in sRGB format. +* **Added** utility function :meth:`sRGB_to_rgb` to recover the (R, G, B) color triple for a given color integer in sRGB format. +* **Added** utility function :meth:`make_table` which delivers table cells for a given rectangle and desired numbers of columns and rows. +* **Added** support for optional fonts in repository `pymupdf-fonts `_. + +------ + +**Changes in Version 1.17.3** + +* **Fixed** an undocumented issue, which prevented fully cleaning a PDF page when using :meth:`Page.cleanContents`. +* **Fixed** issue `#540 `_. Text extraction for EPUB should again work correctly. +* **Fixed** issue `#548 `_. Documentation now includes ``LINK_NAMED``. +* **Added** new parameter to control start of text in :meth:`TextWriter.fillTextbox`. Implements `#549 `_. +* **Changed** documentation of :meth:`Page.add_redact_annot` to explain the usage of non-builtin fonts. + +------ + +**Changes in Version 1.17.2** + +* **Fixed** issue `#533 `_. +* **Added** options to modify 'Redact' annotation appearance. Implements `#535 `_. + + +------ + +**Changes in Version 1.17.1** + +* **Fixed** issue `#520 `_. +* **Fixed** issue `#525 `_. Vertices for 'Ink' annots should now be correct. +* **Fixed** issue `#524 `_. It is now possible to query and set rotation for applicable annotation types. + +Also significantly improved inline documentation for better support of interactive help. + +------ + +**Changes in Version 1.17.0** + +This version is based on MuPDF v1.17. Following are highlights of new and changed features: + +* **Added** extended language support for annotations and widgets: a mixture of Latin, Greece, Russian, Chinese, Japanese and Korean characters can now be used in 'FreeText' annotations and text widgets. No special arrangement is required to use it. + +* Faster page access is implemented for documents supporting a "chapter" structure. This applies to EPUB documents currently. This comes with several new :ref:`Document` methods and changes for :meth:`Document.loadPage` and the "indexed" page access *doc[n]*: In addition to specifying a page number as before, a tuple *(chaper, pno)* can be specified to identify the desired page. + +* **Changed:** Improved support of redaction annotations: images overlapped by redactions are **permanantly modified** by erasing the overlap areas. Also links are removed if overlapped by redactions. This is now fully in sync with PDF specifications. + +Other changes: + +* **Changed** :meth:`TextWriter.writeText` to support the *"morph"* parameter. +* **Added** methods :meth:`Rect.morph`, :meth:`IRect.morph`, and :meth:`Quad.morph`, which return a new :ref:`Quad`. +* **Changed** :meth:`Page.add_freetext_annot` to support text alignment via a new *"align"* parameter. +* **Fixed** issue `#508 `_. Improved image rectangle calculation to hopefully deliver correct values in most if not all cases. +* **Fixed** issue `#502 `_. +* **Fixed** issue `#500 `_. :meth:`Document.convertToPDF` should no longer cause memory leaks. +* **Fixed** issue `#496 `_. Annotations and widgets / fields are now added or modified using the coordinates of the **unrotated page**. This behavior is now in sync with other methods modifying PDF pages. +* **Added** :attr:`Page.rotationMatrix` and :attr:`Page.derotationMatrix` to support coordinate transformations between the rotated and the original versions of a PDF page. + +Potential code breaking changes: + +* The private method ``Page._getTransformation()`` has been removed. Use the public :attr:`Page.transformationMattrix` instead. + + +------ + +**Changes in Version 1.16.18** + +This version introduces several new features around PDF text output. The motivation is to simplify this task, while at the same time offering extending features. + +One major achievement is using MuPDF's capabilities to dynamically choosing fallback fonts whenever a character cannot be found in the current one. This seemlessly works for Base-14 fonts in combination with CJK fonts (China, Japan, Korea). So a text may contain **any combination of characters** from the Latin, Greek, Russian, Chinese, Japanese and Korean languages. + +* **Fixed** issue `#493 `_. ``Pixmap(doc, xref)`` should now again correctly resemble the loaded image object. +* **Fixed** issue `#488 `_. Widget names are now modifiable. +* **Added** new class :ref:`Font` which represents a font. +* **Added** new class :ref:`TextWriter` which serves as a container for text to be written on a page. +* **Added** :meth:`Page.writeText` to write one or more :ref:`TextWriter` objects to the page. + + +------ + +**Changes in Version 1.16.17** + + +* **Fixed** issue `#479 `_. PyMuPDF should now more correctly report image resolutions. This applies to both, images (either from images files or extracted from PDF documents) and pixmaps created from images. +* **Added** :meth:`Pixmap.set_dpi` which sets the image resolution in x and y directions. + +------ + +**Changes in Version 1.16.16** + + +* **Fixed** issue `#477 `_. +* **Fixed** issue `#476 `_. +* **Changed** annotation line end symbol coloring and fixed an error coloring the interior of 'Polyline' /'Polygon' annotations. + +------ + +**Changes in Version 1.16.14** + + +* **Changed** text marker annotations to accept parameters beyond just quadrilaterals such that now **text lines between two given points can be marked**. + +* **Added** :meth:`Document.scrub` which **removes potentially sensitive data** from a PDF. Implements `#453 `_. + +* **Added** :meth:`Annot.blendMode` which returns the **blend mode** of annotations. + +* **Added** :meth:`Annot.setBlendMode` to set the annotation's blend mode. This resolves issue `#416 `_. +* **Changed** :meth:`Annot.update` to accept additional parameters for setting blend mode and opacity. +* **Added** advanced graphics features to **control the anti-aliasing values**, :meth:`Tools.set_aa_level`. Resolves `#467 `_ + +* **Fixed** issue `#474 `_. +* **Fixed** issue `#466 `_. + + + +------ + +**Changes in Version 1.16.13** + + +* **Added** :meth:`Document.getPageXObjectList` which returns a list of **Form XObjects** of the page. +* **Added** :meth:`Page.setMediaBox` for changing the physical PDF page size. +* **Added** :ref:`Page` methods which have been internal before: :meth:`Page.cleanContents` (= :meth:`Page._cleanContents`), :meth:`Page.getContents` (= :meth:`Page._getContents`), :meth:`Page.getTransformation` (= :meth:`Page._getTransformation`). + + + +------ + +**Changes in Version 1.16.12** + +* **Fixed** issue `#447 `_ +* **Fixed** issue `#461 `_. +* **Fixed** issue `#397 `_. +* **Fixed** issue `#463 `_. +* **Added** JavaScript support to PDF form fields, thereby fixing `#454 `_. +* **Added** a new annotation method :meth:`Annot.delete_responses`, which removes 'Popup' and response annotations referring to the current one. Mainly serves data protection purposes. +* **Added** a new form field method :meth:`Widget.reset`, which resets the field value to its default. +* **Changed** and extended handling of redactions: images and XObjects are removed if *contained* in a redaction rectangle. Any partial only overlaps will just be covered by the redaction background color. Now an *overlay* text can be specified to be inserted in the rectangle area to **take the place the deleted original** text. This resolves `#434 `_. + +------ + +**Changes in Version 1.16.11** + +* **Added** Support for redaction annotations via method :meth:`Page.add_redact_annot` and :meth:`Page.apply_redactions`. +* **Fixed** issue #426 ("PolygonAnnotation in 1.16.10 version"). +* **Fixed** documentation only issues `#443 `_ and `#444 `_. + +------ + +**Changes in Version 1.16.10** + +* **Fixed** issue #421 ("annot.set_rect(rect) has no effect on text Annotation") +* **Fixed** issue #417 ("Strange behavior for page.deleteAnnot on 1.16.9 compare to 1.13.20") +* **Fixed** issue #415 ("Annot.setOpacity throws mupdf warnings") +* **Changed** all "add annotation / widget" methods to store a unique name in the */NM* PDF key. +* **Changed** :meth:`Annot.setInfo` to also accept direct parameters in addition to a dictionary. +* **Changed** :attr:`Annot.info` to now also show the annotation's unique id (*/NM* PDF key) if present. +* **Added** :meth:`Page.annot_names` which returns a list of all annotation names (*/NM* keys). +* **Added** :meth:`Page.load_annot` which loads an annotation given its unique id (*/NM* key). +* **Added** :meth:`Document.reload_page` which provides a new copy of a page after finishing any pending updates to it. + + +------ + +**Changes in Version 1.16.9** + +* **Fixed** #412 ("Feature Request: Allow controlling whether TOC entries should be collapsed") +* **Fixed** #411 ("Seg Fault with page.firstWidget") +* **Fixed** #407 ("Annot.setOpacity trouble") +* **Changed** methods :meth:`Annot.setBorder`, :meth:`Annot.setColors`, :meth:`Link.setBorder`, and :meth:`Link.setColors` to also accept direct parameters, and not just cumbersome dictionaries. + +------ + +**Changes in Version 1.16.8** + +* **Added** several new methods to the :ref:`Document` class, which make dealing with PDF low-level structures easier. I also decided to provide them as "normal" methods (as opposed to private ones starting with an underscore "_"). These are :meth:`Document.xrefObject`, :meth:`Document.xrefStream`, :meth:`Document.xrefStreamRaw`, :meth:`Document.PDFTrailer`, :meth:`Document.PDFCatalog`, :meth:`Document.metadataXML`, :meth:`Document.updateObject`, :meth:`Document.updateStream`. +* **Added** :meth:`Tools.mupdf_disply_errors` which sets the display of mupdf errors on *sys.stderr*. +* **Added** a commandline facility. This a major new feature: you can now invoke several utility functions via *"python -m fitz ..."*. It should obsolete the need for many of the most trivial scripts. Please refer to :ref:`Module`. + + +------ + +**Changes in Version 1.16.7** + +Minor changes to better synchronize the binary image streams of :ref:`TextPage` image blocks and :meth:`Document.extractImage` images. + +* **Fixed** issue #394 ("PyMuPDF Segfaults when using TOOLS.mupdf_warnings()"). +* **Changed** redirection of MuPDF error messages: apart from writing them to Python *sys.stderr*, they are now also stored with the MuPDF warnings. +* **Changed** :meth:`Tools.mupdf_warnings` to automatically empty the store (if not deactivated via a parameter). +* **Changed** :meth:`Page.getImageBbox` to return an **infinite rectangle** if the image could not be located on the page -- instead of raising an exception. + + +------ + +**Changes in Version 1.16.6** + +* **Fixed** issue #390 ("Incomplete deletion of annotations"). +* **Changed** :meth:`Page.searchFor` / :meth:`Document.searchPageFor` to also support the *flags* parameter, which controls the data included in a :ref:`TextPage`. +* **Changed** :meth:`Document.getPageImageList`, :meth:`Document.getPageFontList` and their :ref:`Page` counterparts to support a new parameter *full*. If true, the returned items will contain the :data:`xref` of the *Form XObject* where the font or image is referenced. + +------ + +**Changes in Version 1.16.5** + +More performance improvements for text extraction. + +* **Fixed** second part of issue #381 (see item in v1.16.4). +* **Added** :meth:`Page.getTextPage`, so it is no longer required to create an intermediate display list for text extractions. Page level wrappers for text extraction and text searching are now based on this, which should improve performance by ca. 5%. + +------ + +**Changes in Version 1.16.4** + + +* **Fixed** issue #381 ("TextPage.extractDICT ... failed ... after upgrading ... to 1.16.3") +* **Added** method :meth:`Document.pages` which delivers a generator iterator over a page range. +* **Added** method :meth:`Page.links` which delivers a generator iterator over the links of a page. +* **Added** method :meth:`Page.annots` which delivers a generator iterator over the annotations of a page. +* **Added** method :meth:`Page.widgets` which delivers a generator iterator over the form fields of a page. +* **Changed** :attr:`Document.is_form_pdf` to now contain the number of widgets, and *False* if not a PDF or this number is zero. + + +------ + +**Changes in Version 1.16.3** + +Minor changes compared to version 1.16.2. The code of the "dict" and "rawdict" variants of :meth:`Page.getText` has been ported to C which has greatly improved their performance. This improvement is mostly noticeable with text-oriented documents, where they now should execute almost two times faster. + +* **Fixed** issue #369 ("mupdf: cmsCreateTransform failed") by removing ICC colorspace support. +* **Changed** :meth:`Page.getText` to accept additional keywords "blocks" and "words". These will deliver the results of :meth:`Page.getTextBlocks` and :meth:`Page.getTextWords`, respectively. So all text extraction methods are now available via a uniform API. Correspondingly, there are now new methods :meth:`TextPage.extractBLOCKS` and :meth:`TextPage.extractWords`. +* **Changed** :meth:`Page.getText` to default bit indicator *TEXT_INHIBIT_SPACES* to **off**. Insertion of additional spaces is **not suppressed** by default. + +------ + +**Changes in Version 1.16.2** + +* **Changed** text extraction methods of :ref:`Page` to allow detail control of the amount of extracted data. +* **Added** :meth:`planish_line` which maps a given line (defined as a pair of points) to the x-axis. +* **Fixed** an issue (w/o Github number) which brought down the interpreter when encountering certain non-UTF-8 encodable characters while using :meth:`Page.getText` with te "dict" option. +* **Fixed** issue #362 ("Memory Leak with getText('rawDICT')"). + +------ + +**Changes in Version 1.16.1** + +* **Added** property :attr:`Quad.is_convex` which checks whether a line is contained in the quad if it connects two points of it. +* **Changed** :meth:`Document.insert_pdf` to now allow dropping or including links and annotations independently during the copy. Fixes issue #352 ("Corrupt PDF data and ..."), which seemed to intermittently occur when using the method for some problematic PDF files. +* **Fixed** a bug which, in matrix division using the syntax *"m1/m2"*, caused matrix *"m1"* to be **replaced** by the result instead of delivering a new matrix. +* **Fixed** issue #354 ("SyntaxWarning with Python 3.8"). We now always use *"=="* for literals (instead of the *"is"* Python keyword). +* **Fixed** issue #353 ("mupdf version check"), to no longer refuse the import when there are only patch level deviations from MuPDF. + + + +------ + +**Changes in Version 1.16.0** + +This major new version of MuPDF comes with several nice new or changed features. Some of them imply programming API changes, however. This is a synopsis of what has changed: + +* PDF document encryption and decryption is now **fully supported**. This includes setting **permissions**, **passwords** (user and owner passwords) and the desired encryption method. +* In response to the new encryption features, PyMuPDF returns an integer (ie. a combination of bits) for document permissions, and no longer a dictionary. +* Redirection of MuPDF errors and warnings is now natively supported. PyMuPDF redirects error messages from MuPDF to *sys.stderr* and no longer buffers them. Warnings continue to be buffered and will not be displayed. Functions exist to access and reset the warnings buffer. +* Annotations are now **only supported for PDF**. +* Annotations and widgets (form fields) are now **separate object chains** on a page (although widgets technically still **are** PDF annotations). This means, that you will **never encounter widgets** when using :attr:`Page.firstAnnot` or :meth:`Annot.next`. You must use :attr:`Page.firstWidget` and :meth:`Widget.next` to access form fields. +* As part of MuPDF's changes regarding widgets, only the following four fonts are supported, when **adding** or **changing** form fields: **Courier, Helvetica, Times-Roman** and **ZapfDingBats**. + +List of change details: + +* **Added** :meth:`Document.can_save_incrementally` which checks conditions that are preventing use of option *incremental=True* of :meth:`Document.save`. +* **Added** :attr:`Page.firstWidget` which points to the first field on a page. +* **Added** :meth:`Page.getImageBbox` which returns the rectangle occupied by an image shown on the page. +* **Added** :meth:`Annot.setName` which lets you change the (icon) name field. +* **Added** outputting the text color in :meth:`Page.getText`: the *"dict"*, *"rawdict"* and *"xml"* options now also show the color in sRGB format. +* **Changed** :attr:`Document.permissions` to now contain an integer of bool indicators -- was a dictionary before. +* **Changed** :meth:`Document.save`, :meth:`Document.write`, which now fully support password-based decryption and encryption of PDF files. +* **Changed the names of all Python constants** related to annotations and widgets. Please make sure to consult the **Constants and Enumerations** chapter if your script is dealing with these two classes. This decision goes back to the dropped support for non-PDF annotations. The **old names** (starting with "ANNOT_*" or "WIDGET_*") will be available as deprecated synonyms. +* **Changed** font support for widgets: only *Cour* (Courier), *Helv* (Helvetica, default), *TiRo* (Times-Roman) and *ZaDb* (ZapfDingBats) are accepted when **adding or changing** form fields. Only the plain versions are possible -- not their italic or bold variations. **Reading** widgets, however will show its original font. +* **Changed** the name of the warnings buffer to :meth:`Tools.mupdf_warnings` and the function to empty this buffer is now called :meth:`Tools.reset_mupdf_warnings`. +* **Changed** :meth:`Page.getPixmap`, :meth:`Document.get_page_pixmap`: a new bool argument *annots* can now be used to **suppress the rendering of annotations** on the page. +* **Changed** :meth:`Page.add_file_annot` and :meth:`Page.add_text_annot` to enable setting an icon. +* **Removed** widget-related methods and attributes from the :ref:`Annot` object. +* **Removed** :ref:`Document` attributes *openErrCode*, *openErrMsg*, and :ref:`Tools` attributes / methods *stderr*, *reset_stderr*, *stdout*, and *reset_stdout*. +* **Removed** **thirdparty zlib** dependency in PyMuPDF: there are now compression functions available in MuPDF. Source installers of PyMuPDF may now omit this extra installation step. + +**No version published for MuPDF v1.15.0** + + +------ + +**Changes in Version 1.14.20 / 1.14.21** + +* **Changed** text marker annotations to support multiple rectangles / quadrilaterals. This fixes issue #341 ("Question : How to addhighlight so that a string spread across more than a line is covered by one highlight?") and similar (#285). +* **Fixed** issue #331 ("Importing PyMuPDF changes warning filtering behaviour globally"). + + +------ + +**Changes in Version 1.14.19** + +* **Fixed** issue #319 ("InsertText function error when use custom font"). +* **Added** new method :meth:`Document.get_sigflags` which returns information on whether a PDF is signed. Resolves issue #326 ("How to detect signature in a form pdf?"). + + +------ + +**Changes in Version 1.14.17** + +* **Added** :meth:`Document.fullcopyPage` to make full page copies within a PDF (not just copied references as :meth:`Document.copyPage` does). +* **Changed** :meth:`Page.getPixmap`, :meth:`Document.get_page_pixmap` now use *alpha=False* as default. +* **Changed** text extraction: the span dictionary now (again) contains its rectangle under the *bbox* key. +* **Changed** :meth:`Document.movePage` and :meth:`Document.copyPage` to use direct functions instead of wrapping :meth:`Document.select` -- similar to :meth:`Document.delete_page` in v1.14.16. + +------ + +**Changes in Version 1.14.16** + +* **Changed** :ref:`Document` methods around PDF */EmbeddedFiles* to no longer use MuPDF's "portfolio" functions. That support will be dropped in MuPDF v1.15 -- therefore another solution was required. +* **Changed** :meth:`Document.embfile_Count` to be a function (was an attribute). +* **Added** new method :meth:`Document.embfile_Names` which returns a list of names of embedded files. +* **Changed** :meth:`Document.delete_page` and :meth:`Document.delete_pages` to internally no longer use :meth:`Document.select`, but instead use functions to perform the deletion directly. As it has turned out, the :meth:`Document.select` method yields invalid outline trees (tables of content) for very complex PDFs and sophisticated use of annotations. + + +------ + +**Changes in Version 1.14.15** + +* **Fixed** issues #301 ("Line cap and Line join"), #300 ("How to draw a shape without outlines") and #298 ("utils.updateRect exception"). These bugs pertain to drawing shapes with PyMuPDF. Drawing shapes without any border is fully supported. Line cap styles and line line join style are now differentiated and support all possible PDF values (0, 1, 2) instead of just being a bool. The previous parameter *roundCap* is deprecated in favor of *lineCap* and *lineJoin* and will be deleted in the next release. +* **Fixed** issue #290 ("Memory Leak with getText('rawDICT')"). This bug caused memory not being (completely) freed after invoking the "dict", "rawdict" and "json" versions of :meth:`Page.getText`. + + +------ + +**Changes in Version 1.14.14** + +* **Added** new low-level function :meth:`ImageProperties` to determine a number of characteristics for an image. +* **Added** new low-level function :meth:`Document.is_stream`, which checks whether an object is of stream type. +* **Changed** low-level functions :meth:`Document._getXrefString` and :meth:`Document._getTrailerString` now by default return object definitions in a formatted form which makes parsing easy. + +------ + +**Changes in Version 1.14.13** + +* **Changed** methods working with binary input: while ever supporting bytes and bytearray objects, they now also accept *io.BytesIO* input, using their *getvalue()* method. This pertains to document creation, embedded files, FileAttachment annotations, pixmap creation and others. Fixes issue #274 ("Segfault when using BytesIO as a stream for insertImage"). +* **Fixed** issue #278 ("Is insertImage(keep_proportion=True) broken?"). Images are now correctly presented when keeping aspect ratio. + + +------ + +**Changes in Version 1.14.12** + +* **Changed** the draw methods of :ref:`Page` and :ref:`Shape` to support not only RGB, but also GRAY and CMYK colorspaces. This solves issue #270 ("Is there a way to use CMYK color to draw shapes?"). This change also applies to text insertion methods of :ref:`Shape`, resp. :ref:`Page`. +* **Fixed** issue #269 ("AttributeError in Document.insert_page()"), which occurred when using :meth:`Document.insert_page` with text insertion. + + +------ + +**Changes in Version 1.14.11** + +* **Changed** :meth:`Page.show_pdf_page` to always position the source rectangle centered in the target. This method now also supports **rotation by arbitrary angles**. The argument *reuse_xref* has been deprecated: prevention of duplicates is now **handled internally**. +* **Changed** :meth:`Page.insertImage` to support rotated display of the image and keeping the aspect ratio. Only rotations by multiples of 90 degrees are supported here. +* **Fixed** issue #265 ("TypeError: insertText() got an unexpected keyword argument 'idx'"). This issue only occurred when using :meth:`Document.insert_page` with also inserting text. + +------ + +**Changes in Version 1.14.10** + +* **Changed** :meth:`Page.show_pdf_page` to support rotation of the source rectangle. Fixes #261 ("Cannot rotate insterted pages"). +* **Fixed** a bug in :meth:`Page.insertImage` which prevented insertion of multiple images provided as streams. + + +------ + +**Changes in Version 1.14.9** + +* **Added** new low-level method :meth:`Document._getTrailerString`, which returns the trailer object of a PDF. This is much like :meth:`Document._getXrefString` except that the PDF trailer has no / needs no :data:`xref` to identify it. +* **Added** new parameters for text insertion methods. You can now set stroke and fill colors of glyphs (text characters) independently, as well as the thickness of the glyph border. A new parameter *render_mode* controls the use of these colors, and whether the text should be visible at all. +* **Fixed** issue #258 ("Copying image streams to new PDF without size increase"): For JPX images embedded in a PDF, :meth:`Document.extractImage` will now return them in their original format. Previously, the MuPDF base library was used, which returns them in PNG format (entailing a massive size increase). +* **Fixed** issue #259 ("Morphing text to fit inside rect"). Clarified use of :meth:`get_text_length` and removed extra line breaks for long words. + +------ + +**Changes in Version 1.14.8** + +* **Added** :meth:`Pixmap.set_rect` to change the pixel values in a rectangle. This is also an alternative to setting the color of a complete pixmap (:meth:`Pixmap.clear_with`). +* **Fixed** an image extraction issue with JBIG2 (monochrome) encoded PDF images. The issue occurred in :meth:`Page.getText` (parameters "dict" and "rawdict") and in :meth:`Document.extractImage` methods. +* **Fixed** an issue with not correctly clearing a non-alpha :ref:`Pixmap` (:meth:`Pixmap.clear_with`). +* **Fixed** an issue with not correctly inverting colors of a non-alpha :ref:`Pixmap` (:meth:`Pixmap.invert_irect`). + +------ + +**Changes in Version 1.14.7** + +* **Added** :meth:`Pixmap.set_pixel` to change one pixel value. +* **Added** documentation for image conversion in the :ref:`FAQ`. +* **Added** new function :meth:`get_text_length` to determine the string length for a given font. +* **Added** Postscript image output (changed :meth:`Pixmap.save` and :meth:`Pixmap.tobytes`). +* **Changed** :meth:`Pixmap.save` and :meth:`Pixmap.tobytes` to ensure valid combinations of colorspace, alpha and output format. +* **Changed** :meth:`Pixmap.save`: the desired format is now inferred from the filename. +* **Changed** FreeText annotations can now have a transparent background - see :meth:`Annot.update`. + +------ + +**Changes in Version 1.14.5** + +* **Changed:** :ref:`Shape` methods now strictly use the transformation matrix of the :ref:`Page` -- instead of "manually" calculating locations. +* **Added** method :meth:`Pixmap.pixel` which returns the pixel value (a list) for given pixel coordinates. +* **Added** method :meth:`Pixmap.tobytes` which returns a bytes object representing the pixmap in a variety of formats. Previously, this could be done for PNG outputs only (:meth:`Pixmap.tobytes`). +* **Changed:** output of methods :meth:`Pixmap.save` and (the new) :meth:`Pixmap.tobytes` may now also be PSD (Adobe Photoshop Document). +* **Added** method :meth:`Shape.drawQuad` which draws a :ref:`Quad`. This actually is a shorthand for a :meth:`Shape.drawPolyline` with the edges of the quad. +* **Changed** method :meth:`Shape.drawOval`: the argument can now be **either** a rectangle (:data:`rect_like`) **or** a quadrilateral (:data:`quad_like`). + +------ + +**Changes in Version 1.14.4** + +* **Fixes** issue #239 "Annotation coordinate consistency". + + +------ + +**Changes in Version 1.14.3** + +This patch version contains minor bug fixes and CJK font output support. + +* **Added** support for the four CJK fonts as PyMuPDF generated text output. This pertains to methods :meth:`Page.insertFont`, :meth:`Shape.insertText`, :meth:`Shape.insertTextbox`, and corresponding :ref:`Page` methods. The new fonts are available under "reserved" fontnames "china-t" (traditional Chinese), "china-s" (simplified Chinese), "japan" (Japanese), and "korea" (Korean). +* **Added** full support for the built-in fonts 'Symbol' and 'Zapfdingbats'. +* **Changed:** The 14 standard fonts can now each be referenced by a 4-letter abbreviation. + +------ + +**Changes in Version 1.14.1** + +This patch version contains minor performance improvements. + +* **Added** support for :ref:`Document` filenames given as *pathlib* object by using the Python *str()* function. + + +------ + +**Changes in Version 1.14.0** + +To support MuPDF v1.14.0, massive changes were required in PyMuPDF -- most of them purely technical, with little visibility to developers. But there are also quite a lot of interesting new and improved features. Following are the details: + +* **Added** "ink" annotation. +* **Added** "rubber stamp" annotation. +* **Added** "squiggly" text marker annotation. +* **Added** new class :ref:`Quad` (quadrilateral or tetragon) -- which represents a general four-sided shape in the plane. The special subtype of rectangular, non-empty tetragons is used in text marker annotations and as returned objects in text search methods. +* **Added** a new option "decrypt" to :meth:`Document.save` and :meth:`Document.write`. Now you can **keep encryption** when saving a password protected PDF. +* **Added** suppression and redirection of unsolicited messages issued by the underlying C-library MuPDF. Consult :ref:`RedirectMessages` for details. +* **Changed:** Changes to annotations now **always require** :meth:`Annot.update` to become effective. +* **Changed** free text annotations to support the full Latin character set and range of appearance options. +* **Changed** text searching, :meth:`Page.searchFor`, to optionally return :ref:`Quad` instead :ref:`Rect` objects surrounding each search hit. +* **Changed** plain text output: we now add a *\n* to each line if it does not itself end with this character. +* **Fixed** issue 211 ("Something wrong in the doc"). +* **Fixed** issue 213 ("Rewritten outline is displayed only by mupdf-based applications"). +* **Fixed** issue 214 ("PDF decryption GONE!"). +* **Fixed** issue 215 ("Formatting of links added with pyMuPDF"). +* **Fixed** issue 217 ("extraction through json is failing for my pdf"). + +Behind the curtain, we have changed the implementation of geometry objects: they now purely exist in Python and no longer have "shadow" twins on the C-level (in MuPDF). This has improved processing speed in that area by more than a factor of two. + +Because of the same reason, most methods involving geometry parameters now also accept the corresponding Python sequence. For example, in method *"page.show_pdf_page(rect, ...)"* parameter *rect* may now be any :data:`rect_like` sequence. + +We also invested considerable effort to further extend and improve the :ref:`FAQ` chapter. + + +------ + +**Changes in Version 1.13.19** + +This version contains some technical / performance improvements and bug fixes. + +* **Changed** memory management: for Python 3 builds, Python memory management is exclusively used across all C-level code (i.e. no more native *malloc()* in MuPDF code or PyMuPDF interface code). This leads to improved memory usage profiles and also some runtime improvements: we have seen > 2% shorter runtimes for text extractions and pixmap creations (on Windows machines only to date). +* **Fixed** an error occurring in Python 2.7, which crashed the interpreter when using :meth:`TextPage.extractRAWDICT` (= *Page.getText("rawdict")*). +* **Fixed** an error occurring in Python 2.7, when creating link destinations. +* **Extended** the :ref:`FAQ` chapter with more examples. + +------ + +**Changes in Version 1.13.18** + +* **Added** method :meth:`TextPage.extractRAWDICT`, and a corresponding new string parameter "rawdict" to method :meth:`Page.getText`. It extracts text and images from a page in Python *dict* form like :meth:`TextPage.extractDICT`, but with the detail level of :meth:`TextPage.extractXML`, which is position information down to each single character. + +------ + +**Changes in Version 1.13.17** + +* **Fixed** an error that intermittently caused an exception in :meth:`Page.show_pdf_page`, when pages from many different source PDFs were shown. +* **Changed** method :meth:`Document.extractImage` to now return more meta information about the extracted imgage. Also, its performance has been greatly improved. Several demo scripts have been changed to make use of this method. +* **Changed** method :meth:`Document._getXrefStream` to now return *None* if the object is no stream and no longer raise an exception if otherwise. +* **Added** method :meth:`Document._deleteObject` which deletes a PDF object identified by its :data:`xref`. Only to be used by the experienced PDF expert. +* **Added** a method :meth:`paper_rect` which returns a :ref:`Rect` for a supplied paper format string. Example: *fitz.paper_rect("letter") = fitz.Rect(0.0, 0.0, 612.0, 792.0)*. +* **Added** a :ref:`FAQ` chapter to this document. + +------ + +**Changes in Version 1.13.16** + +* **Added** support for correctly setting transparency (opacity) for certain annotation types. +* **Added** a tool property (:attr:`Tools.fitz_config`) showing the configuration of this PyMuPDF version. +* **Fixed** issue #193 ('insertText(overlay=False) gives "cannot resize a buffer with shared storage" error') by avoiding read-only buffers. + +------ + +**Changes in Version 1.13.15** + +* **Fixed** issue #189 ("cannot find builtin CJK font"), so we are supporting builtin CJK fonts now (CJK = China, Japan, Korea). This should lead to correctly generated pixmaps for documents using these languages. This change has consequences for our binary file size: it will now range between 8 and 10 MB, depending on the OS. +* **Fixed** issue #191 ("Jupyter notebook kernel dies after ca. 40 pages"), which occurred when modifying the contents of an annotation. + +------ + +**Changes in Version 1.13.14** + +This patch version contains several improvements, mainly for annotations. + +* **Changed** :attr:`Annot.lineEnds` is now a list of two integers representing the line end symbols. Previously was a *dict* of strings. +* **Added** support of line end symbols for applicable annotations. PyMuPDF now can generate these annotations including the line end symbols. +* **Added** :meth:`Annot.setLineEnds` adds line end symbols to applicable annotation types ('Line', 'PolyLine', 'Polygon'). +* **Changed** technical implementation of :meth:`Page.insertImage` and :meth:`Page.show_pdf_page`: they now create there own contents objects, thereby avoiding changes of potentially large streams with consequential compression / decompression efforts and high change volumes with incremental updates. + +------ + +**Changes in Version 1.13.13** + +This patch version contains several improvements for embedded files and file attachment annotations. + +* **Added** :meth:`Document.embfile_Upd` which allows changing **file content and metadata** of an embedded file. It supersedes the old method :meth:`Document.embfile_SetInfo` (which will be deleted in a future version). Content is automatically compressed and metadata may be unicode. +* **Changed** :meth:`Document.embfile_Add` to now automatically compress file content. Accompanying metadata can now be unicode (had to be ASCII in the past). +* **Changed** :meth:`Document.embfile_Del` to now automatically delete **all entries** having the supplied identifying name. The return code is now an integer count of the removed entries (was *None* previously). +* **Changed** embedded file methods to now also accept or show the PDF unicode filename as additional parameter *ufilename*. +* **Added** :meth:`Page.add_file_annot` which adds a new file attachment annotation. +* **Changed** :meth:`Annot.fileUpd` (file attachment annot) to now also accept the PDF unicode *ufilename* parameter. The description parameter *desc* correctly works with unicode. Furthermore, **all** parameters are optional, so metadata may be changed without also replacing the file content. +* **Changed** :meth:`Annot.fileInfo` (file attachment annot) to now also show the PDF unicode filename as parameter *ufilename*. +* **Fixed** issue #180 ("page.getText(output='dict') return invalid bbox") to now also work for vertical text. +* **Fixed** issue #185 ("Can't render the annotations created by PyMuPDF"). The issue's cause was the minimalistic MuPDF approach when creating annotations. Several annotation types have no */AP* ("appearance") object when created by MuPDF functions. MuPDF, SumatraPDF and hence also PyMuPDF cannot render annotations without such an object. This fix now ensures, that an appearance object is always created together with the annotation itself. We still do not support line end styles. + +------ + +**Changes in Version 1.13.12** + +* **Fixed** issue #180 ("page.getText(output='dict') return invalid bbox"). Note that this is a circumvention of an MuPDF error, which generates zero-height character rectangles in some cases. When this happens, this fix ensures a bbox height of at least fontsize. +* **Changed** for ListBox and ComboBox widgets, the attribute list of selectable values has been renamed to :attr:`Widget.choice_values`. +* **Changed** when adding widgets, any missing of the :ref:`Base-14-Fonts` is automatically added to the PDF. Widget text fonts can now also be chosen from existing widget fonts. Any specified field values are now honored and lead to a field with a preset value. +* **Added** :meth:`Annot.updateWidget` which allows changing existing form fields -- including the field value. + +------ + +**Changes in Version 1.13.11** + +While the preceeding patch subversions only contained various fixes, this version again introduces major new features: + +* **Added** basic support for PDF widget annotations. You can now add PDF form fields of types Text, CheckBox, ListBox and ComboBox. Where necessary, the PDF is tranformed to a Form PDF with the first added widget. +* **Fixed** issues #176 ("wrong file embedding"), #177 ("segment fault when invoking page.getText()")and #179 ("Segmentation fault using page.getLinks() on encrypted PDF"). + + +------ + +**Changes in Version 1.13.7** + +* **Added** support of variable page sizes for reflowable documents (e-books, HTML, etc.): new parameters *rect* and *fontsize* in :ref:`Document` creation (open), and as a separate method :meth:`Document.layout`. +* **Added** :ref:`Annot` creation of many annotations types: sticky notes, free text, circle, rectangle, line, polygon, polyline and text markers. +* **Added** support of annotation transparency (:attr:`Annot.opacity`, :meth:`Annot.setOpacity`). +* **Changed** :attr:`Annot.vertices`: point coordinates are now grouped as pairs of floats (no longer as separate floats). +* **Changed** annotation colors dictionary: the two keys are now named *"stroke"* (formerly *"common"*) and *"fill"*. +* **Added** :attr:`Document.isDirty` which is *True* if a PDF has been changed in this session. Reset to *False* on each :meth:`Document.save` or :meth:`Document.write`. + +------ + +**Changes in Version 1.13.6** + +* Fix #173: for memory-resident documents, ensure the stream object will not be garbage-collected by Python before document is closed. + +------ + +**Changes in Version 1.13.5** + +* New low-level method :meth:`Page._setContents` defines an object given by its :data:`xref` to serve as the :data:`contents` object. +* Changed and extended PDF form field support: the attribute *widget_text* has been renamed to :attr:`Annot.widget_value`. Values of all form field types (except signatures) are now supported. A new attribute :attr:`Annot.widget_choices` contains the selectable values of listboxes and comboboxes. All these attributes now contain *None* if no value is present. + +------ + +**Changes in Version 1.13.4** + +* :meth:`Document.convertToPDF` now supports page ranges, reverted page sequences and page rotation. If the document already is a PDF, an exception is raised. +* Fixed a bug (introduced with v1.13.0) that prevented :meth:`Page.insertImage` for transparent images. + +------ + +**Changes in Version 1.13.3** + +Introduces a way to convert **any MuPDF supported document** to a PDF. If you ever wanted PDF versions of your XPS, EPUB, CBZ or FB2 files -- here is a way to do this. + +* :meth:`Document.convertToPDF` returns a Python *bytes* object in PDF format. Can be opened like normal in PyMuPDF, or be written to disk with the *".pdf"* extension. + +------ + +**Changes in Version 1.13.2** + +The major enhancement is PDF form field support. Form fields are annotations of type *(19, 'Widget')*. There is a new document method to check whether a PDF is a form. The :ref:`Annot` class has new properties describing field details. + +* :attr:`Document.is_form_pdf` is true if object type */AcroForm* and at least one form field exists. +* :attr:`Annot.widget_type`, :attr:`Annot.widget_text` and :attr:`Annot.widget_name` contain the details of a form field (i.e. a "Widget" annotation). + +------ + +**Changes in Version 1.13.1** + +* :meth:`TextPage.extractDICT` is a new method to extract the contents of a document page (text and images). All document types are supported as with the other :ref:`TextPage` *extract*()* methods. The returned object is a dictionary of nested lists and other dictionaries, and **exactly equal** to the JSON-deserialization of the old :meth:`TextPage.extractJSON`. The difference is that the result is created directly -- no JSON module is used. Because the user needs no JSON module to interpet the information, it should be easier to use, and also have a better performance, because it contains images in their original **binary format** -- they need not be base64-decoded. +* :meth:`Page.getText` correspondingly supports the new parameter value *"dict"* to invoke the above method. +* :meth:`TextPage.extractJSON` (resp. *Page.getText("json")*) is still supported for convenience, but its use is expected to decline. + +------ + +**Changes in Version 1.13.0** + +This version is based on MuPDF v1.13.0. This release is "primarily a bug fix release". + +In PyMuPDF, we are also doing some bug fixes while introducing minor enhancements. There only very minimal changes to the user's API. + +* :ref:`Document` construction is more flexible: the new *filetype* parameter allows setting the document type. If specified, any extension in the filename will be ignored. More completely addresses `issue #156 `_. As part of this, the documentation has been reworked. + +* Changes to :ref:`Pixmap` constructors: + - Colorspace conversion no longer allows dropping the alpha channel: source and target **alpha will now always be the same**. We have seen exceptions and even interpreter crashes when using *alpha = 0*. + - As a replacement, the simple pixmap copy lets you choose the target alpha. + +* :meth:`Document.save` again offers the full garbage collection range 0 thru 4. Because of a bug in :data:`xref` maintenance, we had to temporarily enforce *garbage > 1*. Finally resolves `issue #148 `_. + +* :meth:`Document.save` now offers to "prettify" PDF source via an additional argument. +* :meth:`Page.insertImage` has the additional *stream* \-parameter, specifying a memory area holding an image. + +* Issue with garbled PNGs on Linux systems has been resolved (`"Problem writing PNG" #133) `_. + + +------ + +**Changes in Version 1.12.4** + +This is an extension of 1.12.3. + +* Fix of `issue #147 `_: methods :meth:`Document.getPageFontlist` and :meth:`Document.getPageImagelist` now also show fonts and images contained in :data:`resources` nested via "Form XObjects". +* Temporary fix of `issue #148 `_: Saving to new PDF files will now automatically use *garbage = 2* if a lower value is given. Final fix is to be expected with MuPDF's next version. At that point we will remove this circumvention. +* Preventive fix of illegally using stencil / image mask pixmaps in some methods. +* Method :meth:`Document.getPageFontlist` now includes the encoding name for each font in the list. +* Method :meth:`Document.getPageImagelist` now includes the decode method name for each image in the list. + +------ + +**Changes in Version 1.12.3** + +This is an extension of 1.12.2. + +* Many functions now return *None* instead of *0*, if the result has no other meaning than just indicating successful execution (:meth:`Document.close`, :meth:`Document.save`, :meth:`Document.select`, :meth:`Pixmap.save` and many others). + +------ + +**Changes in Version 1.12.2** + +This is an extension of 1.12.1. + +* Method :meth:`Page.show_pdf_page` now accepts the new *clip* argument. This specifies an area of the source page to which the display should be restricted. + +* New :attr:`Page.CropBox` and :attr:`Page.MediaBox` have been included for convenience. + + +------ + +**Changes in Version 1.12.1** + +This is an extension of version 1.12.0. + +* New method :meth:`Page.show_pdf_page` displays another's PDF page. This is a **vector** image and therefore remains precise across zooming. Both involved documents must be PDF. + +* New method :meth:`Page.getSVGimage` creates an SVG image from the page. In contrast to the raster image of a pixmap, this is a vector image format. The return is a unicode text string, which can be saved in a *.svg* file. + +* Method :meth:`Page.getTextBlocks` now accepts an additional bool parameter "images". If set to true (default is false), image blocks (metadata only) are included in the produced list and thus allow detecting areas with rendered images. + +* Minor bug fixes. + +* "text" result of :meth:`Page.getText` concatenates all lines within a block using a single space character. MuPDF's original uses "\\n" instead, producing a rather ragged output. + +* New properties of :ref:`Page` objects :attr:`Page.MediaBoxSize` and :attr:`Page.CropBoxPosition` provide more information about a page's dimensions. For non-PDF files (and for most PDF files, too) these will be equal to :attr:`Page.rect.bottom_right`, resp. :attr:`Page.rect.top_left`. For example, class :ref:`Shape` makes use of them to correctly position its items. + +------ + +**Changes in Version 1.12.0** + +This version is based on and requires MuPDF v1.12.0. The new MuPDF version contains quite a number of changes -- most of them around text extraction. Some of the changes impact the programmer's API. + +* :meth:`Outline.saveText` and :meth:`Outline.saveXML` have been deleted without replacement. You probably haven't used them much anyway. But if you are looking for a replacement: the output of :meth:`Document.get_toc` can easily be used to produce something equivalent. + +* Class *TextSheet* does no longer exist. + +* Text "spans" (one of the hierarchy levels of :ref:`TextPage`) no longer contain positioning information (i.e. no "bbox" key). Instead, spans now provide the font information for its text. This impacts our JSON output variant. + +* HTML output has improved very much: it now creates valid documents which can be displayed by browsers to produce a similar view as the original document. + +* There is a new output format XHTML, which provides text and images in a browser-readable format. The difference to HTML output is, that no effort is made to reproduce the original layout. + +* All output formats of :meth:`Page.getText` now support creating complete, valid documents, by wrapping them with appropriate header and trailer information. If you are interested in using the HTML output, please make sure to read :ref:`HTMLQuality`. + +* To support finding text positions, we have added special methods that don't need detours like :meth:`TextPage.extractJSON` or :meth:`TextPage.extractXML`: use :meth:`Page.getTextBlocks` or resp. :meth:`Page.getTextWords` to create lists of text blocks or resp. words, which are accompanied by their rectangles. This should be much faster than the standard text extraction methods and also avoids using additional packages for interpreting their output. + + +------ + +**Changes in Version 1.11.2** + +This is an extension of v1.11.1. + +* New :meth:`Page.insertFont` creates a PDF */Font* object and returns its object number. + +* New :meth:`Document.extractFont` extracts the content of an embedded font given its object number. + +* Methods **FontList(...)** items no longer contain the PDF generation number. This value never had any significance. Instead, the font file extension is included (e.g. "pfa" for a "PostScript Font for ASCII"), which is more valuable information. + +* Fonts other than "simple fonts" (Type1) are now also supported. + +* New options to change :ref:`Pixmap` size: + + * Method :meth:`Pixmap.shrink` reduces the pixmap proportionally in place. + + * A new :ref:`Pixmap` copy constructor allows scaling via setting target width and height. + + +------ + +**Changes in Version 1.11.1** + +This is an extension of v1.11.0. + +* New class *Shape*. It facilitates and extends the creation of image shapes on PDF pages. It contains multiple methods for creating elementary shapes like lines, rectangles or circles, which can be combined into more complex ones and be given common properties like line width or colors. Combined shapes are handled as a unit and e.g. be "morphed" together. The class can accumulate multiple complex shapes and put them all in the page's foreground or background -- thus also reducing the number of updates to the page's :data:`contents` object. + +* All *Page* draw methods now use the new *Shape* class. + +* Text insertion methods *insertText()* and *insertTextBox()* now support morphing in addition to text rotation. They have become part of the *Shape* class and thus allow text to be freely combined with graphics. + +* A new *Pixmap* constructor allows creating pixmap copies with an added alpha channel. A new method also allows directly manipulating alpha values. + +* Binary algebraic operations with geometry objects (matrices, rectangles and points) now generally also support lists or tuples as the second operand. You can add a tuple *(x, y)* of numbers to a :ref:`Point`. In this context, such sequences are called ":data:`point_like`" (resp. :data:`matrix_like`, :data:`rect_like`). + +* Geometry objects now fully support in-place operators. For example, *p /= m* replaces point p with *p * 1/m* for a number, or *p * ~m* for a :data:`matrix_like` object *m*. Similarly, if *r* is a rectangle, then *r |= (3, 4)* is the new rectangle that also includes *fitz.Point(3, 4)*, and *r &= (1, 2, 3, 4)* is its intersection with *fitz.Rect(1, 2, 3, 4)*. + +------ + +**Changes in Version 1.11.0** + +This version is based on and requires MuPDF v1.11. + +Though MuPDF has declared it as being mostly a bug fix version, one major new feature is indeed contained: support of embedded files -- also called portfolios or collections. We have extended PyMuPDF functionality to embrace this up to an extent just a little beyond the *mutool* utility as follows. + +* The *Document* class now support embedded files with several new methods and one new property: + + - *embfile_Info()* returns metadata information about an entry in the list of embedded files. This is more than *mutool* currently provides: it shows all the information that was used to embed the file (not just the entry's name). + - *embfile_Get()* retrieves the (decompressed) content of an entry into a *bytes* buffer. + - *embfile_Add(...)* inserts new content into the PDF portfolio. We (in contrast to *mutool*) **restrict** this to entries with a **new name** (no duplicate names allowed). + - *embfile_Del(...)* deletes an entry from the portfolio (function not offered in MuPDF). + - *embfile_SetInfo()* -- changes filename or description of an embedded file. + - *embfile_Count* -- contains the number of embedded files. + +* Several enhancements deal with streamlining geometry objects. These are not connected to the new MuPDF version and most of them are also reflected in PyMuPDF v1.10.0. Among them are new properties to identify the corners of rectangles by name (e.g. *Rect.bottom_right*) and new methods to deal with set-theoretic questions like *Rect.contains(x)* or *IRect.intersects(x)*. Special effort focussed on supporting more "Pythonic" language constructs: *if x in rect ...* is equivalent to *rect.contains(x)*. + +* The :ref:`Rect` chapter now has more background on empty amd infinite rectangles and how we handle them. The handling itself was also updated for more consistency in this area. + +* We have started basic support for **generation** of PDF content: + + - *Document.insert_page()* adds a new page into a PDF, optionally containing some text. + - *Page.insertImage()* places a new image on a PDF page. + - *Page.insertText()* puts new text on an existing page + +* For **FileAttachment** annotations, content and name of the attached file can extracted and changed. + +------ + +**Changes in Version 1.10.0** + +**MuPDF v1.10 Impact** + +MuPDF version 1.10 has a significant impact on our bindings. Some of the changes also affect the API -- in other words, **you** as a PyMuPDF user. + +* Link destination information has been reduced. Several properties of the *linkDest* class no longer contain valuable information. In fact, this class as a whole has been deleted from MuPDF's library and we in PyMuPDF only maintain it to provide compatibilty to existing code. + +* In an effort to minimize memory requirements, several improvements have been built into MuPDF v1.10: + + - A new *config.h* file can be used to de-select unwanted features in the C base code. Using this feature we have been able to reduce the size of our binary *_fitz.o* / *_fitz.pyd* by about 50% (from 9 MB to 4.5 MB). When UPX-ing this, the size goes even further down to a very handy 2.3 MB. + + - The alpha (transparency) channel for pixmaps is now optional. Letting alpha default to *False* significantly reduces pixmap sizes (by 20% -- CMYK, 25% -- RGB, 50% -- GRAY). Many *Pixmap* constructors therefore now accept an *alpha* boolean to control inclusion of this channel. Other pixmap constructors (e.g. those for file and image input) create pixmaps with no alpha alltogether. On the downside, save methods for pixmaps no longer accept a *savealpha* option: this channel will always be saved when present. To minimize code breaks, we have left this parameter in the call patterns -- it will just be ignored. + +* *DisplayList* and *TextPage* class constructors now **require the mediabox** of the page they are referring to (i.e. the *page.bound()* rectangle). There is no way to construct this information from other sources, therefore a source code change cannot be avoided in these cases. We assume however, that not many users are actually employing these rather low level classes explixitely. So the impact of that change should be minor. + +**Other Changes compared to Version 1.9.3** + +* The new :ref:`Document` method *write()* writes an opened PDF to memory (as opposed to a file, like *save()* does). +* An annotation can now be scaled and moved around on its page. This is done by modifying its rectangle. +* Annotations can now be deleted. :ref:`Page` contains the new method *deleteAnnot()*. +* Various annotation attributes can now be modified, e.g. content, dates, title (= author), border, colors. +* Method *Document.insert_pdf()* now also copies annotations of source pages. +* The *Pages* class has been deleted. As documents can now be accessed with page numbers as indices (like *doc[n] = doc.loadPage(n)*), and document object can be used as iterators, the benefit of this class was too low to maintain it. See the following comments. +* *loadPage(n)* / *doc[n]* now accept arbitrary integers to specify a page number, as long as *n < pageCount*. So, e.g. *doc[-500]* is always valid and will load page *(-500) % pageCount*. +* A document can now also be used as an iterator like this: *for page in doc: ... ...*. This will yield all pages of *doc* as *page*. +* The :ref:`Pixmap` method *getSize()* has been replaced with property *size*. As before *Pixmap.size == len(Pixmap)* is true. +* In response to transparency (alpha) being optional, several new parameters and properties have been added to :ref:`Pixmap` and :ref:`Colorspace` classes to support determining their characteristics. +* The :ref:`Page` class now contains new properties *firstAnnot* and *firstLink* to provide starting points to the respective class chains, where *firstLink* is just a mnemonic synonym to method *loadLinks()* which continues to exist. Similarly, the new property *rect* is a synonym for method *bound()*, which also continues to exist. +* :ref:`Pixmap` methods *samplesRGB()* and *samplesAlpha()* have been deleted because pixmaps can now be created without transparency. +* :ref:`Rect` now has a property *irect* which is a synonym of method *round()*. Likewise, :ref:`IRect` now has property *rect* to deliver a :ref:`Rect` which has the same coordinates as floats values. +* Document has the new method *searchPageFor()* to search for a text string. It works exactly like the corresponding *Page.searchFor()* with page number as additional parameter. + + +------ + +**Changes in Version 1.9.3** + +This version is also based on MuPDF v1.9a. Changes compared to version 1.9.2: + +* As a major enhancement, annotations are now supported in a similar way as links. Annotations can be displayed (as pixmaps) and their properties can be accessed. +* In addition to the document *select()* method, some simpler methods can now be used to manipulate a PDF: + + - *copyPage()* copies a page within a document. + - *movePage()* is similar, but deletes the original. + - *delete_page()* deletes a page + - *delete_pages()* deletes a page range + +* *rotation* or *setRotation()* access or change a PDF page's rotation, respectively. +* Available but undocumented before, :ref:`IRect`, :ref:`Rect`, :ref:`Point` and :ref:`Matrix` support the *len()* method and their coordinate properties can be accessed via indices, e.g. *IRect.x1 == IRect[2]*. +* For convenience, documents now support simple indexing: *doc.loadPage(n) == doc[n]*. The index may however be in range *-pageCount < n < pageCount*, such that *doc[-1]* is the last page of the document. + +------ + +**Changes in Version 1.9.2** + +This version is also based on MuPDF v1.9a. Changes compared to version 1.9.1: + +* *fitz.open()* (no parameters) creates a new empty **PDF** document, i.e. if saved afterwards, it must be given a *.pdf* extension. +* :ref:`Document` now accepts all of the following formats (*Document* and *open* are synonyms): + + - *open()*, + - *open(filename)* (equivalent to *open(filename, None)*), + - *open(filetype, area)* (equivalent to *open(filetype, stream = area)*). + + Type of memory area *stream* may be *bytes* or *bytearray*. Thus, e.g. *area = open("file.pdf", "rb").read()* may be used directly (without first converting it to bytearray). +* New method *Document.insert_pdf()* (PDFs only) inserts a range of pages from another PDF. +* *Document* objects doc now support the *len()* function: ``len(doc) == doc.pageCount``. +* New method *Document.getPageImageList()* creates a list of images used on a page. +* New method *Document.getPageFontList()* creates a list of fonts referenced by a page. +* New pixmap constructor *fitz.Pixmap(doc, xref)* creates a pixmap based on an opened PDF document and an :data:`xref` number of the image. +* New pixmap constructor *fitz.Pixmap(cspace, spix)* creates a pixmap as a copy of another one *spix* with the colorspace converted to *cspace*. This works for all colorspace combinations. +* Pixmap constructor *fitz.Pixmap(colorspace, width, height, samples)* now allows *samples* to also be *bytes*, not only *bytearray*. + + +------ + +**Changes in Version 1.9.1** + +This version of PyMuPDF is based on MuPDF library source code version 1.9a published on April 21, 2016. + +Please have a look at MuPDF's website to see which changes and enhancements are contained herein. + +Changes in version 1.9.1 compared to version 1.8.0 are the following: + +* New methods *get_area()* for both *fitz.Rect* and *fitz.IRect* +* Pixmaps can now be created directly from files using the new constructor *fitz.Pixmap(filename)*. +* The Pixmap constructor *fitz.Pixmap(image)* has been extended accordingly. +* *fitz.Rect* can now be created with all possible combinations of points and coordinates. +* PyMuPDF classes and methods now all contain __doc__ strings, most of them created by SWIG automatically. While the PyMuPDF documentation certainly is more detailed, this feature should help a lot when programming in Python-aware IDEs. +* A new document method of *getPermits()* returns the permissions associated with the current access to the document (print, edit, annotate, copy), as a Python dictionary. +* The identity matrix *fitz.Identity* is now **immutable**. +* The new document method *select(list)* removes all pages from a document that are not contained in the list. Pages can also be duplicated and re-arranged. +* Various improvements and new members in our demo and examples collections. Perhaps most prominently: *PDF_display* now supports scrolling with the mouse wheel, and there is a new example program *wxTableExtract* which allows to graphically identify and extract table data in documents. +* *fitz.open()* is now an alias of *fitz.Document()*. +* New pixmap method *tobytes()* which will return a bytearray formatted as a PNG image of the pixmap. +* New pixmap method *samplesRGB()* providing a *samples* version with alpha bytes stripped off (RGB colorspaces only). +* New pixmap method *samplesAlpha()* providing the alpha bytes only of the *samples* area. +* New iterator *fitz.Pages(doc)* over a document's set of pages. +* New matrix methods *invert()* (calculate inverted matrix), *concat()* (calculate matrix product), *pretranslate()* (perform a shift operation). +* New *IRect* methods *intersect()* (intersection with another rectangle), *translate()* (perform a shift operation). +* New *Rect* methods *intersect()* (intersection with another rectangle), *transform()* (transformation with a matrix), *include_point()* (enlarge rectangle to also contain a point), *include_rect()* (enlarge rectangle to also contain another one). +* Documented *Point.transform()* (transform a point with a matrix). +* *Matrix*, *IRect*, *Rect* and *Point* classes now support compact, algebraic formulations for manipulating such objects. +* Incremental saves for changes are possible now using the call pattern *doc.save(doc.name, incremental=True)*. +* A PDF's metadata can now be deleted, set or changed by document method *set_metadata()*. Supports incremental saves. +* A PDF's bookmarks (or table of contents) can now be deleted, set or changed with the entries of a list using document method *set_toc(list)*. Supports incremental saves. + +.. codespell:ignore-end diff -r 000000000000 -r 1d09e1dec1d9 pipcl.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pipcl.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,3132 @@ +''' +Python packaging operations, including PEP-517 support, for use by a `setup.py` +script. + +The intention is to take care of as many packaging details as possible so that +setup.py contains only project-specific information, while also giving as much +flexibility as possible. + +For example we provide a function `build_extension()` that can be used to build +a SWIG extension, but we also give access to the located compiler/linker so +that a `setup.py` script can take over the details itself. + +Run doctests with: `python -m doctest pipcl.py` + +For Graal we require that PIPCL_GRAAL_PYTHON is set to non-graal Python (we +build for non-graal except with Graal Python's include paths and library +directory). +''' + +import base64 +import codecs +import glob +import hashlib +import inspect +import io +import os +import platform +import re +import shlex +import shutil +import site +import subprocess +import sys +import sysconfig +import tarfile +import textwrap +import time +import zipfile + +import wdev + + +class Package: + ''' + Our constructor takes a definition of a Python package similar to that + passed to `distutils.core.setup()` or `setuptools.setup()` (name, version, + summary etc) plus callbacks for building, getting a list of sdist + filenames, and cleaning. + + We provide methods that can be used to implement a Python package's + `setup.py` supporting PEP-517. + + We also support basic command line handling for use + with a legacy (pre-PEP-517) pip, as implemented + by legacy distutils/setuptools and described in: + https://pip.pypa.io/en/stable/reference/build-system/setup-py/ + + Here is a `doctest` example of using pipcl to create a SWIG extension + module. Requires `swig`. + + Create an empty test directory: + + >>> import os + >>> import shutil + >>> shutil.rmtree('pipcl_test', ignore_errors=1) + >>> os.mkdir('pipcl_test') + + Create a `setup.py` which uses `pipcl` to define an extension module. + + >>> import textwrap + >>> with open('pipcl_test/setup.py', 'w') as f: + ... _ = f.write(textwrap.dedent(""" + ... import sys + ... import pipcl + ... + ... def build(): + ... so_leaf = pipcl.build_extension( + ... name = 'foo', + ... path_i = 'foo.i', + ... outdir = 'build', + ... ) + ... return [ + ... ('build/foo.py', 'foo/__init__.py'), + ... ('cli.py', 'foo/__main__.py'), + ... (f'build/{so_leaf}', f'foo/'), + ... ('README', '$dist-info/'), + ... (b'Hello world', 'foo/hw.txt'), + ... ] + ... + ... def sdist(): + ... return [ + ... 'foo.i', + ... 'bar.i', + ... 'setup.py', + ... 'pipcl.py', + ... 'wdev.py', + ... 'README', + ... (b'Hello word2', 'hw2.txt'), + ... ] + ... + ... p = pipcl.Package( + ... name = 'foo', + ... version = '1.2.3', + ... fn_build = build, + ... fn_sdist = sdist, + ... entry_points = ( + ... { 'console_scripts': [ + ... 'foo_cli = foo.__main__:main', + ... ], + ... }), + ... ) + ... + ... build_wheel = p.build_wheel + ... build_sdist = p.build_sdist + ... + ... # Handle old-style setup.py command-line usage: + ... if __name__ == '__main__': + ... p.handle_argv(sys.argv) + ... """)) + + Create the files required by the above `setup.py` - the SWIG `.i` input + file, the README file, and copies of `pipcl.py` and `wdev.py`. + + >>> with open('pipcl_test/foo.i', 'w') as f: + ... _ = f.write(textwrap.dedent(""" + ... %include bar.i + ... %{ + ... #include + ... #include + ... int bar(const char* text) + ... { + ... printf("bar(): text: %s\\\\n", text); + ... int len = (int) strlen(text); + ... printf("bar(): len=%i\\\\n", len); + ... fflush(stdout); + ... return len; + ... } + ... %} + ... int bar(const char* text); + ... """)) + + >>> with open('pipcl_test/bar.i', 'w') as f: + ... _ = f.write( '\\n') + + >>> with open('pipcl_test/README', 'w') as f: + ... _ = f.write(textwrap.dedent(""" + ... This is Foo. + ... """)) + + >>> with open('pipcl_test/cli.py', 'w') as f: + ... _ = f.write(textwrap.dedent(""" + ... def main(): + ... print('pipcl_test:main().') + ... if __name__ == '__main__': + ... main() + ... """)) + + >>> root = os.path.dirname(__file__) + >>> _ = shutil.copy2(f'{root}/pipcl.py', 'pipcl_test/pipcl.py') + >>> _ = shutil.copy2(f'{root}/wdev.py', 'pipcl_test/wdev.py') + + Use `setup.py`'s command-line interface to build and install the extension + module into root `pipcl_test/install`. + + >>> _ = subprocess.run( + ... f'cd pipcl_test && {sys.executable} setup.py --root install install', + ... shell=1, check=1) + + The actual install directory depends on `sysconfig.get_path('platlib')`: + + >>> if windows(): + ... install_dir = 'pipcl_test/install' + ... else: + ... install_dir = f'pipcl_test/install/{sysconfig.get_path("platlib").lstrip(os.sep)}' + >>> assert os.path.isfile( f'{install_dir}/foo/__init__.py') + + Create a test script which asserts that Python function call `foo.bar(s)` + returns the length of `s`, and run it with `PYTHONPATH` set to the install + directory: + + >>> with open('pipcl_test/test.py', 'w') as f: + ... _ = f.write(textwrap.dedent(""" + ... import sys + ... import foo + ... text = 'hello' + ... print(f'test.py: calling foo.bar() with text={text!r}') + ... sys.stdout.flush() + ... l = foo.bar(text) + ... print(f'test.py: foo.bar() returned: {l}') + ... assert l == len(text) + ... """)) + >>> r = subprocess.run( + ... f'{sys.executable} pipcl_test/test.py', + ... shell=1, check=1, text=1, + ... stdout=subprocess.PIPE, stderr=subprocess.STDOUT, + ... env=os.environ | dict(PYTHONPATH=install_dir), + ... ) + >>> print(r.stdout) + test.py: calling foo.bar() with text='hello' + bar(): text: hello + bar(): len=5 + test.py: foo.bar() returned: 5 + + + Check that building sdist and wheel succeeds. For now we don't attempt to + check that the sdist and wheel actually work. + + >>> _ = subprocess.run( + ... f'cd pipcl_test && {sys.executable} setup.py sdist', + ... shell=1, check=1) + + >>> _ = subprocess.run( + ... f'cd pipcl_test && {sys.executable} setup.py bdist_wheel', + ... shell=1, check=1) + + Check that rebuild does nothing. + + >>> t0 = os.path.getmtime('pipcl_test/build/foo.py') + >>> _ = subprocess.run( + ... f'cd pipcl_test && {sys.executable} setup.py bdist_wheel', + ... shell=1, check=1) + >>> t = os.path.getmtime('pipcl_test/build/foo.py') + >>> assert t == t0 + + Check that touching bar.i forces rebuild. + + >>> os.utime('pipcl_test/bar.i') + >>> _ = subprocess.run( + ... f'cd pipcl_test && {sys.executable} setup.py bdist_wheel', + ... shell=1, check=1) + >>> t = os.path.getmtime('pipcl_test/build/foo.py') + >>> assert t > t0 + + Check that touching foo.i.cpp does not run swig, but does recompile/link. + + >>> t0 = time.time() + >>> os.utime('pipcl_test/build/foo.i.cpp') + >>> _ = subprocess.run( + ... f'cd pipcl_test && {sys.executable} setup.py bdist_wheel', + ... shell=1, check=1) + >>> assert os.path.getmtime('pipcl_test/build/foo.py') <= t0 + >>> so = glob.glob('pipcl_test/build/*.so') + >>> assert len(so) == 1 + >>> so = so[0] + >>> assert os.path.getmtime(so) > t0 + + Check `entry_points` causes creation of command `foo_cli` when we install + from our wheel using pip. [As of 2024-02-24 using pipcl's CLI interface + directly with `setup.py install` does not support entry points.] + + >>> print('Creating venv.', file=sys.stderr) + >>> _ = subprocess.run( + ... f'cd pipcl_test && {sys.executable} -m venv pylocal', + ... shell=1, check=1) + + >>> print('Installing from wheel into venv using pip.', file=sys.stderr) + >>> _ = subprocess.run( + ... f'. pipcl_test/pylocal/bin/activate && pip install pipcl_test/dist/*.whl', + ... shell=1, check=1) + + >>> print('Running foo_cli.', file=sys.stderr) + >>> _ = subprocess.run( + ... f'. pipcl_test/pylocal/bin/activate && foo_cli', + ... shell=1, check=1) + + Wheels and sdists + + Wheels: + We generate wheels according to: + https://packaging.python.org/specifications/binary-distribution-format/ + + * `{name}-{version}.dist-info/RECORD` uses sha256 hashes. + * We do not generate other `RECORD*` files such as + `RECORD.jws` or `RECORD.p7s`. + * `{name}-{version}.dist-info/WHEEL` has: + + * `Wheel-Version: 1.0` + * `Root-Is-Purelib: false` + * No support for signed wheels. + + Sdists: + We generate sdist's according to: + https://packaging.python.org/specifications/source-distribution-format/ + ''' + def __init__(self, + name, + version, + *, + platform = None, + supported_platform = None, + summary = None, + description = None, + description_content_type = None, + keywords = None, + home_page = None, + download_url = None, + author = None, + author_email = None, + maintainer = None, + maintainer_email = None, + license = None, + classifier = None, + requires_dist = None, + requires_python = None, + requires_external = None, + project_url = None, + provides_extra = None, + + entry_points = None, + + root = None, + fn_build = None, + fn_clean = None, + fn_sdist = None, + tag_python = None, + tag_abi = None, + tag_platform = None, + py_limited_api = None, + + wheel_compression = zipfile.ZIP_DEFLATED, + wheel_compresslevel = None, + ): + ''' + The initial args before `root` define the package + metadata and closely follow the definitions in: + https://packaging.python.org/specifications/core-metadata/ + + Args: + + name: + A string, the name of the Python package. + version: + A string, the version of the Python package. Also see PEP-440 + `Version Identification and Dependency Specification`. + platform: + A string or list of strings. + supported_platform: + A string or list of strings. + summary: + A string, short description of the package. + description: + A string. If contains newlines, a detailed description of the + package. Otherwise the path of a file containing the detailed + description of the package. + description_content_type: + A string describing markup of `description` arg. For example + `text/markdown; variant=GFM`. + keywords: + A string containing comma-separated keywords. + home_page: + URL of home page. + download_url: + Where this version can be downloaded from. + author: + Author. + author_email: + Author email. + maintainer: + Maintainer. + maintainer_email: + Maintainer email. + license: + A string containing the license text. Written into metadata + file `COPYING`. Is also written into metadata itself if not + multi-line. + classifier: + A string or list of strings. Also see: + + * https://pypi.org/pypi?%3Aaction=list_classifiers + * https://pypi.org/classifiers/ + + requires_dist: + A string or list of strings. None items are ignored. Also see PEP-508. + requires_python: + A string or list of strings. + requires_external: + A string or list of strings. + project_url: + A string or list of strings, each of the form: `{name}, {url}`. + provides_extra: + A string or list of strings. + + entry_points: + String or dict specifying *.dist-info/entry_points.txt, for + example: + + ``` + [console_scripts] + foo_cli = foo.__main__:main + ``` + + or: + + { 'console_scripts': [ + 'foo_cli = foo.__main__:main', + ], + } + + See: https://packaging.python.org/en/latest/specifications/entry-points/ + + root: + Root of package, defaults to current directory. + + fn_build: + A function taking no args, or a single `config_settings` dict + arg (as described in PEP-517), that builds the package. + + Should return a list of items; each item should be a tuple + `(from_, to_)`, or a single string `path` which is treated as + the tuple `(path, path)`. + + `from_` can be a string or a `bytes`. If a string it should + be the path to a file; a relative path is treated as relative + to `root`. If a `bytes` it is the contents of the file to be + added. + + `to_` identifies what the file should be called within a wheel + or when installing. If `to_` ends with `/`, the leaf of `from_` + is appended to it (and `from_` must not be a `bytes`). + + Initial `$dist-info/` in `_to` is replaced by + `{name}-{version}.dist-info/`; this is useful for license files + etc. + + Initial `$data/` in `_to` is replaced by + `{name}-{version}.data/`. We do not enforce particular + subdirectories, instead it is up to `fn_build()` to specify + specific subdirectories such as `purelib`, `headers`, + `scripts`, `data` etc. + + If we are building a wheel (e.g. `python setup.py bdist_wheel`, + or PEP-517 pip calls `self.build_wheel()`), we add file `from_` + to the wheel archive with name `to_`. + + If we are installing (e.g. `install` command in + the argv passed to `self.handle_argv()`), then + we copy `from_` to `{sitepackages}/{to_}`, where + `sitepackages` is the installation directory, the + default being `sysconfig.get_path('platlib')` e.g. + `myvenv/lib/python3.9/site-packages/`. + + fn_clean: + A function taking a single arg `all_` that cleans generated + files. `all_` is true iff `--all` is in argv. + + For safety and convenience, can also returns a list of + files/directory paths to be deleted. Relative paths are + interpreted as relative to `root`. All paths are asserted to be + within `root`. + + fn_sdist: + A function taking no args, or a single `config_settings` dict + arg (as described in PEP517), that returns a list of items to + be copied into the sdist. The list should be in the same format + as returned by `fn_build`. + + It can be convenient to use `pipcl.git_items()`. + + The specification for sdists requires that the list contains + `pyproject.toml`; we enforce this with a diagnostic rather than + raising an exception, to allow legacy command-line usage. + + tag_python: + First element of wheel tag defined in PEP-425. If None we use + `cp{version}`. + + For example if code works with any Python version, one can use + 'py3'. + + tag_abi: + Second element of wheel tag defined in PEP-425. If None we use + `none`. + + tag_platform: + Third element of wheel tag defined in PEP-425. Default + is `os.environ('AUDITWHEEL_PLAT')` if set, otherwise + derived from `sysconfig.get_platform()` (was + `setuptools.distutils.util.get_platform(), before that + `distutils.util.get_platform()` as specified in the PEP), e.g. + `openbsd_7_0_amd64`. + + For pure python packages use: `tag_platform=any` + + py_limited_api: + If true we build wheels that use the Python Limited API. We use + the version of `sys.executable` to define `Py_LIMITED_API` when + compiling extensions, and use ABI tag `abi3` in the wheel name + if argument `tag_abi` is None. + + wheel_compression: + Used as `zipfile.ZipFile()`'s `compression` parameter when + creating wheels. + + wheel_compresslevel: + Used as `zipfile.ZipFile()`'s `compresslevel` parameter when + creating wheels. + + Occurrences of `None` in lists are ignored. + ''' + assert name + assert version + + def assert_str( v): + if v is not None: + assert isinstance( v, str), f'Not a string: {v!r}' + def assert_str_or_multi( v): + if v is not None: + assert isinstance( v, (str, tuple, list)), f'Not a string, tuple or list: {v!r}' + + assert_str( name) + assert_str( version) + assert_str_or_multi( platform) + assert_str_or_multi( supported_platform) + assert_str( summary) + assert_str( description) + assert_str( description_content_type) + assert_str( keywords) + assert_str( home_page) + assert_str( download_url) + assert_str( author) + assert_str( author_email) + assert_str( maintainer) + assert_str( maintainer_email) + assert_str( license) + assert_str_or_multi( classifier) + assert_str_or_multi( requires_dist) + assert_str( requires_python) + assert_str_or_multi( requires_external) + assert_str_or_multi( project_url) + assert_str_or_multi( provides_extra) + + # https://packaging.python.org/en/latest/specifications/core-metadata/. + assert re.match('([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$', name, re.IGNORECASE), \ + f'Bad name: {name!r}' + + _assert_version_pep_440(version) + + # https://packaging.python.org/en/latest/specifications/binary-distribution-format/ + if tag_python: + assert '-' not in tag_python + if tag_abi: + assert '-' not in tag_abi + if tag_platform: + assert '-' not in tag_platform + + self.name = name + self.version = version + self.platform = platform + self.supported_platform = supported_platform + self.summary = summary + self.description = description + self.description_content_type = description_content_type + self.keywords = keywords + self.home_page = home_page + self.download_url = download_url + self.author = author + self.author_email = author_email + self.maintainer = maintainer + self.maintainer_email = maintainer_email + self.license = license + self.classifier = classifier + self.requires_dist = requires_dist + self.requires_python = requires_python + self.requires_external = requires_external + self.project_url = project_url + self.provides_extra = provides_extra + self.entry_points = entry_points + + self.root = os.path.abspath(root if root else os.getcwd()) + self.fn_build = fn_build + self.fn_clean = fn_clean + self.fn_sdist = fn_sdist + self.tag_python_ = tag_python + self.tag_abi_ = tag_abi + self.tag_platform_ = tag_platform + self.py_limited_api = py_limited_api + + self.wheel_compression = wheel_compression + self.wheel_compresslevel = wheel_compresslevel + + # If true and we are building for graal, we set PIPCL_PYTHON_CONFIG to + # a command that will print includes/libs from graal_py's sysconfig. + # + self.graal_legacy_python_config = True + + + def build_wheel(self, + wheel_directory, + config_settings=None, + metadata_directory=None, + ): + ''' + A PEP-517 `build_wheel()` function. + + Also called by `handle_argv()` to handle the `bdist_wheel` command. + + Returns leafname of generated wheel within `wheel_directory`. + ''' + log2( + f' wheel_directory={wheel_directory!r}' + f' config_settings={config_settings!r}' + f' metadata_directory={metadata_directory!r}' + ) + + if sys.implementation.name == 'graalpy': + # We build for Graal by building a native Python wheel with Graal + # Python's include paths and library directory. We then rename the + # wheel to contain graal's tag etc. + # + log0(f'### Graal build: deferring to cpython.') + python_native = os.environ.get('PIPCL_GRAAL_PYTHON') + assert python_native, f'Graal build requires that PIPCL_GRAAL_PYTHON is set.' + env_extra = dict( + PIPCL_SYSCONFIG_PATH_include = sysconfig.get_path('include'), + PIPCL_SYSCONFIG_PATH_platinclude = sysconfig.get_path('platinclude'), + PIPCL_SYSCONFIG_CONFIG_VAR_LIBDIR = sysconfig.get_config_var('LIBDIR'), + ) + # Tell native build to run pipcl.py itself to get python-config + # information about include paths etc. + if self.graal_legacy_python_config: + env_extra['PIPCL_PYTHON_CONFIG'] = f'{python_native} {os.path.abspath(__file__)} --graal-legacy-python-config' + + # Create venv. + venv_name = os.environ.get('PIPCL_GRAAL_NATIVE_VENV') + if venv_name: + log1(f'Graal using pre-existing {venv_name=}') + else: + venv_name = 'venv-pipcl-graal-native' + run(f'{shlex.quote(python_native)} -m venv {venv_name}') + log1(f'Graal using {venv_name=}') + + newfiles = NewFiles(f'{wheel_directory}/*.whl') + run( + f'. {venv_name}/bin/activate && python setup.py --dist-dir {shlex.quote(wheel_directory)} bdist_wheel', + env_extra = env_extra, + prefix = f'pipcl.py graal {python_native}: ', + ) + wheel = newfiles.get_one() + wheel_leaf = os.path.basename(wheel) + python_major_minor = run(f'{shlex.quote(python_native)} -c "import platform; import sys; sys.stdout.write(str().join(platform.python_version_tuple()[:2]))"', capture=1) + cpabi = f'cp{python_major_minor}-abi3' + assert cpabi in wheel_leaf, f'Expected wheel to be for {cpabi=}, but {wheel=}.' + graalpy_ext_suffix = sysconfig.get_config_var('EXT_SUFFIX') + log1(f'{graalpy_ext_suffix=}') + m = re.match(r'\.graalpy(\d+[^\-]*)-(\d+)', graalpy_ext_suffix) + gpver = m[1] + cpver = m[2] + graalpy_wheel_tag = f'graalpy{cpver}-graalpy{gpver}_{cpver}_native' + name = wheel_leaf.replace(cpabi, graalpy_wheel_tag) + destination = f'{wheel_directory}/{name}' + log0(f'### Graal build: copying {wheel=} to {destination=}') + # Copying results in two wheels which appears to confuse pip, showing: + # Found multiple .whl files; unspecified behaviour. Will call build_wheel. + os.rename(wheel, destination) + log1(f'Returning {name=}.') + return name + + wheel_name = self.wheel_name() + path = f'{wheel_directory}/{wheel_name}' + + # Do a build and get list of files to copy into the wheel. + # + items = list() + if self.fn_build: + items = self._call_fn_build(config_settings) + + log2(f'Creating wheel: {path}') + os.makedirs(wheel_directory, exist_ok=True) + record = _Record() + with zipfile.ZipFile(path, 'w', self.wheel_compression, self.wheel_compresslevel) as z: + + def add(from_, to_): + if isinstance(from_, str): + z.write(from_, to_) + record.add_file(from_, to_) + elif isinstance(from_, bytes): + z.writestr(to_, from_) + record.add_content(from_, to_) + else: + assert 0 + + def add_str(content, to_): + add(content.encode('utf8'), to_) + + dist_info_dir = self._dist_info_dir() + + # Add the files returned by fn_build(). + # + for item in items: + from_, (to_abs, to_rel) = self._fromto(item) + add(from_, to_rel) + + # Add -.dist-info/WHEEL. + # + add_str( + f'Wheel-Version: 1.0\n' + f'Generator: pipcl\n' + f'Root-Is-Purelib: false\n' + f'Tag: {self.wheel_tag_string()}\n' + , + f'{dist_info_dir}/WHEEL', + ) + # Add -.dist-info/METADATA. + # + add_str(self._metainfo(), f'{dist_info_dir}/METADATA') + + # Add -.dist-info/COPYING. + if self.license: + add_str(self.license, f'{dist_info_dir}/COPYING') + + # Add -.dist-info/entry_points.txt. + entry_points_text = self._entry_points_text() + if entry_points_text: + add_str(entry_points_text, f'{dist_info_dir}/entry_points.txt') + + # Update -.dist-info/RECORD. This must be last. + # + z.writestr(f'{dist_info_dir}/RECORD', record.get(f'{dist_info_dir}/RECORD')) + + st = os.stat(path) + log1( f'Have created wheel size={st.st_size:,}: {path}') + if g_verbose >= 2: + with zipfile.ZipFile(path, compression=self.wheel_compression) as z: + log2(f'Contents are:') + for zi in sorted(z.infolist(), key=lambda z: z.filename): + log2(f' {zi.file_size: 10,d} {zi.filename}') + + return os.path.basename(path) + + + def build_sdist(self, + sdist_directory, + formats, + config_settings=None, + ): + ''' + A PEP-517 `build_sdist()` function. + + Also called by `handle_argv()` to handle the `sdist` command. + + Returns leafname of generated archive within `sdist_directory`. + ''' + log2( + f' sdist_directory={sdist_directory!r}' + f' formats={formats!r}' + f' config_settings={config_settings!r}' + ) + if formats and formats != 'gztar': + raise Exception( f'Unsupported: formats={formats}') + items = list() + if self.fn_sdist: + if inspect.signature(self.fn_sdist).parameters: + items = self.fn_sdist(config_settings) + else: + items = self.fn_sdist() + + prefix = f'{_normalise(self.name)}-{self.version}' + os.makedirs(sdist_directory, exist_ok=True) + tarpath = f'{sdist_directory}/{prefix}.tar.gz' + log2(f'Creating sdist: {tarpath}') + + with tarfile.open(tarpath, 'w:gz') as tar: + + names_in_tar = list() + def check_name(name): + if name in names_in_tar: + raise Exception(f'Name specified twice: {name}') + names_in_tar.append(name) + + def add(from_, name): + check_name(name) + if isinstance(from_, str): + log2( f'Adding file: {os.path.relpath(from_)} => {name}') + tar.add( from_, f'{prefix}/{name}', recursive=False) + elif isinstance(from_, bytes): + log2( f'Adding: {name}') + ti = tarfile.TarInfo(f'{prefix}/{name}') + ti.size = len(from_) + ti.mtime = time.time() + tar.addfile(ti, io.BytesIO(from_)) + else: + assert 0 + + def add_string(text, name): + textb = text.encode('utf8') + return add(textb, name) + + found_pyproject_toml = False + for item in items: + from_, (to_abs, to_rel) = self._fromto(item) + if isinstance(from_, bytes): + add(from_, to_rel) + else: + if from_.startswith(f'{os.path.abspath(sdist_directory)}/'): + # Source files should not be inside . + assert 0, f'Path is inside sdist_directory={sdist_directory}: {from_!r}' + assert os.path.exists(from_), f'Path does not exist: {from_!r}' + assert os.path.isfile(from_), f'Path is not a file: {from_!r}' + if to_rel == 'pyproject.toml': + found_pyproject_toml = True + add(from_, to_rel) + + if not found_pyproject_toml: + log0(f'Warning: no pyproject.toml specified.') + + # Always add a PKG-INFO file. + add_string(self._metainfo(), 'PKG-INFO') + + if self.license: + if 'COPYING' in names_in_tar: + log2(f'Not writing .license because file already in sdist: COPYING') + else: + add_string(self.license, 'COPYING') + + log1( f'Have created sdist: {tarpath}') + return os.path.basename(tarpath) + + def wheel_tag_string(self): + ''' + Returns --. + ''' + return f'{self.tag_python()}-{self.tag_abi()}-{self.tag_platform()}' + + def tag_python(self): + ''' + Get two-digit python version, e.g. 'cp3.8' for python-3.8.6. + ''' + if self.tag_python_: + return self.tag_python_ + else: + return 'cp' + ''.join(platform.python_version().split('.')[:2]) + + def tag_abi(self): + ''' + ABI tag. + ''' + if self.tag_abi_: + return self.tag_abi_ + elif self.py_limited_api: + return 'abi3' + else: + return 'none' + + def tag_platform(self): + ''' + Find platform tag used in wheel filename. + ''' + ret = self.tag_platform_ + log0(f'From self.tag_platform_: {ret=}.') + + if not ret: + # Prefer this to PEP-425. Appears to be undocumented, + # but set in manylinux docker images and appears + # to be used by cibuildwheel and auditwheel, e.g. + # https://github.com/rapidsai/shared-action-workflows/issues/80 + ret = os.environ.get( 'AUDITWHEEL_PLAT') + log0(f'From AUDITWHEEL_PLAT: {ret=}.') + + if not ret: + # Notes: + # + # PEP-425. On Linux gives `linux_x86_64` which is rejected by + # pypi.org. + # + # On local MacOS/arm64 mac-mini have seen sysconfig.get_platform() + # unhelpfully return `macosx-10.9-universal2` if `python3` is the + # system Python /usr/bin/python3; this happens if we source `. + # /etc/profile`. + # + ret = sysconfig.get_platform() + ret = ret.replace('-', '_').replace('.', '_').lower() + log0(f'From sysconfig.get_platform(): {ret=}.') + + # We need to patch things on MacOS. + # + # E.g. `foo-1.2.3-cp311-none-macosx_13_x86_64.whl` + # causes `pip` to fail with: `not a supported wheel on this + # platform`. We seem to need to add `_0` to the OS version. + # + m = re.match( '^(macosx_[0-9]+)(_[^0-9].+)$', ret) + if m: + ret2 = f'{m.group(1)}_0{m.group(2)}' + log0(f'After macos patch, changing from {ret!r} to {ret2!r}.') + ret = ret2 + + log0( f'tag_platform(): returning {ret=}.') + return ret + + def wheel_name(self): + return f'{_normalise(self.name)}-{self.version}-{self.tag_python()}-{self.tag_abi()}-{self.tag_platform()}.whl' + + def wheel_name_match(self, wheel): + ''' + Returns true if `wheel` matches our wheel. We basically require the + name to be the same, except that we accept platform tags that contain + extra items (see pep-0600/), for example we return true with: + + self: foo-cp38-none-manylinux2014_x86_64.whl + wheel: foo-cp38-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl + ''' + log2(f'{wheel=}') + assert wheel.endswith('.whl') + wheel2 = wheel[:-len('.whl')] + name, version, tag_python, tag_abi, tag_platform = wheel2.split('-') + + py_limited_api_compatible = False + if self.py_limited_api and tag_abi == 'abi3': + # Allow lower tag_python number. + m = re.match('cp([0-9]+)', tag_python) + tag_python_int = int(m.group(1)) + m = re.match('cp([0-9]+)', self.tag_python()) + tag_python_int_self = int(m.group(1)) + if tag_python_int <= tag_python_int_self: + # This wheel uses Python stable ABI same or older than ours, so + # we can use it. + log2(f'py_limited_api; {tag_python=} compatible with {self.tag_python()=}.') + py_limited_api_compatible = True + + log2(f'{_normalise(self.name) == name=}') + log2(f'{self.version == version=}') + log2(f'{self.tag_python() == tag_python=} {self.tag_python()=} {tag_python=}') + log2(f'{py_limited_api_compatible=}') + log2(f'{self.tag_abi() == tag_abi=}') + log2(f'{self.tag_platform() in tag_platform.split(".")=}') + log2(f'{self.tag_platform()=}') + log2(f'{tag_platform.split(".")=}') + ret = (1 + and _normalise(self.name) == name + and self.version == version + and (self.tag_python() == tag_python or py_limited_api_compatible) + and self.tag_abi() == tag_abi + and self.tag_platform() in tag_platform.split('.') + ) + log2(f'Returning {ret=}.') + return ret + + def _entry_points_text(self): + if self.entry_points: + if isinstance(self.entry_points, str): + return self.entry_points + ret = '' + for key, values in self.entry_points.items(): + ret += f'[{key}]\n' + for value in values: + ret += f'{value}\n' + return ret + + def _call_fn_build( self, config_settings=None): + assert self.fn_build + log2(f'calling self.fn_build={self.fn_build}') + if inspect.signature(self.fn_build).parameters: + ret = self.fn_build(config_settings) + else: + ret = self.fn_build() + assert isinstance( ret, (list, tuple)), \ + f'Expected list/tuple from {self.fn_build} but got: {ret!r}' + return ret + + + def _argv_clean(self, all_): + ''' + Called by `handle_argv()`. + ''' + if not self.fn_clean: + return + paths = self.fn_clean(all_) + if paths: + if isinstance(paths, str): + paths = paths, + for path in paths: + if not os.path.isabs(path): + path = ps.path.join(self.root, path) + path = os.path.abspath(path) + assert path.startswith(self.root+os.sep), \ + f'path={path!r} does not start with root={self.root+os.sep!r}' + log2(f'Removing: {path}') + shutil.rmtree(path, ignore_errors=True) + + + def install(self, record_path=None, root=None): + ''' + Called by `handle_argv()` to handle `install` command.. + ''' + log2( f'{record_path=} {root=}') + + # Do a build and get list of files to install. + # + items = list() + if self.fn_build: + items = self._call_fn_build( dict()) + + root2 = install_dir(root) + log2( f'{root2=}') + + log1( f'Installing into: {root2!r}') + dist_info_dir = self._dist_info_dir() + + if not record_path: + record_path = f'{root2}/{dist_info_dir}/RECORD' + record = _Record() + + def add_file(from_, to_abs, to_rel): + os.makedirs( os.path.dirname( to_abs), exist_ok=True) + if isinstance(from_, bytes): + log2(f'Copying content into {to_abs}.') + with open(to_abs, 'wb') as f: + f.write(from_) + record.add_content(from_, to_rel) + else: + log0(f'{from_=}') + log2(f'Copying from {os.path.relpath(from_, self.root)} to {to_abs}') + shutil.copy2( from_, to_abs) + record.add_file(from_, to_rel) + + def add_str(content, to_abs, to_rel): + log2( f'Writing to: {to_abs}') + os.makedirs( os.path.dirname( to_abs), exist_ok=True) + with open( to_abs, 'w') as f: + f.write( content) + record.add_content(content, to_rel) + + for item in items: + from_, (to_abs, to_rel) = self._fromto(item) + log0(f'{from_=} {to_abs=} {to_rel=}') + to_abs2 = f'{root2}/{to_rel}' + add_file( from_, to_abs2, to_rel) + + add_str( self._metainfo(), f'{root2}/{dist_info_dir}/METADATA', f'{dist_info_dir}/METADATA') + + if self.license: + add_str( self.license, f'{root2}/{dist_info_dir}/COPYING', f'{dist_info_dir}/COPYING') + + entry_points_text = self._entry_points_text() + if entry_points_text: + add_str( + entry_points_text, + f'{root2}/{dist_info_dir}/entry_points.txt', + f'{dist_info_dir}/entry_points.txt', + ) + + log2( f'Writing to: {record_path}') + with open(record_path, 'w') as f: + f.write(record.get()) + + log2(f'Finished.') + + + def _argv_dist_info(self, root): + ''' + Called by `handle_argv()`. There doesn't seem to be any documentation + for `setup.py dist_info`, but it appears to be like `egg_info` except + it writes to a slightly different directory. + ''' + if root is None: + root = f'{self.name}-{self.version}.dist-info' + self._write_info(f'{root}/METADATA') + if self.license: + with open( f'{root}/COPYING', 'w') as f: + f.write( self.license) + + + def _argv_egg_info(self, egg_base): + ''' + Called by `handle_argv()`. + ''' + if egg_base is None: + egg_base = '.' + self._write_info(f'{egg_base}/.egg-info') + + + def _write_info(self, dirpath=None): + ''' + Writes egg/dist info to files in directory `dirpath` or `self.root` if + `None`. + ''' + if dirpath is None: + dirpath = self.root + log2(f'Creating files in directory {dirpath}') + os.makedirs(dirpath, exist_ok=True) + with open(os.path.join(dirpath, 'PKG-INFO'), 'w') as f: + f.write(self._metainfo()) + + # These don't seem to be required? + # + #with open(os.path.join(dirpath, 'SOURCES.txt', 'w') as f: + # pass + #with open(os.path.join(dirpath, 'dependency_links.txt', 'w') as f: + # pass + #with open(os.path.join(dirpath, 'top_level.txt', 'w') as f: + # f.write(f'{self.name}\n') + #with open(os.path.join(dirpath, 'METADATA', 'w') as f: + # f.write(self._metainfo()) + + + def handle_argv(self, argv): + ''' + Attempt to handles old-style (pre PEP-517) command line passed by + old releases of pip to a `setup.py` script, and manual running of + `setup.py`. + + This is partial support at best. + ''' + global g_verbose + #log2(f'argv: {argv}') + + class ArgsRaise: + pass + + class Args: + ''' + Iterates over argv items. + ''' + def __init__( self, argv): + self.items = iter( argv) + def next( self, eof=ArgsRaise): + ''' + Returns next arg. If no more args, we return or raise an + exception if is ArgsRaise. + ''' + try: + return next( self.items) + except StopIteration: + if eof is ArgsRaise: + raise Exception('Not enough args') + return eof + + command = None + opt_all = None + opt_dist_dir = 'dist' + opt_egg_base = None + opt_formats = None + opt_install_headers = None + opt_record = None + opt_root = None + + args = Args(argv[1:]) + + while 1: + arg = args.next(None) + if arg is None: + break + + elif arg in ('-h', '--help', '--help-commands'): + log0(textwrap.dedent(''' + Usage: + [...] [...] + Commands: + bdist_wheel + Creates a wheel called + /--
.whl, where + is "dist" or as specified by --dist-dir, + and
encodes ABI and platform etc. + clean + Cleans build files. + dist_info + Creates files in -.dist-info/ or + directory specified by --egg-base. + egg_info + Creates files in .egg-info/ or directory + directory specified by --egg-base. + install + Builds and installs. Writes installation + information to if --record was + specified. + sdist + Make a source distribution: + /-.tar.gz + Options: + --all + Used by "clean". + --compile + Ignored. + --dist-dir | -d + Default is "dist". + --egg-base + Used by "egg_info". + --formats + Used by "sdist". + --install-headers + Ignored. + --python-tag + Ignored. + --record + Used by "install". + --root + Used by "install". + --single-version-externally-managed + Ignored. + --verbose -v + Extra diagnostics. + Other: + windows-vs [-y ] [-v ] [-g ] [--verbose] + Windows only; looks for matching Python. + ''')) + return + + elif arg in ('bdist_wheel', 'clean', 'dist_info', 'egg_info', 'install', 'sdist'): + assert command is None, 'Two commands specified: {command} and {arg}.' + command = arg + + elif arg in ('windows-vs', 'windows-python', 'show-sysconfig'): + assert command is None, 'Two commands specified: {command} and {arg}.' + command = arg + + elif arg == '--all': opt_all = True + elif arg == '--compile': pass + elif arg == '--dist-dir' or arg == '-d': opt_dist_dir = args.next() + elif arg == '--egg-base': opt_egg_base = args.next() + elif arg == '--formats': opt_formats = args.next() + elif arg == '--install-headers': opt_install_headers = args.next() + elif arg == '--python-tag': pass + elif arg == '--record': opt_record = args.next() + elif arg == '--root': opt_root = args.next() + elif arg == '--single-version-externally-managed': pass + elif arg == '--verbose' or arg == '-v': g_verbose += 1 + + else: + raise Exception(f'Unrecognised arg: {arg}') + + assert command, 'No command specified' + + log1(f'Handling command={command}') + if 0: pass + elif command == 'bdist_wheel': self.build_wheel(opt_dist_dir) + elif command == 'clean': self._argv_clean(opt_all) + elif command == 'dist_info': self._argv_dist_info(opt_egg_base) + elif command == 'egg_info': self._argv_egg_info(opt_egg_base) + elif command == 'install': self.install(opt_record, opt_root) + elif command == 'sdist': self.build_sdist(opt_dist_dir, opt_formats) + + elif command == 'windows-python': + version = None + while 1: + arg = args.next(None) + if arg is None: + break + elif arg == '-v': + version = args.next() + elif arg == '--verbose': + g_verbose += 1 + else: + assert 0, f'Unrecognised {arg=}' + python = wdev.WindowsPython(version=version) + print(f'Python is:\n{python.description_ml(" ")}') + + elif command == 'windows-vs': + grade = None + version = None + year = None + while 1: + arg = args.next(None) + if arg is None: + break + elif arg == '-g': + grade = args.next() + elif arg == '-v': + version = args.next() + elif arg == '-y': + year = args.next() + elif arg == '--verbose': + g_verbose += 1 + else: + assert 0, f'Unrecognised {arg=}' + vs = wdev.WindowsVS(year=year, grade=grade, version=version) + print(f'Visual Studio is:\n{vs.description_ml(" ")}') + + elif command == 'show-sysconfig': + show_sysconfig() + for mod in platform, sys: + log0(f'{mod.__name__}:') + for n in dir(mod): + if n.startswith('_'): + continue + log0(f'{mod.__name__}.{n}') + if mod is platform and n == 'uname': + continue + if mod is platform and n == 'pdb': + continue + if mod is sys and n in ('breakpointhook', 'exit'): + # We don't want to call these. + continue + v = getattr(mod, n) + if callable(v): + try: + v = v() + except Exception: + pass + else: + #print(f'{n=}', flush=1) + try: + print(f' {mod.__name__}.{n}()={v!r}') + except Exception: + print(f' Failed to print value of {mod.__name__}.{n}().') + else: + try: + print(f' {mod.__name__}.{n}={v!r}') + except Exception: + print(f' Failed to print value of {mod.__name__}.{n}.') + + else: + assert 0, f'Unrecognised command: {command}' + + log2(f'Finished handling command: {command}') + + + def __str__(self): + return ('{' + f'name={self.name!r}' + f' version={self.version!r}' + f' platform={self.platform!r}' + f' supported_platform={self.supported_platform!r}' + f' summary={self.summary!r}' + f' description={self.description!r}' + f' description_content_type={self.description_content_type!r}' + f' keywords={self.keywords!r}' + f' home_page={self.home_page!r}' + f' download_url={self.download_url!r}' + f' author={self.author!r}' + f' author_email={self.author_email!r}' + f' maintainer={self.maintainer!r}' + f' maintainer_email={self.maintainer_email!r}' + f' license={self.license!r}' + f' classifier={self.classifier!r}' + f' requires_dist={self.requires_dist!r}' + f' requires_python={self.requires_python!r}' + f' requires_external={self.requires_external!r}' + f' project_url={self.project_url!r}' + f' provides_extra={self.provides_extra!r}' + + f' root={self.root!r}' + f' fn_build={self.fn_build!r}' + f' fn_sdist={self.fn_sdist!r}' + f' fn_clean={self.fn_clean!r}' + f' tag_python={self.tag_python_!r}' + f' tag_abi={self.tag_abi_!r}' + f' tag_platform={self.tag_platform_!r}' + '}' + ) + + def _dist_info_dir( self): + return f'{_normalise(self.name)}-{self.version}.dist-info' + + def _metainfo(self): + ''' + Returns text for `.egg-info/PKG-INFO` file, or `PKG-INFO` in an sdist + `.tar.gz` file, or `...dist-info/METADATA` in a wheel. + ''' + # 2021-04-30: Have been unable to get multiline content working on + # test.pypi.org so we currently put the description as the body after + # all the other headers. + # + ret = [''] + def add(key, value): + if value is None: + return + if isinstance( value, (tuple, list)): + for v in value: + if v is not None: + add( key, v) + return + if key == 'License' and '\n' in value: + # This is ok because we write `self.license` into + # *.dist-info/COPYING. + # + log1( f'Omitting license because contains newline(s).') + return + assert '\n' not in value, f'key={key} value contains newline: {value!r}' + if key == 'Project-URL': + assert value.count(',') == 1, f'For {key=}, should have one comma in {value!r}.' + ret[0] += f'{key}: {value}\n' + #add('Description', self.description) + add('Metadata-Version', '2.1') + + # These names are from: + # https://packaging.python.org/specifications/core-metadata/ + # + for name in ( + 'Name', + 'Version', + 'Platform', + 'Supported-Platform', + 'Summary', + 'Description-Content-Type', + 'Keywords', + 'Home-page', + 'Download-URL', + 'Author', + 'Author-email', + 'Maintainer', + 'Maintainer-email', + 'License', + 'Classifier', + 'Requires-Dist', + 'Requires-Python', + 'Requires-External', + 'Project-URL', + 'Provides-Extra', + ): + identifier = name.lower().replace( '-', '_') + add( name, getattr( self, identifier)) + + ret = ret[0] + + # Append description as the body + if self.description: + if '\n' in self.description: + description_text = self.description.strip() + else: + with open(self.description) as f: + description_text = f.read() + ret += '\n' # Empty line separates headers from body. + ret += description_text + ret += '\n' + return ret + + def _path_relative_to_root(self, path, assert_within_root=True): + ''' + Returns `(path_abs, path_rel)`, where `path_abs` is absolute path and + `path_rel` is relative to `self.root`. + + Interprets `path` as relative to `self.root` if not absolute. + + We use `os.path.realpath()` to resolve any links. + + if `assert_within_root` is true, assert-fails if `path` is not within + `self.root`. + ''' + if os.path.isabs(path): + p = path + else: + p = os.path.join(self.root, path) + p = os.path.realpath(os.path.abspath(p)) + if assert_within_root: + assert p.startswith(self.root+os.sep) or p == self.root, \ + f'Path not within root={self.root+os.sep!r}: {path=} {p=}' + p_rel = os.path.relpath(p, self.root) + return p, p_rel + + def _fromto(self, p): + ''' + Returns `(from_, (to_abs, to_rel))`. + + If `p` is a string we convert to `(p, p)`. Otherwise we assert that + `p` is a tuple `(from_, to_)` where `from_` is str/bytes and `to_` is + str. If `from_` is a bytes it is contents of file to add, otherwise the + path of an existing file; non-absolute paths are assumed to be relative + to `self.root`. If `to_` is empty or ends with `/`, we append the leaf + of `from_` (which must be a str). + + If `to_` starts with `$dist-info/`, we replace this with + `self._dist_info_dir()`. + + If `to_` starts with `$data/`, we replace this with + `{self.name}-{self.version}.data/`. + + We assert that `to_abs` is `within self.root`. + + `to_rel` is derived from the `to_abs` and is relative to self.root`. + ''' + ret = None + if isinstance(p, str): + p = p, p + assert isinstance(p, tuple) and len(p) == 2 + + from_, to_ = p + assert isinstance(from_, (str, bytes)) + assert isinstance(to_, str) + if to_.endswith('/') or to_=='': + to_ += os.path.basename(from_) + prefix = '$dist-info/' + if to_.startswith( prefix): + to_ = f'{self._dist_info_dir()}/{to_[ len(prefix):]}' + prefix = '$data/' + if to_.startswith( prefix): + to_ = f'{self.name}-{self.version}.data/{to_[ len(prefix):]}' + if isinstance(from_, str): + from_, _ = self._path_relative_to_root( from_, assert_within_root=False) + to_ = self._path_relative_to_root(to_) + assert isinstance(from_, (str, bytes)) + log2(f'returning {from_=} {to_=}') + return from_, to_ + + +def build_extension( + name, + path_i, + outdir, + builddir=None, + includes=None, + defines=None, + libpaths=None, + libs=None, + optimise=True, + debug=False, + compiler_extra='', + linker_extra='', + swig=None, + cpp=True, + prerequisites_swig=None, + prerequisites_compile=None, + prerequisites_link=None, + infer_swig_includes=True, + py_limited_api=False, + ): + ''' + Builds a Python extension module using SWIG. Works on Windows, Linux, MacOS + and OpenBSD. + + On Unix, sets rpath when linking shared libraries. + + Args: + name: + Name of generated extension module. + path_i: + Path of input SWIG `.i` file. Internally we use swig to generate a + corresponding `.c` or `.cpp` file. + outdir: + Output directory for generated files: + + * `{outdir}/{name}.py` + * `{outdir}/_{name}.so` # Unix + * `{outdir}/_{name}.*.pyd` # Windows + We return the leafname of the `.so` or `.pyd` file. + builddir: + Where to put intermediate files, for example the .cpp file + generated by swig and `.d` dependency files. Default is `outdir`. + includes: + A string, or a sequence of extra include directories to be prefixed + with `-I`. + defines: + A string, or a sequence of extra preprocessor defines to be + prefixed with `-D`. + libpaths + A string, or a sequence of library paths to be prefixed with + `/LIBPATH:` on Windows or `-L` on Unix. + libs + A string, or a sequence of library names. Each item is prefixed + with `-l` on non-Windows. + optimise: + Whether to use compiler optimisations. + debug: + Whether to build with debug symbols. + compiler_extra: + Extra compiler flags. Can be None. + linker_extra: + Extra linker flags. Can be None. + swig: + Swig command; if false we use 'swig'. + cpp: + If true we tell SWIG to generate C++ code instead of C. + prerequisites_swig: + prerequisites_compile: + prerequisites_link: + + [These are mainly for use on Windows. On other systems we + automatically generate dynamic dependencies using swig/compile/link + commands' `-MD` and `-MF` args.] + + Sequences of extra input files/directories that should force + running of swig, compile or link commands if they are newer than + any existing generated SWIG `.i` file, compiled object file or + shared library file. + + If present, the first occurrence of `True` or `False` forces re-run + or no re-run. Any occurrence of None is ignored. If an item is a + directory path we look for newest file within the directory tree. + + If not a sequence, we convert into a single-item list. + + prerequisites_swig + + We use swig's -MD and -MF args to generate dynamic dependencies + automatically, so this is not usually required. + + prerequisites_compile + prerequisites_link + + On non-Windows we use cc's -MF and -MF args to generate dynamic + dependencies so this is not usually required. + infer_swig_includes: + If true, we extract `-I` and `-I ` args from + `compile_extra` (also `/I` on windows) and use them with swig so + that it can see the same header files as C/C++. This is useful + when using enviromment variables such as `CC` and `CXX` to set + `compile_extra. + py_limited_api: + If true we build for current Python's limited API / stable ABI. + + Returns the leafname of the generated library file within `outdir`, e.g. + `_{name}.so` on Unix or `_{name}.cp311-win_amd64.pyd` on Windows. + ''' + if compiler_extra is None: + compiler_extra = '' + if linker_extra is None: + linker_extra = '' + if builddir is None: + builddir = outdir + if not swig: + swig = 'swig' + includes_text = _flags( includes, '-I') + defines_text = _flags( defines, '-D') + libpaths_text = _flags( libpaths, '/LIBPATH:', '"') if windows() else _flags( libpaths, '-L') + libs_text = _flags( libs, '' if windows() else '-l') + path_cpp = f'{builddir}/{os.path.basename(path_i)}' + path_cpp += '.cpp' if cpp else '.c' + os.makedirs( outdir, exist_ok=True) + + # Run SWIG. + + if infer_swig_includes: + # Extract include flags from `compiler_extra`. + swig_includes_extra = '' + compiler_extra_items = compiler_extra.split() + i = 0 + while i < len(compiler_extra_items): + item = compiler_extra_items[i] + # Swig doesn't seem to like a space after `I`. + if item == '-I' or (windows() and item == '/I'): + swig_includes_extra += f' -I{compiler_extra_items[i+1]}' + i += 1 + elif item.startswith('-I') or (windows() and item.startswith('/I')): + swig_includes_extra += f' -I{compiler_extra_items[i][2:]}' + i += 1 + swig_includes_extra = swig_includes_extra.strip() + deps_path = f'{path_cpp}.d' + prerequisites_swig2 = _get_prerequisites( deps_path) + run_if( + f''' + {swig} + -Wall + {"-c++" if cpp else ""} + -python + -module {name} + -outdir {outdir} + -o {path_cpp} + -MD -MF {deps_path} + {includes_text} + {swig_includes_extra} + {path_i} + ''' + , + path_cpp, + path_i, + prerequisites_swig, + prerequisites_swig2, + ) + + so_suffix = _so_suffix(use_so_versioning = not py_limited_api) + path_so_leaf = f'_{name}{so_suffix}' + path_so = f'{outdir}/{path_so_leaf}' + + py_limited_api2 = current_py_limited_api() if py_limited_api else None + + if windows(): + path_obj = f'{path_so}.obj' + + permissive = '/permissive-' + EHsc = '/EHsc' + T = '/Tp' if cpp else '/Tc' + optimise2 = '/DNDEBUG /O2' if optimise else '/D_DEBUG' + debug2 = '' + if debug: + debug2 = '/Zi' # Generate .pdb. + # debug2 = '/Z7' # Embed debug info in .obj files. + + py_limited_api3 = f'/DPy_LIMITED_API={py_limited_api2}' if py_limited_api2 else '' + + # As of 2023-08-23, it looks like VS tools create slightly + # .dll's each time, even with identical inputs. + # + # Some info about this is at: + # https://nikhilism.com/post/2020/windows-deterministic-builds/. + # E.g. an undocumented linker flag `/Brepro`. + # + + command, pythonflags = base_compiler(cpp=cpp) + command = f''' + {command} + # General: + /c # Compiles without linking. + {EHsc} # Enable "Standard C++ exception handling". + + #/MD # Creates a multithreaded DLL using MSVCRT.lib. + {'/MDd' if debug else '/MD'} + + # Input/output files: + {T}{path_cpp} # /Tp specifies C++ source file. + /Fo{path_obj} # Output file. codespell:ignore + + # Include paths: + {includes_text} + {pythonflags.includes} # Include path for Python headers. + + # Code generation: + {optimise2} + {debug2} + {permissive} # Set standard-conformance mode. + + # Diagnostics: + #/FC # Display full path of source code files passed to cl.exe in diagnostic text. + /W3 # Sets which warning level to output. /W3 is IDE default. + /diagnostics:caret # Controls the format of diagnostic messages. + /nologo # + + {defines_text} + {compiler_extra} + + {py_limited_api3} + ''' + run_if( command, path_obj, path_cpp, prerequisites_compile) + + command, pythonflags = base_linker(cpp=cpp) + debug2 = '/DEBUG' if debug else '' + base, _ = os.path.splitext(path_so_leaf) + command = f''' + {command} + /DLL # Builds a DLL. + /EXPORT:PyInit__{name} # Exports a function. + /IMPLIB:{base}.lib # Overrides the default import library name. + {libpaths_text} + {pythonflags.ldflags} + /OUT:{path_so} # Specifies the output file name. + {debug2} + /nologo + {libs_text} + {path_obj} + {linker_extra} + ''' + run_if( command, path_so, path_obj, prerequisites_link) + + else: + + # Not Windows. + # + command, pythonflags = base_compiler(cpp=cpp) + + # setuptools on Linux seems to use slightly different compile flags: + # + # -fwrapv -O3 -Wall -O2 -g0 -DPY_CALL_TRAMPOLINE + # + + general_flags = '' + if debug: + general_flags += ' -g' + if optimise: + general_flags += ' -O2 -DNDEBUG' + + py_limited_api3 = f'-DPy_LIMITED_API={py_limited_api2}' if py_limited_api2 else '' + + if darwin(): + # MacOS's linker does not like `-z origin`. + rpath_flag = "-Wl,-rpath,@loader_path/" + + # Avoid `Undefined symbols for ... "_PyArg_UnpackTuple" ...'. + general_flags += ' -undefined dynamic_lookup' + elif pyodide(): + # Setting `-Wl,-rpath,'$ORIGIN',-z,origin` gives: + # emcc: warning: ignoring unsupported linker flag: `-rpath` [-Wlinkflags] + # wasm-ld: error: unknown -z value: origin + # + log0(f'pyodide: PEP-3149 suffix untested, so omitting. {_so_suffix()=}.') + path_so_leaf = f'_{name}.so' + path_so = f'{outdir}/{path_so_leaf}' + + rpath_flag = '' + else: + rpath_flag = "-Wl,-rpath,'$ORIGIN',-z,origin" + path_so = f'{outdir}/{path_so_leaf}' + # Fun fact - on Linux, if the -L and -l options are before '{path_cpp}' + # they seem to be ignored... + # + prerequisites = list() + + if pyodide(): + # Looks like pyodide's `cc` can't compile and link in one invocation. + prerequisites_compile_path = f'{path_cpp}.o.d' + prerequisites += _get_prerequisites( prerequisites_compile_path) + command = f''' + {command} + -fPIC + {general_flags.strip()} + {pythonflags.includes} + {includes_text} + {defines_text} + -MD -MF {prerequisites_compile_path} + -c {path_cpp} + -o {path_cpp}.o + {compiler_extra} + {py_limited_api3} + ''' + prerequisites_link_path = f'{path_cpp}.o.d' + prerequisites += _get_prerequisites( prerequisites_link_path) + ld, _ = base_linker(cpp=cpp) + command += f''' + && {ld} + {path_cpp}.o + -o {path_so} + -MD -MF {prerequisites_link_path} + {rpath_flag} + {libpaths_text} + {libs_text} + {linker_extra} + {pythonflags.ldflags} + ''' + else: + # We use compiler to compile and link in one command. + prerequisites_path = f'{path_so}.d' + prerequisites = _get_prerequisites(prerequisites_path) + + command = f''' + {command} + -fPIC + -shared + {general_flags.strip()} + {pythonflags.includes} + {includes_text} + {defines_text} + {path_cpp} + -MD -MF {prerequisites_path} + -o {path_so} + {compiler_extra} + {libpaths_text} + {linker_extra} + {pythonflags.ldflags} + {libs_text} + {rpath_flag} + {py_limited_api3} + ''' + command_was_run = run_if( + command, + path_so, + path_cpp, + prerequisites_compile, + prerequisites_link, + prerequisites, + ) + + if command_was_run and darwin(): + # We need to patch up references to shared libraries in `libs`. + sublibraries = list() + for lib in () if libs is None else libs: + for libpath in libpaths: + found = list() + for suffix in '.so', '.dylib': + path = f'{libpath}/lib{os.path.basename(lib)}{suffix}' + if os.path.exists( path): + found.append( path) + if found: + assert len(found) == 1, f'More than one file matches lib={lib!r}: {found}' + sublibraries.append( found[0]) + break + else: + log2(f'Warning: can not find path of lib={lib!r} in libpaths={libpaths}') + macos_patch( path_so, *sublibraries) + + #run(f'ls -l {path_so}', check=0) + #run(f'file {path_so}', check=0) + + return path_so_leaf + + +# Functions that might be useful. +# + + +def base_compiler(vs=None, pythonflags=None, cpp=False, use_env=True): + ''' + Returns basic compiler command and PythonFlags. + + Args: + vs: + Windows only. A `wdev.WindowsVS` instance or None to use default + `wdev.WindowsVS` instance. + pythonflags: + A `pipcl.PythonFlags` instance or None to use default + `pipcl.PythonFlags` instance. + cpp: + If true we return C++ compiler command instead of C. On Windows + this has no effect - we always return `cl.exe`. + use_env: + If true we return '$CC' or '$CXX' if the corresponding + environmental variable is set (without evaluating with `getenv()` + or `os.environ`). + + Returns `(cc, pythonflags)`: + cc: + C or C++ command. On Windows this is of the form + `{vs.vcvars}&&{vs.cl}`; otherwise it is typically `cc` or `c++`. + pythonflags: + The `pythonflags` arg or a new `pipcl.PythonFlags` instance. + ''' + if not pythonflags: + pythonflags = PythonFlags() + cc = None + if use_env: + if cpp: + if os.environ.get( 'CXX'): + cc = '$CXX' + else: + if os.environ.get( 'CC'): + cc = '$CC' + if cc: + pass + elif windows(): + if not vs: + vs = wdev.WindowsVS() + cc = f'"{vs.vcvars}"&&"{vs.cl}"' + elif wasm(): + cc = 'em++' if cpp else 'emcc' + else: + cc = 'c++' if cpp else 'cc' + cc = macos_add_cross_flags( cc) + return cc, pythonflags + + +def base_linker(vs=None, pythonflags=None, cpp=False, use_env=True): + ''' + Returns basic linker command. + + Args: + vs: + Windows only. A `wdev.WindowsVS` instance or None to use default + `wdev.WindowsVS` instance. + pythonflags: + A `pipcl.PythonFlags` instance or None to use default + `pipcl.PythonFlags` instance. + cpp: + If true we return C++ linker command instead of C. On Windows this + has no effect - we always return `link.exe`. + use_env: + If true we use `os.environ['LD']` if set. + + Returns `(linker, pythonflags)`: + linker: + Linker command. On Windows this is of the form + `{vs.vcvars}&&{vs.link}`; otherwise it is typically `cc` or `c++`. + pythonflags: + The `pythonflags` arg or a new `pipcl.PythonFlags` instance. + ''' + if not pythonflags: + pythonflags = PythonFlags() + linker = None + if use_env: + if os.environ.get( 'LD'): + linker = '$LD' + if linker: + pass + elif windows(): + if not vs: + vs = wdev.WindowsVS() + linker = f'"{vs.vcvars}"&&"{vs.link}"' + elif wasm(): + linker = 'em++' if cpp else 'emcc' + else: + linker = 'c++' if cpp else 'cc' + linker = macos_add_cross_flags( linker) + return linker, pythonflags + + +def git_info( directory): + ''' + Returns `(sha, comment, diff, branch)`, all items are str or None if not + available. + + directory: + Root of git checkout. + ''' + sha, comment, diff, branch = None, None, None, None + e, out = run( + f'cd {directory} && (PAGER= git show --pretty=oneline|head -n 1 && git diff)', + capture=1, + check=0 + ) + if not e: + sha, _ = out.split(' ', 1) + comment, diff = _.split('\n', 1) + e, out = run( + f'cd {directory} && git rev-parse --abbrev-ref HEAD', + capture=1, + check=0 + ) + if not e: + branch = out.strip() + log(f'git_info(): directory={directory!r} returning branch={branch!r} sha={sha!r} comment={comment!r}') + return sha, comment, diff, branch + + +def git_items( directory, submodules=False): + ''' + Returns list of paths for all files known to git within a `directory`. + + Args: + directory: + Must be somewhere within a git checkout. + submodules: + If true we also include git submodules. + + Returns: + A list of paths for all files known to git within `directory`. Each + path is relative to `directory`. `directory` must be somewhere within a + git checkout. + + We run a `git ls-files` command internally. + + This function can be useful for the `fn_sdist()` callback. + ''' + command = 'cd ' + directory + ' && git ls-files' + if submodules: + command += ' --recurse-submodules' + log1(f'Running {command=}') + text = subprocess.check_output( command, shell=True) + ret = [] + for path in text.decode('utf8').strip().split( '\n'): + path2 = os.path.join(directory, path) + # Sometimes git ls-files seems to list empty/non-existent directories + # within submodules. + # + if not os.path.exists(path2): + log2(f'Ignoring git ls-files item that does not exist: {path2}') + elif os.path.isdir(path2): + log2(f'Ignoring git ls-files item that is actually a directory: {path2}') + else: + ret.append(path) + return ret + + +def git_get( + remote, + local, + *, + branch=None, + depth=1, + env_extra=None, + tag=None, + update=True, + submodules=True, + default_remote=None, + ): + ''' + Ensures that is a git checkout (at either , or HEAD) + of a remote repository. + + Exactly one of and must be specified, or must start + with 'git:' and match the syntax described below. + + Args: + remote: + Remote git repostitory, for example + 'https://github.com/ArtifexSoftware/mupdf.git'. + + If starts with 'git:', the remaining text should be a command-line + style string containing some or all of these args: + --branch + --tag + + These overrides , and . + + For example these all clone/update/branch master of https://foo.bar/qwerty.git to local + checkout 'foo-local': + + git_get('https://foo.bar/qwerty.git', 'foo-local', branch='master') + git_get('git:--branch master https://foo.bar/qwerty.git', 'foo-local') + git_get('git:--branch master', 'foo-local', default_remote='https://foo.bar/qwerty.git') + git_get('git:', 'foo-local', branch='master', default_remote='https://foo.bar/qwerty.git') + + local: + Local directory. If /.git exists, we attempt to run `git + update` in it. + branch: + Branch to use. Is used as default if remote starts with 'git:'. + depth: + Depth of local checkout when cloning and fetching, or None. + env_extra: + Dict of extra name=value environment variables to use whenever we + run git. + tag: + Tag to use. Is used as default if remote starts with 'git:'. + update: + If false we do not update existing repository. Might be useful if + testing without network access. + submodules: + If true, we clone with `--recursive --shallow-submodules` and run + `git submodule update --init --recursive` before returning. + default_remote: + The remote URL if starts with 'git:' but does not specify + the remote URL. + ''' + log0(f'{remote=} {local=} {branch=} {tag=}') + if remote.startswith('git:'): + remote0 = remote + args = iter(shlex.split(remote0[len('git:'):])) + remote = default_remote + while 1: + try: + arg = next(args) + except StopIteration: + break + if arg == '--branch': + branch = next(args) + tag = None + elif arg == '--tag': + tag == next(args) + branch = None + else: + remote = arg + assert remote, f'{default_remote=} and no remote specified in remote={remote0!r}.' + assert branch or tag, f'{branch=} {tag=} and no branch/tag specified in remote={remote0!r}.' + + assert (branch and not tag) or (not branch and tag), f'Must specify exactly one of and .' + + depth_arg = f' --depth {depth}' if depth else '' + + def do_update(): + # This seems to pull in the entire repository. + log0(f'do_update(): attempting to update {local=}.') + # Remove any local changes. + run(f'cd {local} && git checkout .', env_extra=env_extra) + if tag: + # `-u` avoids `fatal: Refusing to fetch into current branch`. + # Using '+' and `revs/tags/` prefix seems to avoid errors like: + # error: cannot update ref 'refs/heads/v3.16.44': + # trying to write non-commit object + # 06c4ae5fe39a03b37a25a8b95214d9f8f8a867b8 to branch + # 'refs/heads/v3.16.44' + # + run(f'cd {local} && git fetch -fuv{depth_arg} {remote} +refs/tags/{tag}:refs/tags/{tag}', env_extra=env_extra) + run(f'cd {local} && git checkout {tag}', env_extra=env_extra) + if branch: + # `-u` avoids `fatal: Refusing to fetch into current branch`. + run(f'cd {local} && git fetch -fuv{depth_arg} {remote} {branch}:{branch}', env_extra=env_extra) + run(f'cd {local} && git checkout {branch}', env_extra=env_extra) + + do_clone = True + if os.path.isdir(f'{local}/.git'): + if update: + # Try to update existing checkout. + try: + do_update() + do_clone = False + except Exception as e: + log0(f'Failed to update existing checkout {local}: {e}') + else: + do_clone = False + + if do_clone: + # No existing git checkout, so do a fresh clone. + #_fs_remove(local) + log0(f'Cloning to: {local}') + command = f'git clone --config core.longpaths=true{depth_arg}' + if submodules: + command += f' --recursive --shallow-submodules' + if branch: + command += f' -b {branch}' + if tag: + command += f' -b {tag}' + command += f' {remote} {local}' + run(command, env_extra=env_extra) + do_update() + + if submodules: + run(f'cd {local} && git submodule update --init --recursive', env_extra=env_extra) + + # Show sha of checkout. + run( f'cd {local} && git show --pretty=oneline|head -n 1', check=False) + + +def run( + command, + *, + capture=False, + check=1, + verbose=1, + env=None, + env_extra=None, + timeout=None, + caller=1, + prefix=None, + ): + ''' + Runs a command using `subprocess.run()`. + + Args: + command: + A string, the command to run. + + Multiple lines in `command` are treated as a single command. + + * If a line starts with `#` it is discarded. + * If a line contains ` #`, the trailing text is discarded. + + When running the command on Windows, newlines are replaced by + spaces; otherwise each line is terminated by a backslash character. + capture: + If true, we include the command's output in our return value. + check: + If true we raise an exception on error; otherwise we include the + command's returncode in our return value. + verbose: + If true we show the command. + env: + None or dict to use instead of . + env_extra: + None or dict to add to or . + timeout: + If not None, timeout in seconds; passed directly to + subprocess.run(). Note that on MacOS subprocess.run() seems to + leave processes running if timeout expires. + prefix: + String prefix for each line of output. + + If true: + * We run command with stdout=subprocess.PIPE and + stderr=subprocess.STDOUT, repetaedly reading the command's output + and writing it to stdout with . + * We do not support , which must be None. + Returns: + check capture Return + -------------------------- + false false returncode + false true (returncode, output) + true false None or raise exception + true true output or raise exception + ''' + if env is None: + env = os.environ + if env_extra: + env = env.copy() + if env_extra: + env.update(env_extra) + lines = _command_lines( command) + if verbose: + text = f'Running:' + if env_extra: + for k in sorted(env_extra.keys()): + text += f' {k}={shlex.quote(env_extra[k])}' + nl = '\n' + text += f' {nl.join(lines)}' + log1(text, caller=caller+1) + sep = ' ' if windows() else ' \\\n' + command2 = sep.join( lines) + + if prefix: + assert not timeout, f'Timeout not supported with prefix.' + child = subprocess.Popen( + command2, + shell=True, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + encoding='utf8', + env=env, + ) + if capture: + capture_text = '' + decoder = codecs.getincrementaldecoder('utf8')('replace') + line_start = True + while 1: + raw = os.read( child.stdout.fileno(), 10000) + text = decoder.decode(raw, final=not raw) + if text: + if capture: + capture_text += text + lines = text.split('\n') + for i, line in enumerate(lines): + if line_start: + sys.stdout.write(prefix) + line_start = False + sys.stdout.write(line) + if i < len(lines) - 1: + sys.stdout.write('\n') + line_start = True + sys.stdout.flush() + if not raw: + break + if not line_start: + sys.stdout.write('\n') + e = child.wait() + if check and e: + raise subprocess.CalledProcessError(e, command2, capture_text if capture else None) + if check: + return capture_text if capture else None + else: + return (e, capture_text) if capture else e + else: + cp = subprocess.run( + command2, + shell=True, + stdout=subprocess.PIPE if capture else None, + stderr=subprocess.STDOUT if capture else None, + check=check, + encoding='utf8', + env=env, + timeout=timeout, + ) + if check: + return cp.stdout if capture else None + else: + return (cp.returncode, cp.stdout) if capture else cp.returncode + + +def darwin(): + return sys.platform.startswith( 'darwin') + +def windows(): + return platform.system() == 'Windows' + +def wasm(): + return os.environ.get( 'OS') in ('wasm', 'wasm-mt') + +def pyodide(): + return os.environ.get( 'PYODIDE') == '1' + +def linux(): + return platform.system() == 'Linux' + +def openbsd(): + return platform.system() == 'OpenBSD' + + +def show_system(): + ''' + Show useful information about the system plus argv and environ. + ''' + def log(text): + log0(text, caller=3) + + #log(f'{__file__=}') + #log(f'{__name__=}') + log(f'{os.getcwd()=}') + log(f'{platform.machine()=}') + log(f'{platform.platform()=}') + log(f'{platform.python_implementation()=}') + log(f'{platform.python_version()=}') + log(f'{platform.system()=}') + if sys.implementation.name != 'graalpy': + log(f'{platform.uname()=}') + log(f'{sys.executable=}') + log(f'{sys.version=}') + log(f'{sys.version_info=}') + log(f'{list(sys.version_info)=}') + + log(f'CPU bits: {cpu_bits()}') + + log(f'sys.argv ({len(sys.argv)}):') + for i, arg in enumerate(sys.argv): + log(f' {i}: {arg!r}') + + log(f'os.environ ({len(os.environ)}):') + for k in sorted( os.environ.keys()): + v = os.environ[ k] + log( f' {k}: {v!r}') + + +class PythonFlags: + ''' + Compile/link flags for the current python, for example the include path + needed to get `Python.h`. + + The 'PIPCL_PYTHON_CONFIG' environment variable allows to override + the location of the python-config executable. + + Members: + .includes: + String containing compiler flags for include paths. + .ldflags: + String containing linker flags for library paths. + ''' + def __init__(self): + + # Experimental detection of python flags from sysconfig.*() instead of + # python-config command. + includes_, ldflags_ = sysconfig_python_flags() + + if pyodide(): + _include_dir = os.environ[ 'PYO3_CROSS_INCLUDE_DIR'] + _lib_dir = os.environ[ 'PYO3_CROSS_LIB_DIR'] + self.includes = f'-I {_include_dir}' + self.ldflags = f'-L {_lib_dir}' + + elif 0: + + self.includes = includes_ + self.ldflags = ldflags_ + + elif windows(): + wp = wdev.WindowsPython() + self.includes = f'/I"{wp.include}"' + self.ldflags = f'/LIBPATH:"{wp.libs}"' + + elif pyodide(): + _include_dir = os.environ[ 'PYO3_CROSS_INCLUDE_DIR'] + _lib_dir = os.environ[ 'PYO3_CROSS_LIB_DIR'] + self.includes = f'-I {_include_dir}' + self.ldflags = f'-L {_lib_dir}' + + else: + python_config = os.environ.get("PIPCL_PYTHON_CONFIG") + if not python_config: + # We use python-config which appears to work better than pkg-config + # because it copes with multiple installed python's, e.g. + # manylinux_2014's /opt/python/cp*-cp*/bin/python*. + # + # But... on non-macos it seems that we should not attempt to specify + # libpython on the link command. The manylinux docker containers + # don't actually contain libpython.so, and it seems that this + # deliberate. And the link command runs ok. + # + python_exe = os.path.realpath( sys.executable) + if darwin(): + # Basic install of dev tools with `xcode-select --install` doesn't + # seem to provide a `python3-config` or similar, but there is a + # `python-config.py` accessible via sysconfig. + # + # We try different possibilities and use the last one that + # works. + # + python_config = None + for pc in ( + f'python3-config', + f'{sys.executable} {sysconfig.get_config_var("srcdir")}/python-config.py', + f'{python_exe}-config', + ): + e = subprocess.run( + f'{pc} --includes', + shell=1, + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + check=0, + ).returncode + log2(f'{e=} from {pc!r}.') + if e == 0: + python_config = pc + assert python_config, f'Cannot find python-config' + else: + python_config = f'{python_exe}-config' + log2(f'Using {python_config=}.') + try: + self.includes = run( f'{python_config} --includes', capture=1, verbose=0).strip() + except Exception as e: + raise Exception('We require python development tools to be installed.') from e + self.ldflags = run( f'{python_config} --ldflags', capture=1, verbose=0).strip() + if linux(): + # It seems that with python-3.10 on Linux, we can get an + # incorrect -lcrypt flag that on some systems (e.g. WSL) + # causes: + # + # ImportError: libcrypt.so.2: cannot open shared object file: No such file or directory + # + ldflags2 = self.ldflags.replace(' -lcrypt ', ' ') + if ldflags2 != self.ldflags: + log2(f'### Have removed `-lcrypt` from ldflags: {self.ldflags!r} -> {ldflags2!r}') + self.ldflags = ldflags2 + + log1(f'{self.includes=}') + log1(f' {includes_=}') + log1(f'{self.ldflags=}') + log1(f' {ldflags_=}') + + +def macos_add_cross_flags(command): + ''' + If running on MacOS and environment variables ARCHFLAGS is set + (indicating we are cross-building, e.g. for arm64), returns + `command` with extra flags appended. Otherwise returns unchanged + `command`. + ''' + if darwin(): + archflags = os.environ.get( 'ARCHFLAGS') + if archflags: + command = f'{command} {archflags}' + log2(f'Appending ARCHFLAGS to command: {command}') + return command + return command + + +def macos_patch( library, *sublibraries): + ''' + If running on MacOS, patches `library` so that all references to items in + `sublibraries` are changed to `@rpath/{leafname}`. Does nothing on other + platforms. + + library: + Path of shared library. + sublibraries: + List of paths of shared libraries; these have typically been + specified with `-l` when `library` was created. + ''' + log2( f'macos_patch(): library={library} sublibraries={sublibraries}') + if not darwin(): + return + if not sublibraries: + return + subprocess.run( f'otool -L {library}', shell=1, check=1) + command = 'install_name_tool' + names = [] + for sublibrary in sublibraries: + name = subprocess.run( + f'otool -D {sublibrary}', + shell=1, + check=1, + capture_output=1, + encoding='utf8', + ).stdout.strip() + name = name.split('\n') + assert len(name) == 2 and name[0] == f'{sublibrary}:', f'{name=}' + name = name[1] + # strip trailing so_name. + leaf = os.path.basename(name) + m = re.match('^(.+[.]((so)|(dylib)))[0-9.]*$', leaf) + assert m + log2(f'Changing {leaf=} to {m.group(1)}') + leaf = m.group(1) + command += f' -change {name} @rpath/{leaf}' + command += f' {library}' + log2( f'Running: {command}') + subprocess.run( command, shell=1, check=1) + subprocess.run( f'otool -L {library}', shell=1, check=1) + + +# Internal helpers. +# + +def _command_lines( command): + ''' + Process multiline command by running through `textwrap.dedent()`, removes + comments (lines starting with `#` or ` #` until end of line), removes + entirely blank lines. + + Returns list of lines. + ''' + command = textwrap.dedent( command) + lines = [] + for line in command.split( '\n'): + if line.startswith( '#'): + h = 0 + else: + h = line.find( ' #') + if h >= 0: + line = line[:h] + if line.strip(): + lines.append(line.rstrip()) + return lines + + +def cpu_bits(): + return int.bit_length(sys.maxsize+1) + + +def _cpu_name(): + ''' + Returns `x32` or `x64` depending on Python build. + ''' + #log(f'sys.maxsize={hex(sys.maxsize)}') + return f'x{32 if sys.maxsize == 2**31 - 1 else 64}' + + +def run_if( command, out, *prerequisites): + ''' + Runs a command only if the output file is not up to date. + + Args: + command: + The command to run. We write this into a file .cmd so that we + know to run a command if the command itself has changed. + out: + Path of the output file. + + prerequisites: + List of prerequisite paths or true/false/None items. If an item + is None it is ignored, otherwise if an item is not a string we + immediately return it cast to a bool. + + Returns: + True if we ran the command, otherwise None. + + + If the output file does not exist, the command is run: + + >>> verbose(1) + 1 + >>> log_line_numbers(0) + >>> out = 'run_if_test_out' + >>> if os.path.exists( out): + ... os.remove( out) + >>> if os.path.exists( f'{out}.cmd'): + ... os.remove( f'{out}.cmd') + >>> run_if( f'touch {out}', out) + pipcl.py:run_if(): Running command because: File does not exist: 'run_if_test_out' + pipcl.py:run_if(): Running: touch run_if_test_out + True + + If we repeat, the output file will be up to date so the command is not run: + + >>> run_if( f'touch {out}', out) + pipcl.py:run_if(): Not running command because up to date: 'run_if_test_out' + + If we change the command, the command is run: + + >>> run_if( f'touch {out}', out) + pipcl.py:run_if(): Running command because: Command has changed + pipcl.py:run_if(): Running: touch run_if_test_out + True + + If we add a prerequisite that is newer than the output, the command is run: + + >>> time.sleep(1) + >>> prerequisite = 'run_if_test_prerequisite' + >>> run( f'touch {prerequisite}', caller=0) + pipcl.py:run(): Running: touch run_if_test_prerequisite + >>> run_if( f'touch {out}', out, prerequisite) + pipcl.py:run_if(): Running command because: Prerequisite is new: 'run_if_test_prerequisite' + pipcl.py:run_if(): Running: touch run_if_test_out + True + + If we repeat, the output will be newer than the prerequisite, so the + command is not run: + + >>> run_if( f'touch {out}', out, prerequisite) + pipcl.py:run_if(): Not running command because up to date: 'run_if_test_out' + ''' + doit = False + cmd_path = f'{out}.cmd' + + if not doit: + out_mtime = _fs_mtime( out) + if out_mtime == 0: + doit = f'File does not exist: {out!r}' + + if not doit: + if os.path.isfile( cmd_path): + with open( cmd_path) as f: + cmd = f.read() + else: + cmd = None + if command != cmd: + if cmd is None: + doit = 'No previous command stored' + else: + doit = f'Command has changed' + if 0: + doit += f': {cmd!r} => {command!r}' + + if not doit: + # See whether any prerequisites are newer than target. + def _make_prerequisites(p): + if isinstance( p, (list, tuple)): + return list(p) + else: + return [p] + prerequisites_all = list() + for p in prerequisites: + prerequisites_all += _make_prerequisites( p) + if 0: + log2( 'prerequisites_all:') + for i in prerequisites_all: + log2( f' {i!r}') + pre_mtime = 0 + pre_path = None + for prerequisite in prerequisites_all: + if isinstance( prerequisite, str): + mtime = _fs_mtime_newest( prerequisite) + if mtime >= pre_mtime: + pre_mtime = mtime + pre_path = prerequisite + elif prerequisite is None: + pass + elif prerequisite: + doit = str(prerequisite) + break + if not doit: + if pre_mtime > out_mtime: + doit = f'Prerequisite is new: {pre_path!r}' + + if doit: + # Remove `cmd_path` before we run the command, so any failure + # will force rerun next time. + # + try: + os.remove( cmd_path) + except Exception: + pass + log1( f'Running command because: {doit}') + + run( command) + + # Write the command we ran, into `cmd_path`. + with open( cmd_path, 'w') as f: + f.write( command) + return True + else: + log1( f'Not running command because up to date: {out!r}') + + if 0: + log2( f'out_mtime={time.ctime(out_mtime)} pre_mtime={time.ctime(pre_mtime)}.' + f' pre_path={pre_path!r}: returning {ret!r}.' + ) + + +def _get_prerequisites(path): + ''' + Returns list of prerequisites from Makefile-style dependency file, e.g. + created by `cc -MD -MF `. + ''' + ret = list() + if os.path.isfile(path): + with open(path) as f: + for line in f: + for item in line.split(): + if item.endswith( (':', '\\')): + continue + ret.append( item) + return ret + + +def _fs_mtime_newest( path): + ''' + path: + If a file, returns mtime of the file. If a directory, returns mtime of + newest file anywhere within directory tree. Otherwise returns 0. + ''' + ret = 0 + if os.path.isdir( path): + for dirpath, dirnames, filenames in os.walk( path): + for filename in filenames: + path = os.path.join( dirpath, filename) + ret = max( ret, _fs_mtime( path)) + else: + ret = _fs_mtime( path) + return ret + + +def _flags( items, prefix='', quote=''): + ''' + Turns sequence into string, prefixing/quoting each item. + ''' + if not items: + return '' + if isinstance( items, str): + items = items, + ret = '' + for item in items: + if ret: + ret += ' ' + ret += f'{prefix}{quote}{item}{quote}' + return ret.strip() + + +def _fs_mtime( filename, default=0): + ''' + Returns mtime of file, or `default` if error - e.g. doesn't exist. + ''' + try: + return os.path.getmtime( filename) + except OSError: + return default + + +def _normalise(name): + # https://packaging.python.org/en/latest/specifications/name-normalization/#name-normalization + return re.sub(r"[-_.]+", "-", name).lower() + + +def _assert_version_pep_440(version): + assert re.match( + r'^([1-9][0-9]*!)?(0|[1-9][0-9]*)(\.(0|[1-9][0-9]*))*((a|b|rc)(0|[1-9][0-9]*))?(\.post(0|[1-9][0-9]*))?(\.dev(0|[1-9][0-9]*))?$', + version, + ), \ + f'Bad version: {version!r}.' + + +g_verbose = int(os.environ.get('PIPCL_VERBOSE', '1')) + +def verbose(level=None): + ''' + Sets verbose level if `level` is not None. + Returns verbose level. + ''' + global g_verbose + if level is not None: + g_verbose = level + return g_verbose + +g_log_line_numbers = True + +def log_line_numbers(yes): + ''' + Sets whether to include line numbers; helps with doctest. + ''' + global g_log_line_numbers + g_log_line_numbers = bool(yes) + +def log0(text='', caller=1): + _log(text, 0, caller+1) + +def log1(text='', caller=1): + _log(text, 1, caller+1) + +def log2(text='', caller=1): + _log(text, 2, caller+1) + +def _log(text, level, caller): + ''' + Logs lines with prefix, if is lower than . + ''' + if level <= g_verbose: + fr = inspect.stack(context=0)[caller] + filename = relpath(fr.filename) + for line in text.split('\n'): + if g_log_line_numbers: + print(f'{filename}:{fr.lineno}:{fr.function}(): {line}', file=sys.stdout, flush=1) + else: + print(f'{filename}:{fr.function}(): {line}', file=sys.stdout, flush=1) + + +def relpath(path, start=None): + ''' + A safe alternative to os.path.relpath(), avoiding an exception on Windows + if the drive needs to change - in this case we use os.path.abspath(). + ''' + if windows(): + try: + return os.path.relpath(path, start) + except ValueError: + # os.path.relpath() fails if trying to change drives. + return os.path.abspath(path) + else: + return os.path.relpath(path, start) + + +def _so_suffix(use_so_versioning=True): + ''' + Filename suffix for shared libraries is defined in pep-3149. The + pep claims to only address posix systems, but the recommended + sysconfig.get_config_var('EXT_SUFFIX') also seems to give the + right string on Windows. + + If use_so_versioning is false, we return only the last component of + the suffix, which removes any version number, for example changing + `.cp312-win_amd64.pyd` to `.pyd`. + ''' + # Example values: + # linux: .cpython-311-x86_64-linux-gnu.so + # macos: .cpython-311-darwin.so + # openbsd: .cpython-310.so + # windows .cp311-win_amd64.pyd + # + # Only Linux and Windows seem to identify the cpu. For example shared + # libraries in numpy-1.25.2-cp311-cp311-macosx_11_0_arm64.whl are called + # things like `numpy/core/_simd.cpython-311-darwin.so`. + # + ret = sysconfig.get_config_var('EXT_SUFFIX') + if not use_so_versioning: + # Use last component only. + ret = os.path.splitext(ret)[1] + return ret + + +def get_soname(path): + ''' + If we are on Linux and `path` is softlink and points to a shared library + for which `objdump -p` contains 'SONAME', return the pointee. Otherwise + return `path`. Useful if Linux shared libraries have been created with + `-Wl,-soname,...`, where we need to embed the versioned library. + ''' + if linux() and os.path.islink(path): + path2 = os.path.realpath(path) + if subprocess.run(f'objdump -p {path2}|grep SONAME', shell=1, check=0).returncode == 0: + return path2 + elif openbsd(): + # Return newest .so with version suffix. + sos = glob.glob(f'{path}.*') + log1(f'{sos=}') + sos2 = list() + for so in sos: + suffix = so[len(path):] + if not suffix or re.match('^[.][0-9.]*[0-9]$', suffix): + sos2.append(so) + sos2.sort(key=lambda p: os.path.getmtime(p)) + log1(f'{sos2=}') + return sos2[-1] + return path + + +def current_py_limited_api(): + ''' + Returns value of PyLIMITED_API to build for current Python. + ''' + a, b = map(int, platform.python_version().split('.')[:2]) + return f'0x{a:02x}{b:02x}0000' + + +def install_dir(root=None): + ''' + Returns install directory used by `install()`. + + This will be `sysconfig.get_path('platlib')`, modified by `root` if not + None. + ''' + # todo: for pure-python we should use sysconfig.get_path('purelib') ? + root2 = sysconfig.get_path('platlib') + if root: + if windows(): + # If we are in a venv, `sysconfig.get_path('platlib')` + # can be absolute, e.g. + # `C:\\...\\venv-pypackage-3.11.1-64\\Lib\\site-packages`, so it's + # not clear how to append it to `root`. So we just use `root`. + return root + else: + # E.g. if `root` is `install' and `sysconfig.get_path('platlib')` + # is `/usr/local/lib/python3.9/site-packages`, we set `root2` to + # `install/usr/local/lib/python3.9/site-packages`. + # + return os.path.join( root, root2.lstrip( os.sep)) + else: + return root2 + + +class _Record: + ''' + Internal - builds up text suitable for writing to a RECORD item, e.g. + within a wheel. + ''' + def __init__(self): + self.text = '' + + def add_content(self, content, to_, verbose=True): + if isinstance(content, str): + content = content.encode('utf8') + + # Specification for the line we write is supposed to be in + # https://packaging.python.org/en/latest/specifications/binary-distribution-format + # but it's not very clear. + # + h = hashlib.sha256(content) + digest = h.digest() + digest = base64.urlsafe_b64encode(digest) + digest = digest.rstrip(b'=') + digest = digest.decode('utf8') + + self.text += f'{to_},sha256={digest},{len(content)}\n' + if verbose: + log2(f'Adding {to_}') + + def add_file(self, from_, to_): + log1(f'Adding file: {os.path.relpath(from_)} => {to_}') + with open(from_, 'rb') as f: + content = f.read() + self.add_content(content, to_, verbose=False) + + def get(self, record_path=None): + ''' + Returns contents of the RECORD file. If `record_path` is + specified we append a final line `,,`; this can be + used to include the RECORD file itself in the contents, with + empty hash and size fields. + ''' + ret = self.text + if record_path: + ret += f'{record_path},,\n' + return ret + + +class NewFiles: + ''' + Detects new/modified/updated files matching a glob pattern. Useful for + detecting wheels created by pip or cubuildwheel etc. + ''' + def __init__(self, glob_pattern): + # Find current matches of . + self.glob_pattern = glob_pattern + self.items0 = self._items() + def get(self): + ''' + Returns list of new matches of - paths of files that + were not present previously, or have different mtimes or have different + contents. + ''' + ret = list() + items = self._items() + for path, id_ in items.items(): + id0 = self.items0.get(path) + if id0 != id_: + #mtime0, hash0 = id0 + #mtime1, hash1 = id_ + #log0(f'New/modified file {path=}.') + #log0(f' {mtime0=} {"==" if mtime0==mtime1 else "!="} {mtime1=}.') + #log0(f' {hash0=} {"==" if hash0==hash1 else "!="} {hash1=}.') + ret.append(path) + return ret + def get_one(self): + ''' + Returns new match of , asserting that there is exactly + one. + ''' + ret = self.get() + assert len(ret) == 1, f'{len(ret)=}' + return ret[0] + def _file_id(self, path): + mtime = os.stat(path).st_mtime + with open(path, 'rb') as f: + content = f.read() + hash_ = hashlib.md5(content).digest() + # With python >= 3.11 we can do: + #hash_ = hashlib.file_digest(f, hashlib.md5).digest() + return mtime, hash_ + def _items(self): + ret = dict() + for path in glob.glob(self.glob_pattern): + if os.path.isfile(path): + ret[path] = self._file_id(path) + return ret + + +def swig_get(swig, quick, swig_local='pipcl-swig-git'): + ''' + Returns or a new swig binary. + + If is true and starts with 'git:' (not Windows), the remaining text + is passed to git_get() and we clone/update/build swig, and return the built + binary. We default to the main swig repository, branch master, so for + example 'git:' will return the latest swig from branch master. + + Otherwise we simply return . + + Args: + swig: + If starts with 'git:', passed as arg to git_remote(). + quick: + If true, we do not update/build local checkout if the binary is + already present. + swig_local: + path to use for checkout. + ''' + if swig and swig.startswith('git:'): + assert platform.system() != 'Windows' + swig_local = os.path.abspath(swig_local) + # Note that {swig_local}/install/bin/swig doesn't work on MacoS because + # {swig_local}/INSTALL is a file and the fs is case-insensitive. + swig_binary = f'{swig_local}/install-dir/bin/swig' + if quick and os.path.isfile(swig_binary): + log1(f'{quick=} and {swig_binary=} already exists, so not downloading/building.') + else: + # Clone swig. + swig_env_extra = None + git_get( + swig, + swig_local, + default_remote='https://github.com/swig/swig.git', + branch='master', + ) + if darwin(): + run(f'brew install automake') + run(f'brew install pcre2') + # Default bison doesn't work, and Brew's bison is not added to $PATH. + # + # > bison is keg-only, which means it was not symlinked into /opt/homebrew, + # > because macOS already provides this software and installing another version in + # > parallel can cause all kinds of trouble. + # > + # > If you need to have bison first in your PATH, run: + # > echo 'export PATH="/opt/homebrew/opt/bison/bin:$PATH"' >> ~/.zshrc + # + run(f'brew install bison') + PATH = os.environ['PATH'] + PATH = f'/opt/homebrew/opt/bison/bin:{PATH}' + swig_env_extra = dict(PATH=PATH) + # Build swig. + run(f'cd {swig_local} && ./autogen.sh', env_extra=swig_env_extra) + run(f'cd {swig_local} && ./configure --prefix={swig_local}/install-dir', env_extra=swig_env_extra) + run(f'cd {swig_local} && make', env_extra=swig_env_extra) + run(f'cd {swig_local} && make install', env_extra=swig_env_extra) + assert os.path.isfile(swig_binary) + return swig_binary + else: + return swig + + +def _show_dict(d): + ret = '' + for n in sorted(d.keys()): + v = d[n] + ret += f' {n}: {v!r}\n' + return ret + +def show_sysconfig(): + ''' + Shows contents of sysconfig.get_paths() and sysconfig.get_config_vars() dicts. + ''' + import sysconfig + paths = sysconfig.get_paths() + log0(f'show_sysconfig().') + log0(f'sysconfig.get_paths():\n{_show_dict(sysconfig.get_paths())}') + log0(f'sysconfig.get_config_vars():\n{_show_dict(sysconfig.get_config_vars())}') + + +def sysconfig_python_flags(): + ''' + Returns include paths and library directory for Python. + + Uses sysconfig.*(), overridden by environment variables + PIPCL_SYSCONFIG_PATH_include, PIPCL_SYSCONFIG_PATH_platinclude and + PIPCL_SYSCONFIG_CONFIG_VAR_LIBDIR if set. + ''' + include1_ = os.environ.get('PIPCL_SYSCONFIG_PATH_include') or sysconfig.get_path('include') + include2_ = os.environ.get('PIPCL_SYSCONFIG_PATH_platinclude') or sysconfig.get_path('platinclude') + ldflags_ = os.environ.get('PIPCL_SYSCONFIG_CONFIG_VAR_LIBDIR') or sysconfig.get_config_var('LIBDIR') + + includes_ = [include1_] + if include2_ != include1_: + includes_.append(include2_) + if windows(): + includes_ = [f'/I"{i}"' for i in includes_] + ldflags_ = f'/LIBPATH:"{ldflags_}"' + else: + includes_ = [f'-I {i}' for i in includes_] + ldflags_ = f'-L {ldflags_}' + includes_ = ' '.join(includes_) + return includes_, ldflags_ + + +if __name__ == '__main__': + # Internal-only limited command line support, used if + # graal_legacy_python_config is true. + # + includes, ldflags = sysconfig_python_flags() + if sys.argv[1:] == ['--graal-legacy-python-config', '--includes']: + print(includes) + elif sys.argv[1:] == ['--graal-legacy-python-config', '--ldflags']: + print(ldflags) + else: + assert 0, f'Expected `--graal-legacy-python-config --includes|--ldflags` but {sys.argv=}' diff -r 000000000000 -r 1d09e1dec1d9 pyproject.toml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pyproject.toml Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,8 @@ +[build-system] +# We define required packages in setup.py:get_requires_for_build_wheel(). +requires = [] + +# See pep-517. +# +build-backend = "setup" +backend-path = ["."] diff -r 000000000000 -r 1d09e1dec1d9 pytest.ini --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pytest.ini Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,3 @@ +[pytest] +python_files = + tests/test_*.py diff -r 000000000000 -r 1d09e1dec1d9 scripts/gh_release.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/scripts/gh_release.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,625 @@ +#! /usr/bin/env python3 + +''' +Build+test script for PyMuPDF using cibuildwheel. Mostly for use with github +builds. + +We run cibuild manually, in order to build and test PyMuPDF wheels. + +As of 2024-10-08 we also support the old two wheel flavours that make up +PyMuPDF: + + PyMuPDFb + Not specific to particular versions of Python. Contains shared + libraries for the MuPDF C and C++ bindings. + PyMuPDF + Specific to particular versions of Python. Contains the rest of + the PyMuPDF implementation. + +Args: + build + Build using cibuildwheel. + build-devel + Build using cibuild with `--platform` set. + pip_install + For internal use. Runs `pip install -*.whl`, + where `platform_tag` will be things like 'win32', 'win_amd64', + 'x86_64`, depending on the python we are running on. + venv + Run with remaining args inside a venv. + test + Internal. + +We also look at specific items in the environment. This allows use with Github +action inputs, which can't be easily translated into command-line arguments. + + inputs_flavours + If '0' or unset, build complete PyMuPDF wheels. + If '1', build separate PyMuPDF and PyMuPDFb wheels. + inputs_sdist + inputs_skeleton + Build minimal wheel; for testing only. + inputs_wheels_cps: + Python versions to build for. E.g. 'cp39* cp313*'. + inputs_wheels_default + Default value for other inputs_wheels_* if unset. + inputs_wheels_linux_aarch64 + inputs_wheels_linux_auto + inputs_wheels_linux_pyodide + inputs_wheels_macos_arm64 + inputs_wheels_macos_auto + inputs_wheels_windows_auto + If '1' we build the relevant wheels. + inputs_PYMUPDF_SETUP_MUPDF_BUILD + Used to directly set PYMUPDF_SETUP_MUPDF_BUILD. + E.g. 'git:--recursive --depth 1 --shallow-submodules --branch master https://github.com/ArtifexSoftware/mupdf.git' + inputs_PYMUPDF_SETUP_MUPDF_BUILD_TYPE + Used to directly set PYMUPDF_SETUP_MUPDF_BUILD_TYPE. Note that as of + 2024-09-10 .github/workflows/build_wheels.yml does not set this. + PYMUPDF_SETUP_PY_LIMITED_API + If not '0' we build a single wheel for all python versions using the + Python Limited API. + +Building for Pyodide + + If `inputs_wheels_linux_pyodide` is true and we are on Linux, we build a + Pyodide wheel, using scripts/test.py. + +Set up for use outside Github + + sudo apt install docker.io + sudo usermod -aG docker $USER + +Example usage: + + PYMUPDF_SETUP_MUPDF_BUILD=../mupdf py -3.9-32 PyMuPDF/scripts/gh_release.py venv build-devel +''' + +import glob +import inspect +import os +import platform +import re +import shlex +import subprocess +import sys +import textwrap + +import test as test_py + +pymupdf_dir = os.path.abspath( f'{__file__}/../..') + +sys.path.insert(0, pymupdf_dir) +import pipcl +del sys.path[0] + +log = pipcl.log0 +run = pipcl.run + + +def main(): + + log( '### main():') + log(f'{platform.platform()=}') + log(f'{platform.python_version()=}') + log(f'{platform.architecture()=}') + log(f'{platform.machine()=}') + log(f'{platform.processor()=}') + log(f'{platform.release()=}') + log(f'{platform.system()=}') + log(f'{platform.version()=}') + log(f'{platform.uname()=}') + log(f'{sys.executable=}') + log(f'{sys.maxsize=}') + log(f'sys.argv ({len(sys.argv)}):') + for i, arg in enumerate(sys.argv): + log(f' {i}: {arg!r}') + log(f'os.environ ({len(os.environ)}):') + for k in sorted( os.environ.keys()): + v = os.environ[ k] + log( f' {k}: {v!r}') + + if test_py.github_workflow_unimportant(): + return + + valgrind = False + if len( sys.argv) == 1: + args = iter( ['build']) + else: + args = iter( sys.argv[1:]) + while 1: + try: + arg = next(args) + except StopIteration: + break + if arg == 'build': + build(valgrind=valgrind) + elif arg == 'build-devel': + if platform.system() == 'Linux': + p = 'linux' + elif platform.system() == 'Windows': + p = 'windows' + elif platform.system() == 'Darwin': + p = 'macos' + else: + assert 0, f'Unrecognised {platform.system()=}' + build(platform_=p) + elif arg == 'pip_install': + prefix = next(args) + d = os.path.dirname(prefix) + log( f'{prefix=}') + log( f'{d=}') + for leaf in os.listdir(d): + log( f' {d}/{leaf}') + pattern = f'{prefix}-*{platform_tag()}.whl' + paths = glob.glob( pattern) + log( f'{pattern=} {paths=}') + # Follow pipcl.py and look at AUDITWHEEL_PLAT. This allows us to + # cope if building for both musl and normal linux. + awp = os.environ.get('AUDITWHEEL_PLAT') + if awp: + paths = [i for i in paths if awp in i] + log(f'After selecting AUDITWHEEL_PLAT={awp!r}, {paths=}.') + paths = ' '.join( paths) + run( f'pip install {paths}') + elif arg == 'venv': + command = ['python', sys.argv[0]] + for arg in args: + command.append( arg) + venv( command, packages = 'cibuildwheel') + elif arg == 'test': + project = next(args) + package = next(args) + test( project, package, valgrind=valgrind) + elif arg == '--valgrind': + valgrind = int(next(args)) + else: + assert 0, f'Unrecognised {arg=}' + + +def build( platform_=None, valgrind=False): + log( '### build():') + + platform_arg = f' --platform {platform_}' if platform_ else '' + + # Parameters are in os.environ, as that seems to be the only way that + # Github workflow .yml files can encode them. + # + def get_bool(name, default=0): + v = os.environ.get(name) + if v in ('1', 'true'): + return 1 + elif v in ('0', 'false'): + return 0 + elif v is None: + return default + else: + assert 0, f'Bad environ {name=} {v=}' + inputs_flavours = get_bool('inputs_flavours', 1) + inputs_sdist = get_bool('inputs_sdist') + inputs_skeleton = os.environ.get('inputs_skeleton') + inputs_wheels_default = get_bool('inputs_wheels_default', 1) + inputs_wheels_linux_aarch64 = get_bool('inputs_wheels_linux_aarch64', inputs_wheels_default) + inputs_wheels_linux_auto = get_bool('inputs_wheels_linux_auto', inputs_wheels_default) + inputs_wheels_linux_pyodide = get_bool('inputs_wheels_linux_pyodide', 0) + inputs_wheels_macos_arm64 = get_bool('inputs_wheels_macos_arm64', 0) + inputs_wheels_macos_auto = get_bool('inputs_wheels_macos_auto', inputs_wheels_default) + inputs_wheels_windows_auto = get_bool('inputs_wheels_windows_auto', inputs_wheels_default) + inputs_wheels_cps = os.environ.get('inputs_wheels_cps') + inputs_PYMUPDF_SETUP_MUPDF_BUILD = os.environ.get('inputs_PYMUPDF_SETUP_MUPDF_BUILD') + inputs_PYMUPDF_SETUP_MUPDF_BUILD_TYPE = os.environ.get('inputs_PYMUPDF_SETUP_MUPDF_BUILD_TYPE') + + PYMUPDF_SETUP_PY_LIMITED_API = os.environ.get('PYMUPDF_SETUP_PY_LIMITED_API') + + log( f'{inputs_flavours=}') + log( f'{inputs_sdist=}') + log( f'{inputs_skeleton=}') + log( f'{inputs_wheels_default=}') + log( f'{inputs_wheels_linux_aarch64=}') + log( f'{inputs_wheels_linux_auto=}') + log( f'{inputs_wheels_linux_pyodide=}') + log( f'{inputs_wheels_macos_arm64=}') + log( f'{inputs_wheels_macos_auto=}') + log( f'{inputs_wheels_windows_auto=}') + log( f'{inputs_wheels_cps=}') + log( f'{inputs_PYMUPDF_SETUP_MUPDF_BUILD=}') + log( f'{inputs_PYMUPDF_SETUP_MUPDF_BUILD_TYPE=}') + log( f'{PYMUPDF_SETUP_PY_LIMITED_API=}') + + # Build Pyodide wheel if specified. + # + if platform.system() == 'Linux' and inputs_wheels_linux_pyodide: + # Pyodide wheels are built by running scripts/test.py, not + # cibuildwheel. + command = f'{sys.executable} scripts/test.py -P 1' + if inputs_PYMUPDF_SETUP_MUPDF_BUILD: + command += f' -m {shlex.quote(inputs_PYMUPDF_SETUP_MUPDF_BUILD)}' + command += ' pyodide_wheel' + run(command) + + # Build sdist(s). + # + if inputs_sdist: + if pymupdf_dir != os.path.abspath( os.getcwd()): + log( f'Changing dir to {pymupdf_dir=}') + os.chdir( pymupdf_dir) + # Create PyMuPDF sdist. + run(f'{sys.executable} setup.py sdist') + assert glob.glob('dist/pymupdf-*.tar.gz') + if inputs_flavours: + # Create PyMuPDFb sdist. + run( + f'{sys.executable} setup.py sdist', + env_extra=dict(PYMUPDF_SETUP_FLAVOUR='b'), + ) + assert glob.glob('dist/pymupdfb-*.tar.gz') + + # Build wheels. + # + if (0 + or inputs_wheels_linux_aarch64 + or inputs_wheels_linux_auto + or inputs_wheels_macos_arm64 + or inputs_wheels_macos_auto + or inputs_wheels_windows_auto + ): + env_extra = dict() + + def set_if_unset(name, value): + v = os.environ.get(name) + if v is None: + log( f'Setting environment {name=} to {value=}') + env_extra[ name] = value + else: + log( f'Not changing {name}={v!r} to {value!r}') + set_if_unset( 'CIBW_BUILD_VERBOSITY', '1') + # We exclude pp* because of `fitz_wrap.obj : error LNK2001: unresolved + # external symbol PyUnicode_DecodeRawUnicodeEscape`. + # 2024-06-05: musllinux on aarch64 fails because libclang cannot find + # libclang.so. + # + # Note that we had to disable cp313-win32 when 3.13 was experimental + # because there was no 64-bit Python-3.13 available via `py + # -3.13`. (Win32 builds need to use win64 Python because win32 + # libclang is broken.) + # + set_if_unset( 'CIBW_SKIP', 'pp* *i686 cp36* cp37* *musllinux*aarch64*') + + def make_string(*items): + ret = list() + for item in items: + if item: + ret.append(item) + return ' '.join(ret) + + cps = inputs_wheels_cps if inputs_wheels_cps else 'cp39* cp310* cp311* cp312* cp313*' + set_if_unset( 'CIBW_BUILD', cps) + for cp in cps.split(): + m = re.match('cp([0-9]+)[*]', cp) + assert m, f'{cps=} {cp=}' + v = int(m.group(1)) + if v == 314: + # Need to set CIBW_PRERELEASE_PYTHONS, otherwise cibuildwheel + # will refuse. + log(f'Setting CIBW_PRERELEASE_PYTHONS for Python version {cp=}.') + set_if_unset( 'CIBW_PRERELEASE_PYTHONS', '1') + + if platform.system() == 'Linux': + set_if_unset( + 'CIBW_ARCHS_LINUX', + make_string( + 'auto64' * inputs_wheels_linux_auto, + 'aarch64' * inputs_wheels_linux_aarch64, + ), + ) + if env_extra.get('CIBW_ARCHS_LINUX') == '': + log(f'Not running cibuildwheel because CIBW_ARCHS_LINUX is empty string.') + return + + if platform.system() == 'Windows': + set_if_unset( + 'CIBW_ARCHS_WINDOWS', + make_string( + 'auto' * inputs_wheels_windows_auto, + ), + ) + if env_extra.get('CIBW_ARCHS_WINDOWS') == '': + log(f'Not running cibuildwheel because CIBW_ARCHS_WINDOWS is empty string.') + return + + if platform.system() == 'Darwin': + set_if_unset( + 'CIBW_ARCHS_MACOS', + make_string( + 'auto' * inputs_wheels_macos_auto, + 'arm64' * inputs_wheels_macos_arm64, + ), + ) + if env_extra.get('CIBW_ARCHS_MACOS') == '': + log(f'Not running cibuildwheel because CIBW_ARCHS_MACOS is empty string.') + return + + def env_pass(name): + ''' + Adds `name` to CIBW_ENVIRONMENT_PASS_LINUX if required to be available + when building wheel with cibuildwheel. + ''' + if platform.system() == 'Linux': + v = env_extra.get('CIBW_ENVIRONMENT_PASS_LINUX', '') + if v: + v += ' ' + v += name + env_extra['CIBW_ENVIRONMENT_PASS_LINUX'] = v + + def env_set(name, value, pass_=False): + assert isinstance( value, str) + if not name.startswith('CIBW'): + assert pass_, f'Non-CIBW* name requires `pass_` to be true. {name=} {value=}.' + env_extra[ name] = value + if pass_: + env_pass(name) + + env_pass('PYMUPDF_SETUP_PY_LIMITED_API') + + if os.environ.get('PYMUPDF_SETUP_LIBCLANG'): + env_pass('PYMUPDF_SETUP_LIBCLANG') + + if inputs_skeleton: + env_set('PYMUPDF_SETUP_SKELETON', inputs_skeleton, pass_=1) + + if inputs_PYMUPDF_SETUP_MUPDF_BUILD not in ('-', None): + log(f'Setting PYMUPDF_SETUP_MUPDF_BUILD to {inputs_PYMUPDF_SETUP_MUPDF_BUILD!r}.') + env_set('PYMUPDF_SETUP_MUPDF_BUILD', inputs_PYMUPDF_SETUP_MUPDF_BUILD, pass_=True) + env_set('PYMUPDF_SETUP_MUPDF_TGZ', '', pass_=True) # Don't put mupdf in sdist. + + if inputs_PYMUPDF_SETUP_MUPDF_BUILD_TYPE not in ('-', None): + log(f'Setting PYMUPDF_SETUP_MUPDF_BUILD_TYPE to {inputs_PYMUPDF_SETUP_MUPDF_BUILD_TYPE!r}.') + env_set('PYMUPDF_SETUP_MUPDF_BUILD_TYPE', inputs_PYMUPDF_SETUP_MUPDF_BUILD_TYPE, pass_=True) + + def set_cibuild_test(): + log( f'set_cibuild_test(): {inputs_skeleton=}') + valgrind_text = '' + if valgrind: + valgrind_text = ' --valgrind 1' + env_set('CIBW_TEST_COMMAND', f'python {{project}}/scripts/gh_release.py{valgrind_text} test {{project}} {{package}}') + + if pymupdf_dir != os.path.abspath( os.getcwd()): + log( f'Changing dir to {pymupdf_dir=}') + os.chdir( pymupdf_dir) + + run('pip install cibuildwheel') + + # We include MuPDF build-time files. + flavour_d = True + + if PYMUPDF_SETUP_PY_LIMITED_API != '0': + # Build one wheel with oldest python, then fake build with other python + # versions so we test everything. + log(f'{PYMUPDF_SETUP_PY_LIMITED_API=}') + env_pass('PYMUPDF_SETUP_PY_LIMITED_API') + CIBW_BUILD_old = env_extra.get('CIBW_BUILD') + assert CIBW_BUILD_old is not None + cp = cps.split()[0] + env_set('CIBW_BUILD', cp) + log(f'Building single wheel.') + run( f'cibuildwheel{platform_arg}', env_extra=env_extra) + + # Fake-build with all python versions, using the wheel we have + # just created. This works by setting PYMUPDF_SETUP_URL_WHEEL + # which makes PyMuPDF's setup.py copy an existing wheel instead + # of building a wheel itself; it also copes with existing + # wheels having extra platform tags (from cibuildwheel's use of + # auditwheel). + # + env_set('PYMUPDF_SETUP_URL_WHEEL', f'file://wheelhouse/', pass_=True) + + set_cibuild_test() + env_set('CIBW_BUILD', CIBW_BUILD_old) + + # Disable cibuildwheels use of auditwheel. The wheel was repaired + # when it was created above so we don't need to do so again. This + # also avoids problems with musl wheels on a Linux glibc host where + # auditwheel fails with: `ValueError: Cannot repair wheel, because + # required library "libgcc_s-a3a07607.so.1" could not be located`. + # + env_set('CIBW_REPAIR_WHEEL_COMMAND', '') + + if platform.system() == 'Linux' and env_extra.get('CIBW_ARCHS_LINUX') == 'aarch64': + log(f'Testing all Python versions on linux-aarch64 is too slow and is killed by github after 6h.') + log(f'Testing on restricted python versions using wheels in wheelhouse/.') + # Testing only on first and last python versions. + cp1 = cps.split()[0] + cp2 = cps.split()[-1] + cp = cp1 if cp1 == cp2 else f'{cp1} {cp2}' + env_set('CIBW_BUILD', cp) + else: + log(f'Testing on all python versions using wheels in wheelhouse/.') + run( f'cibuildwheel{platform_arg}', env_extra=env_extra) + + elif inputs_flavours: + # Build and test PyMuPDF and PyMuPDFb wheels. + # + + # First build PyMuPDFb wheel. cibuildwheel will build a single wheel + # here, which will work with any python version on current OS. + # + flavour = 'b' + if flavour_d: + # Include MuPDF build-time files. + flavour += 'd' + env_set( 'PYMUPDF_SETUP_FLAVOUR', flavour, pass_=1) + run( f'cibuildwheel{platform_arg}', env_extra=env_extra) + run( 'echo after {flavour=}') + run( 'ls -l wheelhouse') + + # Now set environment to build PyMuPDF wheels. cibuildwheel will build + # one for each Python version. + # + + # Tell cibuildwheel not to use `auditwheel`, because it cannot cope + # with us deliberately putting required libraries into a different + # wheel. + # + # Also, `auditwheel addtag` says `No tags to be added` and terminates + # with non-zero. See: https://github.com/pypa/auditwheel/issues/439. + # + env_set('CIBW_REPAIR_WHEEL_COMMAND_LINUX', '') + env_set('CIBW_REPAIR_WHEEL_COMMAND_MACOS', '') + + # We tell cibuildwheel to test these wheels, but also set + # CIBW_BEFORE_TEST to make it first run ourselves with the + # `pip_install` arg to install the PyMuPDFb wheel. Otherwise + # installation of PyMuPDF would fail because it lists the + # PyMuPDFb wheel as a prerequisite. We need to use `pip_install` + # because wildcards do not work on Windows, and we want to be + # careful to avoid incompatible wheels, e.g. 32 vs 64-bit wheels + # coexist during Windows builds. + # + env_set('CIBW_BEFORE_TEST', f'python scripts/gh_release.py pip_install wheelhouse/pymupdfb') + + set_cibuild_test() + + # Build main PyMuPDF wheel. + flavour = 'p' + env_set( 'PYMUPDF_SETUP_FLAVOUR', flavour, pass_=1) + run( f'cibuildwheel{platform_arg}', env_extra=env_extra) + + else: + # Build and test wheels which contain everything. + # + flavour = 'pb' + if flavour_d: + flavour += 'd' + set_cibuild_test() + env_set( 'PYMUPDF_SETUP_FLAVOUR', flavour, pass_=1) + + run( f'cibuildwheel{platform_arg}', env_extra=env_extra) + + run( 'ls -lt wheelhouse') + + +def cpu_bits(): + return 32 if sys.maxsize == 2**31 - 1 else 64 + + +# Name of venv used by `venv()`. +# +venv_name = f'venv-pymupdf-{platform.python_version()}-{cpu_bits()}' + +def venv( command=None, packages=None, quick=False, system_site_packages=False): + ''' + Runs remaining args, or the specified command if present, in a venv. + + command: + Command as string or list of args. Should usually start with 'python' + to run the venv's python. + packages: + List of packages (or comma-separated string) to install. + quick: + If true and venv directory already exists, we don't recreate venv or + install Python packages in it. + ''' + command2 = '' + if platform.system() == 'OpenBSD': + # libclang not available from pypi.org, but system py3-llvm package + # works. `pip install` should be run with --no-build-isolation and + # explicit `pip install swig psutil`. + system_site_packages = True + #ssp = ' --system-site-packages' + log(f'OpenBSD: libclang not available from pypi.org.') + log(f'OpenBSD: system package `py3-llvm` must be installed.') + log(f'OpenBSD: creating venv with --system-site-packages.') + log(f'OpenBSD: `pip install .../PyMuPDF` must be preceded by install of swig etc.') + ssp = ' --system-site-packages' if system_site_packages else '' + if quick and os.path.isdir(venv_name): + log(f'{quick=}: Not creating venv because directory already exists: {venv_name}') + command2 += 'true' + else: + quick = False + command2 += f'{sys.executable} -m venv{ssp} {venv_name}' + if platform.system() == 'Windows': + command2 += f' && {venv_name}\\Scripts\\activate' + else: + command2 += f' && . {venv_name}/bin/activate' + if quick: + log(f'{quick=}: Not upgrading pip or installing packages.') + else: + command2 += ' && python -m pip install --upgrade pip' + if packages: + if isinstance(packages, str): + packages = packages.split(',') + command2 += ' && pip install ' + ' '.join(packages) + command2 += ' &&' + if isinstance( command, str): + command2 += ' ' + command + else: + for arg in command: + command2 += ' ' + shlex.quote(arg) + + run( command2) + + +def test( project, package, valgrind): + + run(f'pip install {test_packages}') + if valgrind: + log('Installing valgrind.') + run(f'sudo apt update') + run(f'sudo apt install valgrind') + run(f'valgrind --version') + + log('Running PyMuPDF tests under valgrind.') + # We ignore memory leaks. + run( + f'{sys.executable} {project}/tests/run_compound.py' + f' valgrind --suppressions={project}/valgrind.supp --error-exitcode=100 --errors-for-leak-kinds=none --fullpath-after=' + f' pytest {project}/tests' + , + env_extra=dict( + PYTHONMALLOC='malloc', + PYMUPDF_RUNNING_ON_VALGRIND='1', + ), + ) + else: + run(f'{sys.executable} {project}/tests/run_compound.py pytest {project}/tests') + + +if platform.system() == 'Windows': + def relpath(path, start=None): + try: + return os.path.relpath(path, start) + except ValueError: + # os.path.relpath() fails if trying to change drives. + return os.path.abspath(path) +else: + def relpath(path, start=None): + return os.path.relpath(path, start) + + +def platform_tag(): + bits = cpu_bits() + if platform.system() == 'Windows': + return 'win32' if bits==32 else 'win_amd64' + elif platform.system() in ('Linux', 'Darwin'): + assert bits == 64 + return platform.machine() + #return 'x86_64' + else: + assert 0, f'Unrecognised: {platform.system()=}' + + +test_packages = 'pytest fontTools pymupdf-fonts flake8 pylint codespell' +if platform.system() == 'Windows' and cpu_bits() == 32: + # No pillow wheel available, and doesn't build easily. + pass +else: + test_packages += ' pillow' +if platform.system().startswith('MSYS_NT-'): + # psutil not available on msys2. + pass +else: + test_packages += ' psutil' + + +if __name__ == '__main__': + main() diff -r 000000000000 -r 1d09e1dec1d9 scripts/sysinstall.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/scripts/sysinstall.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,430 @@ +#! /usr/bin/env python3 + +''' +Test for Linux system install of MuPDF and PyMuPDF. + +We build and install MuPDF and PyMuPDF into a root directory, then use +scripts/test.py to run PyMuPDF's pytest tests with LD_PRELOAD_PATH and +PYTHONPATH set. + +PyMuPDF itself is installed using `python -m install` with a wheel created with +`pip wheel`. + +We run install commands with `sudo` if `--root /` is used. + +Note that we run some commands with sudo; it's important that these use the +same python as non-sudo, otherwise things can be build and installed for +different python versions. For example when we are run from a github action, it +should not do `- uses: actions/setup-python@v5` but instead use whatever system +python is already defined. + +Args: + + --gdb 0|1 + --mupdf-dir + Path of MuPDF checkout; default is 'mupdf'. + --mupdf-do 0|1 + Whether to build and install mupdf. + --mupdf-git + Get or update `mupdf_dir` using git. If `mupdf_dir` already + exists we run `git pull` in it; otherwise we run `git + clone` with ` `. For example: + --mupdf-git "--branch master https://github.com/ArtifexSoftware/mupdf.git" + --mupdf-so-mode + Used with `install -m ...` when installing MuPDF. For example + `--mupdf-so-mode 744`. + --packages 0|1 + If 1 (the default) we install required system packages such as + `libfreetype-dev`. + --pip 0|venv|sudo + Whether/how to install Python packages. + If '0' we assume required packages are already available. + If 'sudo' we install required Python packages using `sudo pip install + ...`. + If 'venv' (the default) we install Python packages and run installer + and test commands inside venv's. + --prefix: + Directory within `root`; default is `/usr/local`. Must start with `/`. + --pymupdf-dir + Path of PyMuPDF checkout; default is 'PyMuPDF'. + --pymupdf-do 0|1 + Whether to build and install pymupdf. + --root + Root of install directory; default is 'pymupdf-sysinstall-test-root'. + --tesseract5 0|1 + If 1 (the default), we force installation of libtesseract-dev version + 5 (which is not available as a default package in Ubuntu-22.04) from + package repository ppa:alex-p/tesseract-ocr-devel. + --test-venv + Set the name of the venv in which we run tests (only with `--pip + venv`); the default is a hard-coded venv name. The venv will be + created, and required packages installed using `pip`. + --use-installer 0|1 + If 1 (the default), we use `python -m installer` to install PyMuPDF + from a generated wheel. [Otherwise we use `pip install`, which refuses + to do a system install with `--root /`, referencing PEP-668.] + -i + Passed through to scripts/test.py. Default is 'rR'. + -f + Passed through to scripts/test.py. Default is '1'. + -p + Passed through to scripts/test.py. + -t + Passed through to scripts/test.py. + +To only show what commands would be run, but not actually run them, specify `-m +0 -p 0 -t 0`. +''' + +import glob +import multiprocessing +import os +import platform +import shlex +import subprocess +import sys +import sysconfig + +import test as test_py + +pymupdf_dir = os.path.abspath( f'{__file__}/../..') + +sys.path.insert(0, pymupdf_dir) +import pipcl +del sys.path[0] + +log = pipcl.log0 + +# Requirements for a system build and install: +# +# system packages (Debian names): +# +g_sys_packages = [ + 'libfreetype-dev', + 'libgumbo-dev', + 'libharfbuzz-dev', + 'libjbig2dec-dev', + 'libjpeg-dev', + 'libleptonica-dev', + 'libopenjp2-7-dev', + ] +# We also need libtesseract-dev version 5. +# + + +def main(): + + if 1: + log(f'## {__file__}: Starting.') + log(f'{sys.executable=}') + log(f'{platform.python_version()=}') + log(f'{__file__=}') + log(f'{os.environ.get("PYMUDF_SCRIPTS_SYSINSTALL_ARGS_PRE")=}') + log(f'{os.environ.get("PYMUDF_SCRIPTS_SYSINSTALL_ARGS_POST")=}') + log(f'{sys.argv=}') + log(f'{sysconfig.get_path("platlib")=}') + run_command(f'python -V', check=0) + run_command(f'python3 -V', check=0) + run_command(f'sudo python -V', check=0) + run_command(f'sudo python3 -V', check=0) + run_command(f'sudo PATH={os.environ["PATH"]} python -V', check=0) + run_command(f'sudo PATH={os.environ["PATH"]} python3 -V', check=0) + + if test_py.github_workflow_unimportant(): + return + + # Set default behaviour. + # + gdb = False + use_installer = True + mupdf_do = True + mupdf_dir = 'mupdf' + mupdf_git = None + mupdf_so_mode = None + packages = True + prefix = '/usr/local' + pymupdf_do = True + root = 'pymupdf-sysinstall-test-root' + tesseract5 = True + pytest_args = None + pytest_do = True + pytest_name = None + test_venv = 'venv-pymupdf-sysinstall-test' + pip = 'venv' + test_fitz = '1' + test_implementations = 'rR' + + # Parse command-line. + # + env_args_pre = shlex.split(os.environ.get('PYMUDF_SCRIPTS_SYSINSTALL_ARGS_PRE', '')) + env_args_post = shlex.split(os.environ.get('PYMUDF_SCRIPTS_SYSINSTALL_ARGS_POST', '')) + args = iter(env_args_pre + sys.argv[1:] + env_args_post) + while 1: + try: + arg = next(args) + except StopIteration: + break + if arg in ('-h', '--help'): + log(__doc__) + return + elif arg == '--gdb': gdb = int(next(args)) + elif arg == '--mupdf-do': mupdf_do = int(next(args)) + elif arg == '--mupdf-dir': mupdf_dir = next(args) + elif arg == '--mupdf-git': mupdf_git = next(args) + elif arg == '--mupdf-so-mode': mupdf_so_mode = next(args) + elif arg == '--packages': packages = int(next(args)) + elif arg == '--prefix': prefix = next(args) + elif arg == '--pymupdf-do': pymupdf_do = int(next(args)) + elif arg == '--root': root = next(args) + elif arg == '--tesseract5': tesseract5 = int(next(args)) + elif arg == '--pytest-do': pytest_do = int(next(args)) + elif arg == '--test-venv': test_venv = next(args) + elif arg == '--use-installer': use_installer = int(next(args)) + elif arg == '--pip': pip = next(args) + elif arg == '-f': test_fitz = next(args) + elif arg == '-i': test_implementations = next(args) + elif arg == '-p': pytest_args = next(args) + elif arg == '-t': pytest_name = next(args) + else: + assert 0, f'Unrecognised arg: {arg!r}' + + assert prefix.startswith('/') + pip_values = ('0', 'sudo', 'venv') + assert pip in pip_values, f'Unrecognised --pip value {pip!r} should be one of: {pip_values!r}' + root = os.path.abspath(root) + root_prefix = f'{root}{prefix}'.replace('//', '/') + + sudo = '' + if root == '/': + sudo = f'sudo PATH={os.environ["PATH"]} ' + def run(command, env_extra=None): + return run_command(command, doit=mupdf_do, env_extra=env_extra) + # Get MuPDF from git if specified. + # + if mupdf_git: + # Update existing checkout or do `git clone`. + if os.path.exists(mupdf_dir): + log(f'## Update MuPDF checkout {mupdf_dir}.') + run(f'cd {mupdf_dir} && git pull && git submodule update --init') + else: + # No existing git checkout, so do a fresh clone. + log(f'## Clone MuPDF into {mupdf_dir}.') + run(f'git clone --recursive --depth 1 --shallow-submodules {mupdf_git} {mupdf_dir}') + + if packages: + # Install required system packages. We assume a Debian package system. + # + log('## Install system packages required by MuPDF.') + run(f'sudo apt update') + run(f'sudo apt install {" ".join(g_sys_packages)}') + # Ubuntu-22.04 has freeglut3-dev, not libglut-dev. + run(f'sudo apt install libglut-dev | sudo apt install freeglut3-dev') + if tesseract5: + log(f'## Force installation of libtesseract-dev version 5.') + # https://stackoverflow.com/questions/76834972/how-can-i-run-pytesseract-python-library-in-ubuntu-22-04 + # + run('sudo apt install -y software-properties-common') + run('sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel') + run('sudo apt update') + run('sudo apt install -y libtesseract-dev') + else: + run('sudo apt install libtesseract-dev') + + # Build+install MuPDF. We use mupd:Makefile's install-shared-python target. + # + if pip == 'sudo': + log('## Installing Python packages required for building MuPDF and PyMuPDF.') + #run(f'sudo pip install --upgrade pip') # Breaks on Github see: https://github.com/pypa/get-pip/issues/226. + # We need to install psutil and pillow as system packages, otherwise things like `import psutil` + # fail, seemingly because of pip warning: + # + # WARNING: Running pip as the 'root' user can result in broken + # permissions and conflicting behaviour with the system package + # manager. It is recommended to use a virtual environment instead: + # https://pip.pypa.io/warnings/venv + # + names = test_py.wrap_get_requires_for_build_wheel(f'{__file__}/../..') + names = names.split(' ') + names = [n for n in names if n not in ('psutil', 'pillow')] + names = ' '.join(names) + run(f'sudo pip install {names}') + run(f'sudo apt install python3-psutil python3-pillow') + + log('## Build and install MuPDF.') + command = f'cd {mupdf_dir}' + command += f' && {sudo}make' + command += f' -j {multiprocessing.cpu_count()}' + #command += f' EXE_LDFLAGS=-Wl,--trace' # Makes linker generate diagnostics as it runs. + command += f' DESTDIR={root}' + command += f' HAVE_LEPTONICA=yes' + command += f' HAVE_TESSERACT=yes' + command += f' USE_SYSTEM_LIBS=yes' + # We need latest zxingcpp so system version not ok. + command += f' USE_SYSTEM_ZXINGCPP=no' + command += f' barcode=yes' + command += f' VENV_FLAG={"--venv" if pip == "venv" else ""}' + if mupdf_so_mode: + command += f' SO_INSTALL_MODE={mupdf_so_mode}' + command += f' build_prefix=system-libs-' + command += f' prefix={prefix}' + command += f' verbose=yes' + command += f' install-shared-python' + command += f' INSTALL_MODE=755' + run( command) + + # Build+install PyMuPDF. + # + log('## Build and install PyMuPDF.') + def run(command): + return run_command(command, doit=pymupdf_do) + flags_freetype2 = run_command('pkg-config --cflags freetype2', capture=1) + compile_flags = f'-I {root_prefix}/include {flags_freetype2}' + link_flags = f'-L {root_prefix}/lib' + env = '' + env += f'CFLAGS="{compile_flags}" ' + env += f'CXXFLAGS="{compile_flags}" ' + env += f'LDFLAGS="-L {root}/{prefix}/lib" ' + env += f'PYMUPDF_SETUP_MUPDF_BUILD= ' # Use system MuPDF. + if use_installer: + log(f'## Building wheel.') + if pip == 'venv': + venv_name = 'venv-pymupdf-sysinstall' + run(f'pwd') + run(f'rm dist/* || true') + if pip == 'venv': + run(f'{sys.executable} -m venv {venv_name}') + run(f'. {venv_name}/bin/activate && pip install --upgrade pip') + run(f'. {venv_name}/bin/activate && pip install --upgrade installer') + run(f'{env} {venv_name}/bin/python -m pip wheel -vv -w dist {os.path.abspath(pymupdf_dir)}') + elif pip == 'sudo': + #run(f'sudo pip install --upgrade pip') # Breaks on Github see: https://github.com/pypa/get-pip/issues/226. + run(f'sudo pip install installer') + run(f'{env} pip wheel -vv -w dist {os.path.abspath(pymupdf_dir)}') + else: + log(f'Not installing "installer" because {pip=}.') + wheel = glob.glob(f'dist/*') + assert len(wheel) == 1, f'{wheel=}' + wheel = wheel[0] + log(f'## Installing wheel using `installer`.') + pv = '.'.join(platform.python_version_tuple()[:2]) + p = f'{root_prefix}/lib/python{pv}' + # `python -m installer` fails to overwrite existing files. + run(f'{sudo}rm -r {p}/site-packages/pymupdf || true') + run(f'{sudo}rm -r {p}/site-packages/pymupdf.py || true') + run(f'{sudo}rm -r {p}/site-packages/fitz || true') + run(f'{sudo}rm -r {p}/site-packages/fitz.py || true') + run(f'{sudo}rm -r {p}/site-packages/pymupdf-*.dist-info || true') + run(f'{sudo}rm -r {root_prefix}/bin/pymupdf || true') + if pip == 'venv': + run(f'{sudo}{venv_name}/bin/python -m installer --destdir {root} --prefix {prefix} {wheel}') + else: + run(f'{sudo}{sys.executable} -m installer --destdir {root} --prefix {prefix} {wheel}') + # It seems that MuPDF Python bindings are installed into + # `.../dist-packages` (from mupdf:Mafile's call of `$(shell python3 + # -c "import sysconfig; print(sysconfig.get_path('platlib'))")` while + # `python -m installer` installs PyMuPDF into `.../site-packages`. + # + # This might be because `sysconfig.get_path('platlib')` returns + # `.../site-packages` if run in a venv, otherwise `.../dist-packages`. + # + # And on github ubuntu-latest, sysconfig.get_path("platlib") is + # /opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages + # + # So we set pythonpath (used later) to import from all + # `pythonX.Y/site-packages/` and `pythonX.Y/dist-packages` directories + # within `root_prefix`: + # + pv = platform.python_version().split('.') + pv = f'python{pv[0]}.{pv[1]}' + pythonpath = list() + for dirpath, dirnames, filenames in os.walk(root_prefix): + if os.path.basename(dirpath) == pv: + for leaf in 'site-packages', 'dist-packages': + if leaf in dirnames: + pythonpath.append(os.path.join(dirpath, leaf)) + pythonpath = ':'.join(pythonpath) + log(f'{pythonpath=}') + else: + command = f'{env} pip install -vv --root {root} {os.path.abspath(pymupdf_dir)}' + run( command) + pythonpath = pipcl.install_dir(root) + + # Show contents of installation directory. This is very slow on github, + # where /usr/local contains lots of things. + #run(f'find {root_prefix}|sort') + + # Run pytest tests. + # + log('## Run PyMuPDF pytest tests.') + def run(command, env_extra=None): + return run_command(command, doit=pytest_do, env_extra=env_extra, caller=1) + import gh_release + if pip == 'venv': + # Create venv. + run(f'{sys.executable} -m venv {test_venv}') + # Install required packages. + command = f'. {test_venv}/bin/activate' + command += f' && pip install --upgrade pip' + command += f' && pip install --upgrade {gh_release.test_packages}' + run(command) + elif pip == 'sudo': + names = gh_release.test_packages + names = names.split(' ') + names = [n for n in names if n not in ('psutil', 'pillow')] + names = ' '.join(names) + run(f'sudo pip install --upgrade {names}') + else: + log(f'Not installing packages for testing because {pip=}.') + # Run pytest. + # + # We need to set PYTHONPATH and LD_LIBRARY_PATH. In particular we + # use pipcl.install_dir() to find where pipcl will have installed + # PyMuPDF. + command = '' + if pip == 'venv': + command += f'. {test_venv}/bin/activate &&' + command += f' LD_LIBRARY_PATH={root_prefix}/lib PYTHONPATH={pythonpath} PATH=$PATH:{root_prefix}/bin' + run(f'ls -l {root_prefix}/bin/') + # 2024-03-20: Not sure whether/where `pymupdf` binary is installed, so we + # disable the test_cli* tests. + command += f' {pymupdf_dir}/scripts/test.py' + if gdb: + command += ' --gdb 1' + command += f' -v 0' + if pytest_name is None: + excluded_tests = ( + 'test_color_count', + 'test_3050', + 'test_cli', + 'test_cli_out', + 'test_pylint', + 'test_textbox3', + 'test_3493', + 'test_4180', + ) + excluded_tests = ' and not '.join(excluded_tests) + if not pytest_args: + pytest_args = '' + pytest_args += f' -k \'not {excluded_tests}\'' + else: + command += f' -t {pytest_name}' + if test_fitz: + command += f' -f {test_fitz}' + if test_implementations: + command += f' -i {test_implementations}' + if pytest_args: + command += f' -p {shlex.quote(pytest_args)}' + if pytest_do: + command += ' test' + run(command, env_extra=dict(PYMUPDF_SYSINSTALL_TEST='1')) + + +def run_command(command, capture=False, check=True, doit=True, env_extra=None, caller=0): + if doit: + return pipcl.run(command, capture=capture, check=check, caller=caller+2, env_extra=env_extra) + else: + log(f'## Would have run: {command}', caller=2) + + +if __name__ == '__main__': + main() diff -r 000000000000 -r 1d09e1dec1d9 scripts/test.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/scripts/test.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1300 @@ +#! /usr/bin/env python3 + +'''Developer build/test script for PyMuPDF. + +Examples: + + ./PyMuPDF/scripts/test.py --m mupdf build test + Build and test with pre-existing local mupdf/ checkout. + + ./PyMuPDF/scripts/test.py build test + Build and test with default internal download of mupdf. + + ./PyMuPDF/scripts/test.py -m 'git:https://git.ghostscript.com/mupdf.git' build test + Build and test with internal checkout of MuPDF master. + + ./PyMuPDF/scripts/test.py -m 'git:--branch 1.26.x https://github.com/ArtifexSoftware/mupdf.git' build test + Build and test using internal checkout of mupdf 1.26.x branch from + Github. + +Usage: + +* Command line arguments are called parameters if they start with `-`, + otherwise they are called commands. +* Parameters are evaluated first in the order that they were specified. +* Then commands are run in the order in which they were specified. +* Usually command `test` would be specified after a `build`, `install` or + `wheel` command. +* Parameters and commands can be interleaved but it may be clearer to separate + them on the command line. + +Other: + +* If we are not already running inside a Python venv, we automatically create a + venv and re-run ourselves inside it. +* Build/wheel/install commands always install into the venv. +* Tests use whatever PyMuPDF/MuPDF is currently installed in the venv. +* We run tests with pytest. + +* One can generate call traces by setting environment variables in debug + builds. For details see: + https://mupdf.readthedocs.io/en/latest/language-bindings.html#environmental-variables + +Command line args: + + -a + Read next space-separated argument(s) from environmental variable + . + * Does nothing if is unset. + * Useful when running via Github action. + + -b + Set build type for `build` commands. `` should be one of + 'release', 'debug', 'memento'. [This makes `build` set environment + variable `PYMUPDF_SETUP_MUPDF_BUILD_TYPE`, which is used by PyMuPDF's + `setup.py`.] + + --build-flavour + Combination of 'p', 'b', 'd'. See ../setup.py's description of + PYMUPDF_SETUP_FLAVOUR. Default is 'pbd', i.e. self-contained PyMuPDF + wheels including MuPDF build-time files. + + --build-isolation 0|1 + If true (the default on non-OpenBSD systems), we let pip create and use + its own new venv to build PyMuPDF. Otherwise we force pip to use the + current venv. + + --cibw-archs-linux + Set CIBW_ARCHS_LINUX, e.g. to `auto64 aarch64`. Default is `auto64` so + this allows control over whether to build linux-aarch64 wheels. + + --cibw-name + Name to use when installing cibuildwheel, e.g.: + --cibw-name cibuildwheel==3.0.0b1 + Default is `cibuildwheel`, i.e. the current release. + + --cibw-pyodide 0|1 + Experimental, make `cibuild` command build a pyodide wheel. + 2025-05-27: this fails when building mupdf C API - `ld -r -b binary + ...` fails with: + emcc: error: binary: No such file or directory ("binary" was expected to be an input file, based on the commandline arguments provided) + + --cibw-pyodide-version + Override default Pyodide version to use with `cibuildwheel` command. If + empty string we use cibuildwheel's default. + + --cibw-release-1 + Set up so that `cibw` builds all wheels except linux-aarch64, and sdist + if on Linux. + + --cibw-release-2 + Set up so that `cibw` builds only linux-aarch64 wheel. + + -d + Equivalent to `-b debug`. + + --dummy + Sets PYMUPDF_SETUP_DUMMY=1 which makes setup.py build a dummy wheel + with no content. For internal testing only. + + -e = + Add to environment used in build and test commands. Can be specified + multiple times. + + -f 0|1 + If 1 we also test alias `fitz` as well as `pymupdf`. Default is '0'. + + --gdb 0|1 + Run tests under gdb. Requires user interaction. + + --graal + Use graal - run inside a Graal VM instead of a Python venv. + + As of 2025-08-04 we: + * Clone the latest pyenv and build it. + * Use pyenv to install graalpy. + * Use graalpy to create venv. + + [After the first time, suggest `-v 1` to avoid delay from + updating/building pyenv and recreating the graal venv.] + + --help + -h + Show help. + + -I + Set PyMuPDF implementations to test. + must contain only these individual characters: + 'r' - rebased. + 'R' - rebased without optimisations. + Default is 'r'. Also see `PyMuPDF:tests/run_compound.py`. + + -i + Set version installed by the 'install' command. + + -k + Specify which test(s) to run; passed straight through to pytest's `-k`. + For example `-k test_3354`. + + -m | --mupdf + Location of local mupdf/ directory or 'git:...' to be used + when building PyMuPDF. + + This sets environment variable PYMUPDF_SETUP_MUPDF_BUILD, which is used + by PyMuPDF/setup.py. If not specified PyMuPDF will download its default + mupdf .tgz. + + Additionally if starts with ':' we use the remaining text as + the branch name and add https://github.com/ArtifexSoftware/mupdf.git. + + For example: + + -m "git:--branch master https://github.com/ArtifexSoftware/mupdf.git" + -m :master + + -m "git:--branch 1.26.x https://github.com/ArtifexSoftware/mupdf.git" + -m :1.26.x + + --mupdf-clean 0|1 + If 1 we do a clean MuPDF build. + + -M 0|1 + --build-mupdf 0|1 + Whether to rebuild mupdf when we build PyMuPDF. Default is 1. + + -o + Control whether we do nothing on the current platform. + * is a comma-separated list of names. + * If is empty (the default), we always run normally. + * Otherwise we only run if an item in matches (case + insensitive) platform.system(). + * For example `-o linux,darwin` will do nothing unless on Linux or + MacOS. + + -p + Set pytest options; default is ''. + + -P 0|1 + If 1, automatically install required system packages such as + Valgrind. Default is 0. + + --pybind 0|1 + Experimental, for investigating + https://github.com/pymupdf/PyMuPDF/issues/3869. Runs run basic code + inside C++ pybind. Requires `sudo apt install pybind11-dev` or similar. + + --pyodide-build-version + Version of Python package pyodide-build to use with `pyodide` command. + + If None (the default) `pyodide` uses the latest available version. + 2025-02-13: pyodide_build_version='0.29.3' works. + + -s 0 | 1 + If 1 (the default), build with Python Limited API/Stable ABI. + [This simply sSets $PYMUPDF_SETUP_PY_LIMITED_API, which is used by + PyMuPDF/setup.py.] + + --show-args: + Show sys.argv and exit. For debugging. + + --sync-paths + Do not run anything, instead write required files/directories/checkouts + to stdout, one per line. This is to help with automated running on + remote machines. + + --system-site-packages 0|1 + If 1, use `--system-site-packages` when creating venv. Defaults is 0. + + --swig + Use instead of the `swig` command. + + Unix only: + Clone/update/build swig from a git repository using 'git:' prefix. + + We default to https://github.com/swig/swig.git branch master, so these + are all equivalent: + + --swig 'git:--branch master https://github.com/swig/swig.git' + --swig 'git:--branch master' + --swig git: + + 2025-08-18: This fixes building with py_limited_api on python-3.13. + + --swig-quick 0|1 + If 1 and `--swig` starts with 'git:', we do not update/build swig if + already present. + + See description of PYMUPDF_SETUP_SWIG_QUICK in setup.py. + + -t + Pytest test names, comma-separated. Should be relative to PyMuPDF + directory. For example: + -t tests/test_general.py + -t tests/test_general.py::test_subset_fonts + To specify multiple tests, use comma-separated list and/or multiple `-t + ` args. + + --timeout + Sets timeout when running tests. + + -T + Use specified prefix when running pytest, must be one of: + gdb + helgrind + vagrind + + -v + venv is: + 0 - do not use a venv. + 1 - Use venv. If it already exists, we assume the existing directory + was created by us earlier and is a valid venv containing all + necessary packages; this saves a little time. + 2 - Use venv. + 3 - Use venv but delete it first if it already exists. + The default is 2. + +Commands: + + build + Builds and installs PyMuPDF into venv, using `pip install .../PyMuPDF`. + + buildtest + Same as 'build test'. + + cibw + Build and test PyMuPDF wheel(s) using cibuildwheel. Wheels are placed + in directory `wheelhouse`. + * We do not attempt to install wheels. + * So it is generally not useful to do `cibw test`. + + If CIBW_BUILD is unset, we set it as follows: + * On Github we build and test all supported Python versions. + * Otherwise we build and test the current Python version only. + + If CIBW_ARCHS is unset we set $CIBW_ARCHS_WINDOWS, $CIBW_ARCHS_MACOS + and $CIBW_ARCHS_LINUX to auto64 if they are unset. + + install + Install with `pip install --force-reinstall `. + + pyodide + Build Pyodide wheel. We clone `emsdk.git`, set it up, and run + `pyodide build`. This runs our setup.py with CC etc set up + to create Pyodide binaries in a wheel called, for example, + `PyMuPDF-1.23.2-cp311-none-emscripten_3_1_32_wasm32.whl`. + + It seems that sys.version must match the Python version inside emsdk; + as of 2025-02-14 this is 3.12. Otherwise we get build errors such as: + [wasm-validator error in function 723] unexpected false: all used features should be allowed, on ... + + test + Runs PyMuPDF's pytest tests. Default is to test rebased and unoptimised + rebased; use `-i` to change this. + + wheel + Build and install wheel. + + +Environment: + PYMUDF_SCRIPTS_TEST_options + Is prepended to command line args. +''' + +import glob +import os +import platform +import re +import shlex +import shutil +import subprocess +import sys +import textwrap + + +pymupdf_dir_abs = os.path.abspath( f'{__file__}/../..') + +try: + sys.path.insert(0, pymupdf_dir_abs) + import pipcl +finally: + del sys.path[0] + +try: + sys.path.insert(0, f'{pymupdf_dir_abs}/scripts') + import gh_release +finally: + del sys.path[0] + + +pymupdf_dir = pipcl.relpath(pymupdf_dir_abs) + +log = pipcl.log0 +run = pipcl.run + + +def main(argv): + + if github_workflow_unimportant(): + return + + build_isolation = None + cibw_name = None + cibw_pyodide = None + cibw_pyodide_version = None + commands = list() + env_extra = dict() + graal = False + implementations = 'r' + install_version = None + mupdf_sync = None + os_names = list() + system_packages = False + pybind = False + pyodide_build_version = None + pytest_options = '' + pytest_prefix = None + cibw_sdist = None + show_args = False + show_help = False + sync_paths = False + system_site_packages = False + swig = None + swig_quick = None + test_fitz = False + test_names = list() + test_timeout = None + valgrind = False + warnings = list() + venv = 2 + + options = os.environ.get('PYMUDF_SCRIPTS_TEST_options', '') + options = shlex.split(options) + + # Parse args and update the above state. We do this before moving into a + # venv, partly so we can return errors immediately. + # + args = iter(options + argv[1:]) + i = 0 + while 1: + try: + arg = next(args) + except StopIteration: + arg = None + break + + if 0: + pass + + elif arg == '-a': + _name = next(args) + _value = os.environ.get(_name, '') + _args = shlex.split(_value) + list(args) + args = iter(_args) + + elif arg == '-b': + env_extra['PYMUPDF_SETUP_MUPDF_BUILD_TYPE'] = next(args) + + elif arg == '--build-flavour': + env_extra['PYMUPDF_SETUP_FLAVOUR'] = next(args) + + elif arg == '--build-isolation': + build_isolation = int(next(args)) + + elif arg == '--cibw-pyodide-version': + cibw_pyodide_version = next(args) + + elif arg == '--cibw-release-1': + cibw_sdist = True + env_extra['CIBW_ARCHS_LINUX'] = 'auto64' + env_extra['CIBW_ARCHS_MACOS'] = 'auto64' + env_extra['CIBW_ARCHS_WINDOWS'] = 'auto' # win32 and win64. + env_extra['CIBW_SKIP'] = 'pp* *i686 cp36* cp37* *musllinux*aarch64*' + + elif arg == '--cibw-release-2': + env_extra['CIBW_ARCHS_LINUX'] = 'aarch64' + # Testing only first and last python versions because otherwise + # Github times out after 6h. + env_extra['CIBW_BUILD'] = 'cp39* cp313*' + os_names = ['linux'] + + elif arg == '--cibw-archs-linux': + env_extra['CIBW_ARCHS_LINUX'] = next(args) + + elif arg == '--cibw-name': + cibw_name = next(args) + + elif arg == '--cibw-pyodide': + cibw_pyodide = next(args) + + elif arg == '-d': + env_extra['PYMUPDF_SETUP_MUPDF_BUILD_TYPE'] = 'debug' + + elif arg == '--dummy': + env_extra['PYMUPDF_SETUP_DUMMY'] = '1' + env_extra['CIBW_TEST_COMMAND'] = '' + + elif arg == '-e': + _nv = next(args) + assert '=' in _nv, f'-e = does not contain "=": {_nv!r}' + _name, _value = _nv.split('=', 1) + env_extra[_name] = _value + + elif arg == '-f': + test_fitz = int(next(args)) + + elif arg == '--graal': + graal = True + + elif arg in ('-h', '--help'): + show_help = True + + elif arg == '-i': + install_version = next(args) + + elif arg == '-I': + implementations = next(args) + + elif arg == '-k': + pytest_options += f' -k {shlex.quote(next(args))}' + + elif arg in ('-m', '--mupdf'): + _mupdf = next(args) + if _mupdf == '-': + _mupdf = None + elif _mupdf.startswith(':'): + _branch = _mupdf[1:] + _mupdf = 'git:--branch {_branch} https://github.com/ArtifexSoftware/mupdf.git' + os.environ['PYMUPDF_SETUP_MUPDF_BUILD'] = _mupdf + elif _mupdf.startswith('git:') or '://' in _mupdf: + os.environ['PYMUPDF_SETUP_MUPDF_BUILD'] = _mupdf + else: + assert os.path.isdir(_mupdf), f'Not a directory: {_mupdf=}' + os.environ['PYMUPDF_SETUP_MUPDF_BUILD'] = os.path.abspath(_mupdf) + mupdf_sync = _mupdf + + elif arg == '--mupdf-clean': + env_extra['PYMUPDF_SETUP_MUPDF_CLEAN']=next(args) + + elif arg in ('-M', '--build-mupdf'): + env_extra['PYMUPDF_SETUP_MUPDF_REBUILD'] = next(args) + + elif arg == '-o': + os_names += next(args).split(',') + + elif arg == '-p': + pytest_options += f' {next(args)}' + + elif arg == '-P': + system_packages = int(next(args)) + + elif arg == '--pybind': + pybind = int(next(args)) + + elif arg == '--pyodide-build-version': + pyodide_build_version = next(args) + + elif arg == '-s': + _value = next(args) + assert _value in ('0', '1'), f'`-s` must be followed by `0` or `1`, not {_value=}.' + env_extra['PYMUPDF_SETUP_PY_LIMITED_API'] = _value + + elif arg == '--show-args': + show_args = 1 + elif arg == '--sync-paths': + sync_paths = True + + elif arg == '--system-site-packages': + system_site_packages = int(next(args)) + + elif arg == '--swig': + swig = next(args) + + elif arg == '--swig-quick': + swig_quick = int(next(args)) + + elif arg == '-t': + test_names += next(args).split(',') + + elif arg == '--timeout': + test_timeout = float(next(args)) + + elif arg == '-T': + pytest_prefix = next(args) + assert pytest_prefix in ('gdb', 'helgrind', 'valgrind'), \ + f'Unrecognised {pytest_prefix=}, should be one of: gdb valgrind helgrind.' + + elif arg == '-v': + venv = int(next(args)) + assert venv in (0, 1, 2, 3), f'Invalid {venv=} should be 0, 1, 2 or 3.' + + elif arg in ('build', 'cibw', 'install', 'pyodide', 'test', 'wheel'): + commands.append(arg) + + elif arg == 'buildtest': + commands += ['build', 'test'] + + else: + assert 0, f'Unrecognised option/command: {arg=}.' + + # Handle special args --sync-paths, -h, -v, -o first. + # + if sync_paths: + # Just print required files, directories and checkouts. + print(pymupdf_dir) + if mupdf_sync: + print(mupdf_sync) + return + + if show_help: + print(__doc__) + return + + if show_args: + print(f'sys.argv ({len(sys.argv)}):') + for arg in sys.argv: + print(f' {arg!r}') + return + + if os_names: + if platform.system().lower() not in os_names: + log(f'Not running because {platform.system().lower()=} not in {os_names=}') + return + + if commands: + if venv: + # Rerun ourselves inside a venv if not already in a venv. + if not venv_in(): + if graal: + # 2025-07-24: We need the latest pyenv. + graalpy = 'graalpy-24.2.1' + venv_name = f'venv-pymupdf-{graalpy}' + pyenv_dir = f'{pymupdf_dir_abs}/pyenv-git' + os.environ['PYENV_ROOT'] = pyenv_dir + os.environ['PATH'] = f'{pyenv_dir}/bin:{os.environ["PATH"]}' + os.environ['PIPCL_GRAAL_PYTHON'] = sys.executable + + if venv >= 3: + shutil.rmtree(venv_name, ignore_errors=1) + if venv == 1 and os.path.exists(pyenv_dir) and os.path.exists(venv_name): + log(f'{venv=} and {venv_name=} already exists so not building pyenv or creating venv.') + else: + pipcl.git_get('https://github.com/pyenv/pyenv.git', pyenv_dir, branch='master') + run(f'cd {pyenv_dir} && src/configure && make -C src') + run(f'which pyenv') + run(f'pyenv install -v -s {graalpy}') + run(f'{pyenv_dir}/versions/{graalpy}/bin/graalpy -m venv {venv_name}') + e = run(f'. {venv_name}/bin/activate && python {shlex.join(sys.argv)}', + check=False, + ) + else: + venv_name = f'venv-pymupdf-{platform.python_version()}-{int.bit_length(sys.maxsize+1)}' + e = venv_run( + sys.argv, + venv_name, + recreate=(venv>=2), + clean=(venv>=3), + ) + sys.exit(e) + else: + log(f'Warning, no commands specified so nothing to do.') + + # Clone/update/build swig if specified. + swig_binary = pipcl.swig_get(swig, swig_quick) + if swig_binary: + os.environ['PYMUPDF_SETUP_SWIG'] = swig_binary + + # Handle commands. + # + have_installed = False + for command in commands: + log(f'### {command=}.') + if 0: + pass + + elif command in ('build', 'wheel'): + build( + env_extra, + build_isolation=build_isolation, + venv=venv, + wheel=(command=='wheel'), + ) + have_installed = True + + elif command == 'cibw': + # Build wheel(s) with cibuildwheel. + if cibw_pyodide and env_extra.get('CIBW_BUILD') is None: + assert 0, f'Need a Python version for Pyodide.' + CIBW_BUILD = 'cp312*' + env_extra['CIBW_BUILD'] = CIBW_BUILD + log(f'Defaulting to {CIBW_BUILD=} for Pyodide.') + #if cibw_pyodide_version == None: + # cibw_pyodide_version = '0.28.0' + cibuildwheel( + env_extra, + cibw_name or 'cibuildwheel', + cibw_pyodide, + cibw_pyodide_version, + cibw_sdist, + ) + + elif command == 'install': + p = 'pymupdf' + if install_version: + if not install_version.startswith(('==', '>=', '>')): + p = f'{p}==' + p = f'{p}{install_version}' + run(f'pip install --force-reinstall {p}') + have_installed = True + + elif command == 'test': + if not have_installed: + log(f'## Warning: have not built/installed PyMuPDF; testing whatever is already installed.') + test( + env_extra=env_extra, + implementations=implementations, + test_names=test_names, + pytest_options=pytest_options, + test_timeout=test_timeout, + pytest_prefix=pytest_prefix, + test_fitz=test_fitz, + pybind=pybind, + system_packages=system_packages, + venv=venv, + ) + + elif command == 'pyodide': + build_pyodide_wheel(pyodide_build_version=pyodide_build_version) + + else: + assert 0, f'{command=}' + + +def get_env_bool(name, default=0): + v = os.environ.get(name) + if v in ('1', 'true'): + return 1 + elif v in ('0', 'false'): + return 0 + elif v is None: + return default + else: + assert 0, f'Bad environ {name=} {v=}' + +def show_help(): + print(__doc__) + print(venv_info()) + + +def github_workflow_unimportant(): + ''' + Returns true if we are running a Github scheduled workflow but in a + repository not called 'PyMuPDF'. This can be used to avoid consuming + unnecessary Github minutes running workflows on non-main repositories such + as ArtifexSoftware/PyMuPDF-julian. + ''' + GITHUB_EVENT_NAME = os.environ.get('GITHUB_EVENT_NAME') + GITHUB_REPOSITORY = os.environ.get('GITHUB_REPOSITORY') + if GITHUB_EVENT_NAME == 'schedule' and GITHUB_REPOSITORY != 'pymupdf/PyMuPDF': + log(f'## This is an unimportant Github workflow: a scheduled event, not in the main repository `pymupdf/PyMuPDF`.') + log(f'## {GITHUB_EVENT_NAME=}.') + log(f'## {GITHUB_REPOSITORY=}.') + return True + +def venv_info(pytest_args=None): + ''' + Returns string containing information about the venv we use and how to + run tests manually. If specified, `pytest_args` contains the pytest args, + otherwise we use an example. + ''' + pymupdf_dir_rel = gh_release.relpath(pymupdf_dir) + ret = f'Name of venv: {gh_release.venv_name}\n' + if pytest_args is None: + pytest_args = f'{pymupdf_dir_rel}/tests/test_general.py::test_subset_fonts' + if platform.system() == 'Windows': + ret += textwrap.dedent(f''' + Rerun tests manually with rebased implementation: + Enter venv: + {gh_release.venv_name}\\Scripts\\activate + Run specific test in venv: + {gh_release.venv_name}\\Scripts\\python -m pytest {pytest_args} + ''') + else: + ret += textwrap.dedent(f''' + Rerun tests manually with rebased implementation: + Enter venv and run specific test, also under gdb: + . {gh_release.venv_name}/bin/activate + python -m pytest {pytest_args} + gdb --args python -m pytest {pytest_args} + Run without explicitly entering venv, also under gdb: + ./{gh_release.venv_name}/bin/python -m pytest {pytest_args} + gdb --args ./{gh_release.venv_name}/bin/python -m pytest {pytest_args} + ''') + return ret + + +def build( + env_extra, + *, + build_isolation, + venv, + wheel, + ): + print(f'{build_isolation=}') + + if build_isolation is None: + # On OpenBSD libclang is not available on pypi.org, so we need to force + # use of system package py3-llvm with --no-build-isolation, manually + # installing other required packages. + build_isolation = False if platform.system() == 'OpenBSD' else True + + if build_isolation: + # This is the default on non-OpenBSD. + build_isolation_text = '' + else: + # Not using build isolation - i.e. pip will not be using its own clean + # venv, so we need to explicitly install required packages. Manually + # install required packages from pyproject.toml. + sys.path.insert(0, os.path.abspath(f'{__file__}/../..')) + import setup + names = setup.get_requires_for_build_wheel() + del sys.path[0] + if names: + names = ' '.join(names) + if venv == 2: + run( f'python -m pip install --upgrade {names}') + else: + log(f'{venv=}: Not installing packages with pip: {names}') + build_isolation_text = ' --no-build-isolation' + + if wheel: + new_files = pipcl.NewFiles(f'wheelhouse/*.whl') + run(f'pip wheel{build_isolation_text} -w wheelhouse -v {pymupdf_dir_abs}', env_extra=env_extra) + wheel = new_files.get_one() + run(f'pip install --force-reinstall {wheel}') + else: + run(f'pip install{build_isolation_text} -v --force-reinstall {pymupdf_dir_abs}', env_extra=env_extra) + + +def cibuildwheel(env_extra, cibw_name, cibw_pyodide, cibw_pyodide_version, cibw_sdist): + + if cibw_sdist and platform.system() == 'Linux': + log(f'Building sdist.') + run(f'cd {pymupdf_dir_abs} && {sys.executable} setup.py -d wheelhouse sdist', env_extra=env_extra) + sdists = glob.glob(f'{pymupdf_dir_abs}/wheelhouse/pymupdf-*.tar.gz') + log(f'{sdists=}') + assert sdists + + run(f'pip install --upgrade --force-reinstall {cibw_name}') + + # Some general flags. + if 'CIBW_BUILD_VERBOSITY' not in env_extra: + env_extra['CIBW_BUILD_VERBOSITY'] = '1' + if 'CIBW_SKIP' not in env_extra: + env_extra['CIBW_SKIP'] = 'pp* *i686 cp36* cp37* *musllinux* *-win32 *-aarch64' + + # Set what wheels to build, if not already specified. + if 'CIBW_ARCHS' not in env_extra: + if 'CIBW_ARCHS_WINDOWS' not in env_extra: + env_extra['CIBW_ARCHS_WINDOWS'] = 'auto64' + + if 'CIBW_ARCHS_MACOS' not in env_extra: + env_extra['CIBW_ARCHS_MACOS'] = 'auto64' + + if 'CIBW_ARCHS_LINUX' not in env_extra: + env_extra['CIBW_ARCHS_LINUX'] = 'auto64' + + # Tell cibuildwheel not to use `auditwheel` on Linux and MacOS, + # because it cannot cope with us deliberately having required + # libraries in different wheel - specifically in the PyMuPDF wheel. + # + # We cannot use a subset of auditwheel's functionality + # with `auditwheel addtag` because it says `No tags + # to be added` and terminates with non-zero. See: + # https://github.com/pypa/auditwheel/issues/439. + # + env_extra['CIBW_REPAIR_WHEEL_COMMAND_LINUX'] = '' + env_extra['CIBW_REPAIR_WHEEL_COMMAND_MACOS'] = '' + + # Tell cibuildwheel how to test PyMuPDF. + if 'CIBW_TEST_COMMAND' not in env_extra: + env_extra['CIBW_TEST_COMMAND'] = f'python {{project}}/scripts/test.py test' + + # Specify python versions. + CIBW_BUILD = env_extra.get('CIBW_BUILD') + log(f'{CIBW_BUILD=}') + if CIBW_BUILD is None: + if os.environ.get('GITHUB_ACTIONS') == 'true': + # Build/test all supported Python versions. + CIBW_BUILD = 'cp39* cp310* cp311* cp312* cp313*' + else: + # Build/test current Python only. + v = platform.python_version_tuple()[:2] + log(f'{v=}') + CIBW_BUILD = f'cp{"".join(v)}*' + + cibw_pyodide_args = '' + if cibw_pyodide: + cibw_pyodide_args = ' --platform pyodide' + env_extra['HAVE_LIBCRYPTO'] = 'no' + env_extra['PYMUPDF_SETUP_MUPDF_TESSERACT'] = '0' + if cibw_pyodide_version: + # 2025-07-21: there is no --pyodide-version option so we set + # CIBW_PYODIDE_VERSION. + env_extra['CIBW_PYODIDE_VERSION'] = cibw_pyodide_version + env_extra['CIBW_ENABLE'] = 'pyodide-prerelease' + + # Pass all the environment variables we have set, to Linux + # docker. Note that this will miss any settings in the original + # environment. + env_extra['CIBW_ENVIRONMENT_PASS_LINUX'] = ' '.join(sorted(env_extra.keys())) + + # Build for lowest (assumed first) Python version. + # + CIBW_BUILD_0 = CIBW_BUILD.split()[0] + log(f'Building for first Python version {CIBW_BUILD_0}.') + env_extra['CIBW_BUILD'] = CIBW_BUILD_0 + run(f'cd {pymupdf_dir} && cibuildwheel{cibw_pyodide_args}', env_extra=env_extra) + + # Tell cibuildwheel to build and test all specified Python versions; it + # will notice that the wheel we built above supports all versions of + # Python, so will not actually do any builds here. + # + env_extra['CIBW_BUILD'] = CIBW_BUILD + run(f'cd {pymupdf_dir} && cibuildwheel{cibw_pyodide_args}', env_extra=env_extra) + run(f'ls -ld {pymupdf_dir}/wheelhouse/*') + + +def build_pyodide_wheel(pyodide_build_version=None): + ''' + Build Pyodide wheel. + + This runs `pyodide build` inside the PyMuPDF directory, which in turn runs + setup.py in a Pyodide build environment. + ''' + log(f'## Building Pyodide wheel.') + + # Our setup.py does not know anything about Pyodide; we set a few + # required environmental variables here. + # + env_extra = dict() + + # Disable libcrypto because not available in Pyodide. + env_extra['HAVE_LIBCRYPTO'] = 'no' + + # Tell MuPDF to build for Pyodide. + env_extra['OS'] = 'pyodide' + + # Build a single wheel without a separate PyMuPDFb wheel. + env_extra['PYMUPDF_SETUP_FLAVOUR'] = 'pb' + + # 2023-08-30: We set PYMUPDF_SETUP_MUPDF_BUILD_TESSERACT=0 because + # otherwise mupdf thirdparty/tesseract/src/ccstruct/dppoint.cpp fails to + # build because `#include "errcode.h"` finds a header inside emsdk. This is + # pyodide bug https://github.com/pyodide/pyodide/issues/3839. It's fixed in + # https://github.com/pyodide/pyodide/pull/3866 but the fix has not reached + # pypi.org's pyodide-build package. E.g. currently in tag 0.23.4, but + # current devuan pyodide-build is pyodide_build-0.23.4. + # + env_extra['PYMUPDF_SETUP_MUPDF_TESSERACT'] = '0' + setup = pyodide_setup(pymupdf_dir, pyodide_build_version=pyodide_build_version) + command = f'{setup} && echo "### Running pyodide build" && pyodide build --exports whole_archive' + + command = command.replace(' && ', '\n && ') + + run(command, env_extra=env_extra) + + # Copy wheel into `wheelhouse/` so it is picked up as a workflow + # artifact. + # + run(f'ls -l {pymupdf_dir}/dist/') + run(f'mkdir -p {pymupdf_dir}/wheelhouse && cp -p {pymupdf_dir}/dist/* {pymupdf_dir}/wheelhouse/') + run(f'ls -l {pymupdf_dir}/wheelhouse/') + + +def pyodide_setup( + directory, + clean=False, + pyodide_build_version=None, + ): + ''' + Returns a command that will set things up for a pyodide build. + + Args: + directory: + Our command cd's into this directory. + clean: + If true we create an entirely new environment. Otherwise + we reuse any existing emsdk repository and venv. + pyodide_build_version: + Version of Python package pyodide-build; if None we use latest + available version. + 2025-02-13: pyodide_build_version='0.29.3' works. + + The returned command does the following: + + * Checkout latest emsdk from https://github.com/emscripten-core/emsdk.git: + * Clone emsdk repository to `emsdk` if not already present. + * Run `git pull -r` inside emsdk checkout. + * Create venv `venv_pyodide_` if not already present. + * Activate venv `venv_pyodide_`. + * Install/upgrade package `pyodide-build`. + * Run emsdk install scripts and enter emsdk environment. + + Example usage in a build function: + + command = pyodide_setup() + command += ' && pyodide build --exports pyinit' + subprocess.run(command, shell=1, check=1) + ''' + + pv = platform.python_version_tuple()[:2] + assert pv == ('3', '12'), f'Pyodide builds need to be run with Python-3.12 but current Python is {platform.python_version()}.' + command = f'cd {directory}' + + # Clone/update emsdk. We always use the latest emsdk with `git pull`. + # + # 2025-02-13: this works: 2514ec738de72cebbba7f4fdba0cf2fabcb779a5 + # + dir_emsdk = 'emsdk' + if clean: + shutil.rmtree(dir_emsdk, ignore_errors=1) + # 2024-06-25: old `.pyodide-xbuildenv` directory was breaking build, so + # important to remove it here. + shutil.rmtree('.pyodide-xbuildenv', ignore_errors=1) + if not os.path.exists(f'{directory}/{dir_emsdk}'): + command += f' && echo "### Cloning emsdk.git"' + command += f' && git clone https://github.com/emscripten-core/emsdk.git {dir_emsdk}' + command += f' && echo "### Updating checkout {dir_emsdk}"' + command += f' && (cd {dir_emsdk} && git pull -r)' + command += f' && echo "### Checkout {dir_emsdk} is:"' + command += f' && (cd {dir_emsdk} && git show -s --oneline)' + + # Create and enter Python venv. + # + python = sys.executable + venv_pyodide = f'venv_pyodide_{sys.version_info[0]}.{sys.version_info[1]}' + + if not os.path.exists( f'{directory}/{venv_pyodide}'): + command += f' && echo "### Creating venv {venv_pyodide}"' + command += f' && {python} -m venv {venv_pyodide}' + command += f' && . {venv_pyodide}/bin/activate' + command += f' && echo "### Installing Python packages."' + command += f' && python -m pip install --upgrade pip wheel pyodide-build' + if pyodide_build_version: + command += f'=={pyodide_build_version}' + + # Run emsdk install scripts and enter emsdk environment. + # + command += f' && cd {dir_emsdk}' + command += ' && PYODIDE_EMSCRIPTEN_VERSION=$(pyodide config get emscripten_version)' + command += ' && echo "### PYODIDE_EMSCRIPTEN_VERSION is: $PYODIDE_EMSCRIPTEN_VERSION"' + command += ' && echo "### Running ./emsdk install"' + command += ' && ./emsdk install ${PYODIDE_EMSCRIPTEN_VERSION}' + command += ' && echo "### Running ./emsdk activate"' + command += ' && ./emsdk activate ${PYODIDE_EMSCRIPTEN_VERSION}' + command += ' && echo "### Running ./emsdk_env.sh"' + command += ' && . ./emsdk_env.sh' # Need leading `./` otherwise weird 'Not found' error. + + command += ' && cd ..' + return command + + +def test( + *, + env_extra, + implementations, + venv=False, + test_names=None, + pytest_options=None, + test_timeout=None, + pytest_prefix=None, + test_fitz=True, + pytest_k=None, + pybind=False, + system_packages=False, + ): + if pybind: + cpp_path = 'pymupdf_test_pybind.cpp' + cpp_exe = 'pymupdf_test_pybind.exe' + cpp = textwrap.dedent(''' + #include + + int main() + { + pybind11::scoped_interpreter guard{}; + pybind11::exec(R"( + print('Hello world', flush=1) + import pymupdf + pymupdf.JM_mupdf_show_warnings = 1 + print(f'{pymupdf.version=}', flush=1) + doc = pymupdf.Document() + pymupdf.mupdf.fz_warn('Dummy warning.') + pymupdf.mupdf.fz_warn('Dummy warning.') + pymupdf.mupdf.fz_warn('Dummy warning.') + print(f'{doc=}', flush=1) + )"); + } + ''') + def fs_read(path): + try: + with open(path) as f: + return f.read() + except Exception: + return + def fs_remove(path): + try: + os.remove(path) + except Exception: + pass + cpp_existing = fs_read(cpp_path) + if cpp == cpp_existing: + log(f'Not creating {cpp_exe} because unchanged: {cpp_path}') + else: + with open(cpp_path, 'w') as f: + f.write(cpp) + def getmtime(path): + try: + return os.path.getmtime(path) + except Exception: + return 0 + python_config = f'{os.path.realpath(sys.executable)}-config' + # `--embed` adds `-lpython3.11` to the link command, which appears to + # be necessary when building an executable. + flags = run(f'{python_config} --cflags --ldflags --embed', capture=1) + build_command = f'c++ {cpp_path} -o {cpp_exe} -g -W -Wall {flags}' + build_path = f'{cpp_exe}.cmd' + build_command_prev = fs_read(build_path) + if build_command != build_command_prev or getmtime(cpp_path) >= getmtime(cpp_exe): + fs_remove(build_path) + run(build_command) + with open(build_path, 'w') as f: + f.write(build_command) + run(f'./{cpp_exe}') + return + + pymupdf_dir_rel = gh_release.relpath(pymupdf_dir) + if not pytest_options and pytest_prefix == 'valgrind': + pytest_options = '-sv' + if pytest_k: + pytest_options += f' -k {shlex.quote(pytest_k)}' + pytest_arg = '' + if test_names: + for test_name in test_names: + pytest_arg += f' {pymupdf_dir_rel}/{test_name}' + else: + pytest_arg += f' {pymupdf_dir_rel}/tests' + python = gh_release.relpath(sys.executable) + log('Running tests with tests/run_compound.py and pytest.') + + PYODIDE_ROOT = os.environ.get('PYODIDE_ROOT') + if PYODIDE_ROOT is not None: + log(f'Not installing test packages because {PYODIDE_ROOT=}.') + command = f'{pytest_options} {pytest_arg} -s' + args = shlex.split(command) + print(f'{PYODIDE_ROOT=} so calling pytest.main(args).') + print(f'{command=}') + print(f'args are ({len(args)}):') + for arg in args: + print(f' {arg!r}') + import pytest + pytest.main(args) + return + + if venv >= 2: + run(f'pip install --upgrade {gh_release.test_packages}') + else: + log(f'{venv=}: Not installing test packages: {gh_release.test_packages}') + run_compound_args = '' + + if implementations: + run_compound_args += f' -i {implementations}' + + if test_timeout: + run_compound_args += f' -t {test_timeout}' + + if pytest_prefix in ('valgrind', 'helgrind'): + if system_packages: + log('Installing valgrind.') + run(f'sudo apt update') + run(f'sudo apt install --upgrade valgrind') + run(f'valgrind --version') + + command = f'{python} {pymupdf_dir_rel}/tests/run_compound.py{run_compound_args}' + + if pytest_prefix is None: + pass + elif pytest_prefix == 'gdb': + command += ' gdb --args' + elif pytest_prefix == 'valgrind': + env_extra['PYMUPDF_RUNNING_ON_VALGRIND'] = '1' + env_extra['PYTHONMALLOC'] = 'malloc' + command += ( + f' valgrind' + f' --suppressions={pymupdf_dir_abs}/valgrind.supp' + f' --trace-children=no' + f' --num-callers=20' + f' --error-exitcode=100' + f' --errors-for-leak-kinds=none' + f' --fullpath-after=' + ) + elif pytest_prefix == 'helgrind': + env_extra['PYMUPDF_RUNNING_ON_VALGRIND'] = '1' + env_extra['PYTHONMALLOC'] = 'malloc' + command = ( + f' valgrind' + f' --tool=helgrind' + f' --trace-children=no' + f' --num-callers=20' + f' --error-exitcode=100' + f' --fullpath-after=' + ) + else: + assert 0, f'Unrecognised {pytest_prefix=}' + + if platform.system() == 'Windows': + # `python -m pytest` doesn't seem to work. + command += ' pytest' + else: + # On OpenBSD `pip install pytest` doesn't seem to install the pytest + # command, so we use `python -m pytest ...`. + command += f' {python} -m pytest' + + command += f' {pytest_options} {pytest_arg}' + + # Always start by removing any test_*_fitz.py files. + for p in glob.glob(f'{pymupdf_dir_rel}/tests/test_*_fitz.py'): + print(f'Removing {p=}') + os.remove(p) + if test_fitz: + # Create copies of each test file, modified to use `pymupdf` + # instead of `fitz`. + for p in glob.glob(f'{pymupdf_dir_rel}/tests/test_*.py'): + if os.path.basename(p).startswith('test_fitz_'): + # Don't recursively generate test_fitz_fitz_foo.py, + # test_fitz_fitz_fitz_foo.py, ... etc. + continue + branch, leaf = os.path.split(p) + p2 = f'{branch}/{leaf[:5]}fitz_{leaf[5:]}' + print(f'Converting {p=} to {p2=}.') + with open(p, encoding='utf8') as f: + text = f.read() + text2 = re.sub("([^\'])\\bpymupdf\\b", '\\1fitz', text) + if p.replace(os.sep, '/') == f'{pymupdf_dir_rel}/tests/test_docs_samples.py'.replace(os.sep, '/'): + assert text2 == text + else: + assert text2 != text, f'Unexpectedly unchanged when creating {p!r} => {p2!r}' + with open(p2, 'w', encoding='utf8') as f: + f.write(text2) + try: + log(f'Running tests with tests/run_compound.py and pytest.') + run(command, env_extra=env_extra, timeout=test_timeout) + + except subprocess.TimeoutExpired as e: + log(f'Timeout when running tests.') + raise + finally: + log(f'\n' + f'[As of 2024-10-10 we get warnings from pytest/Python such as:\n' + f' DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute\n' + f'This seems to be due to Swig\'s handling of Py_LIMITED_API.\n' + f'For details see https://github.com/swig/swig/issues/2881.\n' + f']' + ) + log('\n' + venv_info(pytest_args=f'{pytest_options} {pytest_arg}')) + + +def get_pyproject_required(ppt=None): + ''' + Returns space-separated names of required packages in pyproject.toml. We + do not do a proper parse and rely on the packages being in a single line. + ''' + if ppt is None: + ppt = os.path.abspath(f'{__file__}/../../pyproject.toml') + with open(ppt) as f: + for line in f: + m = re.match('^requires = \\[(.*)\\]$', line) + if m: + names = m.group(1).replace(',', ' ').replace('"', '') + return names + else: + assert 0, f'Failed to find "requires" line in {ppt}' + +def wrap_get_requires_for_build_wheel(dir_): + ''' + Returns space-separated list of required + packages. Looks at `dir_`/pyproject.toml and calls + `dir_`/setup.py:get_requires_for_build_wheel(). + ''' + dir_abs = os.path.abspath(dir_) + ret = list() + ppt = os.path.join(dir_abs, 'pyproject.toml') + if os.path.exists(ppt): + ret += get_pyproject_required(ppt) + if os.path.exists(os.path.join(dir_abs, 'setup.py')): + sys.path.insert(0, dir_abs) + try: + from setup import get_requires_for_build_wheel as foo + for i in foo(): + ret.append(i) + finally: + del sys.path[0] + return ' '.join(ret) + + +def venv_in(path=None): + ''' + If path is None, returns true if we are in a venv. Otherwise returns true + only if we are in venv . + ''' + if path: + return os.path.abspath(sys.prefix) == os.path.abspath(path) + else: + return sys.prefix != sys.base_prefix + + +def venv_run(args, path, recreate=True, clean=False): + ''' + Runs command inside venv and returns termination code. + + Args: + args: + List of args. + path: + Name of venv. + recreate: + If false we do not run ` -m venv ` if + already exists. This avoids a delay in the common case where + is already set up, but fails if exists but does not contain + a valid venv. + clean: + If true we first delete . + ''' + if clean: + log(f'Removing any existing venv {path}.') + assert path.startswith('venv-') + shutil.rmtree(path, ignore_errors=1) + if recreate or not os.path.isdir(path): + run(f'{sys.executable} -m venv {path}') + if platform.system() == 'Windows': + command = f'{path}\\Scripts\\activate && python' + # shlex not reliable on Windows. + # Use crude quoting with "...". Seems to work. + for arg in args: + assert '"' not in arg + command += f' "{arg}"' + else: + command = f'. {path}/bin/activate && python {shlex.join(args)}' + e = run(command, check=0) + return e + + +if __name__ == '__main__': + try: + sys.exit(main(sys.argv)) + except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as e: + # Terminate relatively quietly, failed commands will usually have + # generated diagnostics. + log(f'{e}') + sys.exit(1) + # Other exceptions should not happen, and will generate a full Python + # backtrace etc here. diff -r 000000000000 -r 1d09e1dec1d9 setup.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/setup.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1493 @@ +#! /usr/bin/env python3 + +''' +Overview: + + Build script for PyMuPDF, supporting PEP-517 and simple command-line usage. + + We hard-code the URL of the MuPDF .tar.gz file that we require. This + generally points to a particular source release on mupdf.com. + + Default behaviour: + + Building an sdist: + As of 2024-002-28 we no longer download the MuPDF .tar.gz file and + embed it within the sdist. Instead it will be downloaded at build + time. + + Building PyMuPDF: + We first download the hard-coded mupdf .tar.gz file. + + Then we extract and build MuPDF locally, before building PyMuPDF + itself. So PyMuPDF will always be built with the exact MuPDF + release that we require. + + +Environmental variables: + + If building with system MuPDF (PYMUPDF_SETUP_MUPDF_BUILD is empty string): + + CFLAGS + CXXFLAGS + LDFLAGS + Added to c, c++, and link commands. + + PYMUPDF_INCLUDES + Colon-separated extra include paths. + + PYMUPDF_MUPDF_LIB + Directory containing MuPDF libraries, (libmupdf.so, + libmupdfcpp.so). + + PYMUPDF_SETUP_DEVENV + Location of devenv.com on Windows. If unset we search for it - see + wdev.py. if that fails we use just 'devenv.com'. + + PYMUPDF_SETUP_DUMMY + If 1, we build dummy sdist and wheel with no files. + + PYMUPDF_SETUP_FLAVOUR + Control building of separate wheels for PyMuPDF. + + Must be unset or a combination of 'p', 'b' and 'd'. + + Default is 'pbd'. + + 'p': + Generated wheel contains PyMuPDF code. + 'b': + Generated wheel contains MuPDF libraries; these are independent of + the Python version. + 'd': + Generated wheel contains includes and libraries for MuPDF. + + If 'p' is included, the generated wheel is called PyMuPDF. + Otherwise if 'b' is included the generated wheel is called PyMuPDFb. + Otherwise if 'd' is included the generated wheel is called PyMuPDFd. + + For example: + + 'pb': a `PyMuPDF` wheel with PyMuPDF runtime files and MuPDF + runtime shared libraries. + + 'b': a `PyMuPDFb` wheel containing MuPDF runtime shared libraries. + + 'pbd' a `PyMuPDF` wheel with PyMuPDF runtime files and MuPDF + runtime shared libraries, plus MuPDF build-time files (includes, + *.lib files on Windows). + + 'd': a `PyMuPDFd` wheel containing MuPDF build-time files + (includes, *.lib files on Windows). + + PYMUPDF_SETUP_LIBCLANG + For internal testing. + + PYMUPDF_SETUP_MUPDF_BUILD + If unset or '-', use internal hard-coded default MuPDF location. + Otherwise overrides location of MuPDF when building PyMuPDF: + Empty string: + Build PyMuPDF with the system MuPDF. + A string starting with 'git:': + Use `git clone` to get a MuPDF checkout. We use the + string in the git clone command; it must contain the git + URL from which to clone, and can also contain other `git + clone` args, for example: + PYMUPDF_SETUP_MUPDF_BUILD="git:--branch master https://github.com/ArtifexSoftware/mupdf.git" + Otherwise: + Location of mupdf directory. + + PYMUPDF_SETUP_MUPDF_BSYMBOLIC + If '0' we do not link libmupdf.so with -Bsymbolic. + + PYMUPDF_SETUP_MUPDF_TESSERACT + If '0' we build MuPDF without Tesseract. + + PYMUPDF_SETUP_MUPDF_BUILD_TYPE + Unix only. Controls build type of MuPDF. Supported values are: + debug + memento + release (default) + + PYMUPDF_SETUP_MUPDF_CLEAN + Unix only. If '1', we do a clean MuPDF build. + + PYMUPDF_SETUP_MUPDF_REFCHECK_IF + Should be preprocessor statement to enable MuPDF reference count + checking. + + As of 2024-09-27, MuPDF default is `#ifndef NDEBUG`. + + PYMUPDF_SETUP_MUPDF_TRACE_IF + Should be preprocessor statement to enable MuPDF runtime diagnostics in + response to environment variables such as MUPDF_trace. + + As of 2024-09-27, MuPDF default is `#ifndef NDEBUG`. + + PYMUPDF_SETUP_MUPDF_THIRD + If '0' and we are building on Linux with the system MuPDF + (i.e. PYMUPDF_SETUP_MUPDF_BUILD=''), then don't link with + `-lmupdf-third`. + + PYMUPDF_SETUP_MUPDF_VS_UPGRADE + If '1' we run mupdf `scripts/mupdfwrap.py` with `--vs-upgrade 1` to + help Windows builds work with Visual Studio versions newer than 2019. + + PYMUPDF_SETUP_MUPDF_TGZ + If set, overrides location of MuPDF .tar.gz file: + Empty string: + Do not download MuPDF .tar.gz file. Sdist's will not contain + MuPDF. + + A string containing '://': + The URL from which to download the MuPDF .tar.gz file. Leaf + must match mupdf-*.tar.gz. + + Otherwise: + The path of local mupdf git checkout. We put all files in this + checkout known to git into a local tar archive. + + PYMUPDF_SETUP_MUPDF_OVERWRITE_CONFIG + If '0' we do not overwrite MuPDF's include/mupdf/fitz/config.h with + PyMuPDF's own configuration file, before building MuPDF. + + PYMUPDF_SETUP_MUPDF_REBUILD + If 0 we do not (re)build mupdf. + + PYMUPDF_SETUP_PY_LIMITED_API + If not '0', we build for current Python's stable ABI. + + However if unset and we are on Python-3.13 or later, we do + not build for the stable ABI because as of 2025-03-04 SWIG + generates incorrect stable ABI code with Python-3.13 - see: + https://github.com/swig/swig/issues/3059 + + PYMUPDF_SETUP_URL_WHEEL + If set, we use an existing wheel instead of building a new wheel. + + If starts with `http://` or `https://`: + If ends with '/', we append our wheel name and download. Otherwise + we download directly. + + If starts with `file://`: + If ends with '/' we look for a matching wheel name, `using + pipcl.wheel_name_match()` to cope with differing platform tags, + for example our `manylinux2014_x86_64` will match with an existing + wheel with `manylinux2014_x86_64.manylinux_2_17_x86_64`. + + Any other prefix is an error. + + PYMUPDF_SETUP_SWIG + If set, we use this instead of `swig`. + + WDEV_VS_YEAR + If set, we use as Visual Studio year, for example '2019' or '2022'. + + WDEV_VS_GRADE + If set, we use as Visual Studio grade, for example 'Community' or + 'Professional' or 'Enterprise'. +''' + +import glob +import io +import os +import textwrap +import time +import platform +import re +import shlex +import shutil +import stat +import subprocess +import sys +import tarfile +import traceback +import urllib.request +import zipfile + +import pipcl + + +log = pipcl.log0 + +run = pipcl.run + + +if 1: + # For debugging. + log(f'### Starting.') + pipcl.show_system() + + +PYMUPDF_SETUP_FLAVOUR = os.environ.get( 'PYMUPDF_SETUP_FLAVOUR', 'pbd') +for i in PYMUPDF_SETUP_FLAVOUR: + assert i in 'pbd', f'Unrecognised flag "{i} in {PYMUPDF_SETUP_FLAVOUR=}. Should be one of "p", "b", "d"' + +g_root = os.path.abspath( f'{__file__}/..') + +# Name of file that identifies that we are in a PyMuPDF sdist. +g_pymupdfb_sdist_marker = 'pymupdfb_sdist' + +python_version_tuple = tuple(int(x) for x in platform.python_version_tuple()[:2]) + +PYMUPDF_SETUP_PY_LIMITED_API = os.environ.get('PYMUPDF_SETUP_PY_LIMITED_API') +assert PYMUPDF_SETUP_PY_LIMITED_API in (None, '', '0', '1'), \ + f'Should be "", "0", "1" or undefined: {PYMUPDF_SETUP_PY_LIMITED_API=}.' +if PYMUPDF_SETUP_PY_LIMITED_API is None and python_version_tuple >= (3, 13): + log(f'Not defaulting to Python limited api because {platform.python_version_tuple()=}.') + g_py_limited_api = False +else: + g_py_limited_api = (PYMUPDF_SETUP_PY_LIMITED_API != '0') + +PYMUPDF_SETUP_URL_WHEEL = os.environ.get('PYMUPDF_SETUP_URL_WHEEL') +log(f'{PYMUPDF_SETUP_URL_WHEEL=}') + +PYMUPDF_SETUP_DUMMY = os.environ.get('PYMUPDF_SETUP_DUMMY') +log(f'{PYMUPDF_SETUP_DUMMY=}') + +PYMUPDF_SETUP_SWIG = os.environ.get('PYMUPDF_SETUP_SWIG') + +def _fs_remove(path): + ''' + Removes file or directory, without raising exception if it doesn't exist. + + We assert-fail if the path still exists when we return, in case of + permission problems etc. + ''' + # First try deleting `path` as a file. + try: + os.remove( path) + except Exception as e: + pass + + if os.path.exists(path): + # Try deleting `path` as a directory. Need to use + # shutil.rmtree() callback to handle permission problems; see: + # https://docs.python.org/3/library/shutil.html#rmtree-example + # + def error_fn(fn, path, excinfo): + # Clear the readonly bit and reattempt the removal. + os.chmod(path, stat.S_IWRITE) + fn(path) + shutil.rmtree( path, onerror=error_fn) + + assert not os.path.exists( path) + + +def _git_get_branch( directory): + command = f'cd {directory} && git branch --show-current' + log( f'Running: {command}') + p = subprocess.run( + command, + shell=True, + check=False, + text=True, + stdout=subprocess.PIPE, + ) + ret = None + if p.returncode == 0: + ret = p.stdout.strip() + log( f'Have found MuPDF git branch: ret={ret!r}') + return ret + + +def tar_check(path, mode='r:gz', prefix=None, remove=False): + ''' + Checks items in tar file have same , or if not None. + + We fail if items in tar file have different top-level directory names. + + path: + The tar file. + mode: + As tarfile.open(). + prefix: + If not None, we fail if tar file's is not . + + Returns the directory name (which will be if not None). + ''' + with tarfile.open( path, mode) as t: + items = t.getnames() + assert items + item = items[0] + assert not item.startswith('./') and not item.startswith('../') + s = item.find('/') + if s == -1: + prefix_actual = item + '/' + else: + prefix_actual = item[:s+1] + if prefix: + assert prefix == prefix_actual, f'{path=} {prefix=} {prefix_actual=}' + for item in items[1:]: + assert item.startswith( prefix_actual), f'prefix_actual={prefix_actual!r} != item={item!r}' + return prefix_actual + + +def tar_extract(path, mode='r:gz', prefix=None, exists='raise'): + ''' + Extracts tar file into single local directory. + + We fail if items in tar file have different . + + path: + The tar file. + mode: + As tarfile.open(). + prefix: + If not None, we fail if tar file's is not . + exists: + What to do if already exists: + 'raise': raise exception. + 'remove': remove existing file/directory before extracting. + 'return': return without extracting. + + Returns the directory name (which will be if not None, with '/' + appended if not already present). + ''' + prefix_actual = tar_check( path, mode, prefix) + if os.path.exists( prefix_actual): + if exists == 'raise': + raise Exception( f'Path already exists: {prefix_actual!r}') + elif exists == 'remove': + remove( prefix_actual) + elif exists == 'return': + log( f'Not extracting {path} because already exists: {prefix_actual}') + return prefix_actual + else: + assert 0, f'Unrecognised exists={exists!r}' + assert not os.path.exists( prefix_actual), f'Path already exists: {prefix_actual}' + log( f'Extracting {path}') + with tarfile.open( path, mode) as t: + t.extractall() + return prefix_actual + + +def git_info( directory): + ''' + Returns `(sha, comment, diff, branch)`, all items are str or None if not + available. + + directory: + Root of git checkout. + ''' + sha, comment, diff, branch = '', '', '', '' + cp = subprocess.run( + f'cd {directory} && (PAGER= git show --pretty=oneline|head -n 1 && git diff)', + capture_output=1, + shell=1, + text=1, + ) + if cp.returncode == 0: + sha, _ = cp.stdout.split(' ', 1) + comment, diff = _.split('\n', 1) + cp = subprocess.run( + f'cd {directory} && git rev-parse --abbrev-ref HEAD', + capture_output=1, + shell=1, + text=1, + ) + if cp.returncode == 0: + branch = cp.stdout.strip() + log(f'git_info(): directory={directory!r} returning branch={branch!r} sha={sha!r} comment={comment!r}') + return sha, comment, diff, branch + + +def git_patch(directory, patch, hard=False): + ''' + Applies string with `git patch` in . + + If is true we clean the tree with `git checkout .` and then apply + the patch. + + Otherwise we apply patch only if it is not already applied; this might fail + if there are conflicting changes in the tree. + ''' + log(f'Applying patch in {directory}:\n{textwrap.indent(patch, " ")}') + if not patch: + return + # Carriage returns break `git apply` so we use `newline='\n'` in open(). + path = os.path.abspath(f'{directory}/pymupdf_patch.txt') + with open(path, 'w', newline='\n') as f: + f.write(patch) + log(f'Using patch file: {path}') + if hard: + run(f'cd {directory} && git checkout .') + run(f'cd {directory} && git apply {path}') + log(f'Have applied patch in {directory}.') + else: + e = run( f'cd {directory} && git apply --check --reverse {path}', check=0) + if e == 0: + log(f'Not patching {directory} because already patched.') + else: + run(f'cd {directory} && git apply {path}') + log(f'Have applied patch in {directory}.') + run(f'cd {directory} && git diff') + + +mupdf_tgz = os.path.abspath( f'{__file__}/../mupdf.tgz') + +def get_mupdf_internal(out, location=None, sha=None, local_tgz=None): + ''' + Gets MuPDF as either a .tgz or a local directory. + + Args: + out: + Either 'dir' (we return name of local directory containing mupdf) or 'tgz' (we return + name of local .tgz file containing mupdf). + location: + First, if None we set to hard-coded default URL or git location. + If starts with 'git:', should be remote git location. + Otherwise if containing '://' should be URL for .tgz. + Otherwise should path of local mupdf checkout. + sha: + If not None and we use git clone, we checkout this sha. + local_tgz: + If not None, must be local .tgz file. + Returns: + (path, location): + `path` is absolute path of local directory or .tgz containing + MuPDF, or None if we are to use system MuPDF. + + `location_out` is `location` if not None, else the hard-coded + default location. + + ''' + log(f'get_mupdf_internal(): {out=} {location=} {sha=}') + assert out in ('dir', 'tgz') + if location is None: + location = f'https://mupdf.com/downloads/archive/mupdf-{version_mupdf}-source.tar.gz' + #location = 'git:--branch master https://github.com/ArtifexSoftware/mupdf.git' + + if location == '': + # Use system mupdf. + return None, location + + local_dir = None + if local_tgz: + assert os.path.isfile(local_tgz) + elif location.startswith( 'git:'): + location_git = location[4:] + local_dir = 'mupdf-git' + + # Try to update existing checkout. + e = run(f'cd {local_dir} && git pull && git submodule update --init', check=False) + if e: + # No existing git checkout, so do a fresh clone. + _fs_remove(local_dir) + gitargs = location[4:] + run(f'git clone --recursive --depth 1 --shallow-submodules {gitargs} {local_dir}') + + # Show sha of checkout. + run( f'cd {local_dir} && git show --pretty=oneline|head -n 1', check=False) + if sha: + run( f'cd {local_dir} && git checkout {sha}') + elif '://' in location: + # Download .tgz. + local_tgz = os.path.basename( location) + suffix = '.tar.gz' + assert location.endswith(suffix), f'Unrecognised suffix in remote URL {location=}.' + name = local_tgz[:-len(suffix)] + log( f'Download {location=} {local_tgz=} {name=}') + if os.path.exists(local_tgz): + try: + tar_check(local_tgz, 'r:gz', prefix=f'{name}/') + except Exception as e: + log(f'Not using existing file {local_tgz} because invalid tar data: {e}') + _fs_remove( local_tgz) + if os.path.exists(local_tgz): + log(f'Not downloading from {location} because already present: {local_tgz!r}') + else: + log(f'Downloading from {location=} to {local_tgz=}.') + urllib.request.urlretrieve( location, local_tgz + '-') + os.rename(local_tgz + '-', local_tgz) + assert os.path.exists( local_tgz) + tar_check( local_tgz, 'r:gz', prefix=f'{name}/') + else: + assert os.path.isdir(location), f'Local MuPDF does not exist: {location=}' + local_dir = location + + assert bool(local_dir) != bool(local_tgz) + if out == 'dir': + if not local_dir: + assert local_tgz + local_dir = tar_extract( local_tgz, exists='return') + return os.path.abspath( local_dir), location + elif out == 'tgz': + if not local_tgz: + # Create .tgz containing git files in `local_dir`. + assert local_dir + if local_dir.endswith( '/'): + local_dir = local_dir[:-1] + top = os.path.basename(local_dir) + local_tgz = f'{local_dir}.tgz' + log( f'Creating .tgz from git files. {top=} {local_dir=} {local_tgz=}') + _fs_remove( local_tgz) + with tarfile.open( local_tgz, 'w:gz') as f: + for name in pipcl.git_items( local_dir, submodules=True): + path = os.path.join( local_dir, name) + if os.path.isfile( path): + path2 = f'{top}/{name}' + log(f'Adding {path=} {path2=}.') + f.add( path, path2, recursive=False) + return os.path.abspath( local_tgz), location + else: + assert 0, f'Unrecognised {out=}' + + + +def get_mupdf_tgz(): + ''' + Creates .tgz file called containing MuPDF source, for inclusion in an + sdist. + + What we do depends on environmental variable PYMUPDF_SETUP_MUPDF_TGZ; see + docs at start of this file for details. + + Returns name of top-level directory within the .tgz file. + ''' + name, location = get_mupdf_internal( 'tgz', os.environ.get('PYMUPDF_SETUP_MUPDF_TGZ')) + return name, location + + +def get_mupdf(path=None, sha=None): + ''' + Downloads and/or extracts mupdf and returns (path, location) where `path` + is the local mupdf directory and `location` is where it came from. + + Exact behaviour depends on environmental variable + PYMUPDF_SETUP_MUPDF_BUILD; see docs at start of this file for details. + ''' + m = os.environ.get('PYMUPDF_SETUP_MUPDF_BUILD') + if m == '-': + # This allows easy specification in Github actions. + m = None + if m is None and os.path.isfile(mupdf_tgz): + # This makes us use tgz inside sdist. + log(f'Using local tgz: {mupdf_tgz=}') + return get_mupdf_internal('dir', local_tgz=mupdf_tgz) + return get_mupdf_internal('dir', m) + + +linux = sys.platform.startswith( 'linux') or 'gnu' in sys.platform +openbsd = sys.platform.startswith( 'openbsd') +freebsd = sys.platform.startswith( 'freebsd') +darwin = sys.platform.startswith( 'darwin') +windows = platform.system() == 'Windows' or platform.system().startswith('CYGWIN') +msys2 = platform.system().startswith('MSYS_NT-') + +pyodide_flags = '-fwasm-exceptions' + +if os.environ.get('PYODIDE') == '1': + if os.environ.get('OS') != 'pyodide': + log('PYODIDE=1, setting OS=pyodide.') + os.environ['OS'] = 'pyodide' + os.environ['XCFLAGS'] = pyodide_flags + os.environ['XCXXFLAGS'] = pyodide_flags + +pyodide = os.environ.get('OS') == 'pyodide' + +def build(): + ''' + pipcl.py `build_fn()` callback. + ''' + #pipcl.show_sysconfig() + + if PYMUPDF_SETUP_DUMMY == '1': + log(f'{PYMUPDF_SETUP_DUMMY=} Building dummy wheel with no files.') + return list() + + # Download MuPDF. + # + mupdf_local, mupdf_location = get_mupdf() + if mupdf_local: + mupdf_version_tuple = get_mupdf_version(mupdf_local) + # else we cannot determine version this way and do not use it + + build_type = os.environ.get( 'PYMUPDF_SETUP_MUPDF_BUILD_TYPE', 'release') + assert build_type in ('debug', 'memento', 'release'), \ + f'Unrecognised build_type={build_type!r}' + + overwrite_config = os.environ.get('PYMUPDF_SETUP_MUPDF_OVERWRITE_CONFIG', '1') == '1' + + PYMUPDF_SETUP_MUPDF_REFCHECK_IF = os.environ.get('PYMUPDF_SETUP_MUPDF_REFCHECK_IF') + PYMUPDF_SETUP_MUPDF_TRACE_IF = os.environ.get('PYMUPDF_SETUP_MUPDF_TRACE_IF') + + # Build MuPDF shared libraries. + # + if windows: + mupdf_build_dir = build_mupdf_windows( + mupdf_local, + build_type, + overwrite_config, + g_py_limited_api, + PYMUPDF_SETUP_MUPDF_REFCHECK_IF, + PYMUPDF_SETUP_MUPDF_TRACE_IF, + ) + else: + if 'p' not in PYMUPDF_SETUP_FLAVOUR and 'b' not in PYMUPDF_SETUP_FLAVOUR: + # We only need MuPDF headers, so no point building MuPDF. + log(f'Not building MuPDF because not Windows and {PYMUPDF_SETUP_FLAVOUR=}.') + mupdf_build_dir = None + else: + mupdf_build_dir = build_mupdf_unix( + mupdf_local, + build_type, + overwrite_config, + g_py_limited_api, + PYMUPDF_SETUP_MUPDF_REFCHECK_IF, + PYMUPDF_SETUP_MUPDF_TRACE_IF, + PYMUPDF_SETUP_SWIG, + ) + log( f'build(): mupdf_build_dir={mupdf_build_dir!r}') + + # Build rebased `extra` module. + # + if 'p' in PYMUPDF_SETUP_FLAVOUR: + path_so_leaf = _build_extension( + mupdf_local, + mupdf_build_dir, + build_type, + g_py_limited_api, + ) + else: + log(f'Not building extension.') + path_so_leaf = None + + # Generate list of (from, to) items to return to pipcl. What we add depends + # on PYMUPDF_SETUP_FLAVOUR. + # + ret = list() + def add(flavour, from_, to_): + assert flavour in 'pbd' + if flavour in PYMUPDF_SETUP_FLAVOUR: + ret.append((from_, to_)) + + to_dir = 'pymupdf/' + to_dir_d = f'{to_dir}/mupdf-devel' + + # Add implementation files. + add('p', f'{g_root}/src/__init__.py', to_dir) + add('p', f'{g_root}/src/__main__.py', to_dir) + add('p', f'{g_root}/src/pymupdf.py', to_dir) + add('p', f'{g_root}/src/table.py', to_dir) + add('p', f'{g_root}/src/utils.py', to_dir) + add('p', f'{g_root}/src/_wxcolors.py', to_dir) + add('p', f'{g_root}/src/_apply_pages.py', to_dir) + add('p', f'{g_root}/src/build/extra.py', to_dir) + if path_so_leaf: + add('p', f'{g_root}/src/build/{path_so_leaf}', to_dir) + + # Add support for `fitz` backwards compatibility. + add('p', f'{g_root}/src/fitz___init__.py', 'fitz/__init__.py') + add('p', f'{g_root}/src/fitz_table.py', 'fitz/table.py') + add('p', f'{g_root}/src/fitz_utils.py', 'fitz/utils.py') + + if mupdf_local: + # Add MuPDF Python API. + add('p', f'{mupdf_build_dir}/mupdf.py', to_dir) + + # Add MuPDF shared libraries. + if windows: + wp = pipcl.wdev.WindowsPython() + add('p', f'{mupdf_build_dir}/_mupdf.pyd', to_dir) + add('b', f'{mupdf_build_dir}/mupdfcpp{wp.cpu.windows_suffix}.dll', to_dir) + + # Add Windows .lib files. + mupdf_build_dir2 = _windows_lib_directory(mupdf_local, build_type) + add('d', f'{mupdf_build_dir2}/mupdfcpp{wp.cpu.windows_suffix}.lib', f'{to_dir_d}/lib/') + if mupdf_version_tuple >= (1, 26): + # MuPDF-1.25+ language bindings build also builds libmuthreads. + add('d', f'{mupdf_build_dir2}/libmuthreads.lib', f'{to_dir_d}/lib/') + elif darwin: + add('p', f'{mupdf_build_dir}/_mupdf.so', to_dir) + add('b', f'{mupdf_build_dir}/libmupdfcpp.so', to_dir) + add('b', f'{mupdf_build_dir}/libmupdf.dylib', to_dir) + add('d', f'{mupdf_build_dir}/libmupdf-threads.a', f'{to_dir_d}/lib/') + elif pyodide: + add('p', f'{mupdf_build_dir}/_mupdf.so', to_dir) + add('b', f'{mupdf_build_dir}/libmupdfcpp.so', 'PyMuPDF.libs/') + add('b', f'{mupdf_build_dir}/libmupdf.so', 'PyMuPDF.libs/') + else: + add('p', f'{mupdf_build_dir}/_mupdf.so', to_dir) + add('b', pipcl.get_soname(f'{mupdf_build_dir}/libmupdfcpp.so'), to_dir) + add('b', pipcl.get_soname(f'{mupdf_build_dir}/libmupdf.so'), to_dir) + add('d', f'{mupdf_build_dir}/libmupdf-threads.a', f'{to_dir_d}/lib/') + + if 'd' in PYMUPDF_SETUP_FLAVOUR: + # Add MuPDF C and C++ headers to `ret_d`. Would prefer to use + # pipcl.git_items() but hard-coded mupdf tree is not a git + # checkout. + # + for root in ( + f'{mupdf_local}/include', + f'{mupdf_local}/platform/c++/include', + ): + for dirpath, dirnames, filenames in os.walk(root): + for filename in filenames: + if not filename.endswith('.h'): + continue + header_abs = os.path.join(dirpath, filename) + assert header_abs.startswith(root) + header_rel = header_abs[len(root)+1:] + add('d', f'{header_abs}', f'{to_dir_d}/include/{header_rel}') + + # Add a .py file containing location of MuPDF. + try: + sha, comment, diff, branch = git_info(g_root) + except Exception as e: + log(f'Failed to get git information: {e}') + sha, comment, diff, branch = (None, None, None, None) + swig = PYMUPDF_SETUP_SWIG or 'swig' + swig_version_text = run(f'{swig} --version', capture=1) + m = re.search('\nSWIG Version ([^\n]+)', swig_version_text) + log(f'{swig_version_text=}') + assert m, f'Unrecognised {swig_version_text=}' + swig_version = m.group(1) + def int_or_0(text): + try: + return int(text) + except Exception: + return 0 + swig_version_tuple = tuple(int_or_0(i) for i in swig_version.split('.')) + log(f'{swig_version=}') + text = '' + text += f'mupdf_location = {mupdf_location!r}\n' + text += f'pymupdf_version = {version_p!r}\n' + text += f'pymupdf_git_sha = {sha!r}\n' + text += f'pymupdf_git_diff = {diff!r}\n' + text += f'pymupdf_git_branch = {branch!r}\n' + text += f'swig_version = {swig_version!r}\n' + text += f'swig_version_tuple = {swig_version_tuple!r}\n' + add('p', text.encode(), f'{to_dir}/_build.py') + + # Add single README file. + if 'p' in PYMUPDF_SETUP_FLAVOUR: + add('p', f'{g_root}/README.md', '$dist-info/README.md') + elif 'b' in PYMUPDF_SETUP_FLAVOUR: + add('b', f'{g_root}/READMEb.md', '$dist-info/README.md') + elif 'd' in PYMUPDF_SETUP_FLAVOUR: + add('d', f'{g_root}/READMEd.md', '$dist-info/README.md') + + return ret + + +def env_add(env, name, value, sep=' ', prepend=False, verbose=False): + ''' + Appends/prepends `` to `env[name]`. + + If `name` is not in `env`, we use os.environ[name] if it exists. + ''' + v = env.get(name) + if verbose: + log(f'Initally: {name}={v!r}') + if v is None: + v = os.environ.get(name) + if v is None: + env[ name] = value + else: + if prepend: + env[ name] = f'{value}{sep}{v}' + else: + env[ name] = f'{v}{sep}{value}' + if verbose: + log(f'Returning with {name}={env[name]!r}') + + +def build_mupdf_windows( + mupdf_local, + build_type, + overwrite_config, + g_py_limited_api, + PYMUPDF_SETUP_MUPDF_REFCHECK_IF, + PYMUPDF_SETUP_MUPDF_TRACE_IF, + ): + + assert mupdf_local + + if overwrite_config: + mupdf_config_h = f'{mupdf_local}/include/mupdf/fitz/config.h' + prefix = '#define TOFU_CJK_EXT 1 /* PyMuPDF override. */\n' + with open(mupdf_config_h) as f: + text = f.read() + if text.startswith(prefix): + print(f'Not modifying {mupdf_config_h} because already has prefix {prefix!r}.') + else: + print(f'Prefixing {mupdf_config_h} with {prefix!r}.') + text = prefix + text + st = os.stat(mupdf_config_h) + with open(mupdf_config_h, 'w') as f: + f.write(text) + os.utime(mupdf_config_h, (st.st_atime, st.st_mtime)) + + wp = pipcl.wdev.WindowsPython() + tesseract = '' if os.environ.get('PYMUPDF_SETUP_MUPDF_TESSERACT') == '0' else 'tesseract-' + windows_build_tail = f'build\\shared-{tesseract}{build_type}' + if g_py_limited_api: + windows_build_tail += f'-Py_LIMITED_API_{pipcl.current_py_limited_api()}' + windows_build_tail += f'-x{wp.cpu.bits}-py{wp.version}' + windows_build_dir = f'{mupdf_local}\\{windows_build_tail}' + #log( f'Building mupdf.') + devenv = os.environ.get('PYMUPDF_SETUP_DEVENV') + if not devenv: + try: + # Prefer VS-2022 as that is what Github provide in windows-2022. + log(f'Looking for Visual Studio 2022.') + vs = pipcl.wdev.WindowsVS(year=2022) + except Exception as e: + log(f'Failed to find VS-2022:\n' + f'{textwrap.indent(traceback.format_exc(), " ")}' + ) + log(f'Looking for any Visual Studio.') + vs = pipcl.wdev.WindowsVS() + log(f'vs:\n{vs.description_ml(" ")}') + devenv = vs.devenv + if not devenv: + devenv = 'devenv.com' + log( f'Cannot find devenv.com in default locations, using: {devenv!r}') + command = f'cd "{mupdf_local}" && "{sys.executable}" ./scripts/mupdfwrap.py' + if os.environ.get('PYMUPDF_SETUP_MUPDF_VS_UPGRADE') == '1': + command += ' --vs-upgrade 1' + + # Would like to simply do f'... --devenv {shutil.quote(devenv)}', but + # it looks like if `devenv` has spaces then `shutil.quote()` puts it + # inside single quotes, which then appear to be ignored when run by + # subprocess.run(). + # + # So instead we strip any enclosing quotes and the enclose with + # double-quotes. + # + if len(devenv) >= 2: + for q in '"', "'": + if devenv.startswith( q) and devenv.endswith( q): + devenv = devenv[1:-1] + command += f' -d {windows_build_tail}' + command += f' -b' + if PYMUPDF_SETUP_MUPDF_REFCHECK_IF: + command += f' --refcheck-if "{PYMUPDF_SETUP_MUPDF_REFCHECK_IF}"' + if PYMUPDF_SETUP_MUPDF_TRACE_IF: + command += f' --trace-if "{PYMUPDF_SETUP_MUPDF_TRACE_IF}"' + command += f' --devenv "{devenv}"' + command += f' all' + if os.environ.get( 'PYMUPDF_SETUP_MUPDF_REBUILD') == '0': + log( f'PYMUPDF_SETUP_MUPDF_REBUILD is "0" so not building MuPDF; would have run: {command}') + else: + log( f'Building MuPDF by running: {command}') + subprocess.run( command, shell=True, check=True) + log( f'Finished building mupdf.') + + return windows_build_dir + + +def _windows_lib_directory(mupdf_local, build_type): + ret = f'{mupdf_local}/platform/win32/' + if _cpu_bits() == 64: + ret += 'x64/' + if build_type == 'release': + ret += 'Release/' + elif build_type == 'debug': + ret += 'Debug/' + else: + assert 0, f'Unrecognised {build_type=}.' + return ret + + +def _cpu_bits(): + if sys.maxsize == 2**31 - 1: + return 32 + return 64 + + +def build_mupdf_unix( + mupdf_local, + build_type, + overwrite_config, + g_py_limited_api, + PYMUPDF_SETUP_MUPDF_REFCHECK_IF, + PYMUPDF_SETUP_MUPDF_TRACE_IF, + PYMUPDF_SETUP_SWIG, + ): + ''' + Builds MuPDF. + + Args: + mupdf_local: + Path of MuPDF directory or None if we are using system MuPDF. + + Returns the absolute path of build directory within MuPDF, e.g. + `.../mupdf/build/pymupdf-shared-release`, or `None` if we are using the + system MuPDF. + ''' + if not mupdf_local: + log( f'Using system mupdf.') + return None + + env = dict() + if overwrite_config: + # By predefining TOFU_CJK_EXT here, we don't need to modify + # MuPDF's include/mupdf/fitz/config.h. + log( f'Setting XCFLAGS and XCXXFLAGS to predefine TOFU_CJK_EXT.') + env_add(env, 'XCFLAGS', '-DTOFU_CJK_EXT') + env_add(env, 'XCXXFLAGS', '-DTOFU_CJK_EXT') + + if openbsd or freebsd: + env_add(env, 'CXX', 'c++', ' ') + + if darwin and os.environ.get('GITHUB_ACTIONS') == 'true': + if os.environ.get('ImageOS') == 'macos13': + # On Github macos13 we need to use Clang/LLVM (Homebrew) 15.0.7, + # otherwise mupdf:thirdparty/tesseract/src/api/baseapi.cpp fails to + # compile with: + # + # thirdparty/tesseract/src/api/baseapi.cpp:150:25: error: 'recursive_directory_iterator' is unavailable: introduced in macOS 10.15 + # + # See: + # https://github.com/actions/runner-images/blob/main/images/macos/macos-13-Readme.md + # + log(f'Using llvm@15 clang and clang++') + cl15 = pipcl.run(f'brew --prefix llvm@15', capture=1) + log(f'{cl15=}') + cl15 = cl15.strip() + pipcl.run(f'ls -lL {cl15}') + pipcl.run(f'ls -lL {cl15}/bin') + cc = f'{cl15}/bin/clang' + cxx = f'{cl15}/bin/clang++' + env['CC'] = cc + env['CXX'] = cxx + + # Show compiler versions. + cc = env.get('CC', 'cc') + cxx = env.get('CXX', 'c++') + pipcl.run(f'{cc} --version') + pipcl.run(f'{cxx} --version') + + # Add extra flags for MacOS cross-compilation, where ARCHFLAGS can be + # '-arch arm64'. + # + archflags = os.environ.get( 'ARCHFLAGS') + if archflags: + env_add(env, 'XCFLAGS', archflags) + env_add(env, 'XLIBS', archflags) + + mupdf_version_tuple = get_mupdf_version(mupdf_local) + + # We specify a build directory path containing 'pymupdf' so that we + # coexist with non-PyMuPDF builds (because PyMuPDF builds have a + # different config.h). + # + # We also append further text to try to allow different builds to + # work if they reuse the mupdf directory. + # + # Using platform.machine() (e.g. 'amd64') ensures that different + # builds of mupdf on a shared filesystem can coexist. Using + # $_PYTHON_HOST_PLATFORM allows cross-compiled cibuildwheel builds + # to coexist, e.g. on github. + # + # Have experimented with looking at getconf_ARG_MAX to decide whether to + # omit `PyMuPDF-` from the build directory, to avoid command-too-long + # errors with mupdf-1.26. But it seems that `getconf ARG_MAX` returns + # a system limit, not the actual limit of the current shell, and there + # doesn't seem to be a way to find the current shell's limit. + # + build_prefix = f'PyMuPDF-' + if mupdf_version_tuple >= (1, 26): + # Avoid link command length problems seen on musllinux. + build_prefix = '' + if pyodide: + build_prefix += 'pyodide-' + else: + build_prefix += f'{platform.machine()}-' + build_prefix_extra = os.environ.get( '_PYTHON_HOST_PLATFORM') + if build_prefix_extra: + build_prefix += f'{build_prefix_extra}-' + build_prefix += 'shared-' + if msys2: + # Error in mupdf/scripts/tesseract/endianness.h: + # #error "I don't know what architecture this is!" + log(f'msys2: building MuPDF without tesseract.') + elif os.environ.get('PYMUPDF_SETUP_MUPDF_TESSERACT') == '0': + log(f'PYMUPDF_SETUP_MUPDF_TESSERACT=0 so building mupdf without tesseract.') + else: + build_prefix += 'tesseract-' + if ( + linux + and os.environ.get('PYMUPDF_SETUP_MUPDF_BSYMBOLIC', '1') == '1' + ): + log(f'Appending `bsymbolic-` to MuPDF build path.') + build_prefix += 'bsymbolic-' + log(f'{g_py_limited_api=}') + if g_py_limited_api: + build_prefix += f'Py_LIMITED_API_{pipcl.current_py_limited_api()}-' + unix_build_dir = f'{mupdf_local}/build/{build_prefix}{build_type}' + PYMUPDF_SETUP_MUPDF_CLEAN = os.environ.get('PYMUPDF_SETUP_MUPDF_CLEAN') + if PYMUPDF_SETUP_MUPDF_CLEAN == '1': + log(f'{PYMUPDF_SETUP_MUPDF_CLEAN=}, deleting {unix_build_dir=}.') + shutil.rmtree(unix_build_dir, ignore_errors=1) + # We need MuPDF's Python bindings, so we build MuPDF with + # `mupdf/scripts/mupdfwrap.py` instead of running `make`. + # + command = f'cd {mupdf_local} &&' + for n, v in env.items(): + command += f' {n}={shlex.quote(v)}' + command += f' {sys.executable} ./scripts/mupdfwrap.py' + if PYMUPDF_SETUP_SWIG: + command += f' --swig {shlex.quote(PYMUPDF_SETUP_SWIG)}' + command += f' -d build/{build_prefix}{build_type} -b' + #command += f' --m-target libs' + if PYMUPDF_SETUP_MUPDF_REFCHECK_IF: + command += f' --refcheck-if "{PYMUPDF_SETUP_MUPDF_REFCHECK_IF}"' + if PYMUPDF_SETUP_MUPDF_TRACE_IF: + command += f' --trace-if "{PYMUPDF_SETUP_MUPDF_TRACE_IF}"' + if 'p' in PYMUPDF_SETUP_FLAVOUR: + command += ' all' + else: + command += ' m01' # No need for C++/Python bindings. + command += f' && echo {unix_build_dir}:' + command += f' && ls -l {unix_build_dir}' + + if os.environ.get( 'PYMUPDF_SETUP_MUPDF_REBUILD') == '0': + log( f'PYMUPDF_SETUP_MUPDF_REBUILD is "0" so not building MuPDF; would have run: {command}') + else: + log( f'Building MuPDF by running: {command}') + subprocess.run( command, shell=True, check=True) + log( f'Finished building mupdf.') + + return unix_build_dir + + +def get_mupdf_version(mupdf_dir): + path = f'{mupdf_dir}/include/mupdf/fitz/version.h' + with open(path) as f: + text = f.read() + v0 = re.search('#define FZ_VERSION_MAJOR ([0-9]+)', text) + v1 = re.search('#define FZ_VERSION_MINOR ([0-9]+)', text) + v2 = re.search('#define FZ_VERSION_PATCH ([0-9]+)', text) + assert v0 and v1 and v2, f'Cannot find MuPDF version numbers in {path=}.' + v0 = int(v0.group(1)) + v1 = int(v1.group(1)) + v2 = int(v2.group(1)) + return v0, v1, v2 + +def _fs_update(text, path): + try: + with open( path) as f: + text0 = f.read() + except OSError: + text0 = None + print(f'path={path!r} text==text0={text==text0!r}') + if text != text0: + with open( path, 'w') as f: + f.write( text) + + +def _build_extension( mupdf_local, mupdf_build_dir, build_type, g_py_limited_api): + ''' + Builds Python extension module `_extra`. + + Returns leafname of the generated shared libraries within mupdf_build_dir. + ''' + (compiler_extra, linker_extra, includes, defines, optimise, debug, libpaths, libs, libraries) \ + = _extension_flags( mupdf_local, mupdf_build_dir, build_type) + log(f'_build_extension(): {g_py_limited_api=} {defines=}') + if mupdf_local: + includes = ( + f'{mupdf_local}/platform/c++/include', + f'{mupdf_local}/include', + ) + + # Build rebased extension module. + log('Building PyMuPDF rebased.') + compile_extra_cpp = '' + if darwin: + # Avoids `error: cannot pass object of non-POD type + # 'std::nullptr_t' through variadic function; call will abort at + # runtime` when compiling `mupdf::pdf_dict_getl(..., nullptr)`. + compile_extra_cpp += ' -Wno-non-pod-varargs' + # Avoid errors caused by mupdf's C++ bindings' exception classes + # not having `nothrow` to match the base exception class. + compile_extra_cpp += ' -std=c++14' + if windows: + wp = pipcl.wdev.WindowsPython() + libs = f'mupdfcpp{wp.cpu.windows_suffix}.lib' + else: + libs = ('mupdf', 'mupdfcpp') + libraries = [ + f'{mupdf_build_dir}/libmupdf.so' + f'{mupdf_build_dir}/libmupdfcpp.so' + ] + + path_so_leaf = pipcl.build_extension( + name = 'extra', + path_i = f'{g_root}/src/extra.i', + outdir = f'{g_root}/src/build', + includes = includes, + defines = defines, + libpaths = libpaths, + libs = libs, + compiler_extra = compiler_extra + compile_extra_cpp, + linker_extra = linker_extra, + optimise = optimise, + debug = debug, + prerequisites_swig = None, + prerequisites_compile = f'{mupdf_local}/include', + prerequisites_link = libraries, + py_limited_api = g_py_limited_api, + swig = PYMUPDF_SETUP_SWIG, + ) + + return path_so_leaf + + +def _extension_flags( mupdf_local, mupdf_build_dir, build_type): + ''' + Returns various flags to pass to pipcl.build_extension(). + ''' + compiler_extra = '' + linker_extra = '' + if build_type == 'memento': + compiler_extra += ' -DMEMENTO' + if mupdf_build_dir: + mupdf_build_dir_flags = os.path.basename( mupdf_build_dir).split( '-') + else: + mupdf_build_dir_flags = [build_type] + optimise = 'release' in mupdf_build_dir_flags + debug = 'debug' in mupdf_build_dir_flags + r_extra = '' + defines = list() + if windows: + defines.append('FZ_DLL_CLIENT') + wp = pipcl.wdev.WindowsPython() + if os.environ.get('PYMUPDF_SETUP_MUPDF_VS_UPGRADE') == '1': + # MuPDF C++ build uses a parallel build tree with updated VS files. + infix = 'win32-vs-upgrade' + else: + infix = 'win32' + build_type_infix = 'Debug' if debug else 'Release' + libpaths = ( + f'{mupdf_local}\\platform\\{infix}\\{wp.cpu.windows_subdir}{build_type_infix}', + f'{mupdf_local}\\platform\\{infix}\\{wp.cpu.windows_subdir}{build_type_infix}Tesseract', + ) + libs = f'mupdfcpp{wp.cpu.windows_suffix}.lib' + libraries = f'{mupdf_local}\\platform\\{infix}\\{wp.cpu.windows_subdir}{build_type_infix}\\{libs}' + compiler_extra = '' + else: + libs = ['mupdf'] + compiler_extra += ( + ' -Wall' + ' -Wno-deprecated-declarations' + ' -Wno-unused-const-variable' + ) + if mupdf_local: + libpaths = (mupdf_build_dir,) + libraries = f'{mupdf_build_dir}/{libs[0]}' + if openbsd: + compiler_extra += ' -Wno-deprecated-declarations' + else: + libpaths = os.environ.get('PYMUPDF_MUPDF_LIB') + libraries = None + if libpaths: + libpaths = libpaths.split(':') + + if mupdf_local: + includes = ( + f'{mupdf_local}/include', + f'{mupdf_local}/include/mupdf', + f'{mupdf_local}/thirdparty/freetype/include', + ) + else: + # Use system MuPDF. + includes = list() + pi = os.environ.get('PYMUPDF_INCLUDES') + if pi: + includes += pi.split(':') + pmi = os.environ.get('PYMUPDF_MUPDF_INCLUDE') + if pmi: + includes.append(pmi) + ldflags = os.environ.get('LDFLAGS') + if ldflags: + linker_extra += f' {ldflags}' + cflags = os.environ.get('CFLAGS') + if cflags: + compiler_extra += f' {cflags}' + cxxflags = os.environ.get('CXXFLAGS') + if cxxflags: + compiler_extra += f' {cxxflags}' + + if pyodide: + compiler_extra += f' {pyodide_flags}' + linker_extra += f' {pyodide_flags}' + + return compiler_extra, linker_extra, includes, defines, optimise, debug, libpaths, libs, libraries, + + +def sdist(): + ret = list() + if PYMUPDF_SETUP_DUMMY == '1': + return ret + + if PYMUPDF_SETUP_FLAVOUR == 'b': + # Create a minimal sdist that will build/install a dummy PyMuPDFb. + for p in ( + 'setup.py', + 'pipcl.py', + 'wdev.py', + 'pyproject.toml', + ): + ret.append(p) + ret.append( + ( + b'This file indicates that we are a PyMuPDFb sdist and should build/install a dummy PyMuPDFb package.\n', + g_pymupdfb_sdist_marker, + ) + ) + return ret + + for p in pipcl.git_items( g_root): + if p.startswith( + ( + 'docs/', + 'signatures/', + '.', + ) + ): + pass + else: + ret.append(p) + if 0: + tgz, mupdf_location = get_mupdf_tgz() + if tgz: + ret.append((tgz, mupdf_tgz)) + else: + log(f'Not including MuPDF .tgz in sdist.') + return ret + + +classifier = [ + 'Development Status :: 5 - Production/Stable', + 'Intended Audience :: Developers', + 'Intended Audience :: Information Technology', + 'Operating System :: MacOS', + 'Operating System :: Microsoft :: Windows', + 'Operating System :: POSIX :: Linux', + 'Programming Language :: C', + 'Programming Language :: C++', + 'Programming Language :: Python :: 3 :: Only', + 'Programming Language :: Python :: Implementation :: CPython', + 'Topic :: Utilities', + 'Topic :: Multimedia :: Graphics', + 'Topic :: Software Development :: Libraries', + ] + +# We generate different wheels depending on PYMUPDF_SETUP_FLAVOUR. +# + +# PyMuPDF version. +version_p = '1.26.4' + +version_mupdf = '1.26.7' + +# PyMuPDFb version. This is the PyMuPDF version whose PyMuPDFb wheels we will +# (re)use if generating separate PyMuPDFb wheels. Though as of PyMuPDF-1.24.11 +# (2024-10-03) we no longer use PyMuPDFb wheels so this is actually unused. +# +version_b = '1.26.3' + +if os.path.exists(f'{g_root}/{g_pymupdfb_sdist_marker}'): + + # We are in a PyMuPDFb sdist. We specify a dummy package so that pip builds + # from sdists work - pip's build using PyMuPDF's sdist will already create + # the required binaries, but pip will still see `requires_dist` set to + # 'PyMuPDFb', so will also download and build PyMuPDFb's sdist. + # + log(f'Specifying dummy PyMuPDFb wheel.') + + def get_requires_for_build_wheel(config_settings=None): + return list() + + p = pipcl.Package( + 'PyMuPDFb', + version_b, + summary = 'Dummy PyMuPDFb wheel', + description = '', + author = 'Artifex', + author_email = 'support@artifex.com', + license = 'GNU AFFERO GPL 3.0', + tag_python = 'py3', + ) + +else: + # A normal PyMuPDF package. + + with open( f'{g_root}/README.md', encoding='utf-8') as f: + readme_p = f.read() + + with open( f'{g_root}/READMEb.md', encoding='utf-8') as f: + readme_b = f.read() + + with open( f'{g_root}/READMEd.md', encoding='utf-8') as f: + readme_d = f.read() + + tag_python = None + requires_dist = list() + entry_points = None + + if 'p' in PYMUPDF_SETUP_FLAVOUR: + version = version_p + name = 'PyMuPDF' + readme = readme_p + summary = 'A high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.' + if 'b' not in PYMUPDF_SETUP_FLAVOUR: + requires_dist.append(f'PyMuPDFb =={version_b}') + # Create a `pymupdf` command. + entry_points = textwrap.dedent(''' + [console_scripts] + pymupdf = pymupdf.__main__:main + ''') + elif 'b' in PYMUPDF_SETUP_FLAVOUR: + version = version_b + name = 'PyMuPDFb' + readme = readme_b + summary = 'MuPDF shared libraries for PyMuPDF.' + tag_python = 'py3' + elif 'd' in PYMUPDF_SETUP_FLAVOUR: + version = version_b + name = 'PyMuPDFd' + readme = readme_d + summary = 'MuPDF build-time files for PyMuPDF.' + tag_python = 'py3' + else: + assert 0, f'Unrecognised {PYMUPDF_SETUP_FLAVOUR=}.' + + if os.environ.get('PYODIDE_ROOT'): + # We can't pip install pytest on pyodide, so specify it here. + requires_dist.append('pytest') + + p = pipcl.Package( + name, + version, + summary = summary, + description = readme, + description_content_type = 'text/markdown', + classifier = classifier, + author = 'Artifex', + author_email = 'support@artifex.com', + requires_dist = requires_dist, + requires_python = '>=3.9', + license = 'Dual Licensed - GNU AFFERO GPL 3.0 or Artifex Commercial License', + project_url = [ + ('Documentation, https://pymupdf.readthedocs.io/'), + ('Source, https://github.com/pymupdf/pymupdf'), + ('Tracker, https://github.com/pymupdf/PyMuPDF/issues'), + ('Changelog, https://pymupdf.readthedocs.io/en/latest/changes.html'), + ], + + entry_points = entry_points, + + fn_build=build, + fn_sdist=sdist, + + tag_python=tag_python, + py_limited_api=g_py_limited_api, + + # 30MB: 9 ZIP_DEFLATED + # 28MB: 9 ZIP_BZIP2 + # 23MB: 9 ZIP_LZMA + #wheel_compression = zipfile.ZIP_DEFLATED if (darwin or pyodide) else zipfile.ZIP_LZMA, + wheel_compresslevel = 9, + ) + + def get_requires_for_build_wheel(config_settings=None): + ''' + Adds to pyproject.toml:[build-system]:requires, allowing programmatic + control over what packages we require. + ''' + def platform_release_tuple(): + r = platform.release() + r = r.split('.') + r = tuple(int(i) for i in r) + log(f'platform_release_tuple() returning {r=}.') + return r + + ret = list() + libclang = os.environ.get('PYMUPDF_SETUP_LIBCLANG') + if libclang: + print(f'Overriding to use {libclang=}.') + ret.append(libclang) + elif openbsd: + print(f'OpenBSD: libclang not available via pip; assuming `pkg_add py3-llvm`.') + elif darwin and platform.machine() == 'arm64': + print(f'MacOS/arm64: forcing use of libclang 16.0.6 because 18.1.1 known to fail with `clang.cindex.TranslationUnitLoadError: Error parsing translation unit.`') + ret.append('libclang==16.0.6') + elif darwin and platform_release_tuple() < (18,): + # There are still of problems when building on old macos. + ret.append('libclang==14.0.6') + else: + ret.append('libclang') + if msys2: + print(f'msys2: pip install of swig does not build; assuming `pacman -S swig`.') + elif openbsd: + print(f'OpenBSD: pip install of swig does not build; assuming `pkg_add swig`.') + else: + ret.append( 'swig') + return ret + + +if PYMUPDF_SETUP_URL_WHEEL: + def build_wheel( + wheel_directory, + config_settings=None, + metadata_directory=None, + p=p, + ): + ''' + Instead of building wheel, we look for and copy a wheel from location + specified by PYMUPDF_SETUP_URL_WHEEL. + ''' + log(f'{PYMUPDF_SETUP_URL_WHEEL=}') + log(f'{p.wheel_name()=}') + url = PYMUPDF_SETUP_URL_WHEEL + if url.startswith(('http://', 'https://')): + leaf = p.wheel_name() + out_path = f'{wheel_directory}{leaf}' + out_path_temp = out_path + '-' + if url.endswith('/'): + url += leaf + log(f'Downloading from {url=} to {out_path_temp=}.') + urllib.request.urlretrieve(url, out_path_temp) + elif url.startswith(f'file://'): + in_path = url[len('file://'):] + log(f'{in_path=}') + if in_path.endswith('/'): + # Look for matching wheel within this directory. + wheels = glob.glob(f'{in_path}*.whl') + log(f'{len(wheels)=}') + for in_path in wheels: + log(f'{in_path=}') + leaf = os.path.basename(in_path) + if p.wheel_name_match(leaf): + log(f'Match: {in_path=}') + break + else: + message = f'Cannot find matching for {p.wheel_name()=} in ({len(wheels)=}):\n' + wheels_text = '' + for wheel in wheels: + wheels_text += f' {wheel}\n' + assert 0, f'Cannot find matching for {p.wheel_name()=} in:\n{wheels_text}' + else: + leaf = os.path.basename(in_path) + out_path = os.path.join(wheel_directory, leaf) + out_path_temp = out_path + '-' + log(f'Copying from {in_path=} to {out_path_temp=}.') + shutil.copy2(in_path, out_path_temp) + else: + assert 0, f'Unrecognised prefix in {PYMUPDF_SETUP_URL_WHEEL=}.' + + log(f'Renaming from:\n {out_path_temp}\nto:\n {out_path}.') + os.rename(out_path_temp, out_path) + return os.path.basename(out_path) +else: + build_wheel = p.build_wheel + +build_sdist = p.build_sdist + + +if __name__ == '__main__': + p.handle_argv(sys.argv) diff -r 000000000000 -r 1d09e1dec1d9 src/__init__.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/__init__.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,21308 @@ +''' +PyMuPDF implemented on top of MuPDF Python bindings. + +License: + + SPDX-License-Identifier: GPL-3.0-only +''' + +# To reduce startup times, we don't import everything we require here. +# +import atexit +import binascii +import collections +import inspect +import io +import math +import os +import pathlib +import glob +import re +import string +import sys +import tarfile +import time +import typing +import warnings +import weakref +import zipfile + +from . import extra + + +# Set up g_out_log and g_out_message from environment variables. +# +# PYMUPDF_MESSAGE controls the destination of user messages (from function +# `pymupdf.message()`). +# +# PYMUPDF_LOG controls the destination of internal development logging (from +# function `pymupdf.log()`). +# +# For syntax, see _make_output()'s `text` arg. +# + +def _make_output( + *, + text=None, + fd=None, + stream=None, + path=None, + path_append=None, + pylogging=None, + pylogging_logger=None, + pylogging_level=None, + pylogging_name=None, + default=None, + ): + ''' + Returns a stream that writes to a specified destination, which can be a + file descriptor, a file, an existing stream or Python's `logging' system. + + Args: + text: text specification of destination. + fd: - write to file descriptor. + path: - write to file. + path+: - append to file. + logging: - write to Python `logging` module. + items: comma-separated pairs. + level= + name=. + Other names are ignored. + + fd: an int file descriptor. + stream: something with methods .write(text) and .flush(). + If specified we simply return . + path: a file path. + If specified we return a stream that writes to this file. + path_append: a file path. + If specified we return a stream that appends to this file. + pylogging*: + if any of these args is not None, we return a stream that writes to + Python's `logging` module. + + pylogging: + Unused other than to activate use of logging module. + pylogging_logger: + A logging.Logger; If None, set from . + pylogging_level: + An int log level, if None we use + pylogging_logger.getEffectiveLevel(). + pylogging_name: + Only used if is None: + If is None, we set it to 'pymupdf'. + Then we do: pylogging_logger = logging.getLogger(pylogging_name) + ''' + if text is not None: + # Textual specification, for example from from environment variable. + if text.startswith('fd:'): + fd = int(text[3:]) + elif text.startswith('path:'): + path = text[5:] + elif text.startswith('path+'): + path_append = text[5:] + elif text.startswith('logging:'): + pylogging = True + items_d = dict() + items = text[8:].split(',') + #items_d = {n: v for (n, v) in [item.split('=', 1) for item in items]} + for item in items: + if not item: + continue + nv = item.split('=', 1) + assert len(nv) == 2, f'Need `=` in {item=}.' + n, v = nv + items_d[n] = v + pylogging_level = items_d.get('level') + if pylogging_level is not None: + pylogging_level = int(pylogging_level) + pylogging_name = items_d.get('name', 'pymupdf') + else: + assert 0, f'Expected prefix `fd:`, `path:`. `path+:` or `logging:` in {text=}.' + + if fd is not None: + ret = open(fd, mode='w', closefd=False) + elif stream is not None: + assert hasattr(stream, 'write') + assert hasattr(stream, 'flush') + ret = stream + elif path is not None: + ret = open(path, 'w') + elif path_append is not None: + ret = open(path_append, 'a') + elif (0 + or pylogging is not None + or pylogging_logger is not None + or pylogging_level is not None + or pylogging_name is not None + ): + import logging + if pylogging_logger is None: + if pylogging_name is None: + pylogging_name = 'pymupdf' + pylogging_logger = logging.getLogger(pylogging_name) + assert isinstance(pylogging_logger, logging.Logger) + if pylogging_level is None: + pylogging_level = pylogging_logger.getEffectiveLevel() + class Out: + def write(self, text): + # `logging` module appends newlines, but so does the `print()` + # functions in our caller message() and log() fns, so we need to + # remove them here. + text = text.rstrip('\n') + if text: + pylogging_logger.log(pylogging_level, text) + def flush(self): + pass + ret = Out() + else: + ret = default + return ret + +# Set steam used by PyMuPDF messaging. +_g_out_message = _make_output(text=os.environ.get('PYMUPDF_MESSAGE'), default=sys.stdout) + +# Set steam used by PyMuPDF development/debugging logging. +_g_out_log = _make_output(text=os.environ.get('PYMUPDF_LOG'), default=sys.stdout) + +# Things for testing logging. +_g_log_items = list() +_g_log_items_active = False + +def _log_items(): + return _g_log_items + +def _log_items_active(active): + global _g_log_items_active + _g_log_items_active = active + +def _log_items_clear(): + del _g_log_items[:] + + +def set_messages( + *, + text=None, + fd=None, + stream=None, + path=None, + path_append=None, + pylogging=None, + pylogging_logger=None, + pylogging_level=None, + pylogging_name=None, + ): + ''' + Sets destination of PyMuPDF messages. See _make_output() for details. + ''' + global _g_out_message + _g_out_message = _make_output( + text=text, + fd=fd, + stream=stream, + path=path, + path_append=path_append, + pylogging=pylogging, + pylogging_logger=pylogging_logger, + pylogging_level=pylogging_level, + pylogging_name=pylogging_name, + default=_g_out_message, + ) + +def set_log( + *, + text=None, + fd=None, + stream=None, + path=None, + path_append=None, + pylogging=None, + pylogging_logger=None, + pylogging_level=None, + pylogging_name=None, + ): + ''' + Sets destination of PyMuPDF development/debugging logging. See + _make_output() for details. + ''' + global _g_out_log + _g_out_log = _make_output( + text=text, + fd=fd, + stream=stream, + path=path, + path_append=path_append, + pylogging=pylogging, + pylogging_logger=pylogging_logger, + pylogging_level=pylogging_level, + pylogging_name=pylogging_name, + default=_g_out_log, + ) + +def log( text='', caller=1): + ''' + For development/debugging diagnostics. + ''' + try: + stack = inspect.stack(context=0) + except StopIteration: + pass + else: + frame_record = stack[caller] + try: + filename = os.path.relpath(frame_record.filename) + except Exception: # Can fail on windows. + filename = frame_record.filename + line = frame_record.lineno + function = frame_record.function + text = f'{filename}:{line}:{function}(): {text}' + if _g_log_items_active: + _g_log_items.append(text) + if _g_out_log: + print(text, file=_g_out_log, flush=1) + + +def message(text=''): + ''' + For user messages. + ''' + # It looks like `print()` does nothing if sys.stdout is None (without + # raising an exception), but we don't rely on this. + if _g_out_message: + print(text, file=_g_out_message, flush=1) + + +def exception_info(): + import traceback + log(f'exception_info:') + log(traceback.format_exc()) + + +# PDF names must not contain these characters: +INVALID_NAME_CHARS = set(string.whitespace + "()<>[]{}/%" + chr(0)) + +def get_env_bool( name, default): + ''' + Returns `True`, `False` or `default` depending on whether $ is '1', + '0' or unset. Otherwise assert-fails. + ''' + v = os.environ.get( name) + if v is None: + ret = default + elif v == '1': + ret = True + elif v == '0': + ret = False + else: + assert 0, f'Unrecognised value for {name}: {v!r}' + if ret != default: + log(f'Using non-default setting from {name}: {v!r}') + return ret + +def get_env_int( name, default): + ''' + Returns `True`, `False` or `default` depending on whether $ is '1', + '0' or unset. Otherwise assert-fails. + ''' + v = os.environ.get( name) + if v is None: + ret = default + else: + ret = int(v) + if ret != default: + log(f'Using non-default setting from {name}: {v}') + return ret + +# All our `except ...` blocks output diagnostics if `g_exceptions_verbose` is +# true. +g_exceptions_verbose = get_env_int( 'PYMUPDF_EXCEPTIONS_VERBOSE', 1) + +# $PYMUPDF_USE_EXTRA overrides whether to use optimised C fns in `extra`. +# +g_use_extra = get_env_bool( 'PYMUPDF_USE_EXTRA', True) + + +# Global switches +# + +class _Globals: + def __init__(self): + self.no_device_caching = 0 + self.small_glyph_heights = 0 + self.subset_fontnames = 0 + self.skip_quad_corrections = 0 + +_globals = _Globals() + + +# Optionally use MuPDF via cppyy bindings; experimental and not tested recently +# as of 2023-01-20 11:51:40 +# +mupdf_cppyy = os.environ.get( 'MUPDF_CPPYY') +if mupdf_cppyy is not None: + # pylint: disable=all + log( f'{__file__}: $MUPDF_CPPYY={mupdf_cppyy!r} so attempting to import mupdf_cppyy.') + log( f'{__file__}: $PYTHONPATH={os.environ["PYTHONPATH"]}') + if mupdf_cppyy == '': + import mupdf_cppyy + else: + import importlib + mupdf_cppyy = importlib.machinery.SourceFileLoader( + 'mupdf_cppyy', + mupdf_cppyy + ).load_module() + mupdf = mupdf_cppyy.cppyy.gbl.mupdf +else: + # Use MuPDF Python SWIG bindings. We allow import from either our own + # directory for conventional wheel installs, or from separate place in case + # we are using a separately-installed system installation of mupdf. + # + try: + from . import mupdf + except Exception: + import mupdf + if hasattr(mupdf, 'internal_check_ndebug'): + mupdf.internal_check_ndebug() + mupdf.reinit_singlethreaded() + +def _int_rc(text): + ''' + Converts string to int, ignoring trailing 'rc...'. + ''' + rc = text.find('rc') + if rc >= 0: + text = text[:rc] + return int(text) + +# Basic version information. +# +# (We use `noqa F401` to avoid flake8 errors such as `F401 +# '._build.mupdf_location' imported but unused`. +# +from ._build import mupdf_location # noqa F401 +from ._build import pymupdf_git_branch # noqa F401 +from ._build import pymupdf_git_diff # noqa F401 +from ._build import pymupdf_git_sha # noqa F401 +from ._build import pymupdf_version # noqa F401 +from ._build import swig_version # noqa F401 +from ._build import swig_version_tuple # noqa F401 + +mupdf_version = mupdf.FZ_VERSION + +# Removed in PyMuPDF-1.26.1. +pymupdf_date = None + +# Versions as tuples; useful when comparing versions. +# +pymupdf_version_tuple = tuple( [_int_rc(i) for i in pymupdf_version.split('.')]) +mupdf_version_tuple = tuple( [_int_rc(i) for i in mupdf_version.split('.')]) + +assert mupdf_version_tuple == (mupdf.FZ_VERSION_MAJOR, mupdf.FZ_VERSION_MINOR, mupdf.FZ_VERSION_PATCH), \ + f'Inconsistent MuPDF version numbers: {mupdf_version_tuple=} != {(mupdf.FZ_VERSION_MAJOR, mupdf.FZ_VERSION_MINOR, mupdf.FZ_VERSION_PATCH)=}' + +# Legacy version information. +# +version = (pymupdf_version, mupdf_version, None) +VersionFitz = mupdf_version +VersionBind = pymupdf_version +VersionDate = None + + +# String formatting. + +def _format_g(value, *, fmt='%g'): + ''' + Returns `value` formatted with mupdf.fz_format_double() if available, + otherwise with Python's `%`. + + If `value` is a list or tuple, we return a space-separated string of + formatted values. + ''' + if isinstance(value, (list, tuple)): + ret = '' + for v in value: + if ret: + ret += ' ' + ret += _format_g(v, fmt=fmt) + return ret + else: + return mupdf.fz_format_double(fmt, value) + +format_g = _format_g + +# ByteString is gone from typing in 3.14. +# collections.abc.Buffer available from 3.12 only +try: + ByteString = typing.ByteString +except AttributeError: + ByteString = bytes | bytearray | memoryview + +# Names required by class method typing annotations. +OptBytes = typing.Optional[ByteString] +OptDict = typing.Optional[dict] +OptFloat = typing.Optional[float] +OptInt = typing.Union[int, None] +OptSeq = typing.Optional[typing.Sequence] +OptStr = typing.Optional[str] + +Page = 'Page_forward_decl' +Point = 'Point_forward_decl' + +matrix_like = 'matrix_like' +point_like = 'point_like' +quad_like = 'quad_like' +rect_like = 'rect_like' + + +def _as_fz_document(document): + ''' + Returns document as a mupdf.FzDocument, upcasting as required. Raises + 'document closed' exception if closed. + ''' + if isinstance(document, Document): + if document.is_closed: + raise ValueError('document closed') + document = document.this + if isinstance(document, mupdf.FzDocument): + return document + elif isinstance(document, mupdf.PdfDocument): + return document.super() + elif document is None: + assert 0, f'document is None' + else: + assert 0, f'Unrecognised {type(document)=}' + +def _as_pdf_document(document, required=True): + ''' + Returns `document` downcast to a mupdf.PdfDocument. If downcast fails (i.e. + `document` is not actually a `PdfDocument`) then we assert-fail if `required` + is true (the default) else return a `mupdf.PdfDocument` with `.m_internal` + false. + ''' + if isinstance(document, Document): + if document.is_closed: + raise ValueError('document closed') + document = document.this + if isinstance(document, mupdf.PdfDocument): + return document + elif isinstance(document, mupdf.FzDocument): + ret = mupdf.PdfDocument(document) + if required: + assert ret.m_internal + return ret + elif document is None: + assert 0, f'document is None' + else: + assert 0, f'Unrecognised {type(document)=}' + +def _as_fz_page(page): + ''' + Returns page as a mupdf.FzPage, upcasting as required. + ''' + if isinstance(page, Page): + page = page.this + if isinstance(page, mupdf.PdfPage): + return page.super() + elif isinstance(page, mupdf.FzPage): + return page + elif page is None: + assert 0, f'page is None' + else: + assert 0, f'Unrecognised {type(page)=}' + +def _as_pdf_page(page, required=True): + ''' + Returns `page` downcast to a mupdf.PdfPage. If downcast fails (i.e. `page` + is not actually a `PdfPage`) then we assert-fail if `required` is true (the + default) else return a `mupdf.PdfPage` with `.m_internal` false. + ''' + if isinstance(page, Page): + page = page.this + if isinstance(page, mupdf.PdfPage): + return page + elif isinstance(page, mupdf.FzPage): + ret = mupdf.pdf_page_from_fz_page(page) + if required: + assert ret.m_internal + return ret + elif page is None: + assert 0, f'page is None' + else: + assert 0, f'Unrecognised {type(page)=}' + + +def _pdf_annot_page(annot): + ''' + Wrapper for mupdf.pdf_annot_page() which raises an exception if + is not bound to a page instead of returning a mupdf.PdfPage with + `.m_internal=None`. + + [Some other MuPDF functions such as pdf_update_annot()` already raise a + similar exception if a pdf_annot's .page field is null.] + ''' + page = mupdf.pdf_annot_page(annot) + if not page.m_internal: + raise RuntimeError('Annot is not bound to a page') + return page + + +# Fixme: we don't support JM_MEMORY=1. +JM_MEMORY = 0 + +# Classes +# + +class Annot: + + def __init__(self, annot): + assert isinstance( annot, mupdf.PdfAnnot) + self.this = annot + + def __repr__(self): + parent = getattr(self, 'parent', '<>') + return "'%s' annotation on %s" % (self.type[1], str(parent)) + + def __str__(self): + return self.__repr__() + + def _erase(self): + if getattr(self, "thisown", False): + self.thisown = False + + def _get_redact_values(self): + annot = self.this + if mupdf.pdf_annot_type(annot) != mupdf.PDF_ANNOT_REDACT: + return + + values = dict() + try: + obj = mupdf.pdf_dict_gets(mupdf.pdf_annot_obj(annot), "RO") + if obj.m_internal: + message_warning("Ignoring redaction key '/RO'.") + xref = mupdf.pdf_to_num(obj) + values[dictkey_xref] = xref + obj = mupdf.pdf_dict_gets(mupdf.pdf_annot_obj(annot), "OverlayText") + if obj.m_internal: + text = mupdf.pdf_to_text_string(obj) + values[dictkey_text] = JM_UnicodeFromStr(text) + else: + values[dictkey_text] = '' + obj = mupdf.pdf_dict_get(mupdf.pdf_annot_obj(annot), PDF_NAME('Q')) + align = 0 + if obj.m_internal: + align = mupdf.pdf_to_int(obj) + values[dictkey_align] = align + except Exception: + if g_exceptions_verbose: exception_info() + return + val = values + + if not val: + return val + val["rect"] = self.rect + text_color, fontname, fontsize = TOOLS._parse_da(self) + val["text_color"] = text_color + val["fontname"] = fontname + val["fontsize"] = fontsize + fill = self.colors["fill"] + val["fill"] = fill + return val + + def _getAP(self): + if g_use_extra: + assert isinstance( self.this, mupdf.PdfAnnot) + ret = extra.Annot_getAP(self.this) + assert isinstance( ret, bytes) + return ret + else: + r = None + res = None + annot = self.this + assert isinstance( annot, mupdf.PdfAnnot) + annot_obj = mupdf.pdf_annot_obj( annot) + ap = mupdf.pdf_dict_getl( annot_obj, PDF_NAME('AP'), PDF_NAME('N')) + if mupdf.pdf_is_stream( ap): + res = mupdf.pdf_load_stream( ap) + if res and res.m_internal: + r = JM_BinFromBuffer(res) + return r + + def _setAP(self, buffer_, rect=0): + try: + annot = self.this + annot_obj = mupdf.pdf_annot_obj( annot) + page = _pdf_annot_page(annot) + apobj = mupdf.pdf_dict_getl( annot_obj, PDF_NAME('AP'), PDF_NAME('N')) + if not apobj.m_internal: + raise RuntimeError( MSG_BAD_APN) + if not mupdf.pdf_is_stream( apobj): + raise RuntimeError( MSG_BAD_APN) + res = JM_BufferFromBytes( buffer_) + if not res.m_internal: + raise ValueError( MSG_BAD_BUFFER) + JM_update_stream( page.doc(), apobj, res, 1) + if rect: + bbox = mupdf.pdf_dict_get_rect( annot_obj, PDF_NAME('Rect')) + mupdf.pdf_dict_put_rect( apobj, PDF_NAME('BBox'), bbox) + except Exception: + if g_exceptions_verbose: exception_info() + + def _update_appearance(self, opacity=-1, blend_mode=None, fill_color=None, rotate=-1): + annot = self.this + assert annot.m_internal + annot_obj = mupdf.pdf_annot_obj( annot) + page = _pdf_annot_page(annot) + pdf = page.doc() + type_ = mupdf.pdf_annot_type( annot) + nfcol, fcol = JM_color_FromSequence(fill_color) + + try: + # remove fill color from unsupported annots + # or if so requested + if nfcol == 0 or type_ not in ( + mupdf.PDF_ANNOT_SQUARE, + mupdf.PDF_ANNOT_CIRCLE, + mupdf.PDF_ANNOT_LINE, + mupdf.PDF_ANNOT_POLY_LINE, + mupdf.PDF_ANNOT_POLYGON + ): + mupdf.pdf_dict_del( annot_obj, PDF_NAME('IC')) + elif nfcol > 0: + mupdf.pdf_set_annot_interior_color( annot, fcol[:nfcol]) + + insert_rot = 1 if rotate >= 0 else 0 + if type_ not in ( + mupdf.PDF_ANNOT_CARET, + mupdf.PDF_ANNOT_CIRCLE, + mupdf.PDF_ANNOT_FREE_TEXT, + mupdf.PDF_ANNOT_FILE_ATTACHMENT, + mupdf.PDF_ANNOT_INK, + mupdf.PDF_ANNOT_LINE, + mupdf.PDF_ANNOT_POLY_LINE, + mupdf.PDF_ANNOT_POLYGON, + mupdf.PDF_ANNOT_SQUARE, + mupdf.PDF_ANNOT_STAMP, + mupdf.PDF_ANNOT_TEXT, + ): + insert_rot = 0 + + if insert_rot: + mupdf.pdf_dict_put_int(annot_obj, PDF_NAME('Rotate'), rotate) + + # insert fill color + if type_ == mupdf.PDF_ANNOT_FREE_TEXT: + if nfcol > 0: + mupdf.pdf_set_annot_color(annot, fcol[:nfcol]) + elif nfcol > 0: + col = mupdf.pdf_new_array(page.doc(), nfcol) + for i in range( nfcol): + mupdf.pdf_array_push_real(col, fcol[i]) + mupdf.pdf_dict_put(annot_obj, PDF_NAME('IC'), col) + mupdf.pdf_dirty_annot(annot) + mupdf.pdf_update_annot(annot) # let MuPDF update + pdf.resynth_required = 0 + except Exception as e: + if g_exceptions_verbose: + exception_info() + message( f'cannot update annot: {e}') + raise + + if (opacity < 0 or opacity >= 1) and not blend_mode: # no opacity, no blend_mode + return True + + try: # create or update /ExtGState + ap = mupdf.pdf_dict_getl( + mupdf.pdf_annot_obj(annot), + PDF_NAME('AP'), + PDF_NAME('N') + ) + if not ap.m_internal: # should never happen + raise RuntimeError( MSG_BAD_APN) + + resources = mupdf.pdf_dict_get( ap, PDF_NAME('Resources')) + if not resources.m_internal: # no Resources yet: make one + resources = mupdf.pdf_dict_put_dict( ap, PDF_NAME('Resources'), 2) + + alp0 = mupdf.pdf_new_dict( page.doc(), 3) + if opacity >= 0 and opacity < 1: + mupdf.pdf_dict_put_real( alp0, PDF_NAME('CA'), opacity) + mupdf.pdf_dict_put_real( alp0, PDF_NAME('ca'), opacity) + mupdf.pdf_dict_put_real( annot_obj, PDF_NAME('CA'), opacity) + + if blend_mode: + mupdf.pdf_dict_put_name( alp0, PDF_NAME('BM'), blend_mode) + mupdf.pdf_dict_put_name( annot_obj, PDF_NAME('BM'), blend_mode) + + extg = mupdf.pdf_dict_get( resources, PDF_NAME('ExtGState')) + if not extg.m_internal: # no ExtGState yet: make one + extg = mupdf.pdf_dict_put_dict( resources, PDF_NAME('ExtGState'), 2) + + mupdf.pdf_dict_put( extg, PDF_NAME('H'), alp0) + + except Exception as e: + if g_exceptions_verbose: exception_info() + message( f'cannot set opacity or blend mode\n: {e}') + raise + + return True + + @property + def apn_bbox(self): + """annotation appearance bbox""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + ap = mupdf.pdf_dict_getl(annot_obj, PDF_NAME('AP'), PDF_NAME('N')) + if not ap.m_internal: + val = JM_py_from_rect(mupdf.FzRect(mupdf.FzRect.Fixed_INFINITE)) + else: + rect = mupdf.pdf_dict_get_rect(ap, PDF_NAME('BBox')) + val = JM_py_from_rect(rect) + + val = Rect(val) * self.get_parent().transformation_matrix + val *= self.get_parent().derotation_matrix + return val + + @property + def apn_matrix(self): + """annotation appearance matrix""" + try: + CheckParent(self) + annot = self.this + assert isinstance(annot, mupdf.PdfAnnot) + ap = mupdf.pdf_dict_getl( + mupdf.pdf_annot_obj(annot), + mupdf.PDF_ENUM_NAME_AP, + mupdf.PDF_ENUM_NAME_N + ) + if not ap.m_internal: + return JM_py_from_matrix(mupdf.FzMatrix()) + mat = mupdf.pdf_dict_get_matrix(ap, mupdf.PDF_ENUM_NAME_Matrix) + val = JM_py_from_matrix(mat) + + val = Matrix(val) + + return val + except Exception: + if g_exceptions_verbose: exception_info() + raise + + @property + def blendmode(self): + """annotation BlendMode""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + obj = mupdf.pdf_dict_get(annot_obj, PDF_NAME('BM')) + blend_mode = None + if obj.m_internal: + blend_mode = JM_UnicodeFromStr(mupdf.pdf_to_name(obj)) + return blend_mode + # loop through the /AP/N/Resources/ExtGState objects + obj = mupdf.pdf_dict_getl( + annot_obj, + PDF_NAME('AP'), + PDF_NAME('N'), + PDF_NAME('Resources'), + PDF_NAME('ExtGState'), + ) + if mupdf.pdf_is_dict(obj): + n = mupdf.pdf_dict_len(obj) + for i in range(n): + obj1 = mupdf.pdf_dict_get_val(obj, i) + if mupdf.pdf_is_dict(obj1): + m = mupdf.pdf_dict_len(obj1) + for j in range(m): + obj2 = mupdf.pdf_dict_get_key(obj1, j) + if mupdf.pdf_objcmp(obj2, PDF_NAME('BM')) == 0: + blend_mode = JM_UnicodeFromStr(mupdf.pdf_to_name(mupdf.pdf_dict_get_val(obj1, j))) + return blend_mode + return blend_mode + + @property + def border(self): + """Border information.""" + CheckParent(self) + atype = self.type[0] + if atype not in ( + mupdf.PDF_ANNOT_CIRCLE, + mupdf.PDF_ANNOT_FREE_TEXT, + mupdf.PDF_ANNOT_INK, + mupdf.PDF_ANNOT_LINE, + mupdf.PDF_ANNOT_POLY_LINE, + mupdf.PDF_ANNOT_POLYGON, + mupdf.PDF_ANNOT_SQUARE, + ): + return dict() + ao = mupdf.pdf_annot_obj(self.this) + ret = JM_annot_border(ao) + return ret + + def clean_contents(self, sanitize=1): + """Clean appearance contents stream.""" + CheckParent(self) + annot = self.this + pdf = mupdf.pdf_get_bound_document(mupdf.pdf_annot_obj(annot)) + filter_ = _make_PdfFilterOptions(recurse=1, instance_forms=0, ascii=0, sanitize=sanitize) + mupdf.pdf_filter_annot_contents(pdf, annot, filter_) + + @property + def colors(self): + """Color definitions.""" + try: + CheckParent(self) + annot = self.this + assert isinstance(annot, mupdf.PdfAnnot) + return JM_annot_colors(mupdf.pdf_annot_obj(annot)) + except Exception: + if g_exceptions_verbose: exception_info() + raise + + def delete_responses(self): + """Delete 'Popup' and responding annotations.""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + page = _pdf_annot_page(annot) + while 1: + irt_annot = JM_find_annot_irt(annot) + if not irt_annot: + break + mupdf.pdf_delete_annot(page, irt_annot) + mupdf.pdf_dict_del(annot_obj, PDF_NAME('Popup')) + + annots = mupdf.pdf_dict_get(page.obj(), PDF_NAME('Annots')) + n = mupdf.pdf_array_len(annots) + found = 0 + for i in range(n-1, -1, -1): + o = mupdf.pdf_array_get(annots, i) + p = mupdf.pdf_dict_get(o, PDF_NAME('Parent')) + if not o.m_internal: + continue + if not mupdf.pdf_objcmp(p, annot_obj): + mupdf.pdf_array_delete(annots, i) + found = 1 + if found: + mupdf.pdf_dict_put(page.obj(), PDF_NAME('Annots'), annots) + + @property + def file_info(self): + """Attached file information.""" + CheckParent(self) + res = dict() + length = -1 + size = -1 + desc = None + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + type_ = mupdf.pdf_annot_type(annot) + if type_ != mupdf.PDF_ANNOT_FILE_ATTACHMENT: + raise TypeError( MSG_BAD_ANNOT_TYPE) + stream = mupdf.pdf_dict_getl( + annot_obj, + PDF_NAME('FS'), + PDF_NAME('EF'), + PDF_NAME('F'), + ) + if not stream.m_internal: + RAISEPY( "bad PDF: file entry not found", JM_Exc_FileDataError) + + fs = mupdf.pdf_dict_get(annot_obj, PDF_NAME('FS')) + + o = mupdf.pdf_dict_get(fs, PDF_NAME('UF')) + if o.m_internal: + filename = mupdf.pdf_to_text_string(o) + else: + o = mupdf.pdf_dict_get(fs, PDF_NAME('F')) + if o.m_internal: + filename = mupdf.pdf_to_text_string(o) + + o = mupdf.pdf_dict_get(fs, PDF_NAME('Desc')) + if o.m_internal: + desc = mupdf.pdf_to_text_string(o) + + o = mupdf.pdf_dict_get(stream, PDF_NAME('Length')) + if o.m_internal: + length = mupdf.pdf_to_int(o) + + o = mupdf.pdf_dict_getl(stream, PDF_NAME('Params'), PDF_NAME('Size')) + if o.m_internal: + size = mupdf.pdf_to_int(o) + + res[ dictkey_filename] = JM_EscapeStrFromStr(filename) + res[ dictkey_descr] = JM_UnicodeFromStr(desc) + res[ dictkey_length] = length + res[ dictkey_size] = size + return res + + @property + def flags(self): + """Flags field.""" + CheckParent(self) + annot = self.this + return mupdf.pdf_annot_flags(annot) + + def get_file(self): + """Retrieve attached file content.""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + type = mupdf.pdf_annot_type(annot) + if type != mupdf.PDF_ANNOT_FILE_ATTACHMENT: + raise TypeError( MSG_BAD_ANNOT_TYPE) + stream = mupdf.pdf_dict_getl(annot_obj, PDF_NAME('FS'), PDF_NAME('EF'), PDF_NAME('F')) + if not stream.m_internal: + RAISEPY( "bad PDF: file entry not found", JM_Exc_FileDataError) + buf = mupdf.pdf_load_stream(stream) + res = JM_BinFromBuffer(buf) + return res + + def get_oc(self): + """Get annotation optional content reference.""" + CheckParent(self) + oc = 0 + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + obj = mupdf.pdf_dict_get(annot_obj, PDF_NAME('OC')) + if obj.m_internal: + oc = mupdf.pdf_to_num(obj) + return oc + + # PyMuPDF doesn't seem to have this .parent member, but removing it breaks + # 11 tests...? + #@property + def get_parent(self): + try: + ret = getattr( self, 'parent') + except AttributeError: + page = _pdf_annot_page(self.this) + assert isinstance( page, mupdf.PdfPage) + document = Document( page.doc()) if page.m_internal else None + ret = Page(page, document) + #self.parent = weakref.proxy( ret) + self.parent = ret + #log(f'No attribute .parent: {type(self)=} {id(self)=}: have set {id(self.parent)=}.') + #log( f'Have set self.parent') + return ret + + def get_pixmap(self, matrix=None, dpi=None, colorspace=None, alpha=0): + """annotation Pixmap""" + + CheckParent(self) + cspaces = {"gray": csGRAY, "rgb": csRGB, "cmyk": csCMYK} + if type(colorspace) is str: + colorspace = cspaces.get(colorspace.lower(), None) + if dpi: + matrix = Matrix(dpi / 72, dpi / 72) + ctm = JM_matrix_from_py(matrix) + cs = colorspace + if not cs: + cs = mupdf.fz_device_rgb() + + pix = mupdf.pdf_new_pixmap_from_annot(self.this, ctm, cs, mupdf.FzSeparations(0), alpha) + ret = Pixmap(pix) + if dpi: + ret.set_dpi(dpi, dpi) + return ret + + def get_sound(self): + """Retrieve sound stream.""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + type = mupdf.pdf_annot_type(annot) + sound = mupdf.pdf_dict_get(annot_obj, PDF_NAME('Sound')) + if type != mupdf.PDF_ANNOT_SOUND or not sound.m_internal: + raise TypeError( MSG_BAD_ANNOT_TYPE) + if mupdf.pdf_dict_get(sound, PDF_NAME('F')).m_internal: + RAISEPY( "unsupported sound stream", JM_Exc_FileDataError) + res = dict() + obj = mupdf.pdf_dict_get(sound, PDF_NAME('R')) + if obj.m_internal: + res['rate'] = mupdf.pdf_to_real(obj) + obj = mupdf.pdf_dict_get(sound, PDF_NAME('C')) + if obj.m_internal: + res['channels'] = mupdf.pdf_to_int(obj) + obj = mupdf.pdf_dict_get(sound, PDF_NAME('B')) + if obj.m_internal: + res['bps'] = mupdf.pdf_to_int(obj) + obj = mupdf.pdf_dict_get(sound, PDF_NAME('E')) + if obj.m_internal: + res['encoding'] = mupdf.pdf_to_name(obj) + obj = mupdf.pdf_dict_gets(sound, "CO") + if obj.m_internal: + res['compression'] = mupdf.pdf_to_name(obj) + buf = mupdf.pdf_load_stream(sound) + stream = JM_BinFromBuffer(buf) + res['stream'] = stream + return res + + def get_textpage(self, clip=None, flags=0): + """Make annotation TextPage.""" + CheckParent(self) + options = mupdf.FzStextOptions(flags) + if clip: + assert hasattr(mupdf, 'FZ_STEXT_CLIP_RECT'), f'MuPDF-{mupdf_version} does not support FZ_STEXT_CLIP_RECT.' + clip2 = JM_rect_from_py(clip) + options.clip = clip2.internal() + options.flags |= mupdf.FZ_STEXT_CLIP_RECT + annot = self.this + stextpage = mupdf.FzStextPage(annot, options) + ret = TextPage(stextpage) + p = self.get_parent() + if isinstance(p, weakref.ProxyType): + ret.parent = p + else: + ret.parent = weakref.proxy(p) + return ret + + @property + def has_popup(self): + """Check if annotation has a Popup.""" + CheckParent(self) + annot = self.this + obj = mupdf.pdf_dict_get(mupdf.pdf_annot_obj(annot), PDF_NAME('Popup')) + return True if obj.m_internal else False + + @property + def info(self): + """Various information details.""" + CheckParent(self) + annot = self.this + res = dict() + + res[dictkey_content] = JM_UnicodeFromStr(mupdf.pdf_annot_contents(annot)) + + o = mupdf.pdf_dict_get(mupdf.pdf_annot_obj(annot), PDF_NAME('Name')) + res[dictkey_name] = JM_UnicodeFromStr(mupdf.pdf_to_name(o)) + + # Title (= author) + o = mupdf.pdf_dict_get(mupdf.pdf_annot_obj(annot), PDF_NAME('T')) + res[dictkey_title] = JM_UnicodeFromStr(mupdf.pdf_to_text_string(o)) + + # CreationDate + o = mupdf.pdf_dict_gets(mupdf.pdf_annot_obj(annot), "CreationDate") + res[dictkey_creationDate] = JM_UnicodeFromStr(mupdf.pdf_to_text_string(o)) + + # ModDate + o = mupdf.pdf_dict_get(mupdf.pdf_annot_obj(annot), PDF_NAME('M')) + res[dictkey_modDate] = JM_UnicodeFromStr(mupdf.pdf_to_text_string(o)) + + # Subj + o = mupdf.pdf_dict_gets(mupdf.pdf_annot_obj(annot), "Subj") + res[dictkey_subject] = mupdf.pdf_to_text_string(o) + + # Identification (PDF key /NM) + o = mupdf.pdf_dict_gets(mupdf.pdf_annot_obj(annot), "NM") + res[dictkey_id] = JM_UnicodeFromStr(mupdf.pdf_to_text_string(o)) + + return res + + @property + def irt_xref(self): + ''' + annotation IRT xref + ''' + annot = self.this + annot_obj = mupdf.pdf_annot_obj( annot) + irt = mupdf.pdf_dict_get( annot_obj, PDF_NAME('IRT')) + if not irt.m_internal: + return 0 + return mupdf.pdf_to_num( irt) + + @property + def is_open(self): + """Get 'open' status of annotation or its Popup.""" + CheckParent(self) + return mupdf.pdf_annot_is_open(self.this) + + @property + def language(self): + """annotation language""" + this_annot = self.this + lang = mupdf.pdf_annot_language(this_annot) + if lang == mupdf.FZ_LANG_UNSET: + return + assert hasattr(mupdf, 'fz_string_from_text_language2') + return mupdf.fz_string_from_text_language2(lang) + + @property + def line_ends(self): + """Line end codes.""" + CheckParent(self) + annot = self.this + # return nothing for invalid annot types + if not mupdf.pdf_annot_has_line_ending_styles(annot): + return + lstart = mupdf.pdf_annot_line_start_style(annot) + lend = mupdf.pdf_annot_line_end_style(annot) + return lstart, lend + + @property + def next(self): + """Next annotation.""" + CheckParent(self) + this_annot = self.this + assert isinstance(this_annot, mupdf.PdfAnnot) + assert this_annot.m_internal + type_ = mupdf.pdf_annot_type(this_annot) + if type_ != mupdf.PDF_ANNOT_WIDGET: + annot = mupdf.pdf_next_annot(this_annot) + else: + annot = mupdf.pdf_next_widget(this_annot) + + val = Annot(annot) if annot.m_internal else None + if not val: + return None + val.thisown = True + assert val.get_parent().this.m_internal_value() == self.get_parent().this.m_internal_value() + val.parent._annot_refs[id(val)] = val + + if val.type[0] == mupdf.PDF_ANNOT_WIDGET: + widget = Widget() + TOOLS._fill_widget(val, widget) + val = widget + return val + + @property + def opacity(self): + """Opacity.""" + CheckParent(self) + annot = self.this + opy = -1 + ca = mupdf.pdf_dict_get( mupdf.pdf_annot_obj(annot), mupdf.PDF_ENUM_NAME_CA) + if mupdf.pdf_is_number(ca): + opy = mupdf.pdf_to_real(ca) + return opy + + @property + def popup_rect(self): + """annotation 'Popup' rectangle""" + CheckParent(self) + rect = mupdf.FzRect(mupdf.FzRect.Fixed_INFINITE) + annot = self.this + annot_obj = mupdf.pdf_annot_obj( annot) + obj = mupdf.pdf_dict_get( annot_obj, PDF_NAME('Popup')) + if obj.m_internal: + rect = mupdf.pdf_dict_get_rect(obj, PDF_NAME('Rect')) + #log( '{rect=}') + val = JM_py_from_rect(rect) + #log( '{val=}') + + val = Rect(val) * self.get_parent().transformation_matrix + val *= self.get_parent().derotation_matrix + + return val + + @property + def popup_xref(self): + """annotation 'Popup' xref""" + CheckParent(self) + xref = 0 + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + obj = mupdf.pdf_dict_get(annot_obj, PDF_NAME('Popup')) + if obj.m_internal: + xref = mupdf.pdf_to_num(obj) + return xref + + @property + def rect(self): + """annotation rectangle""" + if g_use_extra: + val = extra.Annot_rect3( self.this) + else: + val = mupdf.pdf_bound_annot(self.this) + val = Rect(val) + + # Caching self.parent_() reduces 1000x from 0.07 to 0.04. + # + p = self.get_parent() + #p = getattr( self, 'parent', None) + #if p is None: + # p = self.parent + # self.parent = p + #p = self.parent_() + val *= p.derotation_matrix + return val + + @property + def rect_delta(self): + ''' + annotation delta values to rectangle + ''' + annot_obj = mupdf.pdf_annot_obj(self.this) + arr = mupdf.pdf_dict_get( annot_obj, PDF_NAME('RD')) + if mupdf.pdf_array_len( arr) == 4: + return ( + mupdf.pdf_to_real( mupdf.pdf_array_get( arr, 0)), + mupdf.pdf_to_real( mupdf.pdf_array_get( arr, 1)), + -mupdf.pdf_to_real( mupdf.pdf_array_get( arr, 2)), + -mupdf.pdf_to_real( mupdf.pdf_array_get( arr, 3)), + ) + + @property + def rotation(self): + """annotation rotation""" + CheckParent(self) + annot = self.this + rotation = mupdf.pdf_dict_get( mupdf.pdf_annot_obj(annot), mupdf.PDF_ENUM_NAME_Rotate) + if not rotation.m_internal: + return -1 + return mupdf.pdf_to_int( rotation) + + def set_apn_bbox(self, bbox): + """ + Set annotation appearance bbox. + """ + CheckParent(self) + page = self.get_parent() + rot = page.rotation_matrix + mat = page.transformation_matrix + bbox *= rot * ~mat + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + ap = mupdf.pdf_dict_getl(annot_obj, PDF_NAME('AP'), PDF_NAME('N')) + if not ap.m_internal: + raise RuntimeError( MSG_BAD_APN) + rect = JM_rect_from_py(bbox) + mupdf.pdf_dict_put_rect(ap, PDF_NAME('BBox'), rect) + + def set_apn_matrix(self, matrix): + """Set annotation appearance matrix.""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + ap = mupdf.pdf_dict_getl(annot_obj, PDF_NAME('AP'), PDF_NAME('N')) + if not ap.m_internal: + raise RuntimeError( MSG_BAD_APN) + mat = JM_matrix_from_py(matrix) + mupdf.pdf_dict_put_matrix(ap, PDF_NAME('Matrix'), mat) + + def set_blendmode(self, blend_mode): + """Set annotation BlendMode.""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + mupdf.pdf_dict_put_name(annot_obj, PDF_NAME('BM'), blend_mode) + + def set_border(self, border=None, width=-1, style=None, dashes=None, clouds=-1): + """Set border properties. + + Either a dict, or direct arguments width, style, dashes or clouds.""" + CheckParent(self) + atype, atname = self.type[:2] # annotation type + if atype not in ( + mupdf.PDF_ANNOT_CIRCLE, + mupdf.PDF_ANNOT_FREE_TEXT, + mupdf.PDF_ANNOT_INK, + mupdf.PDF_ANNOT_LINE, + mupdf.PDF_ANNOT_POLY_LINE, + mupdf.PDF_ANNOT_POLYGON, + mupdf.PDF_ANNOT_SQUARE, + ): + message(f"Cannot set border for '{atname}'.") + return None + if atype not in ( + mupdf.PDF_ANNOT_CIRCLE, + mupdf.PDF_ANNOT_FREE_TEXT, + mupdf.PDF_ANNOT_POLYGON, + mupdf.PDF_ANNOT_SQUARE, + ): + if clouds > 0: + message(f"Cannot set cloudy border for '{atname}'.") + clouds = -1 # do not set border effect + if type(border) is not dict: + border = {"width": width, "style": style, "dashes": dashes, "clouds": clouds} + border.setdefault("width", -1) + border.setdefault("style", None) + border.setdefault("dashes", None) + border.setdefault("clouds", -1) + if border["width"] is None: + border["width"] = -1 + if border["clouds"] is None: + border["clouds"] = -1 + if hasattr(border["dashes"], "__getitem__"): # ensure sequence items are integers + border["dashes"] = tuple(border["dashes"]) + for item in border["dashes"]: + if not isinstance(item, int): + border["dashes"] = None + break + annot = self.this + annot_obj = mupdf.pdf_annot_obj( annot) + pdf = mupdf.pdf_get_bound_document( annot_obj) + return JM_annot_set_border( border, pdf, annot_obj) + + def set_colors(self, colors=None, stroke=None, fill=None): + """Set 'stroke' and 'fill' colors. + + Use either a dict or the direct arguments. + """ + if self.type[0] == mupdf.PDF_ANNOT_FREE_TEXT: + raise ValueError("cannot be used for FreeText annotations") + + CheckParent(self) + doc = self.get_parent().parent + if type(colors) is not dict: + colors = {"fill": fill, "stroke": stroke} + fill = colors.get("fill") + stroke = colors.get("stroke") + + fill_annots = (mupdf.PDF_ANNOT_CIRCLE, mupdf.PDF_ANNOT_SQUARE, mupdf.PDF_ANNOT_LINE, mupdf.PDF_ANNOT_POLY_LINE, mupdf.PDF_ANNOT_POLYGON, + mupdf.PDF_ANNOT_REDACT,) + + if stroke in ([], ()): + doc.xref_set_key(self.xref, "C", "[]") + elif stroke is not None: + if hasattr(stroke, "__float__"): + stroke = [float(stroke)] + CheckColor(stroke) + assert len(stroke) in (1, 3, 4) + s = f"[{_format_g(stroke)}]" + doc.xref_set_key(self.xref, "C", s) + + if fill and self.type[0] not in fill_annots: + message("Warning: fill color ignored for annot type '%s'." % self.type[1]) + return + if fill in ([], ()): + doc.xref_set_key(self.xref, "IC", "[]") + elif fill is not None: + if hasattr(fill, "__float__"): + fill = [float(fill)] + CheckColor(fill) + assert len(fill) in (1, 3, 4) + s = f"[{_format_g(fill)}]" + doc.xref_set_key(self.xref, "IC", s) + + def set_flags(self, flags): + """Set annotation flags.""" + CheckParent(self) + annot = self.this + mupdf.pdf_set_annot_flags(annot, flags) + + def set_info(self, info=None, content=None, title=None, creationDate=None, modDate=None, subject=None): + """Set various properties.""" + CheckParent(self) + if type(info) is dict: # build the args from the dictionary + content = info.get("content", None) + title = info.get("title", None) + creationDate = info.get("creationDate", None) + modDate = info.get("modDate", None) + subject = info.get("subject", None) + info = None + annot = self.this + # use this to indicate a 'markup' annot type + is_markup = mupdf.pdf_annot_has_author(annot) + # contents + if content: + mupdf.pdf_set_annot_contents(annot, content) + if is_markup: + # title (= author) + if title: + mupdf.pdf_set_annot_author(annot, title) + # creation date + if creationDate: + mupdf.pdf_dict_put_text_string(mupdf.pdf_annot_obj(annot), PDF_NAME('CreationDate'), creationDate) + # mod date + if modDate: + mupdf.pdf_dict_put_text_string(mupdf.pdf_annot_obj(annot), PDF_NAME('M'), modDate) + # subject + if subject: + mupdf.pdf_dict_puts(mupdf.pdf_annot_obj(annot), "Subj", mupdf.pdf_new_text_string(subject)) + + def set_irt_xref(self, xref): + ''' + Set annotation IRT xref + ''' + annot = self.this + annot_obj = mupdf.pdf_annot_obj( annot) + page = _pdf_annot_page(annot) + if xref < 1 or xref >= mupdf.pdf_xref_len( page.doc()): + raise ValueError( MSG_BAD_XREF) + irt = mupdf.pdf_new_indirect( page.doc(), xref, 0) + subt = mupdf.pdf_dict_get( irt, PDF_NAME('Subtype')) + irt_subt = mupdf.pdf_annot_type_from_string( mupdf.pdf_to_name( subt)) + if irt_subt < 0: + raise ValueError( MSG_IS_NO_ANNOT) + mupdf.pdf_dict_put( annot_obj, PDF_NAME('IRT'), irt) + + def set_language(self, language=None): + """Set annotation language.""" + CheckParent(self) + this_annot = self.this + if not language: + lang = mupdf.FZ_LANG_UNSET + else: + lang = mupdf.fz_text_language_from_string(language) + mupdf.pdf_set_annot_language(this_annot, lang) + + def set_line_ends(self, start, end): + """Set line end codes.""" + CheckParent(self) + annot = self.this + if mupdf.pdf_annot_has_line_ending_styles(annot): + mupdf.pdf_set_annot_line_ending_styles(annot, start, end) + else: + message_warning("bad annot type for line ends") + + def set_name(self, name): + """Set /Name (icon) of annotation.""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + mupdf.pdf_dict_put_name(annot_obj, PDF_NAME('Name'), name) + + def set_oc(self, oc=0): + """Set / remove annotation OC xref.""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + if not oc: + mupdf.pdf_dict_del(annot_obj, PDF_NAME('OC')) + else: + JM_add_oc_object(mupdf.pdf_get_bound_document(annot_obj), annot_obj, oc) + + def set_opacity(self, opacity): + """Set opacity.""" + CheckParent(self) + annot = self.this + if not _INRANGE(opacity, 0.0, 1.0): + mupdf.pdf_set_annot_opacity(annot, 1) + return + mupdf.pdf_set_annot_opacity(annot, opacity) + if opacity < 1.0: + page = _pdf_annot_page(annot) + page.transparency = 1 + + def set_open(self, is_open): + """Set 'open' status of annotation or its Popup.""" + CheckParent(self) + annot = self.this + mupdf.pdf_set_annot_is_open(annot, is_open) + + def set_popup(self, rect): + ''' + Create annotation 'Popup' or update rectangle. + ''' + CheckParent(self) + annot = self.this + pdfpage = _pdf_annot_page(annot) + rot = JM_rotate_page_matrix(pdfpage) + r = mupdf.fz_transform_rect(JM_rect_from_py(rect), rot) + mupdf.pdf_set_annot_popup(annot, r) + + def set_rect(self, rect): + """Set annotation rectangle.""" + CheckParent(self) + annot = self.this + + pdfpage = _pdf_annot_page(annot) + rot = JM_rotate_page_matrix(pdfpage) + r = mupdf.fz_transform_rect(JM_rect_from_py(rect), rot) + if mupdf.fz_is_empty_rect(r) or mupdf.fz_is_infinite_rect(r): + raise ValueError( MSG_BAD_RECT) + try: + mupdf.pdf_set_annot_rect(annot, r) + except Exception as e: + message(f'cannot set rect: {e}') + return False + + def set_rotation(self, rotate=0): + """Set annotation rotation.""" + CheckParent(self) + + annot = self.this + type = mupdf.pdf_annot_type(annot) + if type not in ( + mupdf.PDF_ANNOT_CARET, + mupdf.PDF_ANNOT_CIRCLE, + mupdf.PDF_ANNOT_FREE_TEXT, + mupdf.PDF_ANNOT_FILE_ATTACHMENT, + mupdf.PDF_ANNOT_INK, + mupdf.PDF_ANNOT_LINE, + mupdf.PDF_ANNOT_POLY_LINE, + mupdf.PDF_ANNOT_POLYGON, + mupdf.PDF_ANNOT_SQUARE, + mupdf.PDF_ANNOT_STAMP, + mupdf.PDF_ANNOT_TEXT, + ): + return + rot = rotate + while rot < 0: + rot += 360 + while rot >= 360: + rot -= 360 + if type == mupdf.PDF_ANNOT_FREE_TEXT and rot % 90 != 0: + rot = 0 + annot_obj = mupdf.pdf_annot_obj(annot) + mupdf.pdf_dict_put_int(annot_obj, PDF_NAME('Rotate'), rot) + + @property + def type(self): + """annotation type""" + CheckParent(self) + if not self.this.m_internal: + return 'null' + type_ = mupdf.pdf_annot_type(self.this) + c = mupdf.pdf_string_from_annot_type(type_) + o = mupdf.pdf_dict_gets( mupdf.pdf_annot_obj(self.this), 'IT') + if not o.m_internal or mupdf.pdf_is_name(o): + return (type_, c) + it = mupdf.pdf_to_name(o) + return (type_, c, it) + + def update(self, + blend_mode: OptStr =None, + opacity: OptFloat =None, + fontsize: float =0, + fontname: OptStr =None, + text_color: OptSeq =None, + border_color: OptSeq =None, + fill_color: OptSeq =None, + cross_out: bool =True, + rotate: int =-1, + ): + """Update annot appearance. + + Notes: + Depending on the annot type, some parameters make no sense, + while others are only available in this method to achieve the + desired result. This is especially true for 'FreeText' annots. + Args: + blend_mode: set the blend mode, all annotations. + opacity: set the opacity, all annotations. + fontsize: set fontsize, 'FreeText' only. + fontname: set the font, 'FreeText' only. + border_color: set border color, 'FreeText' only. + text_color: set text color, 'FreeText' only. + fill_color: set fill color, all annotations. + cross_out: draw diagonal lines, 'Redact' only. + rotate: set rotation, 'FreeText' and some others. + """ + annot_obj = mupdf.pdf_annot_obj(self.this) + + if border_color: + is_rich_text = mupdf.pdf_dict_get(annot_obj, PDF_NAME("RC")) + if not is_rich_text: + raise ValueError("cannot set border_color if rich_text is False") + Annot.update_timing_test() + CheckParent(self) + def color_string(cs, code): + """Return valid PDF color operator for a given color sequence. + """ + cc = ColorCode(cs, code) + if not cc: + return b"" + return (cc + "\n").encode() + + annot_type = self.type[0] # get the annot type + + dt = self.border.get("dashes", None) # get the dashes spec + bwidth = self.border.get("width", -1) # get border line width + stroke = self.colors["stroke"] # get the stroke color + if fill_color is not None: + fill = fill_color + else: + fill = self.colors["fill"] + rect = None # self.rect # prevent MuPDF fiddling with it + apnmat = self.apn_matrix # prevent MuPDF fiddling with it + if rotate != -1: # sanitize rotation value + while rotate < 0: + rotate += 360 + while rotate >= 360: + rotate -= 360 + if annot_type == mupdf.PDF_ANNOT_FREE_TEXT and rotate % 90 != 0: + rotate = 0 + + #------------------------------------------------------------------ + # handle opacity and blend mode + #------------------------------------------------------------------ + if blend_mode is None: + blend_mode = self.blendmode + if not hasattr(opacity, "__float__"): + opacity = self.opacity + + if 0 <= opacity < 1 or blend_mode: + opa_code = "/H gs\n" # then we must reference this 'gs' + else: + opa_code = "" + + if annot_type == mupdf.PDF_ANNOT_FREE_TEXT: + CheckColor(text_color) + CheckColor(fill_color) + tcol, fname, fsize = TOOLS._parse_da(self) + + # read and update default appearance as necessary + if fsize <= 0: + fsize = 12 + if text_color: + tcol = text_color + if fontname: + fname = fontname + if fontsize > 0: + fsize = fontsize + JM_make_annot_DA(self, len(tcol), tcol, fname, fsize) + blend_mode = None # not supported for free text annotations! + + #------------------------------------------------------------------ + # now invoke MuPDF to update the annot appearance + #------------------------------------------------------------------ + val = self._update_appearance( + opacity=opacity, + blend_mode=blend_mode, + fill_color=fill, + rotate=rotate, + ) + if val is False: + raise RuntimeError("Error updating annotation.") + + if annot_type == mupdf.PDF_ANNOT_FREE_TEXT: + # in absence of previous opacity, we may need to modify the AP + ap = self._getAP() + if 0 <= opacity < 1 and not ap.startswith(b"/H gs"): + self._setAP(b"/H gs\n" + ap) + return + + bfill = color_string(fill, "f") + bstroke = color_string(stroke, "c") + + p_ctm = self.get_parent().transformation_matrix + imat = ~p_ctm # inverse page transf. matrix + + if dt: + dashes = "[" + " ".join(map(str, dt)) + "] 0 d\n" + dashes = dashes.encode("utf-8") + else: + dashes = None + + if self.line_ends: + line_end_le, line_end_ri = self.line_ends + else: + line_end_le, line_end_ri = 0, 0 # init line end codes + + # read contents as created by MuPDF + ap = self._getAP() + ap_tab = ap.splitlines() # split in single lines + ap_updated = False # assume we did nothing + + if annot_type == mupdf.PDF_ANNOT_REDACT: + if cross_out: # create crossed-out rect + ap_updated = True + ap_tab = ap_tab[:-1] + _, LL, LR, UR, UL = ap_tab + ap_tab.append(LR) + ap_tab.append(LL) + ap_tab.append(UR) + ap_tab.append(LL) + ap_tab.append(UL) + ap_tab.append(b"S") + + if bwidth > 0 or bstroke != b"": + ap_updated = True + ntab = [_format_g(bwidth).encode() + b" w"] if bwidth > 0 else [] + for line in ap_tab: + if line.endswith(b"w"): + continue + if line.endswith(b"RG") and bstroke != b"": + line = bstroke[:-1] + ntab.append(line) + ap_tab = ntab + + ap = b"\n".join(ap_tab) + + if annot_type in (mupdf.PDF_ANNOT_POLYGON, mupdf.PDF_ANNOT_POLY_LINE): + ap = b"\n".join(ap_tab[:-1]) + b"\n" + ap_updated = True + if bfill != b"": + if annot_type == mupdf.PDF_ANNOT_POLYGON: + ap = ap + bfill + b"b" # close, fill, and stroke + elif annot_type == mupdf.PDF_ANNOT_POLY_LINE: + ap = ap + b"S" # stroke + else: + if annot_type == mupdf.PDF_ANNOT_POLYGON: + ap = ap + b"s" # close and stroke + elif annot_type == mupdf.PDF_ANNOT_POLY_LINE: + ap = ap + b"S" # stroke + + if dashes is not None: # handle dashes + ap = dashes + ap + # reset dashing - only applies for LINE annots with line ends given + ap = ap.replace(b"\nS\n", b"\nS\n[] 0 d\n", 1) + ap_updated = True + + if opa_code: + ap = opa_code.encode("utf-8") + ap + ap_updated = True + + ap = b"q\n" + ap + b"\nQ\n" + #---------------------------------------------------------------------- + # the following handles line end symbols for 'Polygon' and 'Polyline' + #---------------------------------------------------------------------- + if line_end_le + line_end_ri > 0 and annot_type in (mupdf.PDF_ANNOT_POLYGON, mupdf.PDF_ANNOT_POLY_LINE): + + le_funcs = (None, TOOLS._le_square, TOOLS._le_circle, + TOOLS._le_diamond, TOOLS._le_openarrow, + TOOLS._le_closedarrow, TOOLS._le_butt, + TOOLS._le_ropenarrow, TOOLS._le_rclosedarrow, + TOOLS._le_slash) + le_funcs_range = range(1, len(le_funcs)) + d = 2 * max(1, self.border["width"]) + rect = self.rect + (-d, -d, d, d) + ap_updated = True + points = self.vertices + if line_end_le in le_funcs_range: + p1 = Point(points[0]) * imat + p2 = Point(points[1]) * imat + left = le_funcs[line_end_le](self, p1, p2, False, fill_color) + ap += left.encode() + if line_end_ri in le_funcs_range: + p1 = Point(points[-2]) * imat + p2 = Point(points[-1]) * imat + left = le_funcs[line_end_ri](self, p1, p2, True, fill_color) + ap += left.encode() + + if ap_updated: + if rect: # rect modified here? + self.set_rect(rect) + self._setAP(ap, rect=1) + else: + self._setAP(ap, rect=0) + + #------------------------------- + # handle annotation rotations + #------------------------------- + if annot_type not in ( # only these types are supported + mupdf.PDF_ANNOT_CARET, + mupdf.PDF_ANNOT_CIRCLE, + mupdf.PDF_ANNOT_FILE_ATTACHMENT, + mupdf.PDF_ANNOT_INK, + mupdf.PDF_ANNOT_LINE, + mupdf.PDF_ANNOT_POLY_LINE, + mupdf.PDF_ANNOT_POLYGON, + mupdf.PDF_ANNOT_SQUARE, + mupdf.PDF_ANNOT_STAMP, + mupdf.PDF_ANNOT_TEXT, + ): + return + + rot = self.rotation # get value from annot object + if rot == -1: # nothing to change + return + + M = (self.rect.tl + self.rect.br) / 2 # center of annot rect + + if rot == 0: # undo rotations + if abs(apnmat - Matrix(1, 1)) < 1e-5: + return # matrix already is a no-op + quad = self.rect.morph(M, ~apnmat) # derotate rect + self.setRect(quad.rect) + self.set_apn_matrix(Matrix(1, 1)) # appearance matrix = no-op + return + + mat = Matrix(rot) + quad = self.rect.morph(M, mat) + self.set_rect(quad.rect) + self.set_apn_matrix(apnmat * mat) + + def update_file(self, buffer_=None, filename=None, ufilename=None, desc=None): + """Update attached file.""" + CheckParent(self) + annot = self.this + annot_obj = mupdf.pdf_annot_obj(annot) + pdf = mupdf.pdf_get_bound_document(annot_obj) # the owning PDF + type = mupdf.pdf_annot_type(annot) + if type != mupdf.PDF_ANNOT_FILE_ATTACHMENT: + raise TypeError( MSG_BAD_ANNOT_TYPE) + stream = mupdf.pdf_dict_getl(annot_obj, PDF_NAME('FS'), PDF_NAME('EF'), PDF_NAME('F')) + # the object for file content + if not stream.m_internal: + RAISEPY( "bad PDF: no /EF object", JM_Exc_FileDataError) + + fs = mupdf.pdf_dict_get(annot_obj, PDF_NAME('FS')) + + # file content given + res = JM_BufferFromBytes(buffer_) + if buffer_ and not res.m_internal: + raise ValueError( MSG_BAD_BUFFER) + if res: + JM_update_stream(pdf, stream, res, 1) + # adjust /DL and /Size parameters + len, _ = mupdf.fz_buffer_storage(res) + l = mupdf.pdf_new_int(len) + mupdf.pdf_dict_put(stream, PDF_NAME('DL'), l) + mupdf.pdf_dict_putl(stream, l, PDF_NAME('Params'), PDF_NAME('Size')) + + if filename: + mupdf.pdf_dict_put_text_string(stream, PDF_NAME('F'), filename) + mupdf.pdf_dict_put_text_string(fs, PDF_NAME('F'), filename) + mupdf.pdf_dict_put_text_string(stream, PDF_NAME('UF'), filename) + mupdf.pdf_dict_put_text_string(fs, PDF_NAME('UF'), filename) + mupdf.pdf_dict_put_text_string(annot_obj, PDF_NAME('Contents'), filename) + + if ufilename: + mupdf.pdf_dict_put_text_string(stream, PDF_NAME('UF'), ufilename) + mupdf.pdf_dict_put_text_string(fs, PDF_NAME('UF'), ufilename) + + if desc: + mupdf.pdf_dict_put_text_string(stream, PDF_NAME('Desc'), desc) + mupdf.pdf_dict_put_text_string(fs, PDF_NAME('Desc'), desc) + + @staticmethod + def update_timing_test(): + total = 0 + for i in range( 30*1000): + total += i + return total + + @property + def vertices(self): + """annotation vertex points""" + CheckParent(self) + annot = self.this + assert isinstance(annot, mupdf.PdfAnnot) + annot_obj = mupdf.pdf_annot_obj(annot) + page = _pdf_annot_page(annot) + page_ctm = mupdf.FzMatrix() # page transformation matrix + dummy = mupdf.FzRect() # Out-param for mupdf.pdf_page_transform(). + mupdf.pdf_page_transform(page, dummy, page_ctm) + derot = JM_derotate_page_matrix(page) + page_ctm = mupdf.fz_concat(page_ctm, derot) + + #---------------------------------------------------------------- + # The following objects occur in different annotation types. + # So we are sure that (!o) occurs at most once. + # Every pair of floats is one point, that needs to be separately + # transformed with the page transformation matrix. + #---------------------------------------------------------------- + o = mupdf.pdf_dict_get(annot_obj, PDF_NAME('Vertices')) + if not o.m_internal: o = mupdf.pdf_dict_get(annot_obj, PDF_NAME('L')) + if not o.m_internal: o = mupdf.pdf_dict_get(annot_obj, PDF_NAME('QuadPoints')) + if not o.m_internal: o = mupdf.pdf_dict_gets(annot_obj, 'CL') + + if o.m_internal: + # handle lists with 1-level depth + # weiter + res = [] + for i in range(0, mupdf.pdf_array_len(o), 2): + x = mupdf.pdf_to_real(mupdf.pdf_array_get(o, i)) + y = mupdf.pdf_to_real(mupdf.pdf_array_get(o, i+1)) + point = mupdf.FzPoint(x, y) + point = mupdf.fz_transform_point(point, page_ctm) + res.append( (point.x, point.y)) + return res + + o = mupdf.pdf_dict_gets(annot_obj, 'InkList') + if o.m_internal: + # InkList has 2-level lists + #inklist: + res = [] + for i in range(mupdf.pdf_array_len(o)): + res1 = [] + o1 = mupdf.pdf_array_get(o, i) + for j in range(0, mupdf.pdf_array_len(o1), 2): + x = mupdf.pdf_to_real(mupdf.pdf_array_get(o1, j)) + y = mupdf.pdf_to_real(mupdf.pdf_array_get(o1, j+1)) + point = mupdf.FzPoint(x, y) + point = mupdf.fz_transform_point(point, page_ctm) + res1.append( (point.x, point.y)) + res.append(res1) + return res + + @property + def xref(self): + """annotation xref number""" + CheckParent(self) + annot = self.this + return mupdf.pdf_to_num(mupdf.pdf_annot_obj(annot)) + + +class Archive: + def __init__( self, *args): + ''' + Archive(dirname [, path]) - from folder + Archive(file [, path]) - from file name or object + Archive(data, name) - from memory item + Archive() - empty archive + Archive(archive [, path]) - from archive + ''' + self._subarchives = list() + self.this = mupdf.fz_new_multi_archive() + if args: + self.add( *args) + + def __repr__( self): + return f'Archive, sub-archives: {len(self._subarchives)}' + + def _add_arch( self, subarch, path=None): + mupdf.fz_mount_multi_archive( self.this, subarch, path) + + def _add_dir( self, folder, path=None): + sub = mupdf.fz_open_directory( folder) + mupdf.fz_mount_multi_archive( self.this, sub, path) + + def _add_treeitem( self, memory, name, path=None): + buff = JM_BufferFromBytes( memory) + sub = mupdf.fz_new_tree_archive( mupdf.FzTree()) + mupdf.fz_tree_archive_add_buffer( sub, name, buff) + mupdf.fz_mount_multi_archive( self.this, sub, path) + + def _add_ziptarfile( self, filepath, type_, path=None): + if type_ == 1: + sub = mupdf.fz_open_zip_archive( filepath) + else: + sub = mupdf.fz_open_tar_archive( filepath) + mupdf.fz_mount_multi_archive( self.this, sub, path) + + def _add_ziptarmemory( self, memory, type_, path=None): + buff = JM_BufferFromBytes( memory) + stream = mupdf.fz_open_buffer( buff) + if type_==1: + sub = mupdf.fz_open_zip_archive_with_stream( stream) + else: + sub = mupdf.fz_open_tar_archive_with_stream( stream) + mupdf.fz_mount_multi_archive( self.this, sub, path) + + def add( self, content, path=None): + ''' + Add a sub-archive. + + Args: + content: + The content to be added. May be one of: + `str` - must be path of directory or file. + `bytes`, `bytearray`, `io.BytesIO` - raw data. + `zipfile.Zipfile`. + `tarfile.TarFile`. + `pymupdf.Archive`. + A two-item tuple `(data, name)`. + List or tuple (but not tuple with length 2) of the above. + path: (str) a "virtual" path name, under which the elements + of content can be retrieved. Use it to e.g. cope with + duplicate element names. + ''' + def is_binary_data(x): + return isinstance(x, (bytes, bytearray, io.BytesIO)) + + def make_subarch(entries, mount, fmt): + subarch = dict(fmt=fmt, entries=entries, path=mount) + if fmt != "tree" or self._subarchives == []: + self._subarchives.append(subarch) + else: + ltree = self._subarchives[-1] + if ltree["fmt"] != "tree" or ltree["path"] != subarch["path"]: + self._subarchives.append(subarch) + else: + ltree["entries"].extend(subarch["entries"]) + self._subarchives[-1] = ltree + + if isinstance(content, pathlib.Path): + content = str(content) + + if isinstance(content, str): + if os.path.isdir(content): + self._add_dir(content, path) + return make_subarch(os.listdir(content), path, 'dir') + elif os.path.isfile(content): + assert isinstance(path, str) and path != '', \ + f'Need name for binary content, but {path=}.' + with open(content) as f: + ff = f.read() + self._add_treeitem(ff, path) + return make_subarch([path], None, 'tree') + else: + raise ValueError(f'Not a file or directory: {content!r}') + + elif is_binary_data(content): + assert isinstance(path, str) and path != '' \ + f'Need name for binary content, but {path=}.' + self._add_treeitem(content, path) + return make_subarch([path], None, 'tree') + + elif isinstance(content, zipfile.ZipFile): + filename = getattr(content, "filename", None) + if filename is None: + fp = content.fp.getvalue() + self._add_ziptarmemory(fp, 1, path) + else: + self._add_ziptarfile(filename, 1, path) + return make_subarch(content.namelist(), path, 'zip') + + elif isinstance(content, tarfile.TarFile): + filename = getattr(content.fileobj, "name", None) + if filename is None: + fp = content.fileobj + if not isinstance(fp, io.BytesIO): + fp = fp.fileobj + self._add_ziptarmemory(fp.getvalue(), 0, path) + else: + self._add_ziptarfile(filename, 0, path) + return make_subarch(content.getnames(), path, 'tar') + + elif isinstance(content, Archive): + self._add_arch(content, path) + return make_subarch([], path, 'multi') + + if isinstance(content, tuple) and len(content) == 2: + # covers the tree item plus path + data, name = content + assert isinstance(name, str), f'Unexpected {type(name)=}' + if is_binary_data(data): + self._add_treeitem(data, name, path=path) + elif isinstance(data, str): + if os.path.isfile(data): + with open(data, 'rb') as f: + ff = f.read() + self._add_treeitem(ff, name, path=path) + else: + assert 0, f'Unexpected {type(data)=}.' + return make_subarch([name], path, 'tree') + + elif hasattr(content, '__getitem__'): + # Deal with sequence of disparate items. + for item in content: + self.add(item, path) + return + + else: + raise TypeError(f'Unrecognised type {type(content)}.') + assert 0 + + @property + def entry_list( self): + ''' + List of sub archives. + ''' + return self._subarchives + + def has_entry( self, name): + return mupdf.fz_has_archive_entry( self.this, name) + + def read_entry( self, name): + buff = mupdf.fz_read_archive_entry( self.this, name) + return JM_BinFromBuffer( buff) + + +class Xml: + + def __enter__(self): + return self + + def __exit__(self, *args): + pass + + def __init__(self, rhs): + if isinstance(rhs, mupdf.FzXml): + self.this = rhs + elif isinstance(rhs, str): + buff = mupdf.fz_new_buffer_from_copied_data(rhs) + self.this = mupdf.fz_parse_xml_from_html5(buff) + else: + assert 0, f'Unsupported type for rhs: {type(rhs)}' + + def _get_node_tree( self): + def show_node(node, items, shift): + while node is not None: + if node.is_text: + items.append((shift, f'"{node.text}"')) + node = node.next + continue + items.append((shift, f"({node.tagname}")) + for k, v in node.get_attributes().items(): + items.append((shift, f"={k} '{v}'")) + child = node.first_child + if child: + items = show_node(child, items, shift + 1) + items.append((shift, f"){node.tagname}")) + node = node.next + return items + + shift = 0 + items = [] + items = show_node(self, items, shift) + return items + + def add_bullet_list(self): + """Add bulleted list ("ul" tag)""" + child = self.create_element("ul") + self.append_child(child) + return child + + def add_class(self, text): + """Set some class via CSS. Replaces complete class spec.""" + cls = self.get_attribute_value("class") + if cls is not None and text in cls: + return self + self.remove_attribute("class") + if cls is None: + cls = text + else: + cls += " " + text + self.set_attribute("class", cls) + return self + + def add_code(self, text=None): + """Add a "code" tag""" + child = self.create_element("code") + if type(text) is str: + child.append_child(self.create_text_node(text)) + prev = self.span_bottom() + if prev is None: + prev = self + prev.append_child(child) + return self + + def add_codeblock(self): + """Add monospaced lines ("pre" node)""" + child = self.create_element("pre") + self.append_child(child) + return child + + def add_description_list(self): + """Add description list ("dl" tag)""" + child = self.create_element("dl") + self.append_child(child) + return child + + def add_division(self): + """Add "div" tag""" + child = self.create_element("div") + self.append_child(child) + return child + + def add_header(self, level=1): + """Add header tag""" + if level not in range(1, 7): + raise ValueError("Header level must be in [1, 6]") + this_tag = self.tagname + new_tag = f"h{level}" + child = self.create_element(new_tag) + if this_tag not in ("h1", "h2", "h3", "h4", "h5", "h6", "p"): + self.append_child(child) + return child + self.parent.append_child(child) + return child + + def add_horizontal_line(self): + """Add horizontal line ("hr" tag)""" + child = self.create_element("hr") + self.append_child(child) + return child + + def add_image(self, name, width=None, height=None, imgfloat=None, align=None): + """Add image node (tag "img").""" + child = self.create_element("img") + if width is not None: + child.set_attribute("width", f"{width}") + if height is not None: + child.set_attribute("height", f"{height}") + if imgfloat is not None: + child.set_attribute("style", f"float: {imgfloat}") + if align is not None: + child.set_attribute("align", f"{align}") + child.set_attribute("src", f"{name}") + self.append_child(child) + return child + + def add_link(self, href, text=None): + """Add a hyperlink ("a" tag)""" + child = self.create_element("a") + if not isinstance(text, str): + text = href + child.set_attribute("href", href) + child.append_child(self.create_text_node(text)) + prev = self.span_bottom() + if prev is None: + prev = self + prev.append_child(child) + return self + + def add_list_item(self): + """Add item ("li" tag) under a (numbered or bulleted) list.""" + if self.tagname not in ("ol", "ul"): + raise ValueError("cannot add list item to", self.tagname) + child = self.create_element("li") + self.append_child(child) + return child + + def add_number_list(self, start=1, numtype=None): + """Add numbered list ("ol" tag)""" + child = self.create_element("ol") + if start > 1: + child.set_attribute("start", str(start)) + if numtype is not None: + child.set_attribute("type", numtype) + self.append_child(child) + return child + + def add_paragraph(self): + """Add "p" tag""" + child = self.create_element("p") + if self.tagname != "p": + self.append_child(child) + else: + self.parent.append_child(child) + return child + + def add_span(self): + child = self.create_element("span") + self.append_child(child) + return child + + def add_style(self, text): + """Set some style via CSS style. Replaces complete style spec.""" + style = self.get_attribute_value("style") + if style is not None and text in style: + return self + self.remove_attribute("style") + if style is None: + style = text + else: + style += ";" + text + self.set_attribute("style", style) + return self + + def add_subscript(self, text=None): + """Add a subscript ("sub" tag)""" + child = self.create_element("sub") + if type(text) is str: + child.append_child(self.create_text_node(text)) + prev = self.span_bottom() + if prev is None: + prev = self + prev.append_child(child) + return self + + def add_superscript(self, text=None): + """Add a superscript ("sup" tag)""" + child = self.create_element("sup") + if type(text) is str: + child.append_child(self.create_text_node(text)) + prev = self.span_bottom() + if prev is None: + prev = self + prev.append_child(child) + return self + + def add_text(self, text): + """Add text. Line breaks are honored.""" + lines = text.splitlines() + line_count = len(lines) + prev = self.span_bottom() + if prev is None: + prev = self + + for i, line in enumerate(lines): + prev.append_child(self.create_text_node(line)) + if i < line_count - 1: + prev.append_child(self.create_element("br")) + return self + + def append_child( self, child): + mupdf.fz_dom_append_child( self.this, child.this) + + def append_styled_span(self, style): + span = self.create_element("span") + span.add_style(style) + prev = self.span_bottom() + if prev is None: + prev = self + prev.append_child(span) + return prev + + def bodytag( self): + return Xml( mupdf.fz_dom_body( self.this)) + + def clone( self): + ret = mupdf.fz_dom_clone( self.this) + return Xml( ret) + + @staticmethod + def color_text(color): + if type(color) is str: + return color + if type(color) is int: + return f"rgb({sRGB_to_rgb(color)})" + if type(color) in (tuple, list): + return f"rgb{tuple(color)}" + return color + + def create_element( self, tag): + return Xml( mupdf.fz_dom_create_element( self.this, tag)) + + def create_text_node( self, text): + return Xml( mupdf.fz_dom_create_text_node( self.this, text)) + + def debug(self): + """Print a list of the node tree below self.""" + items = self._get_node_tree() + for item in items: + message(" " * item[0] + item[1].replace("\n", "\\n")) + + def find( self, tag, att, match): + ret = mupdf.fz_dom_find( self.this, tag, att, match) + if ret.m_internal: + return Xml( ret) + + def find_next( self, tag, att, match): + ret = mupdf.fz_dom_find_next( self.this, tag, att, match) + if ret.m_internal: + return Xml( ret) + + @property + def first_child( self): + if mupdf.fz_xml_text( self.this): + # text node, has no child. + return + ret = mupdf.fz_dom_first_child( self) + if ret.m_internal: + return Xml( ret) + + def get_attribute_value( self, key): + assert key + return mupdf.fz_dom_attribute( self.this, key) + + def get_attributes( self): + if mupdf.fz_xml_text( self.this): + # text node, has no attributes. + return + result = dict() + i = 0 + while 1: + val, key = mupdf.fz_dom_get_attribute( self.this, i) + if not val or not key: + break + result[ key] = val + i += 1 + return result + + def insert_after( self, node): + mupdf.fz_dom_insert_after( self.this, node.this) + + def insert_before( self, node): + mupdf.fz_dom_insert_before( self.this, node.this) + + def insert_text(self, text): + lines = text.splitlines() + line_count = len(lines) + for i, line in enumerate(lines): + self.append_child(self.create_text_node(line)) + if i < line_count - 1: + self.append_child(self.create_element("br")) + return self + + @property + def is_text(self): + """Check if this is a text node.""" + return self.text is not None + + @property + def last_child(self): + """Return last child node.""" + child = self.first_child + if child is None: + return None + while True: + next = child.next + if not next: + return child + child = next + + @property + def next( self): + ret = mupdf.fz_dom_next( self.this) + if ret.m_internal: + return Xml( ret) + + @property + def parent( self): + ret = mupdf.fz_dom_parent( self.this) + if ret.m_internal: + return Xml( ret) + + @property + def previous( self): + ret = mupdf.fz_dom_previous( self.this) + if ret.m_internal: + return Xml( ret) + + def remove( self): + mupdf.fz_dom_remove( self.this) + + def remove_attribute( self, key): + assert key + mupdf.fz_dom_remove_attribute( self.this, key) + + @property + def root( self): + return Xml( mupdf.fz_xml_root( self.this)) + + def set_align(self, align): + """Set text alignment via CSS style""" + text = "text-align: %s" + if isinstance( align, str): + t = align + elif align == TEXT_ALIGN_LEFT: + t = "left" + elif align == TEXT_ALIGN_CENTER: + t = "center" + elif align == TEXT_ALIGN_RIGHT: + t = "right" + elif align == TEXT_ALIGN_JUSTIFY: + t = "justify" + else: + raise ValueError(f"Unrecognised {align=}") + text = text % t + self.add_style(text) + return self + + def set_attribute( self, key, value): + assert key + mupdf.fz_dom_add_attribute( self.this, key, value) + + def set_bgcolor(self, color): + """Set background color via CSS style""" + text = f"background-color: %s" % self.color_text(color) + self.add_style(text) # does not work on span level + return self + + def set_bold(self, val=True): + """Set bold on / off via CSS style""" + if val: + val="bold" + else: + val="normal" + text = "font-weight: %s" % val + self.append_styled_span(text) + return self + + def set_color(self, color): + """Set text color via CSS style""" + text = f"color: %s" % self.color_text(color) + self.append_styled_span(text) + return self + + def set_columns(self, cols): + """Set number of text columns via CSS style""" + text = f"columns: {cols}" + self.append_styled_span(text) + return self + + def set_font(self, font): + """Set font-family name via CSS style""" + text = "font-family: %s" % font + self.append_styled_span(text) + return self + + def set_fontsize(self, fontsize): + """Set font size name via CSS style""" + if type(fontsize) is str: + px="" + else: + px="px" + text = f"font-size: {fontsize}{px}" + self.append_styled_span(text) + return self + + def set_id(self, unique): + """Set a unique id.""" + # check uniqueness + root = self.root + if root.find(None, "id", unique): + raise ValueError(f"id '{unique}' already exists") + self.set_attribute("id", unique) + return self + + def set_italic(self, val=True): + """Set italic on / off via CSS style""" + if val: + val="italic" + else: + val="normal" + text = "font-style: %s" % val + self.append_styled_span(text) + return self + + def set_leading(self, leading): + """Set inter-line spacing value via CSS style - block-level only.""" + text = f"-mupdf-leading: {leading}" + self.add_style(text) + return self + + def set_letter_spacing(self, spacing): + """Set inter-letter spacing value via CSS style""" + text = f"letter-spacing: {spacing}" + self.append_styled_span(text) + return self + + def set_lineheight(self, lineheight): + """Set line height name via CSS style - block-level only.""" + text = f"line-height: {lineheight}" + self.add_style(text) + return self + + def set_margins(self, val): + """Set margin values via CSS style""" + text = "margins: %s" % val + self.append_styled_span(text) + return self + + def set_opacity(self, opacity): + """Set opacity via CSS style""" + text = f"opacity: {opacity}" + self.append_styled_span(text) + return self + + def set_pagebreak_after(self): + """Insert a page break after this node.""" + text = "page-break-after: always" + self.add_style(text) + return self + + def set_pagebreak_before(self): + """Insert a page break before this node.""" + text = "page-break-before: always" + self.add_style(text) + return self + + def set_properties( + self, + align=None, + bgcolor=None, + bold=None, + color=None, + columns=None, + font=None, + fontsize=None, + indent=None, + italic=None, + leading=None, + letter_spacing=None, + lineheight=None, + margins=None, + pagebreak_after=None, + pagebreak_before=None, + word_spacing=None, + unqid=None, + cls=None, + ): + """Set any or all properties of a node. + + To be used for existing nodes preferably. + """ + root = self.root + temp = root.add_division() + if align is not None: + temp.set_align(align) + if bgcolor is not None: + temp.set_bgcolor(bgcolor) + if bold is not None: + temp.set_bold(bold) + if color is not None: + temp.set_color(color) + if columns is not None: + temp.set_columns(columns) + if font is not None: + temp.set_font(font) + if fontsize is not None: + temp.set_fontsize(fontsize) + if indent is not None: + temp.set_text_indent(indent) + if italic is not None: + temp.set_italic(italic) + if leading is not None: + temp.set_leading(leading) + if letter_spacing is not None: + temp.set_letter_spacing(letter_spacing) + if lineheight is not None: + temp.set_lineheight(lineheight) + if margins is not None: + temp.set_margins(margins) + if pagebreak_after is not None: + temp.set_pagebreak_after() + if pagebreak_before is not None: + temp.set_pagebreak_before() + if word_spacing is not None: + temp.set_word_spacing(word_spacing) + if unqid is not None: + self.set_id(unqid) + if cls is not None: + self.add_class(cls) + + styles = [] + top_style = temp.get_attribute_value("style") + if top_style is not None: + styles.append(top_style) + child = temp.first_child + while child: + styles.append(child.get_attribute_value("style")) + child = child.first_child + self.set_attribute("style", ";".join(styles)) + temp.remove() + return self + + def set_text_indent(self, indent): + """Set text indentation name via CSS style - block-level only.""" + text = f"text-indent: {indent}" + self.add_style(text) + return self + + def set_underline(self, val="underline"): + text = "text-decoration: %s" % val + self.append_styled_span(text) + return self + + def set_word_spacing(self, spacing): + """Set inter-word spacing value via CSS style""" + text = f"word-spacing: {spacing}" + self.append_styled_span(text) + return self + + def span_bottom(self): + """Find deepest level in stacked spans.""" + parent = self + child = self.last_child + if child is None: + return None + while child.is_text: + child = child.previous + if child is None: + break + if child is None or child.tagname != "span": + return None + + while True: + if child is None: + return parent + if child.tagname in ("a", "sub","sup","body") or child.is_text: + child = child.next + continue + if child.tagname == "span": + parent = child + child = child.first_child + else: + return parent + + @property + def tagname( self): + return mupdf.fz_xml_tag( self.this) + + @property + def text( self): + return mupdf.fz_xml_text( self.this) + + add_var = add_code + add_samp = add_code + add_kbd = add_code + + +class Colorspace: + + def __init__(self, type_): + """Supported are GRAY, RGB and CMYK.""" + if isinstance( type_, mupdf.FzColorspace): + self.this = type_ + elif type_ == CS_GRAY: + self.this = mupdf.FzColorspace(mupdf.FzColorspace.Fixed_GRAY) + elif type_ == CS_CMYK: + self.this = mupdf.FzColorspace(mupdf.FzColorspace.Fixed_CMYK) + elif type_ == CS_RGB: + self.this = mupdf.FzColorspace(mupdf.FzColorspace.Fixed_RGB) + else: + self.this = mupdf.FzColorspace(mupdf.FzColorspace.Fixed_RGB) + + def __repr__(self): + x = ("", "GRAY", "", "RGB", "CMYK")[self.n] + return "Colorspace(CS_%s) - %s" % (x, self.name) + + def _name(self): + return mupdf.fz_colorspace_name(self.this) + + @property + def n(self): + """Size of one pixel.""" + return mupdf.fz_colorspace_n(self.this) + + @property + def name(self): + """Name of the Colorspace.""" + return self._name() + + +class DeviceWrapper: + def __init__(self, *args): + if args_match( args, mupdf.FzDevice): + device, = args + self.this = device + elif args_match( args, Pixmap, None): + pm, clip = args + bbox = JM_irect_from_py( clip) + if mupdf.fz_is_infinite_irect( bbox): + self.this = mupdf.fz_new_draw_device( mupdf.FzMatrix(), pm) + else: + self.this = mupdf.fz_new_draw_device_with_bbox( mupdf.FzMatrix(), pm, bbox) + elif args_match( args, mupdf.FzDisplayList): + dl, = args + self.this = mupdf.fz_new_list_device( dl) + elif args_match( args, mupdf.FzStextPage, None): + tp, flags = args + opts = mupdf.FzStextOptions( flags) + self.this = mupdf.fz_new_stext_device( tp, opts) + else: + raise Exception( f'Unrecognised args for DeviceWrapper: {args!r}') + + +class DisplayList: + def __del__(self): + if not type(self) is DisplayList: return + self.thisown = False + + def __init__(self, *args): + if len(args) == 1 and isinstance(args[0], mupdf.FzRect): + self.this = mupdf.FzDisplayList(args[0]) + elif len(args) == 1 and isinstance(args[0], mupdf.FzDisplayList): + self.this = args[0] + else: + assert 0, f'Unrecognised {args=}' + + def get_pixmap(self, matrix=None, colorspace=None, alpha=0, clip=None): + if isinstance(colorspace, Colorspace): + colorspace = colorspace.this + else: + colorspace = mupdf.FzColorspace(mupdf.FzColorspace.Fixed_RGB) + val = JM_pixmap_from_display_list(self.this, matrix, colorspace, alpha, clip, None) + val.thisown = True + return val + + def get_textpage(self, flags=3): + """Make a TextPage from a DisplayList.""" + stext_options = mupdf.FzStextOptions() + stext_options.flags = flags + val = mupdf.FzStextPage(self.this, stext_options) + val.thisown = True + return val + + @property + def rect(self): + val = JM_py_from_rect(mupdf.fz_bound_display_list(self.this)) + val = Rect(val) + return val + + def run(self, dw, m, area): + mupdf.fz_run_display_list( + self.this, + dw.device, + JM_matrix_from_py(m), + JM_rect_from_py(area), + mupdf.FzCookie(), + ) + +if g_use_extra: + extra_FzDocument_insert_pdf = extra.FzDocument_insert_pdf + + +class Document: + + def __contains__(self, loc) -> bool: + if type(loc) is int: + if loc < self.page_count: + return True + return False + if type(loc) not in (tuple, list) or len(loc) != 2: + return False + chapter, pno = loc + if (0 + or not isinstance(chapter, int) + or chapter < 0 + or chapter >= self.chapter_count + ): + return False + if (0 + or not isinstance(pno, int) + or pno < 0 + or pno >= self.chapter_page_count(chapter) + ): + return False + return True + + def __delitem__(self, i)->None: + if not self.is_pdf: + raise ValueError("is no PDF") + if type(i) is int: + return self.delete_page(i) + if type(i) in (list, tuple, range): + return self.delete_pages(i) + if type(i) is not slice: + raise ValueError("bad argument type") + pc = self.page_count + start = i.start if i.start else 0 + stop = i.stop if i.stop else pc + step = i.step if i.step else 1 + while start < 0: + start += pc + if start >= pc: + raise ValueError("bad page number(s)") + while stop < 0: + stop += pc + if stop > pc: + raise ValueError("bad page number(s)") + return self.delete_pages(range(start, stop, step)) + + def __enter__(self): + return self + + def __exit__(self, *args): + self.close() + + @typing.overload + def __getitem__(self, i: int = 0) -> Page: + ... + + if sys.version_info >= (3, 9): + @typing.overload + def __getitem__(self, i: slice) -> list[Page]: + ... + + @typing.overload + def __getitem__(self, i: tuple[int, int]) -> Page: + ... + + def __getitem__(self, i=0): + if isinstance(i, slice): + return [self[j] for j in range(*i.indices(len(self)))] + assert isinstance(i, int) or (isinstance(i, tuple) and len(i) == 2 and all(isinstance(x, int) for x in i)), \ + f'Invalid item number: {i=}.' + if i not in self: + raise IndexError(f"page {i} not in document") + return self.load_page(i) + + def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0, height=0, fontsize=11): + """Creates a document. Use 'open' as a synonym. + + Notes: + Basic usages: + open() - new PDF document + open(filename) - string or pathlib.Path, must have supported + file extension. + open(type, buffer) - type: valid extension, buffer: bytes object. + open(stream=buffer, filetype=type) - keyword version of previous. + open(filename, fileype=type) - filename with unrecognized extension. + rect, width, height, fontsize: layout reflowable document + on open (e.g. EPUB). Ignored if n/a. + """ + # We temporarily set JM_mupdf_show_errors=0 while we are constructing, + # then restore its original value in a `finally:` block. + # + global JM_mupdf_show_errors + JM_mupdf_show_errors_old = JM_mupdf_show_errors + JM_mupdf_show_errors = 0 + + try: + self.is_closed = False + self.is_encrypted = False + self.is_encrypted = False + self.metadata = None + self.FontInfos = [] + self.Graftmaps = {} + self.ShownPages = {} + self.InsertedImages = {} + self._page_refs = weakref.WeakValueDictionary() + if isinstance(filename, mupdf.PdfDocument): + pdf_document = filename + self.this = pdf_document + self.this_is_pdf = True + return + + w = width + h = height + r = JM_rect_from_py(rect) + if not mupdf.fz_is_infinite_rect(r): + w = r.x1 - r.x0 + h = r.y1 - r.y0 + + self._name = filename + self.stream = stream + + if stream is not None: + if filename is not None and filetype is None: + # 2025-05-06: Use as the filetype. This is + # reversing precedence - we used to use if both + # were set. + filetype = filename + if isinstance(stream, (bytes, memoryview)): + pass + elif isinstance(stream, bytearray): + stream = bytes(stream) + elif isinstance(stream, io.BytesIO): + stream = stream.getvalue() + else: + raise TypeError(f"bad stream: {type(stream)=}.") + self.stream = stream + + assert isinstance(stream, (bytes, memoryview)) + if len(stream) == 0: + # MuPDF raise an exception for this but also generates + # warnings, which is not very helpful for us. So instead we + # raise a specific exception. + raise EmptyFileError('Cannot open empty stream.') + + stream2 = mupdf.fz_open_memory(mupdf.python_buffer_data(stream), len(stream)) + try: + doc = mupdf.fz_open_document_with_stream(filetype if filetype else '', stream2) + except Exception as e: + if g_exceptions_verbose > 1: exception_info() + raise FileDataError('Failed to open stream') from e + + elif filename: + assert not stream + if isinstance(filename, str): + pass + elif hasattr(filename, "absolute"): + filename = str(filename) + elif hasattr(filename, "name"): + filename = filename.name + else: + raise TypeError(f"bad filename: {type(filename)=} {filename=}.") + self._name = filename + + # Generate our own specific exceptions. This avoids MuPDF + # generating warnings etc. + if not os.path.exists(filename): + raise FileNotFoundError(f"no such file: '{filename}'") + elif not os.path.isfile(filename): + raise FileDataError(f"'{filename}' is no file") + elif os.path.getsize(filename) == 0: + raise EmptyFileError(f'Cannot open empty file: {filename=}.') + + if filetype: + # Override the type implied by . MuPDF does not + # have a way to do this directly so we open via a stream. + try: + fz_stream = mupdf.fz_open_file(filename) + doc = mupdf.fz_open_document_with_stream(filetype, fz_stream) + except Exception as e: + if g_exceptions_verbose > 1: exception_info() + raise FileDataError(f'Failed to open file {filename!r} as type {filetype!r}.') from e + else: + try: + doc = mupdf.fz_open_document(filename) + except Exception as e: + if g_exceptions_verbose > 1: exception_info() + raise FileDataError(f'Failed to open file {filename!r}.') from e + + else: + pdf = mupdf.PdfDocument() + doc = mupdf.FzDocument(pdf) + + if w > 0 and h > 0: + mupdf.fz_layout_document(doc, w, h, fontsize) + elif mupdf.fz_is_document_reflowable(doc): + mupdf.fz_layout_document(doc, 400, 600, 11) + + self.this = doc + + # fixme: not sure where self.thisown gets initialised in PyMuPDF. + # + self.thisown = True + + if self.thisown: + self._graft_id = TOOLS.gen_id() + if self.needs_pass: + self.is_encrypted = True + else: # we won't init until doc is decrypted + self.init_doc() + # the following hack detects invalid/empty SVG files, which else may lead + # to interpreter crashes + if filename and filename.lower().endswith("svg") or filetype and "svg" in filetype.lower(): + try: + _ = self.convert_to_pdf() # this seems to always work + except Exception as e: + if g_exceptions_verbose > 1: exception_info() + raise FileDataError("cannot open broken document") from e + + if g_use_extra: + self.this_is_pdf = isinstance( self.this, mupdf.PdfDocument) + if self.this_is_pdf: + self.page_count2 = extra.page_count_pdf + else: + self.page_count2 = extra.page_count_fz + finally: + JM_mupdf_show_errors = JM_mupdf_show_errors_old + + def __len__(self) -> int: + return self.page_count + + def __repr__(self) -> str: + m = "closed " if self.is_closed else "" + if self.stream is None: + if self.name == "": + return m + "Document()" % self._graft_id + return m + "Document('%s')" % (self.name,) + return m + "Document('%s', )" % (self.name, self._graft_id) + + def _addFormFont(self, name, font): + """Add new form font.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return + fonts = mupdf.pdf_dict_getl( + mupdf.pdf_trailer( pdf), + PDF_NAME('Root'), + PDF_NAME('AcroForm'), + PDF_NAME('DR'), + PDF_NAME('Font'), + ) + if not fonts.m_internal or not mupdf.pdf_is_dict( fonts): + raise RuntimeError( "PDF has no form fonts yet") + k = mupdf.pdf_new_name( name) + v = JM_pdf_obj_from_str( pdf, font) + mupdf.pdf_dict_put( fonts, k, v) + + def _delToC(self): + """Delete the TOC.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + xrefs = [] # create Python list + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return xrefs # not a pdf + # get the main root + root = mupdf.pdf_dict_get(mupdf.pdf_trailer(pdf), PDF_NAME('Root')) + # get the outline root + olroot = mupdf.pdf_dict_get(root, PDF_NAME('Outlines')) + if not olroot.m_internal: + return xrefs # no outlines or some problem + + first = mupdf.pdf_dict_get(olroot, PDF_NAME('First')) # first outline + + xrefs = JM_outline_xrefs(first, xrefs) + xref_count = len(xrefs) + + olroot_xref = mupdf.pdf_to_num(olroot) # delete OL root + mupdf.pdf_delete_object(pdf, olroot_xref) # delete OL root + mupdf.pdf_dict_del(root, PDF_NAME('Outlines')) # delete OL root + + for i in range(xref_count): + _, xref = JM_INT_ITEM(xrefs, i) + mupdf.pdf_delete_object(pdf, xref) # delete outline item + xrefs.append(olroot_xref) + val = xrefs + self.init_doc() + return val + + def _delete_page(self, pno): + pdf = _as_pdf_document(self) + mupdf.pdf_delete_page( pdf, pno) + if pdf.m_internal.rev_page_map: + mupdf.ll_pdf_drop_page_tree( pdf.m_internal) + + def _deleteObject(self, xref): + """Delete object.""" + pdf = _as_pdf_document(self) + if not _INRANGE(xref, 1, mupdf.pdf_xref_len(pdf)-1): + raise ValueError( MSG_BAD_XREF) + mupdf.pdf_delete_object(pdf, xref) + + def _embeddedFileGet(self, idx): + pdf = _as_pdf_document(self) + names = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + PDF_NAME('Root'), + PDF_NAME('Names'), + PDF_NAME('EmbeddedFiles'), + PDF_NAME('Names'), + ) + entry = mupdf.pdf_array_get(names, 2*idx+1) + filespec = mupdf.pdf_dict_getl(entry, PDF_NAME('EF'), PDF_NAME('F')) + buf = mupdf.pdf_load_stream(filespec) + cont = JM_BinFromBuffer(buf) + return cont + + def _embeddedFileIndex(self, item: typing.Union[int, str]) -> int: + filenames = self.embfile_names() + msg = "'%s' not in EmbeddedFiles array." % str(item) + if item in filenames: + idx = filenames.index(item) + elif item in range(len(filenames)): + idx = item + else: + raise ValueError(msg) + return idx + + def _embfile_add(self, name, buffer_, filename=None, ufilename=None, desc=None): + pdf = _as_pdf_document(self) + data = JM_BufferFromBytes(buffer_) + if not data.m_internal: + raise TypeError( MSG_BAD_BUFFER) + + names = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + PDF_NAME('Root'), + PDF_NAME('Names'), + PDF_NAME('EmbeddedFiles'), + PDF_NAME('Names'), + ) + if not mupdf.pdf_is_array(names): + root = mupdf.pdf_dict_get(mupdf.pdf_trailer(pdf), PDF_NAME('Root')) + names = mupdf.pdf_new_array(pdf, 6) # an even number! + mupdf.pdf_dict_putl( + root, + names, + PDF_NAME('Names'), + PDF_NAME('EmbeddedFiles'), + PDF_NAME('Names'), + ) + fileentry = JM_embed_file(pdf, data, filename, ufilename, desc, 1) + xref = mupdf.pdf_to_num( + mupdf.pdf_dict_getl(fileentry, PDF_NAME('EF'), PDF_NAME('F')) + ) + mupdf.pdf_array_push(names, mupdf.pdf_new_text_string(name)) + mupdf.pdf_array_push(names, fileentry) + return xref + + def _embfile_del(self, idx): + pdf = _as_pdf_document(self) + names = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + PDF_NAME('Root'), + PDF_NAME('Names'), + PDF_NAME('EmbeddedFiles'), + PDF_NAME('Names'), + ) + mupdf.pdf_array_delete(names, idx + 1) + mupdf.pdf_array_delete(names, idx) + + def _embfile_info(self, idx, infodict): + pdf = _as_pdf_document(self) + xref = 0 + ci_xref=0 + + trailer = mupdf.pdf_trailer(pdf) + + names = mupdf.pdf_dict_getl( + trailer, + PDF_NAME('Root'), + PDF_NAME('Names'), + PDF_NAME('EmbeddedFiles'), + PDF_NAME('Names'), + ) + o = mupdf.pdf_array_get(names, 2*idx+1) + ci = mupdf.pdf_dict_get(o, PDF_NAME('CI')) + if ci.m_internal: + ci_xref = mupdf.pdf_to_num(ci) + infodict["collection"] = ci_xref + name = mupdf.pdf_to_text_string(mupdf.pdf_dict_get(o, PDF_NAME('F'))) + infodict[dictkey_filename] = JM_EscapeStrFromStr(name) + + name = mupdf.pdf_to_text_string(mupdf.pdf_dict_get(o, PDF_NAME('UF'))) + infodict[dictkey_ufilename] = JM_EscapeStrFromStr(name) + + name = mupdf.pdf_to_text_string(mupdf.pdf_dict_get(o, PDF_NAME('Desc'))) + infodict[dictkey_descr] = JM_UnicodeFromStr(name) + + len_ = -1 + DL = -1 + fileentry = mupdf.pdf_dict_getl(o, PDF_NAME('EF'), PDF_NAME('F')) + xref = mupdf.pdf_to_num(fileentry) + o = mupdf.pdf_dict_get(fileentry, PDF_NAME('Length')) + if o.m_internal: + len_ = mupdf.pdf_to_int(o) + + o = mupdf.pdf_dict_get(fileentry, PDF_NAME('DL')) + if o.m_internal: + DL = mupdf.pdf_to_int(o) + else: + o = mupdf.pdf_dict_getl(fileentry, PDF_NAME('Params'), PDF_NAME('Size')) + if o.m_internal: + DL = mupdf.pdf_to_int(o) + infodict[dictkey_size] = DL + infodict[dictkey_length] = len_ + return xref + + def _embfile_names(self, namelist): + """Get list of embedded file names.""" + pdf = _as_pdf_document(self) + names = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + PDF_NAME('Root'), + PDF_NAME('Names'), + PDF_NAME('EmbeddedFiles'), + PDF_NAME('Names'), + ) + if mupdf.pdf_is_array(names): + n = mupdf.pdf_array_len(names) + for i in range(0, n, 2): + val = JM_EscapeStrFromStr( + mupdf.pdf_to_text_string( + mupdf.pdf_array_get(names, i) + ) + ) + namelist.append(val) + + def _embfile_upd(self, idx, buffer_=None, filename=None, ufilename=None, desc=None): + pdf = _as_pdf_document(self) + xref = 0 + names = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + PDF_NAME('Root'), + PDF_NAME('Names'), + PDF_NAME('EmbeddedFiles'), + PDF_NAME('Names'), + ) + entry = mupdf.pdf_array_get(names, 2*idx+1) + + filespec = mupdf.pdf_dict_getl(entry, PDF_NAME('EF'), PDF_NAME('F')) + if not filespec.m_internal: + RAISEPY( "bad PDF: no /EF object", JM_Exc_FileDataError) + res = JM_BufferFromBytes(buffer_) + if buffer_ and buffer_.m_internal and not res.m_internal: + raise TypeError( MSG_BAD_BUFFER) + if res.m_internal and buffer_ and buffer_.m_internal: + JM_update_stream(pdf, filespec, res, 1) + # adjust /DL and /Size parameters + len, _ = mupdf.fz_buffer_storage(res) + l = mupdf.pdf_new_int(len) + mupdf.pdf_dict_put(filespec, PDF_NAME('DL'), l) + mupdf.pdf_dict_putl(filespec, l, PDF_NAME('Params'), PDF_NAME('Size')) + xref = mupdf.pdf_to_num(filespec) + if filename: + mupdf.pdf_dict_put_text_string(entry, PDF_NAME('F'), filename) + + if ufilename: + mupdf.pdf_dict_put_text_string(entry, PDF_NAME('UF'), ufilename) + + if desc: + mupdf.pdf_dict_put_text_string(entry, PDF_NAME('Desc'), desc) + return xref + + def _extend_toc_items(self, items): + """Add color info to all items of an extended TOC list.""" + if self.is_closed: + raise ValueError("document closed") + if g_use_extra: + return extra.Document_extend_toc_items( self.this, items) + pdf = _as_pdf_document(self) + zoom = "zoom" + bold = "bold" + italic = "italic" + collapse = "collapse" + + root = mupdf.pdf_dict_get(mupdf.pdf_trailer(pdf), PDF_NAME('Root')) + if not root.m_internal: + return + olroot = mupdf.pdf_dict_get(root, PDF_NAME('Outlines')) + if not olroot.m_internal: + return + first = mupdf.pdf_dict_get(olroot, PDF_NAME('First')) + if not first.m_internal: + return + xrefs = [] + xrefs = JM_outline_xrefs(first, xrefs) + n = len(xrefs) + m = len(items) + if not n: + return + if n != m: + raise IndexError( "internal error finding outline xrefs") + + # update all TOC item dictionaries + for i in range(n): + xref = int(xrefs[i]) + item = items[i] + itemdict = item[3] + if not isinstance(itemdict, dict): + raise ValueError( "need non-simple TOC format") + itemdict[dictkey_xref] = xrefs[i] + bm = mupdf.pdf_load_object(pdf, xref) + flags = mupdf.pdf_to_int( mupdf.pdf_dict_get(bm, PDF_NAME('F'))) + if flags == 1: + itemdict[italic] = True + elif flags == 2: + itemdict[bold] = True + elif flags == 3: + itemdict[italic] = True + itemdict[bold] = True + count = mupdf.pdf_to_int( mupdf.pdf_dict_get(bm, PDF_NAME('Count'))) + if count < 0: + itemdict[collapse] = True + elif count > 0: + itemdict[collapse] = False + col = mupdf.pdf_dict_get(bm, PDF_NAME('C')) + if mupdf.pdf_is_array(col) and mupdf.pdf_array_len(col) == 3: + color = ( + mupdf.pdf_to_real(mupdf.pdf_array_get(col, 0)), + mupdf.pdf_to_real(mupdf.pdf_array_get(col, 1)), + mupdf.pdf_to_real(mupdf.pdf_array_get(col, 2)), + ) + itemdict[dictkey_color] = color + z=0 + obj = mupdf.pdf_dict_get(bm, PDF_NAME('Dest')) + if not obj.m_internal or not mupdf.pdf_is_array(obj): + obj = mupdf.pdf_dict_getl(bm, PDF_NAME('A'), PDF_NAME('D')) + if mupdf.pdf_is_array(obj) and mupdf.pdf_array_len(obj) == 5: + z = mupdf.pdf_to_real(mupdf.pdf_array_get(obj, 4)) + itemdict[zoom] = float(z) + item[3] = itemdict + items[i] = item + + def _forget_page(self, page: Page): + """Remove a page from document page dict.""" + pid = id(page) + if pid in self._page_refs: + #self._page_refs[pid] = None + del self._page_refs[pid] + + def _get_char_widths(self, xref: int, bfname: str, ext: str, ordering: int, limit: int, idx: int = 0): + pdf = _as_pdf_document(self) + mylimit = limit + if mylimit < 256: + mylimit = 256 + if ordering >= 0: + data, size, index = mupdf.fz_lookup_cjk_font(ordering) + font = mupdf.fz_new_font_from_memory(None, data, size, index, 0) + else: + data, size = mupdf.fz_lookup_base14_font(bfname) + if data: + font = mupdf.fz_new_font_from_memory(bfname, data, size, 0, 0) + else: + buf = JM_get_fontbuffer(pdf, xref) + if not buf.m_internal: + raise Exception("font at xref %d is not supported" % xref) + + font = mupdf.fz_new_font_from_buffer(None, buf, idx, 0) + wlist = [] + for i in range(mylimit): + glyph = mupdf.fz_encode_character(font, i) + adv = mupdf.fz_advance_glyph(font, glyph, 0) + if ordering >= 0: + glyph = i + if glyph > 0: + wlist.append( (glyph, adv)) + else: + wlist.append( (glyph, 0.0)) + return wlist + + def _get_page_labels(self): + pdf = _as_pdf_document(self) + rc = [] + pagelabels = mupdf.pdf_new_name("PageLabels") + obj = mupdf.pdf_dict_getl( mupdf.pdf_trailer(pdf), PDF_NAME('Root'), pagelabels) + if not obj.m_internal: + return rc + # simple case: direct /Nums object + nums = mupdf.pdf_resolve_indirect( mupdf.pdf_dict_get( obj, PDF_NAME('Nums'))) + if nums.m_internal: + JM_get_page_labels(rc, nums) + return rc + # case: /Kids/Nums + nums = mupdf.pdf_resolve_indirect( mupdf.pdf_dict_getl(obj, PDF_NAME('Kids'), PDF_NAME('Nums'))) + if nums.m_internal: + JM_get_page_labels(rc, nums) + return rc + # case: /Kids is an array of multiple /Nums + kids = mupdf.pdf_resolve_indirect( mupdf.pdf_dict_get( obj, PDF_NAME('Kids'))) + if not kids.m_internal or not mupdf.pdf_is_array(kids): + return rc + n = mupdf.pdf_array_len(kids) + for i in range(n): + nums = mupdf.pdf_resolve_indirect( + mupdf.pdf_dict_get( + mupdf.pdf_array_get(kids, i), + PDF_NAME('Nums'), + ) + ) + JM_get_page_labels(rc, nums) + return rc + + def _getMetadata(self, key): + """Get metadata.""" + try: + return mupdf.fz_lookup_metadata2( self.this, key) + except Exception: + if g_exceptions_verbose > 2: exception_info() + return '' + + def _getOLRootNumber(self): + """Get xref of Outline Root, create it if missing.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + # get main root + root = mupdf.pdf_dict_get( mupdf.pdf_trailer( pdf), PDF_NAME('Root')) + # get outline root + olroot = mupdf.pdf_dict_get( root, PDF_NAME('Outlines')) + if not olroot.m_internal: + olroot = mupdf.pdf_new_dict( pdf, 4) + mupdf.pdf_dict_put( olroot, PDF_NAME('Type'), PDF_NAME('Outlines')) + ind_obj = mupdf.pdf_add_object( pdf, olroot) + mupdf.pdf_dict_put( root, PDF_NAME('Outlines'), ind_obj) + olroot = mupdf.pdf_dict_get( root, PDF_NAME('Outlines')) + return mupdf.pdf_to_num( olroot) + + def _getPDFfileid(self): + """Get PDF file id.""" + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return + idlist = [] + identity = mupdf.pdf_dict_get(mupdf.pdf_trailer(pdf), PDF_NAME('ID')) + if identity.m_internal: + n = mupdf.pdf_array_len(identity) + for i in range(n): + o = mupdf.pdf_array_get(identity, i) + text = mupdf.pdf_to_text_string(o) + hex_ = binascii.hexlify(text) + idlist.append(hex_) + return idlist + + def _getPageInfo(self, pno, what): + """List fonts, images, XObjects used on a page.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + doc = self.this + pageCount = mupdf.pdf_count_pages(doc) if isinstance(doc, mupdf.PdfDocument) else mupdf.fz_count_pages(doc) + n = pno # pno < 0 is allowed + while n < 0: + n += pageCount # make it non-negative + if n >= pageCount: + raise ValueError( MSG_BAD_PAGENO) + pdf = _as_pdf_document(self) + pageref = mupdf.pdf_lookup_page_obj(pdf, n) + rsrc = mupdf.pdf_dict_get_inheritable(pageref, mupdf.PDF_ENUM_NAME_Resources) + liste = [] + tracer = [] + if rsrc.m_internal: + JM_scan_resources(pdf, rsrc, liste, what, 0, tracer) + return liste + + def _insert_font(self, fontfile=None, fontbuffer=None): + ''' + Utility: insert font from file or binary. + ''' + pdf = _as_pdf_document(self) + if not fontfile and not fontbuffer: + raise ValueError( MSG_FILE_OR_BUFFER) + value = JM_insert_font(pdf, None, fontfile, fontbuffer, 0, 0, 0, 0, 0, -1) + return value + + def _loadOutline(self): + """Load first outline.""" + doc = self.this + assert isinstance( doc, mupdf.FzDocument) + try: + ol = mupdf.fz_load_outline( doc) + except Exception: + if g_exceptions_verbose > 1: exception_info() + return + return Outline( ol) + + def _make_page_map(self): + """Make an array page number -> page object.""" + if self.is_closed: + raise ValueError("document closed") + assert 0, f'_make_page_map() is no-op' + + def _move_copy_page(self, pno, nb, before, copy): + """Move or copy a PDF page reference.""" + pdf = _as_pdf_document(self) + same = 0 + # get the two page objects ----------------------------------- + # locate the /Kids arrays and indices in each + + page1, parent1, i1 = pdf_lookup_page_loc( pdf, pno) + + kids1 = mupdf.pdf_dict_get( parent1, PDF_NAME('Kids')) + + page2, parent2, i2 = pdf_lookup_page_loc( pdf, nb) + kids2 = mupdf.pdf_dict_get( parent2, PDF_NAME('Kids')) + if before: # calc index of source page in target /Kids + pos = i2 + else: + pos = i2 + 1 + + # same /Kids array? ------------------------------------------ + same = mupdf.pdf_objcmp( kids1, kids2) + + # put source page in target /Kids array ---------------------- + if not copy and same != 0: # update parent in page object + mupdf.pdf_dict_put( page1, PDF_NAME('Parent'), parent2) + mupdf.pdf_array_insert( kids2, page1, pos) + + if same != 0: # different /Kids arrays ---------------------- + parent = parent2 + while parent.m_internal: # increase /Count objects in parents + count = mupdf.pdf_dict_get_int( parent, PDF_NAME('Count')) + mupdf.pdf_dict_put_int( parent, PDF_NAME('Count'), count + 1) + parent = mupdf.pdf_dict_get( parent, PDF_NAME('Parent')) + if not copy: # delete original item + mupdf.pdf_array_delete( kids1, i1) + parent = parent1 + while parent.m_internal: # decrease /Count objects in parents + count = mupdf.pdf_dict_get_int( parent, PDF_NAME('Count')) + mupdf.pdf_dict_put_int( parent, PDF_NAME('Count'), count - 1) + parent = mupdf.pdf_dict_get( parent, PDF_NAME('Parent')) + else: # same /Kids array + if copy: # source page is copied + parent = parent2 + while parent.m_internal: # increase /Count object in parents + count = mupdf.pdf_dict_get_int( parent, PDF_NAME('Count')) + mupdf.pdf_dict_put_int( parent, PDF_NAME('Count'), count + 1) + parent = mupdf.pdf_dict_get( parent, PDF_NAME('Parent')) + else: + if i1 < pos: + mupdf.pdf_array_delete( kids1, i1) + else: + mupdf.pdf_array_delete( kids1, i1 + 1) + if pdf.m_internal.rev_page_map: # page map no longer valid: drop it + mupdf.ll_pdf_drop_page_tree( pdf.m_internal) + + self._reset_page_refs() + + def _newPage(self, pno=-1, width=595, height=842): + """Make a new PDF page.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if g_use_extra: + extra._newPage( self.this, pno, width, height) + else: + pdf = _as_pdf_document(self) + mediabox = mupdf.FzRect(mupdf.FzRect.Fixed_UNIT) + mediabox.x1 = width + mediabox.y1 = height + contents = mupdf.FzBuffer() + if pno < -1: + raise ValueError( MSG_BAD_PAGENO) + # create /Resources and /Contents objects + #resources = pdf.add_object(pdf.new_dict(1)) + resources = mupdf.pdf_add_new_dict(pdf, 1) + page_obj = mupdf.pdf_add_page( pdf, mediabox, 0, resources, contents) + mupdf.pdf_insert_page( pdf, pno, page_obj) + # fixme: pdf->dirty = 1; + + self._reset_page_refs() + return self[pno] + + def _remove_links_to(self, numbers): + pdf = _as_pdf_document(self) + _remove_dest_range(pdf, numbers) + + def _remove_toc_item(self, xref): + # "remove" bookmark by letting it point to nowhere + pdf = _as_pdf_document(self) + item = mupdf.pdf_new_indirect(pdf, xref, 0) + mupdf.pdf_dict_del( item, PDF_NAME('Dest')) + mupdf.pdf_dict_del( item, PDF_NAME('A')) + color = mupdf.pdf_new_array( pdf, 3) + for i in range(3): + mupdf.pdf_array_push_real( color, 0.8) + mupdf.pdf_dict_put( item, PDF_NAME('C'), color) + + def _reset_page_refs(self): + """Invalidate all pages in document dictionary.""" + if getattr(self, "is_closed", True): + return + pages = [p for p in self._page_refs.values()] + for page in pages: + if page: + page._erase() + page = None + self._page_refs.clear() + + def _set_page_labels(self, labels): + pdf = _as_pdf_document(self) + pagelabels = mupdf.pdf_new_name("PageLabels") + root = mupdf.pdf_dict_get(mupdf.pdf_trailer(pdf), PDF_NAME('Root')) + mupdf.pdf_dict_del(root, pagelabels) + mupdf.pdf_dict_putl(root, mupdf.pdf_new_array(pdf, 0), pagelabels, PDF_NAME('Nums')) + + xref = self.pdf_catalog() + text = self.xref_object(xref, compressed=True) + text = text.replace("/Nums[]", "/Nums[%s]" % labels) + self.update_object(xref, text) + + def _update_toc_item(self, xref, action=None, title=None, flags=0, collapse=None, color=None): + ''' + "update" bookmark by letting it point to nowhere + ''' + pdf = _as_pdf_document(self) + item = mupdf.pdf_new_indirect( pdf, xref, 0) + if title: + mupdf.pdf_dict_put_text_string( item, PDF_NAME('Title'), title) + if action: + mupdf.pdf_dict_del( item, PDF_NAME('Dest')) + obj = JM_pdf_obj_from_str( pdf, action) + mupdf.pdf_dict_put( item, PDF_NAME('A'), obj) + mupdf.pdf_dict_put_int( item, PDF_NAME('F'), flags) + if color: + c = mupdf.pdf_new_array( pdf, 3) + for i in range(3): + f = color[i] + mupdf.pdf_array_push_real( c, f) + mupdf.pdf_dict_put( item, PDF_NAME('C'), c) + elif color is not None: + mupdf.pdf_dict_del( item, PDF_NAME('C')) + if collapse is not None: + if mupdf.pdf_dict_get( item, PDF_NAME('Count')).m_internal: + i = mupdf.pdf_dict_get_int( item, PDF_NAME('Count')) + if (i < 0 and collapse is False) or (i > 0 and collapse is True): + i = i * (-1) + mupdf.pdf_dict_put_int( item, PDF_NAME('Count'), i) + + @property + def FormFonts(self): + """Get list of field font resource names.""" + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return + fonts = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + PDF_NAME('Root'), + PDF_NAME('AcroForm'), + PDF_NAME('DR'), + PDF_NAME('Font'), + ) + liste = list() + if fonts.m_internal and mupdf.pdf_is_dict(fonts): # fonts exist + n = mupdf.pdf_dict_len(fonts) + for i in range(n): + f = mupdf.pdf_dict_get_key(fonts, i) + liste.append(JM_UnicodeFromStr(mupdf.pdf_to_name(f))) + return liste + + def add_layer(self, name, creator=None, on=None): + """Add a new OC layer.""" + pdf = _as_pdf_document(self) + JM_add_layer_config( pdf, name, creator, on) + mupdf.ll_pdf_read_ocg( pdf.m_internal) + + def add_ocg(self, name, config=-1, on=1, intent=None, usage=None): + """Add new optional content group.""" + xref = 0 + pdf = _as_pdf_document(self) + + # make the OCG + ocg = mupdf.pdf_add_new_dict(pdf, 3) + mupdf.pdf_dict_put(ocg, PDF_NAME('Type'), PDF_NAME('OCG')) + mupdf.pdf_dict_put_text_string(ocg, PDF_NAME('Name'), name) + intents = mupdf.pdf_dict_put_array(ocg, PDF_NAME('Intent'), 2) + if not intent: + mupdf.pdf_array_push(intents, PDF_NAME('View')) + elif not isinstance(intent, str): + assert 0, f'fixme: intent is not a str. {type(intent)=} {type=}' + #n = len(intent) + #for i in range(n): + # item = intent[i] + # c = JM_StrAsChar(item); + # if (c) { + # pdf_array_push(gctx, intents, pdf_new_name(gctx, c)); + # } + # Py_DECREF(item); + #} + else: + mupdf.pdf_array_push(intents, mupdf.pdf_new_name(intent)) + use_for = mupdf.pdf_dict_put_dict(ocg, PDF_NAME('Usage'), 3) + ci_name = mupdf.pdf_new_name("CreatorInfo") + cre_info = mupdf.pdf_dict_put_dict(use_for, ci_name, 2) + mupdf.pdf_dict_put_text_string(cre_info, PDF_NAME('Creator'), "PyMuPDF") + if usage: + mupdf.pdf_dict_put_name(cre_info, PDF_NAME('Subtype'), usage) + else: + mupdf.pdf_dict_put_name(cre_info, PDF_NAME('Subtype'), "Artwork") + indocg = mupdf.pdf_add_object(pdf, ocg) + + # Insert OCG in the right config + ocp = JM_ensure_ocproperties(pdf) + obj = mupdf.pdf_dict_get(ocp, PDF_NAME('OCGs')) + mupdf.pdf_array_push(obj, indocg) + + if config > -1: + obj = mupdf.pdf_dict_get(ocp, PDF_NAME('Configs')) + if not mupdf.pdf_is_array(obj): + raise ValueError( MSG_BAD_OC_CONFIG) + cfg = mupdf.pdf_array_get(obj, config) + if not cfg.m_internal: + raise ValueError( MSG_BAD_OC_CONFIG) + else: + cfg = mupdf.pdf_dict_get(ocp, PDF_NAME('D')) + + obj = mupdf.pdf_dict_get(cfg, PDF_NAME('Order')) + if not obj.m_internal: + obj = mupdf.pdf_dict_put_array(cfg, PDF_NAME('Order'), 1) + mupdf.pdf_array_push(obj, indocg) + if on: + obj = mupdf.pdf_dict_get(cfg, PDF_NAME('ON')) + if not obj.m_internal: + obj = mupdf.pdf_dict_put_array(cfg, PDF_NAME('ON'), 1) + else: + obj =mupdf.pdf_dict_get(cfg, PDF_NAME('OFF')) + if not obj.m_internal: + obj =mupdf.pdf_dict_put_array(cfg, PDF_NAME('OFF'), 1) + mupdf.pdf_array_push(obj, indocg) + + # let MuPDF take note: re-read OCProperties + mupdf.ll_pdf_read_ocg(pdf.m_internal) + + xref = mupdf.pdf_to_num(indocg) + return xref + + def authenticate(self, password): + """Decrypt document.""" + if self.is_closed: + raise ValueError("document closed") + val = mupdf.fz_authenticate_password(self.this, password) + if val: # the doc is decrypted successfully and we init the outline + self.is_encrypted = False + self.is_encrypted = False + self.init_doc() + self.thisown = True + return val + + def can_save_incrementally(self): + """Check whether incremental saves are possible.""" + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return False + return mupdf.pdf_can_be_saved_incrementally(pdf) + + def bake(self, *, annots: bool = True, widgets: bool = True) -> None: + """Convert annotations or fields to permanent content. + + Notes: + Converts annotations or widgets to permanent page content, like + text and vector graphics, as appropriate. + After execution, pages will still look the same, but no longer + have annotations, respectively no fields. + If widgets are selected the PDF will no longer be a Form PDF. + + Args: + annots: convert annotations + widgets: convert form fields + + """ + pdf = _as_pdf_document(self) + mupdf.pdf_bake_document(pdf, int(annots), int(widgets)) + + @property + def chapter_count(self): + """Number of chapters.""" + if self.is_closed: + raise ValueError("document closed") + return mupdf.fz_count_chapters( self.this) + + def chapter_page_count(self, chapter): + """Page count of chapter.""" + if self.is_closed: + raise ValueError("document closed") + chapters = mupdf.fz_count_chapters( self.this) + if chapter < 0 or chapter >= chapters: + raise ValueError( "bad chapter number") + pages = mupdf.fz_count_chapter_pages( self.this, chapter) + return pages + + def close(self): + """Close document.""" + if getattr(self, "is_closed", True): + raise ValueError("document closed") + # self._cleanup() + if hasattr(self, "_outline") and self._outline: + self._outline = None + self._reset_page_refs() + #self.metadata = None + #self.stream = None + self.is_closed = True + #self.FontInfos = [] + self.Graftmaps = {} # Fixes test_3140(). + #self.ShownPages = {} + #self.InsertedImages = {} + #self.this = None + self.this = None + + def convert_to_pdf(self, from_page=0, to_page=-1, rotate=0): + """Convert document to a PDF, selecting page range and optional rotation. Output bytes object.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + fz_doc = self.this + fp = from_page + tp = to_page + srcCount = mupdf.fz_count_pages(fz_doc) + if fp < 0: + fp = 0 + if fp > srcCount - 1: + fp = srcCount - 1 + if tp < 0: + tp = srcCount - 1 + if tp > srcCount - 1: + tp = srcCount - 1 + len0 = len(JM_mupdf_warnings_store) + doc = JM_convert_to_pdf(fz_doc, fp, tp, rotate) + len1 = len(JM_mupdf_warnings_store) + for i in range(len0, len1): + message(f'{JM_mupdf_warnings_store[i]}') + return doc + + def copy_page(self, pno: int, to: int =-1): + """Copy a page within a PDF document. + + This will only create another reference of the same page object. + Args: + pno: source page number + to: put before this page, '-1' means after last page. + """ + if self.is_closed: + raise ValueError("document closed") + + page_count = len(self) + if ( + pno not in range(page_count) + or to not in range(-1, page_count) + ): + raise ValueError("bad page number(s)") + before = 1 + copy = 1 + if to == -1: + to = page_count - 1 + before = 0 + + return self._move_copy_page(pno, to, before, copy) + + def del_xml_metadata(self): + """Delete XML metadata.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + root = mupdf.pdf_dict_get( mupdf.pdf_trailer( pdf), PDF_NAME('Root')) + if root.m_internal: + mupdf.pdf_dict_del( root, PDF_NAME('Metadata')) + + def delete_page(self, pno: int =-1): + """ Delete one page from a PDF. + """ + return self.delete_pages(pno) + + def delete_pages(self, *args, **kw): + """Delete pages from a PDF. + + Args: + Either keywords 'from_page'/'to_page', or two integers to + specify the first/last page to delete. + Or a list/tuple/range object, which can contain arbitrary + page numbers. + Or a single integer page number. + """ + if not self.is_pdf: + raise ValueError("is no PDF") + if self.is_closed: + raise ValueError("document closed") + + page_count = self.page_count # page count of document + f = t = -1 + if kw: # check if keywords were used + if args: # then no positional args are allowed + raise ValueError("cannot mix keyword and positional argument") + f = kw.get("from_page", -1) # first page to delete + t = kw.get("to_page", -1) # last page to delete + while f < 0: + f += page_count + while t < 0: + t += page_count + if not f <= t < page_count: + raise ValueError("bad page number(s)") + numbers = tuple(range(f, t + 1)) + else: + if len(args) > 2 or args == []: + raise ValueError("need 1 or 2 positional arguments") + if len(args) == 2: + f, t = args + if not (type(f) is int and type(t) is int): + raise ValueError("both arguments must be int") + if f > t: + f, t = t, f + if not f <= t < page_count: + raise ValueError("bad page number(s)") + numbers = tuple(range(f, t + 1)) + elif isinstance(args[0], int): + pno = args[0] + while pno < 0: + pno += page_count + numbers = (pno,) + else: + numbers = tuple(args[0]) + + numbers = list(map(int, set(numbers))) # ensure unique integers + if numbers == []: + message("nothing to delete") + return + numbers.sort() + if numbers[0] < 0 or numbers[-1] >= page_count: + raise ValueError("bad page number(s)") + frozen_numbers = frozenset(numbers) + toc = self.get_toc() + for i, xref in enumerate(self.get_outline_xrefs()): + if toc[i][2] - 1 in frozen_numbers: + self._remove_toc_item(xref) # remove target in PDF object + + self._remove_links_to(frozen_numbers) + + for i in reversed(numbers): # delete pages, last to first + self._delete_page(i) + + self._reset_page_refs() + + def embfile_add(self, + name: str, + buffer_: ByteString, + filename: OptStr =None, + ufilename: OptStr =None, + desc: OptStr =None, + ) -> None: + """Add an item to the EmbeddedFiles array. + + Args: + name: name of the new item, must not already exist. + buffer_: (binary data) the file content. + filename: (str) the file name, default: the name + ufilename: (unicode) the file name, default: filename + desc: (str) the description. + """ + filenames = self.embfile_names() + msg = "Name '%s' already exists." % str(name) + if name in filenames: + raise ValueError(msg) + + if filename is None: + filename = name + if ufilename is None: + ufilename = filename + if desc is None: + desc = name + xref = self._embfile_add( + name, + buffer_=buffer_, + filename=filename, + ufilename=ufilename, + desc=desc, + ) + date = get_pdf_now() + self.xref_set_key(xref, "Type", "/EmbeddedFile") + self.xref_set_key(xref, "Params/CreationDate", get_pdf_str(date)) + self.xref_set_key(xref, "Params/ModDate", get_pdf_str(date)) + return xref + + def embfile_count(self) -> int: + """Get number of EmbeddedFiles.""" + return len(self.embfile_names()) + + def embfile_del(self, item: typing.Union[int, str]): + """Delete an entry from EmbeddedFiles. + + Notes: + The argument must be name or index of an EmbeddedFiles item. + Physical deletion of data will happen on save to a new + file with appropriate garbage option. + Args: + item: name or number of item. + Returns: + None + """ + idx = self._embeddedFileIndex(item) + return self._embfile_del(idx) + + def embfile_get(self, item: typing.Union[int, str]) -> bytes: + """Get the content of an item in the EmbeddedFiles array. + + Args: + item: number or name of item. + Returns: + (bytes) The file content. + """ + idx = self._embeddedFileIndex(item) + return self._embeddedFileGet(idx) + + def embfile_info(self, item: typing.Union[int, str]) -> dict: + """Get information of an item in the EmbeddedFiles array. + + Args: + item: number or name of item. + Returns: + Information dictionary. + """ + idx = self._embeddedFileIndex(item) + infodict = {"name": self.embfile_names()[idx]} + xref = self._embfile_info(idx, infodict) + t, date = self.xref_get_key(xref, "Params/CreationDate") + if t != "null": + infodict["creationDate"] = date + t, date = self.xref_get_key(xref, "Params/ModDate") + if t != "null": + infodict["modDate"] = date + t, md5 = self.xref_get_key(xref, "Params/CheckSum") + if t != "null": + infodict["checksum"] = binascii.hexlify(md5.encode()).decode() + return infodict + + def embfile_names(self) -> list: + """Get list of names of EmbeddedFiles.""" + filenames = [] + self._embfile_names(filenames) + return filenames + + def embfile_upd(self, + item: typing.Union[int, str], + buffer_: OptBytes =None, + filename: OptStr =None, + ufilename: OptStr =None, + desc: OptStr =None, + ) -> None: + """Change an item of the EmbeddedFiles array. + + Notes: + Only provided parameters are changed. If all are omitted, + the method is a no-op. + Args: + item: number or name of item. + buffer_: (binary data) the new file content. + filename: (str) the new file name. + ufilename: (unicode) the new filen ame. + desc: (str) the new description. + """ + idx = self._embeddedFileIndex(item) + xref = self._embfile_upd( + idx, + buffer_=buffer_, + filename=filename, + ufilename=ufilename, + desc=desc, + ) + date = get_pdf_now() + self.xref_set_key(xref, "Params/ModDate", get_pdf_str(date)) + return xref + + def extract_font(self, xref=0, info_only=0, named=None): + ''' + Get a font by xref. Returns a tuple or dictionary. + ''' + #log( '{=xref info_only}') + pdf = _as_pdf_document(self) + obj = mupdf.pdf_load_object(pdf, xref) + type_ = mupdf.pdf_dict_get(obj, PDF_NAME('Type')) + subtype = mupdf.pdf_dict_get(obj, PDF_NAME('Subtype')) + if (mupdf.pdf_name_eq(type_, PDF_NAME('Font')) + and not mupdf.pdf_to_name( subtype).startswith('CIDFontType') + ): + basefont = mupdf.pdf_dict_get(obj, PDF_NAME('BaseFont')) + if not basefont.m_internal or mupdf.pdf_is_null(basefont): + bname = mupdf.pdf_dict_get(obj, PDF_NAME('Name')) + else: + bname = basefont + ext = JM_get_fontextension(pdf, xref) + if ext != 'n/a' and not info_only: + buffer_ = JM_get_fontbuffer(pdf, xref) + bytes_ = JM_BinFromBuffer(buffer_) + else: + bytes_ = b'' + if not named: + rc = ( + JM_EscapeStrFromStr(mupdf.pdf_to_name(bname)), + JM_UnicodeFromStr(ext), + JM_UnicodeFromStr(mupdf.pdf_to_name(subtype)), + bytes_, + ) + else: + rc = { + dictkey_name: JM_EscapeStrFromStr(mupdf.pdf_to_name(bname)), + dictkey_ext: JM_UnicodeFromStr(ext), + dictkey_type: JM_UnicodeFromStr(mupdf.pdf_to_name(subtype)), + dictkey_content: bytes_, + } + else: + if not named: + rc = '', '', '', b'' + else: + rc = { + dictkey_name: '', + dictkey_ext: '', + dictkey_type: '', + dictkey_content: b'', + } + return rc + + def extract_image(self, xref): + """Get image by xref. Returns a dictionary.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + + pdf = _as_pdf_document(self) + + if not _INRANGE(xref, 1, mupdf.pdf_xref_len(pdf)-1): + raise ValueError( MSG_BAD_XREF) + + obj = mupdf.pdf_new_indirect(pdf, xref, 0) + subtype = mupdf.pdf_dict_get(obj, PDF_NAME('Subtype')) + + if not mupdf.pdf_name_eq(subtype, PDF_NAME('Image')): + raise ValueError( "not an image") + + o = mupdf.pdf_dict_geta(obj, PDF_NAME('SMask'), PDF_NAME('Mask')) + if o.m_internal: + smask = mupdf.pdf_to_num(o) + else: + smask = 0 + + # load the image + img = mupdf.pdf_load_image(pdf, obj) + rc = dict() + _make_image_dict(img, rc) + rc[dictkey_smask] = smask + rc[dictkey_cs_name] = mupdf.fz_colorspace_name(img.colorspace()) + return rc + + def ez_save( + self, + filename, + garbage=3, + clean=False, + deflate=True, + deflate_images=True, + deflate_fonts=True, + incremental=False, + ascii=False, + expand=False, + linear=False, + pretty=False, + encryption=1, + permissions=4095, + owner_pw=None, + user_pw=None, + no_new_id=True, + preserve_metadata=1, + use_objstms=1, + compression_effort=0, + ): + ''' + Save PDF using some different defaults + ''' + return self.save( + filename, + garbage=garbage, + clean=clean, + deflate=deflate, + deflate_images=deflate_images, + deflate_fonts=deflate_fonts, + incremental=incremental, + ascii=ascii, + expand=expand, + linear=linear, + pretty=pretty, + encryption=encryption, + permissions=permissions, + owner_pw=owner_pw, + user_pw=user_pw, + no_new_id=no_new_id, + preserve_metadata=preserve_metadata, + use_objstms=use_objstms, + compression_effort=compression_effort, + ) + + def find_bookmark(self, bm): + """Find new location after layouting a document.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + location = mupdf.fz_lookup_bookmark2( self.this, bm) + return location.chapter, location.page + + def fullcopy_page(self, pno, to=-1): + """Make a full page duplicate.""" + pdf = _as_pdf_document(self) + page_count = mupdf.pdf_count_pages( pdf) + try: + if (not _INRANGE(pno, 0, page_count - 1) + or not _INRANGE(to, -1, page_count - 1) + ): + raise ValueError( MSG_BAD_PAGENO) + + page1 = mupdf.pdf_resolve_indirect( mupdf.pdf_lookup_page_obj( pdf, pno)) + + page2 = mupdf.pdf_deep_copy_obj( page1) + old_annots = mupdf.pdf_dict_get( page2, PDF_NAME('Annots')) + + # copy annotations, but remove Popup and IRT types + if old_annots.m_internal: + n = mupdf.pdf_array_len( old_annots) + new_annots = mupdf.pdf_new_array( pdf, n) + for i in range(n): + o = mupdf.pdf_array_get( old_annots, i) + subtype = mupdf.pdf_dict_get( o, PDF_NAME('Subtype')) + if mupdf.pdf_name_eq( subtype, PDF_NAME('Popup')): + continue + if mupdf.pdf_dict_gets( o, "IRT").m_internal: + continue + copy_o = mupdf.pdf_deep_copy_obj( mupdf.pdf_resolve_indirect( o)) + xref = mupdf.pdf_create_object( pdf) + mupdf.pdf_update_object( pdf, xref, copy_o) + copy_o = mupdf.pdf_new_indirect( pdf, xref, 0) + mupdf.pdf_dict_del( copy_o, PDF_NAME('Popup')) + mupdf.pdf_dict_del( copy_o, PDF_NAME('P')) + mupdf.pdf_array_push( new_annots, copy_o) + mupdf.pdf_dict_put( page2, PDF_NAME('Annots'), new_annots) + + # copy the old contents stream(s) + res = JM_read_contents( page1) + + # create new /Contents object for page2 + if res and res.m_internal: + #contents = mupdf.pdf_add_stream( pdf, mupdf.fz_new_buffer_from_copied_data( b" ", 1), NULL, 0) + contents = mupdf.pdf_add_stream( pdf, mupdf.fz_new_buffer_from_copied_data( b" "), mupdf.PdfObj(), 0) + JM_update_stream( pdf, contents, res, 1) + mupdf.pdf_dict_put( page2, PDF_NAME('Contents'), contents) + + # now insert target page, making sure it is an indirect object + xref = mupdf.pdf_create_object( pdf) # get new xref + mupdf.pdf_update_object( pdf, xref, page2) # store new page + + page2 = mupdf.pdf_new_indirect( pdf, xref, 0) # reread object + mupdf.pdf_insert_page( pdf, to, page2) # and store the page + finally: + mupdf.ll_pdf_drop_page_tree( pdf.m_internal) + + self._reset_page_refs() + + def get_layer(self, config=-1): + """Content of ON, OFF, RBGroups of an OC layer.""" + pdf = _as_pdf_document(self) + ocp = mupdf.pdf_dict_getl( + mupdf.pdf_trailer( pdf), + PDF_NAME('Root'), + PDF_NAME('OCProperties'), + ) + if not ocp.m_internal: + return + if config == -1: + obj = mupdf.pdf_dict_get( ocp, PDF_NAME('D')) + else: + obj = mupdf.pdf_array_get( + mupdf.pdf_dict_get( ocp, PDF_NAME('Configs')), + config, + ) + if not obj.m_internal: + raise ValueError( MSG_BAD_OC_CONFIG) + rc = JM_get_ocg_arrays( obj) + return rc + + def get_layers(self): + """Show optional OC layers.""" + pdf = _as_pdf_document(self) + n = mupdf.pdf_count_layer_configs( pdf) + if n == 1: + obj = mupdf.pdf_dict_getl( + mupdf.pdf_trailer( pdf), + PDF_NAME('Root'), + PDF_NAME('OCProperties'), + PDF_NAME('Configs'), + ) + if not mupdf.pdf_is_array( obj): + n = 0 + rc = [] + info = mupdf.PdfLayerConfig() + for i in range(n): + mupdf.pdf_layer_config_info( pdf, i, info) + item = { + "number": i, + "name": info.name, + "creator": info.creator, + } + rc.append( item) + return rc + + def get_new_xref(self): + """Make new xref.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + xref = 0 + ENSURE_OPERATION(pdf) + xref = mupdf.pdf_create_object(pdf) + return xref + + def get_ocgs(self): + """Show existing optional content groups.""" + ci = mupdf.pdf_new_name( "CreatorInfo") + pdf = _as_pdf_document(self) + ocgs = mupdf.pdf_dict_getl( + mupdf.pdf_dict_get( mupdf.pdf_trailer( pdf), PDF_NAME('Root')), + PDF_NAME('OCProperties'), + PDF_NAME('OCGs'), + ) + rc = dict() + if not mupdf.pdf_is_array( ocgs): + return rc + n = mupdf.pdf_array_len( ocgs) + for i in range(n): + ocg = mupdf.pdf_array_get( ocgs, i) + xref = mupdf.pdf_to_num( ocg) + name = mupdf.pdf_to_text_string( mupdf.pdf_dict_get( ocg, PDF_NAME('Name'))) + obj = mupdf.pdf_dict_getl( ocg, PDF_NAME('Usage'), ci, PDF_NAME('Subtype')) + usage = None + if obj.m_internal: + usage = mupdf.pdf_to_name( obj) + intents = list() + intent = mupdf.pdf_dict_get( ocg, PDF_NAME('Intent')) + if intent.m_internal: + if mupdf.pdf_is_name( intent): + intents.append( mupdf.pdf_to_name( intent)) + elif mupdf.pdf_is_array( intent): + m = mupdf.pdf_array_len( intent) + for j in range(m): + o = mupdf.pdf_array_get( intent, j) + if mupdf.pdf_is_name( o): + intents.append( mupdf.pdf_to_name( o)) + hidden = mupdf.pdf_is_ocg_hidden( pdf, mupdf.PdfObj(), usage, ocg) + item = { + "name": name, + "intent": intents, + "on": not hidden, + "usage": usage, + } + temp = xref + rc[ temp] = item + return rc + + def get_outline_xrefs(self): + """Get list of outline xref numbers.""" + xrefs = [] + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return xrefs + root = mupdf.pdf_dict_get(mupdf.pdf_trailer(pdf), PDF_NAME('Root')) + if not root.m_internal: + return xrefs + olroot = mupdf.pdf_dict_get(root, PDF_NAME('Outlines')) + if not olroot.m_internal: + return xrefs + first = mupdf.pdf_dict_get(olroot, PDF_NAME('First')) + if not first.m_internal: + return xrefs + xrefs = JM_outline_xrefs(first, xrefs) + return xrefs + + def get_page_fonts(self, pno: int, full: bool =False) -> list: + """Retrieve a list of fonts used on a page. + """ + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if not self.is_pdf: + return () + if type(pno) is not int: + try: + pno = pno.number + except Exception: + exception_info() + raise ValueError("need a Page or page number") + val = self._getPageInfo(pno, 1) + if not full: + return [v[:-1] for v in val] + return val + + def get_page_images(self, pno: int, full: bool =False) -> list: + """Retrieve a list of images used on a page. + """ + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if not self.is_pdf: + return () + val = self._getPageInfo(pno, 2) + if not full: + return [v[:-1] for v in val] + return val + + def get_page_xobjects(self, pno: int) -> list: + """Retrieve a list of XObjects used on a page. + """ + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if not self.is_pdf: + return () + val = self._getPageInfo(pno, 3) + return val + + def get_sigflags(self): + """Get the /SigFlags value.""" + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return -1 # not a PDF + sigflags = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + PDF_NAME('Root'), + PDF_NAME('AcroForm'), + PDF_NAME('SigFlags'), + ) + sigflag = -1 + if sigflags.m_internal: + sigflag = mupdf.pdf_to_int(sigflags) + return sigflag + + def get_xml_metadata(self): + """Get document XML metadata.""" + xml = None + pdf = _as_pdf_document(self, required=0) + if pdf.m_internal: + xml = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + PDF_NAME('Root'), + PDF_NAME('Metadata'), + ) + if xml is not None and xml.m_internal: + buff = mupdf.pdf_load_stream(xml) + rc = JM_UnicodeFromBuffer(buff) + else: + rc = '' + return rc + + def init_doc(self): + if self.is_encrypted: + raise ValueError("cannot initialize - document still encrypted") + self._outline = self._loadOutline() + self.metadata = dict( + [ + (k,self._getMetadata(v)) for k,v in { + 'format':'format', + 'title':'info:Title', + 'author':'info:Author', + 'subject':'info:Subject', + 'keywords':'info:Keywords', + 'creator':'info:Creator', + 'producer':'info:Producer', + 'creationDate':'info:CreationDate', + 'modDate':'info:ModDate', + 'trapped':'info:Trapped' + }.items() + ] + ) + self.metadata['encryption'] = None if self._getMetadata('encryption')=='None' else self._getMetadata('encryption') + + def insert_file(self, + infile, + from_page=-1, + to_page=-1, + start_at=-1, + rotate=-1, + links=True, + annots=True, + show_progress=0, + final=1, + ): + ''' + Insert an arbitrary supported document to an existing PDF. + + The infile may be given as a filename, a Document or a Pixmap. Other + parameters - where applicable - equal those of insert_pdf(). + ''' + src = None + if isinstance(infile, Pixmap): + if infile.colorspace.n > 3: + infile = Pixmap(csRGB, infile) + src = Document("png", infile.tobytes()) + elif isinstance(infile, Document): + src = infile + else: + src = Document(infile) + if not src: + raise ValueError("bad infile parameter") + if not src.is_pdf: + pdfbytes = src.convert_to_pdf() + src = Document("pdf", pdfbytes) + return self.insert_pdf( + src, + from_page=from_page, + to_page=to_page, + start_at=start_at, + rotate=rotate, + links=links, + annots=annots, + show_progress=show_progress, + final=final, + ) + + def insert_pdf( + self, + docsrc, + *, + from_page=-1, + to_page=-1, + start_at=-1, + rotate=-1, + links=1, + annots=1, + widgets=1, + join_duplicates=0, + show_progress=0, + final=1, + _gmap=None, + ): + """Insert a page range from another PDF. + + Args: + docsrc: PDF to copy from. Must be different object, but may be same file. + from_page: (int) first source page to copy, 0-based, default 0. + to_page: (int) last source page to copy, 0-based, default last page. + start_at: (int) from_page will become this page number in target. + rotate: (int) rotate copied pages, default -1 is no change. + links: (int/bool) whether to also copy links. + annots: (int/bool) whether to also copy annotations. + widgets: (int/bool) whether to also copy form fields. + join_duplicates: (int/bool) join or rename duplicate widget names. + show_progress: (int) progress message interval, 0 is no messages. + final: (bool) indicates last insertion from this source PDF. + _gmap: internal use only + + Copy sequence reversed if from_page > to_page.""" + + # Insert pages from a source PDF into this PDF. + # For reconstructing the links (_do_links method), we must save the + # insertion point (start_at) if it was specified as -1. + #log( 'insert_pdf(): start') + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if self._graft_id == docsrc._graft_id: + raise ValueError("source and target cannot be same object") + sa = start_at + if sa < 0: + sa = self.page_count + outCount = self.page_count + srcCount = docsrc.page_count + + # local copies of page numbers + fp = from_page + tp = to_page + sa = start_at + + # normalize page numbers + fp = max(fp, 0) # -1 = first page + fp = min(fp, srcCount - 1) # but do not exceed last page + + if tp < 0: + tp = srcCount - 1 # -1 = last page + tp = min(tp, srcCount - 1) # but do not exceed last page + + if sa < 0: + sa = outCount # -1 = behind last page + sa = min(sa, outCount) # but that is also the limit + + if len(docsrc) > show_progress > 0: + inname = os.path.basename(docsrc.name) + if not inname: + inname = "memory PDF" + outname = os.path.basename(self.name) + if not outname: + outname = "memory PDF" + message("Inserting '%s' at '%s'" % (inname, outname)) + + # retrieve / make a Graftmap to avoid duplicate objects + #log( 'insert_pdf(): Graftmaps') + isrt = docsrc._graft_id + _gmap = self.Graftmaps.get(isrt, None) + if _gmap is None: + #log( 'insert_pdf(): Graftmaps2') + _gmap = Graftmap(self) + self.Graftmaps[isrt] = _gmap + + if g_use_extra: + #log( 'insert_pdf(): calling extra_FzDocument_insert_pdf()') + extra_FzDocument_insert_pdf( + self.this, + docsrc.this, + from_page, + to_page, + start_at, + rotate, + links, + annots, + show_progress, + final, + _gmap, + ) + #log( 'insert_pdf(): extra_FzDocument_insert_pdf() returned.') + else: + pdfout = _as_pdf_document(self) + pdfsrc = _as_pdf_document(docsrc) + + if not pdfout.m_internal or not pdfsrc.m_internal: + raise TypeError( "source or target not a PDF") + ENSURE_OPERATION(pdfout) + JM_merge_range(pdfout, pdfsrc, fp, tp, sa, rotate, links, annots, show_progress, _gmap) + + #log( 'insert_pdf(): calling self._reset_page_refs()') + self._reset_page_refs() + if links: + #log( 'insert_pdf(): calling self._do_links()') + self._do_links(docsrc, from_page=fp, to_page=tp, start_at=sa) + if widgets: + self._do_widgets(docsrc, _gmap, from_page=fp, to_page=tp, start_at=sa, join_duplicates=join_duplicates) + if final == 1: + self.Graftmaps[isrt] = None + #log( 'insert_pdf(): returning') + + @property + def is_dirty(self): + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return False + r = mupdf.pdf_has_unsaved_changes(pdf) + return True if r else False + + @property + def is_fast_webaccess(self): + ''' + Check whether we have a linearized PDF. + ''' + pdf = _as_pdf_document(self, required=0) + if pdf.m_internal: + return mupdf.pdf_doc_was_linearized(pdf) + return False # gracefully handle non-PDF + + @property + def is_form_pdf(self): + """Either False or PDF field count.""" + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return False + count = -1 + try: + fields = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + mupdf.PDF_ENUM_NAME_Root, + mupdf.PDF_ENUM_NAME_AcroForm, + mupdf.PDF_ENUM_NAME_Fields, + ) + if mupdf.pdf_is_array(fields): + count = mupdf.pdf_array_len(fields) + except Exception: + if g_exceptions_verbose: exception_info() + return False + if count >= 0: + return count + return False + + @property + def is_pdf(self): + """Check for PDF.""" + if isinstance(self.this, mupdf.PdfDocument): + return True + # Avoid calling smupdf.pdf_specifics because it will end up creating + # a new PdfDocument which will call pdf_create_document(), which is ok + # but a little unnecessary. + # + if mupdf.ll_pdf_specifics(self.this.m_internal): + ret = True + else: + ret = False + return ret + + @property + def is_reflowable(self): + """Check if document is layoutable.""" + if self.is_closed: + raise ValueError("document closed") + return bool(mupdf.fz_is_document_reflowable(self)) + + @property + def is_repaired(self): + """Check whether PDF was repaired.""" + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return False + r = mupdf.pdf_was_repaired(pdf) + if r: + return True + return False + + def journal_can_do(self): + """Show if undo and / or redo are possible.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + undo=0 + redo=0 + pdf = _as_pdf_document(self) + undo = mupdf.pdf_can_undo(pdf) + redo = mupdf.pdf_can_redo(pdf) + return {'undo': bool(undo), 'redo': bool(redo)} + + def journal_enable(self): + """Activate document journalling.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + mupdf.pdf_enable_journal(pdf) + + def journal_is_enabled(self): + """Check if journalling is enabled.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + enabled = pdf.m_internal and pdf.m_internal.journal + return enabled + + def journal_load(self, filename): + """Load a journal from a file.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + if isinstance(filename, str): + mupdf.pdf_load_journal(pdf, filename) + else: + res = JM_BufferFromBytes(filename) + stm = mupdf.fz_open_buffer(res) + mupdf.pdf_deserialise_journal(pdf, stm) + if not pdf.m_internal.journal: + RAISEPY( "Journal and document do not match", JM_Exc_FileDataError) + + def journal_op_name(self, step): + """Show operation name for given step.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + name = mupdf.pdf_undoredo_step(pdf, step) + return name + + def journal_position(self): + """Show journalling state.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + steps=0 + pdf = _as_pdf_document(self) + rc, steps = mupdf.pdf_undoredo_state(pdf) + return rc, steps + + def journal_redo(self): + """Move forward in the journal.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + mupdf.pdf_redo(pdf) + return True + + def journal_save(self, filename): + """Save journal to a file.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + if isinstance(filename, str): + mupdf.pdf_save_journal(pdf, filename) + else: + out = JM_new_output_fileptr(filename) + mupdf.pdf_write_journal(pdf, out) + out.fz_close_output() + + def journal_start_op(self, name=None): + """Begin a journalling operation.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + if not pdf.m_internal.journal: + raise RuntimeError( "Journalling not enabled") + if name: + mupdf.pdf_begin_operation(pdf, name) + else: + mupdf.pdf_begin_implicit_operation(pdf) + + def journal_stop_op(self): + """End a journalling operation.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + mupdf.pdf_end_operation(pdf) + + def journal_undo(self): + """Move backwards in the journal.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + mupdf.pdf_undo(pdf) + return True + + @property + def language(self): + """Document language.""" + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return + lang = mupdf.pdf_document_language(pdf) + if lang == mupdf.FZ_LANG_UNSET: + return + return mupdf.fz_string_from_text_language2(lang) + + @property + def last_location(self): + """Id (chapter, page) of last page.""" + if self.is_closed: + raise ValueError("document closed") + last_loc = mupdf.fz_last_page(self.this) + return last_loc.chapter, last_loc.page + + def layer_ui_configs(self): + """Show OC visibility status modifiable by user.""" + pdf = _as_pdf_document(self) + info = mupdf.PdfLayerConfigUi() + n = mupdf.pdf_count_layer_config_ui( pdf) + rc = [] + for i in range(n): + mupdf.pdf_layer_config_ui_info( pdf, i, info) + if info.type == 1: + type_ = "checkbox" + elif info.type == 2: + type_ = "radiobox" + else: + type_ = "label" + item = { + "number": i, + "text": info.text, + "depth": info.depth, + "type": type_, + "on": info.selected, + "locked": info.locked, + } + rc.append(item) + return rc + + def layout(self, rect=None, width=0, height=0, fontsize=11): + """Re-layout a reflowable document.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + doc = self.this + if not mupdf.fz_is_document_reflowable( doc): + return + w = width + h = height + r = JM_rect_from_py(rect) + if not mupdf.fz_is_infinite_rect(r): + w = r.x1 - r.x0 + h = r.y1 - r.y0 + if w <= 0.0 or h <= 0.0: + raise ValueError( "bad page size") + mupdf.fz_layout_document( doc, w, h, fontsize) + + self._reset_page_refs() + self.init_doc() + + def load_page(self, page_id): + """Load a page. + + 'page_id' is either a 0-based page number or a tuple (chapter, pno), + with chapter number and page number within that chapter. + """ + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if page_id is None: + page_id = 0 + if page_id not in self: + raise ValueError("page not in document") + if type(page_id) is int and page_id < 0: + np = self.page_count + while page_id < 0: + page_id += np + if isinstance(page_id, int): + page = mupdf.fz_load_page(self.this, page_id) + else: + chapter, pagenum = page_id + page = mupdf.fz_load_chapter_page(self.this, chapter, pagenum) + val = Page(page, self) + + val.thisown = True + val.parent = self + self._page_refs[id(val)] = val + val._annot_refs = weakref.WeakValueDictionary() + val.number = page_id + return val + + def location_from_page_number(self, pno): + """Convert pno to (chapter, page).""" + if self.is_closed: + raise ValueError("document closed") + this_doc = self.this + loc = mupdf.fz_make_location(-1, -1) + page_count = mupdf.fz_count_pages(this_doc) + while pno < 0: + pno += page_count + if pno >= page_count: + raise ValueError( MSG_BAD_PAGENO) + loc = mupdf.fz_location_from_page_number(this_doc, pno) + return loc.chapter, loc.page + + def make_bookmark(self, loc): + """Make a page pointer before layouting document.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + loc = mupdf.FzLocation(*loc) + mark = mupdf.ll_fz_make_bookmark2( self.this.m_internal, loc.internal()) + return mark + + @property + def markinfo(self) -> dict: + """Return the PDF MarkInfo value.""" + xref = self.pdf_catalog() + if xref == 0: + return None + rc = self.xref_get_key(xref, "MarkInfo") + if rc[0] == "null": + return {} + if rc[0] == "xref": + xref = int(rc[1].split()[0]) + val = self.xref_object(xref, compressed=True) + elif rc[0] == "dict": + val = rc[1] + else: + val = None + if val is None or not (val[:2] == "<<" and val[-2:] == ">>"): + return {} + valid = {"Marked": False, "UserProperties": False, "Suspects": False} + val = val[2:-2].split("/") + for v in val[1:]: + try: + key, value = v.split() + except Exception: + if g_exceptions_verbose > 1: exception_info() + return valid + if value == "true": + valid[key] = True + return valid + + def move_page(self, pno: int, to: int =-1): + """Move a page within a PDF document. + + Args: + pno: source page number. + to: put before this page, '-1' means after last page. + """ + if self.is_closed: + raise ValueError("document closed") + page_count = len(self) + if (pno not in range(page_count) or to not in range(-1, page_count)): + raise ValueError("bad page number(s)") + before = 1 + copy = 0 + if to == -1: + to = page_count - 1 + before = 0 + + return self._move_copy_page(pno, to, before, copy) + + @property + def name(self): + return self._name + + def need_appearances(self, value=None): + """Get/set the NeedAppearances value.""" + if not self.is_form_pdf: + return None + + pdf = _as_pdf_document(self) + oldval = -1 + appkey = "NeedAppearances" + + form = mupdf.pdf_dict_getp( + mupdf.pdf_trailer(pdf), + "Root/AcroForm", + ) + app = mupdf.pdf_dict_gets(form, appkey) + if mupdf.pdf_is_bool(app): + oldval = mupdf.pdf_to_bool(app) + if value: + mupdf.pdf_dict_puts(form, appkey, mupdf.PDF_TRUE) + else: + mupdf.pdf_dict_puts(form, appkey, mupdf.PDF_FALSE) + if value is None: + return oldval >= 0 + return value + + @property + def needs_pass(self): + """Indicate password required.""" + if self.is_closed: + raise ValueError("document closed") + document = self.this if isinstance(self.this, mupdf.FzDocument) else self.this.super() + ret = mupdf.fz_needs_password( document) + return ret + + def next_location(self, page_id): + """Get (chapter, page) of next page.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if type(page_id) is int: + page_id = (0, page_id) + if page_id not in self: + raise ValueError("page id not in document") + if tuple(page_id) == self.last_location: + return () + this_doc = _as_fz_document(self) + val = page_id[ 0] + if not isinstance(val, int): + RAISEPY(MSG_BAD_PAGEID, PyExc_ValueError) + chapter = val + val = page_id[ 1] + pno = val + loc = mupdf.fz_make_location(chapter, pno) + next_loc = mupdf.fz_next_page( this_doc, loc) + return next_loc.chapter, next_loc.page + + def page_annot_xrefs(self, n): + if g_use_extra: + return extra.page_annot_xrefs( self.this, n) + + if isinstance(self.this, mupdf.PdfDocument): + page_count = mupdf.pdf_count_pages(self.this) + pdf_document = self.this + else: + page_count = mupdf.fz_count_pages(self.this) + pdf_document = _as_pdf_document(self) + while n < 0: + n += page_count + if n > page_count: + raise ValueError( MSG_BAD_PAGENO) + page_obj = mupdf.pdf_lookup_page_obj(pdf_document, n) + annots = JM_get_annot_xref_list(page_obj) + return annots + + @property + def page_count(self): + """Number of pages.""" + if self.is_closed: + raise ValueError('document closed') + if g_use_extra: + return self.page_count2(self) + if isinstance( self.this, mupdf.FzDocument): + return mupdf.fz_count_pages( self.this) + else: + return mupdf.pdf_count_pages( self.this) + + def page_cropbox(self, pno): + """Get CropBox of page number (without loading page).""" + if self.is_closed: + raise ValueError("document closed") + this_doc = self.this + page_count = mupdf.fz_count_pages( this_doc) + n = pno + while n < 0: + n += page_count + pdf = _as_pdf_document(self) + if n >= page_count: + raise ValueError( MSG_BAD_PAGENO) + pageref = mupdf.pdf_lookup_page_obj( pdf, n) + cropbox = JM_cropbox(pageref) + val = JM_py_from_rect(cropbox) + + val = Rect(val) + + return val + + def page_number_from_location(self, page_id): + """Convert (chapter, pno) to page number.""" + if type(page_id) is int: + np = self.page_count + while page_id < 0: + page_id += np + page_id = (0, page_id) + if page_id not in self: + raise ValueError("page id not in document") + chapter, pno = page_id + loc = mupdf.fz_make_location( chapter, pno) + page_n = mupdf.fz_page_number_from_location( self.this, loc) + return page_n + + def page_xref(self, pno): + """Get xref of page number.""" + if g_use_extra: + return extra.page_xref( self.this, pno) + if self.is_closed: + raise ValueError("document closed") + page_count = mupdf.fz_count_pages(self.this) + n = pno + while n < 0: + n += page_count + pdf = _as_pdf_document(self) + xref = 0 + if n >= page_count: + raise ValueError( MSG_BAD_PAGENO) + xref = mupdf.pdf_to_num(mupdf.pdf_lookup_page_obj(pdf, n)) + return xref + + @property + def pagelayout(self) -> str: + """Return the PDF PageLayout value. + """ + xref = self.pdf_catalog() + if xref == 0: + return None + rc = self.xref_get_key(xref, "PageLayout") + if rc[0] == "null": + return "SinglePage" + if rc[0] == "name": + return rc[1][1:] + return "SinglePage" + + @property + def pagemode(self) -> str: + """Return the PDF PageMode value. + """ + xref = self.pdf_catalog() + if xref == 0: + return None + rc = self.xref_get_key(xref, "PageMode") + if rc[0] == "null": + return "UseNone" + if rc[0] == "name": + return rc[1][1:] + return "UseNone" + + if sys.implementation.version < (3, 9): + # Appending `[Page]` causes `TypeError: 'ABCMeta' object is not subscriptable`. + _pages_ret = collections.abc.Iterable + else: + _pages_ret = collections.abc.Iterable[Page] + + def pages(self, start: OptInt =None, stop: OptInt =None, step: OptInt =None) -> _pages_ret: + """Return a generator iterator over a page range. + + Arguments have the same meaning as for the range() built-in. + """ + if not self.page_count: + return + # set the start value + start = start or 0 + while start < 0: + start += self.page_count + if start not in range(self.page_count): + raise ValueError("bad start page number") + + # set the stop value + stop = stop if stop is not None and stop <= self.page_count else self.page_count + + # set the step value + if step == 0: + raise ValueError("arg 3 must not be zero") + if step is None: + if start > stop: + step = -1 + else: + step = 1 + + for pno in range(start, stop, step): + yield (self.load_page(pno)) + + def pdf_catalog(self): + """Get xref of PDF catalog.""" + pdf = _as_pdf_document(self, required=0) + xref = 0 + if not pdf.m_internal: + return xref + root = mupdf.pdf_dict_get(mupdf.pdf_trailer(pdf), PDF_NAME('Root')) + xref = mupdf.pdf_to_num(root) + return xref + + def pdf_trailer(self, compressed=0, ascii=0): + """Get PDF trailer as a string.""" + return self.xref_object(-1, compressed=compressed, ascii=ascii) + + @property + def permissions(self): + """Document permissions.""" + if self.is_encrypted: + return 0 + doc =self.this + pdf = mupdf.pdf_document_from_fz_document(doc) + + # for PDF return result of standard function + if pdf.m_internal: + return mupdf.pdf_document_permissions(pdf) + + # otherwise simulate the PDF return value + perm = 0xFFFFFFFC # all permissions granted + # now switch off where needed + if not mupdf.fz_has_permission(doc, mupdf.FZ_PERMISSION_PRINT): + perm = perm ^ mupdf.PDF_PERM_PRINT + if not mupdf.fz_has_permission(doc, mupdf.FZ_PERMISSION_EDIT): + perm = perm ^ mupdf.PDF_PERM_MODIFY + if not mupdf.fz_has_permission(doc, mupdf.FZ_PERMISSION_COPY): + perm = perm ^ mupdf.PDF_PERM_COPY + if not mupdf.fz_has_permission(doc, mupdf.FZ_PERMISSION_ANNOTATE): + perm = perm ^ mupdf.PDF_PERM_ANNOTATE + return perm + + def prev_location(self, page_id): + + """Get (chapter, page) of previous page.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if type(page_id) is int: + page_id = (0, page_id) + if page_id not in self: + raise ValueError("page id not in document") + if page_id == (0, 0): + return () + chapter, pno = page_id + loc = mupdf.fz_make_location(chapter, pno) + prev_loc = mupdf.fz_previous_page(self.this, loc) + return prev_loc.chapter, prev_loc.page + + def reload_page(self, page: Page) -> Page: + """Make a fresh copy of a page.""" + old_annots = {} # copy annot references to here + pno = page.number # save the page number + for k, v in page._annot_refs.items(): # save the annot dictionary + old_annots[k] = v + + # When we call `self.load_page()` below, it will end up in + # fz_load_chapter_page(), which will return any matching page in the + # document's list of non-ref-counted loaded pages, instead of actually + # reloading the page. + # + # We want to assert that we have actually reloaded the fz_page, and not + # simply returned the same `fz_page*` pointer from the document's list + # of non-ref-counted loaded pages. + # + # So we first remove our reference to the `fz_page*`. This will + # decrement .refs, and if .refs was 1, this is guaranteed to free the + # `fz_page*` and remove it from the document's list if it was there. So + # we are guaranteed that our returned `fz_page*` is from a genuine + # reload, even if it happens to reuse the original block of memory. + # + # However if the original .refs is greater than one, there must be + # other references to the `fz_page` somewhere, and we require that + # these other references are not keeping the page in the document's + # list. We check that we are returning a newly loaded page by + # asserting that our returned `fz_page*` is different from the original + # `fz_page*` - the original was not freed, so a new `fz_page` cannot + # reuse the same block of memory. + # + + refs_old = page.this.m_internal.refs + m_internal_old = page.this.m_internal_value() + + page.this = None + page._erase() # remove the page + page = None + TOOLS.store_shrink(100) + page = self.load_page(pno) # reload the page + + # copy annot refs over to the new dictionary + #page_proxy = weakref.proxy(page) + for k, v in old_annots.items(): + annot = old_annots[k] + #annot.parent = page_proxy # refresh parent to new page + page._annot_refs[k] = annot + if refs_old == 1: + # We know that `page.this = None` will have decremented the ref + # count to zero so we are guaranteed that the new `fz_page` is a + # new page even if it happens to have reused the same block of + # memory. + pass + else: + # Check that the new `fz_page*` is different from the original. + m_internal_new = page.this.m_internal_value() + assert m_internal_new != m_internal_old, \ + f'{refs_old=} {m_internal_old=:#x} {m_internal_new=:#x}' + return page + + def resolve_link(self, uri=None, chapters=0): + """Calculate internal link destination. + + Args: + uri: (str) some Link.uri + chapters: (bool) whether to use (chapter, page) format + Returns: + (page_id, x, y) where x, y are point coordinates on the page. + page_id is either page number (if chapters=0), or (chapter, pno). + """ + if not uri: + if chapters: + return (-1, -1), 0, 0 + return -1, 0, 0 + try: + loc, xp, yp = mupdf.fz_resolve_link(self.this, uri) + except Exception: + if g_exceptions_verbose: exception_info() + if chapters: + return (-1, -1), 0, 0 + return -1, 0, 0 + if chapters: + return (loc.chapter, loc.page), xp, yp + pno = mupdf.fz_page_number_from_location(self.this, loc) + return pno, xp, yp + + def rewrite_images( + self, + dpi_threshold=None, + dpi_target=0, + quality=0, + lossy=True, + lossless=True, + bitonal=True, + color=True, + gray=True, + set_to_gray=False, + options=None, + ): + """Rewrite images in a PDF document. + + The typical use case is to reduce the size of the PDF by recompressing + images. Default parameters will convert all images to JPEG where + possible, using the specified resolutions and quality. Exclude + undesired images by setting parameters to False. + Args: + dpi_threshold: look at images with a larger DPI only. + dpi_target: change eligible images to this DPI. + quality: Quality of the recompressed images (0-100). + lossy: process lossy image types (e.g. JPEG). + lossless: process lossless image types (e.g. PNG). + bitonal: process black-and-white images (e.g. FAX) + color: process colored images. + gray: process gray images. + set_to_gray: whether to change the PDF to gray at process start. + options: (PdfImageRewriterOptions) Custom options for image + rewriting (optional). Expert use only. If provided, other + parameters are ignored, except set_to_gray. + """ + quality_str = str(quality) + if not dpi_threshold: + dpi_threshold = dpi_target = 0 + if dpi_target > 0 and dpi_target >= dpi_threshold: + raise ValueError("{dpi_target=} must be less than {dpi_threshold=}") + template_opts = mupdf.PdfImageRewriterOptions() + dir1 = set(dir(template_opts)) # for checking that only existing options are set + if not options: + opts = mupdf.PdfImageRewriterOptions() + if bitonal: + opts.bitonal_image_recompress_method = mupdf.FZ_RECOMPRESS_FAX + opts.bitonal_image_subsample_method = mupdf.FZ_SUBSAMPLE_AVERAGE + opts.bitonal_image_subsample_to = dpi_target + opts.bitonal_image_recompress_quality = quality_str + opts.bitonal_image_subsample_threshold = dpi_threshold + if color: + if lossless: + opts.color_lossless_image_recompress_method = mupdf.FZ_RECOMPRESS_JPEG + opts.color_lossless_image_subsample_method = mupdf.FZ_SUBSAMPLE_AVERAGE + opts.color_lossless_image_subsample_to = dpi_target + opts.color_lossless_image_subsample_threshold = dpi_threshold + opts.color_lossless_image_recompress_quality = quality_str + if lossy: + opts.color_lossy_image_recompress_method = mupdf.FZ_RECOMPRESS_JPEG + opts.color_lossy_image_subsample_method = mupdf.FZ_SUBSAMPLE_AVERAGE + opts.color_lossy_image_subsample_threshold = dpi_threshold + opts.color_lossy_image_subsample_to = dpi_target + opts.color_lossy_image_recompress_quality = quality_str + if gray: + if lossless: + opts.gray_lossless_image_recompress_method = mupdf.FZ_RECOMPRESS_JPEG + opts.gray_lossless_image_subsample_method = mupdf.FZ_SUBSAMPLE_AVERAGE + opts.gray_lossless_image_subsample_to = dpi_target + opts.gray_lossless_image_subsample_threshold = dpi_threshold + opts.gray_lossless_image_recompress_quality = quality_str + if lossy: + opts.gray_lossy_image_recompress_method = mupdf.FZ_RECOMPRESS_JPEG + opts.gray_lossy_image_subsample_method = mupdf.FZ_SUBSAMPLE_AVERAGE + opts.gray_lossy_image_subsample_threshold = dpi_threshold + opts.gray_lossy_image_subsample_to = dpi_target + opts.gray_lossy_image_recompress_quality = quality_str + else: + opts = options + + dir2 = set(dir(opts)) # checking that only possible options were used + invalid_options = dir2 - dir1 + if invalid_options: + raise ValueError(f"Invalid options: {invalid_options}") + + if set_to_gray: + self.recolor(1) + pdf = _as_pdf_document(self) + mupdf.pdf_rewrite_images(pdf, opts) + + def recolor(self, components=1): + """Change the color component count on all pages. + + Args: + components: (int) desired color component count, one of 1, 3, 4. + + Invokes the same-named method for all pages. + """ + if not self.is_pdf: + raise ValueError("is no PDF") + for i in range(self.page_count): + self.load_page(i).recolor(components) + + def resolve_names(self): + """Convert the PDF's destination names into a Python dict. + + The only parameter is the pymupdf.Document. + All names found in the catalog under keys "/Dests" and "/Names/Dests" are + being included. + + Returns: + A dcitionary with the following layout: + - key: (str) the name + - value: (dict) with the following layout: + * "page": target page number (0-based). If no page number found -1. + * "to": (x, y) target point on page - currently in PDF coordinates, + i.e. point (0,0) is the bottom-left of the page. + * "zoom": (float) the zoom factor + * "dest": (str) only occurs if the target location on the page has + not been provided as "/XYZ" or if no page number was found. + Examples: + {'__bookmark_1': {'page': 0, 'to': (0.0, 541.0), 'zoom': 0.0}, + '__bookmark_2': {'page': 0, 'to': (0.0, 481.45), 'zoom': 0.0}} + + or + + '21154a7c20684ceb91f9c9adc3b677c40': {'page': -1, 'dest': '/XYZ 15.75 1486 0'}, ... + """ + if hasattr(self, "_resolved_names"): # do not execute multiple times! + return self._resolved_names + # this is a backward listing of page xref to page number + page_xrefs = {self.page_xref(i): i for i in range(self.page_count)} + + def obj_string(obj): + """Return string version of a PDF object definition.""" + buffer = mupdf.fz_new_buffer(512) + output = mupdf.FzOutput(buffer) + mupdf.pdf_print_obj(output, obj, 1, 0) + output.fz_close_output() + return JM_UnicodeFromBuffer(buffer) + + def get_array(val): + """Generate value of one item of the names dictionary.""" + templ_dict = {"page": -1, "dest": ""} # value template + if val.pdf_is_indirect(): + val = mupdf.pdf_resolve_indirect(val) + if val.pdf_is_array(): + array = obj_string(val) + elif val.pdf_is_dict(): + array = obj_string(mupdf.pdf_dict_gets(val, "D")) + else: # if all fails return the empty template + return templ_dict + + # replace PDF "null" by zero, omit the square brackets + array = array.replace("null", "0")[1:-1] + + # find stuff before first "/" + idx = array.find("/") + if idx < 1: # this has no target page spec + templ_dict["dest"] = array # return the orig. string + return templ_dict + + subval = array[:idx].strip() # stuff before "/" + array = array[idx:] # stuff from "/" onwards + templ_dict["dest"] = array + # if we start with /XYZ: extract x, y, zoom + # 1, 2 or 3 of these values may actually be supplied + if array.startswith("/XYZ"): + del templ_dict["dest"] # don't return orig string in this case + + # make a list of the 3 tokens following "/XYZ" + array_list = array.split()[1:4] # omit "/XYZ" + + # fill up missing tokens with "0" strings + while len(array_list) < 3: # fill up if too short + array_list.append("0") # add missing values + + # make list of 3 floats: x, y and zoom + t = list(map(float, array_list)) # the resulting x, y, z values + templ_dict["to"] = (t[0], t[1]) + templ_dict["zoom"] = t[2] + + # extract page number + if subval.endswith("0 R"): # page xref given? + templ_dict["page"] = page_xrefs.get(int(subval.split()[0]),-1) + else: # naked page number given + templ_dict["page"] = int(subval) + return templ_dict + + def fill_dict(dest_dict, pdf_dict): + """Generate name resolution items for pdf_dict. + + This may be either "/Names/Dests" or just "/Dests" + """ + # length of the PDF dictionary + name_count = mupdf.pdf_dict_len(pdf_dict) + + # extract key-val of each dict item + for i in range(name_count): + key = mupdf.pdf_dict_get_key(pdf_dict, i) + val = mupdf.pdf_dict_get_val(pdf_dict, i) + if key.pdf_is_name(): # this should always be true! + dict_key = key.pdf_to_name() + else: + message(f"key {i} is no /Name") + dict_key = None + + if dict_key: + dest_dict[dict_key] = get_array(val) # store key/value in dict + + # access underlying PDF document of fz Document + pdf = mupdf.pdf_document_from_fz_document(self) + + # access PDF catalog + catalog = mupdf.pdf_dict_gets(mupdf.pdf_trailer(pdf), "Root") + + dest_dict = {} + + # make PDF_NAME(Dests) + dests = mupdf.pdf_new_name("Dests") + + # extract destinations old style (PDF 1.1) + old_dests = mupdf.pdf_dict_get(catalog, dests) + if old_dests.pdf_is_dict(): + fill_dict(dest_dict, old_dests) + + # extract destinations new style (PDF 1.2+) + tree = mupdf.pdf_load_name_tree(pdf, dests) + if tree.pdf_is_dict(): + fill_dict(dest_dict, tree) + + self._resolved_names = dest_dict # store result or reuse + return dest_dict + + def save( + self, + filename, + garbage=0, + clean=0, + deflate=0, + deflate_images=0, + deflate_fonts=0, + incremental=0, + ascii=0, + expand=0, + linear=0, + no_new_id=0, + appearance=0, + pretty=0, + encryption=1, + permissions=4095, + owner_pw=None, + user_pw=None, + preserve_metadata=1, + use_objstms=0, + compression_effort=0, + ): + # From %pythonprepend save + # + """Save PDF to file, pathlib.Path or file pointer.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if type(filename) is str: + pass + elif hasattr(filename, "open"): # assume: pathlib.Path + filename = str(filename) + elif hasattr(filename, "name"): # assume: file object + filename = filename.name + elif not hasattr(filename, "seek"): # assume file object + raise ValueError("filename must be str, Path or file object") + if filename == self.name and not incremental: + raise ValueError("save to original must be incremental") + if linear and use_objstms: + raise ValueError("'linear' and 'use_objstms' cannot both be requested") + if self.page_count < 1: + raise ValueError("cannot save with zero pages") + if incremental: + if self.name != filename or self.stream: + raise ValueError("incremental needs original file") + if user_pw and len(user_pw) > 40 or owner_pw and len(owner_pw) > 40: + raise ValueError("password length must not exceed 40") + + pdf = _as_pdf_document(self) + opts = mupdf.PdfWriteOptions() + opts.do_incremental = incremental + opts.do_ascii = ascii + opts.do_compress = deflate + opts.do_compress_images = deflate_images + opts.do_compress_fonts = deflate_fonts + opts.do_decompress = expand + opts.do_garbage = garbage + opts.do_pretty = pretty + opts.do_linear = linear + opts.do_clean = clean + opts.do_sanitize = clean + opts.dont_regenerate_id = no_new_id + opts.do_appearance = appearance + opts.do_encrypt = encryption + opts.permissions = permissions + if owner_pw is not None: + opts.opwd_utf8_set_value(owner_pw) + elif user_pw is not None: + opts.opwd_utf8_set_value(user_pw) + if user_pw is not None: + opts.upwd_utf8_set_value(user_pw) + opts.do_preserve_metadata = preserve_metadata + opts.do_use_objstms = use_objstms + opts.compression_effort = compression_effort + + out = None + pdf.m_internal.resynth_required = 0 + JM_embedded_clean(pdf) + if no_new_id == 0: + JM_ensure_identity(pdf) + if isinstance(filename, str): + #log( 'calling mupdf.pdf_save_document()') + mupdf.pdf_save_document(pdf, filename, opts) + else: + out = JM_new_output_fileptr(filename) + #log( f'{type(out)=} {type(out.this)=}') + mupdf.pdf_write_document(pdf, out, opts) + out.fz_close_output() + + def save_snapshot(self, filename): + """Save a file snapshot suitable for journalling.""" + if self.is_closed: + raise ValueError("doc is closed") + if type(filename) is str: + pass + elif hasattr(filename, "open"): # assume: pathlib.Path + filename = str(filename) + elif hasattr(filename, "name"): # assume: file object + filename = filename.name + else: + raise ValueError("filename must be str, Path or file object") + if filename == self.name: + raise ValueError("cannot snapshot to original") + pdf = _as_pdf_document(self) + mupdf.pdf_save_snapshot(pdf, filename) + + def saveIncr(self): + """ Save PDF incrementally""" + return self.save(self.name, incremental=True, encryption=mupdf.PDF_ENCRYPT_KEEP) + + def select(self, pyliste): + """Build sub-pdf with page numbers in the list.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if not self.is_pdf: + raise ValueError("is no PDF") + if not hasattr(pyliste, "__getitem__"): + raise ValueError("sequence required") + + valid_range = range(len(self)) + if (len(pyliste) == 0 + or min(pyliste) not in valid_range + or max(pyliste) not in valid_range + ): + raise ValueError("bad page number(s)") + + # get underlying pdf document, + pdf = _as_pdf_document(self) + # create page sub-pdf via pdf_rearrange_pages2(). + # + if mupdf_version_tuple >= (1, 25, 3): + # We use PDF_CLEAN_STRUCTURE_KEEP otherwise we lose structure tree + # which, for example, breaks test_3705. + mupdf.pdf_rearrange_pages2(pdf, pyliste, mupdf.PDF_CLEAN_STRUCTURE_KEEP) + else: + mupdf.pdf_rearrange_pages2(pdf, pyliste) + + # remove any existing pages with their kids + self._reset_page_refs() + + def set_language(self, language=None): + pdf = _as_pdf_document(self) + if not language: + lang = mupdf.FZ_LANG_UNSET + else: + lang = mupdf.fz_text_language_from_string(language) + mupdf.pdf_set_document_language(pdf, lang) + return True + + def set_layer(self, config, basestate=None, on=None, off=None, rbgroups=None, locked=None): + """Set the PDF keys /ON, /OFF, /RBGroups of an OC layer.""" + if self.is_closed: + raise ValueError("document closed") + ocgs = set(self.get_ocgs().keys()) + if ocgs == set(): + raise ValueError("document has no optional content") + + if on: + if type(on) not in (list, tuple): + raise ValueError("bad type: 'on'") + s = set(on).difference(ocgs) + if s != set(): + raise ValueError("bad OCGs in 'on': %s" % s) + + if off: + if type(off) not in (list, tuple): + raise ValueError("bad type: 'off'") + s = set(off).difference(ocgs) + if s != set(): + raise ValueError("bad OCGs in 'off': %s" % s) + + if locked: + if type(locked) not in (list, tuple): + raise ValueError("bad type: 'locked'") + s = set(locked).difference(ocgs) + if s != set(): + raise ValueError("bad OCGs in 'locked': %s" % s) + + if rbgroups: + if type(rbgroups) not in (list, tuple): + raise ValueError("bad type: 'rbgroups'") + for x in rbgroups: + if not type(x) in (list, tuple): + raise ValueError("bad RBGroup '%s'" % x) + s = set(x).difference(ocgs) + if s != set(): + raise ValueError("bad OCGs in RBGroup: %s" % s) + + if basestate: + basestate = str(basestate).upper() + if basestate == "UNCHANGED": + basestate = "Unchanged" + if basestate not in ("ON", "OFF", "Unchanged"): + raise ValueError("bad 'basestate'") + pdf = _as_pdf_document(self) + ocp = mupdf.pdf_dict_getl( + mupdf.pdf_trailer( pdf), + PDF_NAME('Root'), + PDF_NAME('OCProperties'), + ) + if not ocp.m_internal: + return + if config == -1: + obj = mupdf.pdf_dict_get( ocp, PDF_NAME('D')) + else: + obj = mupdf.pdf_array_get( + mupdf.pdf_dict_get( ocp, PDF_NAME('Configs')), + config, + ) + if not obj.m_internal: + raise ValueError( MSG_BAD_OC_CONFIG) + JM_set_ocg_arrays( obj, basestate, on, off, rbgroups, locked) + mupdf.ll_pdf_read_ocg( pdf.m_internal) + + def set_layer_ui_config(self, number, action=0): + """Set / unset OC intent configuration.""" + # The user might have given the name instead of sequence number, + # so select by that name and continue with corresp. number + if isinstance(number, str): + select = [ui["number"] for ui in self.layer_ui_configs() if ui["text"] == number] + if select == []: + raise ValueError(f"bad OCG '{number}'.") + number = select[0] # this is the number for the name + pdf = _as_pdf_document(self) + if action == 1: + mupdf.pdf_toggle_layer_config_ui(pdf, number) + elif action == 2: + mupdf.pdf_deselect_layer_config_ui(pdf, number) + else: + mupdf.pdf_select_layer_config_ui(pdf, number) + + def set_markinfo(self, markinfo: dict) -> bool: + """Set the PDF MarkInfo values.""" + xref = self.pdf_catalog() + if xref == 0: + raise ValueError("not a PDF") + if not markinfo or not isinstance(markinfo, dict): + return False + valid = {"Marked": False, "UserProperties": False, "Suspects": False} + + if not set(valid.keys()).issuperset(markinfo.keys()): + badkeys = f"bad MarkInfo key(s): {set(markinfo.keys()).difference(valid.keys())}" + raise ValueError(badkeys) + pdfdict = "<<" + valid.update(markinfo) + for key, value in valid.items(): + value=str(value).lower() + if value not in ("true", "false"): + raise ValueError(f"bad key value '{key}': '{value}'") + pdfdict += f"/{key} {value}" + pdfdict += ">>" + self.xref_set_key(xref, "MarkInfo", pdfdict) + return True + + def set_pagelayout(self, pagelayout: str): + """Set the PDF PageLayout value.""" + valid = ("SinglePage", "OneColumn", "TwoColumnLeft", "TwoColumnRight", "TwoPageLeft", "TwoPageRight") + xref = self.pdf_catalog() + if xref == 0: + raise ValueError("not a PDF") + if not pagelayout: + raise ValueError("bad PageLayout value") + if pagelayout[0] == "/": + pagelayout = pagelayout[1:] + for v in valid: + if pagelayout.lower() == v.lower(): + self.xref_set_key(xref, "PageLayout", f"/{v}") + return True + raise ValueError("bad PageLayout value") + + def set_pagemode(self, pagemode: str): + """Set the PDF PageMode value.""" + valid = ("UseNone", "UseOutlines", "UseThumbs", "FullScreen", "UseOC", "UseAttachments") + xref = self.pdf_catalog() + if xref == 0: + raise ValueError("not a PDF") + if not pagemode: + raise ValueError("bad PageMode value") + if pagemode[0] == "/": + pagemode = pagemode[1:] + for v in valid: + if pagemode.lower() == v.lower(): + self.xref_set_key(xref, "PageMode", f"/{v}") + return True + raise ValueError("bad PageMode value") + + def set_xml_metadata(self, metadata): + """Store XML document level metadata.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + root = mupdf.pdf_dict_get( mupdf.pdf_trailer( pdf), PDF_NAME('Root')) + if not root.m_internal: + RAISEPY( MSG_BAD_PDFROOT, JM_Exc_FileDataError) + res = mupdf.fz_new_buffer_from_copied_data( metadata.encode('utf-8')) + xml = mupdf.pdf_dict_get( root, PDF_NAME('Metadata')) + if xml.m_internal: + JM_update_stream( pdf, xml, res, 0) + else: + xml = mupdf.pdf_add_stream( pdf, res, mupdf.PdfObj(), 0) + mupdf.pdf_dict_put( xml, PDF_NAME('Type'), PDF_NAME('Metadata')) + mupdf.pdf_dict_put( xml, PDF_NAME('Subtype'), PDF_NAME('XML')) + mupdf.pdf_dict_put( root, PDF_NAME('Metadata'), xml) + + def switch_layer(self, config, as_default=0): + """Activate an OC layer.""" + pdf = _as_pdf_document(self) + cfgs = mupdf.pdf_dict_getl( + mupdf.pdf_trailer( pdf), + PDF_NAME('Root'), + PDF_NAME('OCProperties'), + PDF_NAME('Configs') + ) + if not mupdf.pdf_is_array( cfgs) or not mupdf.pdf_array_len( cfgs): + if config < 1: + return + raise ValueError( MSG_BAD_OC_LAYER) + if config < 0: + return + mupdf.pdf_select_layer_config( pdf, config) + if as_default: + mupdf.pdf_set_layer_config_as_default( pdf) + mupdf.ll_pdf_read_ocg( pdf.m_internal) + + def update_object(self, xref, text, page=None): + """Replace object definition source.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + xreflen = mupdf.pdf_xref_len(pdf) + if not _INRANGE(xref, 1, xreflen-1): + RAISEPY("bad xref", MSG_BAD_XREF) + ENSURE_OPERATION(pdf) + # create new object with passed-in string + new_obj = JM_pdf_obj_from_str(pdf, text) + mupdf.pdf_update_object(pdf, xref, new_obj) + if page: + JM_refresh_links( _as_pdf_page(page)) + + def update_stream(self, xref=0, stream=None, new=1, compress=1): + """Replace xref stream part.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + xreflen = mupdf.pdf_xref_len(pdf) + if xref < 1 or xref > xreflen: + raise ValueError( MSG_BAD_XREF) + # get the object + obj = mupdf.pdf_new_indirect(pdf, xref, 0) + if not mupdf.pdf_is_dict(obj): + raise ValueError( MSG_IS_NO_DICT) + res = JM_BufferFromBytes(stream) + if not res.m_internal: + raise TypeError( MSG_BAD_BUFFER) + JM_update_stream(pdf, obj, res, compress) + pdf.dirty = 1 + + @property + def version_count(self): + ''' + Count versions of PDF document. + ''' + pdf = _as_pdf_document(self, required=0) + if pdf.m_internal: + return mupdf.pdf_count_versions(pdf) + return 0 + + def write( + self, + garbage=False, + clean=False, + deflate=False, + deflate_images=False, + deflate_fonts=False, + incremental=False, + ascii=False, + expand=False, + linear=False, + no_new_id=False, + appearance=False, + pretty=False, + encryption=1, + permissions=4095, + owner_pw=None, + user_pw=None, + preserve_metadata=1, + use_objstms=0, + compression_effort=0, + ): + from io import BytesIO + bio = BytesIO() + self.save( + bio, + garbage=garbage, + clean=clean, + no_new_id=no_new_id, + appearance=appearance, + deflate=deflate, + deflate_images=deflate_images, + deflate_fonts=deflate_fonts, + incremental=incremental, + ascii=ascii, + expand=expand, + linear=linear, + pretty=pretty, + encryption=encryption, + permissions=permissions, + owner_pw=owner_pw, + user_pw=user_pw, + preserve_metadata=preserve_metadata, + use_objstms=use_objstms, + compression_effort=compression_effort, + ) + return bio.getvalue() + + @property + def xref(self): + """PDF xref number of page.""" + CheckParent(self) + return self.parent.page_xref(self.number) + + def xref_get_key(self, xref, key): + """Get PDF dict key value of object at 'xref'.""" + pdf = _as_pdf_document(self) + xreflen = mupdf.pdf_xref_len(pdf) + if not _INRANGE(xref, 1, xreflen-1) and xref != -1: + raise ValueError( MSG_BAD_XREF) + if xref > 0: + obj = mupdf.pdf_load_object(pdf, xref) + else: + obj = mupdf.pdf_trailer(pdf) + if not obj.m_internal: + return ("null", "null") + subobj = mupdf.pdf_dict_getp(obj, key) + if not subobj.m_internal: + return ("null", "null") + text = None + if mupdf.pdf_is_indirect(subobj): + type = "xref" + text = "%i 0 R" % mupdf.pdf_to_num(subobj) + elif mupdf.pdf_is_array(subobj): + type = "array" + elif mupdf.pdf_is_dict(subobj): + type = "dict" + elif mupdf.pdf_is_int(subobj): + type = "int" + text = "%i" % mupdf.pdf_to_int(subobj) + elif mupdf.pdf_is_real(subobj): + type = "float" + elif mupdf.pdf_is_null(subobj): + type = "null" + text = "null" + elif mupdf.pdf_is_bool(subobj): + type = "bool" + if mupdf.pdf_to_bool(subobj): + text = "true" + else: + text = "false" + elif mupdf.pdf_is_name(subobj): + type = "name" + text = "/%s" % mupdf.pdf_to_name(subobj) + elif mupdf.pdf_is_string(subobj): + type = "string" + text = JM_UnicodeFromStr(mupdf.pdf_to_text_string(subobj)) + else: + type = "unknown" + if text is None: + res = JM_object_to_buffer(subobj, 1, 0) + text = JM_UnicodeFromBuffer(res) + return (type, text) + + def xref_get_keys(self, xref): + """Get the keys of PDF dict object at 'xref'. Use -1 for the PDF trailer.""" + pdf = _as_pdf_document(self) + xreflen = mupdf.pdf_xref_len( pdf) + if not _INRANGE(xref, 1, xreflen-1) and xref != -1: + raise ValueError( MSG_BAD_XREF) + if xref > 0: + obj = mupdf.pdf_load_object( pdf, xref) + else: + obj = mupdf.pdf_trailer( pdf) + n = mupdf.pdf_dict_len( obj) + rc = [] + if n == 0: + return rc + for i in range(n): + key = mupdf.pdf_to_name( mupdf.pdf_dict_get_key( obj, i)) + rc.append(key) + return rc + + def xref_is_font(self, xref): + """Check if xref is a font object.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if self.xref_get_key(xref, "Type")[1] == "/Font": + return True + return False + + def xref_is_image(self, xref): + """Check if xref is an image object.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if self.xref_get_key(xref, "Subtype")[1] == "/Image": + return True + return False + + def xref_is_stream(self, xref=0): + """Check if xref is a stream object.""" + pdf = _as_pdf_document(self, required=0) + if not pdf.m_internal: + return False # not a PDF + return bool(mupdf.pdf_obj_num_is_stream(pdf, xref)) + + def xref_is_xobject(self, xref): + """Check if xref is a form xobject.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if self.xref_get_key(xref, "Subtype")[1] == "/Form": + return True + return False + + def xref_length(self): + """Get length of xref table.""" + xreflen = 0 + pdf = _as_pdf_document(self, required=0) + if pdf.m_internal: + xreflen = mupdf.pdf_xref_len(pdf) + return xreflen + + def xref_object(self, xref, compressed=0, ascii=0): + """Get xref object source as a string.""" + if self.is_closed: + raise ValueError("document closed") + if g_use_extra: + ret = extra.xref_object( self.this, xref, compressed, ascii) + return ret + pdf = _as_pdf_document(self) + xreflen = mupdf.pdf_xref_len(pdf) + if not _INRANGE(xref, 1, xreflen-1) and xref != -1: + raise ValueError( MSG_BAD_XREF) + if xref > 0: + obj = mupdf.pdf_load_object(pdf, xref) + else: + obj = mupdf.pdf_trailer(pdf) + res = JM_object_to_buffer(mupdf.pdf_resolve_indirect(obj), compressed, ascii) + text = JM_EscapeStrFromBuffer(res) + return text + + def xref_set_key(self, xref, key, value): + """Set the value of a PDF dictionary key.""" + if self.is_closed: + raise ValueError("document closed") + + if not key or not isinstance(key, str) or INVALID_NAME_CHARS.intersection(key) not in (set(), {"/"}): + raise ValueError("bad 'key'") + if not isinstance(value, str) or not value or value[0] == "/" and INVALID_NAME_CHARS.intersection(value[1:]) != set(): + raise ValueError("bad 'value'") + + pdf = _as_pdf_document(self) + xreflen = mupdf.pdf_xref_len(pdf) + #if not _INRANGE(xref, 1, xreflen-1) and xref != -1: + # THROWMSG("bad xref") + #if len(value) == 0: + # THROWMSG("bad 'value'") + #if len(key) == 0: + # THROWMSG("bad 'key'") + if not _INRANGE(xref, 1, xreflen-1) and xref != -1: + raise ValueError( MSG_BAD_XREF) + if xref != -1: + obj = mupdf.pdf_load_object(pdf, xref) + else: + obj = mupdf.pdf_trailer(pdf) + new_obj = JM_set_object_value(obj, key, value) + if not new_obj.m_internal: + return # did not work: skip update + if xref != -1: + mupdf.pdf_update_object(pdf, xref, new_obj) + else: + n = mupdf.pdf_dict_len(new_obj) + for i in range(n): + mupdf.pdf_dict_put( + obj, + mupdf.pdf_dict_get_key(new_obj, i), + mupdf.pdf_dict_get_val(new_obj, i), + ) + + def xref_stream(self, xref): + """Get decompressed xref stream.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + xreflen = mupdf.pdf_xref_len( pdf) + if not _INRANGE(xref, 1, xreflen-1) and xref != -1: + raise ValueError( MSG_BAD_XREF) + if xref >= 0: + obj = mupdf.pdf_new_indirect( pdf, xref, 0) + else: + obj = mupdf.pdf_trailer( pdf) + r = None + if mupdf.pdf_is_stream( obj): + res = mupdf.pdf_load_stream_number( pdf, xref) + r = JM_BinFromBuffer( res) + return r + + def xref_stream_raw(self, xref): + """Get xref stream without decompression.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + pdf = _as_pdf_document(self) + xreflen = mupdf.pdf_xref_len( pdf) + if not _INRANGE(xref, 1, xreflen-1) and xref != -1: + raise ValueError( MSG_BAD_XREF) + if xref >= 0: + obj = mupdf.pdf_new_indirect( pdf, xref, 0) + else: + obj = mupdf.pdf_trailer( pdf) + r = None + if mupdf.pdf_is_stream( obj): + res = mupdf.pdf_load_raw_stream_number( pdf, xref) + r = JM_BinFromBuffer( res) + return r + + def xref_xml_metadata(self): + """Get xref of document XML metadata.""" + pdf = _as_pdf_document(self) + root = mupdf.pdf_dict_get( mupdf.pdf_trailer( pdf), PDF_NAME('Root')) + if not root.m_internal: + RAISEPY( MSG_BAD_PDFROOT, JM_Exc_FileDataError) + xml = mupdf.pdf_dict_get( root, PDF_NAME('Metadata')) + xref = 0 + if xml.m_internal: + xref = mupdf.pdf_to_num( xml) + return xref + + __slots__ = ('this', 'page_count2', 'this_is_pdf', '__dict__') + + outline = property(lambda self: self._outline) + tobytes = write + is_stream = xref_is_stream + +open = Document + + +class DocumentWriter: + + def __enter__(self): + return self + + def __exit__(self, *args): + self.close() + + def __init__(self, path, options=''): + if isinstance( path, str): + pass + elif hasattr( path, 'absolute'): + path = str( path) + elif hasattr( path, 'name'): + path = path.name + if isinstance( path, str): + self.this = mupdf.FzDocumentWriter( path, options, mupdf.FzDocumentWriter.PathType_PDF) + else: + # Need to keep the Python JM_new_output_fileptr_Output instance + # alive for the lifetime of this DocumentWriter, otherwise calls + # to virtual methods implemented in Python fail. So we make it a + # member of this DocumentWriter. + # + # Unrelated to this, mupdf.FzDocumentWriter will set + # self._out.m_internal to null because ownership is passed in. + # + out = JM_new_output_fileptr( path) + self.this = mupdf.FzDocumentWriter( out, options, mupdf.FzDocumentWriter.OutputType_PDF) + assert out.m_internal_value() == 0 + assert hasattr( self.this, '_out') + + def begin_page( self, mediabox): + mediabox2 = JM_rect_from_py(mediabox) + device = mupdf.fz_begin_page( self.this, mediabox2) + device_wrapper = DeviceWrapper( device) + return device_wrapper + + def close( self): + mupdf.fz_close_document_writer( self.this) + + def end_page( self): + mupdf.fz_end_page( self.this) + + +class Font: + + def __del__(self): + if type(self) is not Font: + return None + + def __init__( + self, + fontname=None, + fontfile=None, + fontbuffer=None, + script=0, + language=None, + ordering=-1, + is_bold=0, + is_italic=0, + is_serif=0, + embed=1, + ): + + if fontbuffer: + if hasattr(fontbuffer, "getvalue"): + fontbuffer = fontbuffer.getvalue() + elif isinstance(fontbuffer, bytearray): + fontbuffer = bytes(fontbuffer) + if not isinstance(fontbuffer, bytes): + raise ValueError("bad type: 'fontbuffer'") + + if isinstance(fontname, str): + fname_lower = fontname.lower() + if "/" in fname_lower or "\\" in fname_lower or "." in fname_lower: + message("Warning: did you mean a fontfile?") + + if fname_lower in ("cjk", "china-t", "china-ts"): + ordering = 0 + + elif fname_lower.startswith("china-s"): + ordering = 1 + elif fname_lower.startswith("korea"): + ordering = 3 + elif fname_lower.startswith("japan"): + ordering = 2 + elif fname_lower in fitz_fontdescriptors.keys(): + import pymupdf_fonts # optional fonts + fontbuffer = pymupdf_fonts.myfont(fname_lower) # make a copy + fontname = None # ensure using fontbuffer only + del pymupdf_fonts # remove package again + + elif ordering < 0: + fontname = Base14_fontdict.get(fontname, fontname) + + lang = mupdf.fz_text_language_from_string(language) + font = JM_get_font(fontname, fontfile, + fontbuffer, script, lang, ordering, + is_bold, is_italic, is_serif, embed) + self.this = font + + def __repr__(self): + return "Font('%s')" % self.name + + @property + def ascender(self): + """Return the glyph ascender value.""" + return mupdf.fz_font_ascender(self.this) + + @property + def bbox(self): + return self.this.fz_font_bbox() + + @property + def buffer(self): + buffer_ = mupdf.FzBuffer( mupdf.ll_fz_keep_buffer( self.this.m_internal.buffer)) + return mupdf.fz_buffer_extract_copy( buffer_) + + def char_lengths(self, text, fontsize=11, language=None, script=0, wmode=0, small_caps=0): + """Return tuple of char lengths of unicode 'text' under a fontsize.""" + lang = mupdf.fz_text_language_from_string(language) + rc = [] + for ch in text: + c = ord(ch) + if small_caps: + gid = mupdf.fz_encode_character_sc(self.this, c) + if gid >= 0: + font = self.this + else: + gid, font = mupdf.fz_encode_character_with_fallback(self.this, c, script, lang) + rc.append(fontsize * mupdf.fz_advance_glyph(font, gid, wmode)) + return rc + + @property + def descender(self): + """Return the glyph descender value.""" + return mupdf.fz_font_descender(self.this) + + @property + def flags(self): + f = mupdf.ll_fz_font_flags(self.this.m_internal) + if not f: + return + assert isinstance( f, mupdf.fz_font_flags_t) + #log( '{=f}') + if mupdf_cppyy: + # cppyy includes remaining higher bits. + v = [f.is_mono] + def b(bits): + ret = v[0] & ((1 << bits)-1) + v[0] = v[0] >> bits + return ret + is_mono = b(1) + is_serif = b(1) + is_bold = b(1) + is_italic = b(1) + ft_substitute = b(1) + ft_stretch = b(1) + fake_bold = b(1) + fake_italic = b(1) + has_opentype = b(1) + invalid_bbox = b(1) + cjk_lang = b(1) + embed = b(1) + never_embed = b(1) + return { + "mono": is_mono if mupdf_cppyy else f.is_mono, + "serif": is_serif if mupdf_cppyy else f.is_serif, + "bold": is_bold if mupdf_cppyy else f.is_bold, + "italic": is_italic if mupdf_cppyy else f.is_italic, + "substitute": ft_substitute if mupdf_cppyy else f.ft_substitute, + "stretch": ft_stretch if mupdf_cppyy else f.ft_stretch, + "fake-bold": fake_bold if mupdf_cppyy else f.fake_bold, + "fake-italic": fake_italic if mupdf_cppyy else f.fake_italic, + "opentype": has_opentype if mupdf_cppyy else f.has_opentype, + "invalid-bbox": invalid_bbox if mupdf_cppyy else f.invalid_bbox, + 'cjk': cjk_lang if mupdf_cppyy else f.cjk, + 'cjk-lang': cjk_lang if mupdf_cppyy else f.cjk_lang, + 'embed': embed if mupdf_cppyy else f.embed, + 'never-embed': never_embed if mupdf_cppyy else f.never_embed, + } + + def glyph_advance(self, chr_, language=None, script=0, wmode=0, small_caps=0): + """Return the glyph width of a unicode (font size 1).""" + lang = mupdf.fz_text_language_from_string(language) + if small_caps: + gid = mupdf.fz_encode_character_sc(self.this, chr_) + if gid >= 0: + font = self.this + else: + gid, font = mupdf.fz_encode_character_with_fallback(self.this, chr_, script, lang) + return mupdf.fz_advance_glyph(font, gid, wmode) + + def glyph_bbox(self, char, language=None, script=0, small_caps=0): + """Return the glyph bbox of a unicode (font size 1).""" + lang = mupdf.fz_text_language_from_string(language) + if small_caps: + gid = mupdf.fz_encode_character_sc( self.this, char) + if gid >= 0: + font = self.this + else: + gid, font = mupdf.fz_encode_character_with_fallback( self.this, char, script, lang) + return Rect(mupdf.fz_bound_glyph( font, gid, mupdf.FzMatrix())) + + @property + def glyph_count(self): + return self.this.m_internal.glyph_count + + def glyph_name_to_unicode(self, name): + """Return the unicode for a glyph name.""" + return glyph_name_to_unicode(name) + + def has_glyph(self, chr, language=None, script=0, fallback=0, small_caps=0): + """Check whether font has a glyph for this unicode.""" + if fallback: + lang = mupdf.fz_text_language_from_string(language) + gid, font = mupdf.fz_encode_character_with_fallback(self.this, chr, script, lang) + else: + if small_caps: + gid = mupdf.fz_encode_character_sc(self.this, chr) + else: + gid = mupdf.fz_encode_character(self.this, chr) + return gid + + @property + def is_bold(self): + return mupdf.fz_font_is_bold( self.this) + + @property + def is_italic(self): + return mupdf.fz_font_is_italic( self.this) + + @property + def is_monospaced(self): + return mupdf.fz_font_is_monospaced( self.this) + + @property + def is_serif(self): + return mupdf.fz_font_is_serif( self.this) + + @property + def is_writable(self): + return True # see pymupdf commit ef4056ee4da2 + font = self.this + flags = mupdf.ll_fz_font_flags(font.m_internal) + if mupdf_cppyy: + # cppyy doesn't handle bitfields correctly. + import cppyy + ft_substitute = cppyy.gbl.mupdf_mfz_font_flags_ft_substitute( flags) + else: + ft_substitute = flags.ft_substitute + + if ( mupdf.ll_fz_font_t3_procs(font.m_internal) + or ft_substitute + or not mupdf.pdf_font_writing_supported(font) + ): + return False + return True + + @property + def name(self): + ret = mupdf.fz_font_name(self.this) + #log( '{ret=}') + return ret + + def text_length(self, text, fontsize=11, language=None, script=0, wmode=0, small_caps=0): + """Return length of unicode 'text' under a fontsize.""" + thisfont = self.this + lang = mupdf.fz_text_language_from_string(language) + rc = 0 + if not isinstance(text, str): + raise TypeError( MSG_BAD_TEXT) + for ch in text: + c = ord(ch) + if small_caps: + gid = mupdf.fz_encode_character_sc(thisfont, c) + if gid >= 0: + font = thisfont + else: + gid, font = mupdf.fz_encode_character_with_fallback(thisfont, c, script, lang) + rc += mupdf.fz_advance_glyph(font, gid, wmode) + rc *= fontsize + return rc + + def unicode_to_glyph_name(self, ch): + """Return the glyph name for a unicode.""" + return unicode_to_glyph_name(ch) + + def valid_codepoints(self): + ''' + Returns sorted list of valid unicodes of a fz_font. + ''' + ucs_gids = mupdf.fz_enumerate_font_cmap2(self.this) + ucss = [i.ucs for i in ucs_gids] + ucss_unique = set(ucss) + ucss_unique_sorted = sorted(ucss_unique) + return ucss_unique_sorted + + +class Graftmap: + + def __del__(self): + if not type(self) is Graftmap: + return + self.thisown = False + + def __init__(self, doc): + dst = _as_pdf_document(doc) + map_ = mupdf.pdf_new_graft_map(dst) + self.this = map_ + self.thisown = True + + +class Link: + def __del__(self): + self._erase() + + def __init__( self, this): + assert isinstance( this, mupdf.FzLink) + self.this = this + + def __repr__(self): + CheckParent(self) + return "link on " + str(self.parent) + + def __str__(self): + CheckParent(self) + return "link on " + str(self.parent) + + def _border(self, doc, xref): + pdf = _as_pdf_document(doc, required=0) + if not pdf.m_internal: + return + link_obj = mupdf.pdf_new_indirect(pdf, xref, 0) + if not link_obj.m_internal: + return + b = JM_annot_border(link_obj) + return b + + def _colors(self, doc, xref): + pdf = _as_pdf_document(doc, required=0) + if not pdf.m_internal: + return + link_obj = mupdf.pdf_new_indirect( pdf, xref, 0) + if not link_obj.m_internal: + raise ValueError( MSG_BAD_XREF) + b = JM_annot_colors( link_obj) + return b + + def _erase(self): + self.parent = None + self.thisown = False + + def _setBorder(self, border, doc, xref): + pdf = _as_pdf_document(doc, required=0) + if not pdf.m_internal: + return + link_obj = mupdf.pdf_new_indirect(pdf, xref, 0) + if not link_obj.m_internal: + return + b = JM_annot_set_border(border, pdf, link_obj) + return b + + @property + def border(self): + return self._border(self.parent.parent.this, self.xref) + + @property + def colors(self): + return self._colors(self.parent.parent.this, self.xref) + + @property + def dest(self): + """Create link destination details.""" + if hasattr(self, "parent") and self.parent is None: + raise ValueError("orphaned object: parent is None") + if self.parent.parent.is_closed or self.parent.parent.is_encrypted: + raise ValueError("document closed or encrypted") + doc = self.parent.parent + + if self.is_external or self.uri.startswith("#"): + uri = None + else: + uri = doc.resolve_link(self.uri) + + return linkDest(self, uri, doc) + + @property + def flags(self)->int: + CheckParent(self) + doc = self.parent.parent + if not doc.is_pdf: + return 0 + f = doc.xref_get_key(self.xref, "F") + if f[1] != "null": + return int(f[1]) + return 0 + + @property + def is_external(self): + """Flag the link as external.""" + CheckParent(self) + if g_use_extra: + return extra.Link_is_external( self.this) + this_link = self.this + if not this_link.m_internal or not this_link.m_internal.uri: + return False + return bool( mupdf.fz_is_external_link( this_link.m_internal.uri)) + + @property + def next(self): + """Next link.""" + if not self.this.m_internal: + return None + CheckParent(self) + if 0 and g_use_extra: + val = extra.Link_next( self.this) + else: + val = self.this.next() + if not val.m_internal: + return None + val = Link( val) + if val: + val.thisown = True + val.parent = self.parent # copy owning page from prev link + val.parent._annot_refs[id(val)] = val + if self.xref > 0: # prev link has an xref + link_xrefs = [x[0] for x in self.parent.annot_xrefs() if x[1] == mupdf.PDF_ANNOT_LINK] + link_ids = [x[2] for x in self.parent.annot_xrefs() if x[1] == mupdf.PDF_ANNOT_LINK] + idx = link_xrefs.index(self.xref) + val.xref = link_xrefs[idx + 1] + val.id = link_ids[idx + 1] + else: + val.xref = 0 + val.id = "" + return val + + @property + def rect(self): + """Rectangle ('hot area').""" + CheckParent(self) + # utils.py:getLinkDict() appears to expect exceptions from us, so we + # ensure that we raise on error. + if self.this is None or not self.this.m_internal: + raise Exception( 'self.this.m_internal not available') + val = JM_py_from_rect( self.this.rect()) + val = Rect(val) + return val + + def set_border(self, border=None, width=0, dashes=None, style=None): + if type(border) is not dict: + border = {"width": width, "style": style, "dashes": dashes} + return self._setBorder(border, self.parent.parent.this, self.xref) + + def set_colors(self, colors=None, stroke=None, fill=None): + """Set border colors.""" + CheckParent(self) + doc = self.parent.parent + if type(colors) is not dict: + colors = {"fill": fill, "stroke": stroke} + fill = colors.get("fill") + stroke = colors.get("stroke") + if fill is not None: + message("warning: links have no fill color") + if stroke in ([], ()): + doc.xref_set_key(self.xref, "C", "[]") + return + if hasattr(stroke, "__float__"): + stroke = [float(stroke)] + CheckColor(stroke) + assert len(stroke) in (1, 3, 4) + s = f"[{_format_g(stroke)}]" + doc.xref_set_key(self.xref, "C", s) + + def set_flags(self, flags): + CheckParent(self) + doc = self.parent.parent + if not doc.is_pdf: + raise ValueError("is no PDF") + if not type(flags) is int: + raise ValueError("bad 'flags' value") + doc.xref_set_key(self.xref, "F", str(flags)) + return None + + @property + def uri(self): + """Uri string.""" + #CheckParent(self) + if g_use_extra: + return extra.link_uri(self.this) + this_link = self.this + return this_link.m_internal.uri if this_link.m_internal else '' + + page = -1 + + +class Matrix: + + def __abs__(self): + return math.sqrt(sum([c*c for c in self])) + + def __add__(self, m): + if hasattr(m, "__float__"): + return Matrix(self.a + m, self.b + m, self.c + m, + self.d + m, self.e + m, self.f + m) + if len(m) != 6: + raise ValueError("Matrix: bad seq len") + return Matrix(self.a + m[0], self.b + m[1], self.c + m[2], + self.d + m[3], self.e + m[4], self.f + m[5]) + + def __bool__(self): + return not (max(self) == min(self) == 0) + + def __eq__(self, mat): + if not hasattr(mat, "__len__"): + return False + return len(mat) == 6 and not (self - mat) + + def __getitem__(self, i): + return (self.a, self.b, self.c, self.d, self.e, self.f)[i] + + def __init__(self, *args, a=None, b=None, c=None, d=None, e=None, f=None): + """ + Matrix() - all zeros + Matrix(a, b, c, d, e, f) + Matrix(zoom-x, zoom-y) - zoom + Matrix(shear-x, shear-y, 1) - shear + Matrix(degree) - rotate + Matrix(Matrix) - new copy + Matrix(sequence) - from 'sequence' + Matrix(mupdf.FzMatrix) - from MuPDF class wrapper for fz_matrix. + + Explicit keyword args a, b, c, d, e, f override any earlier settings if + not None. + """ + if not args: + self.a = self.b = self.c = self.d = self.e = self.f = 0.0 + elif len(args) > 6: + raise ValueError("Matrix: bad seq len") + elif len(args) == 6: # 6 numbers + self.a, self.b, self.c, self.d, self.e, self.f = map(float, args) + elif len(args) == 1: # either an angle or a sequ + if isinstance(args[0], mupdf.FzMatrix): + self.a = args[0].a + self.b = args[0].b + self.c = args[0].c + self.d = args[0].d + self.e = args[0].e + self.f = args[0].f + elif hasattr(args[0], "__float__"): + theta = math.radians(args[0]) + c_ = round(math.cos(theta), 8) + s_ = round(math.sin(theta), 8) + self.a = self.d = c_ + self.b = s_ + self.c = -s_ + self.e = self.f = 0.0 + else: + self.a, self.b, self.c, self.d, self.e, self.f = map(float, args[0]) + elif len(args) == 2 or len(args) == 3 and args[2] == 0: + self.a, self.b, self.c, self.d, self.e, self.f = float(args[0]), \ + 0.0, 0.0, float(args[1]), 0.0, 0.0 + elif len(args) == 3 and args[2] == 1: + self.a, self.b, self.c, self.d, self.e, self.f = 1.0, \ + float(args[1]), float(args[0]), 1.0, 0.0, 0.0 + else: + raise ValueError("Matrix: bad args") + + # Override with explicit args if specified. + if a is not None: self.a = a + if b is not None: self.b = b + if c is not None: self.c = c + if d is not None: self.d = d + if e is not None: self.e = e + if f is not None: self.f = f + + def __invert__(self): + """Calculate inverted matrix.""" + m1 = Matrix() + m1.invert(self) + return m1 + + def __len__(self): + return 6 + + def __mul__(self, m): + if hasattr(m, "__float__"): + return Matrix(self.a * m, self.b * m, self.c * m, + self.d * m, self.e * m, self.f * m) + m1 = Matrix(1,1) + return m1.concat(self, m) + + def __neg__(self): + return Matrix(-self.a, -self.b, -self.c, -self.d, -self.e, -self.f) + + def __nonzero__(self): + return not (max(self) == min(self) == 0) + + def __pos__(self): + return Matrix(self) + + def __repr__(self): + return "Matrix" + str(tuple(self)) + + def __setitem__(self, i, v): + v = float(v) + if i == 0: self.a = v + elif i == 1: self.b = v + elif i == 2: self.c = v + elif i == 3: self.d = v + elif i == 4: self.e = v + elif i == 5: self.f = v + else: + raise IndexError("index out of range") + return + + def __sub__(self, m): + if hasattr(m, "__float__"): + return Matrix(self.a - m, self.b - m, self.c - m, + self.d - m, self.e - m, self.f - m) + if len(m) != 6: + raise ValueError("Matrix: bad seq len") + return Matrix(self.a - m[0], self.b - m[1], self.c - m[2], + self.d - m[3], self.e - m[4], self.f - m[5]) + + def __truediv__(self, m): + if hasattr(m, "__float__"): + return Matrix(self.a * 1./m, self.b * 1./m, self.c * 1./m, + self.d * 1./m, self.e * 1./m, self.f * 1./m) + m1 = util_invert_matrix(m)[1] + if not m1: + raise ZeroDivisionError("matrix not invertible") + m2 = Matrix(1,1) + return m2.concat(self, m1) + + def concat(self, one, two): + """Multiply two matrices and replace current one.""" + if not len(one) == len(two) == 6: + raise ValueError("Matrix: bad seq len") + self.a, self.b, self.c, self.d, self.e, self.f = util_concat_matrix(one, two) + return self + + def invert(self, src=None): + """Calculate the inverted matrix. Return 0 if successful and replace + current one. Else return 1 and do nothing. + """ + if src is None: + dst = util_invert_matrix(self) + else: + dst = util_invert_matrix(src) + if dst[0] == 1: + return 1 + self.a, self.b, self.c, self.d, self.e, self.f = dst[1] + return 0 + + @property + def is_rectilinear(self): + """True if rectangles are mapped to rectangles.""" + return (abs(self.b) < EPSILON and abs(self.c) < EPSILON) or \ + (abs(self.a) < EPSILON and abs(self.d) < EPSILON) + + def prerotate(self, theta): + """Calculate pre rotation and replace current matrix.""" + theta = float(theta) + while theta < 0: theta += 360 + while theta >= 360: theta -= 360 + if abs(0 - theta) < EPSILON: + pass + + elif abs(90.0 - theta) < EPSILON: + a = self.a + b = self.b + self.a = self.c + self.b = self.d + self.c = -a + self.d = -b + + elif abs(180.0 - theta) < EPSILON: + self.a = -self.a + self.b = -self.b + self.c = -self.c + self.d = -self.d + + elif abs(270.0 - theta) < EPSILON: + a = self.a + b = self.b + self.a = -self.c + self.b = -self.d + self.c = a + self.d = b + + else: + rad = math.radians(theta) + s = math.sin(rad) + c = math.cos(rad) + a = self.a + b = self.b + self.a = c * a + s * self.c + self.b = c * b + s * self.d + self.c =-s * a + c * self.c + self.d =-s * b + c * self.d + + return self + + def prescale(self, sx, sy): + """Calculate pre scaling and replace current matrix.""" + sx = float(sx) + sy = float(sy) + self.a *= sx + self.b *= sx + self.c *= sy + self.d *= sy + return self + + def preshear(self, h, v): + """Calculate pre shearing and replace current matrix.""" + h = float(h) + v = float(v) + a, b = self.a, self.b + self.a += v * self.c + self.b += v * self.d + self.c += h * a + self.d += h * b + return self + + def pretranslate(self, tx, ty): + """Calculate pre translation and replace current matrix.""" + tx = float(tx) + ty = float(ty) + self.e += tx * self.a + ty * self.c + self.f += tx * self.b + ty * self.d + return self + + __inv__ = __invert__ + __div__ = __truediv__ + norm = __abs__ + + +class IdentityMatrix(Matrix): + """Identity matrix [1, 0, 0, 1, 0, 0]""" + + def __hash__(self): + return hash((1,0,0,1,0,0)) + + def __init__(self): + Matrix.__init__(self, 1.0, 1.0) + + def __repr__(self): + return "IdentityMatrix(1.0, 0.0, 0.0, 1.0, 0.0, 0.0)" + + def __setattr__(self, name, value): + if name in "ad": + self.__dict__[name] = 1.0 + elif name in "bcef": + self.__dict__[name] = 0.0 + else: + self.__dict__[name] = value + + def checkargs(*args): + raise NotImplementedError("Identity is readonly") + +Identity = IdentityMatrix() + + +class linkDest: + """link or outline destination details""" + + def __init__(self, obj, rlink, document=None): + isExt = obj.is_external + isInt = not isExt + self.dest = "" + self.file_spec = "" + self.flags = 0 + self.is_map = False + self.is_uri = False + self.kind = LINK_NONE + self.lt = Point(0, 0) + self.named = dict() + self.new_window = "" + self.page = obj.page + self.rb = Point(0, 0) + self.uri = obj.uri + + def uri_to_dict(uri): + items = self.uri[1:].split('&') + ret = dict() + for item in items: + eq = item.find('=') + if eq >= 0: + ret[item[:eq]] = item[eq+1:] + else: + ret[item] = None + return ret + + def unescape(name): + """Unescape '%AB' substrings to chr(0xAB).""" + split = name.replace("%%", "%25") # take care of escaped '%' + split = split.split("%") + newname = split[0] + for item in split[1:]: + piece = item[:2] + newname += chr(int(piece, base=16)) + newname += item[2:] + return newname + + if rlink and not self.uri.startswith("#"): + self.uri = f"#page={rlink[0] + 1}&zoom=0,{_format_g(rlink[1])},{_format_g(rlink[2])}" + if obj.is_external: + self.page = -1 + self.kind = LINK_URI + if not self.uri: + self.page = -1 + self.kind = LINK_NONE + if isInt and self.uri: + self.uri = self.uri.replace("&zoom=nan", "&zoom=0") + if self.uri.startswith("#"): + self.kind = LINK_GOTO + m = re.match('^#page=([0-9]+)&zoom=([0-9.]+),(-?[0-9.]+),(-?[0-9.]+)$', self.uri) + if m: + self.page = int(m.group(1)) - 1 + self.lt = Point(float((m.group(3))), float(m.group(4))) + self.flags = self.flags | LINK_FLAG_L_VALID | LINK_FLAG_T_VALID + else: + m = re.match('^#page=([0-9]+)$', self.uri) + if m: + self.page = int(m.group(1)) - 1 + else: + self.kind = LINK_NAMED + m = re.match('^#nameddest=(.*)', self.uri) + assert document + if document and m: + named = unescape(m.group(1)) + self.named = document.resolve_names().get(named) + if self.named is None: + # document.resolve_names() does not contain an + # entry for `named` so use an empty dict. + self.named = dict() + self.named['nameddest'] = named + else: + self.named = uri_to_dict(self.uri[1:]) + else: + self.kind = LINK_NAMED + self.named = uri_to_dict(self.uri) + if obj.is_external: + if not self.uri: + pass + elif self.uri.startswith("file:"): + self.file_spec = self.uri[5:] + if self.file_spec.startswith("//"): + self.file_spec = self.file_spec[2:] + self.is_uri = False + self.uri = "" + self.kind = LINK_LAUNCH + ftab = self.file_spec.split("#") + if len(ftab) == 2: + if ftab[1].startswith("page="): + self.kind = LINK_GOTOR + self.file_spec = ftab[0] + self.page = int(ftab[1].split("&")[0][5:]) - 1 + elif ":" in self.uri: + self.is_uri = True + self.kind = LINK_URI + else: + self.is_uri = True + self.kind = LINK_LAUNCH + assert isinstance(self.named, dict) + +class Widget: + ''' + Class describing a PDF form field ("widget") + ''' + + def __init__(self): + self.border_color = None + self.border_style = "S" + self.border_width = 0 + self.border_dashes = None + self.choice_values = None # choice fields only + self.rb_parent = None # radio buttons only: xref of owning parent + + self.field_name = None # field name + self.field_label = None # field label + self.field_value = None + self.field_flags = 0 + self.field_display = 0 + self.field_type = 0 # valid range 1 through 7 + self.field_type_string = None # field type as string + + self.fill_color = None + self.button_caption = None # button caption + self.is_signed = None # True / False if signature + self.text_color = (0, 0, 0) + self.text_font = "Helv" + self.text_fontsize = 0 + self.text_maxlen = 0 # text fields only + self.text_format = 0 # text fields only + self._text_da = "" # /DA = default appearance + + self.script = None # JavaScript (/A) + self.script_stroke = None # JavaScript (/AA/K) + self.script_format = None # JavaScript (/AA/F) + self.script_change = None # JavaScript (/AA/V) + self.script_calc = None # JavaScript (/AA/C) + self.script_blur = None # JavaScript (/AA/Bl) + self.script_focus = None # JavaScript (/AA/Fo) codespell:ignore + + self.rect = None # annot value + self.xref = 0 # annot value + + def __repr__(self): + #return "'%s' widget on %s" % (self.field_type_string, str(self.parent)) + # No self.parent. + return f'Widget:(field_type={self.field_type_string} script={self.script})' + return "'%s' widget" % (self.field_type_string) + + def _adjust_font(self): + """Ensure text_font is from our list and correctly spelled. + """ + if not self.text_font: + self.text_font = "Helv" + return + valid_fonts = ("Cour", "TiRo", "Helv", "ZaDb") + for f in valid_fonts: + if self.text_font.lower() == f.lower(): + self.text_font = f + return + self.text_font = "Helv" + return + + def _checker(self): + """Any widget type checks. + """ + if self.field_type not in range(1, 8): + raise ValueError("bad field type") + + # if setting a radio button to ON, first set Off all buttons + # in the group - this is not done by MuPDF: + if self.field_type == mupdf.PDF_WIDGET_TYPE_RADIOBUTTON and self.field_value not in (False, "Off") and hasattr(self, "parent"): + # so we are about setting this button to ON/True + # check other buttons in same group and set them to 'Off' + doc = self.parent.parent + kids_type, kids_value = doc.xref_get_key(self.xref, "Parent/Kids") + if kids_type == "array": + xrefs = tuple(map(int, kids_value[1:-1].replace("0 R","").split())) + for xref in xrefs: + if xref != self.xref: + doc.xref_set_key(xref, "AS", "/Off") + # the calling method will now set the intended button to on and + # will find everything prepared for correct functioning. + + def _parse_da(self): + """Extract font name, size and color from default appearance string (/DA object). + + Equivalent to 'pdf_parse_default_appearance' function in MuPDF's 'pdf-annot.c'. + """ + if not self._text_da: + return + font = "Helv" + fsize = 0 + col = (0, 0, 0) + dat = self._text_da.split() # split on any whitespace + for i, item in enumerate(dat): + if item == "Tf": + font = dat[i - 2][1:] + fsize = float(dat[i - 1]) + dat[i] = dat[i-1] = dat[i-2] = "" + continue + if item == "g": # unicolor text + col = [(float(dat[i - 1]))] + dat[i] = dat[i-1] = "" + continue + if item == "rg": # RGB colored text + col = [float(f) for f in dat[i - 3:i]] + dat[i] = dat[i-1] = dat[i-2] = dat[i-3] = "" + continue + self.text_font = font + self.text_fontsize = fsize + self.text_color = col + self._text_da = "" + return + + def _validate(self): + """Validate the class entries. + """ + if (self.rect.is_infinite + or self.rect.is_empty + ): + raise ValueError("bad rect") + + if not self.field_name: + raise ValueError("field name missing") + + if self.field_label == "Unnamed": + self.field_label = None + CheckColor(self.border_color) + CheckColor(self.fill_color) + if not self.text_color: + self.text_color = (0, 0, 0) + CheckColor(self.text_color) + + if not self.border_width: + self.border_width = 0 + + if not self.text_fontsize: + self.text_fontsize = 0 + + self.border_style = self.border_style.upper()[0:1] + + # standardize content of JavaScript entries + btn_type = self.field_type in ( + mupdf.PDF_WIDGET_TYPE_BUTTON, + mupdf.PDF_WIDGET_TYPE_CHECKBOX, + mupdf.PDF_WIDGET_TYPE_RADIOBUTTON, + ) + if not self.script: + self.script = None + elif type(self.script) is not str: + raise ValueError("script content must be a string") + + # buttons cannot have the following script actions + if btn_type or not self.script_calc: + self.script_calc = None + elif type(self.script_calc) is not str: + raise ValueError("script_calc content must be a string") + + if btn_type or not self.script_change: + self.script_change = None + elif type(self.script_change) is not str: + raise ValueError("script_change content must be a string") + + if btn_type or not self.script_format: + self.script_format = None + elif type(self.script_format) is not str: + raise ValueError("script_format content must be a string") + + if btn_type or not self.script_stroke: + self.script_stroke = None + elif type(self.script_stroke) is not str: + raise ValueError("script_stroke content must be a string") + + if btn_type or not self.script_blur: + self.script_blur = None + elif type(self.script_blur) is not str: + raise ValueError("script_blur content must be a string") + + if btn_type or not self.script_focus: + self.script_focus = None + elif type(self.script_focus) is not str: + raise ValueError("script_focus content must be a string") + + self._checker() # any field_type specific checks + + def _sync_flags(self): + """Propagate the field flags. + + If this widget has a "/Parent", set its field flags and that of all + its /Kids widgets to the value of the current widget. + Only possible for widgets existing in the PDF. + + Returns True or False. + """ + if not self.xref: + return False # no xref: widget not in the PDF + doc = self.parent.parent # the owning document + assert doc + pdf = _as_pdf_document(doc) + # load underlying PDF object + pdf_widget = mupdf.pdf_load_object(pdf, self.xref) + Parent = mupdf.pdf_dict_get(pdf_widget, PDF_NAME("Parent")) + if not Parent.pdf_is_dict(): + return False # no /Parent: nothing to do + + # put the field flags value into the parent field flags: + Parent.pdf_dict_put_int(PDF_NAME("Ff"), self.field_flags) + + # also put that value into all kids of the Parent + kids = Parent.pdf_dict_get(PDF_NAME("Kids")) + if not kids.pdf_is_array(): + message("warning: malformed PDF, Parent has no Kids array") + return False # no /Kids: should never happen! + + for i in range(kids.pdf_array_len()): # walk through all kids + # access kid widget, and do some precautionary checks + kid = kids.pdf_array_get(i) + if not kid.pdf_is_dict(): + continue + xref = kid.pdf_to_num() # get xref of the kid + if xref == self.xref: # skip self widget + continue + subtype = kid.pdf_dict_get(PDF_NAME("Subtype")) + if not subtype.pdf_to_name() == "Widget": + continue + # put the field flags value into the kid field flags: + kid.pdf_dict_put_int(PDF_NAME("Ff"), self.field_flags) + + return True # all done + + def button_states(self): + """Return the on/off state names for button widgets. + + A button may have 'normal' or 'pressed down' appearances. While the 'Off' + state is usually called like this, the 'On' state is often given a name + relating to the functional context. + """ + if self.field_type not in (2, 5): + return None # no button type + if hasattr(self, "parent"): # field already exists on page + doc = self.parent.parent + else: + return + xref = self.xref + states = {"normal": None, "down": None} + APN = doc.xref_get_key(xref, "AP/N") + if APN[0] == "dict": + nstates = [] + APN = APN[1][2:-2] + apnt = APN.split("/")[1:] + for x in apnt: + nstates.append(x.split()[0]) + states["normal"] = nstates + if APN[0] == "xref": + nstates = [] + nxref = int(APN[1].split(" ")[0]) + APN = doc.xref_object(nxref) + apnt = APN.split("/")[1:] + for x in apnt: + nstates.append(x.split()[0]) + states["normal"] = nstates + APD = doc.xref_get_key(xref, "AP/D") + if APD[0] == "dict": + dstates = [] + APD = APD[1][2:-2] + apdt = APD.split("/")[1:] + for x in apdt: + dstates.append(x.split()[0]) + states["down"] = dstates + if APD[0] == "xref": + dstates = [] + dxref = int(APD[1].split(" ")[0]) + APD = doc.xref_object(dxref) + apdt = APD.split("/")[1:] + for x in apdt: + dstates.append(x.split()[0]) + states["down"] = dstates + return states + + @property + def next(self): + return self._annot.next + + def on_state(self): + """Return the "On" value for button widgets. + + This is useful for radio buttons mainly. Checkboxes will always return + "Yes". Radio buttons will return the string that is unequal to "Off" + as returned by method button_states(). + If the radio button is new / being created, it does not yet have an + "On" value. In this case, a warning is shown and True is returned. + """ + if self.field_type not in (2, 5): + return None # no checkbox or radio button + bstate = self.button_states() + if bstate is None: + bstate = dict() + for k in bstate.keys(): + for v in bstate[k]: + if v != "Off": + return v + message("warning: radio button has no 'On' value.") + return True + + def reset(self): + """Reset the field value to its default. + """ + TOOLS._reset_widget(self._annot) + + def update(self, sync_flags=False): + """Reflect Python object in the PDF.""" + self._validate() + + self._adjust_font() # ensure valid text_font name + + # now create the /DA string + self._text_da = "" + if len(self.text_color) == 3: + fmt = "{:g} {:g} {:g} rg /{f:s} {s:g} Tf" + self._text_da + elif len(self.text_color) == 1: + fmt = "{:g} g /{f:s} {s:g} Tf" + self._text_da + elif len(self.text_color) == 4: + fmt = "{:g} {:g} {:g} {:g} k /{f:s} {s:g} Tf" + self._text_da + self._text_da = fmt.format(*self.text_color, f=self.text_font, + s=self.text_fontsize) + # finally update the widget + + # if widget has a '/AA/C' script, make sure it is in the '/CO' + # array of the '/AcroForm' dictionary. + if self.script_calc: # there is a "calculation" script: + # make sure we are in the /CO array + util_ensure_widget_calc(self._annot) + + # finally update the widget + TOOLS._save_widget(self._annot, self) + self._text_da = "" + if sync_flags: + self._sync_flags() # propagate field flags to parent and kids + + +from . import _extra + + +class Outline: + + def __init__(self, ol): + self.this = ol + + @property + def dest(self): + '''outline destination details''' + return linkDest(self, None, None) + + def destination(self, document): + ''' + Like `dest` property but uses `document` to resolve destinations for + kind=LINK_NAMED. + ''' + return linkDest(self, None, document) + + @property + def down(self): + ol = self.this + down_ol = ol.down() + if not down_ol.m_internal: + return + return Outline(down_ol) + + @property + def is_external(self): + if g_use_extra: + # calling _extra.* here appears to save significant time in + # test_toc.py:test_full_toc, 1.2s=>0.94s. + # + return _extra.Outline_is_external( self.this) + ol = self.this + if not ol.m_internal: + return False + uri = ol.m_internal.uri if 1 else ol.uri() + if uri is None: + return False + return mupdf.fz_is_external_link(uri) + + @property + def is_open(self): + if 1: + return self.this.m_internal.is_open + return self.this.is_open() + + @property + def next(self): + ol = self.this + next_ol = ol.next() + if not next_ol.m_internal: + return + return Outline(next_ol) + + @property + def page(self): + if 1: + return self.this.m_internal.page.page + return self.this.page().page + + @property + def title(self): + return self.this.m_internal.title + + @property + def uri(self): + ol = self.this + if not ol.m_internal: + return None + return ol.m_internal.uri + + @property + def x(self): + return self.this.m_internal.x + + @property + def y(self): + return self.this.m_internal.y + + __slots__ = [ 'this'] + + +def _make_PdfFilterOptions( + recurse=0, + instance_forms=0, + ascii=0, + no_update=0, + sanitize=0, + sopts=None, + ): + ''' + Returns a mupdf.PdfFilterOptions instance. + ''' + + filter_ = mupdf.PdfFilterOptions() + filter_.recurse = recurse + filter_.instance_forms = instance_forms + filter_.ascii = ascii + + filter_.no_update = no_update + if sanitize: + # We want to use a PdfFilterFactory whose `.filter` fn pointer is + # set to MuPDF's `pdf_new_sanitize_filter()`. But not sure how to + # get access to this raw fn in Python; and on Windows raw MuPDF + # functions are not even available to C++. + # + # So we use SWIG Director to implement our own + # PdfFilterFactory whose `filter()` method calls + # `mupdf.ll_pdf_new_sanitize_filter()`. + if sopts: + assert isinstance(sopts, mupdf.PdfSanitizeFilterOptions) + else: + sopts = mupdf.PdfSanitizeFilterOptions() + class Factory(mupdf.PdfFilterFactory2): + def __init__(self): + super().__init__() + self.use_virtual_filter() + self.sopts = sopts + def filter(self, ctx, doc, chain, struct_parents, transform, options): + if 0: + log(f'sanitize filter.filter():') + log(f' {self=}') + log(f' {ctx=}') + log(f' {doc=}') + log(f' {chain=}') + log(f' {struct_parents=}') + log(f' {transform=}') + log(f' {options=}') + log(f' {self.sopts.internal()=}') + return mupdf.ll_pdf_new_sanitize_filter( + doc, + chain, + struct_parents, + transform, + options, + self.sopts.internal(), + ) + + factory = Factory() + filter_.add_factory(factory.internal()) + filter_._factory = factory + return filter_ + + +class Page: + + def __init__(self, page, document): + assert isinstance(page, (mupdf.FzPage, mupdf.PdfPage)), f'page is: {page}' + self.this = page + self.thisown = True + self.last_point = None + self.draw_cont = '' + self._annot_refs = dict() + self.parent = document + if page.m_internal: + if isinstance( page, mupdf.PdfPage): + self.number = page.m_internal.super.number + else: + self.number = page.m_internal.number + else: + self.number = None + + def __repr__(self): + return self.__str__() + CheckParent(self) + x = self.parent.name + if self.parent.stream is not None: + x = "" % (self.parent._graft_id,) + if x == "": + x = "" % self.parent._graft_id + return "page %s of %s" % (self.number, x) + + def __str__(self): + #CheckParent(self) + parent = getattr(self, 'parent', None) + if isinstance(self.this.m_internal, mupdf.pdf_page): + number = self.this.m_internal.super.number + else: + number = self.this.m_internal.number + ret = f'page {number}' + if parent: + x = self.parent.name + if self.parent.stream is not None: + x = "" % (self.parent._graft_id,) + if x == "": + x = "" % self.parent._graft_id + ret += f' of {x}' + return ret + + def _add_caret_annot(self, point): + if g_use_extra: + annot = extra._add_caret_annot( self.this, JM_point_from_py(point)) + else: + page = self._pdf_page() + annot = mupdf.pdf_create_annot(page, mupdf.PDF_ANNOT_CARET) + if point: + p = JM_point_from_py(point) + r = mupdf.pdf_annot_rect(annot) + r = mupdf.FzRect(p.x, p.y, p.x + r.x1 - r.x0, p.y + r.y1 - r.y0) + mupdf.pdf_set_annot_rect(annot, r) + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + return annot + + def _add_file_annot(self, point, buffer_, filename, ufilename=None, desc=None, icon=None): + page = self._pdf_page() + uf = ufilename if ufilename else filename + d = desc if desc else filename + p = JM_point_from_py(point) + filebuf = JM_BufferFromBytes(buffer_) + if not filebuf.m_internal: + raise TypeError( MSG_BAD_BUFFER) + annot = mupdf.pdf_create_annot(page, mupdf.PDF_ANNOT_FILE_ATTACHMENT) + r = mupdf.pdf_annot_rect(annot) + r = mupdf.fz_make_rect(p.x, p.y, p.x + r.x1 - r.x0, p.y + r.y1 - r.y0) + mupdf.pdf_set_annot_rect(annot, r) + flags = mupdf.PDF_ANNOT_IS_PRINT + mupdf.pdf_set_annot_flags(annot, flags) + + if icon: + mupdf.pdf_set_annot_icon_name(annot, icon) + + val = JM_embed_file(page.doc(), filebuf, filename, uf, d, 1) + mupdf.pdf_dict_put(mupdf.pdf_annot_obj(annot), PDF_NAME('FS'), val) + mupdf.pdf_dict_put_text_string(mupdf.pdf_annot_obj(annot), PDF_NAME('Contents'), filename) + mupdf.pdf_update_annot(annot) + mupdf.pdf_set_annot_rect(annot, r) + mupdf.pdf_set_annot_flags(annot, flags) + JM_add_annot_id(annot, "A") + return Annot(annot) + + def _add_freetext_annot( + self, rect, + text, + fontsize=11, + fontname=None, + text_color=None, + fill_color=None, + border_color=None, + border_width=0, + dashes=None, + callout=None, + line_end=mupdf.PDF_ANNOT_LE_OPEN_ARROW, + opacity=1, + align=0, + rotate=0, + richtext=False, + style=None, + ): + rc = f""" + + {text}""" + page = self._pdf_page() + if border_color and not richtext: + raise ValueError("cannot set border_color if rich_text is False") + if border_color and not text_color: + text_color = border_color + nfcol, fcol = JM_color_FromSequence(fill_color) + ntcol, tcol = JM_color_FromSequence(text_color) + r = JM_rect_from_py(rect) + if mupdf.fz_is_infinite_rect(r) or mupdf.fz_is_empty_rect(r): + raise ValueError( MSG_BAD_RECT) + annot = mupdf.pdf_create_annot(page, mupdf.PDF_ANNOT_FREE_TEXT) + annot_obj = mupdf.pdf_annot_obj(annot) + + #insert text as 'contents' or 'RC' depending on 'richtext' + if not richtext: + mupdf.pdf_set_annot_contents(annot, text) + else: + mupdf.pdf_dict_put_text_string(annot_obj,PDF_NAME("RC"), rc) + if style: + mupdf.pdf_dict_put_text_string(annot_obj,PDF_NAME("DS"), style) + + mupdf.pdf_set_annot_rect(annot, r) + + while rotate < 0: + rotate += 360 + while rotate >= 360: + rotate -= 360 + if rotate != 0: + mupdf.pdf_dict_put_int(annot_obj, PDF_NAME('Rotate'), rotate) + + mupdf.pdf_set_annot_quadding(annot, align) + + if nfcol > 0: + mupdf.pdf_set_annot_color(annot, fcol[:nfcol]) + + mupdf.pdf_set_annot_border_width(annot, border_width) + mupdf.pdf_set_annot_opacity(annot, opacity) + if dashes: + for d in dashes: + mupdf.pdf_add_annot_border_dash_item(annot, float(d)) + + # Insert callout information + if callout: + mupdf.pdf_dict_put(annot_obj, PDF_NAME("IT"), PDF_NAME("FreeTextCallout")) + mupdf.pdf_set_annot_callout_style(annot, line_end) + point_count = len(callout) + extra.JM_set_annot_callout_line(annot, tuple(callout), point_count) + + # insert the default appearance string + if not richtext: + JM_make_annot_DA(annot, ntcol, tcol, fontname, fontsize) + + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + val = Annot(annot) + return val + + def _add_ink_annot(self, list): + page = _as_pdf_page(self.this) + if not PySequence_Check(list): + raise ValueError( MSG_BAD_ARG_INK_ANNOT) + ctm = mupdf.FzMatrix() + mupdf.pdf_page_transform(page, mupdf.FzRect(0), ctm) + inv_ctm = mupdf.fz_invert_matrix(ctm) + annot = mupdf.pdf_create_annot(page, mupdf.PDF_ANNOT_INK) + annot_obj = mupdf.pdf_annot_obj(annot) + n0 = len(list) + inklist = mupdf.pdf_new_array(page.doc(), n0) + + for j in range(n0): + sublist = list[j] + n1 = len(sublist) + stroke = mupdf.pdf_new_array(page.doc(), 2 * n1) + + for i in range(n1): + p = sublist[i] + if not PySequence_Check(p) or PySequence_Size(p) != 2: + raise ValueError( MSG_BAD_ARG_INK_ANNOT) + point = mupdf.fz_transform_point(JM_point_from_py(p), inv_ctm) + mupdf.pdf_array_push_real(stroke, point.x) + mupdf.pdf_array_push_real(stroke, point.y) + + mupdf.pdf_array_push(inklist, stroke) + + mupdf.pdf_dict_put(annot_obj, PDF_NAME('InkList'), inklist) + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + return Annot(annot) + + def _add_line_annot(self, p1, p2): + page = self._pdf_page() + annot = mupdf.pdf_create_annot(page, mupdf.PDF_ANNOT_LINE) + a = JM_point_from_py(p1) + b = JM_point_from_py(p2) + mupdf.pdf_set_annot_line(annot, a, b) + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + assert annot.m_internal + return Annot(annot) + + def _add_multiline(self, points, annot_type): + page = self._pdf_page() + if len(points) < 2: + raise ValueError( MSG_BAD_ARG_POINTS) + annot = mupdf.pdf_create_annot(page, annot_type) + for p in points: + if (PySequence_Size(p) != 2): + raise ValueError( MSG_BAD_ARG_POINTS) + point = JM_point_from_py(p) + mupdf.pdf_add_annot_vertex(annot, point) + + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + return Annot(annot) + + def _add_redact_annot(self, quad, text=None, da_str=None, align=0, fill=None, text_color=None): + page = self._pdf_page() + fcol = [ 1, 1, 1, 0] + nfcol = 0 + annot = mupdf.pdf_create_annot(page, mupdf.PDF_ANNOT_REDACT) + q = JM_quad_from_py(quad) + r = mupdf.fz_rect_from_quad(q) + # TODO calculate de-rotated rect + mupdf.pdf_set_annot_rect(annot, r) + if fill: + nfcol, fcol = JM_color_FromSequence(fill) + arr = mupdf.pdf_new_array(page.doc(), nfcol) + for i in range(nfcol): + mupdf.pdf_array_push_real(arr, fcol[i]) + mupdf.pdf_dict_put(mupdf.pdf_annot_obj(annot), PDF_NAME('IC'), arr) + if text: + assert da_str + mupdf.pdf_dict_puts( + mupdf.pdf_annot_obj(annot), + "OverlayText", + mupdf.pdf_new_text_string(text), + ) + mupdf.pdf_dict_put_text_string(mupdf.pdf_annot_obj(annot), PDF_NAME('DA'), da_str) + mupdf.pdf_dict_put_int(mupdf.pdf_annot_obj(annot), PDF_NAME('Q'), align) + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + annot = mupdf.ll_pdf_keep_annot(annot.m_internal) + annot = mupdf.PdfAnnot( annot) + return Annot(annot) + + def _add_square_or_circle(self, rect, annot_type): + page = self._pdf_page() + r = JM_rect_from_py(rect) + if mupdf.fz_is_infinite_rect(r) or mupdf.fz_is_empty_rect(r): + raise ValueError( MSG_BAD_RECT) + annot = mupdf.pdf_create_annot(page, annot_type) + mupdf.pdf_set_annot_rect(annot, r) + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + assert annot.m_internal + return Annot(annot) + + def _add_stamp_annot(self, rect, stamp=0): + rect = Rect(rect) + r = JM_rect_from_py(rect) + if mupdf.fz_is_infinite_rect(r) or mupdf.fz_is_empty_rect(r): + raise ValueError(MSG_BAD_RECT) + page = self._pdf_page() + stamp_id = [ + "Approved", + "AsIs", + "Confidential", + "Departmental", + "Experimental", + "Expired", + "Final", + "ForComment", + "ForPublicRelease", + "NotApproved", + "NotForPublicRelease", + "Sold", + "TopSecret", + "Draft", + ] + n = len(stamp_id) + buf = None + name = None + if stamp in range(n): + name = stamp_id[stamp] + elif isinstance(stamp, Pixmap): + buf = stamp.tobytes() + elif isinstance(stamp, str): + buf = pathlib.Path(stamp).read_bytes() + elif isinstance(stamp, (bytes, bytearray)): + buf = stamp + elif isinstance(stamp, io.BytesIO): + buf = stamp.getvalue() + else: + name = stamp_id[0] + + annot = mupdf.pdf_create_annot(page, mupdf.PDF_ANNOT_STAMP) + if buf: # image stamp + fzbuff = mupdf.fz_new_buffer_from_copied_data(buf) + img = mupdf.fz_new_image_from_buffer(fzbuff) + + # compute image boundary box on page + w, h = img.w(), img.h() + scale = min(rect.width / w, rect.height / h) + width = w * scale # bbox width + height = h * scale # bbox height + + # center of "rect" + center = (rect.tl + rect.br) / 2 + x0 = center.x - width / 2 + y0 = center.y - height / 2 + x1 = x0 + width + y1 = y0 + height + r = mupdf.fz_make_rect(x0, y0, x1, y1) + mupdf.pdf_set_annot_rect(annot, r) + mupdf.pdf_set_annot_stamp_image(annot, img) + mupdf.pdf_dict_put(mupdf.pdf_annot_obj(annot), PDF_NAME("Name"), mupdf.pdf_new_name("ImageStamp")) + mupdf.pdf_set_annot_contents(annot, "Image Stamp") + else: # text stamp + mupdf.pdf_set_annot_rect(annot, r) + mupdf.pdf_dict_put(mupdf.pdf_annot_obj(annot), PDF_NAME("Name"), PDF_NAME(name)) + mupdf.pdf_set_annot_contents(annot, name) + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + return Annot(annot) + + def _add_text_annot(self, point, text, icon=None): + page = self._pdf_page() + p = JM_point_from_py( point) + annot = mupdf.pdf_create_annot(page, mupdf.PDF_ANNOT_TEXT) + r = mupdf.pdf_annot_rect(annot) + r = mupdf.fz_make_rect(p.x, p.y, p.x + r.x1 - r.x0, p.y + r.y1 - r.y0) + mupdf.pdf_set_annot_rect(annot, r) + mupdf.pdf_set_annot_contents(annot, text) + if icon: + mupdf.pdf_set_annot_icon_name(annot, icon) + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + return Annot(annot) + + def _add_text_marker(self, quads, annot_type): + + CheckParent(self) + if not self.parent.is_pdf: + raise ValueError("is no PDF") + + val = Page__add_text_marker(self, quads, annot_type) + if not val: + return None + val.parent = weakref.proxy(self) + self._annot_refs[id(val)] = val + + return val + + def _addAnnot_FromString(self, linklist): + """Add links from list of object sources.""" + CheckParent(self) + if g_use_extra: + self.__class__._addAnnot_FromString = extra.Page_addAnnot_FromString + #log('Page._addAnnot_FromString() deferring to extra.Page_addAnnot_FromString().') + return extra.Page_addAnnot_FromString( self.this, linklist) + page = _as_pdf_page(self.this) + lcount = len(linklist) # link count + if lcount < 1: + return + i = -1 + + # insert links from the provided sources + if not isinstance(linklist, tuple): + raise ValueError( "bad 'linklist' argument") + if not mupdf.pdf_dict_get( page.obj(), PDF_NAME('Annots')).m_internal: + mupdf.pdf_dict_put_array( page.obj(), PDF_NAME('Annots'), lcount) + annots = mupdf.pdf_dict_get( page.obj(), PDF_NAME('Annots')) + assert annots.m_internal, f'{lcount=} {annots.m_internal=}' + for i in range(lcount): + txtpy = linklist[i] + text = JM_StrAsChar(txtpy) + if not text: + message("skipping bad link / annot item %i.", i) + continue + try: + annot = mupdf.pdf_add_object( page.doc(), JM_pdf_obj_from_str( page.doc(), text)) + ind_obj = mupdf.pdf_new_indirect( page.doc(), mupdf.pdf_to_num( annot), 0) + mupdf.pdf_array_push( annots, ind_obj) + except Exception: + if g_exceptions_verbose: exception_info() + message("skipping bad link / annot item %i.\n" % i) + + def _addWidget(self, field_type, field_name): + page = self._pdf_page() + pdf = page.doc() + annot = JM_create_widget(pdf, page, field_type, field_name) + if not annot.m_internal: + raise RuntimeError( "cannot create widget") + JM_add_annot_id(annot, "W") + return Annot(annot) + + def _apply_redactions(self, text, images, graphics): + page = self._pdf_page() + opts = mupdf.PdfRedactOptions() + opts.black_boxes = 0 # no black boxes + opts.text = text # how to treat text + opts.image_method = images # how to treat images + opts.line_art = graphics # how to treat vector graphics + success = mupdf.pdf_redact_page(page.doc(), page, opts) + return success + + def _erase(self): + self._reset_annot_refs() + try: + self.parent._forget_page(self) + except Exception: + exception_info() + pass + self.parent = None + self.thisown = False + self.number = None + self.this = None + + def _count_q_balance(self): + """Count missing graphic state pushs and pops. + + Returns: + A pair of integers (push, pop). Push is the number of missing + PDF "q" commands, pop is the number of "Q" commands. + A balanced graphics state for the page will be reached if its + /Contents is prepended with 'push' copies of string "q\n" + and appended with 'pop' copies of "\nQ". + """ + page = _as_pdf_page(self) # need the underlying PDF page + res = mupdf.pdf_dict_get( # access /Resources + page.obj(), + mupdf.PDF_ENUM_NAME_Resources, + ) + cont = mupdf.pdf_dict_get( # access /Contents + page.obj(), + mupdf.PDF_ENUM_NAME_Contents, + ) + pdf = _as_pdf_document(self.parent) # need underlying PDF document + + # return value of MuPDF function + return mupdf.pdf_count_q_balance_outparams_fn(pdf, res, cont) + + def _get_optional_content(self, oc: OptInt) -> OptStr: + if oc is None or oc == 0: + return None + doc = self.parent + check = doc.xref_object(oc, compressed=True) + if not ("/Type/OCG" in check or "/Type/OCMD" in check): + #log( 'raising "bad optional content"') + raise ValueError("bad optional content: 'oc'") + #log( 'Looking at self._get_resource_properties()') + props = {} + for p, x in self._get_resource_properties(): + props[x] = p + if oc in props.keys(): + return props[oc] + i = 0 + mc = "MC%i" % i + while mc in props.values(): + i += 1 + mc = "MC%i" % i + self._set_resource_property(mc, oc) + #log( 'returning {mc=}') + return mc + + def _get_resource_properties(self): + ''' + page list Resource/Properties + ''' + page = self._pdf_page() + rc = JM_get_resource_properties(page.obj()) + return rc + + def _get_textpage(self, clip=None, flags=0, matrix=None): + if g_use_extra: + ll_tpage = extra.page_get_textpage(self.this, clip, flags, matrix) + tpage = mupdf.FzStextPage(ll_tpage) + return tpage + page = self.this + options = mupdf.FzStextOptions(flags) + rect = JM_rect_from_py(clip) + # Default to page's rect if `clip` not specified, for #2048. + rect = mupdf.fz_bound_page(page) if clip is None else JM_rect_from_py(clip) + ctm = JM_matrix_from_py(matrix) + tpage = mupdf.FzStextPage(rect) + dev = mupdf.fz_new_stext_device(tpage, options) + if _globals.no_device_caching: + mupdf.fz_enable_device_hints( dev, mupdf.FZ_NO_CACHE) + if isinstance(page, mupdf.FzPage): + pass + elif isinstance(page, mupdf.PdfPage): + page = page.super() + else: + assert 0, f'Unrecognised {type(page)=}' + mupdf.fz_run_page(page, dev, ctm, mupdf.FzCookie()) + mupdf.fz_close_device(dev) + return tpage + + def _insert_image(self, + filename=None, pixmap=None, stream=None, imask=None, clip=None, + overlay=1, rotate=0, keep_proportion=1, oc=0, width=0, height=0, + xref=0, alpha=-1, _imgname=None, digests=None + ): + maskbuf = mupdf.FzBuffer() + page = self._pdf_page() + # This will create an empty PdfDocument with a call to + # pdf_new_document() then assign page.doc()'s return value to it (which + # drop the original empty pdf_document). + pdf = page.doc() + w = width + h = height + img_xref = xref + rc_digest = 0 + + do_process_pixmap = 1 + do_process_stream = 1 + do_have_imask = 1 + do_have_image = 1 + do_have_xref = 1 + + if xref > 0: + ref = mupdf.pdf_new_indirect(pdf, xref, 0) + w = mupdf.pdf_to_int( mupdf.pdf_dict_geta( ref, PDF_NAME('Width'), PDF_NAME('W'))) + h = mupdf.pdf_to_int( mupdf.pdf_dict_geta( ref, PDF_NAME('Height'), PDF_NAME('H'))) + if w + h == 0: + raise ValueError( MSG_IS_NO_IMAGE) + #goto have_xref() + do_process_pixmap = 0 + do_process_stream = 0 + do_have_imask = 0 + do_have_image = 0 + + else: + if stream: + imgbuf = JM_BufferFromBytes(stream) + do_process_pixmap = 0 + else: + if filename: + imgbuf = mupdf.fz_read_file(filename) + #goto have_stream() + do_process_pixmap = 0 + + if do_process_pixmap: + #log( 'do_process_pixmap') + # process pixmap --------------------------------- + arg_pix = pixmap.this + w = arg_pix.w() + h = arg_pix.h() + digest = mupdf.fz_md5_pixmap2(arg_pix) + md5_py = digest + temp = digests.get(md5_py, None) + if temp is not None: + img_xref = temp + ref = mupdf.pdf_new_indirect(page.doc(), img_xref, 0) + #goto have_xref() + do_process_stream = 0 + do_have_imask = 0 + do_have_image = 0 + else: + if arg_pix.alpha() == 0: + image = mupdf.fz_new_image_from_pixmap(arg_pix, mupdf.FzImage()) + else: + pm = mupdf.fz_convert_pixmap( + arg_pix, + mupdf.FzColorspace(), + mupdf.FzColorspace(), + mupdf.FzDefaultColorspaces(None), + mupdf.FzColorParams(), + 1, + ) + pm.alpha = 0 + pm.colorspace = None + mask = mupdf.fz_new_image_from_pixmap(pm, mupdf.FzImage()) + image = mupdf.fz_new_image_from_pixmap(arg_pix, mask) + #goto have_image() + do_process_stream = 0 + do_have_imask = 0 + + if do_process_stream: + #log( 'do_process_stream') + # process stream --------------------------------- + state = mupdf.FzMd5() + if mupdf_cppyy: + mupdf.fz_md5_update_buffer( state, imgbuf) + else: + mupdf.fz_md5_update(state, imgbuf.m_internal.data, imgbuf.m_internal.len) + if imask: + maskbuf = JM_BufferFromBytes(imask) + if mupdf_cppyy: + mupdf.fz_md5_update_buffer( state, maskbuf) + else: + mupdf.fz_md5_update(state, maskbuf.m_internal.data, maskbuf.m_internal.len) + digest = mupdf.fz_md5_final2(state) + md5_py = bytes(digest) + temp = digests.get(md5_py, None) + if temp is not None: + img_xref = temp + ref = mupdf.pdf_new_indirect(page.doc(), img_xref, 0) + w = mupdf.pdf_to_int( mupdf.pdf_dict_geta( ref, PDF_NAME('Width'), PDF_NAME('W'))) + h = mupdf.pdf_to_int( mupdf.pdf_dict_geta( ref, PDF_NAME('Height'), PDF_NAME('H'))) + #goto have_xref() + do_have_imask = 0 + do_have_image = 0 + else: + image = mupdf.fz_new_image_from_buffer(imgbuf) + w = image.w() + h = image.h() + if not imask: + #goto have_image() + do_have_imask = 0 + + if do_have_imask: + # `fz_compressed_buffer` is reference counted and + # `mupdf.fz_new_image_from_compressed_buffer2()` + # is povided as a Swig-friendly wrapper for + # `fz_new_image_from_compressed_buffer()`, so we can do things + # straightfowardly. + # + cbuf1 = mupdf.fz_compressed_image_buffer( image) + if not cbuf1.m_internal: + raise ValueError( "uncompressed image cannot have mask") + bpc = image.bpc() + colorspace = image.colorspace() + xres, yres = mupdf.fz_image_resolution(image) + mask = mupdf.fz_new_image_from_buffer(maskbuf) + image = mupdf.fz_new_image_from_compressed_buffer2( + w, + h, + bpc, + colorspace, + xres, + yres, + 1, # interpolate + 0, # imagemask, + list(), # decode + list(), # colorkey + cbuf1, + mask, + ) + + if do_have_image: + #log( 'do_have_image') + ref = mupdf.pdf_add_image(pdf, image) + if oc: + JM_add_oc_object(pdf, ref, oc) + img_xref = mupdf.pdf_to_num(ref) + digests[md5_py] = img_xref + rc_digest = 1 + + if do_have_xref: + #log( 'do_have_xref') + resources = mupdf.pdf_dict_get_inheritable(page.obj(), PDF_NAME('Resources')) + if not resources.m_internal: + resources = mupdf.pdf_dict_put_dict(page.obj(), PDF_NAME('Resources'), 2) + xobject = mupdf.pdf_dict_get(resources, PDF_NAME('XObject')) + if not xobject.m_internal: + xobject = mupdf.pdf_dict_put_dict(resources, PDF_NAME('XObject'), 2) + mat = calc_image_matrix(w, h, clip, rotate, keep_proportion) + mupdf.pdf_dict_puts(xobject, _imgname, ref) + nres = mupdf.fz_new_buffer(50) + s = f"\nq\n{_format_g((mat.a, mat.b, mat.c, mat.d, mat.e, mat.f))} cm\n/{_imgname} Do\nQ\n" + #s = s.replace('\n', '\r\n') + mupdf.fz_append_string(nres, s) + JM_insert_contents(pdf, page.obj(), nres, overlay) + + if rc_digest: + return img_xref, digests + else: + return img_xref, None + + def _insertFont(self, fontname, bfname, fontfile, fontbuffer, set_simple, idx, wmode, serif, encoding, ordering): + page = self._pdf_page() + pdf = page.doc() + + value = JM_insert_font(pdf, bfname, fontfile,fontbuffer, set_simple, idx, wmode, serif, encoding, ordering) + # get the objects /Resources, /Resources/Font + resources = mupdf.pdf_dict_get_inheritable(page.obj(), PDF_NAME('Resources')) + if not resources.pdf_is_dict(): + resources = mupdf.pdf_dict_put_dict(page.obj(), PDF_NAME("Resources"), 5) + fonts = mupdf.pdf_dict_get(resources, PDF_NAME('Font')) + if not fonts.m_internal: # page has no fonts yet + fonts = mupdf.pdf_new_dict(pdf, 5) + mupdf.pdf_dict_putl(page.obj(), fonts, PDF_NAME('Resources'), PDF_NAME('Font')) + # store font in resources and fonts objects will contain named reference to font + _, xref = JM_INT_ITEM(value, 0) + if not xref: + raise RuntimeError( "cannot insert font") + font_obj = mupdf.pdf_new_indirect(pdf, xref, 0) + mupdf.pdf_dict_puts(fonts, fontname, font_obj) + return value + + def _load_annot(self, name, xref): + page = self._pdf_page() + if xref == 0: + annot = JM_get_annot_by_name(page, name) + else: + annot = JM_get_annot_by_xref(page, xref) + if annot.m_internal: + return Annot(annot) + + def _makePixmap(self, doc, ctm, cs, alpha=0, annots=1, clip=None): + pix = JM_pixmap_from_page(doc, self.this, ctm, cs, alpha, annots, clip) + return Pixmap(pix) + + def _other_box(self, boxtype): + rect = mupdf.FzRect( mupdf.FzRect.Fixed_INFINITE) + page = _as_pdf_page(self.this, required=False) + if page.m_internal: + obj = mupdf.pdf_dict_gets( page.obj(), boxtype) + if mupdf.pdf_is_array(obj): + rect = mupdf.pdf_to_rect(obj) + if mupdf.fz_is_infinite_rect( rect): + return + return JM_py_from_rect(rect) + + def _pdf_page(self, required=True): + return _as_pdf_page(self.this, required=required) + + def _reset_annot_refs(self): + """Invalidate / delete all annots of this page.""" + self._annot_refs.clear() + + def _set_opacity(self, gstate=None, CA=1, ca=1, blendmode=None): + + if CA >= 1 and ca >= 1 and blendmode is None: + return + tCA = int(round(max(CA , 0) * 100)) + if tCA >= 100: + tCA = 99 + tca = int(round(max(ca, 0) * 100)) + if tca >= 100: + tca = 99 + gstate = "fitzca%02i%02i" % (tCA, tca) + + if not gstate: + return + page = _as_pdf_page(self.this) + resources = mupdf.pdf_dict_get(page.obj(), PDF_NAME('Resources')) + if not resources.m_internal: + resources = mupdf.pdf_dict_put_dict(page.obj(), PDF_NAME('Resources'), 2) + extg = mupdf.pdf_dict_get(resources, PDF_NAME('ExtGState')) + if not extg.m_internal: + extg = mupdf.pdf_dict_put_dict(resources, PDF_NAME('ExtGState'), 2) + n = mupdf.pdf_dict_len(extg) + for i in range(n): + o1 = mupdf.pdf_dict_get_key(extg, i) + name = mupdf.pdf_to_name(o1) + if name == gstate: + return gstate + opa = mupdf.pdf_new_dict(page.doc(), 3) + mupdf.pdf_dict_put_real(opa, PDF_NAME('CA'), CA) + mupdf.pdf_dict_put_real(opa, PDF_NAME('ca'), ca) + mupdf.pdf_dict_puts(extg, gstate, opa) + return gstate + + def _set_pagebox(self, boxtype, rect): + doc = self.parent + if doc is None: + raise ValueError("orphaned object: parent is None") + + if not doc.is_pdf: + raise ValueError("is no PDF") + + valid_boxes = ("CropBox", "BleedBox", "TrimBox", "ArtBox") + + if boxtype not in valid_boxes: + raise ValueError("bad boxtype") + + rect = Rect(rect) + mb = self.mediabox + rect = Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1]) + if not (mb.x0 <= rect.x0 < rect.x1 <= mb.x1 and mb.y0 <= rect.y0 < rect.y1 <= mb.y1): + raise ValueError(f"{boxtype} not in MediaBox") + + doc.xref_set_key(self.xref, boxtype, f"[{_format_g(tuple(rect))}]") + + def _set_resource_property(self, name, xref): + page = self._pdf_page() + JM_set_resource_property(page.obj(), name, xref) + + def _show_pdf_page(self, fz_srcpage, overlay=1, matrix=None, xref=0, oc=0, clip=None, graftmap=None, _imgname=None): + cropbox = JM_rect_from_py(clip) + mat = JM_matrix_from_py(matrix) + rc_xref = xref + tpage = _as_pdf_page(self.this) + tpageref = tpage.obj() + pdfout = tpage.doc() # target PDF + ENSURE_OPERATION(pdfout) + #------------------------------------------------------------- + # convert the source page to a Form XObject + #------------------------------------------------------------- + xobj1 = JM_xobject_from_page(pdfout, fz_srcpage, xref, graftmap.this) + if not rc_xref: + rc_xref = mupdf.pdf_to_num(xobj1) + + #------------------------------------------------------------- + # create referencing XObject (controls display on target page) + #------------------------------------------------------------- + # fill reference to xobj1 into the /Resources + #------------------------------------------------------------- + subres1 = mupdf.pdf_new_dict(pdfout, 5) + mupdf.pdf_dict_puts(subres1, "fullpage", xobj1) + subres = mupdf.pdf_new_dict(pdfout, 5) + mupdf.pdf_dict_put(subres, PDF_NAME('XObject'), subres1) + + res = mupdf.fz_new_buffer(20) + mupdf.fz_append_string(res, "/fullpage Do") + + xobj2 = mupdf.pdf_new_xobject(pdfout, cropbox, mat, subres, res) + if oc > 0: + JM_add_oc_object(pdfout, mupdf.pdf_resolve_indirect(xobj2), oc) + + #------------------------------------------------------------- + # update target page with xobj2: + #------------------------------------------------------------- + # 1. insert Xobject in Resources + #------------------------------------------------------------- + resources = mupdf.pdf_dict_get_inheritable(tpageref, PDF_NAME('Resources')) + if not resources.m_internal: + resources = mupdf.pdf_dict_put_dict(tpageref,PDF_NAME('Resources'), 5) + subres = mupdf.pdf_dict_get(resources, PDF_NAME('XObject')) + if not subres.m_internal: + subres = mupdf.pdf_dict_put_dict(resources, PDF_NAME('XObject'), 5) + + mupdf.pdf_dict_puts(subres, _imgname, xobj2) + + #------------------------------------------------------------- + # 2. make and insert new Contents object + #------------------------------------------------------------- + nres = mupdf.fz_new_buffer(50) # buffer for Do-command + mupdf.fz_append_string(nres, " q /") # Do-command + mupdf.fz_append_string(nres, _imgname) + mupdf.fz_append_string(nres, " Do Q ") + + JM_insert_contents(pdfout, tpageref, nres, overlay) + return rc_xref + + def add_caret_annot(self, point: point_like) -> Annot: + """Add a 'Caret' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_caret_annot(point) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot = Annot( annot) + annot_postprocess(self, annot) + assert hasattr( annot, 'parent') + return annot + + def add_circle_annot(self, rect: rect_like) -> Annot: + """Add a 'Circle' (ellipse, oval) annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_square_or_circle(rect, mupdf.PDF_ANNOT_CIRCLE) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_file_annot( + self, + point: point_like, + buffer_: ByteString, + filename: str, + ufilename: OptStr =None, + desc: OptStr =None, + icon: OptStr =None + ) -> Annot: + """Add a 'FileAttachment' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_file_annot(point, + buffer_, + filename, + ufilename=ufilename, + desc=desc, + icon=icon, + ) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_freetext_annot( + self, + rect: rect_like, + text: str, + *, + fontsize: float =11, + fontname: OptStr =None, + text_color: OptSeq =None, + fill_color: OptSeq =None, + border_color: OptSeq =None, + border_width: float =0, + dashes: OptSeq =None, + callout: OptSeq =None, + line_end: int=mupdf.PDF_ANNOT_LE_OPEN_ARROW, + opacity: float =1, + align: int =0, + rotate: int =0, + richtext=False, + style=None, + ) -> Annot: + """Add a 'FreeText' annotation.""" + + old_rotation = annot_preprocess(self) + try: + annot = self._add_freetext_annot( + rect, + text, + fontsize=fontsize, + fontname=fontname, + text_color=text_color, + fill_color=fill_color, + border_color=border_color, + border_width=border_width, + dashes=dashes, + callout=callout, + line_end=line_end, + opacity=opacity, + align=align, + rotate=rotate, + richtext=richtext, + style=style, + ) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_highlight_annot(self, quads=None, start=None, + stop=None, clip=None) -> Annot: + """Add a 'Highlight' annotation.""" + if quads is None: + q = get_highlight_selection(self, start=start, stop=stop, clip=clip) + else: + q = CheckMarkerArg(quads) + ret = self._add_text_marker(q, mupdf.PDF_ANNOT_HIGHLIGHT) + return ret + + def add_ink_annot(self, handwriting: list) -> Annot: + """Add a 'Ink' ('handwriting') annotation. + + The argument must be a list of lists of point_likes. + """ + old_rotation = annot_preprocess(self) + try: + annot = self._add_ink_annot(handwriting) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_line_annot(self, p1: point_like, p2: point_like) -> Annot: + """Add a 'Line' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_line_annot(p1, p2) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_polygon_annot(self, points: list) -> Annot: + """Add a 'Polygon' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_multiline(points, mupdf.PDF_ANNOT_POLYGON) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_polyline_annot(self, points: list) -> Annot: + """Add a 'PolyLine' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_multiline(points, mupdf.PDF_ANNOT_POLY_LINE) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_rect_annot(self, rect: rect_like) -> Annot: + """Add a 'Square' (rectangle) annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_square_or_circle(rect, mupdf.PDF_ANNOT_SQUARE) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_redact_annot( + self, + quad, + text: OptStr =None, + fontname: OptStr =None, + fontsize: float =11, + align: int =0, + fill: OptSeq =None, + text_color: OptSeq =None, + cross_out: bool =True, + ) -> Annot: + """Add a 'Redact' annotation.""" + da_str = None + if text and not set(string.whitespace).issuperset(text): + CheckColor(fill) + CheckColor(text_color) + if not fontname: + fontname = "Helv" + if not fontsize: + fontsize = 11 + if not text_color: + text_color = (0, 0, 0) + if hasattr(text_color, "__float__"): + text_color = (text_color, text_color, text_color) + if len(text_color) > 3: + text_color = text_color[:3] + fmt = "{:g} {:g} {:g} rg /{f:s} {s:g} Tf" + da_str = fmt.format(*text_color, f=fontname, s=fontsize) + if fill is None: + fill = (1, 1, 1) + if fill: + if hasattr(fill, "__float__"): + fill = (fill, fill, fill) + if len(fill) > 3: + fill = fill[:3] + else: + text = None + + old_rotation = annot_preprocess(self) + try: + annot = self._add_redact_annot(quad, text=text, da_str=da_str, + align=align, fill=fill) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + #------------------------------------------------------------- + # change appearance to show a crossed-out rectangle + #------------------------------------------------------------- + if cross_out: + ap_tab = annot._getAP().splitlines()[:-1] # get the 4 commands only + _, LL, LR, UR, UL = ap_tab + ap_tab.append(LR) + ap_tab.append(LL) + ap_tab.append(UR) + ap_tab.append(LL) + ap_tab.append(UL) + ap_tab.append(b"S") + ap = b"\n".join(ap_tab) + annot._setAP(ap, 0) + return annot + + def add_squiggly_annot( + self, + quads=None, + start=None, + stop=None, + clip=None, + ) -> Annot: + """Add a 'Squiggly' annotation.""" + if quads is None: + q = get_highlight_selection(self, start=start, stop=stop, clip=clip) + else: + q = CheckMarkerArg(quads) + return self._add_text_marker(q, mupdf.PDF_ANNOT_SQUIGGLY) + + def add_stamp_annot(self, rect: rect_like, stamp=0) -> Annot: + """Add a ('rubber') 'Stamp' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_stamp_annot(rect, stamp) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_strikeout_annot(self, quads=None, start=None, stop=None, clip=None) -> Annot: + """Add a 'StrikeOut' annotation.""" + if quads is None: + q = get_highlight_selection(self, start=start, stop=stop, clip=clip) + else: + q = CheckMarkerArg(quads) + return self._add_text_marker(q, mupdf.PDF_ANNOT_STRIKE_OUT) + + def add_text_annot(self, point: point_like, text: str, icon: str ="Note") -> Annot: + """Add a 'Text' (sticky note) annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_text_annot(point, text, icon=icon) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + def add_underline_annot(self, quads=None, start=None, stop=None, clip=None) -> Annot: + """Add a 'Underline' annotation.""" + if quads is None: + q = get_highlight_selection(self, start=start, stop=stop, clip=clip) + else: + q = CheckMarkerArg(quads) + return self._add_text_marker(q, mupdf.PDF_ANNOT_UNDERLINE) + + def add_widget(self, widget: Widget) -> Annot: + """Add a 'Widget' (form field).""" + CheckParent(self) + doc = self.parent + if not doc.is_pdf: + raise ValueError("is no PDF") + widget._validate() + annot = self._addWidget(widget.field_type, widget.field_name) + if not annot: + return None + annot.thisown = True + annot.parent = weakref.proxy(self) # owning page object + self._annot_refs[id(annot)] = annot + widget.parent = annot.parent + widget._annot = annot + widget.update() + return annot + + def annot_names(self): + ''' + page get list of annot names + ''' + """List of names of annotations, fields and links.""" + CheckParent(self) + page = self._pdf_page(required=False) + if not page.m_internal: + return [] + return JM_get_annot_id_list(page) + + def annot_xrefs(self): + ''' + List of xref numbers of annotations, fields and links. + ''' + return JM_get_annot_xref_list2(self) + + def annots(self, types=None): + """ Generator over the annotations of a page. + + Args: + types: (list) annotation types to subselect from. If none, + all annotations are returned. E.g. types=[PDF_ANNOT_LINE] + will only yield line annotations. + """ + skip_types = (mupdf.PDF_ANNOT_LINK, mupdf.PDF_ANNOT_POPUP, mupdf.PDF_ANNOT_WIDGET) + if not hasattr(types, "__getitem__"): + annot_xrefs = [a[0] for a in self.annot_xrefs() if a[1] not in skip_types] + else: + annot_xrefs = [a[0] for a in self.annot_xrefs() if a[1] in types and a[1] not in skip_types] + for xref in annot_xrefs: + annot = self.load_annot(xref) + annot._yielded=True + yield annot + + def recolor(self, components=1): + """Convert colorspaces of objects on the page. + + Valid values are 1, 3 and 4. + """ + if components not in (1, 3, 4): + raise ValueError("components must be one of 1, 3, 4") + pdfdoc = _as_pdf_document(self.parent) + ropt = mupdf.pdf_recolor_options() + ropt.num_comp = components + ropts = mupdf.PdfRecolorOptions(ropt) + mupdf.pdf_recolor_page(pdfdoc, self.number, ropts) + + def clip_to_rect(self, rect): + """Clip away page content outside the rectangle.""" + clip = Rect(rect) + if clip.is_infinite or (clip & self.rect).is_empty: + raise ValueError("rect must not be infinite or empty") + clip *= self.transformation_matrix + pdfpage = _as_pdf_page(self) + pclip = JM_rect_from_py(clip) + mupdf.pdf_clip_page(pdfpage, pclip) + + @property + def artbox(self): + """The ArtBox""" + rect = self._other_box("ArtBox") + if rect is None: + return self.cropbox + mb = self.mediabox + return Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1]) + + @property + def bleedbox(self): + """The BleedBox""" + rect = self._other_box("BleedBox") + if rect is None: + return self.cropbox + mb = self.mediabox + return Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1]) + + def bound(self): + """Get page rectangle.""" + CheckParent(self) + page = _as_fz_page(self.this) + val = mupdf.fz_bound_page(page) + val = Rect(val) + + if val.is_infinite and self.parent.is_pdf: + cb = self.cropbox + w, h = cb.width, cb.height + if self.rotation not in (0, 180): + w, h = h, w + val = Rect(0, 0, w, h) + msg = TOOLS.mupdf_warnings(reset=False).splitlines()[-1] + message(msg) + + return val + + def clean_contents(self, sanitize=1): + if not sanitize and not self.is_wrapped: + self.wrap_contents() + page = _as_pdf_page( self.this, required=False) + if not page.m_internal: + return + filter_ = _make_PdfFilterOptions(recurse=1, sanitize=sanitize) + mupdf.pdf_filter_page_contents( page.doc(), page, filter_) + + @property + def cropbox(self): + """The CropBox.""" + CheckParent(self) + page = self._pdf_page(required=False) + if not page.m_internal: + val = mupdf.fz_bound_page(self.this) + else: + val = JM_cropbox(page.obj()) + val = Rect(val) + + return val + + @property + def cropbox_position(self): + return self.cropbox.tl + + def delete_annot(self, annot): + """Delete annot and return next one.""" + CheckParent(self) + CheckParent(annot) + + page = self._pdf_page() + while 1: + # first loop through all /IRT annots and remove them + irt_annot = JM_find_annot_irt(annot.this) + if not irt_annot: # no more there + break + mupdf.pdf_delete_annot(page, irt_annot.this) + nextannot = mupdf.pdf_next_annot(annot.this) # store next + mupdf.pdf_delete_annot(page, annot.this) + val = Annot(nextannot) + + if val: + val.thisown = True + val.parent = weakref.proxy(self) # owning page object + val.parent._annot_refs[id(val)] = val + annot._erase() + return val + + def delete_link(self, linkdict): + """Delete a Link.""" + CheckParent(self) + if not isinstance( linkdict, dict): + return # have no dictionary + + def finished(): + if linkdict["xref"] == 0: return + try: + linkid = linkdict["id"] + linkobj = self._annot_refs[linkid] + linkobj._erase() + except Exception: + # Don't print this exception, to match classic. Issue #2841. + if g_exceptions_verbose > 1: exception_info() + pass + + page = _as_pdf_page(self.this, required=False) + if not page.m_internal: + return finished() # have no PDF + xref = linkdict[dictkey_xref] + if xref < 1: + return finished() # invalid xref + annots = mupdf.pdf_dict_get( page.obj(), PDF_NAME('Annots')) + if not annots.m_internal: + return finished() # have no annotations + len_ = mupdf.pdf_array_len( annots) + if len_ == 0: + return finished() + oxref = 0 + for i in range( len_): + oxref = mupdf.pdf_to_num( mupdf.pdf_array_get( annots, i)) + if xref == oxref: + break # found xref in annotations + + if xref != oxref: + return finished() # xref not in annotations + mupdf.pdf_array_delete( annots, i) # delete entry in annotations + mupdf.pdf_delete_object( page.doc(), xref) # delete link object + mupdf.pdf_dict_put( page.obj(), PDF_NAME('Annots'), annots) + JM_refresh_links( page) + + return finished() + + @property + def derotation_matrix(self) -> Matrix: + """Reflects page de-rotation.""" + if g_use_extra: + return Matrix(extra.Page_derotate_matrix( self.this)) + pdfpage = self._pdf_page(required=False) + if not pdfpage.m_internal: + return Matrix(mupdf.FzRect(mupdf.FzRect.UNIT)) + return Matrix(JM_derotate_page_matrix(pdfpage)) + + def extend_textpage(self, tpage, flags=0, matrix=None): + page = self.this + tp = tpage.this + assert isinstance( tp, mupdf.FzStextPage) + options = mupdf.FzStextOptions() + options.flags = flags + ctm = JM_matrix_from_py(matrix) + dev = mupdf.FzDevice(tp, options) + mupdf.fz_run_page( page, dev, ctm, mupdf.FzCookie()) + mupdf.fz_close_device( dev) + + @property + def first_annot(self): + """First annotation.""" + CheckParent(self) + page = self._pdf_page(required=False) + if not page.m_internal: + return + annot = mupdf.pdf_first_annot(page) + if not annot.m_internal: + return + val = Annot(annot) + val.thisown = True + val.parent = weakref.proxy(self) # owning page object + self._annot_refs[id(val)] = val + return val + + @property + def first_link(self): + ''' + First link on page + ''' + return self.load_links() + + @property + def first_widget(self): + """First widget/field.""" + CheckParent(self) + annot = 0 + page = self._pdf_page(required=False) + if not page.m_internal: + return + annot = mupdf.pdf_first_widget(page) + if not annot.m_internal: + return + val = Annot(annot) + val.thisown = True + val.parent = weakref.proxy(self) # owning page object + self._annot_refs[id(val)] = val + widget = Widget() + TOOLS._fill_widget(val, widget) + val = widget + return val + + def get_bboxlog(self, layers=None): + CheckParent(self) + old_rotation = self.rotation + if old_rotation != 0: + self.set_rotation(0) + page = self.this + rc = [] + inc_layers = True if layers else False + dev = JM_new_bbox_device( rc, inc_layers) + mupdf.fz_run_page( page, dev, mupdf.FzMatrix(), mupdf.FzCookie()) + mupdf.fz_close_device( dev) + + if old_rotation != 0: + self.set_rotation(old_rotation) + return rc + + def get_cdrawings(self, extended=None, callback=None, method=None): + """Extract vector graphics ("line art") from the page.""" + CheckParent(self) + old_rotation = self.rotation + if old_rotation != 0: + self.set_rotation(0) + page = self.this + if isinstance(page, mupdf.PdfPage): + # Downcast pdf_page to fz_page. + page = mupdf.FzPage(page) + assert isinstance(page, mupdf.FzPage), f'{self.this=}' + clips = True if extended else False + prect = mupdf.fz_bound_page(page) + if g_use_extra: + rc = extra.get_cdrawings(page, extended, callback, method) + else: + rc = list() + if callable(callback) or method is not None: + dev = JM_new_lineart_device_Device(callback, clips, method) + else: + dev = JM_new_lineart_device_Device(rc, clips, method) + dev.ptm = mupdf.FzMatrix(1, 0, 0, -1, 0, prect.y1) + mupdf.fz_run_page(page, dev, mupdf.FzMatrix(), mupdf.FzCookie()) + mupdf.fz_close_device(dev) + + if old_rotation != 0: + self.set_rotation(old_rotation) + if callable(callback) or method is not None: + return + return rc + + def get_contents(self): + """Get xrefs of /Contents objects.""" + CheckParent(self) + ret = [] + page = _as_pdf_page(self.this) + obj = page.obj() + contents = mupdf.pdf_dict_get(obj, mupdf.PDF_ENUM_NAME_Contents) + if mupdf.pdf_is_array(contents): + n = mupdf.pdf_array_len(contents) + for i in range(n): + icont = mupdf.pdf_array_get(contents, i) + xref = mupdf.pdf_to_num(icont) + ret.append(xref) + elif contents.m_internal: + xref = mupdf.pdf_to_num(contents) + ret.append( xref) + return ret + + def get_displaylist(self, annots=1): + ''' + Make a DisplayList from the page for Pixmap generation. + + Include (default) or exclude annotations. + ''' + CheckParent(self) + if annots: + dl = mupdf.fz_new_display_list_from_page(self.this) + else: + dl = mupdf.fz_new_display_list_from_page_contents(self.this) + return DisplayList(dl) + + def get_drawings(self, extended: bool=False) -> list: + """Retrieve vector graphics. The extended version includes clips. + + Note: + For greater comfort, this method converts point-likes, rect-likes, quad-likes + of the C version to respective Point / Rect / Quad objects. + It also adds default items that are missing in original path types. + """ + allkeys = ( + 'closePath', + 'fill', + 'color', + 'width', + 'lineCap', + 'lineJoin', + 'dashes', + 'stroke_opacity', + 'fill_opacity', + 'even_odd', + ) + val = self.get_cdrawings(extended=extended) + for i in range(len(val)): + npath = val[i] + if not npath["type"].startswith("clip"): + npath["rect"] = Rect(npath["rect"]) + else: + npath["scissor"] = Rect(npath["scissor"]) + if npath["type"]!="group": + items = npath["items"] + newitems = [] + for item in items: + cmd = item[0] + rest = item[1:] + if cmd == "re": + item = ("re", Rect(rest[0]).normalize(), rest[1]) + elif cmd == "qu": + item = ("qu", Quad(rest[0])) + else: + item = tuple([cmd] + [Point(i) for i in rest]) + newitems.append(item) + npath["items"] = newitems + if npath['type'] in ('f', 's'): + for k in allkeys: + npath[k] = npath.get(k) + + val[i] = npath + return val + + class Drawpath(object): + """Reflects a path dictionary from get_cdrawings().""" + def __init__(self, **args): + self.__dict__.update(args) + + class Drawpathlist(object): + """List of Path objects representing get_cdrawings() output.""" + def __getitem__(self, item): + return self.paths.__getitem__(item) + + def __init__(self): + self.paths = [] + self.path_count = 0 + self.group_count = 0 + self.clip_count = 0 + self.fill_count = 0 + self.stroke_count = 0 + self.fillstroke_count = 0 + + def __len__(self): + return self.paths.__len__() + + def append(self, path): + self.paths.append(path) + self.path_count += 1 + if path.type == "clip": + self.clip_count += 1 + elif path.type == "group": + self.group_count += 1 + elif path.type == "f": + self.fill_count += 1 + elif path.type == "s": + self.stroke_count += 1 + elif path.type == "fs": + self.fillstroke_count += 1 + + def clip_parents(self, i): + """Return list of parent clip paths. + + Args: + i: (int) return parents of this path. + Returns: + List of the clip parents.""" + if i >= self.path_count: + raise IndexError("bad path index") + while i < 0: + i += self.path_count + lvl = self.paths[i].level + clips = list( # clip paths before identified one + reversed( + [ + p + for p in self.paths[:i] + if p.type == "clip" and p.level < lvl + ] + ) + ) + if clips == []: # none found: empty list + return [] + nclips = [clips[0]] # init return list + for p in clips[1:]: + if p.level >= nclips[-1].level: + continue # only accept smaller clip levels + nclips.append(p) + return nclips + + def group_parents(self, i): + """Return list of parent group paths. + + Args: + i: (int) return parents of this path. + Returns: + List of the group parents.""" + if i >= self.path_count: + raise IndexError("bad path index") + while i < 0: + i += self.path_count + lvl = self.paths[i].level + groups = list( # group paths before identified one + reversed( + [ + p + for p in self.paths[:i] + if p.type == "group" and p.level < lvl + ] + ) + ) + if groups == []: # none found: empty list + return [] + ngroups = [groups[0]] # init return list + for p in groups[1:]: + if p.level >= ngroups[-1].level: + continue # only accept smaller group levels + ngroups.append(p) + return ngroups + + def get_lineart(self) -> object: + """Get page drawings paths. + + Note: + For greater comfort, this method converts point-like, rect-like, quad-like + tuples of the C version to respective Point / Rect / Quad objects. + Also adds default items that are missing in original path types. + In contrast to get_drawings(), this output is an object. + """ + + val = self.get_cdrawings(extended=True) + paths = self.Drawpathlist() + for path in val: + npath = self.Drawpath(**path) + if npath.type != "clip": + npath.rect = Rect(path["rect"]) + else: + npath.scissor = Rect(path["scissor"]) + if npath.type != "group": + items = path["items"] + newitems = [] + for item in items: + cmd = item[0] + rest = item[1:] + if cmd == "re": + item = ("re", Rect(rest[0]).normalize(), rest[1]) + elif cmd == "qu": + item = ("qu", Quad(rest[0])) + else: + item = tuple([cmd] + [Point(i) for i in rest]) + newitems.append(item) + npath.items = newitems + + if npath.type == "f": + npath.stroke_opacity = None + npath.dashes = None + npath.line_join = None + npath.line_cap = None + npath.color = None + npath.width = None + + paths.append(npath) + + val = None + return paths + + def remove_rotation(self): + """Set page rotation to 0 while maintaining visual appearance.""" + rot = self.rotation # normalized rotation value + if rot == 0: + return Identity # nothing to do + + # need to derotate the page's content + mb = self.mediabox # current mediabox + + if rot == 90: + # before derotation, shift content horizontally + mat0 = Matrix(1, 0, 0, 1, mb.y1 - mb.x1 - mb.x0 - mb.y0, 0) + elif rot == 270: + # before derotation, shift content vertically + mat0 = Matrix(1, 0, 0, 1, 0, mb.x1 - mb.y1 - mb.y0 - mb.x0) + else: # rot = 180 + mat0 = Matrix(1, 0, 0, 1, -2 * mb.x0, -2 * mb.y0) + + # prefix with derotation matrix + mat = mat0 * self.derotation_matrix + cmd = _format_g(tuple(mat)) + ' cm ' + cmd = cmd.encode('utf8') + _ = TOOLS._insert_contents(self, cmd, False) # prepend to page contents + + # swap x- and y-coordinates + if rot in (90, 270): + x0, y0, x1, y1 = mb + mb.x0 = y0 + mb.y0 = x0 + mb.x1 = y1 + mb.y1 = x1 + self.set_mediabox(mb) + + self.set_rotation(0) + rot = ~mat # inverse of the derotation matrix + + for annot in self.annots(): # modify rectangles of annotations + r = annot.rect * rot + # TODO: only try to set rectangle for applicable annot types + annot.set_rect(r) + for link in self.get_links(): # modify 'from' rectangles of links + r = link["from"] * rot + self.delete_link(link) + link["from"] = r + try: # invalid links remain deleted + self.insert_link(link) + except Exception: + pass + for widget in self.widgets(): # modify field rectangles + r = widget.rect * rot + widget.rect = r + widget.update() + return rot # the inverse of the generated derotation matrix + + def cluster_drawings( + self, clip=None, drawings=None, x_tolerance: float = 3, y_tolerance: float = 3, + final_filter: bool = True, + ) -> list: + """Join rectangles of neighboring vector graphic items. + + Args: + clip: optional rect-like to restrict the page area to consider. + drawings: (optional) output of a previous "get_drawings()". + x_tolerance: horizontal neighborhood threshold. + y_tolerance: vertical neighborhood threshold. + + Notes: + Vector graphics (also called line-art or drawings) usually consist + of independent items like rectangles, lines or curves to jointly + form table grid lines or bar, line, pie charts and similar. + This method identifies rectangles wrapping these disparate items. + + Returns: + A list of Rect items, each wrapping line-art items that are close + enough to be considered forming a common vector graphic. + Only "significant" rectangles will be returned, i.e. having both, + width and height larger than the tolerance values. + """ + CheckParent(self) + parea = self.rect # the default clipping area + if clip is not None: + parea = Rect(clip) + delta_x = x_tolerance # shorter local name + delta_y = y_tolerance # shorter local name + if drawings is None: # if we cannot re-use a previous output + drawings = self.get_drawings() + + def are_neighbors(r1, r2): + """Detect whether r1, r2 are "neighbors". + + Items r1, r2 are called neighbors if the minimum distance between + their points is less-equal delta. + + Both parameters must be (potentially invalid) rectangles. + """ + # normalize rectangles as needed + rr1_x0, rr1_x1 = (r1.x0, r1.x1) if r1.x1 > r1.x0 else (r1.x1, r1.x0) + rr1_y0, rr1_y1 = (r1.y0, r1.y1) if r1.y1 > r1.y0 else (r1.y1, r1.y0) + rr2_x0, rr2_x1 = (r2.x0, r2.x1) if r2.x1 > r2.x0 else (r2.x1, r2.x0) + rr2_y0, rr2_y1 = (r2.y0, r2.y1) if r2.y1 > r2.y0 else (r2.y1, r2.y0) + if ( + 0 + or rr1_x1 < rr2_x0 - delta_x + or rr1_x0 > rr2_x1 + delta_x + or rr1_y1 < rr2_y0 - delta_y + or rr1_y0 > rr2_y1 + delta_y + ): + # Rects do not overlap. + return False + else: + # Rects overlap. + return True + + # exclude graphics not contained in the clip + paths = [ + p + for p in drawings + if 1 + and p["rect"].x0 >= parea.x0 + and p["rect"].x1 <= parea.x1 + and p["rect"].y0 >= parea.y0 + and p["rect"].y1 <= parea.y1 + ] + + # list of all vector graphic rectangles + prects = sorted([p["rect"] for p in paths], key=lambda r: (r.y1, r.x0)) + + new_rects = [] # the final list of the joined rectangles + + # ------------------------------------------------------------------------- + # The strategy is to identify and join all rects that are neighbors + # ------------------------------------------------------------------------- + while prects: # the algorithm will empty this list + r = +prects[0] # copy of first rectangle + repeat = True + while repeat: + repeat = False + for i in range(len(prects) - 1, 0, -1): # from back to front + if are_neighbors(prects[i], r): + r |= prects[i].tl # include in first rect + r |= prects[i].br # include in first rect + del prects[i] # delete this rect + repeat = True + + new_rects.append(r) + del prects[0] + prects = sorted(set(prects), key=lambda r: (r.y1, r.x0)) + + new_rects = sorted(set(new_rects), key=lambda r: (r.y1, r.x0)) + if not final_filter: + return new_rects + return [r for r in new_rects if r.width > delta_x and r.height > delta_y] + + def get_fonts(self, full=False): + """List of fonts defined in the page object.""" + CheckParent(self) + return self.parent.get_page_fonts(self.number, full=full) + + def get_image_bbox(self, name, transform=0): + """Get rectangle occupied by image 'name'. + + 'name' is either an item of the image list, or the referencing + name string - elem[7] of the resp. item. + Option 'transform' also returns the image transformation matrix. + """ + CheckParent(self) + doc = self.parent + if doc.is_closed or doc.is_encrypted: + raise ValueError('document closed or encrypted') + + inf_rect = Rect(1, 1, -1, -1) + null_mat = Matrix() + if transform: + rc = (inf_rect, null_mat) + else: + rc = inf_rect + + if type(name) in (list, tuple): + if not type(name[-1]) is int: + raise ValueError('need item of full page image list') + item = name + else: + imglist = [i for i in doc.get_page_images(self.number, True) if name == i[7]] + if len(imglist) == 1: + item = imglist[0] + elif imglist == []: + raise ValueError('bad image name') + else: + raise ValueError("found multiple images named '%s'." % name) + xref = item[-1] + if xref != 0 or transform: + try: + return self.get_image_rects(item, transform=transform)[0] + except Exception: + exception_info() + return inf_rect + pdf_page = self._pdf_page() + val = JM_image_reporter(pdf_page) + + if not bool(val): + return rc + + for v in val: + if v[0] != item[-3]: + continue + q = Quad(v[1]) + bbox = q.rect + if transform == 0: + rc = bbox + break + + hm = Matrix(util_hor_matrix(q.ll, q.lr)) + h = abs(q.ll - q.ul) + w = abs(q.ur - q.ul) + m0 = Matrix(1 / w, 0, 0, 1 / h, 0, 0) + m = ~(hm * m0) + rc = (bbox, m) + break + val = rc + + return val + + def get_images(self, full=False): + """List of images defined in the page object.""" + CheckParent(self) + return self.parent.get_page_images(self.number, full=full) + + def get_oc_items(self) -> list: + """Get OCGs and OCMDs used in the page's contents. + + Returns: + List of items (name, xref, type), where type is one of "ocg" / "ocmd", + and name is the property name. + """ + rc = [] + for pname, xref in self._get_resource_properties(): + text = self.parent.xref_object(xref, compressed=True) + if "/Type/OCG" in text: + octype = "ocg" + elif "/Type/OCMD" in text: + octype = "ocmd" + else: + continue + rc.append((pname, xref, octype)) + return rc + + def get_svg_image(self, matrix=None, text_as_path=1): + """Make SVG image from page.""" + CheckParent(self) + mediabox = mupdf.fz_bound_page(self.this) + ctm = JM_matrix_from_py(matrix) + tbounds = mediabox + text_option = mupdf.FZ_SVG_TEXT_AS_PATH if text_as_path == 1 else mupdf.FZ_SVG_TEXT_AS_TEXT + tbounds = mupdf.fz_transform_rect(tbounds, ctm) + + res = mupdf.fz_new_buffer(1024) + out = mupdf.FzOutput(res) + dev = mupdf.fz_new_svg_device( + out, + tbounds.x1-tbounds.x0, # width + tbounds.y1-tbounds.y0, # height + text_option, + 1, + ) + mupdf.fz_run_page(self.this, dev, ctm, mupdf.FzCookie()) + mupdf.fz_close_device(dev) + out.fz_close_output() + text = JM_EscapeStrFromBuffer(res) + return text + + def get_textbox( + page: Page, + rect: rect_like, + textpage=None, #: TextPage = None, + ) -> str: + tp = textpage + if tp is None: + tp = page.get_textpage() + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + rc = tp.extractTextbox(rect) + if textpage is None: + del tp + return rc + + def get_textpage(self, clip: rect_like = None, flags: int = 0, matrix=None) -> "TextPage": + CheckParent(self) + if matrix is None: + matrix = Matrix(1, 1) + old_rotation = self.rotation + if old_rotation != 0: + self.set_rotation(0) + try: + textpage = self._get_textpage(clip, flags=flags, matrix=matrix) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + textpage = TextPage(textpage) + textpage.parent = weakref.proxy(self) + return textpage + + def get_texttrace(self): + + CheckParent(self) + old_rotation = self.rotation + if old_rotation != 0: + self.set_rotation(0) + page = self.this + rc = [] + if g_use_extra: + dev = extra.JM_new_texttrace_device(rc) + else: + dev = JM_new_texttrace_device(rc) + prect = mupdf.fz_bound_page(page) + dev.ptm = mupdf.FzMatrix(1, 0, 0, -1, 0, prect.y1) + mupdf.fz_run_page(page, dev, mupdf.FzMatrix(), mupdf.FzCookie()) + mupdf.fz_close_device(dev) + + if old_rotation != 0: + self.set_rotation(old_rotation) + return rc + + def get_xobjects(self): + """List of xobjects defined in the page object.""" + CheckParent(self) + return self.parent.get_page_xobjects(self.number) + + def insert_font(self, fontname="helv", fontfile=None, fontbuffer=None, + set_simple=False, wmode=0, encoding=0): + doc = self.parent + if doc is None: + raise ValueError("orphaned object: parent is None") + idx = 0 + + if fontname.startswith("/"): + fontname = fontname[1:] + inv_chars = INVALID_NAME_CHARS.intersection(fontname) + if inv_chars != set(): + raise ValueError(f"bad fontname chars {inv_chars}") + + font = CheckFont(self, fontname) + if font is not None: # font already in font list of page + xref = font[0] # this is the xref + if CheckFontInfo(doc, xref): # also in our document font list? + return xref # yes: we are done + # need to build the doc FontInfo entry - done via get_char_widths + doc.get_char_widths(xref) + return xref + + #-------------------------------------------------------------------------- + # the font is not present for this page + #-------------------------------------------------------------------------- + + bfname = Base14_fontdict.get(fontname.lower(), None) # BaseFont if Base-14 font + + serif = 0 + CJK_number = -1 + CJK_list_n = ["china-t", "china-s", "japan", "korea"] + CJK_list_s = ["china-ts", "china-ss", "japan-s", "korea-s"] + + try: + CJK_number = CJK_list_n.index(fontname) + serif = 0 + except Exception: + # Verbose in PyMuPDF/tests. + if g_exceptions_verbose > 1: exception_info() + pass + + if CJK_number < 0: + try: + CJK_number = CJK_list_s.index(fontname) + serif = 1 + except Exception: + # Verbose in PyMuPDF/tests. + if g_exceptions_verbose > 1: exception_info() + pass + + if fontname.lower() in fitz_fontdescriptors.keys(): + import pymupdf_fonts + fontbuffer = pymupdf_fonts.myfont(fontname) # make a copy + del pymupdf_fonts + + # install the font for the page + if fontfile is not None: + if type(fontfile) is str: + fontfile_str = fontfile + elif hasattr(fontfile, "absolute"): + fontfile_str = str(fontfile) + elif hasattr(fontfile, "name"): + fontfile_str = fontfile.name + else: + raise ValueError("bad fontfile") + else: + fontfile_str = None + val = self._insertFont(fontname, bfname, fontfile_str, fontbuffer, set_simple, idx, + wmode, serif, encoding, CJK_number) + + if not val: # did not work, error return + return val + + xref = val[0] # xref of installed font + fontdict = val[1] + + if CheckFontInfo(doc, xref): # check again: document already has this font + return xref # we are done + + # need to create document font info + doc.get_char_widths(xref, fontdict=fontdict) + return xref + + @property + def is_wrapped(self): + """Check if /Contents is in a balanced graphics state.""" + return self._count_q_balance() == (0, 0) + + @property + def language(self): + """Page language.""" + pdfpage = _as_pdf_page(self.this, required=False) + if not pdfpage.m_internal: + return + lang = mupdf.pdf_dict_get_inheritable(pdfpage.obj(), PDF_NAME('Lang')) + if not lang.m_internal: + return + return mupdf.pdf_to_str_buf(lang) + + def links(self, kinds=None): + """ Generator over the links of a page. + + Args: + kinds: (list) link kinds to subselect from. If none, + all links are returned. E.g. kinds=[LINK_URI] + will only yield URI links. + """ + all_links = self.get_links() + for link in all_links: + if kinds is None or link["kind"] in kinds: + yield (link) + + def load_annot(self, ident: typing.Union[str, int]) -> Annot: + """Load an annot by name (/NM key) or xref. + + Args: + ident: identifier, either name (str) or xref (int). + """ + CheckParent(self) + if type(ident) is str: + xref = 0 + name = ident + elif type(ident) is int: + xref = ident + name = None + else: + raise ValueError("identifier must be a string or integer") + val = self._load_annot(name, xref) + if not val: + return val + val.thisown = True + val.parent = weakref.proxy(self) + self._annot_refs[id(val)] = val + return val + + def load_links(self): + """Get first Link.""" + CheckParent(self) + val = mupdf.fz_load_links( self.this) + if not val.m_internal: + return + val = Link( val) + val.thisown = True + val.parent = weakref.proxy(self) # owning page object + self._annot_refs[id(val)] = val + val.xref = 0 + val.id = "" + if self.parent.is_pdf: + xrefs = self.annot_xrefs() + xrefs = [x for x in xrefs if x[1] == mupdf.PDF_ANNOT_LINK] + if xrefs: + link_id = xrefs[0] + val.xref = link_id[0] + val.id = link_id[2] + else: + val.xref = 0 + val.id = "" + return val + + #---------------------------------------------------------------- + # page load widget by xref + #---------------------------------------------------------------- + def load_widget( self, xref): + """Load a widget by its xref.""" + CheckParent(self) + + page = _as_pdf_page(self.this) + annot = JM_get_widget_by_xref( page, xref) + #log( '{=type(annot)}') + val = annot + if not val: + return val + val.thisown = True + val.parent = weakref.proxy(self) + self._annot_refs[id(val)] = val + widget = Widget() + TOOLS._fill_widget(val, widget) + val = widget + return val + + @property + def mediabox(self): + """The MediaBox.""" + CheckParent(self) + page = self._pdf_page(required=False) + if not page.m_internal: + rect = mupdf.fz_bound_page( self.this) + else: + rect = JM_mediabox( page.obj()) + return Rect(rect) + + @property + def mediabox_size(self): + return Point(self.mediabox.x1, self.mediabox.y1) + + #@property + #def parent( self): + # assert self._parent + # if self._parent: + # return self._parent + # return Document( self.this.document()) + + def read_contents(self): + """All /Contents streams concatenated to one bytes object.""" + return TOOLS._get_all_contents(self) + + def refresh(self): + """Refresh page after link/annot/widget updates.""" + CheckParent(self) + doc = self.parent + page = doc.reload_page(self) + # fixme this looks wrong. + self.this = page + + @property + def rotation(self): + """Page rotation.""" + CheckParent(self) + page = _as_pdf_page(self.this, required=0) + if not page.m_internal: + return 0 + return JM_page_rotation(page) + + @property + def rotation_matrix(self) -> Matrix: + """Reflects page rotation.""" + return Matrix(TOOLS._rotate_matrix(self)) + + def run(self, dw, m): + """Run page through a device. + dw: DeviceWrapper + """ + CheckParent(self) + mupdf.fz_run_page(self.this, dw.device, JM_matrix_from_py(m), mupdf.FzCookie()) + + def set_artbox(self, rect): + """Set the ArtBox.""" + return self._set_pagebox("ArtBox", rect) + + def set_bleedbox(self, rect): + """Set the BleedBox.""" + return self._set_pagebox("BleedBox", rect) + + def set_contents(self, xref): + """Set object at 'xref' as the page's /Contents.""" + CheckParent(self) + doc = self.parent + if doc.is_closed: + raise ValueError("document closed") + if not doc.is_pdf: + raise ValueError("is no PDF") + if xref not in range(1, doc.xref_length()): + raise ValueError("bad xref") + if not doc.xref_is_stream(xref): + raise ValueError("xref is no stream") + doc.xref_set_key(self.xref, "Contents", "%i 0 R" % xref) + + def set_cropbox(self, rect): + """Set the CropBox. Will also change Page.rect.""" + return self._set_pagebox("CropBox", rect) + + def set_language(self, language=None): + """Set PDF page default language.""" + CheckParent(self) + pdfpage = _as_pdf_page(self.this) + if not language: + mupdf.pdf_dict_del(pdfpage.obj(), PDF_NAME('Lang')) + else: + lang = mupdf.fz_text_language_from_string(language) + assert hasattr(mupdf, 'fz_string_from_text_language2') + mupdf.pdf_dict_put_text_string( + pdfpage.obj, + PDF_NAME('Lang'), + mupdf.fz_string_from_text_language2(lang) + ) + + def set_mediabox(self, rect): + """Set the MediaBox.""" + CheckParent(self) + page = self._pdf_page() + mediabox = JM_rect_from_py(rect) + if (mupdf.fz_is_empty_rect(mediabox) + or mupdf.fz_is_infinite_rect(mediabox) + ): + raise ValueError( MSG_BAD_RECT) + mupdf.pdf_dict_put_rect( page.obj(), PDF_NAME('MediaBox'), mediabox) + mupdf.pdf_dict_del( page.obj(), PDF_NAME('CropBox')) + mupdf.pdf_dict_del( page.obj(), PDF_NAME('ArtBox')) + mupdf.pdf_dict_del( page.obj(), PDF_NAME('BleedBox')) + mupdf.pdf_dict_del( page.obj(), PDF_NAME('TrimBox')) + + def set_rotation(self, rotation): + """Set page rotation.""" + CheckParent(self) + page = _as_pdf_page(self.this) + rot = JM_norm_rotation(rotation) + mupdf.pdf_dict_put_int( page.obj(), PDF_NAME('Rotate'), rot) + + def set_trimbox(self, rect): + """Set the TrimBox.""" + return self._set_pagebox("TrimBox", rect) + + @property + def transformation_matrix(self): + """Page transformation matrix.""" + CheckParent(self) + + ctm = mupdf.FzMatrix() + page = self._pdf_page(required=False) + if not page.m_internal: + return JM_py_from_matrix(ctm) + mediabox = mupdf.FzRect(mupdf.FzRect.Fixed_UNIT) # fixme: original code passed mediabox=NULL. + mupdf.pdf_page_transform(page, mediabox, ctm) + val = JM_py_from_matrix(ctm) + + if self.rotation % 360 == 0: + val = Matrix(val) + else: + val = Matrix(1, 0, 0, -1, 0, self.cropbox.height) + return val + + @property + def trimbox(self): + """The TrimBox""" + rect = self._other_box("TrimBox") + if rect is None: + return self.cropbox + mb = self.mediabox + return Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1]) + + def widgets(self, types=None): + """ Generator over the widgets of a page. + + Args: + types: (list) field types to subselect from. If none, + all fields are returned. E.g. types=[PDF_WIDGET_TYPE_TEXT] + will only yield text fields. + """ + #for a in self.annot_xrefs(): + # log( '{a=}') + widget_xrefs = [a[0] for a in self.annot_xrefs() if a[1] == mupdf.PDF_ANNOT_WIDGET] + #log(f'widgets(): {widget_xrefs=}') + for xref in widget_xrefs: + widget = self.load_widget(xref) + if types is None or widget.field_type in types: + yield (widget) + + def wrap_contents(self): + """Ensure page is in a balanced graphics state.""" + push, pop = self._count_q_balance() # count missing "q"/"Q" commands + if push > 0: # prepend required push commands + prepend = b"q\n" * push + TOOLS._insert_contents(self, prepend, False) + if pop > 0: # append required pop commands + append = b"\nQ" * pop + b"\n" + TOOLS._insert_contents(self, append, True) + + @property + def xref(self): + """PDF xref number of page.""" + CheckParent(self) + return self.parent.page_xref(self.number) + + rect = property(bound, doc="page rectangle") + + +class Pixmap: + + def __init__(self, *args): + """ + Pixmap(colorspace, irect, alpha) - empty pixmap. + Pixmap(colorspace, src) - copy changing colorspace. + Pixmap(src, width, height,[clip]) - scaled copy, float dimensions. + Pixmap(src, alpha=1) - copy and add or drop alpha channel. + Pixmap(filename) - from an image in a file. + Pixmap(image) - from an image in memory (bytes). + Pixmap(colorspace, width, height, samples, alpha) - from samples data. + Pixmap(PDFdoc, xref) - from an image at xref in a PDF document. + """ + # Cache for property `self.samples_mv`. Set here so __del_() sees it if + # we raise. + # + self._samples_mv = None + + # 2024-01-16: Experimental support for a memory-view of the underlying + # data. Doesn't seem to make much difference to Pixmap.set_pixel() so + # not currently used. + self._memory_view = None + + if 0: + pass + + elif args_match(args, + (Colorspace, mupdf.FzColorspace), + (mupdf.FzRect, mupdf.FzIrect, IRect, Rect, tuple) + ): + # create empty pixmap with colorspace and IRect + cs, rect = args + alpha = 0 + pm = mupdf.fz_new_pixmap_with_bbox(cs, JM_irect_from_py(rect), mupdf.FzSeparations(0), alpha) + self.this = pm + + elif args_match(args, + (Colorspace, mupdf.FzColorspace), + (mupdf.FzRect, mupdf.FzIrect, IRect, Rect, tuple), + (int, bool) + ): + # create empty pixmap with colorspace and IRect + cs, rect, alpha = args + pm = mupdf.fz_new_pixmap_with_bbox(cs, JM_irect_from_py(rect), mupdf.FzSeparations(0), alpha) + self.this = pm + + elif args_match(args, (Colorspace, mupdf.FzColorspace, type(None)), (Pixmap, mupdf.FzPixmap)): + # copy pixmap, converting colorspace + cs, spix = args + if isinstance(cs, Colorspace): + cs = cs.this + elif cs is None: + cs = mupdf.FzColorspace(None) + if isinstance(spix, Pixmap): + spix = spix.this + if not mupdf.fz_pixmap_colorspace(spix).m_internal: + raise ValueError( "source colorspace must not be None") + + if cs.m_internal: + self.this = mupdf.fz_convert_pixmap( + spix, + cs, + mupdf.FzColorspace(), + mupdf.FzDefaultColorspaces(None), + mupdf.FzColorParams(), + 1 + ) + else: + self.this = mupdf.fz_new_pixmap_from_alpha_channel( spix) + if not self.this.m_internal: + raise RuntimeError( MSG_PIX_NOALPHA) + + elif args_match(args, (Pixmap, mupdf.FzPixmap), (Pixmap, mupdf.FzPixmap)): + # add mask to a pixmap w/o alpha channel + spix, mpix = args + if isinstance(spix, Pixmap): + spix = spix.this + if isinstance(mpix, Pixmap): + mpix = mpix.this + spm = spix + mpm = mpix + if not spix.m_internal: # intercept NULL for spix: make alpha only pix + dst = mupdf.fz_new_pixmap_from_alpha_channel(mpm) + if not dst.m_internal: + raise RuntimeError( MSG_PIX_NOALPHA) + else: + dst = mupdf.fz_new_pixmap_from_color_and_mask(spm, mpm) + self.this = dst + + elif (args_match(args, (Pixmap, mupdf.FzPixmap), (float, int), (float, int), None) or + args_match(args, (Pixmap, mupdf.FzPixmap), (float, int), (float, int))): + # create pixmap as scaled copy of another one + if len(args) == 3: + spix, w, h = args + bbox = mupdf.FzIrect(mupdf.fz_infinite_irect) + else: + spix, w, h, clip = args + bbox = JM_irect_from_py(clip) + + src_pix = spix.this if isinstance(spix, Pixmap) else spix + if not mupdf.fz_is_infinite_irect(bbox): + pm = mupdf.fz_scale_pixmap(src_pix, src_pix.x(), src_pix.y(), w, h, bbox) + else: + pm = mupdf.fz_scale_pixmap(src_pix, src_pix.x(), src_pix.y(), w, h, mupdf.FzIrect(mupdf.fz_infinite_irect)) + self.this = pm + + elif args_match(args, str, (Pixmap, mupdf.FzPixmap)) and args[0] == 'raw': + # Special raw construction where we set .this directly. + _, pm = args + if isinstance(pm, Pixmap): + pm = pm.this + self.this = pm + + elif args_match(args, (Pixmap, mupdf.FzPixmap), (int, None)): + # Pixmap(struct Pixmap *spix, int alpha=1) + # copy pixmap & add / drop the alpha channel + spix = args[0] + alpha = args[1] if len(args) == 2 else 1 + src_pix = spix.this if isinstance(spix, Pixmap) else spix + if not _INRANGE(alpha, 0, 1): + raise ValueError( "bad alpha value") + cs = mupdf.fz_pixmap_colorspace(src_pix) + if not cs.m_internal and not alpha: + raise ValueError( "cannot drop alpha for 'NULL' colorspace") + seps = mupdf.FzSeparations() + n = mupdf.fz_pixmap_colorants(src_pix) + w = mupdf.fz_pixmap_width(src_pix) + h = mupdf.fz_pixmap_height(src_pix) + pm = mupdf.fz_new_pixmap(cs, w, h, seps, alpha) + pm.m_internal.x = src_pix.m_internal.x + pm.m_internal.y = src_pix.m_internal.y + pm.m_internal.xres = src_pix.m_internal.xres + pm.m_internal.yres = src_pix.m_internal.yres + + # copy samples data ------------------------------------------ + if 1: + # We use our pixmap_copy() to get best performance. + # test_pixmap.py:test_setalpha(): 3.9s t=0.0062 + extra.pixmap_copy( pm.m_internal, src_pix.m_internal, n) + elif 1: + # Use memoryview. + # test_pixmap.py:test_setalpha(): 4.6 t=0.51 + src_view = mupdf.fz_pixmap_samples_memoryview( src_pix) + pm_view = mupdf.fz_pixmap_samples_memoryview( pm) + if src_pix.alpha() == pm.alpha(): # identical samples + #memcpy(tptr, sptr, w * h * (n + alpha)); + size = w * h * (n + alpha) + pm_view[ 0 : size] = src_view[ 0 : size] + else: + tptr = 0 + sptr = 0 + # This is a little faster than calling + # pm.fz_samples_set(), but still quite slow. E.g. reduces + # test_pixmap.py:test_setalpha() from 6.7s to 4.5s. + # + # t=0.53 + pm_stride = pm.stride() + pm_n = pm.n() + pm_alpha = pm.alpha() + src_stride = src_pix.stride() + src_n = src_pix.n() + #log( '{=pm_stride pm_n src_stride src_n}') + for y in range( h): + for x in range( w): + pm_i = pm_stride * y + pm_n * x + src_i = src_stride * y + src_n * x + pm_view[ pm_i : pm_i + n] = src_view[ src_i : src_i + n] + if pm_alpha: + pm_view[ pm_i + n] = 255 + else: + # Copy individual bytes from Python. Very slow. + # test_pixmap.py:test_setalpha(): 6.89 t=2.601 + if src_pix.alpha() == pm.alpha(): # identical samples + #memcpy(tptr, sptr, w * h * (n + alpha)); + for i in range(w * h * (n + alpha)): + mupdf.fz_samples_set(pm, i, mupdf.fz_samples_get(src_pix, i)) + else: + # t=2.56 + tptr = 0 + sptr = 0 + src_pix_alpha = src_pix.alpha() + for i in range(w * h): + #memcpy(tptr, sptr, n); + for j in range(n): + mupdf.fz_samples_set(pm, tptr + j, mupdf.fz_samples_get(src_pix, sptr + j)) + tptr += n + if pm.alpha(): + mupdf.fz_samples_set(pm, tptr, 255) + tptr += 1 + sptr += n + src_pix_alpha + self.this = pm + + elif args_match(args, (mupdf.FzColorspace, Colorspace), int, int, None, (int, bool)): + # create pixmap from samples data + cs, w, h, samples, alpha = args + if isinstance(cs, Colorspace): + cs = cs.this + assert isinstance(cs, mupdf.FzColorspace) + n = mupdf.fz_colorspace_n(cs) + stride = (n + alpha) * w + seps = mupdf.FzSeparations() + pm = mupdf.fz_new_pixmap(cs, w, h, seps, alpha) + + if isinstance( samples, (bytes, bytearray)): + #log('using mupdf.python_buffer_data()') + samples2 = mupdf.python_buffer_data(samples) + size = len(samples) + else: + res = JM_BufferFromBytes(samples) + if not res.m_internal: + raise ValueError( "bad samples data") + size, c = mupdf.fz_buffer_storage(res) + samples2 = mupdf.python_buffer_data(samples) # raw swig proxy for `const unsigned char*`. + if stride * h != size: + raise ValueError( f"bad samples length {w=} {h=} {alpha=} {n=} {stride=} {size=}") + mupdf.ll_fz_pixmap_copy_raw( pm.m_internal, samples2) + self.this = pm + + elif args_match(args, None): + # create pixmap from filename, file object, pathlib.Path or memory + imagedata, = args + name = 'name' + if hasattr(imagedata, "resolve"): + fname = imagedata.__str__() + if fname: + img = mupdf.fz_new_image_from_file(fname) + elif hasattr(imagedata, name): + fname = imagedata.name + if fname: + img = mupdf.fz_new_image_from_file(fname) + elif isinstance(imagedata, str): + img = mupdf.fz_new_image_from_file(imagedata) + else: + res = JM_BufferFromBytes(imagedata) + if not res.m_internal or not res.m_internal.len: + raise ValueError( "bad image data") + img = mupdf.fz_new_image_from_buffer(res) + + # Original code passed null for subarea and ctm, but that's not + # possible with MuPDF's python bindings. The equivalent is an + # infinite rect and identify matrix scaled by img.w() and img.h(). + pm, w, h = mupdf.fz_get_pixmap_from_image( + img, + mupdf.FzIrect(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, FZ_MAX_INF_RECT), + mupdf.FzMatrix( img.w(), 0, 0, img.h(), 0, 0), + ) + xres, yres = mupdf.fz_image_resolution(img) + pm.m_internal.xres = xres + pm.m_internal.yres = yres + self.this = pm + + elif args_match(args, (Document, mupdf.FzDocument), int): + # Create pixmap from PDF image identified by XREF number + doc, xref = args + pdf = _as_pdf_document(doc) + xreflen = mupdf.pdf_xref_len(pdf) + if not _INRANGE(xref, 1, xreflen-1): + raise ValueError( MSG_BAD_XREF) + ref = mupdf.pdf_new_indirect(pdf, xref, 0) + type_ = mupdf.pdf_dict_get(ref, PDF_NAME('Subtype')) + if (not mupdf.pdf_name_eq(type_, PDF_NAME('Image')) + and not mupdf.pdf_name_eq(type_, PDF_NAME('Alpha')) + and not mupdf.pdf_name_eq(type_, PDF_NAME('Luminosity')) + ): + raise ValueError( MSG_IS_NO_IMAGE) + img = mupdf.pdf_load_image(pdf, ref) + # Original code passed null for subarea and ctm, but that's not + # possible with MuPDF's python bindings. The equivalent is an + # infinite rect and identify matrix scaled by img.w() and img.h(). + pix, w, h = mupdf.fz_get_pixmap_from_image( + img, + mupdf.FzIrect(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, FZ_MAX_INF_RECT), + mupdf.FzMatrix(img.w(), 0, 0, img.h(), 0, 0), + ) + self.this = pix + + else: + text = 'Unrecognised args for constructing Pixmap:\n' + for arg in args: + text += f' {type(arg)}: {arg}\n' + raise Exception( text) + + def __len__(self): + return self.size + + def __repr__(self): + if not type(self) is Pixmap: return + if self.colorspace: + return "Pixmap(%s, %s, %s)" % (self.colorspace.this.m_internal.name, self.irect, self.alpha) + else: + return "Pixmap(%s, %s, %s)" % ('None', self.irect, self.alpha) + + def _tobytes(self, format_, jpg_quality): + ''' + Pixmap._tobytes + ''' + pm = self.this + size = mupdf.fz_pixmap_stride(pm) * pm.h() + res = mupdf.fz_new_buffer(size) + out = mupdf.FzOutput(res) + if format_ == 1: mupdf.fz_write_pixmap_as_png(out, pm) + elif format_ == 2: mupdf.fz_write_pixmap_as_pnm(out, pm) + elif format_ == 3: mupdf.fz_write_pixmap_as_pam(out, pm) + elif format_ == 5: mupdf.fz_write_pixmap_as_psd(out, pm) + elif format_ == 6: mupdf.fz_write_pixmap_as_ps(out, pm) + elif format_ == 7: + mupdf.fz_write_pixmap_as_jpeg(out, pm, jpg_quality, 0) + else: + mupdf.fz_write_pixmap_as_png(out, pm) + out.fz_close_output() + barray = JM_BinFromBuffer(res) + return barray + + def _writeIMG(self, filename, format_, jpg_quality): + pm = self.this + if format_ == 1: mupdf.fz_save_pixmap_as_png(pm, filename) + elif format_ == 2: mupdf.fz_save_pixmap_as_pnm(pm, filename) + elif format_ == 3: mupdf.fz_save_pixmap_as_pam(pm, filename) + elif format_ == 5: mupdf.fz_save_pixmap_as_psd(pm, filename) + elif format_ == 6: mupdf.fz_save_pixmap_as_ps(pm, filename) + elif format_ == 7: mupdf.fz_save_pixmap_as_jpeg(pm, filename, jpg_quality) + else: mupdf.fz_save_pixmap_as_png(pm, filename) + + @property + def alpha(self): + """Indicates presence of alpha channel.""" + return mupdf.fz_pixmap_alpha(self.this) + + def clear_with(self, value=None, bbox=None): + """Fill all color components with same value.""" + if value is None: + mupdf.fz_clear_pixmap(self.this) + elif bbox is None: + mupdf.fz_clear_pixmap_with_value(self.this, value) + else: + JM_clear_pixmap_rect_with_value(self.this, value, JM_irect_from_py(bbox)) + + def color_count(self, colors=0, clip=None): + ''' + Return count of each color. + ''' + pm = self.this + rc = JM_color_count( pm, clip) + if not colors: + return len( rc) + return rc + + def color_topusage(self, clip=None): + """Return most frequent color and its usage ratio.""" + allpixels = 0 + cnt = 0 + if clip is not None and self.irect in Rect(clip): + clip = self.irect + for pixel, count in self.color_count(colors=True,clip=clip).items(): + allpixels += count + if count > cnt: + cnt = count + maxpixel = pixel + if not allpixels: + return (1, bytes([255] * self.n)) + return (cnt / allpixels, maxpixel) + + @property + def colorspace(self): + """Pixmap Colorspace.""" + cs = Colorspace(mupdf.fz_pixmap_colorspace(self.this)) + if cs.name == "None": + return None + return cs + + def copy(self, src, bbox): + """Copy bbox from another Pixmap.""" + pm = self.this + src_pix = src.this + if not mupdf.fz_pixmap_colorspace(src_pix): + raise ValueError( "cannot copy pixmap with NULL colorspace") + if pm.alpha() != src_pix.alpha(): + raise ValueError( "source and target alpha must be equal") + mupdf.fz_copy_pixmap_rect(pm, src_pix, JM_irect_from_py(bbox), mupdf.FzDefaultColorspaces(None)) + + @property + def digest(self): + """MD5 digest of pixmap (bytes).""" + ret = mupdf.fz_md5_pixmap2(self.this) + return bytes(ret) + + def gamma_with(self, gamma): + """Apply correction with some float. + gamma=1 is a no-op.""" + if not mupdf.fz_pixmap_colorspace( self.this): + message_warning("colorspace invalid for function") + return + mupdf.fz_gamma_pixmap( self.this, gamma) + + @property + def h(self): + """The height.""" + return mupdf.fz_pixmap_height(self.this) + + def invert_irect(self, bbox=None): + """Invert the colors inside a bbox.""" + pm = self.this + if not mupdf.fz_pixmap_colorspace(pm).m_internal: + message_warning("ignored for stencil pixmap") + return False + r = JM_irect_from_py(bbox) + if mupdf.fz_is_infinite_irect(r): + mupdf.fz_invert_pixmap(pm) + return True + mupdf.fz_invert_pixmap_rect(pm, r) + return True + + @property + def irect(self): + """Pixmap bbox - an IRect object.""" + val = mupdf.fz_pixmap_bbox(self.this) + return JM_py_from_irect( val) + + @property + def is_monochrome(self): + """Check if pixmap is monochrome.""" + return mupdf.fz_is_pixmap_monochrome( self.this) + + @property + def is_unicolor(self): + ''' + Check if pixmap has only one color. + ''' + pm = self.this + n = pm.n() + count = pm.w() * pm.h() * n + def _pixmap_read_samples(pm, offset, n): + ret = list() + for i in range(n): + ret.append(mupdf.fz_samples_get(pm, offset+i)) + return ret + for offset in range( 0, count, n): + if offset == 0: + sample0 = _pixmap_read_samples( pm, 0, n) + else: + sample = _pixmap_read_samples( pm, offset, n) + if sample != sample0: + return False + return True + + @property + def n(self): + """The size of one pixel.""" + if g_use_extra: + # Setting self.__class__.n gives a small reduction in overhead of + # test_general.py:test_2093, e.g. 1.4x -> 1.3x. + #return extra.pixmap_n(self.this) + def n2(self): + return extra.pixmap_n(self.this) + self.__class__.n = property(n2) + return self.n + return mupdf.fz_pixmap_components(self.this) + + def pdfocr_save(self, filename, compress=1, language=None, tessdata=None): + ''' + Save pixmap as an OCR-ed PDF page. + ''' + tessdata = get_tessdata(tessdata) + opts = mupdf.FzPdfocrOptions() + opts.compress = compress + if language: + opts.language_set2( language) + if tessdata: + opts.datadir_set2( tessdata) + pix = self.this + if isinstance(filename, str): + mupdf.fz_save_pixmap_as_pdfocr( pix, filename, 0, opts) + else: + out = JM_new_output_fileptr( filename) + try: + mupdf.fz_write_pixmap_as_pdfocr( out, pix, opts) + finally: + out.fz_close_output() # Avoid MuPDF warning. + + def pdfocr_tobytes(self, compress=True, language="eng", tessdata=None): + """Save pixmap as an OCR-ed PDF page. + + Args: + compress: (bool) compress, default 1 (True). + language: (str) language(s) occurring on page, default "eng" (English), + multiples like "eng+ger" for English and German. + tessdata: (str) folder name of Tesseract's language support. If None + we use environment variable TESSDATA_PREFIX or search for + Tesseract installation. + Notes: + On failure, make sure Tesseract is installed and you have set + or environment variable "TESSDATA_PREFIX" to the folder + containing your Tesseract's language support data. + """ + tessdata = get_tessdata(tessdata) + from io import BytesIO + bio = BytesIO() + self.pdfocr_save(bio, compress=compress, language=language, tessdata=tessdata) + return bio.getvalue() + + def pil_image(self): + """Create a Pillow Image from the Pixmap.""" + try: + from PIL import Image + except ImportError: + message("PIL/Pillow not installed") + raise + + cspace = self.colorspace + if not cspace: + mode = "L" + elif cspace.n == 1: + mode = "L" if not self.alpha else "LA" + elif cspace.n == 3: + mode = "RGB" if not self.alpha else "RGBA" + else: + mode = "CMYK" + + img = Image.frombytes(mode, (self.width, self.height), self.samples) + return img + + def pil_save(self, *args, **kwargs): + """Write to image file using Pillow. + + An intermediate PIL Image is created, and its "save" method is used + to store the image. See Pillow documentation to learn about the + meaning of possible positional and keyword parameters. + Use this when other output formats are desired. + """ + img = self.pil_image() + + if "dpi" not in kwargs.keys(): + kwargs["dpi"] = (self.xres, self.yres) + + img.save(*args, **kwargs) + + def pil_tobytes(self, *args, **kwargs): + """Convert to an image in memory using Pillow. + + An intermediate PIL Image is created, and its "save" method is used + to store the image. See Pillow documentation to learn about the + meaning of possible positional or keyword parameters. + Use this when other output formats are desired. + """ + bytes_out = io.BytesIO() + img = self.pil_image() + + if "dpi" not in kwargs.keys(): + kwargs["dpi"] = (self.xres, self.yres) + + img.save(bytes_out, *args, **kwargs) + return bytes_out.getvalue() + + def pixel(self, x, y): + """Get color tuple of pixel (x, y). + Last item is the alpha if Pixmap.alpha is true.""" + if g_use_extra: + return extra.pixmap_pixel(self.this.m_internal, x, y) + if (0 + or x < 0 + or x >= self.this.m_internal.w + or y < 0 + or y >= self.this.m_internal.h + ): + RAISEPY(MSG_PIXEL_OUTSIDE, PyExc_ValueError) + n = self.this.m_internal.n + stride = self.this.m_internal.stride + i = stride * y + n * x + ret = tuple( self.samples_mv[ i: i+n]) + return ret + + @property + def samples(self)->bytes: + mv = self.samples_mv + return bytes( mv) + + @property + def samples_mv(self): + ''' + Pixmap samples memoryview. + ''' + # We remember the returned memoryview so that our `__del__()` can + # release it; otherwise accessing it after we have been destructed will + # fail, possibly crashing Python; this is #4155. + # + if self._samples_mv is None: + self._samples_mv = mupdf.fz_pixmap_samples_memoryview(self.this) + return self._samples_mv + + def _samples_mv_release(self): + if self._samples_mv: + self._samples_mv.release() + + @property + def samples_ptr(self): + return mupdf.fz_pixmap_samples_int(self.this) + + def save(self, filename, output=None, jpg_quality=95): + """Output as image in format determined by filename extension. + + Args: + output: (str) only use to overrule filename extension. Default is PNG. + Others are JPEG, JPG, PNM, PGM, PPM, PBM, PAM, PSD, PS. + """ + valid_formats = { + "png": 1, + "pnm": 2, + "pgm": 2, + "ppm": 2, + "pbm": 2, + "pam": 3, + "psd": 5, + "ps": 6, + "jpg": 7, + "jpeg": 7, + } + + if type(filename) is str: + pass + elif hasattr(filename, "absolute"): + filename = str(filename) + elif hasattr(filename, "name"): + filename = filename.name + if output is None: + _, ext = os.path.splitext(filename) + output = ext[1:] + + idx = valid_formats.get(output.lower(), None) + if idx is None: + raise ValueError(f"Image format {output} not in {tuple(valid_formats.keys())}") + if self.alpha and idx in (2, 6, 7): + raise ValueError("'%s' cannot have alpha" % output) + if self.colorspace and self.colorspace.n > 3 and idx in (1, 2, 4): + raise ValueError(f"unsupported colorspace for '{output}'") + if idx == 7: + self.set_dpi(self.xres, self.yres) + return self._writeIMG(filename, idx, jpg_quality) + + def set_alpha(self, alphavalues=None, premultiply=1, opaque=None, matte=None): + """Set alpha channel to values contained in a byte array. + If omitted, set alphas to 255. + + Args: + alphavalues: (bytes) with length (width * height) or 'None'. + premultiply: (bool, True) premultiply colors with alpha values. + opaque: (tuple, length colorspace.n) this color receives opacity 0. + matte: (tuple, length colorspace.n)) preblending background color. + """ + pix = self.this + alpha = 0 + m = 0 + if pix.alpha() == 0: + raise ValueError( MSG_PIX_NOALPHA) + n = mupdf.fz_pixmap_colorants(pix) + w = mupdf.fz_pixmap_width(pix) + h = mupdf.fz_pixmap_height(pix) + balen = w * h * (n+1) + colors = [0, 0, 0, 0] # make this color opaque + bgcolor = [0, 0, 0, 0] # preblending background color + zero_out = 0 + bground = 0 + if opaque and isinstance(opaque, (list, tuple)) and len(opaque) == n: + for i in range(n): + colors[i] = opaque[i] + zero_out = 1 + if matte and isinstance( matte, (tuple, list)) and len(matte) == n: + for i in range(n): + bgcolor[i] = matte[i] + bground = 1 + data = bytes() + data_len = 0 + if alphavalues: + #res = JM_BufferFromBytes(alphavalues) + #data_len, data = mupdf.fz_buffer_storage(res) + #if data_len < w * h: + # THROWMSG("bad alpha values") + # fixme: don't seem to need to create an fz_buffer - can + # use directly? + if isinstance(alphavalues, (bytes, bytearray)): + data = alphavalues + data_len = len(alphavalues) + else: + assert 0, f'unexpected type for alphavalues: {type(alphavalues)}' + if data_len < w * h: + raise ValueError( "bad alpha values") + if 1: + # Use C implementation for speed. + mupdf.Pixmap_set_alpha_helper( + balen, + n, + data_len, + zero_out, + mupdf.python_buffer_data( data), + pix.m_internal, + premultiply, + bground, + colors, + bgcolor, + ) + else: + i = k = j = 0 + data_fix = 255 + while i < balen: + alpha = data[k] + if zero_out: + for j in range(i, i+n): + if mupdf.fz_samples_get(pix, j) != colors[j - i]: + data_fix = 255 + break + else: + data_fix = 0 + if data_len: + def fz_mul255( a, b): + x = a * b + 128 + x += x // 256 + return x // 256 + + if data_fix == 0: + mupdf.fz_samples_set(pix, i+n, 0) + else: + mupdf.fz_samples_set(pix, i+n, alpha) + if premultiply and not bground: + for j in range(i, i+n): + mupdf.fz_samples_set(pix, j, fz_mul255( mupdf.fz_samples_get(pix, j), alpha)) + elif bground: + for j in range( i, i+n): + m = bgcolor[j - i] + mupdf.fz_samples_set(pix, j, fz_mul255( mupdf.fz_samples_get(pix, j) - m, alpha)) + else: + mupdf.fz_samples_set(pix, i+n, data_fix) + i += n+1 + k += 1 + + def tobytes(self, output="png", jpg_quality=95): + ''' + Convert to binary image stream of desired type. + ''' + valid_formats = { + "png": 1, + "pnm": 2, + "pgm": 2, + "ppm": 2, + "pbm": 2, + "pam": 3, + "tga": 4, + "tpic": 4, + "psd": 5, + "ps": 6, + 'jpg': 7, + 'jpeg': 7, + } + idx = valid_formats.get(output.lower(), None) + if idx is None: + raise ValueError(f"Image format {output} not in {tuple(valid_formats.keys())}") + if self.alpha and idx in (2, 6, 7): + raise ValueError("'{output}' cannot have alpha") + if self.colorspace and self.colorspace.n > 3 and idx in (1, 2, 4): + raise ValueError(f"unsupported colorspace for '{output}'") + if idx == 7: + self.set_dpi(self.xres, self.yres) + barray = self._tobytes(idx, jpg_quality) + return barray + + def set_dpi(self, xres, yres): + """Set resolution in both dimensions.""" + pm = self.this + pm.m_internal.xres = xres + pm.m_internal.yres = yres + + def set_origin(self, x, y): + """Set top-left coordinates.""" + pm = self.this + pm.m_internal.x = x + pm.m_internal.y = y + + def set_pixel(self, x, y, color): + """Set color of pixel (x, y).""" + if g_use_extra: + return extra.set_pixel(self.this.m_internal, x, y, color) + pm = self.this + if not _INRANGE(x, 0, pm.w() - 1) or not _INRANGE(y, 0, pm.h() - 1): + raise ValueError( MSG_PIXEL_OUTSIDE) + n = pm.n() + for j in range(n): + i = color[j] + if not _INRANGE(i, 0, 255): + raise ValueError( MSG_BAD_COLOR_SEQ) + stride = mupdf.fz_pixmap_stride( pm) + i = stride * y + n * x + if 0: + # Using a cached self._memory_view doesn't actually make much + # difference to speed. + if not self._memory_view: + self._memory_view = self.samples_mv + for j in range(n): + self._memory_view[i + j] = color[j] + else: + for j in range(n): + pm.fz_samples_set(i + j, color[j]) + + def set_rect(self, bbox, color): + """Set color of all pixels in bbox.""" + pm = self.this + n = pm.n() + c = [] + for j in range(n): + i = color[j] + if not _INRANGE(i, 0, 255): + raise ValueError( MSG_BAD_COLOR_SEQ) + c.append(i) + bbox = JM_irect_from_py(bbox) + i = JM_fill_pixmap_rect_with_color(pm, c, bbox) + rc = bool(i) + return rc + + def shrink(self, factor): + """Divide width and height by 2**factor. + E.g. factor=1 shrinks to 25% of original size (in place).""" + if factor < 1: + message_warning("ignoring shrink factor < 1") + return + mupdf.fz_subsample_pixmap( self.this, factor) + # Pixmap has changed so clear our memory view. + self._memory_view = None + self._samples_mv_release() + + @property + def size(self): + """Pixmap size.""" + return mupdf.fz_pixmap_size( self.this) + + @property + def stride(self): + """Length of one image line (width * n).""" + return self.this.stride() + + def tint_with(self, black, white): + """Tint colors with modifiers for black and white.""" + if not self.colorspace or self.colorspace.n > 3: + message("warning: colorspace invalid for function") + return + return mupdf.fz_tint_pixmap( self.this, black, white) + + @property + def w(self): + """The width.""" + return mupdf.fz_pixmap_width(self.this) + + def warp(self, quad, width, height): + """Return pixmap from a warped quad.""" + if not quad.is_convex: raise ValueError("quad must be convex") + q = JM_quad_from_py(quad) + points = [ q.ul, q.ur, q.lr, q.ll] + dst = mupdf.fz_warp_pixmap( self.this, points, width, height) + return Pixmap( dst) + + @property + def x(self): + """x component of Pixmap origin.""" + return mupdf.fz_pixmap_x(self.this) + + @property + def xres(self): + """Resolution in x direction.""" + return self.this.xres() + + @property + def y(self): + """y component of Pixmap origin.""" + return mupdf.fz_pixmap_y(self.this) + + @property + def yres(self): + """Resolution in y direction.""" + return self.this.yres() + + width = w + height = h + + def __del__(self): + if self._samples_mv: + self._samples_mv.release() + + +del Point +class Point: + + def __abs__(self): + return math.sqrt(self.x * self.x + self.y * self.y) + + def __add__(self, p): + if hasattr(p, "__float__"): + return Point(self.x + p, self.y + p) + if len(p) != 2: + raise ValueError("Point: bad seq len") + return Point(self.x + p[0], self.y + p[1]) + + def __bool__(self): + return not (max(self) == min(self) == 0) + + def __eq__(self, p): + if not hasattr(p, "__len__"): + return False + return len(p) == 2 and not (self - p) + + def __getitem__(self, i): + return (self.x, self.y)[i] + + def __hash__(self): + return hash(tuple(self)) + + def __init__(self, *args, x=None, y=None): + ''' + Point() - all zeros + Point(x, y) + Point(Point) - new copy + Point(sequence) - from 'sequence' + + Explicit keyword args x, y override earlier settings if not None. + ''' + if not args: + self.x = 0.0 + self.y = 0.0 + elif len(args) > 2: + raise ValueError("Point: bad seq len") + elif len(args) == 2: + self.x = float(args[0]) + self.y = float(args[1]) + elif len(args) == 1: + l = args[0] + if isinstance(l, (mupdf.FzPoint, mupdf.fz_point)): + self.x = l.x + self.y = l.y + else: + if not hasattr(l, "__getitem__"): + raise ValueError("Point: bad args") + if len(l) != 2: + raise ValueError("Point: bad seq len") + self.x = float(l[0]) + self.y = float(l[1]) + else: + raise ValueError("Point: bad seq len") + if x is not None: self.x = x + if y is not None: self.y = y + + def __len__(self): + return 2 + + def __mul__(self, m): + if hasattr(m, "__float__"): + return Point(self.x * m, self.y * m) + if hasattr(m, "__getitem__") and len(m) == 2: + # dot product + return self.x * m[0] + self.y * m[1] + p = Point(self) + return p.transform(m) + + def __neg__(self): + return Point(-self.x, -self.y) + + def __nonzero__(self): + return not (max(self) == min(self) == 0) + + def __pos__(self): + return Point(self) + + def __repr__(self): + return "Point" + str(tuple(self)) + + def __setitem__(self, i, v): + v = float(v) + if i == 0: self.x = v + elif i == 1: self.y = v + else: + raise IndexError("index out of range") + return None + + def __sub__(self, p): + if hasattr(p, "__float__"): + return Point(self.x - p, self.y - p) + if len(p) != 2: + raise ValueError("Point: bad seq len") + return Point(self.x - p[0], self.y - p[1]) + + def __truediv__(self, m): + if hasattr(m, "__float__"): + return Point(self.x * 1./m, self.y * 1./m) + m1 = util_invert_matrix(m)[1] + if not m1: + raise ZeroDivisionError("matrix not invertible") + p = Point(self) + return p.transform(m1) + + @property + def abs_unit(self): + """Unit vector with positive coordinates.""" + s = self.x * self.x + self.y * self.y + if s < EPSILON: + return Point(0,0) + s = math.sqrt(s) + return Point(abs(self.x) / s, abs(self.y) / s) + + def distance_to(self, *args): + """Return distance to rectangle or another point.""" + if not len(args) > 0: + raise ValueError("at least one parameter must be given") + + x = args[0] + if len(x) == 2: + x = Point(x) + elif len(x) == 4: + x = Rect(x) + else: + raise ValueError("arg1 must be point-like or rect-like") + + if len(args) > 1: + unit = args[1] + else: + unit = "px" + u = {"px": (1.,1.), "in": (1.,72.), "cm": (2.54, 72.), + "mm": (25.4, 72.)} + f = u[unit][0] / u[unit][1] + + if type(x) is Point: + return abs(self - x) * f + + # from here on, x is a rectangle + # as a safeguard, make a finite copy of it + r = Rect(x.top_left, x.top_left) + r = r | x.bottom_right + if self in r: + return 0.0 + if self.x > r.x1: + if self.y >= r.y1: + return self.distance_to(r.bottom_right, unit) + elif self.y <= r.y0: + return self.distance_to(r.top_right, unit) + else: + return (self.x - r.x1) * f + elif r.x0 <= self.x <= r.x1: + if self.y >= r.y1: + return (self.y - r.y1) * f + else: + return (r.y0 - self.y) * f + else: + if self.y >= r.y1: + return self.distance_to(r.bottom_left, unit) + elif self.y <= r.y0: + return self.distance_to(r.top_left, unit) + else: + return (r.x0 - self.x) * f + + def transform(self, m): + """Replace point by its transformation with matrix-like m.""" + if len(m) != 6: + raise ValueError("Matrix: bad seq len") + self.x, self.y = util_transform_point(self, m) + return self + + @property + def unit(self): + """Unit vector of the point.""" + s = self.x * self.x + self.y * self.y + if s < EPSILON: + return Point(0,0) + s = math.sqrt(s) + return Point(self.x / s, self.y / s) + + __div__ = __truediv__ + norm = __abs__ + + +class Quad: + + def __abs__(self): + if self.is_empty: + return 0.0 + return abs(self.ul - self.ur) * abs(self.ul - self.ll) + + def __add__(self, q): + if hasattr(q, "__float__"): + return Quad(self.ul + q, self.ur + q, self.ll + q, self.lr + q) + if len(q) != 4: + raise ValueError("Quad: bad seq len") + return Quad(self.ul + q[0], self.ur + q[1], self.ll + q[2], self.lr + q[3]) + + def __bool__(self): + return not self.is_empty + + def __contains__(self, x): + try: + l = x.__len__() + except Exception: + if g_exceptions_verbose > 1: exception_info() + return False + if l == 2: + return util_point_in_quad(x, self) + if l != 4: + return False + if CheckRect(x): + if Rect(x).is_empty: + return True + return util_point_in_quad(x[:2], self) and util_point_in_quad(x[2:], self) + if CheckQuad(x): + for i in range(4): + if not util_point_in_quad(x[i], self): + return False + return True + return False + + def __eq__(self, quad): + if not hasattr(quad, "__len__"): + return False + return len(quad) == 4 and ( + self.ul == quad[0] and + self.ur == quad[1] and + self.ll == quad[2] and + self.lr == quad[3] + ) + + def __getitem__(self, i): + return (self.ul, self.ur, self.ll, self.lr)[i] + + def __hash__(self): + return hash(tuple(self)) + + def __init__(self, *args, ul=None, ur=None, ll=None, lr=None): + ''' + Quad() - all zero points + Quad(ul, ur, ll, lr) + Quad(quad) - new copy + Quad(sequence) - from 'sequence' + + Explicit keyword args ul, ur, ll, lr override earlier settings if not + None. + + ''' + if not args: + self.ul = self.ur = self.ll = self.lr = Point() + elif len(args) > 4: + raise ValueError("Quad: bad seq len") + elif len(args) == 4: + self.ul, self.ur, self.ll, self.lr = map(Point, args) + elif len(args) == 1: + l = args[0] + if isinstance(l, mupdf.FzQuad): + self.this = l + self.ul, self.ur, self.ll, self.lr = Point(l.ul), Point(l.ur), Point(l.ll), Point(l.lr) + elif not hasattr(l, "__getitem__"): + raise ValueError("Quad: bad args") + elif len(l) != 4: + raise ValueError("Quad: bad seq len") + else: + self.ul, self.ur, self.ll, self.lr = map(Point, l) + else: + raise ValueError("Quad: bad args") + if ul is not None: self.ul = Point(ul) + if ur is not None: self.ur = Point(ur) + if ll is not None: self.ll = Point(ll) + if lr is not None: self.lr = Point(lr) + + def __len__(self): + return 4 + + def __mul__(self, m): + q = Quad(self) + q = q.transform(m) + return q + + def __neg__(self): + return Quad(-self.ul, -self.ur, -self.ll, -self.lr) + + def __nonzero__(self): + return not self.is_empty + + def __pos__(self): + return Quad(self) + + def __repr__(self): + return "Quad" + str(tuple(self)) + + def __setitem__(self, i, v): + if i == 0: self.ul = Point(v) + elif i == 1: self.ur = Point(v) + elif i == 2: self.ll = Point(v) + elif i == 3: self.lr = Point(v) + else: + raise IndexError("index out of range") + return None + + def __sub__(self, q): + if hasattr(q, "__float__"): + return Quad(self.ul - q, self.ur - q, self.ll - q, self.lr - q) + if len(q) != 4: + raise ValueError("Quad: bad seq len") + return Quad(self.ul - q[0], self.ur - q[1], self.ll - q[2], self.lr - q[3]) + + def __truediv__(self, m): + if hasattr(m, "__float__"): + im = 1. / m + else: + im = util_invert_matrix(m)[1] + if not im: + raise ZeroDivisionError("Matrix not invertible") + q = Quad(self) + q = q.transform(im) + return q + + @property + def is_convex(self): + """Check if quad is convex and not degenerate. + + Notes: + Check that for the two diagonals, the other two corners are not + on the same side of the diagonal. + Returns: + True or False. + """ + m = planish_line(self.ul, self.lr) # puts this diagonal on x-axis + p1 = self.ll * m # transform the + p2 = self.ur * m # other two points + if p1.y * p2.y > 0: + return False + m = planish_line(self.ll, self.ur) # puts other diagonal on x-axis + p1 = self.lr * m # transform the + p2 = self.ul * m # remaining points + if p1.y * p2.y > 0: + return False + return True + + @property + def is_empty(self): + """Check whether all quad corners are on the same line. + + This is the case if width or height is zero. + """ + return self.width < EPSILON or self.height < EPSILON + + @property + def is_infinite(self): + """Check whether this is the infinite quad.""" + return self.rect.is_infinite + + @property + def is_rectangular(self): + """Check if quad is rectangular. + + Notes: + Some rotation matrix can thus transform it into a rectangle. + This is equivalent to three corners enclose 90 degrees. + Returns: + True or False. + """ + + sine = util_sine_between(self.ul, self.ur, self.lr) + if abs(sine - 1) > EPSILON: # the sine of the angle + return False + + sine = util_sine_between(self.ur, self.lr, self.ll) + if abs(sine - 1) > EPSILON: + return False + + sine = util_sine_between(self.lr, self.ll, self.ul) + if abs(sine - 1) > EPSILON: + return False + + return True + + def morph(self, p, m): + """Morph the quad with matrix-like 'm' and point-like 'p'. + + Return a new quad.""" + if self.is_infinite: + return INFINITE_QUAD() + delta = Matrix(1, 1).pretranslate(p.x, p.y) + q = self * ~delta * m * delta + return q + + @property + def rect(self): + r = Rect() + r.x0 = min(self.ul.x, self.ur.x, self.lr.x, self.ll.x) + r.y0 = min(self.ul.y, self.ur.y, self.lr.y, self.ll.y) + r.x1 = max(self.ul.x, self.ur.x, self.lr.x, self.ll.x) + r.y1 = max(self.ul.y, self.ur.y, self.lr.y, self.ll.y) + return r + + def transform(self, m): + """Replace quad by its transformation with matrix m.""" + if hasattr(m, "__float__"): + pass + elif len(m) != 6: + raise ValueError("Matrix: bad seq len") + self.ul *= m + self.ur *= m + self.ll *= m + self.lr *= m + return self + + __div__ = __truediv__ + width = property(lambda self: max(abs(self.ul - self.ur), abs(self.ll - self.lr))) + height = property(lambda self: max(abs(self.ul - self.ll), abs(self.ur - self.lr))) + + +class Rect: + + def __abs__(self): + if self.is_empty or self.is_infinite: + return 0.0 + return (self.x1 - self.x0) * (self.y1 - self.y0) + + def __add__(self, p): + if hasattr(p, "__float__"): + return Rect(self.x0 + p, self.y0 + p, self.x1 + p, self.y1 + p) + if len(p) != 4: + raise ValueError("Rect: bad seq len") + return Rect(self.x0 + p[0], self.y0 + p[1], self.x1 + p[2], self.y1 + p[3]) + + def __and__(self, x): + if not hasattr(x, "__len__"): + raise ValueError("bad operand 2") + + r1 = Rect(x) + r = Rect(self) + return r.intersect(r1) + + def __bool__(self): + return not (max(self) == min(self) == 0) + + def __contains__(self, x): + if hasattr(x, "__float__"): + return x in tuple(self) + l = len(x) + if l == 2: + return util_is_point_in_rect(x, self) + if l == 4: + r = INFINITE_RECT() + try: + r = Rect(x) + except Exception: + if g_exceptions_verbose > 1: exception_info() + r = Quad(x).rect + return (self.x0 <= r.x0 <= r.x1 <= self.x1 and + self.y0 <= r.y0 <= r.y1 <= self.y1) + return False + + def __eq__(self, rect): + if not hasattr(rect, "__len__"): + return False + return len(rect) == 4 and not (self - rect) + + def __getitem__(self, i): + return (self.x0, self.y0, self.x1, self.y1)[i] + + def __hash__(self): + return hash(tuple(self)) + + def __init__(self, *args, p0=None, p1=None, x0=None, y0=None, x1=None, y1=None): + """ + Rect() - all zeros + Rect(x0, y0, x1, y1) + Rect(top-left, x1, y1) + Rect(x0, y0, bottom-right) + Rect(top-left, bottom-right) + Rect(Rect or IRect) - new copy + Rect(sequence) - from 'sequence' + + Explicit keyword args p0, p1, x0, y0, x1, y1 override earlier settings + if not None. + """ + x0, y0, x1, y1 = util_make_rect( *args, p0=p0, p1=p1, x0=x0, y0=y0, x1=x1, y1=y1) + self.x0 = float( x0) + self.y0 = float( y0) + self.x1 = float( x1) + self.y1 = float( y1) + + def __len__(self): + return 4 + + def __mul__(self, m): + if hasattr(m, "__float__"): + return Rect(self.x0 * m, self.y0 * m, self.x1 * m, self.y1 * m) + r = Rect(self) + r = r.transform(m) + return r + + def __neg__(self): + return Rect(-self.x0, -self.y0, -self.x1, -self.y1) + + def __nonzero__(self): + return not (max(self) == min(self) == 0) + + def __or__(self, x): + if not hasattr(x, "__len__"): + raise ValueError("bad operand 2") + r = Rect(self) + if len(x) == 2: + return r.include_point(x) + if len(x) == 4: + return r.include_rect(x) + raise ValueError("bad operand 2") + + def __pos__(self): + return Rect(self) + + def __repr__(self): + return "Rect" + str(tuple(self)) + + def __setitem__(self, i, v): + v = float(v) + if i == 0: self.x0 = v + elif i == 1: self.y0 = v + elif i == 2: self.x1 = v + elif i == 3: self.y1 = v + else: + raise IndexError("index out of range") + return None + + def __sub__(self, p): + if hasattr(p, "__float__"): + return Rect(self.x0 - p, self.y0 - p, self.x1 - p, self.y1 - p) + if len(p) != 4: + raise ValueError("Rect: bad seq len") + return Rect(self.x0 - p[0], self.y0 - p[1], self.x1 - p[2], self.y1 - p[3]) + + def __truediv__(self, m): + if hasattr(m, "__float__"): + return Rect(self.x0 * 1./m, self.y0 * 1./m, self.x1 * 1./m, self.y1 * 1./m) + im = util_invert_matrix(m)[1] + if not im: + raise ZeroDivisionError(f"Matrix not invertible: {m}") + r = Rect(self) + r = r.transform(im) + return r + + @property + def bottom_left(self): + """Bottom-left corner.""" + return Point(self.x0, self.y1) + + @property + def bottom_right(self): + """Bottom-right corner.""" + return Point(self.x1, self.y1) + + def contains(self, x): + """Check if containing point-like or rect-like x.""" + return self.__contains__(x) + + @property + def height(self): + return max(0, self.y1 - self.y0) + + def include_point(self, p): + """Extend to include point-like p.""" + if len(p) != 2: + raise ValueError("Point: bad seq len") + self.x0, self.y0, self.x1, self.y1 = util_include_point_in_rect(self, p) + return self + + def include_rect(self, r): + """Extend to include rect-like r.""" + if len(r) != 4: + raise ValueError("Rect: bad seq len") + r = Rect(r) + if r.is_infinite or self.is_infinite: + self.x0, self.y0, self.x1, self.y1 = FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, FZ_MAX_INF_RECT + elif r.is_empty: + return self + elif self.is_empty: + self.x0, self.y0, self.x1, self.y1 = r.x0, r.y0, r.x1, r.y1 + else: + self.x0, self.y0, self.x1, self.y1 = util_union_rect(self, r) + return self + + def intersect(self, r): + """Restrict to common rect with rect-like r.""" + if not len(r) == 4: + raise ValueError("Rect: bad seq len") + r = Rect(r) + if r.is_infinite: + return self + elif self.is_infinite: + self.x0, self.y0, self.x1, self.y1 = r.x0, r.y0, r.x1, r.y1 + elif r.is_empty: + self.x0, self.y0, self.x1, self.y1 = r.x0, r.y0, r.x1, r.y1 + elif self.is_empty: + return self + else: + self.x0, self.y0, self.x1, self.y1 = util_intersect_rect(self, r) + return self + + def intersects(self, x): + """Check if intersection with rectangle x is not empty.""" + rect2 = Rect(x) + return (1 + and not self.is_empty + and not self.is_infinite + and not rect2.is_empty + and not rect2.is_infinite + and self.x0 < rect2.x1 + and rect2.x0 < self.x1 + and self.y0 < rect2.y1 + and rect2.y0 < self.y1 + ) + + @property + def is_empty(self): + """True if rectangle area is empty.""" + return self.x0 >= self.x1 or self.y0 >= self.y1 + + @property + def is_infinite(self): + """True if this is the infinite rectangle.""" + return self.x0 == self.y0 == FZ_MIN_INF_RECT and self.x1 == self.y1 == FZ_MAX_INF_RECT + + @property + def is_valid(self): + """True if rectangle is valid.""" + return self.x0 <= self.x1 and self.y0 <= self.y1 + + def morph(self, p, m): + """Morph with matrix-like m and point-like p. + + Returns a new quad.""" + if self.is_infinite: + return INFINITE_QUAD() + return self.quad.morph(p, m) + + def norm(self): + return math.sqrt(sum([c*c for c in self])) + + def normalize(self): + """Replace rectangle with its finite version.""" + if self.x1 < self.x0: + self.x0, self.x1 = self.x1, self.x0 + if self.y1 < self.y0: + self.y0, self.y1 = self.y1, self.y0 + return self + + @property + def quad(self): + """Return Quad version of rectangle.""" + return Quad(self.tl, self.tr, self.bl, self.br) + + def round(self): + """Return the IRect.""" + return IRect(util_round_rect(self)) + + @property + def top_left(self): + """Top-left corner.""" + return Point(self.x0, self.y0) + + @property + def top_right(self): + """Top-right corner.""" + return Point(self.x1, self.y0) + + def torect(self, r): + """Return matrix that converts to target rect.""" + + r = Rect(r) + if self.is_infinite or self.is_empty or r.is_infinite or r.is_empty: + raise ValueError("rectangles must be finite and not empty") + return ( + Matrix(1, 0, 0, 1, -self.x0, -self.y0) + * Matrix(r.width / self.width, r.height / self.height) + * Matrix(1, 0, 0, 1, r.x0, r.y0) + ) + + def transform(self, m): + """Replace with the transformation by matrix-like m.""" + if not len(m) == 6: + raise ValueError("Matrix: bad seq len") + self.x0, self.y0, self.x1, self.y1 = util_transform_rect(self, m) + return self + + @property + def width(self): + return max(0, self.x1 - self.x0) + + __div__ = __truediv__ + + bl = bottom_left + br = bottom_right + irect = property(round) + tl = top_left + tr = top_right + + +class Story: + + def __init__( self, html='', user_css=None, em=12, archive=None): + buffer_ = mupdf.fz_new_buffer_from_copied_data( html.encode('utf-8')) + if archive and not isinstance(archive, Archive): + archive = Archive(archive) + arch = archive.this if archive else mupdf.FzArchive( None) + if hasattr(mupdf, 'FzStoryS'): + self.this = mupdf.FzStoryS( buffer_, user_css, em, arch) + else: + self.this = mupdf.FzStory( buffer_, user_css, em, arch) + + def add_header_ids(self): + ''' + Look for `` items in `self` and adds unique `id` + attributes if not already present. + ''' + dom = self.body + i = 0 + x = dom.find(None, None, None) + while x: + name = x.tagname + if len(name) == 2 and name[0]=="h" and name[1] in "123456": + attr = x.get_attribute_value("id") + if not attr: + id_ = f"h_id_{i}" + #log(f"{name=}: setting {id_=}") + x.set_attribute("id", id_) + i += 1 + x = x.find_next(None, None, None) + + @staticmethod + def add_pdf_links(document_or_stream, positions): + """ + Adds links to PDF document. + Args: + document_or_stream: + A PDF `Document` or raw PDF content, for example an + `io.BytesIO` instance. + positions: + List of `ElementPosition`'s for `document_or_stream`, + typically from Story.element_positions(). We raise an + exception if two or more positions have same id. + Returns: + `document_or_stream` if a `Document` instance, otherwise a + new `Document` instance. + We raise an exception if an `href` in `positions` refers to an + internal position `#` but no item in `positions` has `id = + name`. + """ + if isinstance(document_or_stream, Document): + document = document_or_stream + else: + document = Document("pdf", document_or_stream) + + # Create dict from id to position, which we will use to find + # link destinations. + # + id_to_position = dict() + #log(f"positions: {positions}") + for position in positions: + #log(f"add_pdf_links(): position: {position}") + if (position.open_close & 1) and position.id: + #log(f"add_pdf_links(): position with id: {position}") + if position.id in id_to_position: + #log(f"Ignoring duplicate positions with id={position.id!r}") + pass + else: + id_to_position[ position.id] = position + + # Insert links for all positions that have an `href`. + # + for position_from in positions: + + if (position_from.open_close & 1) and position_from.href: + + #log(f"add_pdf_links(): position with href: {position}") + link = dict() + link['from'] = Rect(position_from.rect) + + if position_from.href.startswith("#"): + #`...` internal link. + target_id = position_from.href[1:] + try: + position_to = id_to_position[ target_id] + except Exception as e: + if g_exceptions_verbose > 1: exception_info() + raise RuntimeError(f"No destination with id={target_id}, required by position_from: {position_from}") from e + # Make link from `position_from`'s rect to top-left of + # `position_to`'s rect. + if 0: + log(f"add_pdf_links(): making link from:") + log(f"add_pdf_links(): {position_from}") + log(f"add_pdf_links(): to:") + log(f"add_pdf_links(): {position_to}") + link["kind"] = LINK_GOTO + x0, y0, x1, y1 = position_to.rect + # This appears to work well with viewers which scroll + # to make destination point top-left of window. + link["to"] = Point(x0, y0) + link["page"] = position_to.page_num - 1 + + else: + # `...` external link. + if position_from.href.startswith('name:'): + link['kind'] = LINK_NAMED + link['name'] = position_from.href[5:] + else: + link['kind'] = LINK_URI + link['uri'] = position_from.href + + #log(f'Adding link: {position_from.page_num=} {link=}.') + document[position_from.page_num - 1].insert_link(link) + + return document + + @property + def body(self): + dom = self.document() + return dom.bodytag() + + def document( self): + dom = mupdf.fz_story_document( self.this) + return Xml( dom) + + def draw( self, device, matrix=None): + ctm2 = JM_matrix_from_py( matrix) + dev = device.this if device else mupdf.FzDevice( None) + mupdf.fz_draw_story( self.this, dev, ctm2) + + def element_positions( self, function, args=None): + ''' + Trigger a callback function to record where items have been placed. + ''' + if type(args) is dict: + for k in args.keys(): + if not (type(k) is str and k.isidentifier()): + raise ValueError(f"invalid key '{k}'") + else: + args = {} + if not callable(function) or function.__code__.co_argcount != 1: + raise ValueError("callback 'function' must be a callable with exactly one argument") + + def function2( position): + class Position2: + pass + position2 = Position2() + position2.depth = position.depth + position2.heading = position.heading + position2.id = position.id + position2.rect = JM_py_from_rect(position.rect) + position2.text = position.text + position2.open_close = position.open_close + position2.rect_num = position.rectangle_num + position2.href = position.href + if args: + for k, v in args.items(): + setattr( position2, k, v) + function( position2) + mupdf.fz_story_positions( self.this, function2) + + def place( self, where): + where = JM_rect_from_py( where) + filled = mupdf.FzRect() + more = mupdf.fz_place_story( self.this, where, filled) + return more, JM_py_from_rect( filled) + + def reset( self): + mupdf.fz_reset_story( self.this) + + def write(self, writer, rectfn, positionfn=None, pagefn=None): + dev = None + page_num = 0 + rect_num = 0 + filled = Rect(0, 0, 0, 0) + while 1: + mediabox, rect, ctm = rectfn(rect_num, filled) + rect_num += 1 + if mediabox: + # new page. + page_num += 1 + more, filled = self.place( rect) + if positionfn: + def positionfn2(position): + # We add a `.page_num` member to the + # `ElementPosition` instance. + position.page_num = page_num + positionfn(position) + self.element_positions(positionfn2) + if writer: + if mediabox: + # new page. + if dev: + if pagefn: + pagefn(page_num, mediabox, dev, 1) + writer.end_page() + dev = writer.begin_page( mediabox) + if pagefn: + pagefn(page_num, mediabox, dev, 0) + self.draw( dev, ctm) + if not more: + if pagefn: + pagefn( page_num, mediabox, dev, 1) + writer.end_page() + else: + self.draw(None, ctm) + if not more: + break + + @staticmethod + def write_stabilized(writer, contentfn, rectfn, user_css=None, em=12, positionfn=None, pagefn=None, archive=None, add_header_ids=True): + positions = list() + content = None + # Iterate until stable. + while 1: + content_prev = content + content = contentfn( positions) + stable = False + if content == content_prev: + stable = True + content2 = content + story = Story(content2, user_css, em, archive) + + if add_header_ids: + story.add_header_ids() + + positions = list() + def positionfn2(position): + #log(f"write_stabilized(): {stable=} {positionfn=} {position=}") + positions.append(position) + if stable and positionfn: + positionfn(position) + story.write( + writer if stable else None, + rectfn, + positionfn2, + pagefn, + ) + if stable: + break + + @staticmethod + def write_stabilized_with_links(contentfn, rectfn, user_css=None, em=12, positionfn=None, pagefn=None, archive=None, add_header_ids=True): + #log("write_stabilized_with_links()") + stream = io.BytesIO() + writer = DocumentWriter(stream) + positions = [] + def positionfn2(position): + #log(f"write_stabilized_with_links(): {position=}") + positions.append(position) + if positionfn: + positionfn(position) + Story.write_stabilized(writer, contentfn, rectfn, user_css, em, positionfn2, pagefn, archive, add_header_ids) + writer.close() + stream.seek(0) + return Story.add_pdf_links(stream, positions) + + def write_with_links(self, rectfn, positionfn=None, pagefn=None): + #log("write_with_links()") + stream = io.BytesIO() + writer = DocumentWriter(stream) + positions = [] + def positionfn2(position): + #log(f"write_with_links(): {position=}") + positions.append(position) + if positionfn: + positionfn(position) + self.write(writer, rectfn, positionfn=positionfn2, pagefn=pagefn) + writer.close() + stream.seek(0) + return Story.add_pdf_links(stream, positions) + + class FitResult: + ''' + The result from a `Story.fit*()` method. + + Members: + + `big_enough`: + `True` if the fit succeeded. + `filled`: + From the last call to `Story.place()`. + `more`: + `False` if the fit succeeded. + `numcalls`: + Number of calls made to `self.place()`. + `parameter`: + The successful parameter value, or the largest failing value. + `rect`: + The rect created from `parameter`. + ''' + def __init__(self, big_enough=None, filled=None, more=None, numcalls=None, parameter=None, rect=None): + self.big_enough = big_enough + self.filled = filled + self.more = more + self.numcalls = numcalls + self.parameter = parameter + self.rect = rect + + def __repr__(self): + return ( + f' big_enough={self.big_enough}' + f' filled={self.filled}' + f' more={self.more}' + f' numcalls={self.numcalls}' + f' parameter={self.parameter}' + f' rect={self.rect}' + ) + + def fit(self, fn, pmin=None, pmax=None, delta=0.001, verbose=False): + ''' + Finds optimal rect that contains the story `self`. + + Returns a `Story.FitResult` instance. + + On success, the last call to `self.place()` will have been with the + returned rectangle, so `self.draw()` can be used directly. + + Args: + :arg fn: + A callable taking a floating point `parameter` and returning a + `pymupdf.Rect()`. If the rect is empty, we assume the story will + not fit and do not call `self.place()`. + + Must guarantee that `self.place()` behaves monotonically when + given rect `fn(parameter`) as `parameter` increases. This + usually means that both width and height increase or stay + unchanged as `parameter` increases. + :arg pmin: + Minimum parameter to consider; `None` for -infinity. + :arg pmax: + Maximum parameter to consider; `None` for +infinity. + :arg delta: + Maximum error in returned `parameter`. + :arg verbose: + If true we output diagnostics. + ''' + def log(text): + assert verbose + message(f'fit(): {text}') + + assert isinstance(pmin, (int, float)) or pmin is None + assert isinstance(pmax, (int, float)) or pmax is None + + class State: + def __init__(self): + self.pmin = pmin + self.pmax = pmax + self.pmin_result = None + self.pmax_result = None + self.result = None + self.numcalls = 0 + if verbose: + self.pmin0 = pmin + self.pmax0 = pmax + state = State() + + if verbose: + log(f'starting. {state.pmin=} {state.pmax=}.') + + self.reset() + + def ret(): + if state.pmax is not None: + if state.last_p != state.pmax: + if verbose: + log(f'Calling update() with pmax, because was overwritten by later calls.') + big_enough = update(state.pmax) + assert big_enough + result = state.pmax_result + else: + result = state.pmin_result if state.pmin_result else Story.FitResult(numcalls=state.numcalls) + if verbose: + log(f'finished. {state.pmin0=} {state.pmax0=} {state.pmax=}: returning {result=}') + return result + + def update(parameter): + ''' + Evaluates `more, _ = self.place(fn(parameter))`. If `more` is + false, then `rect` is big enough to contain `self` and we + set `state.pmax=parameter` and return True. Otherwise we set + `state.pmin=parameter` and return False. + ''' + rect = fn(parameter) + assert isinstance(rect, Rect), f'{type(rect)=} {rect=}' + if rect.is_empty: + big_enough = False + result = Story.FitResult(parameter=parameter, numcalls=state.numcalls) + if verbose: + log(f'update(): not calling self.place() because rect is empty.') + else: + more, filled = self.place(rect) + state.numcalls += 1 + big_enough = not more + result = Story.FitResult( + filled=filled, + more=more, + numcalls=state.numcalls, + parameter=parameter, + rect=rect, + big_enough=big_enough, + ) + if verbose: + log(f'update(): called self.place(): {state.numcalls:>2d}: {more=} {parameter=} {rect=}.') + if big_enough: + state.pmax = parameter + state.pmax_result = result + else: + state.pmin = parameter + state.pmin_result = result + state.last_p = parameter + return big_enough + + def opposite(p, direction): + ''' + Returns same sign as `direction`, larger or smaller than `p` if + direction is positive or negative respectively. + ''' + if p is None or p==0: + return direction + if direction * p > 0: + return 2 * p + return -p + + if state.pmin is None: + # Find an initial finite pmin value. + if verbose: log(f'finding pmin.') + parameter = opposite(state.pmax, -1) + while 1: + if not update(parameter): + break + parameter *= 2 + else: + if update(state.pmin): + if verbose: log(f'{state.pmin=} is big enough.') + return ret() + + if state.pmax is None: + # Find an initial finite pmax value. + if verbose: log(f'finding pmax.') + parameter = opposite(state.pmin, +1) + while 1: + if update(parameter): + break + parameter *= 2 + else: + if not update(state.pmax): + # No solution possible. + state.pmax = None + if verbose: log(f'No solution possible {state.pmax=}.') + return ret() + + # Do binary search in pmin..pmax. + if verbose: log(f'doing binary search with {state.pmin=} {state.pmax=}.') + while 1: + if state.pmax - state.pmin < delta: + return ret() + parameter = (state.pmin + state.pmax) / 2 + update(parameter) + + def fit_scale(self, rect, scale_min=0, scale_max=None, delta=0.001, verbose=False): + ''' + Finds smallest value `scale` in range `scale_min..scale_max` where + `scale * rect` is large enough to contain the story `self`. + + Returns a `Story.FitResult` instance. + + :arg width: + width of rect. + :arg height: + height of rect. + :arg scale_min: + Minimum scale to consider; must be >= 0. + :arg scale_max: + Maximum scale to consider, must be >= scale_min or `None` for + infinite. + :arg delta: + Maximum error in returned scale. + :arg verbose: + If true we output diagnostics. + ''' + x0, y0, x1, y1 = rect + width = x1 - x0 + height = y1 - y0 + def fn(scale): + return Rect(x0, y0, x0 + scale*width, y0 + scale*height) + return self.fit(fn, scale_min, scale_max, delta, verbose) + + def fit_height(self, width, height_min=0, height_max=None, origin=(0, 0), delta=0.001, verbose=False): + ''' + Finds smallest height in range `height_min..height_max` where a rect + with size `(width, height)` is large enough to contain the story + `self`. + + Returns a `Story.FitResult` instance. + + :arg width: + width of rect. + :arg height_min: + Minimum height to consider; must be >= 0. + :arg height_max: + Maximum height to consider, must be >= height_min or `None` for + infinite. + :arg origin: + `(x0, y0)` of rect. + :arg delta: + Maximum error in returned height. + :arg verbose: + If true we output diagnostics. + ''' + x0, y0 = origin + x1 = x0 + width + def fn(height): + return Rect(x0, y0, x1, y0+height) + return self.fit(fn, height_min, height_max, delta, verbose) + + def fit_width(self, height, width_min=0, width_max=None, origin=(0, 0), delta=0.001, verbose=False): + ''' + Finds smallest width in range `width_min..width_max` where a rect with size + `(width, height)` is large enough to contain the story `self`. + + Returns a `Story.FitResult` instance. + Returns a `FitResult` instance. + + :arg height: + height of rect. + :arg width_min: + Minimum width to consider; must be >= 0. + :arg width_max: + Maximum width to consider, must be >= width_min or `None` for + infinite. + :arg origin: + `(x0, y0)` of rect. + :arg delta: + Maximum error in returned width. + :arg verbose: + If true we output diagnostics. + ''' + x0, y0 = origin + y1 = y0 + height + def fn(width): + return Rect(x0, y0, x0+width, y1) + return self.fit(fn, width_min, width_max, delta, verbose) + + +class TextPage: + + def __init__(self, *args): + if args_match(args, mupdf.FzRect): + mediabox = args[0] + self.this = mupdf.FzStextPage( mediabox) + elif args_match(args, mupdf.FzStextPage): + self.this = args[0] + else: + raise Exception(f'Unrecognised args: {args}') + self.thisown = True + self.parent = None + + def _extractText(self, format_): + this_tpage = self.this + res = mupdf.fz_new_buffer(1024) + out = mupdf.FzOutput( res) + # fixme: mupdfwrap.py thinks fz_output is not copyable, possibly + # because there is no .refs member visible and no fz_keep_output() fn, + # although there is an fz_drop_output(). So mupdf.fz_new_output_with_buffer() + # doesn't convert the returned fz_output* into a mupdf.FzOutput. + #out = mupdf.FzOutput(out) + if format_ == 1: + mupdf.fz_print_stext_page_as_html(out, this_tpage, 0) + elif format_ == 3: + mupdf.fz_print_stext_page_as_xml(out, this_tpage, 0) + elif format_ == 4: + mupdf.fz_print_stext_page_as_xhtml(out, this_tpage, 0) + else: + JM_print_stext_page_as_text(res, this_tpage) + out.fz_close_output() + text = JM_EscapeStrFromBuffer(res) + return text + + def _getNewBlockList(self, page_dict, raw): + JM_make_textpage_dict(self.this, page_dict, raw) + + def _textpage_dict(self, raw=False): + page_dict = {"width": self.rect.width, "height": self.rect.height} + self._getNewBlockList(page_dict, raw) + return page_dict + + def extractBLOCKS(self): + """Return a list with text block information.""" + if g_use_extra: + return extra.extractBLOCKS(self.this) + block_n = -1 + this_tpage = self.this + tp_rect = mupdf.FzRect(this_tpage.m_internal.mediabox) + res = mupdf.fz_new_buffer(1024) + lines = [] + for block in this_tpage: + block_n += 1 + blockrect = mupdf.FzRect(mupdf.FzRect.Fixed_EMPTY) + if block.m_internal.type == mupdf.FZ_STEXT_BLOCK_TEXT: + mupdf.fz_clear_buffer(res) # set text buffer to empty + line_n = -1 + last_char = 0 + for line in block: + line_n += 1 + linerect = mupdf.FzRect(mupdf.FzRect.Fixed_EMPTY) + for ch in line: + cbbox = JM_char_bbox(line, ch) + if (not JM_rects_overlap(tp_rect, cbbox) + and not mupdf.fz_is_infinite_rect(tp_rect) + ): + continue + JM_append_rune(res, ch.m_internal.c) + last_char = ch.m_internal.c + linerect = mupdf.fz_union_rect(linerect, cbbox) + if last_char != 10 and not mupdf.fz_is_empty_rect(linerect): + mupdf.fz_append_byte(res, 10) + blockrect = mupdf.fz_union_rect(blockrect, linerect) + text = JM_EscapeStrFromBuffer(res) + elif (JM_rects_overlap(tp_rect, block.m_internal.bbox) + or mupdf.fz_is_infinite_rect(tp_rect) + ): + img = block.i_image() + cs = img.colorspace() + text = "" % ( + mupdf.fz_colorspace_name(cs), + img.w(), img.h(), img.bpc() + ) + blockrect = mupdf.fz_union_rect(blockrect, mupdf.FzRect(block.m_internal.bbox)) + if not mupdf.fz_is_empty_rect(blockrect): + litem = ( + blockrect.x0, + blockrect.y0, + blockrect.x1, + blockrect.y1, + text, + block_n, + block.m_internal.type, + ) + lines.append(litem) + return lines + + def extractDICT(self, cb=None, sort=False) -> dict: + """Return page content as a Python dict of images and text spans.""" + val = self._textpage_dict(raw=False) + if cb is not None: + val["width"] = cb.width + val["height"] = cb.height + if sort: + blocks = val["blocks"] + blocks.sort(key=lambda b: (b["bbox"][3], b["bbox"][0])) + val["blocks"] = blocks + return val + + def extractHTML(self) -> str: + """Return page content as a HTML string.""" + return self._extractText(1) + + def extractIMGINFO(self, hashes=0): + """Return a list with image meta information.""" + block_n = -1 + this_tpage = self.this + rc = [] + for block in this_tpage: + block_n += 1 + if block.m_internal.type == mupdf.FZ_STEXT_BLOCK_TEXT: + continue + img = block.i_image() + img_size = 0 + mask = img.mask() + if mask.m_internal: + has_mask = True + else: + has_mask = False + compr_buff = mupdf.fz_compressed_image_buffer(img) + if compr_buff.m_internal: + img_size = compr_buff.fz_compressed_buffer_size() + compr_buff = None + if hashes: + r = mupdf.FzIrect(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, FZ_MAX_INF_RECT) + assert mupdf.fz_is_infinite_irect(r) + m = mupdf.FzMatrix(img.w(), 0, 0, img.h(), 0, 0) + pix, w, h = mupdf.fz_get_pixmap_from_image(img, r, m) + digest = mupdf.fz_md5_pixmap2(pix) + digest = bytes(digest) + if img_size == 0: + img_size = img.w() * img.h() * img.n() + cs = mupdf.FzColorspace(mupdf.ll_fz_keep_colorspace(img.m_internal.colorspace)) + block_dict = dict() + block_dict[dictkey_number] = block_n + block_dict[dictkey_bbox] = JM_py_from_rect(block.m_internal.bbox) + block_dict[dictkey_matrix] = JM_py_from_matrix(block.i_transform()) + block_dict[dictkey_width] = img.w() + block_dict[dictkey_height] = img.h() + block_dict[dictkey_colorspace] = mupdf.fz_colorspace_n(cs) + block_dict[dictkey_cs_name] = mupdf.fz_colorspace_name(cs) + block_dict[dictkey_xres] = img.xres() + block_dict[dictkey_yres] = img.yres() + block_dict[dictkey_bpc] = img.bpc() + block_dict[dictkey_size] = img_size + if hashes: + block_dict["digest"] = digest + block_dict["has-mask"] = has_mask + rc.append(block_dict) + return rc + + def extractJSON(self, cb=None, sort=False) -> str: + """Return 'extractDICT' converted to JSON format.""" + import base64 + import json + val = self._textpage_dict(raw=False) + + class b64encode(json.JSONEncoder): + def default(self, s): + if type(s) in (bytes, bytearray): + return base64.b64encode(s).decode() + + if cb is not None: + val["width"] = cb.width + val["height"] = cb.height + if sort: + blocks = val["blocks"] + blocks.sort(key=lambda b: (b["bbox"][3], b["bbox"][0])) + val["blocks"] = blocks + + val = json.dumps(val, separators=(",", ":"), cls=b64encode, indent=1) + return val + + def extractRAWDICT(self, cb=None, sort=False) -> dict: + """Return page content as a Python dict of images and text characters.""" + val = self._textpage_dict(raw=True) + if cb is not None: + val["width"] = cb.width + val["height"] = cb.height + if sort: + blocks = val["blocks"] + blocks.sort(key=lambda b: (b["bbox"][3], b["bbox"][0])) + val["blocks"] = blocks + return val + + def extractRAWJSON(self, cb=None, sort=False) -> str: + """Return 'extractRAWDICT' converted to JSON format.""" + import base64 + import json + val = self._textpage_dict(raw=True) + + class b64encode(json.JSONEncoder): + def default(self,s): + if type(s) in (bytes, bytearray): + return base64.b64encode(s).decode() + + if cb is not None: + val["width"] = cb.width + val["height"] = cb.height + if sort: + blocks = val["blocks"] + blocks.sort(key=lambda b: (b["bbox"][3], b["bbox"][0])) + val["blocks"] = blocks + val = json.dumps(val, separators=(",", ":"), cls=b64encode, indent=1) + return val + + def extractSelection(self, pointa, pointb): + a = JM_point_from_py(pointa) + b = JM_point_from_py(pointb) + found = mupdf.fz_copy_selection(self.this, a, b, 0) + return found + + def extractText(self, sort=False) -> str: + """Return simple, bare text on the page.""" + if not sort: + return self._extractText(0) + blocks = self.extractBLOCKS()[:] + blocks.sort(key=lambda b: (b[3], b[0])) + return "".join([b[4] for b in blocks]) + + def extractTextbox(self, rect): + this_tpage = self.this + assert isinstance(this_tpage, mupdf.FzStextPage) + area = JM_rect_from_py(rect) + found = JM_copy_rectangle(this_tpage, area) + rc = PyUnicode_DecodeRawUnicodeEscape(found) + return rc + + def extractWORDS(self, delimiters=None): + """Return a list with text word information.""" + if g_use_extra: + return extra.extractWORDS(self.this, delimiters) + buflen = 0 + last_char_rtl = 0 + block_n = -1 + wbbox = mupdf.FzRect(mupdf.FzRect.Fixed_EMPTY) # word bbox + this_tpage = self.this + tp_rect = mupdf.FzRect(this_tpage.m_internal.mediabox) + + lines = None + buff = mupdf.fz_new_buffer(64) + lines = [] + for block in this_tpage: + block_n += 1 + if block.m_internal.type != mupdf.FZ_STEXT_BLOCK_TEXT: + continue + line_n = -1 + for line in block: + line_n += 1 + word_n = 0 # word counter per line + mupdf.fz_clear_buffer(buff) # reset word buffer + buflen = 0 # reset char counter + for ch in line: + cbbox = JM_char_bbox(line, ch) + if (not JM_rects_overlap(tp_rect, cbbox) + and not mupdf.fz_is_infinite_rect(tp_rect) + ): + continue + word_delimiter = JM_is_word_delimiter(ch.m_internal.c, delimiters) + this_char_rtl = JM_is_rtl_char(ch.m_internal.c) + if word_delimiter or this_char_rtl != last_char_rtl: + if buflen == 0 and word_delimiter: + continue # skip delimiters at line start + if not mupdf.fz_is_empty_rect(wbbox): + word_n, wbbox = JM_append_word(lines, buff, wbbox, block_n, line_n, word_n) + mupdf.fz_clear_buffer(buff) + buflen = 0 # reset char counter + if word_delimiter: + continue + # append one unicode character to the word + JM_append_rune(buff, ch.m_internal.c) + last_char_rtl = this_char_rtl + buflen += 1 + # enlarge word bbox + wbbox = mupdf.fz_union_rect(wbbox, JM_char_bbox(line, ch)) + if buflen and not mupdf.fz_is_empty_rect(wbbox): + word_n, wbbox = JM_append_word(lines, buff, wbbox, block_n, line_n, word_n) + buflen = 0 + return lines + + def extractXHTML(self) -> str: + """Return page content as a XHTML string.""" + return self._extractText(4) + + def extractXML(self) -> str: + """Return page content as a XML string.""" + return self._extractText(3) + + def poolsize(self): + """TextPage current poolsize.""" + tpage = self.this + pool = mupdf.Pool(tpage.m_internal.pool) + size = mupdf.fz_pool_size( pool) + pool.m_internal = None # Ensure that pool's destructor does not free the pool. + return size + + @property + def rect(self): + """Page rectangle.""" + this_tpage = self.this + mediabox = this_tpage.m_internal.mediabox + val = JM_py_from_rect(mediabox) + val = Rect(val) + + return val + + def search(self, needle, hit_max=0, quads=1): + """Locate 'needle' returning rects or quads.""" + val = JM_search_stext_page(self.this, needle) + if not val: + return val + items = len(val) + for i in range(items): # change entries to quads or rects + q = Quad(val[i]) + if quads: + val[i] = q + else: + val[i] = q.rect + if quads: + return val + i = 0 # join overlapping rects on the same line + while i < items - 1: + v1 = val[i] + v2 = val[i + 1] + if v1.y1 != v2.y1 or (v1 & v2).is_empty: + i += 1 + continue # no overlap on same line + val[i] = v1 | v2 # join rectangles + del val[i + 1] # remove v2 + items -= 1 # reduce item count + return val + + extractTEXT = extractText + + +class TextWriter: + + def __init__(self, page_rect, opacity=1, color=None): + """Stores text spans for later output on compatible PDF pages.""" + self.this = mupdf.fz_new_text() + + self.opacity = opacity + self.color = color + self.rect = Rect(page_rect) + self.ctm = Matrix(1, 0, 0, -1, 0, self.rect.height) + self.ictm = ~self.ctm + self.last_point = Point() + self.last_point.__doc__ = "Position following last text insertion." + self.text_rect = Rect() + + self.text_rect.__doc__ = "Accumulated area of text spans." + self.used_fonts = set() + self.thisown = True + + @property + def _bbox(self): + val = JM_py_from_rect( mupdf.fz_bound_text( self.this, mupdf.FzStrokeState(None), mupdf.FzMatrix())) + val = Rect(val) + return val + + def append(self, pos, text, font=None, fontsize=11, language=None, right_to_left=0, small_caps=0): + """Store 'text' at point 'pos' using 'font' and 'fontsize'.""" + pos = Point(pos) * self.ictm + #log( '{font=}') + if font is None: + font = Font("helv") + if not font.is_writable: + if 0: + log( '{font.this.m_internal.name=}') + log( '{font.this.m_internal.t3matrix=}') + log( '{font.this.m_internal.bbox=}') + log( '{font.this.m_internal.glyph_count=}') + log( '{font.this.m_internal.use_glyph_bbox=}') + log( '{font.this.m_internal.width_count=}') + log( '{font.this.m_internal.width_default=}') + log( '{font.this.m_internal.has_digest=}') + log( 'Unsupported font {font.name=}') + if mupdf_cppyy: + import cppyy + log( f'Unsupported font {cppyy.gbl.mupdf_font_name(font.this.m_internal)=}') + raise ValueError("Unsupported font '%s'." % font.name) + if right_to_left: + text = self.clean_rtl(text) + text = "".join(reversed(text)) + right_to_left = 0 + + lang = mupdf.fz_text_language_from_string(language) + p = JM_point_from_py(pos) + trm = mupdf.fz_make_matrix(fontsize, 0, 0, fontsize, p.x, p.y) + markup_dir = 0 + wmode = 0 + if small_caps == 0: + trm = mupdf.fz_show_string( self.this, font.this, trm, text, wmode, right_to_left, markup_dir, lang) + else: + trm = JM_show_string_cs( self.this, font.this, trm, text, wmode, right_to_left, markup_dir, lang) + val = JM_py_from_matrix(trm) + + self.last_point = Point(val[-2:]) * self.ctm + self.text_rect = self._bbox * self.ctm + val = self.text_rect, self.last_point + if font.flags["mono"] == 1: + self.used_fonts.add(font) + return val + + def appendv(self, pos, text, font=None, fontsize=11, language=None, small_caps=False): + lheight = fontsize * 1.2 + for c in text: + self.append(pos, c, font=font, fontsize=fontsize, + language=language, small_caps=small_caps) + pos.y += lheight + return self.text_rect, self.last_point + + def clean_rtl(self, text): + """Revert the sequence of Latin text parts. + + Text with right-to-left writing direction (Arabic, Hebrew) often + contains Latin parts, which are written in left-to-right: numbers, names, + etc. For output as PDF text we need *everything* in right-to-left. + E.g. an input like " ABCDE FG HIJ KL " will be + converted to " JIH GF EDCBA LK ". The Arabic + parts remain untouched. + + Args: + text: str + Returns: + Massaged string. + """ + if not text: + return text + # split into words at space boundaries + words = text.split(" ") + idx = [] + for i in range(len(words)): + w = words[i] + # revert character sequence for Latin only words + if not (len(w) < 2 or max([ord(c) for c in w]) > 255): + words[i] = "".join(reversed(w)) + idx.append(i) # stored index of Latin word + + # adjacent Latin words must revert their sequence, too + idx2 = [] # store indices of adjacent Latin words + for i in range(len(idx)): + if idx2 == []: # empty yet? + idx2.append(idx[i]) # store Latin word number + + elif idx[i] > idx2[-1] + 1: # large gap to last? + if len(idx2) > 1: # at least two consecutives? + words[idx2[0] : idx2[-1] + 1] = reversed( + words[idx2[0] : idx2[-1] + 1] + ) # revert their sequence + idx2 = [idx[i]] # re-initialize + + elif idx[i] == idx2[-1] + 1: # new adjacent Latin word + idx2.append(idx[i]) + + text = " ".join(words) + return text + + def write_text(self, page, color=None, opacity=-1, overlay=1, morph=None, matrix=None, render_mode=0, oc=0): + """Write the text to a PDF page having the TextWriter's page size. + + Args: + page: a PDF page having same size. + color: override text color. + opacity: override transparency. + overlay: put in foreground or background. + morph: tuple(Point, Matrix), apply a matrix with a fixpoint. + matrix: Matrix to be used instead of 'morph' argument. + render_mode: (int) PDF render mode operator 'Tr'. + """ + CheckParent(page) + if abs(self.rect - page.rect) > 1e-3: + raise ValueError("incompatible page rect") + if morph is not None: + if (type(morph) not in (tuple, list) + or type(morph[0]) is not Point + or type(morph[1]) is not Matrix + ): + raise ValueError("morph must be (Point, Matrix) or None") + if matrix is not None and morph is not None: + raise ValueError("only one of matrix, morph is allowed") + if getattr(opacity, "__float__", None) is None or opacity == -1: + opacity = self.opacity + if color is None: + color = self.color + + if 1: + pdfpage = page._pdf_page() + alpha = 1 + if opacity >= 0 and opacity < 1: + alpha = opacity + ncol = 1 + dev_color = [0, 0, 0, 0] + if color: + ncol, dev_color = JM_color_FromSequence(color) + if ncol == 3: + colorspace = mupdf.fz_device_rgb() + elif ncol == 4: + colorspace = mupdf.fz_device_cmyk() + else: + colorspace = mupdf.fz_device_gray() + + resources = mupdf.pdf_new_dict(pdfpage.doc(), 5) + contents = mupdf.fz_new_buffer(1024) + dev = mupdf.pdf_new_pdf_device( pdfpage.doc(), mupdf.FzMatrix(), resources, contents) + #log( '=== {dev_color!r=}') + mupdf.fz_fill_text( + dev, + self.this, + mupdf.FzMatrix(), + colorspace, + dev_color, + alpha, + mupdf.FzColorParams(mupdf.fz_default_color_params), + ) + mupdf.fz_close_device( dev) + + # copy generated resources into the one of the page + max_nums = JM_merge_resources( pdfpage, resources) + cont_string = JM_EscapeStrFromBuffer( contents) + result = (max_nums, cont_string) + val = result + + max_nums = val[0] + content = val[1] + max_alp, max_font = max_nums + old_cont_lines = content.splitlines() + + optcont = page._get_optional_content(oc) + if optcont is not None: + bdc = "/OC /%s BDC" % optcont + emc = "EMC" + else: + bdc = emc = "" + + new_cont_lines = ["q"] + if bdc: + new_cont_lines.append(bdc) + + cb = page.cropbox_position + if page.rotation in (90, 270): + delta = page.rect.height - page.rect.width + else: + delta = 0 + mb = page.mediabox + if bool(cb) or mb.y0 != 0 or delta != 0: + new_cont_lines.append(f"1 0 0 1 {_format_g((cb.x, cb.y + mb.y0 - delta))} cm") + + if morph: + p = morph[0] * self.ictm + delta = Matrix(1, 1).pretranslate(p.x, p.y) + matrix = ~delta * morph[1] * delta + if morph or matrix: + new_cont_lines.append(_format_g(JM_TUPLE(matrix)) + " cm") + + for line in old_cont_lines: + if line.endswith(" cm"): + continue + if line == "BT": + new_cont_lines.append(line) + new_cont_lines.append("%i Tr" % render_mode) + continue + if line.endswith(" gs"): + alp = int(line.split()[0][4:]) + max_alp + line = "/Alp%i gs" % alp + elif line.endswith(" Tf"): + temp = line.split() + fsize = float(temp[1]) + if render_mode != 0: + w = fsize * 0.05 + else: + w = 1 + new_cont_lines.append(_format_g(w) + " w") + font = int(temp[0][2:]) + max_font + line = " ".join(["/F%i" % font] + temp[1:]) + elif line.endswith(" rg"): + new_cont_lines.append(line.replace("rg", "RG")) + elif line.endswith(" g"): + new_cont_lines.append(line.replace(" g", " G")) + elif line.endswith(" k"): + new_cont_lines.append(line.replace(" k", " K")) + new_cont_lines.append(line) + if emc: + new_cont_lines.append(emc) + new_cont_lines.append("Q\n") + content = "\n".join(new_cont_lines).encode("utf-8") + TOOLS._insert_contents(page, content, overlay=overlay) + val = None + for font in self.used_fonts: + repair_mono_font(page, font) + return val + + +class IRect: + """ + IRect() - all zeros + IRect(x0, y0, x1, y1) - 4 coordinates + IRect(top-left, x1, y1) - point and 2 coordinates + IRect(x0, y0, bottom-right) - 2 coordinates and point + IRect(top-left, bottom-right) - 2 points + IRect(sequ) - new from sequence or rect-like + """ + + def __add__(self, p): + return Rect.__add__(self, p).round() + + def __and__(self, x): + return Rect.__and__(self, x).round() + + def __contains__(self, x): + return Rect.__contains__(self, x) + + def __eq__(self, r): + if not hasattr(r, "__len__"): + return False + return len(r) == 4 and self.x0 == r[0] and self.y0 == r[1] and self.x1 == r[2] and self.y1 == r[3] + + def __getitem__(self, i): + return (self.x0, self.y0, self.x1, self.y1)[i] + + def __hash__(self): + return hash(tuple(self)) + + def __init__(self, *args, p0=None, p1=None, x0=None, y0=None, x1=None, y1=None): + self.x0, self.y0, self.x1, self.y1 = util_make_irect( *args, p0=p0, p1=p1, x0=x0, y0=y0, x1=x1, y1=y1) + + def __len__(self): + return 4 + + def __mul__(self, m): + return Rect.__mul__(self, m).round() + + def __neg__(self): + return IRect(-self.x0, -self.y0, -self.x1, -self.y1) + + def __or__(self, x): + return Rect.__or__(self, x).round() + + def __pos__(self): + return IRect(self) + + def __repr__(self): + return "IRect" + str(tuple(self)) + + def __setitem__(self, i, v): + v = int(v) + if i == 0: self.x0 = v + elif i == 1: self.y0 = v + elif i == 2: self.x1 = v + elif i == 3: self.y1 = v + else: + raise IndexError("index out of range") + return None + + def __sub__(self, p): + return Rect.__sub__(self, p).round() + + def __truediv__(self, m): + return Rect.__truediv__(self, m).round() + + @property + def bottom_left(self): + """Bottom-left corner.""" + return Point(self.x0, self.y1) + + @property + def bottom_right(self): + """Bottom-right corner.""" + return Point(self.x1, self.y1) + + @property + def height(self): + return max(0, self.y1 - self.y0) + + def contains(self, x): + """Check if x is in the rectangle.""" + return self.__contains__(x) + + def include_point(self, p): + """Extend rectangle to include point p.""" + rect = self.rect.include_point(p) + return rect.irect + + def include_rect(self, r): + """Extend rectangle to include rectangle r.""" + rect = self.rect.include_rect(r) + return rect.irect + + def intersect(self, r): + """Restrict rectangle to intersection with rectangle r.""" + return Rect.intersect(self, r).round() + + def intersects(self, x): + return Rect.intersects(self, x) + + @property + def is_empty(self): + """True if rectangle area is empty.""" + return self.x0 >= self.x1 or self.y0 >= self.y1 + + @property + def is_infinite(self): + """True if rectangle is infinite.""" + return self.x0 == self.y0 == FZ_MIN_INF_RECT and self.x1 == self.y1 == FZ_MAX_INF_RECT + + @property + def is_valid(self): + """True if rectangle is valid.""" + return self.x0 <= self.x1 and self.y0 <= self.y1 + + def morph(self, p, m): + """Morph with matrix-like m and point-like p. + + Returns a new quad.""" + if self.is_infinite: + return INFINITE_QUAD() + return self.quad.morph(p, m) + + def norm(self): + return math.sqrt(sum([c*c for c in self])) + + def normalize(self): + """Replace rectangle with its valid version.""" + if self.x1 < self.x0: + self.x0, self.x1 = self.x1, self.x0 + if self.y1 < self.y0: + self.y0, self.y1 = self.y1, self.y0 + return self + + @property + def quad(self): + """Return Quad version of rectangle.""" + return Quad(self.tl, self.tr, self.bl, self.br) + + @property + def rect(self): + return Rect(self) + + @property + def top_left(self): + """Top-left corner.""" + return Point(self.x0, self.y0) + + @property + def top_right(self): + """Top-right corner.""" + return Point(self.x1, self.y0) + + def torect(self, r): + """Return matrix that converts to target rect.""" + r = Rect(r) + if self.is_infinite or self.is_empty or r.is_infinite or r.is_empty: + raise ValueError("rectangles must be finite and not empty") + return ( + Matrix(1, 0, 0, 1, -self.x0, -self.y0) + * Matrix(r.width / self.width, r.height / self.height) + * Matrix(1, 0, 0, 1, r.x0, r.y0) + ) + + def transform(self, m): + return Rect.transform(self, m).round() + + @property + def width(self): + return max(0, self.x1 - self.x0) + + br = bottom_right + bl = bottom_left + tl = top_left + tr = top_right + + +# Data +# + +if 1: + _self = sys.modules[__name__] + if 1: + for _name, _value in mupdf.__dict__.items(): + if _name.startswith(('PDF_', 'UCDN_SCRIPT_')): + if _name.startswith('PDF_ENUM_NAME_'): + # Not a simple enum. + pass + else: + #assert not inspect.isroutine(value) + #log(f'importing {_name=} {_value=}.') + setattr(_self, _name, _value) + #log(f'{getattr( self, name, None)=}') + else: + # This is slow due to importing inspect, e.g. 0.019 instead of 0.004. + for _name, _value in inspect.getmembers(mupdf): + if _name.startswith(('PDF_', 'UCDN_SCRIPT_')): + if _name.startswith('PDF_ENUM_NAME_'): + # Not a simple enum. + pass + else: + #assert not inspect.isroutine(value) + #log(f'importing {name}') + setattr(_self, _name, _value) + #log(f'{getattr( self, name, None)=}') + + # This is a macro so not preserved in mupdf C++/Python bindings. + # + PDF_SIGNATURE_DEFAULT_APPEARANCE = (0 + | mupdf.PDF_SIGNATURE_SHOW_LABELS + | mupdf.PDF_SIGNATURE_SHOW_DN + | mupdf.PDF_SIGNATURE_SHOW_DATE + | mupdf.PDF_SIGNATURE_SHOW_TEXT_NAME + | mupdf.PDF_SIGNATURE_SHOW_GRAPHIC_NAME + | mupdf.PDF_SIGNATURE_SHOW_LOGO + ) + + #UCDN_SCRIPT_ADLAM = mupdf.UCDN_SCRIPT_ADLAM + #setattr(self, 'UCDN_SCRIPT_ADLAM', mupdf.UCDN_SCRIPT_ADLAM) + + assert mupdf.UCDN_EAST_ASIAN_H == 1 + + # Flake8 incorrectly fails next two lines because we've dynamically added + # items to self. + assert PDF_TX_FIELD_IS_MULTILINE == mupdf.PDF_TX_FIELD_IS_MULTILINE # noqa: F821 + assert UCDN_SCRIPT_ADLAM == mupdf.UCDN_SCRIPT_ADLAM # noqa: F821 + del _self, _name, _value + +AnyType = typing.Any + +Base14_fontnames = ( + "Courier", + "Courier-Oblique", + "Courier-Bold", + "Courier-BoldOblique", + "Helvetica", + "Helvetica-Oblique", + "Helvetica-Bold", + "Helvetica-BoldOblique", + "Times-Roman", + "Times-Italic", + "Times-Bold", + "Times-BoldItalic", + "Symbol", + "ZapfDingbats", + ) + +Base14_fontdict = {} +for f in Base14_fontnames: + Base14_fontdict[f.lower()] = f +Base14_fontdict["helv"] = "Helvetica" +Base14_fontdict["heit"] = "Helvetica-Oblique" +Base14_fontdict["hebo"] = "Helvetica-Bold" +Base14_fontdict["hebi"] = "Helvetica-BoldOblique" +Base14_fontdict["cour"] = "Courier" +Base14_fontdict["coit"] = "Courier-Oblique" +Base14_fontdict["cobo"] = "Courier-Bold" +Base14_fontdict["cobi"] = "Courier-BoldOblique" +Base14_fontdict["tiro"] = "Times-Roman" +Base14_fontdict["tibo"] = "Times-Bold" +Base14_fontdict["tiit"] = "Times-Italic" +Base14_fontdict["tibi"] = "Times-BoldItalic" +Base14_fontdict["symb"] = "Symbol" +Base14_fontdict["zadb"] = "ZapfDingbats" + +EPSILON = 1e-5 +FLT_EPSILON = 1e-5 + +# largest 32bit integers surviving C float conversion roundtrips +# used by MuPDF to define infinite rectangles +FZ_MIN_INF_RECT = -0x80000000 +FZ_MAX_INF_RECT = 0x7fffff80 + +JM_annot_id_stem = "fitz" +JM_mupdf_warnings_store = [] +JM_mupdf_show_errors = 1 +JM_mupdf_show_warnings = 0 + + +# ------------------------------------------------------------------------------ +# Image recompression constants +# ------------------------------------------------------------------------------ +FZ_RECOMPRESS_NEVER = mupdf.FZ_RECOMPRESS_NEVER +FZ_RECOMPRESS_SAME = mupdf.FZ_RECOMPRESS_SAME +FZ_RECOMPRESS_LOSSLESS = mupdf.FZ_RECOMPRESS_LOSSLESS +FZ_RECOMPRESS_JPEG = mupdf.FZ_RECOMPRESS_JPEG +FZ_RECOMPRESS_J2K = mupdf.FZ_RECOMPRESS_J2K +FZ_RECOMPRESS_FAX = mupdf.FZ_RECOMPRESS_FAX +FZ_SUBSAMPLE_AVERAGE = mupdf.FZ_SUBSAMPLE_AVERAGE +FZ_SUBSAMPLE_BICUBIC = mupdf.FZ_SUBSAMPLE_BICUBIC + +# ------------------------------------------------------------------------------ +# Various PDF Optional Content Flags +# ------------------------------------------------------------------------------ +PDF_OC_ON = 0 +PDF_OC_TOGGLE = 1 +PDF_OC_OFF = 2 + +# ------------------------------------------------------------------------------ +# link kinds and link flags +# ------------------------------------------------------------------------------ +LINK_NONE = 0 +LINK_GOTO = 1 +LINK_URI = 2 +LINK_LAUNCH = 3 +LINK_NAMED = 4 +LINK_GOTOR = 5 +LINK_FLAG_L_VALID = 1 +LINK_FLAG_T_VALID = 2 +LINK_FLAG_R_VALID = 4 +LINK_FLAG_B_VALID = 8 +LINK_FLAG_FIT_H = 16 +LINK_FLAG_FIT_V = 32 +LINK_FLAG_R_IS_ZOOM = 64 + +SigFlag_SignaturesExist = 1 +SigFlag_AppendOnly = 2 + +STAMP_Approved = 0 +STAMP_AsIs = 1 +STAMP_Confidential = 2 +STAMP_Departmental = 3 +STAMP_Experimental = 4 +STAMP_Expired = 5 +STAMP_Final = 6 +STAMP_ForComment = 7 +STAMP_ForPublicRelease = 8 +STAMP_NotApproved = 9 +STAMP_NotForPublicRelease = 10 +STAMP_Sold = 11 +STAMP_TopSecret = 12 +STAMP_Draft = 13 + +TEXT_ALIGN_LEFT = 0 +TEXT_ALIGN_CENTER = 1 +TEXT_ALIGN_RIGHT = 2 +TEXT_ALIGN_JUSTIFY = 3 + +TEXT_FONT_SUPERSCRIPT = 1 +TEXT_FONT_ITALIC = 2 +TEXT_FONT_SERIFED = 4 +TEXT_FONT_MONOSPACED = 8 +TEXT_FONT_BOLD = 16 + +TEXT_OUTPUT_TEXT = 0 +TEXT_OUTPUT_HTML = 1 +TEXT_OUTPUT_JSON = 2 +TEXT_OUTPUT_XML = 3 +TEXT_OUTPUT_XHTML = 4 + +TEXT_PRESERVE_LIGATURES = mupdf.FZ_STEXT_PRESERVE_LIGATURES +TEXT_PRESERVE_WHITESPACE = mupdf.FZ_STEXT_PRESERVE_WHITESPACE +TEXT_PRESERVE_IMAGES = mupdf.FZ_STEXT_PRESERVE_IMAGES +TEXT_INHIBIT_SPACES = mupdf.FZ_STEXT_INHIBIT_SPACES +TEXT_DEHYPHENATE = mupdf.FZ_STEXT_DEHYPHENATE +TEXT_PRESERVE_SPANS = mupdf.FZ_STEXT_PRESERVE_SPANS +TEXT_MEDIABOX_CLIP = mupdf.FZ_STEXT_MEDIABOX_CLIP +TEXT_USE_CID_FOR_UNKNOWN_UNICODE = mupdf.FZ_STEXT_USE_CID_FOR_UNKNOWN_UNICODE +TEXT_COLLECT_STRUCTURE = mupdf.FZ_STEXT_COLLECT_STRUCTURE +TEXT_ACCURATE_BBOXES = mupdf.FZ_STEXT_ACCURATE_BBOXES +TEXT_COLLECT_VECTORS = mupdf.FZ_STEXT_COLLECT_VECTORS +TEXT_IGNORE_ACTUALTEXT = mupdf.FZ_STEXT_IGNORE_ACTUALTEXT +TEXT_SEGMENT = mupdf.FZ_STEXT_SEGMENT + +if mupdf_version_tuple >= (1, 26): + TEXT_PARAGRAPH_BREAK = mupdf.FZ_STEXT_PARAGRAPH_BREAK + TEXT_TABLE_HUNT = mupdf.FZ_STEXT_TABLE_HUNT + TEXT_COLLECT_STYLES = mupdf.FZ_STEXT_COLLECT_STYLES + TEXT_USE_GID_FOR_UNKNOWN_UNICODE = mupdf.FZ_STEXT_USE_GID_FOR_UNKNOWN_UNICODE + TEXT_CLIP_RECT = mupdf.FZ_STEXT_CLIP_RECT + TEXT_ACCURATE_ASCENDERS = mupdf.FZ_STEXT_ACCURATE_ASCENDERS + TEXT_ACCURATE_SIDE_BEARINGS = mupdf.FZ_STEXT_ACCURATE_SIDE_BEARINGS + +# 2025-05-07: Non-standard names preserved for backwards compatibility. +TEXT_STEXT_SEGMENT = TEXT_SEGMENT +TEXT_CID_FOR_UNKNOWN_UNICODE = TEXT_USE_CID_FOR_UNKNOWN_UNICODE + +TEXTFLAGS_WORDS = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_USE_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_BLOCKS = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_USE_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_DICT = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_PRESERVE_IMAGES + | TEXT_USE_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_RAWDICT = TEXTFLAGS_DICT + +TEXTFLAGS_SEARCH = (0 + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_DEHYPHENATE + | TEXT_USE_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_HTML = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_PRESERVE_IMAGES + | TEXT_USE_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_XHTML = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_PRESERVE_IMAGES + | TEXT_USE_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_XML = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_USE_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_TEXT = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_USE_CID_FOR_UNKNOWN_UNICODE + ) + +# Simple text encoding options +TEXT_ENCODING_LATIN = 0 +TEXT_ENCODING_GREEK = 1 +TEXT_ENCODING_CYRILLIC = 2 + +TOOLS_JM_UNIQUE_ID = 0 + +# colorspace identifiers +CS_RGB = 1 +CS_GRAY = 2 +CS_CMYK = 3 + +# PDF Blend Modes +PDF_BM_Color = "Color" +PDF_BM_ColorBurn = "ColorBurn" +PDF_BM_ColorDodge = "ColorDodge" +PDF_BM_Darken = "Darken" +PDF_BM_Difference = "Difference" +PDF_BM_Exclusion = "Exclusion" +PDF_BM_HardLight = "HardLight" +PDF_BM_Hue = "Hue" +PDF_BM_Lighten = "Lighten" +PDF_BM_Luminosity = "Luminosity" +PDF_BM_Multiply = "Multiply" +PDF_BM_Normal = "Normal" +PDF_BM_Overlay = "Overlay" +PDF_BM_Saturation = "Saturation" +PDF_BM_Screen = "Screen" +PDF_BM_SoftLight = "Softlight" + + +annot_skel = { + "goto1": lambda a, b, c, d, e: f"<>/Rect[{e}]/BS<>/Subtype/Link>>", + "goto2": lambda a, b: f"<>/Rect[{b}]/BS<>/Subtype/Link>>", + "gotor1": lambda a, b, c, d, e, f, g: f"<>>>/Rect[{g}]/BS<>/Subtype/Link>>", + "gotor2": lambda a, b, c: f"<>/Rect[{c}]/BS<>/Subtype/Link>>", + "launch": lambda a, b, c: f"<>>>/Rect[{c}]/BS<>/Subtype/Link>>", + "uri": lambda a, b: f"<>/Rect[{b}]/BS<>/Subtype/Link>>", + "named": lambda a, b: f"<>/Rect[{b}]/BS<>/Subtype/Link>>", + } + +class FileDataError(RuntimeError): + """Raised for documents with file structure issues.""" + pass + +class FileNotFoundError(RuntimeError): + """Raised if file does not exist.""" + pass + +class EmptyFileError(FileDataError): + """Raised when creating documents from zero-length data.""" + pass + +# propagate exception class to C-level code +#_set_FileDataError(FileDataError) + +csRGB = Colorspace(CS_RGB) +csGRAY = Colorspace(CS_GRAY) +csCMYK = Colorspace(CS_CMYK) + +# These don't appear to be visible in classic, but are used +# internally. +# +dictkey_align = "align" +dictkey_asc = "ascender" +dictkey_bidi = "bidi" +dictkey_bbox = "bbox" +dictkey_blocks = "blocks" +dictkey_bpc = "bpc" +dictkey_c = "c" +dictkey_chars = "chars" +dictkey_color = "color" +dictkey_colorspace = "colorspace" +dictkey_content = "content" +dictkey_creationDate = "creationDate" +dictkey_cs_name = "cs-name" +dictkey_da = "da" +dictkey_dashes = "dashes" +dictkey_descr = "description" +dictkey_desc = "descender" +dictkey_dir = "dir" +dictkey_effect = "effect" +dictkey_ext = "ext" +dictkey_filename = "filename" +dictkey_fill = "fill" +dictkey_flags = "flags" +dictkey_char_flags = "char_flags" +dictkey_font = "font" +dictkey_glyph = "glyph" +dictkey_height = "height" +dictkey_id = "id" +dictkey_image = "image" +dictkey_items = "items" +dictkey_length = "length" +dictkey_lines = "lines" +dictkey_matrix = "transform" +dictkey_modDate = "modDate" +dictkey_name = "name" +dictkey_number = "number" +dictkey_origin = "origin" +dictkey_rect = "rect" +dictkey_size = "size" +dictkey_smask = "smask" +dictkey_spans = "spans" +dictkey_stroke = "stroke" +dictkey_style = "style" +dictkey_subject = "subject" +dictkey_text = "text" +dictkey_title = "title" +dictkey_type = "type" +dictkey_ufilename = "ufilename" +dictkey_width = "width" +dictkey_wmode = "wmode" +dictkey_xref = "xref" +dictkey_xres = "xres" +dictkey_yres = "yres" + + +try: + from pymupdf_fonts import fontdescriptors, fontbuffers + + fitz_fontdescriptors = fontdescriptors.copy() + for k in fitz_fontdescriptors.keys(): + fitz_fontdescriptors[k]["loader"] = fontbuffers[k] + del fontdescriptors, fontbuffers +except ImportError: + fitz_fontdescriptors = {} + +symbol_glyphs = ( # Glyph list for the built-in font 'Symbol' + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (32, 0.25), + (33, 0.333), + (34, 0.713), + (35, 0.5), + (36, 0.549), + (37, 0.833), + (38, 0.778), + (39, 0.439), + (40, 0.333), + (41, 0.333), + (42, 0.5), + (43, 0.549), + (44, 0.25), + (45, 0.549), + (46, 0.25), + (47, 0.278), + (48, 0.5), + (49, 0.5), + (50, 0.5), + (51, 0.5), + (52, 0.5), + (53, 0.5), + (54, 0.5), + (55, 0.5), + (56, 0.5), + (57, 0.5), + (58, 0.278), + (59, 0.278), + (60, 0.549), + (61, 0.549), + (62, 0.549), + (63, 0.444), + (64, 0.549), + (65, 0.722), + (66, 0.667), + (67, 0.722), + (68, 0.612), + (69, 0.611), + (70, 0.763), + (71, 0.603), + (72, 0.722), + (73, 0.333), + (74, 0.631), + (75, 0.722), + (76, 0.686), + (77, 0.889), + (78, 0.722), + (79, 0.722), + (80, 0.768), + (81, 0.741), + (82, 0.556), + (83, 0.592), + (84, 0.611), + (85, 0.69), + (86, 0.439), + (87, 0.768), + (88, 0.645), + (89, 0.795), + (90, 0.611), + (91, 0.333), + (92, 0.863), + (93, 0.333), + (94, 0.658), + (95, 0.5), + (96, 0.5), + (97, 0.631), + (98, 0.549), + (99, 0.549), + (100, 0.494), + (101, 0.439), + (102, 0.521), + (103, 0.411), + (104, 0.603), + (105, 0.329), + (106, 0.603), + (107, 0.549), + (108, 0.549), + (109, 0.576), + (110, 0.521), + (111, 0.549), + (112, 0.549), + (113, 0.521), + (114, 0.549), + (115, 0.603), + (116, 0.439), + (117, 0.576), + (118, 0.713), + (119, 0.686), + (120, 0.493), + (121, 0.686), + (122, 0.494), + (123, 0.48), + (124, 0.2), + (125, 0.48), + (126, 0.549), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (160, 0.25), + (161, 0.62), + (162, 0.247), + (163, 0.549), + (164, 0.167), + (165, 0.713), + (166, 0.5), + (167, 0.753), + (168, 0.753), + (169, 0.753), + (170, 0.753), + (171, 1.042), + (172, 0.713), + (173, 0.603), + (174, 0.987), + (175, 0.603), + (176, 0.4), + (177, 0.549), + (178, 0.411), + (179, 0.549), + (180, 0.549), + (181, 0.576), + (182, 0.494), + (183, 0.46), + (184, 0.549), + (185, 0.549), + (186, 0.549), + (187, 0.549), + (188, 1), + (189, 0.603), + (190, 1), + (191, 0.658), + (192, 0.823), + (193, 0.686), + (194, 0.795), + (195, 0.987), + (196, 0.768), + (197, 0.768), + (198, 0.823), + (199, 0.768), + (200, 0.768), + (201, 0.713), + (202, 0.713), + (203, 0.713), + (204, 0.713), + (205, 0.713), + (206, 0.713), + (207, 0.713), + (208, 0.768), + (209, 0.713), + (210, 0.79), + (211, 0.79), + (212, 0.89), + (213, 0.823), + (214, 0.549), + (215, 0.549), + (216, 0.713), + (217, 0.603), + (218, 0.603), + (219, 1.042), + (220, 0.987), + (221, 0.603), + (222, 0.987), + (223, 0.603), + (224, 0.494), + (225, 0.329), + (226, 0.79), + (227, 0.79), + (228, 0.786), + (229, 0.713), + (230, 0.384), + (231, 0.384), + (232, 0.384), + (233, 0.384), + (234, 0.384), + (235, 0.384), + (236, 0.494), + (237, 0.494), + (238, 0.494), + (239, 0.494), + (183, 0.46), + (241, 0.329), + (242, 0.274), + (243, 0.686), + (244, 0.686), + (245, 0.686), + (246, 0.384), + (247, 0.549), + (248, 0.384), + (249, 0.384), + (250, 0.384), + (251, 0.384), + (252, 0.494), + (253, 0.494), + (254, 0.494), + (183, 0.46), + ) + + +zapf_glyphs = ( # Glyph list for the built-in font 'ZapfDingbats' + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (32, 0.278), + (33, 0.974), + (34, 0.961), + (35, 0.974), + (36, 0.98), + (37, 0.719), + (38, 0.789), + (39, 0.79), + (40, 0.791), + (41, 0.69), + (42, 0.96), + (43, 0.939), + (44, 0.549), + (45, 0.855), + (46, 0.911), + (47, 0.933), + (48, 0.911), + (49, 0.945), + (50, 0.974), + (51, 0.755), + (52, 0.846), + (53, 0.762), + (54, 0.761), + (55, 0.571), + (56, 0.677), + (57, 0.763), + (58, 0.76), + (59, 0.759), + (60, 0.754), + (61, 0.494), + (62, 0.552), + (63, 0.537), + (64, 0.577), + (65, 0.692), + (66, 0.786), + (67, 0.788), + (68, 0.788), + (69, 0.79), + (70, 0.793), + (71, 0.794), + (72, 0.816), + (73, 0.823), + (74, 0.789), + (75, 0.841), + (76, 0.823), + (77, 0.833), + (78, 0.816), + (79, 0.831), + (80, 0.923), + (81, 0.744), + (82, 0.723), + (83, 0.749), + (84, 0.79), + (85, 0.792), + (86, 0.695), + (87, 0.776), + (88, 0.768), + (89, 0.792), + (90, 0.759), + (91, 0.707), + (92, 0.708), + (93, 0.682), + (94, 0.701), + (95, 0.826), + (96, 0.815), + (97, 0.789), + (98, 0.789), + (99, 0.707), + (100, 0.687), + (101, 0.696), + (102, 0.689), + (103, 0.786), + (104, 0.787), + (105, 0.713), + (106, 0.791), + (107, 0.785), + (108, 0.791), + (109, 0.873), + (110, 0.761), + (111, 0.762), + (112, 0.762), + (113, 0.759), + (114, 0.759), + (115, 0.892), + (116, 0.892), + (117, 0.788), + (118, 0.784), + (119, 0.438), + (120, 0.138), + (121, 0.277), + (122, 0.415), + (123, 0.392), + (124, 0.392), + (125, 0.668), + (126, 0.668), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (161, 0.732), + (162, 0.544), + (163, 0.544), + (164, 0.91), + (165, 0.667), + (166, 0.76), + (167, 0.76), + (168, 0.776), + (169, 0.595), + (170, 0.694), + (171, 0.626), + (172, 0.788), + (173, 0.788), + (174, 0.788), + (175, 0.788), + (176, 0.788), + (177, 0.788), + (178, 0.788), + (179, 0.788), + (180, 0.788), + (181, 0.788), + (182, 0.788), + (183, 0.788), + (184, 0.788), + (185, 0.788), + (186, 0.788), + (187, 0.788), + (188, 0.788), + (189, 0.788), + (190, 0.788), + (191, 0.788), + (192, 0.788), + (193, 0.788), + (194, 0.788), + (195, 0.788), + (196, 0.788), + (197, 0.788), + (198, 0.788), + (199, 0.788), + (200, 0.788), + (201, 0.788), + (202, 0.788), + (203, 0.788), + (204, 0.788), + (205, 0.788), + (206, 0.788), + (207, 0.788), + (208, 0.788), + (209, 0.788), + (210, 0.788), + (211, 0.788), + (212, 0.894), + (213, 0.838), + (214, 1.016), + (215, 0.458), + (216, 0.748), + (217, 0.924), + (218, 0.748), + (219, 0.918), + (220, 0.927), + (221, 0.928), + (222, 0.928), + (223, 0.834), + (224, 0.873), + (225, 0.828), + (226, 0.924), + (227, 0.924), + (228, 0.917), + (229, 0.93), + (230, 0.931), + (231, 0.463), + (232, 0.883), + (233, 0.836), + (234, 0.836), + (235, 0.867), + (236, 0.867), + (237, 0.696), + (238, 0.696), + (239, 0.874), + (183, 0.788), + (241, 0.874), + (242, 0.76), + (243, 0.946), + (244, 0.771), + (245, 0.865), + (246, 0.771), + (247, 0.888), + (248, 0.967), + (249, 0.888), + (250, 0.831), + (251, 0.873), + (252, 0.927), + (253, 0.97), + (183, 0.788), + (183, 0.788), + ) + + +# Functions +# + +def _read_samples( pixmap, offset, n): + # fixme: need to be able to get a sample in one call, as a Python + # bytes or similar. + ret = [] + if not pixmap.samples(): + # mupdf.fz_samples_get() gives a segv if pixmap->samples is null. + return ret + for i in range( n): + ret.append( mupdf.fz_samples_get( pixmap, offset + i)) + return bytes( ret) + + +def _INRANGE(v, low, high): + return low <= v and v <= high + + +def _remove_dest_range(pdf, numbers): + pagecount = mupdf.pdf_count_pages(pdf) + for i in range(pagecount): + n1 = i + if n1 in numbers: + continue + + pageref = mupdf.pdf_lookup_page_obj( pdf, i) + annots = mupdf.pdf_dict_get( pageref, PDF_NAME('Annots')) + if not annots.m_internal: + continue + len_ = mupdf.pdf_array_len(annots) + for j in range(len_ - 1, -1, -1): + o = mupdf.pdf_array_get( annots, j) + if not mupdf.pdf_name_eq( mupdf.pdf_dict_get( o, PDF_NAME('Subtype')), PDF_NAME('Link')): + continue + action = mupdf.pdf_dict_get( o, PDF_NAME('A')) + dest = mupdf.pdf_dict_get( o, PDF_NAME('Dest')) + if action.m_internal: + if not mupdf.pdf_name_eq( mupdf.pdf_dict_get( action, PDF_NAME('S')), PDF_NAME('GoTo')): + continue + dest = mupdf.pdf_dict_get( action, PDF_NAME('D')) + pno = -1 + if mupdf.pdf_is_array( dest): + target = mupdf.pdf_array_get( dest, 0) + pno = mupdf.pdf_lookup_page_number( pdf, target) + elif mupdf.pdf_is_string( dest): + location, _, _ = mupdf.fz_resolve_link( pdf.super(), mupdf.pdf_to_text_string( dest)) + pno = location.page + if pno < 0: # page number lookup did not work + continue + n1 = pno + if n1 in numbers: + mupdf.pdf_array_delete( annots, j) + + +def ASSERT_PDF(cond): + assert isinstance(cond, (mupdf.PdfPage, mupdf.PdfDocument)), f'{type(cond)=} {cond=}' + if not cond.m_internal: + raise Exception(MSG_IS_NO_PDF) + + +def EMPTY_IRECT(): + return IRect(FZ_MAX_INF_RECT, FZ_MAX_INF_RECT, FZ_MIN_INF_RECT, FZ_MIN_INF_RECT) + + +def EMPTY_QUAD(): + return EMPTY_RECT().quad + + +def EMPTY_RECT(): + return Rect(FZ_MAX_INF_RECT, FZ_MAX_INF_RECT, FZ_MIN_INF_RECT, FZ_MIN_INF_RECT) + + +def ENSURE_OPERATION(pdf): + if not JM_have_operation(pdf): + raise Exception("No journalling operation started") + + +def INFINITE_IRECT(): + return IRect(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, FZ_MAX_INF_RECT) + + +def INFINITE_QUAD(): + return INFINITE_RECT().quad + + +def INFINITE_RECT(): + return Rect(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, FZ_MAX_INF_RECT) + + +def JM_BinFromBuffer(buffer_): + ''' + Turn fz_buffer into a Python bytes object + ''' + assert isinstance(buffer_, mupdf.FzBuffer) + ret = mupdf.fz_buffer_extract_copy(buffer_) + return ret + + +def JM_EscapeStrFromStr(c): + # `c` is typically from SWIG which will have converted a `const char*` from + # C into a Python `str` using `PyUnicode_DecodeUTF8(carray, static_cast< + # Py_ssize_t >(size), "surrogateescape")`. This gives us a Python `str` + # with some characters encoded as a \0xdcXY sequence, where `XY` are hex + # digits for an invalid byte in the original `const char*`. + # + # This is actually a reasonable way of representing arbitrary + # strings from C, but we want to mimic what PyMuPDF does. It uses + # `PyUnicode_DecodeRawUnicodeEscape(c, (Py_ssize_t) strlen(c), "replace")` + # which gives a string containing actual unicode characters for any invalid + # bytes. + # + # We mimic this by converting the `str` to a `bytes` with 'surrogateescape' + # to recognise \0xdcXY sequences, then convert the individual bytes into a + # `str` using `chr()`. + # + # Would be good to have a more efficient way to do this. + # + if c is None: + return '' + assert isinstance(c, str), f'{type(c)=}' + b = c.encode('utf8', 'surrogateescape') + ret = '' + for bb in b: + ret += chr(bb) + return ret + + +def JM_BufferFromBytes(stream): + ''' + Make fz_buffer from a PyBytes, PyByteArray or io.BytesIO object. If a text + io.BytesIO, we convert to binary by encoding as utf8. + ''' + if isinstance(stream, (bytes, bytearray)): + data = stream + elif hasattr(stream, 'getvalue'): + data = stream.getvalue() + if isinstance(data, str): + data = data.encode('utf-8') + if not isinstance(data, (bytes, bytearray)): + raise Exception(f'.getvalue() returned unexpected type: {type(data)}') + else: + return mupdf.FzBuffer() + return mupdf.fz_new_buffer_from_copied_data(data) + + +def JM_FLOAT_ITEM(obj, idx): + if not PySequence_Check(obj): + return None + return float(obj[idx]) + +def JM_INT_ITEM(obj, idx): + if idx < len(obj): + temp = obj[idx] + if isinstance(temp, (int, float)): + return 0, temp + return 1, None + + +def JM_pixmap_from_page(doc, page, ctm, cs, alpha, annots, clip): + ''' + Pixmap creation directly using a short-lived displaylist, so we can support + separations. + ''' + SPOTS_NONE = 0 + SPOTS_OVERPRINT_SIM = 1 + SPOTS_FULL = 2 + + FZ_ENABLE_SPOT_RENDERING = True # fixme: this is a build-time setting in MuPDF's config.h. + if FZ_ENABLE_SPOT_RENDERING: + spots = SPOTS_OVERPRINT_SIM + else: + spots = SPOTS_NONE + + seps = None + colorspace = cs + + matrix = JM_matrix_from_py(ctm) + rect = mupdf.fz_bound_page(page) + rclip = JM_rect_from_py(clip) + rect = mupdf.fz_intersect_rect(rect, rclip) # no-op if clip is not given + rect = mupdf.fz_transform_rect(rect, matrix) + bbox = mupdf.fz_round_rect(rect) + + # Pixmap of the document's /OutputIntents ("output intents") + oi = mupdf.fz_document_output_intent(doc) + # if present and compatible, use it instead of the parameter + if oi.m_internal: + if mupdf.fz_colorspace_n(oi) == mupdf.fz_colorspace_n(cs): + colorspace = mupdf.fz_keep_colorspace(oi) + + # check if spots rendering is available and if so use separations + if spots != SPOTS_NONE: + seps = mupdf.fz_page_separations(page) + if seps.m_internal: + n = mupdf.fz_count_separations(seps) + if spots == SPOTS_FULL: + for i in range(n): + mupdf.fz_set_separation_behavior(seps, i, mupdf.FZ_SEPARATION_SPOT) + else: + for i in range(n): + mupdf.fz_set_separation_behavior(seps, i, mupdf.FZ_SEPARATION_COMPOSITE) + elif mupdf.fz_page_uses_overprint(page): + # This page uses overprint, so we need an empty + # sep object to force the overprint simulation on. + seps = mupdf.fz_new_separations(0) + elif oi.m_internal and mupdf.fz_colorspace_n(oi) != mupdf.fz_colorspace_n(colorspace): + # We have an output intent, and it's incompatible + # with the colorspace our device needs. Force the + # overprint simulation on, because this ensures that + # we 'simulate' the output intent too. + seps = mupdf.fz_new_separations(0) + + pix = mupdf.fz_new_pixmap_with_bbox(colorspace, bbox, seps, alpha) + + if alpha: + mupdf.fz_clear_pixmap(pix) + else: + mupdf.fz_clear_pixmap_with_value(pix, 0xFF) + + dev = mupdf.fz_new_draw_device(matrix, pix) + if annots: + mupdf.fz_run_page(page, dev, mupdf.FzMatrix(), mupdf.FzCookie()) + else: + mupdf.fz_run_page_contents(page, dev, mupdf.FzMatrix(), mupdf.FzCookie()) + mupdf.fz_close_device(dev) + return pix + + +def JM_StrAsChar(x): + # fixme: should encode, but swig doesn't pass bytes to C as const char*. + return x + #return x.encode('utf8') + + +def JM_TUPLE(o: typing.Sequence) -> tuple: + return tuple(map(lambda x: round(x, 5) if abs(x) >= 1e-4 else 0, o)) + + +def JM_TUPLE3(o: typing.Sequence) -> tuple: + return tuple(map(lambda x: round(x, 3) if abs(x) >= 1e-3 else 0, o)) + + +def JM_UnicodeFromStr(s): + if s is None: + return '' + if isinstance(s, bytes): + s = s.decode('utf8') + assert isinstance(s, str), f'{type(s)=} {s=}' + return s + + +def JM_add_annot_id(annot, stem): + ''' + Add a unique /NM key to an annotation or widget. + Append a number to 'stem' such that the result is a unique name. + ''' + assert isinstance(annot, mupdf.PdfAnnot) + page = _pdf_annot_page(annot) + annot_obj = mupdf.pdf_annot_obj( annot) + names = JM_get_annot_id_list(page) + i = 0 + while 1: + stem_id = f'{JM_annot_id_stem}-{stem}{i}' + if stem_id not in names: + break + i += 1 + response = JM_StrAsChar(stem_id) + name = mupdf.pdf_new_string( response, len(response)) + mupdf.pdf_dict_puts(annot_obj, "NM", name) + page.doc().m_internal.resynth_required = 0 + + +def JM_add_oc_object(pdf, ref, xref): + ''' + Add OC object reference to a dictionary + ''' + indobj = mupdf.pdf_new_indirect(pdf, xref, 0) + if not mupdf.pdf_is_dict(indobj): + RAISEPY(MSG_BAD_OC_REF, PyExc_ValueError) + type_ = mupdf.pdf_dict_get(indobj, PDF_NAME('Type')) + if (mupdf.pdf_objcmp(type_, PDF_NAME('OCG')) == 0 + or mupdf.pdf_objcmp(type_, PDF_NAME('OCMD')) == 0 + ): + mupdf.pdf_dict_put(ref, PDF_NAME('OC'), indobj) + else: + RAISEPY(MSG_BAD_OC_REF, PyExc_ValueError) + + +def JM_annot_border(annot_obj): + dash_py = list() + style = None + width = -1 + clouds = -1 + obj = None + + obj = mupdf.pdf_dict_get( annot_obj, PDF_NAME('Border')) + if mupdf.pdf_is_array( obj): + width = mupdf.pdf_to_real( mupdf.pdf_array_get( obj, 2)) + if mupdf.pdf_array_len( obj) == 4: + dash = mupdf.pdf_array_get( obj, 3) + for i in range( mupdf.pdf_array_len( dash)): + val = mupdf.pdf_to_int( mupdf.pdf_array_get( dash, i)) + dash_py.append( val) + + bs_o = mupdf.pdf_dict_get( annot_obj, PDF_NAME('BS')) + if bs_o.m_internal: + width = mupdf.pdf_to_real( mupdf.pdf_dict_get( bs_o, PDF_NAME('W'))) + style = mupdf.pdf_to_name( mupdf.pdf_dict_get( bs_o, PDF_NAME('S'))) + if style == '': + style = None + obj = mupdf.pdf_dict_get( bs_o, PDF_NAME('D')) + if obj.m_internal: + for i in range( mupdf.pdf_array_len( obj)): + val = mupdf.pdf_to_int( mupdf.pdf_array_get( obj, i)) + dash_py.append( val) + + obj = mupdf.pdf_dict_get( annot_obj, PDF_NAME('BE')) + if obj.m_internal: + clouds = mupdf.pdf_to_int( mupdf.pdf_dict_get( obj, PDF_NAME('I'))) + + res = dict() + res[ dictkey_width] = width + res[ dictkey_dashes] = tuple( dash_py) + res[ dictkey_style] = style + res[ 'clouds'] = clouds + return res + + +def JM_annot_colors(annot_obj): + res = dict() + bc = list() # stroke colors + fc =list() # fill colors + o = mupdf.pdf_dict_get(annot_obj, mupdf.PDF_ENUM_NAME_C) + if mupdf.pdf_is_array(o): + n = mupdf.pdf_array_len(o) + for i in range(n): + col = mupdf.pdf_to_real( mupdf.pdf_array_get(o, i)) + bc.append(col) + res[dictkey_stroke] = bc + + o = mupdf.pdf_dict_gets(annot_obj, "IC") + if mupdf.pdf_is_array(o): + n = mupdf.pdf_array_len(o) + for i in range(n): + col = mupdf.pdf_to_real( mupdf.pdf_array_get(o, i)) + fc.append(col) + + res[dictkey_fill] = fc + return res + + +def JM_annot_set_border( border, doc, annot_obj): + assert isinstance(border, dict) + obj = None + dashlen = 0 + nwidth = border.get( dictkey_width) # new width + ndashes = border.get( dictkey_dashes) # new dashes + nstyle = border.get( dictkey_style) # new style + nclouds = border.get( 'clouds', -1) # new clouds value + + # get old border properties + oborder = JM_annot_border( annot_obj) + + # delete border-related entries + mupdf.pdf_dict_del( annot_obj, PDF_NAME('BS')) + mupdf.pdf_dict_del( annot_obj, PDF_NAME('BE')) + mupdf.pdf_dict_del( annot_obj, PDF_NAME('Border')) + + # populate border items: keep old values for any omitted new ones + if nwidth < 0: + nwidth = oborder.get( dictkey_width) # no new width: keep current + if ndashes is None: + ndashes = oborder.get( dictkey_dashes) # no new dashes: keep old + if nstyle is None: + nstyle = oborder.get( dictkey_style) # no new style: keep old + if nclouds < 0: + nclouds = oborder.get( "clouds", -1) # no new clouds: keep old + + if isinstance( ndashes, tuple) and len( ndashes) > 0: + dashlen = len( ndashes) + darr = mupdf.pdf_new_array( doc, dashlen) + for d in ndashes: + mupdf.pdf_array_push_int( darr, d) + mupdf.pdf_dict_putl( annot_obj, darr, PDF_NAME('BS'), PDF_NAME('D')) + + mupdf.pdf_dict_putl( + annot_obj, + mupdf.pdf_new_real( nwidth), + PDF_NAME('BS'), + PDF_NAME('W'), + ) + + if dashlen == 0: + obj = JM_get_border_style( nstyle) + else: + obj = PDF_NAME('D') + mupdf.pdf_dict_putl( annot_obj, obj, PDF_NAME('BS'), PDF_NAME('S')) + + if nclouds > 0: + mupdf.pdf_dict_put_dict( annot_obj, PDF_NAME('BE'), 2) + obj = mupdf.pdf_dict_get( annot_obj, PDF_NAME('BE')) + mupdf.pdf_dict_put( obj, PDF_NAME('S'), PDF_NAME('C')) + mupdf.pdf_dict_put_int( obj, PDF_NAME('I'), nclouds) + + +def make_escape(ch): + if ch == 92: + return "\\u005c" + elif 32 <= ch <= 127 or ch == 10: + return chr(ch) + elif 0xd800 <= ch <= 0xdfff: # orphaned surrogate + return "\\ufffd" + elif ch <= 0xffff: + return "\\u%04x" % ch + else: + return "\\U%08x" % ch + + +def JM_append_rune(buff, ch): + """ + APPEND non-ascii runes in unicode escape format to fz_buffer. + """ + mupdf.fz_append_string(buff, make_escape(ch)) + + +def JM_append_word(lines, buff, wbbox, block_n, line_n, word_n): + ''' + Functions for wordlist output + ''' + s = JM_EscapeStrFromBuffer(buff) + litem = ( + wbbox.x0, + wbbox.y0, + wbbox.x1, + wbbox.y1, + s, + block_n, + line_n, + word_n, + ) + lines.append(litem) + return word_n + 1, mupdf.FzRect(mupdf.FzRect.Fixed_EMPTY) # word counter + + +def JM_add_layer_config( pdf, name, creator, ON): + ''' + Add OC configuration to the PDF catalog + ''' + ocp = JM_ensure_ocproperties( pdf) + configs = mupdf.pdf_dict_get( ocp, PDF_NAME('Configs')) + if not mupdf.pdf_is_array( configs): + configs = mupdf.pdf_dict_put_array( ocp, PDF_NAME('Configs'), 1) + D = mupdf.pdf_new_dict( pdf, 5) + mupdf.pdf_dict_put_text_string( D, PDF_NAME('Name'), name) + if creator is not None: + mupdf.pdf_dict_put_text_string( D, PDF_NAME('Creator'), creator) + mupdf.pdf_dict_put( D, PDF_NAME('BaseState'), PDF_NAME('OFF')) + onarray = mupdf.pdf_dict_put_array( D, PDF_NAME('ON'), 5) + if not ON: + pass + else: + ocgs = mupdf.pdf_dict_get( ocp, PDF_NAME('OCGs')) + n = len(ON) + for i in range(n): + xref = 0 + e, xref = JM_INT_ITEM(ON, i) + if e == 1: + continue + ind = mupdf.pdf_new_indirect( pdf, xref, 0) + if mupdf.pdf_array_contains( ocgs, ind): + mupdf.pdf_array_push( onarray, ind) + mupdf.pdf_array_push( configs, D) + + +def JM_char_bbox(line, ch): + ''' + return rect of char quad + ''' + q = JM_char_quad(line, ch) + r = mupdf.fz_rect_from_quad(q) + if not line.m_internal.wmode: + return r + if r.y1 < r.y0 + ch.m_internal.size: + r.y0 = r.y1 - ch.m_internal.size + return r + + +def JM_char_font_flags(font, line, ch): + flags = 0 + if line and ch: + flags += detect_super_script(line, ch) + flags += mupdf.fz_font_is_italic(font) * TEXT_FONT_ITALIC + flags += mupdf.fz_font_is_serif(font) * TEXT_FONT_SERIFED + flags += mupdf.fz_font_is_monospaced(font) * TEXT_FONT_MONOSPACED + flags += mupdf.fz_font_is_bold(font) * TEXT_FONT_BOLD + return flags + + +def JM_char_quad(line, ch): + ''' + re-compute char quad if ascender/descender values make no sense + ''' + if 1 and g_use_extra: + # This reduces time taken to extract text from PyMuPDF.pdf from 20s to + # 15s. + return mupdf.FzQuad(extra.JM_char_quad( line.m_internal, ch.m_internal)) + + assert isinstance(line, mupdf.FzStextLine) + assert isinstance(ch, mupdf.FzStextChar) + if _globals.skip_quad_corrections: # no special handling + return ch.quad + if line.m_internal.wmode: # never touch vertical write mode + return ch.quad + font = mupdf.FzFont(mupdf.ll_fz_keep_font(ch.m_internal.font)) + asc = JM_font_ascender(font) + dsc = JM_font_descender(font) + fsize = ch.m_internal.size + asc_dsc = asc - dsc + FLT_EPSILON + if asc_dsc >= 1 and _globals.small_glyph_heights == 0: # no problem + return mupdf.FzQuad(ch.m_internal.quad) + + # Re-compute quad with adjusted ascender / descender values: + # Move ch->origin to (0,0) and de-rotate quad, then adjust the corners, + # re-rotate and move back to ch->origin location. + fsize = ch.m_internal.size + bbox = mupdf.fz_font_bbox(font) + fwidth = bbox.x1 - bbox.x0 + if asc < 1e-3: # probably Tesseract glyphless font + dsc = -0.1 + asc = 0.9 + asc_dsc = 1.0 + + if _globals.small_glyph_heights or asc_dsc < 1: + dsc = dsc / asc_dsc + asc = asc / asc_dsc + asc_dsc = asc - dsc + asc = asc * fsize / asc_dsc + dsc = dsc * fsize / asc_dsc + + # Re-compute quad with the adjusted ascender / descender values: + # Move ch->origin to (0,0) and de-rotate quad, then adjust the corners, + # re-rotate and move back to ch->origin location. + c = line.m_internal.dir.x # cosine + s = line.m_internal.dir.y # sine + trm1 = mupdf.fz_make_matrix(c, -s, s, c, 0, 0) # derotate + trm2 = mupdf.fz_make_matrix(c, s, -s, c, 0, 0) # rotate + if (c == -1): # left-right flip + trm1.d = 1 + trm2.d = 1 + xlate1 = mupdf.fz_make_matrix(1, 0, 0, 1, -ch.m_internal.origin.x, -ch.m_internal.origin.y) + xlate2 = mupdf.fz_make_matrix(1, 0, 0, 1, ch.m_internal.origin.x, ch.m_internal.origin.y) + + quad = mupdf.fz_transform_quad(mupdf.FzQuad(ch.m_internal.quad), xlate1) # move origin to (0,0) + quad = mupdf.fz_transform_quad(quad, trm1) # de-rotate corners + + # adjust vertical coordinates + if c == 1 and quad.ul.y > 0: # up-down flip + quad.ul.y = asc + quad.ur.y = asc + quad.ll.y = dsc + quad.lr.y = dsc + else: + quad.ul.y = -asc + quad.ur.y = -asc + quad.ll.y = -dsc + quad.lr.y = -dsc + + # adjust horizontal coordinates that are too crazy: + # (1) left x must be >= 0 + # (2) if bbox width is 0, lookup char advance in font. + if quad.ll.x < 0: + quad.ll.x = 0 + quad.ul.x = 0 + + cwidth = quad.lr.x - quad.ll.x + if cwidth < FLT_EPSILON: + glyph = mupdf.fz_encode_character( font, ch.m_internal.c) + if glyph: + fwidth = mupdf.fz_advance_glyph( font, glyph, line.m_internal.wmode) + quad.lr.x = quad.ll.x + fwidth * fsize + quad.ur.x = quad.lr.x + + quad = mupdf.fz_transform_quad(quad, trm2) # rotate back + quad = mupdf.fz_transform_quad(quad, xlate2) # translate back + return quad + + +def JM_choice_options(annot): + ''' + return list of choices for list or combo boxes + ''' + annot_obj = mupdf.pdf_annot_obj( annot.this) + + opts = mupdf.pdf_choice_widget_options2( annot, 0) + n = len( opts) + if n == 0: + return # wrong widget type + + optarr = mupdf.pdf_dict_get( annot_obj, PDF_NAME('Opt')) + liste = [] + + for i in range( n): + m = mupdf.pdf_array_len( mupdf.pdf_array_get( optarr, i)) + if m == 2: + val = ( + mupdf.pdf_to_text_string( mupdf.pdf_array_get( mupdf.pdf_array_get( optarr, i), 0)), + mupdf.pdf_to_text_string( mupdf.pdf_array_get( mupdf.pdf_array_get( optarr, i), 1)), + ) + liste.append( val) + else: + val = mupdf.pdf_to_text_string( mupdf.pdf_array_get( optarr, i)) + liste.append( val) + return liste + + +def JM_clear_pixmap_rect_with_value(dest, value, b): + ''' + Clear a pixmap rectangle - my version also supports non-alpha pixmaps + ''' + b = mupdf.fz_intersect_irect(b, mupdf.fz_pixmap_bbox(dest)) + w = b.x1 - b.x0 + y = b.y1 - b.y0 + if w <= 0 or y <= 0: + return 0 + + destspan = dest.stride() + destp = destspan * (b.y0 - dest.y()) + dest.n() * (b.x0 - dest.x()) + + # CMYK needs special handling (and potentially any other subtractive colorspaces) + if mupdf.fz_colorspace_n(dest.colorspace()) == 4: + value = 255 - value + while 1: + s = destp + for x in range(0, w): + mupdf.fz_samples_set(dest, s, 0) + s += 1 + mupdf.fz_samples_set(dest, s, 0) + s += 1 + mupdf.fz_samples_set(dest, s, 0) + s += 1 + mupdf.fz_samples_set(dest, s, value) + s += 1 + if dest.alpha(): + mupdf.fz_samples_set(dest, s, 255) + s += 1 + destp += destspan + if y == 0: + break + y -= 1 + return 1 + + while 1: + s = destp + for x in range(w): + for k in range(dest.n()-1): + mupdf.fz_samples_set(dest, s, value) + s += 1 + if dest.alpha(): + mupdf.fz_samples_set(dest, s, 255) + s += 1 + else: + mupdf.fz_samples_set(dest, s, value) + s += 1 + destp += destspan + if y == 0: + break + y -= 1 + return 1 + + +def JM_color_FromSequence(color): + + if isinstance(color, (int, float)): # maybe just a single float + color = [color] + + if not isinstance( color, (list, tuple)): + return -1, [] + + if len(color) not in (0, 1, 3, 4): + return -1, [] + + ret = color[:] + for i in range(len(ret)): + if ret[i] < 0 or ret[i] > 1: + ret[i] = 1 + return len(ret), ret + + +def JM_color_count( pm, clip): + if g_use_extra: + return extra.ll_JM_color_count(pm.m_internal, clip) + + rc = dict() + cnt = 0 + irect = mupdf.fz_pixmap_bbox( pm) + irect = mupdf.fz_intersect_irect(irect, mupdf.fz_round_rect(JM_rect_from_py(clip))) + stride = pm.stride() + width = irect.x1 - irect.x0 + height = irect.y1 - irect.y0 + n = pm.n() + substride = width * n + s = stride * (irect.y0 - pm.y()) + (irect.x0 - pm.x()) * n + oldpix = _read_samples( pm, s, n) + cnt = 0 + if mupdf.fz_is_empty_irect(irect): + return rc + for i in range( height): + for j in range( 0, substride, n): + newpix = _read_samples( pm, s + j, n) + if newpix != oldpix: + pixel = oldpix + c = rc.get( pixel, None) + if c is not None: + cnt += c + rc[ pixel] = cnt + cnt = 1 + oldpix = newpix + else: + cnt += 1 + s += stride + pixel = oldpix + c = rc.get( pixel) + if c is not None: + cnt += c + rc[ pixel] = cnt + return rc + + +def JM_compress_buffer(inbuffer): + ''' + compress char* into a new buffer + ''' + data, compressed_length = mupdf.fz_new_deflated_data_from_buffer( + inbuffer, + mupdf.FZ_DEFLATE_BEST, + ) + #log( '{=data compressed_length}') + if not data or compressed_length == 0: + return None + buf = mupdf.FzBuffer(mupdf.fz_new_buffer_from_data(data, compressed_length)) + mupdf.fz_resize_buffer(buf, compressed_length) + return buf + + +def JM_copy_rectangle(page, area): + need_new_line = 0 + buffer = io.StringIO() + for block in page: + if block.m_internal.type != mupdf.FZ_STEXT_BLOCK_TEXT: + continue + for line in block: + line_had_text = 0 + for ch in line: + r = JM_char_bbox(line, ch) + if JM_rects_overlap(area, r): + line_had_text = 1 + if need_new_line: + buffer.write("\n") + need_new_line = 0 + buffer.write(make_escape(ch.m_internal.c)) + if line_had_text: + need_new_line = 1 + + s = buffer.getvalue() # take over the data + return s + + +def JM_convert_to_pdf(doc, fp, tp, rotate): + ''' + Convert any MuPDF document to a PDF + Returns bytes object containing the PDF, created via 'write' function. + ''' + pdfout = mupdf.PdfDocument() + incr = 1 + s = fp + e = tp + if fp > tp: + incr = -1 # count backwards + s = tp # adjust ... + e = fp # ... range + rot = JM_norm_rotation(rotate) + i = fp + while 1: # interpret & write document pages as PDF pages + if not _INRANGE(i, s, e): + break + page = mupdf.fz_load_page(doc, i) + mediabox = mupdf.fz_bound_page(page) + dev, resources, contents = mupdf.pdf_page_write(pdfout, mediabox) + mupdf.fz_run_page(page, dev, mupdf.FzMatrix(), mupdf.FzCookie()) + mupdf.fz_close_device(dev) + dev = None + page_obj = mupdf.pdf_add_page(pdfout, mediabox, rot, resources, contents) + mupdf.pdf_insert_page(pdfout, -1, page_obj) + i += incr + # PDF created - now write it to Python bytearray + # prepare write options structure + opts = mupdf.PdfWriteOptions() + opts.do_garbage = 4 + opts.do_compress = 1 + opts.do_compress_images = 1 + opts.do_compress_fonts = 1 + opts.do_sanitize = 1 + opts.do_incremental = 0 + opts.do_ascii = 0 + opts.do_decompress = 0 + opts.do_linear = 0 + opts.do_clean = 1 + opts.do_pretty = 0 + + res = mupdf.fz_new_buffer(8192) + out = mupdf.FzOutput(res) + mupdf.pdf_write_document(pdfout, out, opts) + out.fz_close_output() + c = mupdf.fz_buffer_extract_copy(res) + assert isinstance(c, bytes) + return c + + +# Copied from MuPDF v1.14 +# Create widget +def JM_create_widget(doc, page, type, fieldname): + old_sigflags = mupdf.pdf_to_int(mupdf.pdf_dict_getp(mupdf.pdf_trailer(doc), "Root/AcroForm/SigFlags")) + #log( '*** JM_create_widget()') + #log( f'{mupdf.pdf_create_annot_raw=}') + #log( f'{page=}') + #log( f'{mupdf.PDF_ANNOT_WIDGET=}') + annot = mupdf.pdf_create_annot_raw(page, mupdf.PDF_ANNOT_WIDGET) + annot_obj = mupdf.pdf_annot_obj(annot) + try: + JM_set_field_type(doc, annot_obj, type) + mupdf.pdf_dict_put_text_string(annot_obj, PDF_NAME('T'), fieldname) + + if type == mupdf.PDF_WIDGET_TYPE_SIGNATURE: + sigflags = old_sigflags | (SigFlag_SignaturesExist | SigFlag_AppendOnly) + mupdf.pdf_dict_putl( + mupdf.pdf_trailer(doc), + mupdf.pdf_new_int(sigflags), + PDF_NAME('Root'), + PDF_NAME('AcroForm'), + PDF_NAME('SigFlags'), + ) + # pdf_create_annot will have linked the new widget into the page's + # annot array. We also need it linked into the document's form + form = mupdf.pdf_dict_getp(mupdf.pdf_trailer(doc), "Root/AcroForm/Fields") + if not form.m_internal: + form = mupdf.pdf_new_array(doc, 1) + mupdf.pdf_dict_putl( + mupdf.pdf_trailer(doc), + form, + PDF_NAME('Root'), + PDF_NAME('AcroForm'), + PDF_NAME('Fields'), + ) + mupdf.pdf_array_push(form, annot_obj) # Cleanup relies on this statement being last + except Exception: + if g_exceptions_verbose: exception_info() + mupdf.pdf_delete_annot(page, annot) + + if type == mupdf.PDF_WIDGET_TYPE_SIGNATURE: + mupdf.pdf_dict_putl( + mupdf.pdf_trailer(doc), + mupdf.pdf_new_int(old_sigflags), + PDF_NAME('Root'), + PDF_NAME('AcroForm'), + PDF_NAME('SigFlags'), + ) + raise + return annot + + +def JM_cropbox(page_obj): + ''' + return a PDF page's CropBox + ''' + if g_use_extra: + return extra.JM_cropbox(page_obj) + + mediabox = JM_mediabox(page_obj) + cropbox = mupdf.pdf_to_rect( + mupdf.pdf_dict_get_inheritable(page_obj, PDF_NAME('CropBox')) + ) + if mupdf.fz_is_infinite_rect(cropbox) or mupdf.fz_is_empty_rect(cropbox): + cropbox = mediabox + y0 = mediabox.y1 - cropbox.y1 + y1 = mediabox.y1 - cropbox.y0 + cropbox.y0 = y0 + cropbox.y1 = y1 + return cropbox + + +def JM_cropbox_size(page_obj): + rect = JM_cropbox(page_obj) + w = abs(rect.x1 - rect.x0) + h = abs(rect.y1 - rect.y0) + size = mupdf.fz_make_point(w, h) + return size + + +def JM_derotate_page_matrix(page): + ''' + just the inverse of rotation + ''' + mp = JM_rotate_page_matrix(page) + return mupdf.fz_invert_matrix(mp) + + +def JM_embed_file( + pdf, + buf, + filename, + ufilename, + desc, + compress, + ): + ''' + embed a new file in a PDF (not only /EmbeddedFiles entries) + ''' + len_ = 0 + val = mupdf.pdf_new_dict(pdf, 6) + mupdf.pdf_dict_put_dict(val, PDF_NAME('CI'), 4) + ef = mupdf.pdf_dict_put_dict(val, PDF_NAME('EF'), 4) + mupdf.pdf_dict_put_text_string(val, PDF_NAME('F'), filename) + mupdf.pdf_dict_put_text_string(val, PDF_NAME('UF'), ufilename) + mupdf.pdf_dict_put_text_string(val, PDF_NAME('Desc'), desc) + mupdf.pdf_dict_put(val, PDF_NAME('Type'), PDF_NAME('Filespec')) + bs = b' ' + f = mupdf.pdf_add_stream( + pdf, + #mupdf.fz_fz_new_buffer_from_copied_data(bs), + mupdf.fz_new_buffer_from_copied_data(bs), + mupdf.PdfObj(), + 0, + ) + mupdf.pdf_dict_put(ef, PDF_NAME('F'), f) + JM_update_stream(pdf, f, buf, compress) + len_, _ = mupdf.fz_buffer_storage(buf) + mupdf.pdf_dict_put_int(f, PDF_NAME('DL'), len_) + mupdf.pdf_dict_put_int(f, PDF_NAME('Length'), len_) + params = mupdf.pdf_dict_put_dict(f, PDF_NAME('Params'), 4) + mupdf.pdf_dict_put_int(params, PDF_NAME('Size'), len_) + return val + + +def JM_embedded_clean(pdf): + ''' + perform some cleaning if we have /EmbeddedFiles: + (1) remove any /Limits if /Names exists + (2) remove any empty /Collection + (3) set /PageMode/UseAttachments + ''' + root = mupdf.pdf_dict_get( mupdf.pdf_trailer( pdf), PDF_NAME('Root')) + + # remove any empty /Collection entry + coll = mupdf.pdf_dict_get(root, PDF_NAME('Collection')) + if coll.m_internal and mupdf.pdf_dict_len(coll) == 0: + mupdf.pdf_dict_del(root, PDF_NAME('Collection')) + + efiles = mupdf.pdf_dict_getl( + root, + PDF_NAME('Names'), + PDF_NAME('EmbeddedFiles'), + PDF_NAME('Names'), + ) + if efiles.m_internal: + mupdf.pdf_dict_put_name(root, PDF_NAME('PageMode'), "UseAttachments") + + +def JM_EscapeStrFromBuffer(buff): + if not buff.m_internal: + return '' + s = mupdf.fz_buffer_extract_copy(buff) + val = PyUnicode_DecodeRawUnicodeEscape(s, errors='replace') + return val + + +def JM_ensure_identity(pdf): + ''' + Store ID in PDF trailer + ''' + id_ = mupdf.pdf_dict_get( mupdf.pdf_trailer(pdf), PDF_NAME('ID')) + if not id_.m_internal: + rnd0 = mupdf.fz_memrnd2(16) + # Need to convert raw bytes into a str to send to + # mupdf.pdf_new_string(). chr() seems to work for this. + rnd = '' + for i in rnd0: + rnd += chr(i) + id_ = mupdf.pdf_dict_put_array( mupdf.pdf_trailer( pdf), PDF_NAME('ID'), 2) + mupdf.pdf_array_push( id_, mupdf.pdf_new_string( rnd, len(rnd))) + mupdf.pdf_array_push( id_, mupdf.pdf_new_string( rnd, len(rnd))) + +def JM_ensure_ocproperties(pdf): + ''' + Ensure OCProperties, return /OCProperties key + ''' + ocp = mupdf.pdf_dict_get(mupdf.pdf_dict_get(mupdf.pdf_trailer(pdf), PDF_NAME('Root')), PDF_NAME('OCProperties')) + if ocp.m_internal: + return ocp + root = mupdf.pdf_dict_get(mupdf.pdf_trailer(pdf), PDF_NAME('Root')) + ocp = mupdf.pdf_dict_put_dict(root, PDF_NAME('OCProperties'), 2) + mupdf.pdf_dict_put_array(ocp, PDF_NAME('OCGs'), 0) + D = mupdf.pdf_dict_put_dict(ocp, PDF_NAME('D'), 5) + mupdf.pdf_dict_put_array(D, PDF_NAME('ON'), 0) + mupdf.pdf_dict_put_array(D, PDF_NAME('OFF'), 0) + mupdf.pdf_dict_put_array(D, PDF_NAME('Order'), 0) + mupdf.pdf_dict_put_array(D, PDF_NAME('RBGroups'), 0) + return ocp + + +def JM_expand_fname(name): + ''' + Make /DA string of annotation + ''' + if not name: return "Helv" + if name.startswith("Co"): return "Cour" + if name.startswith("co"): return "Cour" + if name.startswith("Ti"): return "TiRo" + if name.startswith("ti"): return "TiRo" + if name.startswith("Sy"): return "Symb" + if name.startswith("sy"): return "Symb" + if name.startswith("Za"): return "ZaDb" + if name.startswith("za"): return "ZaDb" + return "Helv" + + +def JM_field_type_text(wtype): + ''' + String from widget type + ''' + if wtype == mupdf.PDF_WIDGET_TYPE_BUTTON: + return "Button" + if wtype == mupdf.PDF_WIDGET_TYPE_CHECKBOX: + return "CheckBox" + if wtype == mupdf.PDF_WIDGET_TYPE_RADIOBUTTON: + return "RadioButton" + if wtype == mupdf.PDF_WIDGET_TYPE_TEXT: + return "Text" + if wtype == mupdf.PDF_WIDGET_TYPE_LISTBOX: + return "ListBox" + if wtype == mupdf.PDF_WIDGET_TYPE_COMBOBOX: + return "ComboBox" + if wtype == mupdf.PDF_WIDGET_TYPE_SIGNATURE: + return "Signature" + return "unknown" + + +def JM_fill_pixmap_rect_with_color(dest, col, b): + assert isinstance(dest, mupdf.FzPixmap) + # fill a rect with a color tuple + b = mupdf.fz_intersect_irect(b, mupdf.fz_pixmap_bbox( dest)) + w = b.x1 - b.x0 + y = b.y1 - b.y0 + if w <= 0 or y <= 0: + return 0 + destspan = dest.stride() + destp = destspan * (b.y0 - dest.y()) + dest.n() * (b.x0 - dest.x()) + while 1: + s = destp + for x in range(w): + for i in range( dest.n()): + mupdf.fz_samples_set(dest, s, col[i]) + s += 1 + destp += destspan + y -= 1 + if y == 0: + break + return 1 + + +def JM_find_annot_irt(annot): + ''' + Return the first annotation whose /IRT key ("In Response To") points to + annot. Used to remove the response chain of a given annotation. + ''' + assert isinstance(annot, mupdf.PdfAnnot) + irt_annot = None # returning this + annot_obj = mupdf.pdf_annot_obj(annot) + found = 0 + # loop thru MuPDF's internal annots array + page = _pdf_annot_page(annot) + irt_annot = mupdf.pdf_first_annot(page) + while 1: + assert isinstance(irt_annot, mupdf.PdfAnnot) + if not irt_annot.m_internal: + break + irt_annot_obj = mupdf.pdf_annot_obj(irt_annot) + o = mupdf.pdf_dict_gets(irt_annot_obj, 'IRT') + if o.m_internal: + if not mupdf.pdf_objcmp(o, annot_obj): + found = 1 + break + irt_annot = mupdf.pdf_next_annot(irt_annot) + if found: + return irt_annot + + +def JM_font_ascender(font): + ''' + need own versions of ascender / descender + ''' + assert isinstance(font, mupdf.FzFont) + if _globals.skip_quad_corrections: + return 0.8 + return mupdf.fz_font_ascender(font) + + +def JM_font_descender(font): + ''' + need own versions of ascender / descender + ''' + assert isinstance(font, mupdf.FzFont) + if _globals.skip_quad_corrections: + return -0.2 + ret = mupdf.fz_font_descender(font) + return ret + + +def JM_is_word_delimiter(ch, delimiters): + """Check if ch is an extra word delimiting character. + """ + if (0 + or ch <= 32 + or ch == 160 + or 0x202a <= ch <= 0x202e + ): + # covers any whitespace plus unicodes that switch between + # right-to-left and left-to-right languages + return True + if not delimiters: # no extra delimiters provided + return False + char = chr(ch) + for d in delimiters: + if d == char: + return True + return False + + +def JM_is_rtl_char(ch): + if ch < 0x590 or ch > 0x900: + return False + return True + + +def JM_font_name(font): + assert isinstance(font, mupdf.FzFont) + name = mupdf.fz_font_name(font) + s = name.find('+') + if _globals.subset_fontnames or s == -1 or s != 6: + return name + return name[s + 1:] + + +def JM_gather_fonts(pdf, dict_, fontlist, stream_xref): + rc = 1 + n = mupdf.pdf_dict_len(dict_) + for i in range(n): + + refname = mupdf.pdf_dict_get_key(dict_, i) + fontdict = mupdf.pdf_dict_get_val(dict_, i) + if not mupdf.pdf_is_dict(fontdict): + mupdf.fz_warn( f"'{mupdf.pdf_to_name(refname)}' is no font dict ({mupdf.pdf_to_num(fontdict)} 0 R)") + continue + + subtype = mupdf.pdf_dict_get(fontdict, mupdf.PDF_ENUM_NAME_Subtype) + basefont = mupdf.pdf_dict_get(fontdict, mupdf.PDF_ENUM_NAME_BaseFont) + if not basefont.m_internal or mupdf.pdf_is_null(basefont): + name = mupdf.pdf_dict_get(fontdict, mupdf.PDF_ENUM_NAME_Name) + else: + name = basefont + encoding = mupdf.pdf_dict_get(fontdict, mupdf.PDF_ENUM_NAME_Encoding) + if mupdf.pdf_is_dict(encoding): + encoding = mupdf.pdf_dict_get(encoding, mupdf.PDF_ENUM_NAME_BaseEncoding) + xref = mupdf.pdf_to_num(fontdict) + ext = "n/a" + if xref: + ext = JM_get_fontextension(pdf, xref) + entry = ( + xref, + ext, + mupdf.pdf_to_name(subtype), + JM_EscapeStrFromStr(mupdf.pdf_to_name(name)), + mupdf.pdf_to_name(refname), + mupdf.pdf_to_name(encoding), + stream_xref, + ) + fontlist.append(entry) + return rc + + +def JM_gather_forms(doc, dict_: mupdf.PdfObj, imagelist, stream_xref: int): + ''' + Store info of a /Form xobject in Python list + ''' + assert isinstance(doc, mupdf.PdfDocument) + rc = 1 + n = mupdf.pdf_dict_len(dict_) + for i in range(n): + refname = mupdf.pdf_dict_get_key( dict_, i) + imagedict = mupdf.pdf_dict_get_val(dict_, i) + if not mupdf.pdf_is_dict(imagedict): + mupdf.fz_warn( f"'{mupdf.pdf_to_name(refname)}' is no form dict ({mupdf.pdf_to_num(imagedict)} 0 R)") + continue + + type_ = mupdf.pdf_dict_get(imagedict, PDF_NAME('Subtype')) + if not mupdf.pdf_name_eq(type_, PDF_NAME('Form')): + continue + + o = mupdf.pdf_dict_get(imagedict, PDF_NAME('BBox')) + m = mupdf.pdf_dict_get(imagedict, PDF_NAME('Matrix')) + if m.m_internal: + mat = mupdf.pdf_to_matrix(m) + else: + mat = mupdf.FzMatrix() + if o.m_internal: + bbox = mupdf.fz_transform_rect( mupdf.pdf_to_rect(o), mat) + else: + bbox = mupdf.FzRect(mupdf.FzRect.Fixed_INFINITE) + xref = mupdf.pdf_to_num(imagedict) + + entry = ( + xref, + mupdf.pdf_to_name( refname), + stream_xref, + JM_py_from_rect(bbox), + ) + imagelist.append(entry) + return rc + + +def JM_gather_images(doc: mupdf.PdfDocument, dict_: mupdf.PdfObj, imagelist, stream_xref: int): + ''' + Store info of an image in Python list + ''' + rc = 1 + n = mupdf.pdf_dict_len( dict_) + for i in range(n): + refname = mupdf.pdf_dict_get_key(dict_, i) + imagedict = mupdf.pdf_dict_get_val(dict_, i) + if not mupdf.pdf_is_dict(imagedict): + mupdf.fz_warn(f"'{mupdf.pdf_to_name(refname)}' is no image dict ({mupdf.pdf_to_num(imagedict)} 0 R)") + continue + + type_ = mupdf.pdf_dict_get(imagedict, PDF_NAME('Subtype')) + if not mupdf.pdf_name_eq(type_, PDF_NAME('Image')): + continue + + xref = mupdf.pdf_to_num(imagedict) + gen = 0 + smask = mupdf.pdf_dict_geta(imagedict, PDF_NAME('SMask'), PDF_NAME('Mask')) + if smask.m_internal: + gen = mupdf.pdf_to_num(smask) + + filter_ = mupdf.pdf_dict_geta(imagedict, PDF_NAME('Filter'), PDF_NAME('F')) + if mupdf.pdf_is_array(filter_): + filter_ = mupdf.pdf_array_get(filter_, 0) + + altcs = mupdf.PdfObj(0) + cs = mupdf.pdf_dict_geta(imagedict, PDF_NAME('ColorSpace'), PDF_NAME('CS')) + if mupdf.pdf_is_array(cs): + cses = cs + cs = mupdf.pdf_array_get(cses, 0) + if (mupdf.pdf_name_eq(cs, PDF_NAME('DeviceN')) + or mupdf.pdf_name_eq(cs, PDF_NAME('Separation')) + ): + altcs = mupdf.pdf_array_get(cses, 2) + if mupdf.pdf_is_array(altcs): + altcs = mupdf.pdf_array_get(altcs, 0) + width = mupdf.pdf_dict_geta(imagedict, PDF_NAME('Width'), PDF_NAME('W')) + height = mupdf.pdf_dict_geta(imagedict, PDF_NAME('Height'), PDF_NAME('H')) + bpc = mupdf.pdf_dict_geta(imagedict, PDF_NAME('BitsPerComponent'), PDF_NAME('BPC')) + + entry = ( + xref, + gen, + mupdf.pdf_to_int(width), + mupdf.pdf_to_int(height), + mupdf.pdf_to_int(bpc), + JM_EscapeStrFromStr(mupdf.pdf_to_name(cs)), + JM_EscapeStrFromStr(mupdf.pdf_to_name(altcs)), + JM_EscapeStrFromStr(mupdf.pdf_to_name(refname)), + JM_EscapeStrFromStr(mupdf.pdf_to_name(filter_)), + stream_xref, + ) + imagelist.append(entry) + return rc + + +def JM_get_annot_by_xref(page, xref): + ''' + retrieve annot by its xref + ''' + assert isinstance(page, mupdf.PdfPage) + found = 0 + # loop thru MuPDF's internal annots array + annot = mupdf.pdf_first_annot(page) + while 1: + if not annot.m_internal: + break + if xref == mupdf.pdf_to_num(mupdf.pdf_annot_obj(annot)): + found = 1 + break + annot = mupdf.pdf_next_annot( annot) + if not found: + raise Exception("xref %d is not an annot of this page" % xref) + return annot + + +def JM_get_annot_by_name(page, name): + ''' + retrieve annot by name (/NM key) + ''' + assert isinstance(page, mupdf.PdfPage) + if not name: + return + found = 0 + # loop thru MuPDF's internal annots and widget arrays + annot = mupdf.pdf_first_annot(page) + while 1: + if not annot.m_internal: + break + + response, len_ = mupdf.pdf_to_string(mupdf.pdf_dict_gets(mupdf.pdf_annot_obj(annot), "NM")) + if name == response: + found = 1 + break + annot = mupdf.pdf_next_annot(annot) + if not found: + raise Exception("'%s' is not an annot of this page" % name) + return annot + + +def JM_get_annot_id_list(page): + names = [] + annots = mupdf.pdf_dict_get( page.obj(), mupdf.PDF_ENUM_NAME_Annots) + if not annots.m_internal: + return names + for i in range( mupdf.pdf_array_len(annots)): + annot_obj = mupdf.pdf_array_get(annots, i) + name = mupdf.pdf_dict_gets(annot_obj, "NM") + if name.m_internal: + names.append( + mupdf.pdf_to_text_string(name) + ) + return names + +def JM_get_annot_xref_list( page_obj): + ''' + return the xrefs and /NM ids of a page's annots, links and fields + ''' + if g_use_extra: + names = extra.JM_get_annot_xref_list( page_obj) + return names + + names = [] + annots = mupdf.pdf_dict_get( page_obj, PDF_NAME('Annots')) + n = mupdf.pdf_array_len( annots) + for i in range( n): + annot_obj = mupdf.pdf_array_get( annots, i) + xref = mupdf.pdf_to_num( annot_obj) + subtype = mupdf.pdf_dict_get( annot_obj, PDF_NAME('Subtype')) + if not subtype.m_internal: + continue # subtype is required + type_ = mupdf.pdf_annot_type_from_string( mupdf.pdf_to_name( subtype)) + if type_ == mupdf.PDF_ANNOT_UNKNOWN: + continue # only accept valid annot types + id_ = mupdf.pdf_dict_gets( annot_obj, "NM") + names.append( (xref, type_, mupdf.pdf_to_text_string( id_))) + return names + + +def JM_get_annot_xref_list2(page): + page = page._pdf_page(required=False) + if not page.m_internal: + return list() + return JM_get_annot_xref_list( page.obj()) + + +def JM_get_border_style(style): + ''' + return pdf_obj "border style" from Python str + ''' + val = mupdf.PDF_ENUM_NAME_S + if style is None: + return val + s = style + if s.startswith("b") or s.startswith("B"): val = mupdf.PDF_ENUM_NAME_B + elif s.startswith("d") or s.startswith("D"): val = mupdf.PDF_ENUM_NAME_D + elif s.startswith("i") or s.startswith("I"): val = mupdf.PDF_ENUM_NAME_I + elif s.startswith("u") or s.startswith("U"): val = mupdf.PDF_ENUM_NAME_U + elif s.startswith("s") or s.startswith("S"): val = mupdf.PDF_ENUM_NAME_S + return val + + +def JM_get_font( + fontname, + fontfile, + fontbuffer, + script, + lang, + ordering, + is_bold, + is_italic, + is_serif, + embed, + ): + ''' + return a fz_font from a number of parameters + ''' + def fertig(font): + if not font.m_internal: + raise RuntimeError(MSG_FONT_FAILED) + # if font allows this, set embedding + if not font.m_internal.flags.never_embed: + mupdf.fz_set_font_embedding(font, embed) + return font + + index = 0 + font = None + if fontfile: + #goto have_file; + font = mupdf.fz_new_font_from_file( None, fontfile, index, 0) + return fertig(font) + + if fontbuffer: + #goto have_buffer; + res = JM_BufferFromBytes(fontbuffer) + font = mupdf.fz_new_font_from_buffer( None, res, index, 0) + return fertig(font) + + if ordering > -1: + # goto have_cjk; + font = mupdf.fz_new_cjk_font(ordering) + return fertig(font) + + if fontname: + # goto have_base14; + # Base-14 or a MuPDF builtin font + font = mupdf.fz_new_base14_font(fontname) + if font.m_internal: + return fertig(font) + font = mupdf.fz_new_builtin_font(fontname, is_bold, is_italic) + return fertig(font) + + # Check for NOTO font + #have_noto:; + data, size, index = mupdf.fz_lookup_noto_font( script, lang) + font = None + if data: + font = mupdf.fz_new_font_from_memory( None, data, size, index, 0) + if font.m_internal: + return fertig(font) + font = mupdf.fz_load_fallback_font( script, lang, is_serif, is_bold, is_italic) + return fertig(font) + + +def JM_get_fontbuffer(doc, xref): + ''' + Return the contents of a font file, identified by xref + ''' + if xref < 1: + return + o = mupdf.pdf_load_object(doc, xref) + desft = mupdf.pdf_dict_get(o, PDF_NAME('DescendantFonts')) + if desft.m_internal: + obj = mupdf.pdf_resolve_indirect(mupdf.pdf_array_get(desft, 0)) + obj = mupdf.pdf_dict_get(obj, PDF_NAME('FontDescriptor')) + else: + obj = mupdf.pdf_dict_get(o, PDF_NAME('FontDescriptor')) + + if not obj.m_internal: + message(f"invalid font - FontDescriptor missing") + return + + o = obj + + stream = None + + obj = mupdf.pdf_dict_get(o, PDF_NAME('FontFile')) + if obj.m_internal: + stream = obj # ext = "pfa" + + obj = mupdf.pdf_dict_get(o, PDF_NAME('FontFile2')) + if obj.m_internal: + stream = obj # ext = "ttf" + + obj = mupdf.pdf_dict_get(o, PDF_NAME('FontFile3')) + if obj.m_internal: + stream = obj + + obj = mupdf.pdf_dict_get(obj, PDF_NAME('Subtype')) + if obj.m_internal and not mupdf.pdf_is_name(obj): + message("invalid font descriptor subtype") + return + + if mupdf.pdf_name_eq(obj, PDF_NAME('Type1C')): + pass # Prev code did: ext = "cff", but this has no effect. + elif mupdf.pdf_name_eq(obj, PDF_NAME('CIDFontType0C')): + pass # Prev code did: ext = "cid", but this has no effect. + elif mupdf.pdf_name_eq(obj, PDF_NAME('OpenType')): + pass # Prev code did: ext = "otf", but this has no effect. */ + else: + message('warning: unhandled font type {pdf_to_name(ctx, obj)!r}') + + if not stream: + message('warning: unhandled font type') + return + + return mupdf.pdf_load_stream(stream) + + +def JM_get_resource_properties(ref): + ''' + Return the items of Resources/Properties (used for Marked Content) + Argument may be e.g. a page object or a Form XObject + ''' + properties = mupdf.pdf_dict_getl(ref, PDF_NAME('Resources'), PDF_NAME('Properties')) + if not properties.m_internal: + return () + else: + n = mupdf.pdf_dict_len(properties) + if n < 1: + return () + rc = [] + for i in range(n): + key = mupdf.pdf_dict_get_key(properties, i) + val = mupdf.pdf_dict_get_val(properties, i) + c = mupdf.pdf_to_name(key) + xref = mupdf.pdf_to_num(val) + rc.append((c, xref)) + return rc + + +def JM_get_widget_by_xref( page, xref): + ''' + retrieve widget by its xref + ''' + found = False + annot = mupdf.pdf_first_widget( page) + while annot.m_internal: + annot_obj = mupdf.pdf_annot_obj( annot) + if xref == mupdf.pdf_to_num( annot_obj): + found = True + break + annot = mupdf.pdf_next_widget( annot) + if not found: + raise Exception( f"xref {xref} is not a widget of this page") + return Annot( annot) + + +def JM_get_widget_properties(annot, Widget): + ''' + Populate a Python Widget object with the values from a PDF form field. + Called by "Page.first_widget" and "Widget.next". + ''' + #log( '{type(annot)=}') + annot_obj = mupdf.pdf_annot_obj(annot.this) + #log( 'Have called mupdf.pdf_annot_obj()') + page = _pdf_annot_page(annot.this) + pdf = page.doc() + tw = annot + + def SETATTR(key, value): + setattr(Widget, key, value) + + def SETATTR_DROP(mod, key, value): + # Original C code for this function deletes if PyObject* is NULL. We + # don't have a representation for that in Python - e.g. None is not + # represented by NULL. + setattr(mod, key, value) + + #log( '=== + mupdf.pdf_widget_type(tw)') + field_type = mupdf.pdf_widget_type(tw.this) + #log( '=== - mupdf.pdf_widget_type(tw)') + Widget.field_type = field_type + if field_type == mupdf.PDF_WIDGET_TYPE_SIGNATURE: + if mupdf.pdf_signature_is_signed(pdf, annot_obj): + SETATTR("is_signed", True) + else: + SETATTR("is_signed",False) + else: + SETATTR("is_signed", None) + SETATTR_DROP(Widget, "border_style", JM_UnicodeFromStr(mupdf.pdf_field_border_style(annot_obj))) + SETATTR_DROP(Widget, "field_type_string", JM_UnicodeFromStr(JM_field_type_text(field_type))) + + field_name = mupdf.pdf_load_field_name(annot_obj) + SETATTR_DROP(Widget, "field_name", field_name) + + def pdf_dict_get_inheritable_nonempty_label(node, key): + ''' + This is a modified version of MuPDF's pdf_dict_get_inheritable(), with + some changes: + * Returns string from pdf_to_text_string() or None if not found. + * Recurses to parent if current node exists but with empty string + value. + ''' + slow = node + halfbeat = 11 # Don't start moving slow pointer for a while. + while 1: + if not node.m_internal: + return + val = mupdf.pdf_dict_get(node, key) + if val.m_internal: + label = mupdf.pdf_to_text_string(val) + if label: + return label + node = mupdf.pdf_dict_get(node, PDF_NAME('Parent')) + if node.m_internal == slow.m_internal: + raise Exception("cycle in resources") + halfbeat -= 1 + if halfbeat == 0: + slow = mupdf.pdf_dict_get(slow, PDF_NAME('Parent')) + halfbeat = 2 + + # In order to address #3950, we use our modified pdf_dict_get_inheritable() + # to ignore empty-string child values. + label = pdf_dict_get_inheritable_nonempty_label(annot_obj, PDF_NAME('TU')) + if label is not None: + SETATTR_DROP(Widget, "field_label", label) + + fvalue = None + if field_type == mupdf.PDF_WIDGET_TYPE_RADIOBUTTON: + obj = mupdf.pdf_dict_get( annot_obj, PDF_NAME('Parent')) # owning RB group + if obj.m_internal: + SETATTR_DROP(Widget, "rb_parent", mupdf.pdf_to_num( obj)) + obj = mupdf.pdf_dict_get(annot_obj, PDF_NAME('AS')) + if obj.m_internal: + fvalue = mupdf.pdf_to_name(obj) + if not fvalue: + fvalue = mupdf.pdf_field_value(annot_obj) + SETATTR_DROP(Widget, "field_value", JM_UnicodeFromStr(fvalue)) + + SETATTR_DROP(Widget, "field_display", mupdf.pdf_field_display(annot_obj)) + + border_width = mupdf.pdf_to_real(mupdf.pdf_dict_getl(annot_obj, PDF_NAME('BS'), PDF_NAME('W'))) + if border_width == 0: + border_width = 1 + SETATTR_DROP(Widget, "border_width", border_width) + + obj = mupdf.pdf_dict_getl(annot_obj, PDF_NAME('BS'), PDF_NAME('D')) + if mupdf.pdf_is_array(obj): + n = mupdf.pdf_array_len(obj) + d = [0] * n + for i in range(n): + d[i] = mupdf.pdf_to_int(mupdf.pdf_array_get(obj, i)) + SETATTR_DROP(Widget, "border_dashes", d) + + SETATTR_DROP(Widget, "text_maxlen", mupdf.pdf_text_widget_max_len(tw.this)) + + SETATTR_DROP(Widget, "text_format", mupdf.pdf_text_widget_format(tw.this)) + + obj = mupdf.pdf_dict_getl(annot_obj, PDF_NAME('MK'), PDF_NAME('BG')) + if mupdf.pdf_is_array(obj): + n = mupdf.pdf_array_len(obj) + col = [0] * n + for i in range(n): + col[i] = mupdf.pdf_to_real(mupdf.pdf_array_get(obj, i)) + SETATTR_DROP(Widget, "fill_color", col) + + obj = mupdf.pdf_dict_getl(annot_obj, PDF_NAME('MK'), PDF_NAME('BC')) + if mupdf.pdf_is_array(obj): + n = mupdf.pdf_array_len(obj) + col = [0] * n + for i in range(n): + col[i] = mupdf.pdf_to_real(mupdf.pdf_array_get(obj, i)) + SETATTR_DROP(Widget, "border_color", col) + + SETATTR_DROP(Widget, "choice_values", JM_choice_options(annot)) + + da = mupdf.pdf_to_text_string(mupdf.pdf_dict_get_inheritable(annot_obj, PDF_NAME('DA'))) + SETATTR_DROP(Widget, "_text_da", JM_UnicodeFromStr(da)) + + obj = mupdf.pdf_dict_getl(annot_obj, PDF_NAME('MK'), PDF_NAME('CA')) + if obj.m_internal: + SETATTR_DROP(Widget, "button_caption", JM_UnicodeFromStr(mupdf.pdf_to_text_string(obj))) + + SETATTR_DROP(Widget, "field_flags", mupdf.pdf_field_flags(annot_obj)) + + # call Py method to reconstruct text color, font name, size + Widget._parse_da() + + # extract JavaScript action texts + s = mupdf.pdf_dict_get(annot_obj, PDF_NAME('A')) + ss = JM_get_script(s) + SETATTR_DROP(Widget, "script", ss) + + SETATTR_DROP(Widget, "script_stroke", + JM_get_script(mupdf.pdf_dict_getl(annot_obj, PDF_NAME('AA'), PDF_NAME('K'))) + ) + + SETATTR_DROP(Widget, "script_format", + JM_get_script(mupdf.pdf_dict_getl(annot_obj, PDF_NAME('AA'), PDF_NAME('F'))) + ) + + SETATTR_DROP(Widget, "script_change", + JM_get_script(mupdf.pdf_dict_getl(annot_obj, PDF_NAME('AA'), PDF_NAME('V'))) + ) + + SETATTR_DROP(Widget, "script_calc", + JM_get_script(mupdf.pdf_dict_getl(annot_obj, PDF_NAME('AA'), PDF_NAME('C'))) + ) + + SETATTR_DROP(Widget, "script_blur", + JM_get_script(mupdf.pdf_dict_getl(annot_obj, PDF_NAME('AA'), mupdf.pdf_new_name('Bl'))) + ) + + SETATTR_DROP(Widget, "script_focus", + JM_get_script(mupdf.pdf_dict_getl(annot_obj, PDF_NAME('AA'), mupdf.pdf_new_name('Fo'))) + ) + + +def JM_get_fontextension(doc, xref): + ''' + Return the file extension of a font file, identified by xref + ''' + if xref < 1: + return "n/a" + o = mupdf.pdf_load_object(doc, xref) + desft = mupdf.pdf_dict_get(o, PDF_NAME('DescendantFonts')) + if desft.m_internal: + obj = mupdf.pdf_resolve_indirect(mupdf.pdf_array_get(desft, 0)) + obj = mupdf.pdf_dict_get(obj, PDF_NAME('FontDescriptor')) + else: + obj = mupdf.pdf_dict_get(o, PDF_NAME('FontDescriptor')) + if not obj.m_internal: + return "n/a" # this is a base-14 font + + o = obj # we have the FontDescriptor + + obj = mupdf.pdf_dict_get(o, PDF_NAME('FontFile')) + if obj.m_internal: + return "pfa" + + obj = mupdf.pdf_dict_get(o, PDF_NAME('FontFile2')) + if obj.m_internal: + return "ttf" + + obj = mupdf.pdf_dict_get(o, PDF_NAME('FontFile3')) + if obj.m_internal: + obj = mupdf.pdf_dict_get(obj, PDF_NAME('Subtype')) + if obj.m_internal and not mupdf.pdf_is_name(obj): + message("invalid font descriptor subtype") + return "n/a" + if mupdf.pdf_name_eq(obj, PDF_NAME('Type1C')): + return "cff" + elif mupdf.pdf_name_eq(obj, PDF_NAME('CIDFontType0C')): + return "cid" + elif mupdf.pdf_name_eq(obj, PDF_NAME('OpenType')): + return "otf" + else: + message("unhandled font type '%s'", mupdf.pdf_to_name(obj)) + + return "n/a" + + +def JM_get_ocg_arrays_imp(arr): + ''' + Get OCG arrays from OC configuration + Returns dict {"basestate":name, "on":list, "off":list, "rbg":list, "locked":list} + ''' + list_ = list() + if mupdf.pdf_is_array( arr): + n = mupdf.pdf_array_len( arr) + for i in range(n): + obj = mupdf.pdf_array_get( arr, i) + item = mupdf.pdf_to_num( obj) + if item not in list_: + list_.append(item) + return list_ + + +def JM_get_ocg_arrays(conf): + + rc = dict() + arr = mupdf.pdf_dict_get( conf, PDF_NAME('ON')) + list_ = JM_get_ocg_arrays_imp( arr) + if list_: + rc["on"] = list_ + arr = mupdf.pdf_dict_get( conf, PDF_NAME('OFF')) + list_ = JM_get_ocg_arrays_imp( arr) + if list_: + rc["off"] = list_ + arr = mupdf.pdf_dict_get( conf, PDF_NAME('Locked')) + list_ = JM_get_ocg_arrays_imp( arr) + if list_: + rc['locked'] = list_ + list_ = list() + arr = mupdf.pdf_dict_get( conf, PDF_NAME('RBGroups')) + if mupdf.pdf_is_array( arr): + n = mupdf.pdf_array_len( arr) + for i in range(n): + obj = mupdf.pdf_array_get( arr, i) + list1 = JM_get_ocg_arrays_imp( obj) + list_.append(list1) + if list_: + rc["rbgroups"] = list_ + obj = mupdf.pdf_dict_get( conf, PDF_NAME('BaseState')) + + if obj.m_internal: + state = mupdf.pdf_to_name( obj) + rc["basestate"] = state + return rc + + +def JM_get_page_labels(liste, nums): + n = mupdf.pdf_array_len(nums) + for i in range(0, n, 2): + key = mupdf.pdf_resolve_indirect( mupdf.pdf_array_get(nums, i)) + pno = mupdf.pdf_to_int(key) + val = mupdf.pdf_resolve_indirect( mupdf.pdf_array_get(nums, i + 1)) + res = JM_object_to_buffer(val, 1, 0) + c = mupdf.fz_buffer_extract(res) + assert isinstance(c, bytes) + c = c.decode('utf-8') + liste.append( (pno, c)) + + +def JM_get_script(key): + ''' + JavaScript extractor + Returns either the script source or None. Parameter is a PDF action + dictionary, which must have keys /S and /JS. The value of /S must be + '/JavaScript'. The value of /JS is returned. + ''' + if not key.m_internal: + return + + j = mupdf.pdf_dict_get(key, PDF_NAME('S')) + jj = mupdf.pdf_to_name(j) + if jj == "JavaScript": + js = mupdf.pdf_dict_get(key, PDF_NAME('JS')) + if not js.m_internal: + return + else: + return + + if mupdf.pdf_is_string(js): + script = JM_UnicodeFromStr(mupdf.pdf_to_text_string(js)) + elif mupdf.pdf_is_stream(js): + res = mupdf.pdf_load_stream(js) + script = JM_EscapeStrFromBuffer(res) + else: + return + if script: # do not return an empty script + return script + return + + +def JM_have_operation(pdf): + ''' + Ensure valid journalling state + ''' + if pdf.m_internal.journal and not mupdf.pdf_undoredo_step(pdf, 0): + return 0 + return 1 + + +def JM_image_extension(type_): + ''' + return extension for MuPDF image type + ''' + if type_ == mupdf.FZ_IMAGE_FAX: return "fax" + if type_ == mupdf.FZ_IMAGE_RAW: return "raw" + if type_ == mupdf.FZ_IMAGE_FLATE: return "flate" + if type_ == mupdf.FZ_IMAGE_LZW: return "lzw" + if type_ == mupdf.FZ_IMAGE_RLD: return "rld" + if type_ == mupdf.FZ_IMAGE_BMP: return "bmp" + if type_ == mupdf.FZ_IMAGE_GIF: return "gif" + if type_ == mupdf.FZ_IMAGE_JBIG2: return "jb2" + if type_ == mupdf.FZ_IMAGE_JPEG: return "jpeg" + if type_ == mupdf.FZ_IMAGE_JPX: return "jpx" + if type_ == mupdf.FZ_IMAGE_JXR: return "jxr" + if type_ == mupdf.FZ_IMAGE_PNG: return "png" + if type_ == mupdf.FZ_IMAGE_PNM: return "pnm" + if type_ == mupdf.FZ_IMAGE_TIFF: return "tiff" + #if type_ == mupdf.FZ_IMAGE_PSD: return "psd" + return "n/a" + + +# fixme: need to avoid using a global for this. +g_img_info = None + + +def JM_image_filter(opaque, ctm, name, image): + assert isinstance(ctm, mupdf.FzMatrix) + r = mupdf.FzRect(mupdf.FzRect.Fixed_UNIT) + q = mupdf.fz_transform_quad( mupdf.fz_quad_from_rect(r), ctm) + q = mupdf.fz_transform_quad( q, g_img_info_matrix) + temp = name, JM_py_from_quad(q) + g_img_info.append(temp) + + +def JM_image_profile( imagedata, keep_image): + ''' + Return basic properties of an image provided as bytes or bytearray + The function creates an fz_image and optionally returns it. + ''' + if not imagedata: + return None # nothing given + + len_ = len( imagedata) + if len_ < 8: + message( "bad image data") + return None + c = imagedata + #log( 'calling mfz_recognize_image_format with {c!r=}') + type_ = mupdf.fz_recognize_image_format( c) + if type_ == mupdf.FZ_IMAGE_UNKNOWN: + return None + + if keep_image: + res = mupdf.fz_new_buffer_from_copied_data( c, len_) + else: + res = mupdf.fz_new_buffer_from_shared_data( c, len_) + image = mupdf.fz_new_image_from_buffer( res) + ctm = mupdf.fz_image_orientation_matrix( image) + xres, yres = mupdf.fz_image_resolution(image) + orientation = mupdf.fz_image_orientation( image) + cs_name = mupdf.fz_colorspace_name( image.colorspace()) + result = dict() + result[ dictkey_width] = image.w() + result[ dictkey_height] = image.h() + result[ "orientation"] = orientation + result[ dictkey_matrix] = JM_py_from_matrix(ctm) + result[ dictkey_xres] = xres + result[ dictkey_yres] = yres + result[ dictkey_colorspace] = image.n() + result[ dictkey_bpc] = image.bpc() + result[ dictkey_ext] = JM_image_extension(type_) + result[ dictkey_cs_name] = cs_name + + if keep_image: + result[ dictkey_image] = image + return result + + +def JM_image_reporter(page): + doc = page.doc() + global g_img_info_matrix + g_img_info_matrix = mupdf.FzMatrix() + mediabox = mupdf.FzRect() + mupdf.pdf_page_transform(page, mediabox, g_img_info_matrix) + + class SanitizeFilterOptions(mupdf.PdfSanitizeFilterOptions2): + def __init__(self): + super().__init__() + self.use_virtual_image_filter() + def image_filter(self, ctx, ctm, name, image, scissor): + JM_image_filter(None, mupdf.FzMatrix(ctm), name, image) + + sanitize_filter_options = SanitizeFilterOptions() + + filter_options = _make_PdfFilterOptions( + instance_forms=1, + ascii=1, + no_update=1, + sanitize=1, + sopts=sanitize_filter_options, + ) + + global g_img_info + g_img_info = [] + + mupdf.pdf_filter_page_contents( doc, page, filter_options) + + rc = tuple(g_img_info) + g_img_info = [] + return rc + + +def JM_fitz_config(): + have_TOFU = not hasattr(mupdf, 'TOFU') + have_TOFU_BASE14 = not hasattr(mupdf, 'TOFU_BASE14') + have_TOFU_CJK = not hasattr(mupdf, 'TOFU_CJK') + have_TOFU_CJK_EXT = not hasattr(mupdf, 'TOFU_CJK_EXT') + have_TOFU_CJK_LANG = not hasattr(mupdf, 'TOFU_CJK_LANG') + have_TOFU_EMOJI = not hasattr(mupdf, 'TOFU_EMOJI') + have_TOFU_HISTORIC = not hasattr(mupdf, 'TOFU_HISTORIC') + have_TOFU_SIL = not hasattr(mupdf, 'TOFU_SIL') + have_TOFU_SYMBOL = not hasattr(mupdf, 'TOFU_SYMBOL') + + ret = dict() + ret["base14"] = have_TOFU_BASE14 + ret["cbz"] = bool(mupdf.FZ_ENABLE_CBZ) + ret["epub"] = bool(mupdf.FZ_ENABLE_EPUB) + ret["html"] = bool(mupdf.FZ_ENABLE_HTML) + ret["icc"] = bool(mupdf.FZ_ENABLE_ICC) + ret["img"] = bool(mupdf.FZ_ENABLE_IMG) + ret["jpx"] = bool(mupdf.FZ_ENABLE_JPX) + ret["js"] = bool(mupdf.FZ_ENABLE_JS) + ret["pdf"] = bool(mupdf.FZ_ENABLE_PDF) + ret["plotter-cmyk"] = bool(mupdf.FZ_PLOTTERS_CMYK) + ret["plotter-g"] = bool(mupdf.FZ_PLOTTERS_G) + ret["plotter-n"] = bool(mupdf.FZ_PLOTTERS_N) + ret["plotter-rgb"] = bool(mupdf.FZ_PLOTTERS_RGB) + ret["py-memory"] = bool(JM_MEMORY) + ret["svg"] = bool(mupdf.FZ_ENABLE_SVG) + ret["tofu"] = have_TOFU + ret["tofu-cjk"] = have_TOFU_CJK + ret["tofu-cjk-ext"] = have_TOFU_CJK_EXT + ret["tofu-cjk-lang"] = have_TOFU_CJK_LANG + ret["tofu-emoji"] = have_TOFU_EMOJI + ret["tofu-historic"] = have_TOFU_HISTORIC + ret["tofu-sil"] = have_TOFU_SIL + ret["tofu-symbol"] = have_TOFU_SYMBOL + ret["xps"] = bool(mupdf.FZ_ENABLE_XPS) + return ret + + +def JM_insert_contents(pdf, pageref, newcont, overlay): + ''' + Insert a buffer as a new separate /Contents object of a page. + 1. Create a new stream object from buffer 'newcont' + 2. If /Contents already is an array, then just prepend or append this object + 3. Else, create new array and put old content obj and this object into it. + If the page had no /Contents before, just create a 1-item array. + ''' + contents = mupdf.pdf_dict_get(pageref, PDF_NAME('Contents')) + newconts = mupdf.pdf_add_stream(pdf, newcont, mupdf.PdfObj(), 0) + xref = mupdf.pdf_to_num(newconts) + if mupdf.pdf_is_array(contents): + if overlay: # append new object + mupdf.pdf_array_push(contents, newconts) + else: # prepend new object + mupdf.pdf_array_insert(contents, newconts, 0) + else: + carr = mupdf.pdf_new_array(pdf, 5) + if overlay: + if contents.m_internal: + mupdf.pdf_array_push(carr, contents) + mupdf.pdf_array_push(carr, newconts) + else: + mupdf.pdf_array_push(carr, newconts) + if contents.m_internal: + mupdf.pdf_array_push(carr, contents) + mupdf.pdf_dict_put(pageref, PDF_NAME('Contents'), carr) + return xref + + +def JM_insert_font(pdf, bfname, fontfile, fontbuffer, set_simple, idx, wmode, serif, encoding, ordering): + ''' + Insert a font in a PDF + ''' + font = None + res = None + data = None + ixref = 0 + index = 0 + simple = 0 + value=None + name=None + subt=None + exto = None + + ENSURE_OPERATION(pdf) + # check for CJK font + if ordering > -1: + data, size, index = mupdf.fz_lookup_cjk_font(ordering) + if data: + font = mupdf.fz_new_font_from_memory(None, data, size, index, 0) + font_obj = mupdf.pdf_add_cjk_font(pdf, font, ordering, wmode, serif) + exto = "n/a" + simple = 0 + #goto weiter; + else: + + # check for PDF Base-14 font + if bfname: + data, size = mupdf.fz_lookup_base14_font(bfname) + if data: + font = mupdf.fz_new_font_from_memory(bfname, data, size, 0, 0) + font_obj = mupdf.pdf_add_simple_font(pdf, font, encoding) + exto = "n/a" + simple = 1 + #goto weiter; + + else: + if fontfile: + font = mupdf.fz_new_font_from_file(None, fontfile, idx, 0) + else: + res = JM_BufferFromBytes(fontbuffer) + if not res.m_internal: + RAISEPY(MSG_FILE_OR_BUFFER, PyExc_ValueError) + font = mupdf.fz_new_font_from_buffer(None, res, idx, 0) + + if not set_simple: + font_obj = mupdf.pdf_add_cid_font(pdf, font) + simple = 0 + else: + font_obj = mupdf.pdf_add_simple_font(pdf, font, encoding) + simple = 2 + #weiter: ; + ixref = mupdf.pdf_to_num(font_obj) + name = JM_EscapeStrFromStr( mupdf.pdf_to_name( mupdf.pdf_dict_get(font_obj, PDF_NAME('BaseFont')))) + + subt = JM_UnicodeFromStr( mupdf.pdf_to_name( mupdf.pdf_dict_get( font_obj, PDF_NAME('Subtype')))) + + if not exto: + exto = JM_UnicodeFromStr(JM_get_fontextension(pdf, ixref)) + + asc = mupdf.fz_font_ascender(font) + dsc = mupdf.fz_font_descender(font) + value = [ + ixref, + { + "name": name, # base font name + "type": subt, # subtype + "ext": exto, # file extension + "simple": bool(simple), # simple font? + "ordering": ordering, # CJK font? + "ascender": asc, + "descender": dsc, + }, + ] + return value + +def JM_irect_from_py(r): + ''' + PySequence to mupdf.FzIrect. Default: infinite irect + ''' + if isinstance(r, mupdf.FzIrect): + return r + if isinstance(r, IRect): + r = mupdf.FzIrect( r.x0, r.y0, r.x1, r.y1) + return r + if isinstance(r, Rect): + ret = mupdf.FzRect(r.x0, r.y0, r.x1, r.y1) + ret = mupdf.FzIrect(ret) # Uses fz_irect_from_rect(). + return ret + if isinstance(r, mupdf.FzRect): + ret = mupdf.FzIrect(r) # Uses fz_irect_from_rect(). + return ret + if not r or not PySequence_Check(r) or PySequence_Size(r) != 4: + return mupdf.FzIrect(mupdf.fz_infinite_irect) + f = [0, 0, 0, 0] + for i in range(4): + f[i] = r[i] + if f[i] is None: + return mupdf.FzIrect(mupdf.fz_infinite_irect) + if f[i] < FZ_MIN_INF_RECT: + f[i] = FZ_MIN_INF_RECT + if f[i] > FZ_MAX_INF_RECT: + f[i] = FZ_MAX_INF_RECT + return mupdf.fz_make_irect(f[0], f[1], f[2], f[3]) + +def JM_listbox_value( annot): + ''' + ListBox retrieve value + ''' + # may be single value or array + annot_obj = mupdf.pdf_annot_obj( annot) + optarr = mupdf.pdf_dict_get( annot_obj, PDF_NAME('V')) + if mupdf.pdf_is_string( optarr): # a single string + return mupdf.pdf_to_text_string( optarr) + + # value is an array (may have len 0) + n = mupdf.pdf_array_len( optarr) + liste = [] + + # extract a list of strings + # each entry may again be an array: take second entry then + for i in range( n): + elem = mupdf.pdf_array_get( optarr, i) + if mupdf.pdf_is_array( elem): + elem = mupdf.pdf_array_get( elem, 1) + liste.append( JM_UnicodeFromStr( mupdf.pdf_to_text_string( elem))) + return liste + + +def JM_make_annot_DA(annot, ncol, col, fontname, fontsize): + # PyMuPDF uses a fz_buffer to build up the string, but it's non-trivial to + # convert the fz_buffer's `unsigned char*` into a `const char*` suitable + # for passing to pdf_dict_put_text_string(). So instead we build up the + # string directly in Python. + buf = '' + if ncol < 1: + buf += f'0 g ' + elif ncol == 1: + buf += f'{col[0]:g} g ' + elif ncol == 2: + assert 0 + elif ncol == 3: + buf += f'{col[0]:g} {col[1]:g} {col[2]:g} rg ' + else: + buf += f'{col[0]:g} {col[1]:g} {col[2]:g} {col[3]:g} k ' + buf += f'/{JM_expand_fname(fontname)} {fontsize} Tf' + mupdf.pdf_dict_put_text_string(mupdf.pdf_annot_obj(annot), mupdf.PDF_ENUM_NAME_DA, buf) + + +def JM_make_spanlist(line_dict, line, raw, buff, tp_rect): + if g_use_extra: + return extra.JM_make_spanlist(line_dict, line, raw, buff, tp_rect) + char_list = None + span_list = [] + mupdf.fz_clear_buffer(buff) + span_rect = mupdf.FzRect(mupdf.FzRect.Fixed_EMPTY) + line_rect = mupdf.FzRect(mupdf.FzRect.Fixed_EMPTY) + + class char_style: + def __init__(self, rhs=None): + if rhs: + self.size = rhs.size + self.flags = rhs.flags + if mupdf_version_tuple >= (1, 25, 2): + self.char_flags = rhs.char_flags + self.font = rhs.font + self.argb = rhs.argb + self.asc = rhs.asc + self.desc = rhs.desc + self.bidi = rhs.bidi + else: + self.size = -1 + self.flags = -1 + if mupdf_version_tuple >= (1, 25, 2): + self.char_flags = -1 + self.font = '' + self.argb = -1 + self.asc = 0 + self.desc = 0 + self.bidi = 0 + def __str__(self): + ret = f'{self.size} {self.flags}' + if mupdf_version_tuple >= (1, 25, 2): + ret += f' {self.char_flags}' + ret += f' {self.font} {self.color} {self.asc} {self.desc}' + return ret + + old_style = char_style() + style = char_style() + span = None + span_origin = None + + for ch in line: + # start-trace + r = JM_char_bbox(line, ch) + if (not JM_rects_overlap(tp_rect, r) + and not mupdf.fz_is_infinite_rect(tp_rect) + ): + continue + + # Info from: + # detect_super_script() + # fz_font_is_italic() + # fz_font_is_serif() + # fz_font_is_monospaced() + # fz_font_is_bold() + + flags = JM_char_font_flags(mupdf.FzFont(mupdf.ll_fz_keep_font(ch.m_internal.font)), line, ch) + origin = mupdf.FzPoint(ch.m_internal.origin) + style.size = ch.m_internal.size + style.flags = flags + if mupdf_version_tuple >= (1, 25, 2): + # FZ_STEXT_SYNTHETIC is per-char, not per-span. + style.char_flags = ch.m_internal.flags & ~mupdf.FZ_STEXT_SYNTHETIC + style.font = JM_font_name(mupdf.FzFont(mupdf.ll_fz_keep_font(ch.m_internal.font))) + style.argb = ch.m_internal.argb + style.asc = JM_font_ascender(mupdf.FzFont(mupdf.ll_fz_keep_font(ch.m_internal.font))) + style.desc = JM_font_descender(mupdf.FzFont(mupdf.ll_fz_keep_font(ch.m_internal.font))) + style.bidi = ch.m_internal.bidi + + if (style.size != old_style.size + or style.flags != old_style.flags + or (mupdf_version_tuple >= (1, 25, 2) + and (style.char_flags != old_style.char_flags) + ) + or style.argb != old_style.argb + or style.font != old_style.font + or style.bidi != old_style.bidi + ): + if old_style.size >= 0: + # not first one, output previous + if raw: + # put character list in the span + span[dictkey_chars] = char_list + char_list = None + else: + # put text string in the span + span[dictkey_text] = JM_EscapeStrFromBuffer( buff) + mupdf.fz_clear_buffer(buff) + + span[dictkey_origin] = JM_py_from_point(span_origin) + span[dictkey_bbox] = JM_py_from_rect(span_rect) + line_rect = mupdf.fz_union_rect(line_rect, span_rect) + span_list.append( span) + span = None + + span = dict() + asc = style.asc + desc = style.desc + if style.asc < 1e-3: + asc = 0.9 + desc = -0.1 + + span[dictkey_size] = style.size + span[dictkey_flags] = style.flags + span[dictkey_bidi] = style.bidi + if mupdf_version_tuple >= (1, 25, 2): + span[dictkey_char_flags] = style.char_flags + span[dictkey_font] = JM_EscapeStrFromStr(style.font) + span[dictkey_color] = style.argb & 0xffffff + if mupdf_version_tuple >= (1, 25, 0): + span['alpha'] = style.argb >> 24 + span["ascender"] = asc + span["descender"] = desc + + # Need to be careful here - doing 'old_style=style' does a shallow + # copy, but we need to keep old_style as a distinct instance. + old_style = char_style(style) + span_rect = r + span_origin = origin + + span_rect = mupdf.fz_union_rect(span_rect, r) + + if raw: # make and append a char dict + char_dict = dict() + char_dict[dictkey_origin] = JM_py_from_point( ch.m_internal.origin) + char_dict[dictkey_bbox] = JM_py_from_rect(r) + char_dict[dictkey_c] = chr(ch.m_internal.c) + char_dict['synthetic'] = bool(ch.m_internal.flags & mupdf.FZ_STEXT_SYNTHETIC) + + if char_list is None: + char_list = [] + char_list.append(char_dict) + else: # add character byte to buffer + JM_append_rune(buff, ch.m_internal.c) + + # all characters processed, now flush remaining span + if span: + if raw: + span[dictkey_chars] = char_list + char_list = None + else: + span[dictkey_text] = JM_EscapeStrFromBuffer(buff) + mupdf.fz_clear_buffer(buff) + span[dictkey_origin] = JM_py_from_point(span_origin) + span[dictkey_bbox] = JM_py_from_rect(span_rect) + + if not mupdf.fz_is_empty_rect(span_rect): + span_list.append(span) + line_rect = mupdf.fz_union_rect(line_rect, span_rect) + span = None + if not mupdf.fz_is_empty_rect(line_rect): + line_dict[dictkey_spans] = span_list + else: + line_dict[dictkey_spans] = span_list + return line_rect + +def _make_image_dict(img, img_dict): + """Populate a dictionary with information extracted from a given image. + + Used by 'Document.extract_image' and by 'JM_make_image_block'. + Both of these functions will add some more specific information. + """ + img_type = img.fz_compressed_image_type() + ext = JM_image_extension(img_type) + + # compressed image buffer if present, else None + ll_cbuf = mupdf.ll_fz_compressed_image_buffer(img.m_internal) + + if (0 + or not ll_cbuf + or img_type in (mupdf.FZ_IMAGE_JBIG2, mupdf.FZ_IMAGE_UNKNOWN) + or img_type < mupdf.FZ_IMAGE_BMP + ): + # not an image with a compressed buffer: convert to PNG + res = mupdf.fz_new_buffer_from_image_as_png( + img, + mupdf.FzColorParams(mupdf.fz_default_color_params), + ) + ext = "png" + elif ext == "jpeg" and img.n() == 4: + # JPEG with CMYK: invert colors + res = mupdf.fz_new_buffer_from_image_as_jpeg( + img, mupdf.FzColorParams(mupdf.fz_default_color_params), 95, 1) + else: + # copy the compressed buffer + res = mupdf.FzBuffer(mupdf.ll_fz_keep_buffer(ll_cbuf.buffer)) + + bytes_ = JM_BinFromBuffer(res) + img_dict[dictkey_width] = img.w() + img_dict[dictkey_height] = img.h() + img_dict[dictkey_ext] = ext + img_dict[dictkey_colorspace] = img.n() + img_dict[dictkey_xres] = img.xres() + img_dict[dictkey_yres] = img.yres() + img_dict[dictkey_bpc] = img.bpc() + img_dict[dictkey_size] = len(bytes_) + img_dict[dictkey_image] = bytes_ + +def JM_make_image_block(block, block_dict): + img = block.i_image() + _make_image_dict(img, block_dict) + # if the image has a mask, store it as a PNG buffer + mask = img.mask() + if mask.m_internal: + buff = mask.fz_new_buffer_from_image_as_png(mupdf.FzColorParams(mupdf.fz_default_color_params)) + block_dict["mask"] = buff.fz_buffer_extract() + else: + block_dict["mask"] = None + block_dict[dictkey_matrix] = JM_py_from_matrix(block.i_transform()) + + +def JM_make_text_block(block, block_dict, raw, buff, tp_rect): + if g_use_extra: + return extra.JM_make_text_block(block.m_internal, block_dict, raw, buff.m_internal, tp_rect.m_internal) + line_list = [] + block_rect = mupdf.FzRect(mupdf.FzRect.Fixed_EMPTY) + #log(f'{block=}') + for line in block: + #log(f'{line=}') + if (mupdf.fz_is_empty_rect(mupdf.fz_intersect_rect(tp_rect, mupdf.FzRect(line.m_internal.bbox))) + and not mupdf.fz_is_infinite_rect(tp_rect) + ): + continue + line_dict = dict() + line_rect = JM_make_spanlist(line_dict, line, raw, buff, tp_rect) + block_rect = mupdf.fz_union_rect(block_rect, line_rect) + line_dict[dictkey_wmode] = line.m_internal.wmode + line_dict[dictkey_dir] = JM_py_from_point(line.m_internal.dir) + line_dict[dictkey_bbox] = JM_py_from_rect(line_rect) + line_list.append(line_dict) + block_dict[dictkey_bbox] = JM_py_from_rect(block_rect) + block_dict[dictkey_lines] = line_list + + +def JM_make_textpage_dict(tp, page_dict, raw): + if g_use_extra: + return extra.JM_make_textpage_dict(tp.m_internal, page_dict, raw) + text_buffer = mupdf.fz_new_buffer(128) + block_list = [] + tp_rect = mupdf.FzRect(tp.m_internal.mediabox) + block_n = -1 + #log( 'JM_make_textpage_dict {=tp}') + for block in tp: + block_n += 1 + if (not mupdf.fz_contains_rect(tp_rect, mupdf.FzRect(block.m_internal.bbox)) + and not mupdf.fz_is_infinite_rect(tp_rect) + and block.m_internal.type == mupdf.FZ_STEXT_BLOCK_IMAGE + ): + continue + if (not mupdf.fz_is_infinite_rect(tp_rect) + and mupdf.fz_is_empty_rect(mupdf.fz_intersect_rect(tp_rect, mupdf.FzRect(block.m_internal.bbox))) + ): + continue + + block_dict = dict() + block_dict[dictkey_number] = block_n + block_dict[dictkey_type] = block.m_internal.type + if block.m_internal.type == mupdf.FZ_STEXT_BLOCK_IMAGE: + block_dict[dictkey_bbox] = JM_py_from_rect(block.m_internal.bbox) + JM_make_image_block(block, block_dict) + else: + JM_make_text_block(block, block_dict, raw, text_buffer, tp_rect) + + block_list.append(block_dict) + page_dict[dictkey_blocks] = block_list + + +def JM_matrix_from_py(m): + a = [0, 0, 0, 0, 0, 0] + if isinstance(m, mupdf.FzMatrix): + return m + if isinstance(m, Matrix): + return mupdf.FzMatrix(m.a, m.b, m.c, m.d, m.e, m.f) + if not m or not PySequence_Check(m) or PySequence_Size(m) != 6: + return mupdf.FzMatrix() + for i in range(6): + a[i] = JM_FLOAT_ITEM(m, i) + if a[i] is None: + return mupdf.FzRect() + return mupdf.FzMatrix(a[0], a[1], a[2], a[3], a[4], a[5]) + + +def JM_mediabox(page_obj): + ''' + return a PDF page's MediaBox + ''' + page_mediabox = mupdf.FzRect(mupdf.FzRect.Fixed_UNIT) + mediabox = mupdf.pdf_to_rect( + mupdf.pdf_dict_get_inheritable(page_obj, PDF_NAME('MediaBox')) + ) + if mupdf.fz_is_empty_rect(mediabox) or mupdf.fz_is_infinite_rect(mediabox): + mediabox.x0 = 0 + mediabox.y0 = 0 + mediabox.x1 = 612 + mediabox.y1 = 792 + + page_mediabox = mupdf.FzRect( + mupdf.fz_min(mediabox.x0, mediabox.x1), + mupdf.fz_min(mediabox.y0, mediabox.y1), + mupdf.fz_max(mediabox.x0, mediabox.x1), + mupdf.fz_max(mediabox.y0, mediabox.y1), + ) + + if (page_mediabox.x1 - page_mediabox.x0 < 1 + or page_mediabox.y1 - page_mediabox.y0 < 1 + ): + page_mediabox = mupdf.FzRect(mupdf.FzRect.Fixed_UNIT) + + return page_mediabox + + +def JM_merge_range( + doc_des, + doc_src, + spage, + epage, + apage, + rotate, + links, + annots, + show_progress, + graft_map, + ): + ''' + Copy a range of pages (spage, epage) from a source PDF to a specified + location (apage) of the target PDF. + If spage > epage, the sequence of source pages is reversed. + ''' + if g_use_extra: + return extra.JM_merge_range( + doc_des, + doc_src, + spage, + epage, + apage, + rotate, + links, + annots, + show_progress, + graft_map, + ) + afterpage = apage + counter = 0 # copied pages counter + total = mupdf.fz_absi(epage - spage) + 1 # total pages to copy + + if spage < epage: + page = spage + while page <= epage: + page_merge(doc_des, doc_src, page, afterpage, rotate, links, annots, graft_map) + counter += 1 + if show_progress > 0 and counter % show_progress == 0: + message(f"Inserted {counter} of {total} pages.") + page += 1 + afterpage += 1 + else: + page = spage + while page >= epage: + page_merge(doc_des, doc_src, page, afterpage, rotate, links, annots, graft_map) + counter += 1 + if show_progress > 0 and counter % show_progress == 0: + message(f"Inserted {counter} of {total} pages.") + page -= 1 + afterpage += 1 + + +def JM_merge_resources( page, temp_res): + ''' + Merge the /Resources object created by a text pdf device into the page. + The device may have created multiple /ExtGState/Alp? and /Font/F? objects. + These need to be renamed (renumbered) to not overwrite existing page + objects from previous executions. + Returns the next available numbers n, m for objects /Alp, /F. + ''' + # page objects /Resources, /Resources/ExtGState, /Resources/Font + resources = mupdf.pdf_dict_get(page.obj(), PDF_NAME('Resources')) + if not resources.m_internal: + resources = mupdf.pdf_dict_put_dict(page.obj(), PDF_NAME('Resources'), 5) + main_extg = mupdf.pdf_dict_get(resources, PDF_NAME('ExtGState')) + main_fonts = mupdf.pdf_dict_get(resources, PDF_NAME('Font')) + + # text pdf device objects /ExtGState, /Font + temp_extg = mupdf.pdf_dict_get(temp_res, PDF_NAME('ExtGState')) + temp_fonts = mupdf.pdf_dict_get(temp_res, PDF_NAME('Font')) + + max_alp = -1 + max_fonts = -1 + + # Handle /Alp objects + if mupdf.pdf_is_dict(temp_extg): # any created at all? + n = mupdf.pdf_dict_len(temp_extg) + if mupdf.pdf_is_dict(main_extg): # does page have /ExtGState yet? + for i in range(mupdf.pdf_dict_len(main_extg)): + # get highest number of objects named /Alpxxx + alp = mupdf.pdf_to_name( mupdf.pdf_dict_get_key(main_extg, i)) + if not alp.startswith('Alp'): + continue + j = mupdf.fz_atoi(alp[3:]) + if j > max_alp: + max_alp = j + else: # create a /ExtGState for the page + main_extg = mupdf.pdf_dict_put_dict(resources, PDF_NAME('ExtGState'), n) + + max_alp += 1 + for i in range(n): # copy over renumbered /Alp objects + alp = mupdf.pdf_to_name( mupdf.pdf_dict_get_key( temp_extg, i)) + j = mupdf.fz_atoi(alp[3:]) + max_alp + text = f'Alp{j}' + val = mupdf.pdf_dict_get_val( temp_extg, i) + mupdf.pdf_dict_puts(main_extg, text, val) + + if mupdf.pdf_is_dict(main_fonts): # has page any fonts yet? + for i in range(mupdf.pdf_dict_len(main_fonts)): # get max font number + font = mupdf.pdf_to_name( mupdf.pdf_dict_get_key( main_fonts, i)) + if not font.startswith("F"): + continue + j = mupdf.fz_atoi(font[1:]) + if j > max_fonts: + max_fonts = j + else: # create a Resources/Font for the page + main_fonts = mupdf.pdf_dict_put_dict(resources, PDF_NAME('Font'), 2) + + max_fonts += 1 + for i in range(mupdf.pdf_dict_len(temp_fonts)): # copy renumbered fonts + font = mupdf.pdf_to_name( mupdf.pdf_dict_get_key( temp_fonts, i)) + j = mupdf.fz_atoi(font[1:]) + max_fonts + text = f'F{j}' + val = mupdf.pdf_dict_get_val(temp_fonts, i) + mupdf.pdf_dict_puts(main_fonts, text, val) + return (max_alp, max_fonts) # next available numbers + + +def JM_mupdf_warning( text): + ''' + redirect MuPDF warnings + ''' + JM_mupdf_warnings_store.append(text) + if JM_mupdf_show_warnings: + message(f'MuPDF warning: {text}') + + +def JM_mupdf_error( text): + JM_mupdf_warnings_store.append(text) + if JM_mupdf_show_errors: + message(f'MuPDF error: {text}\n') + + +def JM_new_bbox_device(rc, inc_layers): + assert isinstance(rc, list) + return JM_new_bbox_device_Device( rc, inc_layers) + + +def JM_new_buffer_from_stext_page(page): + ''' + make a buffer from an stext_page's text + ''' + assert isinstance(page, mupdf.FzStextPage) + rect = mupdf.FzRect(page.m_internal.mediabox) + buf = mupdf.fz_new_buffer(256) + for block in page: + if block.m_internal.type == mupdf.FZ_STEXT_BLOCK_TEXT: + for line in block: + for ch in line: + if (not JM_rects_overlap(rect, JM_char_bbox(line, ch)) + and not mupdf.fz_is_infinite_rect(rect) + ): + continue + mupdf.fz_append_rune(buf, ch.m_internal.c) + mupdf.fz_append_byte(buf, ord('\n')) + mupdf.fz_append_byte(buf, ord('\n')) + return buf + + +def JM_new_javascript(pdf, value): + ''' + make new PDF action object from JavaScript source + Parameters are a PDF document and a Python string. + Returns a PDF action object. + ''' + if value is None: + # no argument given + return + data = JM_StrAsChar(value) + if data is None: + # not convertible to char* + return + + res = mupdf.fz_new_buffer_from_copied_data(data.encode('utf8')) + source = mupdf.pdf_add_stream(pdf, res, mupdf.PdfObj(), 0) + newaction = mupdf.pdf_add_new_dict(pdf, 4) + mupdf.pdf_dict_put(newaction, PDF_NAME('S'), mupdf.pdf_new_name('JavaScript')) + mupdf.pdf_dict_put(newaction, PDF_NAME('JS'), source) + return newaction + + +def JM_new_output_fileptr(bio): + return JM_new_output_fileptr_Output( bio) + + +def JM_norm_rotation(rotate): + ''' + # return normalized /Rotate value:one of 0, 90, 180, 270 + ''' + while rotate < 0: + rotate += 360 + while rotate >= 360: + rotate -= 360 + if rotate % 90 != 0: + return 0 + return rotate + + +def JM_object_to_buffer(what, compress, ascii): + res = mupdf.fz_new_buffer(512) + out = mupdf.FzOutput(res) + mupdf.pdf_print_obj(out, what, compress, ascii) + out.fz_close_output() + mupdf.fz_terminate_buffer(res) + return res + + +def JM_outline_xrefs(obj, xrefs): + ''' + Return list of outline xref numbers. Recursive function. Arguments: + 'obj' first OL item + 'xrefs' empty Python list + ''' + if not obj.m_internal: + return xrefs + thisobj = obj + while thisobj.m_internal: + newxref = mupdf.pdf_to_num( thisobj) + if newxref in xrefs or mupdf.pdf_dict_get( thisobj, PDF_NAME('Type')).m_internal: + # circular ref or top of chain: terminate + break + xrefs.append( newxref) + first = mupdf.pdf_dict_get( thisobj, PDF_NAME('First')) # try go down + if mupdf.pdf_is_dict( first): + xrefs = JM_outline_xrefs( first, xrefs) + thisobj = mupdf.pdf_dict_get( thisobj, PDF_NAME('Next')) # try go next + parent = mupdf.pdf_dict_get( thisobj, PDF_NAME('Parent')) # get parent + if not mupdf.pdf_is_dict( thisobj): + thisobj = parent + return xrefs + + +def JM_page_rotation(page): + ''' + return a PDF page's /Rotate value: one of (0, 90, 180, 270) + ''' + rotate = 0 + + obj = mupdf.pdf_dict_get_inheritable( page.obj(), mupdf.PDF_ENUM_NAME_Rotate) + rotate = mupdf.pdf_to_int(obj) + rotate = JM_norm_rotation(rotate) + return rotate + + +def JM_pdf_obj_from_str(doc, src): + ''' + create PDF object from given string (new in v1.14.0: MuPDF dropped it) + ''' + # fixme: seems inefficient to convert to bytes instance then make another + # copy inside fz_new_buffer_from_copied_data(), but no other way? + # + buffer_ = mupdf.fz_new_buffer_from_copied_data(bytes(src, 'utf8')) + stream = mupdf.fz_open_buffer(buffer_) + lexbuf = mupdf.PdfLexbuf(mupdf.PDF_LEXBUF_SMALL) + result = mupdf.pdf_parse_stm_obj(doc, stream, lexbuf) + return result + + +def JM_pixmap_from_display_list( + list_, + ctm, + cs, + alpha, + clip, + seps, + ): + ''' + Version of fz_new_pixmap_from_display_list (util.c) to also support + rendering of only the 'clip' part of the displaylist rectangle + ''' + assert isinstance(list_, mupdf.FzDisplayList) + if seps is None: + seps = mupdf.FzSeparations() + assert seps is None or isinstance(seps, mupdf.FzSeparations), f'{type(seps)=}: {seps}' + + rect = mupdf.fz_bound_display_list(list_) + matrix = JM_matrix_from_py(ctm) + rclip = JM_rect_from_py(clip) + rect = mupdf.fz_intersect_rect(rect, rclip) # no-op if clip is not given + + rect = mupdf.fz_transform_rect(rect, matrix) + irect = mupdf.fz_round_rect(rect) + + assert isinstance( cs, mupdf.FzColorspace) + + pix = mupdf.fz_new_pixmap_with_bbox(cs, irect, seps, alpha) + if alpha: + mupdf.fz_clear_pixmap(pix) + else: + mupdf.fz_clear_pixmap_with_value(pix, 0xFF) + + if not mupdf.fz_is_infinite_rect(rclip): + dev = mupdf.fz_new_draw_device_with_bbox(matrix, pix, irect) + mupdf.fz_run_display_list(list_, dev, mupdf.FzMatrix(), rclip, mupdf.FzCookie()) + else: + dev = mupdf.fz_new_draw_device(matrix, pix) + mupdf.fz_run_display_list(list_, dev, mupdf.FzMatrix(), mupdf.FzRect(mupdf.FzRect.Fixed_INFINITE), mupdf.FzCookie()) + + mupdf.fz_close_device(dev) + # Use special raw Pixmap constructor so we don't set alpha to true. + return Pixmap( 'raw', pix) + + +def JM_point_from_py(p): + ''' + PySequence to fz_point. Default: (FZ_MIN_INF_RECT, FZ_MIN_INF_RECT) + ''' + if isinstance(p, mupdf.FzPoint): + return p + if isinstance(p, Point): + return mupdf.FzPoint(p.x, p.y) + if g_use_extra: + return extra.JM_point_from_py( p) + + p0 = mupdf.FzPoint(0, 0) + x = JM_FLOAT_ITEM(p, 0) + y = JM_FLOAT_ITEM(p, 1) + if x is None or y is None: + return p0 + x = max( x, FZ_MIN_INF_RECT) + y = max( y, FZ_MIN_INF_RECT) + x = min( x, FZ_MAX_INF_RECT) + y = min( y, FZ_MAX_INF_RECT) + return mupdf.FzPoint(x, y) + + +def JM_print_stext_page_as_text(res, page): + ''' + Plain text output. An identical copy of fz_print_stext_page_as_text, + but lines within a block are concatenated by space instead a new-line + character (which else leads to 2 new-lines). + ''' + if 1 and g_use_extra: + return extra.JM_print_stext_page_as_text(res, page) + + assert isinstance(res, mupdf.FzBuffer) + assert isinstance(page, mupdf.FzStextPage) + rect = mupdf.FzRect(page.m_internal.mediabox) + last_char = 0 + + n_blocks = 0 + n_lines = 0 + n_chars = 0 + for n_blocks2, block in enumerate( page): + if block.m_internal.type == mupdf.FZ_STEXT_BLOCK_TEXT: + for n_lines2, line in enumerate( block): + for n_chars2, ch in enumerate( line): + pass + n_chars += n_chars2 + n_lines += n_lines2 + n_blocks += n_blocks2 + + for block in page: + if block.m_internal.type == mupdf.FZ_STEXT_BLOCK_TEXT: + for line in block: + last_char = 0 + for ch in line: + chbbox = JM_char_bbox(line, ch) + if (mupdf.fz_is_infinite_rect(rect) + or JM_rects_overlap(rect, chbbox) + ): + #raw += chr(ch.m_internal.c) + last_char = ch.m_internal.c + #log( '{=last_char!r utf!r}') + JM_append_rune(res, last_char) + if last_char != 10 and last_char > 0: + mupdf.fz_append_string(res, "\n") + + +def JM_put_script(annot_obj, key1, key2, value): + ''' + Create a JavaScript PDF action. + Usable for all object types which support PDF actions, even if the + argument name suggests annotations. Up to 2 key values can be specified, so + JavaScript actions can be stored for '/A' and '/AA/?' keys. + ''' + key1_obj = mupdf.pdf_dict_get(annot_obj, key1) + pdf = mupdf.pdf_get_bound_document(annot_obj) # owning PDF + + # if no new script given, just delete corresponding key + if not value: + if key2 is None or not key2.m_internal: + mupdf.pdf_dict_del(annot_obj, key1) + elif key1_obj.m_internal: + mupdf.pdf_dict_del(key1_obj, key2) + return + + # read any existing script as a PyUnicode string + if not key2.m_internal or not key1_obj.m_internal: + script = JM_get_script(key1_obj) + else: + script = JM_get_script(mupdf.pdf_dict_get(key1_obj, key2)) + + # replace old script, if different from new one + if value != script: + newaction = JM_new_javascript(pdf, value) + if not key2.m_internal: + mupdf.pdf_dict_put(annot_obj, key1, newaction) + else: + mupdf.pdf_dict_putl(annot_obj, newaction, key1, key2) + + +def JM_py_from_irect(r): + return r.x0, r.y0, r.x1, r.y1 + + +def JM_py_from_matrix(m): + return m.a, m.b, m.c, m.d, m.e, m.f + + +def JM_py_from_point(p): + return p.x, p.y + + +def JM_py_from_quad(q): + ''' + PySequence from fz_quad. + ''' + return ( + (q.ul.x, q.ul.y), + (q.ur.x, q.ur.y), + (q.ll.x, q.ll.y), + (q.lr.x, q.lr.y), + ) + + +def JM_py_from_rect(r): + return r.x0, r.y0, r.x1, r.y1 + + +def JM_quad_from_py(r): + if isinstance(r, mupdf.FzQuad): + return r + # cover all cases of 4-float-sequences + if hasattr(r, "__getitem__") and len(r) == 4 and hasattr(r[0], "__float__"): + r = mupdf.FzRect(*tuple(r)) + if isinstance( r, mupdf.FzRect): + return mupdf.fz_quad_from_rect( r) + if isinstance( r, Quad): + return mupdf.fz_make_quad( + r.ul.x, r.ul.y, + r.ur.x, r.ur.y, + r.ll.x, r.ll.y, + r.lr.x, r.lr.y, + ) + q = mupdf.fz_make_quad(0, 0, 0, 0, 0, 0, 0, 0) + p = [0,0,0,0] + if not r or not isinstance(r, (tuple, list)) or len(r) != 4: + return q + + if JM_FLOAT_ITEM(r, 0) is None: + return mupdf.fz_quad_from_rect(JM_rect_from_py(r)) + + for i in range(4): + if i >= len(r): + return q # invalid: cancel the rest + obj = r[i] # next point item + if not PySequence_Check(obj) or PySequence_Size(obj) != 2: + return q # invalid: cancel the rest + + p[i].x = JM_FLOAT_ITEM(obj, 0) + p[i].y = JM_FLOAT_ITEM(obj, 1) + if p[i].x is None or p[i].y is None: + return q + p[i].x = max( p[i].x, FZ_MIN_INF_RECT) + p[i].y = max( p[i].y, FZ_MIN_INF_RECT) + p[i].x = min( p[i].x, FZ_MAX_INF_RECT) + p[i].y = min( p[i].y, FZ_MAX_INF_RECT) + q.ul = p[0] + q.ur = p[1] + q.ll = p[2] + q.lr = p[3] + return q + + +def JM_read_contents(pageref): + ''' + Read and concatenate a PDF page's /Contents object(s) in a buffer + ''' + assert isinstance(pageref, mupdf.PdfObj), f'{type(pageref)}' + contents = mupdf.pdf_dict_get(pageref, mupdf.PDF_ENUM_NAME_Contents) + if mupdf.pdf_is_array(contents): + res = mupdf.FzBuffer(1024) + for i in range(mupdf.pdf_array_len(contents)): + if i > 0: + mupdf.fz_append_byte(res, 32) + obj = mupdf.pdf_array_get(contents, i) + if mupdf.pdf_is_stream(obj): + nres = mupdf.pdf_load_stream(obj) + mupdf.fz_append_buffer(res, nres) + elif contents.m_internal: + res = mupdf.pdf_load_stream(contents) + else: + res = mupdf.FzBuffer(0) + return res + + +def JM_rect_from_py(r): + if isinstance(r, mupdf.FzRect): + return r + if isinstance(r, mupdf.FzIrect): + return mupdf.FzRect(r) + if isinstance(r, Rect): + return mupdf.fz_make_rect(r.x0, r.y0, r.x1, r.y1) + if isinstance(r, IRect): + return mupdf.fz_make_rect(r.x0, r.y0, r.x1, r.y1) + if not r or not PySequence_Check(r) or PySequence_Size(r) != 4: + return mupdf.FzRect(mupdf.FzRect.Fixed_INFINITE) + f = [0, 0, 0, 0] + for i in range(4): + f[i] = JM_FLOAT_ITEM(r, i) + if f[i] is None: + return mupdf.FzRect(mupdf.FzRect.Fixed_INFINITE) + if f[i] < FZ_MIN_INF_RECT: + f[i] = FZ_MIN_INF_RECT + if f[i] > FZ_MAX_INF_RECT: + f[i] = FZ_MAX_INF_RECT + return mupdf.fz_make_rect(f[0], f[1], f[2], f[3]) + + +def JM_rects_overlap(a, b): + if (0 + or a.x0 >= b.x1 + or a.y0 >= b.y1 + or a.x1 <= b.x0 + or a.y1 <= b.y0 + ): + return 0 + return 1 + + +def JM_refresh_links( page): + ''' + refreshes the link and annotation tables of a page + ''' + if page is None or not page.m_internal: + return + obj = mupdf.pdf_dict_get( page.obj(), PDF_NAME('Annots')) + if obj.m_internal: + pdf = page.doc() + number = mupdf.pdf_lookup_page_number( pdf, page.obj()) + page_mediabox = mupdf.FzRect() + page_ctm = mupdf.FzMatrix() + mupdf.pdf_page_transform( page, page_mediabox, page_ctm) + link = mupdf.pdf_load_link_annots( pdf, page, obj, number, page_ctm) + page.m_internal.links = mupdf.ll_fz_keep_link( link.m_internal) + + +def JM_rotate_page_matrix(page): + ''' + calculate page rotation matrices + ''' + if not page.m_internal: + return mupdf.FzMatrix() # no valid pdf page given + rotation = JM_page_rotation(page) + #log( '{rotation=}') + if rotation == 0: + return mupdf.FzMatrix() # no rotation + cb_size = JM_cropbox_size(page.obj()) + w = cb_size.x + h = cb_size.y + #log( '{=h w}') + if rotation == 90: + m = mupdf.fz_make_matrix(0, 1, -1, 0, h, 0) + elif rotation == 180: + m = mupdf.fz_make_matrix(-1, 0, 0, -1, w, h) + else: + m = mupdf.fz_make_matrix(0, -1, 1, 0, 0, w) + #log( 'returning {m=}') + return m + + +def JM_search_stext_page(page, needle): + if g_use_extra: + return extra.JM_search_stext_page(page.m_internal, needle) + + rect = mupdf.FzRect(page.m_internal.mediabox) + if not needle: + return + quads = [] + class Hits: + def __str__(self): + return f'Hits(len={self.len} quads={self.quads} hfuzz={self.hfuzz} vfuzz={self.vfuzz}' + hits = Hits() + hits.len = 0 + hits.quads = quads + hits.hfuzz = 0.2 # merge kerns but not large gaps + hits.vfuzz = 0.1 + + buffer_ = JM_new_buffer_from_stext_page(page) + haystack_string = mupdf.fz_string_from_buffer(buffer_) + haystack = 0 + begin, end = find_string(haystack_string[haystack:], needle) + if begin is None: + #goto no_more_matches; + return quads + + begin += haystack + end += haystack + inside = 0 + i = 0 + for block in page: + if block.m_internal.type != mupdf.FZ_STEXT_BLOCK_TEXT: + continue + for line in block: + for ch in line: + i += 1 + if not mupdf.fz_is_infinite_rect(rect): + r = JM_char_bbox(line, ch) + if not JM_rects_overlap(rect, r): + #goto next_char; + continue + while 1: + #try_new_match: + if not inside: + if haystack >= begin: + inside = 1 + if inside: + if haystack < end: + on_highlight_char(hits, line, ch) + break + else: + inside = 0 + begin, end = find_string(haystack_string[haystack:], needle) + if begin is None: + #goto no_more_matches; + return quads + else: + #goto try_new_match; + begin += haystack + end += haystack + continue + break + haystack += 1 + #next_char:; + assert haystack_string[haystack] == '\n', \ + f'{haystack=} {haystack_string[haystack]=}' + haystack += 1 + assert haystack_string[haystack] == '\n', \ + f'{haystack=} {haystack_string[haystack]=}' + haystack += 1 + #no_more_matches:; + return quads + + +def JM_scan_resources(pdf, rsrc, liste, what, stream_xref, tracer): + ''' + Step through /Resources, looking up image, xobject or font information + ''' + if mupdf.pdf_mark_obj(rsrc): + mupdf.fz_warn('Circular dependencies! Consider page cleaning.') + return # Circular dependencies! + try: + xobj = mupdf.pdf_dict_get(rsrc, mupdf.PDF_ENUM_NAME_XObject) + + if what == 1: # lookup fonts + font = mupdf.pdf_dict_get(rsrc, mupdf.PDF_ENUM_NAME_Font) + JM_gather_fonts(pdf, font, liste, stream_xref) + elif what == 2: # look up images + JM_gather_images(pdf, xobj, liste, stream_xref) + elif what == 3: # look up form xobjects + JM_gather_forms(pdf, xobj, liste, stream_xref) + else: # should never happen + return + + # check if we need to recurse into Form XObjects + n = mupdf.pdf_dict_len(xobj) + for i in range(n): + obj = mupdf.pdf_dict_get_val(xobj, i) + if mupdf.pdf_is_stream(obj): + sxref = mupdf.pdf_to_num(obj) + else: + sxref = 0 + subrsrc = mupdf.pdf_dict_get(obj, mupdf.PDF_ENUM_NAME_Resources) + if subrsrc.m_internal: + sxref_t = sxref + if sxref_t not in tracer: + tracer.append(sxref_t) + JM_scan_resources( pdf, subrsrc, liste, what, sxref, tracer) + else: + mupdf.fz_warn('Circular dependencies! Consider page cleaning.') + return + finally: + mupdf.pdf_unmark_obj(rsrc) + + +def JM_set_choice_options(annot, liste): + ''' + set ListBox / ComboBox values + ''' + if not liste: + return + assert isinstance( liste, (tuple, list)) + n = len( liste) + if n == 0: + return + annot_obj = mupdf.pdf_annot_obj( annot) + pdf = mupdf.pdf_get_bound_document( annot_obj) + optarr = mupdf.pdf_new_array( pdf, n) + for i in range(n): + val = liste[i] + opt = val + if isinstance(opt, str): + mupdf.pdf_array_push_text_string( optarr, opt) + else: + assert isinstance( val, (tuple, list)) and len( val) == 2, 'bad choice field list' + opt1, opt2 = val + assert opt1 and opt2, 'bad choice field list' + optarrsub = mupdf.pdf_array_push_array( optarr, 2) + mupdf.pdf_array_push_text_string( optarrsub, opt1) + mupdf.pdf_array_push_text_string( optarrsub, opt2) + mupdf.pdf_dict_put( annot_obj, PDF_NAME('Opt'), optarr) + + +def JM_set_field_type(doc, obj, type): + ''' + Set the field type + ''' + setbits = 0 + clearbits = 0 + typename = None + if type == mupdf.PDF_WIDGET_TYPE_BUTTON: + typename = PDF_NAME('Btn') + setbits = mupdf.PDF_BTN_FIELD_IS_PUSHBUTTON + elif type == mupdf.PDF_WIDGET_TYPE_RADIOBUTTON: + typename = PDF_NAME('Btn') + clearbits = mupdf.PDF_BTN_FIELD_IS_PUSHBUTTON + setbits = mupdf.PDF_BTN_FIELD_IS_RADIO + elif type == mupdf.PDF_WIDGET_TYPE_CHECKBOX: + typename = PDF_NAME('Btn') + clearbits = (mupdf.PDF_BTN_FIELD_IS_PUSHBUTTON | mupdf.PDF_BTN_FIELD_IS_RADIO) + elif type == mupdf.PDF_WIDGET_TYPE_TEXT: + typename = PDF_NAME('Tx') + elif type == mupdf.PDF_WIDGET_TYPE_LISTBOX: + typename = PDF_NAME('Ch') + clearbits = mupdf.PDF_CH_FIELD_IS_COMBO + elif type == mupdf.PDF_WIDGET_TYPE_COMBOBOX: + typename = PDF_NAME('Ch') + setbits = mupdf.PDF_CH_FIELD_IS_COMBO + elif type == mupdf.PDF_WIDGET_TYPE_SIGNATURE: + typename = PDF_NAME('Sig') + + if typename is not None and typename.m_internal: + mupdf.pdf_dict_put(obj, PDF_NAME('FT'), typename) + + if setbits != 0 or clearbits != 0: + bits = mupdf.pdf_dict_get_int(obj, PDF_NAME('Ff')) + bits &= ~clearbits + bits |= setbits + mupdf.pdf_dict_put_int(obj, PDF_NAME('Ff'), bits) + + +def JM_set_object_value(obj, key, value): + ''' + Set a PDF dict key to some value + ''' + eyecatcher = "fitz: replace me!" + pdf = mupdf.pdf_get_bound_document(obj) + # split PDF key at path seps and take last key part + list_ = key.split('/') + len_ = len(list_) + i = len_ - 1 + skey = list_[i] + + del list_[i] # del the last sub-key + len_ = len(list_) # remaining length + testkey = mupdf.pdf_dict_getp(obj, key) # check if key already exists + if not testkey.m_internal: + #No, it will be created here. But we cannot allow this happening if + #indirect objects are referenced. So we check all higher level + #sub-paths for indirect references. + while len_ > 0: + t = '/'.join(list_) # next high level + if mupdf.pdf_is_indirect(mupdf.pdf_dict_getp(obj, JM_StrAsChar(t))): + raise Exception("path to '%s' has indirects", JM_StrAsChar(skey)) + del list_[len_ - 1] # del last sub-key + len_ = len(list_) # remaining length + # Insert our eyecatcher. Will create all sub-paths in the chain, or + # respectively remove old value of key-path. + mupdf.pdf_dict_putp(obj, key, mupdf.pdf_new_text_string(eyecatcher)) + testkey = mupdf.pdf_dict_getp(obj, key) + if not mupdf.pdf_is_string(testkey): + raise Exception("cannot insert value for '%s'", key) + temp = mupdf.pdf_to_text_string(testkey) + if temp != eyecatcher: + raise Exception("cannot insert value for '%s'", key) + # read the result as a string + res = JM_object_to_buffer(obj, 1, 0) + objstr = JM_EscapeStrFromBuffer(res) + + # replace 'eyecatcher' by desired 'value' + nullval = "/%s(%s)" % ( skey, eyecatcher) + newval = "/%s %s" % (skey, value) + newstr = objstr.replace(nullval, newval, 1) + + # make PDF object from resulting string + new_obj = JM_pdf_obj_from_str(pdf, newstr) + return new_obj + + +def JM_set_ocg_arrays(conf, basestate, on, off, rbgroups, locked): + if basestate: + mupdf.pdf_dict_put_name( conf, PDF_NAME('BaseState'), basestate) + + if on is not None: + mupdf.pdf_dict_del( conf, PDF_NAME('ON')) + if on: + arr = mupdf.pdf_dict_put_array( conf, PDF_NAME('ON'), 1) + JM_set_ocg_arrays_imp( arr, on) + if off is not None: + mupdf.pdf_dict_del( conf, PDF_NAME('OFF')) + if off: + arr = mupdf.pdf_dict_put_array( conf, PDF_NAME('OFF'), 1) + JM_set_ocg_arrays_imp( arr, off) + if locked is not None: + mupdf.pdf_dict_del( conf, PDF_NAME('Locked')) + if locked: + arr = mupdf.pdf_dict_put_array( conf, PDF_NAME('Locked'), 1) + JM_set_ocg_arrays_imp( arr, locked) + if rbgroups is not None: + mupdf.pdf_dict_del( conf, PDF_NAME('RBGroups')) + if rbgroups: + arr = mupdf.pdf_dict_put_array( conf, PDF_NAME('RBGroups'), 1) + n =len(rbgroups) + for i in range(n): + item0 = rbgroups[i] + obj = mupdf.pdf_array_push_array( arr, 1) + JM_set_ocg_arrays_imp( obj, item0) + + +def JM_set_ocg_arrays_imp(arr, list_): + ''' + Set OCG arrays from dict of Python lists + Works with dict like {"basestate":name, "on":list, "off":list, "rbg":list} + ''' + pdf = mupdf.pdf_get_bound_document(arr) + for xref in list_: + obj = mupdf.pdf_new_indirect(pdf, xref, 0) + mupdf.pdf_array_push(arr, obj) + + +def JM_set_resource_property(ref, name, xref): + ''' + Insert an item into Resources/Properties (used for Marked Content) + Arguments: + (1) e.g. page object, Form XObject + (2) marked content name + (3) xref of the referenced object (insert as indirect reference) + ''' + pdf = mupdf.pdf_get_bound_document(ref) + ind = mupdf.pdf_new_indirect(pdf, xref, 0) + if not ind.m_internal: + RAISEPY(MSG_BAD_XREF, PyExc_ValueError) + resources = mupdf.pdf_dict_get(ref, PDF_NAME('Resources')) + if not resources.m_internal: + resources = mupdf.pdf_dict_put_dict(ref, PDF_NAME('Resources'), 1) + properties = mupdf.pdf_dict_get(resources, PDF_NAME('Properties')) + if not properties.m_internal: + properties = mupdf.pdf_dict_put_dict(resources, PDF_NAME('Properties'), 1) + mupdf.pdf_dict_put(properties, mupdf.pdf_new_name(name), ind) + + +def JM_set_widget_properties(annot, Widget): + ''' + Update the PDF form field with the properties from a Python Widget object. + Called by "Page.add_widget" and "Annot.update_widget". + ''' + if isinstance( annot, Annot): + annot = annot.this + assert isinstance( annot, mupdf.PdfAnnot), f'{type(annot)=} {type=}' + page = _pdf_annot_page(annot) + assert page.m_internal, 'Annot is not bound to a page' + annot_obj = mupdf.pdf_annot_obj(annot) + pdf = page.doc() + def GETATTR(name): + return getattr(Widget, name, None) + + value = GETATTR("field_type") + field_type = value + + # rectangle -------------------------------------------------------------- + value = GETATTR("rect") + rect = JM_rect_from_py(value) + rot_mat = JM_rotate_page_matrix(page) + rect = mupdf.fz_transform_rect(rect, rot_mat) + mupdf.pdf_set_annot_rect(annot, rect) + + # fill color ------------------------------------------------------------- + value = GETATTR("fill_color") + if value and PySequence_Check(value): + n = len(value) + fill_col = mupdf.pdf_new_array(pdf, n) + col = 0 + for i in range(n): + col = value[i] + mupdf.pdf_array_push_real(fill_col, col) + mupdf.pdf_field_set_fill_color(annot_obj, fill_col) + + # dashes ----------------------------------------------------------------- + value = GETATTR("border_dashes") + if value and PySequence_Check(value): + n = len(value) + dashes = mupdf.pdf_new_array(pdf, n) + for i in range(n): + mupdf.pdf_array_push_int(dashes, value[i]) + mupdf.pdf_dict_putl(annot_obj, dashes, PDF_NAME('BS'), PDF_NAME('D')) + + # border color ----------------------------------------------------------- + value = GETATTR("border_color") + if value and PySequence_Check(value): + n = len(value) + border_col = mupdf.pdf_new_array(pdf, n) + col = 0 + for i in range(n): + col = value[i] + mupdf.pdf_array_push_real(border_col, col) + mupdf.pdf_dict_putl(annot_obj, border_col, PDF_NAME('MK'), PDF_NAME('BC')) + + # entry ignored - may be used later + # + #int text_format = (int) PyInt_AsLong(GETATTR("text_format")); + # + + # field label ----------------------------------------------------------- + value = GETATTR("field_label") + if value is not None: + label = JM_StrAsChar(value) + mupdf.pdf_dict_put_text_string(annot_obj, PDF_NAME('TU'), label) + + # field name ------------------------------------------------------------- + value = GETATTR("field_name") + if value is not None: + name = JM_StrAsChar(value) + old_name = mupdf.pdf_load_field_name(annot_obj) + if name != old_name: + mupdf.pdf_dict_put_text_string(annot_obj, PDF_NAME('T'), name) + + # max text len ----------------------------------------------------------- + if field_type == mupdf.PDF_WIDGET_TYPE_TEXT: + value = GETATTR("text_maxlen") + text_maxlen = value + if text_maxlen: + mupdf.pdf_dict_put_int(annot_obj, PDF_NAME('MaxLen'), text_maxlen) + value = GETATTR("field_display") + d = value + mupdf.pdf_field_set_display(annot_obj, d) + + # choice values ---------------------------------------------------------- + if field_type in (mupdf.PDF_WIDGET_TYPE_LISTBOX, mupdf.PDF_WIDGET_TYPE_COMBOBOX): + value = GETATTR("choice_values") + JM_set_choice_options(annot, value) + + # border style ----------------------------------------------------------- + value = GETATTR("border_style") + val = JM_get_border_style(value) + mupdf.pdf_dict_putl(annot_obj, val, PDF_NAME('BS'), PDF_NAME('S')) + + # border width ----------------------------------------------------------- + value = GETATTR("border_width") + border_width = value + mupdf.pdf_dict_putl( + annot_obj, + mupdf.pdf_new_real(border_width), + PDF_NAME('BS'), + PDF_NAME('W'), + ) + + # /DA string ------------------------------------------------------------- + value = GETATTR("_text_da") + da = JM_StrAsChar(value) + mupdf.pdf_dict_put_text_string(annot_obj, PDF_NAME('DA'), da) + mupdf.pdf_dict_del(annot_obj, PDF_NAME('DS')) # not supported by MuPDF + mupdf.pdf_dict_del(annot_obj, PDF_NAME('RC')) # not supported by MuPDF + + # field flags ------------------------------------------------------------ + field_flags = GETATTR("field_flags") + if field_flags is not None: + if field_type == mupdf.PDF_WIDGET_TYPE_COMBOBOX: + field_flags |= mupdf.PDF_CH_FIELD_IS_COMBO + elif field_type == mupdf.PDF_WIDGET_TYPE_RADIOBUTTON: + field_flags |= mupdf.PDF_BTN_FIELD_IS_RADIO + elif field_type == mupdf.PDF_WIDGET_TYPE_BUTTON: + field_flags |= mupdf.PDF_BTN_FIELD_IS_PUSHBUTTON + mupdf.pdf_dict_put_int( annot_obj, PDF_NAME('Ff'), field_flags) + + # button caption --------------------------------------------------------- + value = GETATTR("button_caption") + ca = JM_StrAsChar(value) + if ca: + mupdf.pdf_field_set_button_caption(annot_obj, ca) + + # script (/A) ------------------------------------------------------- + value = GETATTR("script") + JM_put_script(annot_obj, PDF_NAME('A'), mupdf.PdfObj(), value) + + # script (/AA/K) ------------------------------------------------------- + value = GETATTR("script_stroke") + JM_put_script(annot_obj, PDF_NAME('AA'), PDF_NAME('K'), value) + + # script (/AA/F) ------------------------------------------------------- + value = GETATTR("script_format") + JM_put_script(annot_obj, PDF_NAME('AA'), PDF_NAME('F'), value) + + # script (/AA/V) ------------------------------------------------------- + value = GETATTR("script_change") + JM_put_script(annot_obj, PDF_NAME('AA'), PDF_NAME('V'), value) + + # script (/AA/C) ------------------------------------------------------- + value = GETATTR("script_calc") + JM_put_script(annot_obj, PDF_NAME('AA'), PDF_NAME('C'), value) + + # script (/AA/Bl) ------------------------------------------------------- + value = GETATTR("script_blur") + JM_put_script(annot_obj, PDF_NAME('AA'), mupdf.pdf_new_name('Bl'), value) + + # script (/AA/Fo) codespell:ignore -------------------------------------- + value = GETATTR("script_focus") + JM_put_script(annot_obj, PDF_NAME('AA'), mupdf.pdf_new_name('Fo'), value) + + # field value ------------------------------------------------------------ + value = GETATTR("field_value") # field value + text = JM_StrAsChar(value) # convert to text (may fail!) + if field_type == mupdf.PDF_WIDGET_TYPE_RADIOBUTTON: + if not value: + mupdf.pdf_set_field_value(pdf, annot_obj, "Off", 1) + mupdf.pdf_dict_put_name(annot_obj, PDF_NAME('AS'), "Off") + else: + # TODO check if another button in the group is ON and if so set it Off + onstate = mupdf.pdf_button_field_on_state(annot_obj) + if onstate.m_internal: + on = mupdf.pdf_to_name(onstate) + mupdf.pdf_set_field_value(pdf, annot_obj, on, 1) + mupdf.pdf_dict_put_name(annot_obj, PDF_NAME('AS'), on) + elif text: + mupdf.pdf_dict_put_name(annot_obj, PDF_NAME('AS'), text) + elif field_type == mupdf.PDF_WIDGET_TYPE_CHECKBOX: + onstate = mupdf.pdf_button_field_on_state(annot_obj) + on = onstate.pdf_to_name() + if value in (True, on) or text == 'Yes': + mupdf.pdf_set_field_value(pdf, annot_obj, on, 1) + mupdf.pdf_dict_put_name(annot_obj, PDF_NAME('AS'), on) + mupdf.pdf_dict_put_name(annot_obj, PDF_NAME('V'), on) + else: + mupdf.pdf_dict_put_name( annot_obj, PDF_NAME('AS'), 'Off') + mupdf.pdf_dict_put_name( annot_obj, PDF_NAME('V'), 'Off') + else: + if text: + mupdf.pdf_set_field_value(pdf, annot_obj, text, 1) + if field_type in (mupdf.PDF_WIDGET_TYPE_COMBOBOX, mupdf.PDF_WIDGET_TYPE_LISTBOX): + mupdf.pdf_dict_del(annot_obj, PDF_NAME('I')) + mupdf.pdf_dirty_annot(annot) + mupdf.pdf_set_annot_hot(annot, 1) + mupdf.pdf_set_annot_active(annot, 1) + mupdf.pdf_update_annot(annot) + + +def JM_show_string_cs( + text, + user_font, + trm, + s, + wmode, + bidi_level, + markup_dir, + language, + ): + i = 0 + while i < len(s): + l, ucs = mupdf.fz_chartorune(s[i:]) + i += l + gid = mupdf.fz_encode_character_sc(user_font, ucs) + if gid == 0: + gid, font = mupdf.fz_encode_character_with_fallback(user_font, ucs, 0, language) + else: + font = user_font + mupdf.fz_show_glyph(text, font, trm, gid, ucs, wmode, bidi_level, markup_dir, language) + adv = mupdf.fz_advance_glyph(font, gid, wmode) + if wmode == 0: + trm = mupdf.fz_pre_translate(trm, adv, 0) + else: + trm = mupdf.fz_pre_translate(trm, 0, -adv) + return trm + + +def JM_UnicodeFromBuffer(buff): + buff_bytes = mupdf.fz_buffer_extract_copy(buff) + val = buff_bytes.decode(errors='replace') + z = val.find(chr(0)) + if z >= 0: + val = val[:z] + return val + + +def message_warning(text): + ''' + Generate a warning. + ''' + message(f'warning: {text}') + + +def JM_update_stream(doc, obj, buffer_, compress): + ''' + update a stream object + compress stream when beneficial + ''' + if compress: + length, _ = mupdf.fz_buffer_storage(buffer_) + if length > 30: # ignore small stuff + buffer_compressed = JM_compress_buffer(buffer_) + assert isinstance(buffer_compressed, mupdf.FzBuffer) + if buffer_compressed.m_internal: + length_compressed, _ = mupdf.fz_buffer_storage(buffer_compressed) + if length_compressed < length: # was it worth the effort? + mupdf.pdf_dict_put( + obj, + mupdf.PDF_ENUM_NAME_Filter, + mupdf.PDF_ENUM_NAME_FlateDecode, + ) + mupdf.pdf_update_stream(doc, obj, buffer_compressed, 1) + return + + mupdf.pdf_update_stream(doc, obj, buffer_, 0) + + +def JM_xobject_from_page(pdfout, fsrcpage, xref, gmap): + ''' + Make an XObject from a PDF page + For a positive xref assume that its object can be used instead + ''' + assert isinstance(gmap, mupdf.PdfGraftMap), f'{type(gmap)=}' + if xref > 0: + xobj1 = mupdf.pdf_new_indirect(pdfout, xref, 0) + else: + srcpage = _as_pdf_page(fsrcpage.this) + spageref = srcpage.obj() + mediabox = mupdf.pdf_to_rect(mupdf.pdf_dict_get_inheritable(spageref, PDF_NAME('MediaBox'))) + # Deep-copy resources object of source page + o = mupdf.pdf_dict_get_inheritable(spageref, PDF_NAME('Resources')) + if gmap.m_internal: + # use graftmap when possible + resources = mupdf.pdf_graft_mapped_object(gmap, o) + else: + resources = mupdf.pdf_graft_object(pdfout, o) + + # get spgage contents source + res = JM_read_contents(spageref) + + #------------------------------------------------------------- + # create XObject representing the source page + #------------------------------------------------------------- + xobj1 = mupdf.pdf_new_xobject(pdfout, mediabox, mupdf.FzMatrix(), mupdf.PdfObj(0), res) + # store spage contents + JM_update_stream(pdfout, xobj1, res, 1) + + # store spage resources + mupdf.pdf_dict_put(xobj1, PDF_NAME('Resources'), resources) + return xobj1 + + +def PySequence_Check(s): + return isinstance(s, (tuple, list)) + + +def PySequence_Size(s): + return len(s) + + +# constants: error messages. These are also in extra.i. +# +MSG_BAD_ANNOT_TYPE = "bad annot type" +MSG_BAD_APN = "bad or missing annot AP/N" +MSG_BAD_ARG_INK_ANNOT = "arg must be seq of seq of float pairs" +MSG_BAD_ARG_POINTS = "bad seq of points" +MSG_BAD_BUFFER = "bad type: 'buffer'" +MSG_BAD_COLOR_SEQ = "bad color sequence" +MSG_BAD_DOCUMENT = "cannot open broken document" +MSG_BAD_FILETYPE = "bad filetype" +MSG_BAD_LOCATION = "bad location" +MSG_BAD_OC_CONFIG = "bad config number" +MSG_BAD_OC_LAYER = "bad layer number" +MSG_BAD_OC_REF = "bad 'oc' reference" +MSG_BAD_PAGEID = "bad page id" +MSG_BAD_PAGENO = "bad page number(s)" +MSG_BAD_PDFROOT = "PDF has no root" +MSG_BAD_RECT = "rect is infinite or empty" +MSG_BAD_TEXT = "bad type: 'text'" +MSG_BAD_XREF = "bad xref" +MSG_COLOR_COUNT_FAILED = "color count failed" +MSG_FILE_OR_BUFFER = "need font file or buffer" +MSG_FONT_FAILED = "cannot create font" +MSG_IS_NO_ANNOT = "is no annotation" +MSG_IS_NO_IMAGE = "is no image" +MSG_IS_NO_PDF = "is no PDF" +MSG_IS_NO_DICT = "object is no PDF dict" +MSG_PIX_NOALPHA = "source pixmap has no alpha" +MSG_PIXEL_OUTSIDE = "pixel(s) outside image" + + +JM_Exc_FileDataError = 'FileDataError' +PyExc_ValueError = 'ValueError' + +def RAISEPY( msg, exc): + #JM_Exc_CurrentException=exc + #fz_throw(context, FZ_ERROR_GENERIC, msg) + raise Exception( msg) + + +def PyUnicode_DecodeRawUnicodeEscape(s, errors='strict'): + # FIXED: handle raw unicode escape sequences + if not s: + return "" + if isinstance(s, str): + rc = s.encode("utf8", errors=errors) + elif isinstance(s, bytes): + rc = s[:] + ret = rc.decode('raw_unicode_escape', errors=errors) + return ret + + +def CheckColor(c: OptSeq): + if c: + if ( + type(c) not in (list, tuple) + or len(c) not in (1, 3, 4) + or min(c) < 0 + or max(c) > 1 + ): + raise ValueError("need 1, 3 or 4 color components in range 0 to 1") + + +def CheckFont(page: Page, fontname: str) -> tuple: + """Return an entry in the page's font list if reference name matches. + """ + for f in page.get_fonts(): + if f[4] == fontname: + return f + + +def CheckFontInfo(doc: Document, xref: int) -> list: + """Return a font info if present in the document. + """ + for f in doc.FontInfos: + if xref == f[0]: + return f + + +def CheckMarkerArg(quads: typing.Any) -> tuple: + if CheckRect(quads): + r = Rect(quads) + return (r.quad,) + if CheckQuad(quads): + return (quads,) + for q in quads: + if not (CheckRect(q) or CheckQuad(q)): + raise ValueError("bad quads entry") + return quads + + +def CheckMorph(o: typing.Any) -> bool: + if not bool(o): + return False + if not (type(o) in (list, tuple) and len(o) == 2): + raise ValueError("morph must be a sequence of length 2") + if not (len(o[0]) == 2 and len(o[1]) == 6): + raise ValueError("invalid morph param 0") + if not o[1][4] == o[1][5] == 0: + raise ValueError("invalid morph param 1") + return True + + +def CheckParent(o: typing.Any): + return + if not hasattr(o, "parent") or o.parent is None: + raise ValueError(f"orphaned object {type(o)=}: parent is None") + + +def CheckQuad(q: typing.Any) -> bool: + """Check whether an object is convex, not empty quad-like. + + It must be a sequence of 4 number pairs. + """ + try: + q0 = Quad(q) + except Exception: + if g_exceptions_verbose > 1: exception_info() + return False + return q0.is_convex + + +def CheckRect(r: typing.Any) -> bool: + """Check whether an object is non-degenerate rect-like. + + It must be a sequence of 4 numbers. + """ + try: + r = Rect(r) + except Exception: + if g_exceptions_verbose > 1: exception_info() + return False + return not (r.is_empty or r.is_infinite) + + +def ColorCode(c: typing.Union[list, tuple, float, None], f: str) -> str: + if not c: + return "" + if hasattr(c, "__float__"): + c = (c,) + CheckColor(c) + if len(c) == 1: + s = _format_g(c[0]) + " " + return s + "G " if f == "c" else s + "g " + + if len(c) == 3: + s = _format_g(tuple(c)) + " " + return s + "RG " if f == "c" else s + "rg " + + s = _format_g(tuple(c)) + " " + return s + "K " if f == "c" else s + "k " + + +def Page__add_text_marker(self, quads, annot_type): + pdfpage = self._pdf_page() + rotation = JM_page_rotation(pdfpage) + def final(): + if rotation != 0: + mupdf.pdf_dict_put_int(pdfpage.obj(), PDF_NAME('Rotate'), rotation) + try: + if rotation != 0: + mupdf.pdf_dict_put_int(pdfpage.obj(), PDF_NAME('Rotate'), 0) + annot = mupdf.pdf_create_annot(pdfpage, annot_type) + for item in quads: + q = JM_quad_from_py(item) + mupdf.pdf_add_annot_quad_point(annot, q) + mupdf.pdf_update_annot(annot) + JM_add_annot_id(annot, "A") + final() + except Exception: + if g_exceptions_verbose: exception_info() + final() + return + return Annot(annot) + + +def PDF_NAME(x): + assert isinstance(x, str) + ret = getattr(mupdf, f'PDF_ENUM_NAME_{x}') + # Note that we return a (swig proxy for) pdf_obj*, not a mupdf.PdfObj. In + # the C++ API, the constructor PdfObj::PdfObj(pdf_obj*) is marked as + # explicit, but this seems to be ignored by SWIG. If SWIG started to + # generate code that respected `explicit`, we would need to do `return + # mupdf.PdfObj(ret)`. + # + # [Compare with extra.i, where we define our own PDF_NAME2() macro that + # returns a mupdf::PdfObj.] + return ret + + +def UpdateFontInfo(doc: Document, info: typing.Sequence): + xref = info[0] + found = False + for i, fi in enumerate(doc.FontInfos): + if fi[0] == xref: + found = True + break + if found: + doc.FontInfos[i] = info + else: + doc.FontInfos.append(info) + + +def args_match(args, *types): + ''' + Returns true if matches . + + Each item in is a type or tuple of types. Any of these types will + match an item in . `None` will match anything in . `type(None)` + will match an arg whose value is `None`. + ''' + j = 0 + for i in range(len(types)): + type_ = types[i] + if j >= len(args): + if isinstance(type_, tuple) and None in type_: + # arg is missing but has default value. + continue + else: + return False + if type_ is not None and not isinstance(args[j], type_): + return False + j += 1 + if j != len(args): + return False + return True + + +def calc_image_matrix(width, height, tr, rotate, keep): + ''' + # compute image insertion matrix + ''' + trect = JM_rect_from_py(tr) + rot = mupdf.fz_rotate(rotate) + trw = trect.x1 - trect.x0 + trh = trect.y1 - trect.y0 + w = trw + h = trh + if keep: + large = max(width, height) + fw = width / large + fh = height / large + else: + fw = fh = 1 + small = min(fw, fh) + if rotate != 0 and rotate != 180: + f = fw + fw = fh + fh = f + if fw < 1: + if trw / fw > trh / fh: + w = trh * small + h = trh + else: + w = trw + h = trw / small + elif fw != fh: + if trw / fw > trh / fh: + w = trh / small + h = trh + else: + w = trw + h = trw * small + else: + w = trw + h = trh + tmp = mupdf.fz_make_point( + (trect.x0 + trect.x1) / 2, + (trect.y0 + trect.y1) / 2, + ) + mat = mupdf.fz_make_matrix(1, 0, 0, 1, -0.5, -0.5) + mat = mupdf.fz_concat(mat, rot) + mat = mupdf.fz_concat(mat, mupdf.fz_scale(w, h)) + mat = mupdf.fz_concat(mat, mupdf.fz_translate(tmp.x, tmp.y)) + return mat + + +def detect_super_script(line, ch): + if line.m_internal.wmode == 0 and line.m_internal.dir.x == 1 and line.m_internal.dir.y == 0: + return ch.m_internal.origin.y < line.m_internal.first_char.origin.y - ch.m_internal.size * 0.1 + return 0 + + +def dir_str(x): + ret = f'{x} {type(x)} ({len(dir(x))}):\n' + for i in dir(x): + ret += f' {i}\n' + return ret + + +def getTJstr(text: str, glyphs: typing.Union[list, tuple, None], simple: bool, ordering: int) -> str: + """ Return a PDF string enclosed in [] brackets, suitable for the PDF TJ + operator. + + Notes: + The input string is converted to either 2 or 4 hex digits per character. + Args: + simple: no glyphs: 2-chars, use char codes as the glyph + glyphs: 2-chars, use glyphs instead of char codes (Symbol, + ZapfDingbats) + not simple: ordering < 0: 4-chars, use glyphs not char codes + ordering >=0: a CJK font! 4 chars, use char codes as glyphs + """ + if text.startswith("[<") and text.endswith(">]"): # already done + return text + + if not bool(text): + return "[<>]" + + if simple: # each char or its glyph is coded as a 2-byte hex + if glyphs is None: # not Symbol, not ZapfDingbats: use char code + otxt = "".join(["%02x" % ord(c) if ord(c) < 256 else "b7" for c in text]) + else: # Symbol or ZapfDingbats: use glyphs + otxt = "".join( + ["%02x" % glyphs[ord(c)][0] if ord(c) < 256 else "b7" for c in text] + ) + return "[<" + otxt + ">]" + + # non-simple fonts: each char or its glyph is coded as 4-byte hex + if ordering < 0: # not a CJK font: use the glyphs + otxt = "".join(["%04x" % glyphs[ord(c)][0] for c in text]) + else: # CJK: use the char codes + otxt = "".join(["%04x" % ord(c) for c in text]) + + return "[<" + otxt + ">]" + + +def get_pdf_str(s: str) -> str: + """ Return a PDF string depending on its coding. + + Notes: + Returns a string bracketed with either "()" or "<>" for hex values. + If only ascii then "(original)" is returned, else if only 8 bit chars + then "(original)" with interspersed octal strings \nnn is returned, + else a string "" is returned, where [hexstring] is the + UTF-16BE encoding of the original. + """ + if not bool(s): + return "()" + + def make_utf16be(s): + r = bytearray([254, 255]) + bytearray(s, "UTF-16BE") + return "<" + r.hex() + ">" # brackets indicate hex + + # The following either returns the original string with mixed-in + # octal numbers \nnn for chars outside the ASCII range, or returns + # the UTF-16BE BOM version of the string. + r = "" + for c in s: + oc = ord(c) + if oc > 255: # shortcut if beyond 8-bit code range + return make_utf16be(s) + + if oc > 31 and oc < 127: # in ASCII range + if c in ("(", ")", "\\"): # these need to be escaped + r += "\\" + r += c + continue + + if oc > 127: # beyond ASCII + r += "\\%03o" % oc + continue + + # now the white spaces + if oc == 8: # backspace + r += "\\b" + elif oc == 9: # tab + r += "\\t" + elif oc == 10: # line feed + r += "\\n" + elif oc == 12: # form feed + r += "\\f" + elif oc == 13: # carriage return + r += "\\r" + else: + r += "\\267" # unsupported: replace by 0xB7 + + return "(" + r + ")" + + +def get_tessdata(tessdata=None): + """Detect Tesseract language support folder. + + This function is used to enable OCR via Tesseract even if the language + support folder is not specified directly or in environment variable + TESSDATA_PREFIX. + + * If is set we return it directly. + + * Otherwise we return `os.environ['TESSDATA_PREFIX']` if set. + + * Otherwise we search for a Tesseract installation and return its language + support folder. + + * Otherwise we raise an exception. + """ + if tessdata: + return tessdata + tessdata = os.getenv("TESSDATA_PREFIX") + if tessdata: # use environment variable if set + return tessdata + + # Try to locate the tesseract-ocr installation. + + import subprocess + + cp = subprocess.run('tesseract --list-langs', shell=1, capture_output=1, check=0, text=True) + if cp.returncode == 0: + m = re.search('List of available languages in "(.+)"', cp.stdout) + if m: + tessdata = m.group(1) + return tessdata + + # Windows systems: + if sys.platform == "win32": + cp = subprocess.run("where tesseract", shell=1, capture_output=1, check=0, text=True) + response = cp.stdout.strip() + if cp.returncode or not response: + raise RuntimeError("No tessdata specified and Tesseract is not installed") + dirname = os.path.dirname(response) # path of tesseract.exe + tessdata = os.path.join(dirname, "tessdata") # language support + if os.path.exists(tessdata): # all ok? + return tessdata + else: # should not happen! + raise RuntimeError("No tessdata specified and Tesseract installation has no {tessdata} folder") + + # Unix-like systems: + attempts = list() + for path in 'tesseract-ocr', 'tesseract': + cp = subprocess.run(f'whereis {path}', shell=1, capture_output=1, check=0, text=True) + if cp.returncode == 0: + response = cp.stdout.strip().split() + if len(response) == 2: + # search tessdata in folder structure + dirname = response[1] # contains tesseract-ocr installation folder + pattern = f"{dirname}/*/tessdata" + attempts.append(pattern) + tessdatas = glob.glob(pattern) + tessdatas.sort() + if tessdatas: + return tessdatas[-1] + if attempts: + text = 'No tessdata specified and no match for:\n' + for attempt in attempts: + text += f' {attempt}' + raise RuntimeError(text) + else: + raise RuntimeError('No tessdata specified and Tesseract is not installed') + + +def css_for_pymupdf_font( + fontcode: str, *, CSS: OptStr = None, archive: AnyType = None, name: OptStr = None +) -> str: + """Create @font-face items for the given fontcode of pymupdf-fonts. + + Adds @font-face support for fonts contained in package pymupdf-fonts. + + Creates a CSS font-family for all fonts starting with string 'fontcode'. + + Note: + The font naming convention in package pymupdf-fonts is "fontcode", + where the suffix "sf" is either empty or one of "it", "bo" or "bi". + These suffixes thus represent the regular, italic, bold or bold-italic + variants of a font. For example, font code "notos" refers to fonts + "notos" - "Noto Sans Regular" + "notosit" - "Noto Sans Italic" + "notosbo" - "Noto Sans Bold" + "notosbi" - "Noto Sans Bold Italic" + + This function creates four CSS @font-face definitions and collectively + assigns the font-family name "notos" to them (or the "name" value). + + All fitting font buffers of the pymupdf-fonts package are placed / added + to the archive provided as parameter. + To use the font in pymupdf.Story, execute 'set_font(fontcode)'. The correct + font weight (bold) or style (italic) will automatically be selected. + Expects and returns the CSS source, with the new CSS definitions appended. + + Args: + fontcode: (str) font code for naming the font variants to include. + E.g. "fig" adds notos, notosi, notosb, notosbi fonts. + A maximum of 4 font variants is accepted. + CSS: (str) CSS string to add @font-face definitions to. + archive: (Archive, mandatory) where to place the font buffers. + name: (str) use this as family-name instead of 'fontcode'. + Returns: + Modified CSS, with appended @font-face statements for each font variant + of fontcode. + Fontbuffers associated with "fontcode" will be added to 'archive'. + """ + # @font-face template string + CSSFONT = "\n@font-face {font-family: %s; src: url(%s);%s%s}\n" + + if not type(archive) is Archive: + raise ValueError("'archive' must be an Archive") + if CSS is None: + CSS = "" + + # select font codes starting with the pass-in string + font_keys = [k for k in fitz_fontdescriptors.keys() if k.startswith(fontcode)] + if font_keys == []: + raise ValueError(f"No font code '{fontcode}' found in pymupdf-fonts.") + if len(font_keys) > 4: + raise ValueError("fontcode too short") + if name is None: # use this name for font-family + name = fontcode + + for fkey in font_keys: + font = fitz_fontdescriptors[fkey] + bold = font["bold"] # determine font property + italic = font["italic"] # determine font property + fbuff = font["loader"]() # load the fontbuffer + archive.add(fbuff, fkey) # update the archive + bold_text = "font-weight: bold;" if bold else "" + italic_text = "font-style: italic;" if italic else "" + CSS += CSSFONT % (name, fkey, bold_text, italic_text) + return CSS + + +def get_text_length(text: str, fontname: str ="helv", fontsize: float =11, encoding: int =0) -> float: + """Calculate length of a string for a built-in font. + + Args: + fontname: name of the font. + fontsize: font size points. + encoding: encoding to use, 0=Latin (default), 1=Greek, 2=Cyrillic. + Returns: + (float) length of text. + """ + fontname = fontname.lower() + basename = Base14_fontdict.get(fontname, None) + + glyphs = None + if basename == "Symbol": + glyphs = symbol_glyphs + if basename == "ZapfDingbats": + glyphs = zapf_glyphs + if glyphs is not None: + w = sum([glyphs[ord(c)][1] if ord(c) < 256 else glyphs[183][1] for c in text]) + return w * fontsize + + if fontname in Base14_fontdict.keys(): + return util_measure_string( + text, Base14_fontdict[fontname], fontsize, encoding + ) + + if fontname in ( + "china-t", + "china-s", + "china-ts", + "china-ss", + "japan", + "japan-s", + "korea", + "korea-s", + ): + return len(text) * fontsize + + raise ValueError("Font '%s' is unsupported" % fontname) + + +def image_profile(img: ByteString) -> dict: + """ Return basic properties of an image. + + Args: + img: bytes, bytearray, io.BytesIO object or an opened image file. + Returns: + A dictionary with keys width, height, colorspace.n, bpc, type, ext and size, + where 'type' is the MuPDF image type (0 to 14) and 'ext' the suitable + file extension. + """ + if type(img) is io.BytesIO: + stream = img.getvalue() + elif hasattr(img, "read"): + stream = img.read() + elif type(img) in (bytes, bytearray): + stream = img + else: + raise ValueError("bad argument 'img'") + + return TOOLS.image_profile(stream) + + +def jm_append_merge(dev): + ''' + Append current path to list or merge into last path of the list. + (1) Append if first path, different item lists or not a 'stroke' version + of previous path + (2) If new path has the same items, merge its content into previous path + and change path["type"] to "fs". + (3) If "out" is callable, skip the previous and pass dictionary to it. + ''' + #log(f'{getattr(dev, "pathdict", None)=}') + assert isinstance(dev.out, list) + #log( f'{dev.out=}') + + if callable(dev.method) or dev.method: # function or method + # callback. + if dev.method is None: + # fixme, this surely cannot happen? + assert 0 + #resp = PyObject_CallFunctionObjArgs(out, dev.pathdict, NULL) + else: + #log(f'calling {dev.out=} {dev.method=} {dev.pathdict=}') + resp = getattr(dev.out, dev.method)(dev.pathdict) + if not resp: + message("calling cdrawings callback function/method failed!") + dev.pathdict = None + return + + def append(): + #log(f'jm_append_merge(): clearing dev.pathdict') + dev.out.append(dev.pathdict.copy()) + dev.pathdict.clear() + assert isinstance(dev.out, list) + len_ = len(dev.out) # len of output list so far + #log('{len_=}') + if len_ == 0: # always append first path + return append() + #log(f'{getattr(dev, "pathdict", None)=}') + thistype = dev.pathdict[ dictkey_type] + #log(f'{thistype=}') + if thistype != 's': # if not stroke, then append + return append() + prev = dev.out[ len_-1] # get prev path + #log( f'{prev=}') + prevtype = prev[ dictkey_type] + #log( f'{prevtype=}') + if prevtype != 'f': # if previous not fill, append + return append() + # last check: there must be the same list of items for "f" and "s". + previtems = prev[ dictkey_items] + thisitems = dev.pathdict[ dictkey_items] + if previtems != thisitems: + return append() + + #rc = PyDict_Merge(prev, dev.pathdict, 0); // merge with no override + try: + for k, v in dev.pathdict.items(): + if k not in prev: + prev[k] = v + rc = 0 + except Exception: + if g_exceptions_verbose: exception_info() + #raise + rc = -1 + if rc == 0: + prev[ dictkey_type] = 'fs' + dev.pathdict.clear() + else: + message("could not merge stroke and fill path") + append() + + +def jm_bbox_add_rect( dev, ctx, rect, code): + if not dev.layers: + dev.result.append( (code, JM_py_from_rect(rect))) + else: + dev.result.append( (code, JM_py_from_rect(rect), dev.layer_name)) + + +def jm_bbox_fill_image( dev, ctx, image, ctm, alpha, color_params): + r = mupdf.FzRect(mupdf.FzRect.Fixed_UNIT) + r = mupdf.ll_fz_transform_rect( r.internal(), ctm) + jm_bbox_add_rect( dev, ctx, r, "fill-image") + + +def jm_bbox_fill_image_mask( dev, ctx, image, ctm, colorspace, color, alpha, color_params): + try: + jm_bbox_add_rect( dev, ctx, mupdf.ll_fz_transform_rect(mupdf.fz_unit_rect, ctm), "fill-imgmask") + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def jm_bbox_fill_path( dev, ctx, path, even_odd, ctm, colorspace, color, alpha, color_params): + even_odd = True if even_odd else False + try: + jm_bbox_add_rect( dev, ctx, mupdf.ll_fz_bound_path(path, None, ctm), "fill-path") + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def jm_bbox_fill_shade( dev, ctx, shade, ctm, alpha, color_params): + try: + jm_bbox_add_rect( dev, ctx, mupdf.ll_fz_bound_shade( shade, ctm), "fill-shade") + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def jm_bbox_stroke_text( dev, ctx, text, stroke, ctm, *args): + try: + jm_bbox_add_rect( dev, ctx, mupdf.ll_fz_bound_text( text, stroke, ctm), "stroke-text") + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def jm_bbox_fill_text( dev, ctx, text, ctm, *args): + try: + jm_bbox_add_rect( dev, ctx, mupdf.ll_fz_bound_text( text, None, ctm), "fill-text") + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def jm_bbox_ignore_text( dev, ctx, text, ctm): + jm_bbox_add_rect( dev, ctx, mupdf.ll_fz_bound_text(text, None, ctm), "ignore-text") + + +def jm_bbox_stroke_path( dev, ctx, path, stroke, ctm, colorspace, color, alpha, color_params): + try: + jm_bbox_add_rect( dev, ctx, mupdf.ll_fz_bound_path( path, stroke, ctm), "stroke-path") + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def jm_checkquad(dev): + ''' + Check whether the last 4 lines represent a quad. + Because of how we count, the lines are a polyline already, i.e. last point + of a line equals 1st point of next line. + So we check for a polygon (last line's end point equals start point). + If not true we return 0. + ''' + #log(f'{getattr(dev, "pathdict", None)=}') + items = dev.pathdict[ dictkey_items] + len_ = len(items) + f = [0] * 8 # coordinates of the 4 corners + # fill the 8 floats in f, start from items[-4:] + for i in range( 4): # store line start points + line = items[ len_ - 4 + i] + temp = JM_point_from_py( line[1]) + f[i * 2] = temp.x + f[i * 2 + 1] = temp.y + lp = JM_point_from_py( line[ 2]) + if lp.x != f[0] or lp.y != f[1]: + # not a polygon! + #dev.linecount -= 1 + return 0 + + # we have detected a quad + dev.linecount = 0 # reset this + # a quad item is ("qu", (ul, ur, ll, lr)), where the tuple items + # are pairs of floats representing a quad corner each. + + # relationship of float array to quad points: + # (0, 1) = ul, (2, 3) = ll, (6, 7) = ur, (4, 5) = lr + q = mupdf.fz_make_quad(f[0], f[1], f[6], f[7], f[2], f[3], f[4], f[5]) + rect = ('qu', JM_py_from_quad(q)) + + items[ len_ - 4] = rect # replace item -4 by rect + del items[ len_ - 3 : len_] # delete remaining 3 items + return 1 + + +def jm_checkrect(dev): + ''' + Check whether the last 3 path items represent a rectangle. + Returns 1 if we have modified the path, otherwise 0. + ''' + #log(f'{getattr(dev, "pathdict", None)=}') + dev.linecount = 0 # reset line count + orientation = 0 # area orientation of rectangle + items = dev.pathdict[ dictkey_items] + len_ = len(items) + + line0 = items[ len_ - 3] + ll = JM_point_from_py( line0[ 1]) + lr = JM_point_from_py( line0[ 2]) + + # no need to extract "line1"! + line2 = items[ len_ - 1] + ur = JM_point_from_py( line2[ 1]) + ul = JM_point_from_py( line2[ 2]) + + # Assumption: + # When decomposing rects, MuPDF always starts with a horizontal line, + # followed by a vertical line, followed by a horizontal line. + # First line: (ll, lr), third line: (ul, ur). + # If 1st line is below 3rd line, we record anti-clockwise (+1), else + # clockwise (-1) orientation. + + if (0 + or ll.y != lr.y + or ll.x != ul.x + or ur.y != ul.y + or ur.x != lr.x + ): + return 0 # not a rectangle + + # we have a rect, replace last 3 "l" items by one "re" item. + if ul.y < lr.y: + r = mupdf.fz_make_rect(ul.x, ul.y, lr.x, lr.y) + orientation = 1 + else: + r = mupdf.fz_make_rect(ll.x, ll.y, ur.x, ur.y) + orientation = -1 + + rect = ( 're', JM_py_from_rect(r), orientation) + items[ len_ - 3] = rect # replace item -3 by rect + del items[ len_ - 2 : len_] # delete remaining 2 items + return 1 + + +def jm_trace_text( dev, text, type_, ctm, colorspace, color, alpha, seqno): + span = text.head + while 1: + if not span: + break + jm_trace_text_span( dev, span, type_, ctm, colorspace, color, alpha, seqno) + span = span.next + + +def jm_trace_text_span(dev, span, type_, ctm, colorspace, color, alpha, seqno): + ''' + jm_trace_text_span(fz_context *ctx, PyObject *out, fz_text_span *span, int type, fz_matrix ctm, fz_colorspace *colorspace, const float *color, float alpha, size_t seqno) + ''' + out_font = None + assert isinstance( span, mupdf.fz_text_span) + span = mupdf.FzTextSpan( span) + assert isinstance( ctm, mupdf.fz_matrix) + ctm = mupdf.FzMatrix( ctm) + fontname = JM_font_name( span.font()) + #float rgb[3]; + #PyObject *chars = PyTuple_New(span->len); + + mat = mupdf.fz_concat(span.trm(), ctm) # text transformation matrix + dir = mupdf.fz_transform_vector(mupdf.fz_make_point(1, 0), mat) # writing direction + fsize = math.sqrt(dir.x * dir.x + dir.y * dir.y) # font size + + dir = mupdf.fz_normalize_vector(dir) + + space_adv = 0 + asc = JM_font_ascender( span.font()) + dsc = JM_font_descender( span.font()) + if asc < 1e-3: # probably Tesseract font + dsc = -0.1 + asc = 0.9 + + # compute effective ascender / descender + ascsize = asc * fsize / (asc - dsc) + dscsize = dsc * fsize / (asc - dsc) + fflags = 0 # font flags + mono = mupdf.fz_font_is_monospaced( span.font()) + fflags += mono * TEXT_FONT_MONOSPACED + fflags += mupdf.fz_font_is_italic( span.font()) * TEXT_FONT_ITALIC + fflags += mupdf.fz_font_is_serif( span.font()) * TEXT_FONT_SERIFED + fflags += mupdf.fz_font_is_bold( span.font()) * TEXT_FONT_BOLD + + last_adv = 0 + + # walk through characters of span + span_bbox = mupdf.FzRect() + rot = mupdf.fz_make_matrix(dir.x, dir.y, -dir.y, dir.x, 0, 0) + if dir.x == -1: # left-right flip + rot.d = 1 + + chars = [] + for i in range( span.m_internal.len): + adv = 0 + if span.items(i).gid >= 0: + adv = mupdf.fz_advance_glyph( span.font(), span.items(i).gid, span.m_internal.wmode) + adv *= fsize + last_adv = adv + if span.items(i).ucs == 32: + space_adv = adv + char_orig = mupdf.fz_make_point(span.items(i).x, span.items(i).y) + char_orig = mupdf.fz_transform_point(char_orig, ctm) + m1 = mupdf.fz_make_matrix(1, 0, 0, 1, -char_orig.x, -char_orig.y) + m1 = mupdf.fz_concat(m1, rot) + m1 = mupdf.fz_concat(m1, mupdf.FzMatrix(1, 0, 0, 1, char_orig.x, char_orig.y)) + x0 = char_orig.x + x1 = x0 + adv + if ( + (mat.d > 0 and (dir.x == 1 or dir.x == -1)) + or + (mat.b != 0 and mat.b == -mat.c) + ): # up-down flip + y0 = char_orig.y + dscsize + y1 = char_orig.y + ascsize + else: + y0 = char_orig.y - ascsize + y1 = char_orig.y - dscsize + char_bbox = mupdf.fz_make_rect(x0, y0, x1, y1) + char_bbox = mupdf.fz_transform_rect(char_bbox, m1) + chars.append( + ( + span.items(i).ucs, + span.items(i).gid, + ( + char_orig.x, + char_orig.y, + ), + ( + char_bbox.x0, + char_bbox.y0, + char_bbox.x1, + char_bbox.y1, + ), + ) + ) + if i > 0: + span_bbox = mupdf.fz_union_rect(span_bbox, char_bbox) + else: + span_bbox = char_bbox + chars = tuple(chars) + + if not space_adv: + if not (fflags & TEXT_FONT_MONOSPACED): + c, out_font = mupdf.fz_encode_character_with_fallback( span.font(), 32, 0, 0) + space_adv = mupdf.fz_advance_glyph( + span.font(), + c, + span.m_internal.wmode, + ) + space_adv *= fsize + if not space_adv: + space_adv = last_adv + else: + space_adv = last_adv # for mono, any char width suffices + + # make the span dictionary + span_dict = dict() + span_dict[ 'dir'] = JM_py_from_point(dir) + span_dict[ 'font'] = JM_EscapeStrFromStr(fontname) + span_dict[ 'wmode'] = span.m_internal.wmode + span_dict[ 'flags'] =fflags + span_dict[ "bidi_lvl"] =span.m_internal.bidi_level + span_dict[ "bidi_dir"] = span.m_internal.markup_dir + span_dict[ 'ascender'] = asc + span_dict[ 'descender'] = dsc + span_dict[ 'colorspace'] = 3 + + if colorspace: + rgb = mupdf.fz_convert_color( + mupdf.FzColorspace( mupdf.ll_fz_keep_colorspace( colorspace)), + color, + mupdf.fz_device_rgb(), + mupdf.FzColorspace(), + mupdf.FzColorParams(), + ) + rgb = rgb[:3] # mupdf.fz_convert_color() always returns 4 items. + else: + rgb = (0, 0, 0) + + if dev.linewidth > 0: # width of character border + linewidth = dev.linewidth + else: + linewidth = fsize * 0.05 # default: 5% of font size + #log(f'{dev.linewidth=:.4f} {fsize=:.4f} {linewidth=:.4f}') + + span_dict[ 'color'] = rgb + span_dict[ 'size'] = fsize + span_dict[ "opacity"] = alpha + span_dict[ "linewidth"] = linewidth + span_dict[ "spacewidth"] = space_adv + span_dict[ 'type'] = type_ + span_dict[ 'bbox'] = JM_py_from_rect(span_bbox) + span_dict[ 'layer'] = dev.layer_name + span_dict[ "seqno"] = seqno + span_dict[ 'chars'] = chars + #log(f'{span_dict=}') + dev.out.append( span_dict) + + +def jm_lineart_color(colorspace, color): + #log(f' ') + if colorspace: + try: + # Need to be careful to use a named Python object to ensure + # that the `params` we pass to mupdf.ll_fz_convert_color() is + # valid. E.g. doing: + # + # rgb = mupdf.ll_fz_convert_color(..., mupdf.FzColorParams().internal()) + # + # - seems to end up with a corrupted `params`. + # + cs = mupdf.FzColorspace( mupdf.FzColorspace.Fixed_RGB) + cp = mupdf.FzColorParams() + rgb = mupdf.ll_fz_convert_color( + colorspace, + color, + cs.m_internal, + None, + cp.internal(), + ) + except Exception: + if g_exceptions_verbose: exception_info() + raise + return rgb[:3] + return () + + +def jm_lineart_drop_device(dev, ctx): + if isinstance(dev.out, list): + dev.out = [] + dev.scissors = [] + + +def jm_lineart_fill_path( dev, ctx, path, even_odd, ctm, colorspace, color, alpha, color_params): + #log(f'{getattr(dev, "pathdict", None)=}') + #log(f'jm_lineart_fill_path(): {dev.seqno=}') + even_odd = True if even_odd else False + try: + assert isinstance( ctm, mupdf.fz_matrix) + dev.ctm = mupdf.FzMatrix( ctm) # fz_concat(ctm, dev_ptm); + dev.path_type = trace_device_FILL_PATH + jm_lineart_path( dev, ctx, path) + if dev.pathdict is None: + return + #item_count = len(dev.pathdict[ dictkey_items]) + #if item_count == 0: + # return + dev.pathdict[ dictkey_type] ="f" + dev.pathdict[ "even_odd"] = even_odd + dev.pathdict[ "fill_opacity"] = alpha + #log(f'setting dev.pathdict[ "closePath"] to false') + #dev.pathdict[ "closePath"] = False + dev.pathdict[ "fill"] = jm_lineart_color( colorspace, color) + dev.pathdict[ dictkey_rect] = JM_py_from_rect(dev.pathrect) + dev.pathdict[ "seqno"] = dev.seqno + #jm_append_merge(dev) + dev.pathdict[ 'layer'] = dev.layer_name + if dev.clips: + dev.pathdict[ 'level'] = dev.depth + jm_append_merge(dev) + dev.seqno += 1 + #log(f'jm_lineart_fill_path() end: {getattr(dev, "pathdict", None)=}') + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +# There are 3 text trace types: +# 0 - fill text (PDF Tr 0) +# 1 - stroke text (PDF Tr 1) +# 3 - ignore text (PDF Tr 3) + +def jm_lineart_fill_text( dev, ctx, text, ctm, colorspace, color, alpha, color_params): + if 0: + log(f'{type(ctx)=} {ctx=}') + log(f'{type(dev)=} {dev=}') + log(f'{type(text)=} {text=}') + log(f'{type(ctm)=} {ctm=}') + log(f'{type(colorspace)=} {colorspace=}') + log(f'{type(color)=} {color=}') + log(f'{type(alpha)=} {alpha=}') + log(f'{type(color_params)=} {color_params=}') + jm_trace_text(dev, text, 0, ctm, colorspace, color, alpha, dev.seqno) + dev.seqno += 1 + + +def jm_lineart_ignore_text(dev, text, ctm): + #log(f'{getattr(dev, "pathdict", None)=}') + jm_trace_text(dev, text, 3, ctm, None, None, 1, dev.seqno) + dev.seqno += 1 + + +class Walker(mupdf.FzPathWalker2): + + def __init__(self, dev): + super().__init__() + self.use_virtual_moveto() + self.use_virtual_lineto() + self.use_virtual_curveto() + self.use_virtual_closepath() + self.dev = dev + + def closepath(self, ctx): # trace_close(). + #log(f'Walker(): {self.dev.pathdict=}') + try: + if self.dev.linecount == 3: + if jm_checkrect(self.dev): + #log(f'end1: {self.dev.pathdict=}') + return + self.dev.linecount = 0 # reset # of consec. lines + + if self.dev.havemove: + if self.dev.lastpoint != self.dev.firstpoint: + item = ("l", JM_py_from_point(self.dev.lastpoint), + JM_py_from_point(self.dev.firstpoint)) + self.dev.pathdict[dictkey_items].append(item) + self.dev.lastpoint = self.dev.firstpoint + self.dev.pathdict["closePath"] = False + + else: + #log('setting self.dev.pathdict[ "closePath"] to true') + self.dev.pathdict[ "closePath"] = True + #log(f'end2: {self.dev.pathdict=}') + + self.dev.havemove = 0 + + except Exception: + if g_exceptions_verbose: exception_info() + raise + + def curveto(self, ctx, x1, y1, x2, y2, x3, y3): # trace_curveto(). + #log(f'Walker(): {self.dev.pathdict=}') + try: + self.dev.linecount = 0 # reset # of consec. lines + p1 = mupdf.fz_make_point(x1, y1) + p2 = mupdf.fz_make_point(x2, y2) + p3 = mupdf.fz_make_point(x3, y3) + p1 = mupdf.fz_transform_point(p1, self.dev.ctm) + p2 = mupdf.fz_transform_point(p2, self.dev.ctm) + p3 = mupdf.fz_transform_point(p3, self.dev.ctm) + self.dev.pathrect = mupdf.fz_include_point_in_rect(self.dev.pathrect, p1) + self.dev.pathrect = mupdf.fz_include_point_in_rect(self.dev.pathrect, p2) + self.dev.pathrect = mupdf.fz_include_point_in_rect(self.dev.pathrect, p3) + + list_ = ( + "c", + JM_py_from_point(self.dev.lastpoint), + JM_py_from_point(p1), + JM_py_from_point(p2), + JM_py_from_point(p3), + ) + self.dev.lastpoint = p3 + self.dev.pathdict[ dictkey_items].append( list_) + except Exception: + if g_exceptions_verbose: exception_info() + raise + + def lineto(self, ctx, x, y): # trace_lineto(). + #log(f'Walker(): {self.dev.pathdict=}') + try: + p1 = mupdf.fz_transform_point( mupdf.fz_make_point(x, y), self.dev.ctm) + self.dev.pathrect = mupdf.fz_include_point_in_rect( self.dev.pathrect, p1) + list_ = ( + 'l', + JM_py_from_point( self.dev.lastpoint), + JM_py_from_point(p1), + ) + self.dev.lastpoint = p1 + items = self.dev.pathdict[ dictkey_items] + items.append( list_) + self.dev.linecount += 1 # counts consecutive lines + if self.dev.linecount == 4 and self.dev.path_type != trace_device_FILL_PATH: + # shrink to "re" or "qu" item + jm_checkquad(self.dev) + except Exception: + if g_exceptions_verbose: exception_info() + raise + + def moveto(self, ctx, x, y): # trace_moveto(). + if 0 and isinstance(self.dev.pathdict, dict): + log(f'self.dev.pathdict:') + for n, v in self.dev.pathdict.items(): + log( ' {type(n)=} {len(n)=} {n!r} {n}: {v!r}: {v}') + + #log(f'Walker(): {type(self.dev.pathdict)=} {self.dev.pathdict=}') + + try: + #log( '{=dev.ctm type(dev.ctm)}') + self.dev.lastpoint = mupdf.fz_transform_point( + mupdf.fz_make_point(x, y), + self.dev.ctm, + ) + if mupdf.fz_is_infinite_rect( self.dev.pathrect): + self.dev.pathrect = mupdf.fz_make_rect( + self.dev.lastpoint.x, + self.dev.lastpoint.y, + self.dev.lastpoint.x, + self.dev.lastpoint.y, + ) + self.dev.firstpoint = self.dev.lastpoint + self.dev.havemove = 1 + self.dev.linecount = 0 # reset # of consec. lines + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def jm_lineart_path(dev, ctx, path): + ''' + Create the "items" list of the path dictionary + * either create or empty the path dictionary + * reset the end point of the path + * reset count of consecutive lines + * invoke fz_walk_path(), which create the single items + * if no items detected, empty path dict again + ''' + #log(f'{getattr(dev, "pathdict", None)=}') + try: + dev.pathrect = mupdf.FzRect( mupdf.FzRect.Fixed_INFINITE) + dev.linecount = 0 + dev.lastpoint = mupdf.FzPoint( 0, 0) + dev.pathdict = dict() + dev.pathdict[ dictkey_items] = [] + + # First time we create a Walker instance is slow, e.g. 0.3s, then later + # times run in around 0.01ms. If Walker is defined locally instead of + # globally, each time takes 0.3s. + # + walker = Walker(dev) + # Unlike fz_run_page(), fz_path_walker callbacks are not passed + # a pointer to the struct, instead they get an arbitrary + # void*. The underlying C++ Director callbacks use this void* to + # identify the fz_path_walker instance so in turn we need to pass + # arg=walker.m_internal. + mupdf.fz_walk_path( mupdf.FzPath(mupdf.ll_fz_keep_path(path)), walker, walker.m_internal) + # Check if any items were added ... + if not dev.pathdict[ dictkey_items]: + dev.pathdict = None + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def jm_lineart_stroke_path( dev, ctx, path, stroke, ctm, colorspace, color, alpha, color_params): + #log(f'{dev.pathdict=} {dev.clips=}') + try: + assert isinstance( ctm, mupdf.fz_matrix) + dev.pathfactor = 1 + if ctm.a != 0 and abs(ctm.a) == abs(ctm.d): + dev.pathfactor = abs(ctm.a) + elif ctm.b != 0 and abs(ctm.b) == abs(ctm.c): + dev.pathfactor = abs(ctm.b) + dev.ctm = mupdf.FzMatrix( ctm) # fz_concat(ctm, dev_ptm); + dev.path_type = trace_device_STROKE_PATH + + jm_lineart_path( dev, ctx, path) + if dev.pathdict is None: + return + dev.pathdict[ dictkey_type] = 's' + dev.pathdict[ 'stroke_opacity'] = alpha + dev.pathdict[ 'color'] = jm_lineart_color( colorspace, color) + dev.pathdict[ dictkey_width] = dev.pathfactor * stroke.linewidth + dev.pathdict[ 'lineCap'] = ( + stroke.start_cap, + stroke.dash_cap, + stroke.end_cap, + ) + dev.pathdict[ 'lineJoin'] = dev.pathfactor * stroke.linejoin + if 'closePath' not in dev.pathdict: + #log('setting dev.pathdict["closePath"] to false') + dev.pathdict['closePath'] = False + + # output the "dashes" string + if stroke.dash_len: + buff = mupdf.fz_new_buffer( 256) + mupdf.fz_append_string( buff, "[ ") # left bracket + for i in range( stroke.dash_len): + # We use mupdf python's SWIG-generated floats_getitem() fn to + # access float *stroke.dash_list[]. + value = mupdf.floats_getitem( stroke.dash_list, i) # stroke.dash_list[i]. + mupdf.fz_append_string( buff, f'{_format_g(dev.pathfactor * value)} ') + mupdf.fz_append_string( buff, f'] {_format_g(dev.pathfactor * stroke.dash_phase)}') + dev.pathdict[ 'dashes'] = buff + else: + dev.pathdict[ 'dashes'] = '[] 0' + dev.pathdict[ dictkey_rect] = JM_py_from_rect(dev.pathrect) + dev.pathdict['layer'] = dev.layer_name + dev.pathdict[ 'seqno'] = dev.seqno + if dev.clips: + dev.pathdict[ 'level'] = dev.depth + jm_append_merge(dev) + dev.seqno += 1 + + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def jm_lineart_clip_path(dev, ctx, path, even_odd, ctm, scissor): + if not dev.clips: + return + dev.ctm = mupdf.FzMatrix(ctm) # fz_concat(ctm, trace_device_ptm); + dev.path_type = trace_device_CLIP_PATH + jm_lineart_path(dev, ctx, path) + if dev.pathdict is None: + return + dev.pathdict[ dictkey_type] = 'clip' + dev.pathdict[ 'even_odd'] = bool(even_odd) + if 'closePath' not in dev.pathdict: + #log(f'setting dev.pathdict["closePath"] to False') + dev.pathdict['closePath'] = False + + dev.pathdict['scissor'] = JM_py_from_rect(compute_scissor(dev)) + dev.pathdict['level'] = dev.depth + dev.pathdict['layer'] = dev.layer_name + jm_append_merge(dev) + dev.depth += 1 + + +def jm_lineart_clip_stroke_path(dev, ctx, path, stroke, ctm, scissor): + if not dev.clips: + return + dev.ctm = mupdf.FzMatrix(ctm) # fz_concat(ctm, trace_device_ptm); + dev.path_type = trace_device_CLIP_STROKE_PATH + jm_lineart_path(dev, ctx, path) + if dev.pathdict is None: + return + dev.pathdict['dictkey_type'] = 'clip' + dev.pathdict['even_odd'] = None + if 'closePath' not in dev.pathdict: + #log(f'setting dev.pathdict["closePath"] to False') + dev.pathdict['closePath'] = False + dev.pathdict['scissor'] = JM_py_from_rect(compute_scissor(dev)) + dev.pathdict['level'] = dev.depth + dev.pathdict['layer'] = dev.layer_name + jm_append_merge(dev) + dev.depth += 1 + + +def jm_lineart_clip_stroke_text(dev, ctx, text, stroke, ctm, scissor): + if not dev.clips: + return + compute_scissor(dev) + dev.depth += 1 + + +def jm_lineart_clip_text(dev, ctx, text, ctm, scissor): + if not dev.clips: + return + compute_scissor(dev) + dev.depth += 1 + + +def jm_lineart_clip_image_mask( dev, ctx, image, ctm, scissor): + if not dev.clips: + return + compute_scissor(dev) + dev.depth += 1 + + +def jm_lineart_pop_clip(dev, ctx): + if not dev.clips or not dev.scissors: + return + len_ = len(dev.scissors) + if len_ < 1: + return + del dev.scissors[-1] + dev.depth -= 1 + + +def jm_lineart_begin_layer(dev, ctx, name): + if name: + dev.layer_name = name + else: + dev.layer_name = "" + + +def jm_lineart_end_layer(dev, ctx): + dev.layer_name = "" + + +def jm_lineart_begin_group(dev, ctx, bbox, cs, isolated, knockout, blendmode, alpha): + #log(f'{dev.pathdict=} {dev.clips=}') + if not dev.clips: + return + dev.pathdict = { # Py_BuildValue("{s:s,s:N,s:N,s:N,s:s,s:f,s:i,s:N}", + "type": "group", + "rect": JM_py_from_rect(bbox), + "isolated": bool(isolated), + "knockout": bool(knockout), + "blendmode": mupdf.fz_blendmode_name(blendmode), + "opacity": alpha, + "level": dev.depth, + "layer": dev.layer_name + } + jm_append_merge(dev) + dev.depth += 1 + + +def jm_lineart_end_group(dev, ctx): + #log(f'{dev.pathdict=} {dev.clips=}') + if not dev.clips: + return + dev.depth -= 1 + + +def jm_lineart_stroke_text(dev, ctx, text, stroke, ctm, colorspace, color, alpha, color_params): + jm_trace_text(dev, text, 1, ctm, colorspace, color, alpha, dev.seqno) + dev.seqno += 1 + + +def jm_dev_linewidth( dev, ctx, path, stroke, matrix, colorspace, color, alpha, color_params): + dev.linewidth = stroke.linewidth + jm_increase_seqno( dev, ctx) + + +def jm_increase_seqno( dev, ctx, *vargs): + try: + dev.seqno += 1 + except Exception: + if g_exceptions_verbose: exception_info() + raise + + +def planish_line(p1: point_like, p2: point_like) -> Matrix: + """Compute matrix which maps line from p1 to p2 to the x-axis, such that it + maintains its length and p1 * matrix = Point(0, 0). + + Args: + p1, p2: point_like + Returns: + Matrix which maps p1 to Point(0, 0) and p2 to a point on the x axis at + the same distance to Point(0,0). Will always combine a rotation and a + transformation. + """ + p1 = Point(p1) + p2 = Point(p2) + return Matrix(util_hor_matrix(p1, p2)) + + +class JM_image_reporter_Filter(mupdf.PdfFilterOptions2): + def __init__(self): + super().__init__() + self.use_virtual_image_filter() + + def image_filter( self, ctx, ctm, name, image): + assert isinstance(ctm, mupdf.fz_matrix) + JM_image_filter(self, mupdf.FzMatrix(ctm), name, image) + if mupdf_cppyy: + # cppyy doesn't appear to treat returned None as nullptr, + # resulting in obscure 'python exception' exception. + return 0 + + +class JM_new_bbox_device_Device(mupdf.FzDevice2): + def __init__(self, result, layers): + super().__init__() + self.result = result + self.layers = layers + self.layer_name = "" + self.use_virtual_fill_path() + self.use_virtual_stroke_path() + self.use_virtual_fill_text() + self.use_virtual_stroke_text() + self.use_virtual_ignore_text() + self.use_virtual_fill_shade() + self.use_virtual_fill_image() + self.use_virtual_fill_image_mask() + + self.use_virtual_begin_layer() + self.use_virtual_end_layer() + + begin_layer = jm_lineart_begin_layer + end_layer = jm_lineart_end_layer + + fill_path = jm_bbox_fill_path + stroke_path = jm_bbox_stroke_path + fill_text = jm_bbox_fill_text + stroke_text = jm_bbox_stroke_text + ignore_text = jm_bbox_ignore_text + fill_shade = jm_bbox_fill_shade + fill_image = jm_bbox_fill_image + fill_image_mask = jm_bbox_fill_image_mask + + +class JM_new_output_fileptr_Output(mupdf.FzOutput2): + def __init__(self, bio): + super().__init__() + self.bio = bio + self.use_virtual_write() + self.use_virtual_seek() + self.use_virtual_tell() + self.use_virtual_truncate() + + def seek( self, ctx, offset, whence): + return self.bio.seek( offset, whence) + + def tell( self, ctx): + ret = self.bio.tell() + return ret + + def truncate( self, ctx): + return self.bio.truncate() + + def write(self, ctx, data_raw, data_length): + data = mupdf.raw_to_python_bytes(data_raw, data_length) + return self.bio.write(data) + + +def compute_scissor(dev): + ''' + Every scissor of a clip is a sub rectangle of the preceding clip scissor + if the clip level is larger. + ''' + if dev.scissors is None: + dev.scissors = list() + num_scissors = len(dev.scissors) + if num_scissors > 0: + last_scissor = dev.scissors[num_scissors-1] + scissor = JM_rect_from_py(last_scissor) + scissor = mupdf.fz_intersect_rect(scissor, dev.pathrect) + else: + scissor = dev.pathrect + dev.scissors.append(JM_py_from_rect(scissor)) + return scissor + + +class JM_new_lineart_device_Device(mupdf.FzDevice2): + ''' + LINEART device for Python method Page.get_cdrawings() + ''' + #log(f'JM_new_lineart_device_Device()') + def __init__(self, out, clips, method): + #log(f'JM_new_lineart_device_Device.__init__()') + super().__init__() + # fixme: this results in "Unexpected call of unimplemented virtual_fnptrs fn FzDevice2::drop_device().". + #self.use_virtual_drop_device() + self.use_virtual_fill_path() + self.use_virtual_stroke_path() + self.use_virtual_clip_path() + self.use_virtual_clip_image_mask() + self.use_virtual_clip_stroke_path() + self.use_virtual_clip_stroke_text() + self.use_virtual_clip_text() + + self.use_virtual_fill_text + self.use_virtual_stroke_text + self.use_virtual_ignore_text + + self.use_virtual_fill_shade() + self.use_virtual_fill_image() + self.use_virtual_fill_image_mask() + + self.use_virtual_pop_clip() + + self.use_virtual_begin_group() + self.use_virtual_end_group() + + self.use_virtual_begin_layer() + self.use_virtual_end_layer() + + self.out = out + self.seqno = 0 + self.depth = 0 + self.clips = clips + self.method = method + + self.scissors = None + self.layer_name = "" # optional content name + self.pathrect = None + + self.linewidth = 0 + self.ptm = mupdf.FzMatrix() + self.ctm = mupdf.FzMatrix() + self.rot = mupdf.FzMatrix() + self.lastpoint = mupdf.FzPoint() + self.firstpoint = mupdf.FzPoint() + self.havemove = 0 + self.pathrect = mupdf.FzRect() + self.pathfactor = 0 + self.linecount = 0 + self.path_type = 0 + + #drop_device = jm_lineart_drop_device + + fill_path = jm_lineart_fill_path + stroke_path = jm_lineart_stroke_path + clip_image_mask = jm_lineart_clip_image_mask + clip_path = jm_lineart_clip_path + clip_stroke_path = jm_lineart_clip_stroke_path + clip_text = jm_lineart_clip_text + clip_stroke_text = jm_lineart_clip_stroke_text + + fill_text = jm_increase_seqno + stroke_text = jm_increase_seqno + ignore_text = jm_increase_seqno + + fill_shade = jm_increase_seqno + fill_image = jm_increase_seqno + fill_image_mask = jm_increase_seqno + + pop_clip = jm_lineart_pop_clip + + begin_group = jm_lineart_begin_group + end_group = jm_lineart_end_group + + begin_layer = jm_lineart_begin_layer + end_layer = jm_lineart_end_layer + + +class JM_new_texttrace_device(mupdf.FzDevice2): + ''' + Trace TEXT device for Python method Page.get_texttrace() + ''' + + def __init__(self, out): + super().__init__() + self.use_virtual_fill_path() + self.use_virtual_stroke_path() + self.use_virtual_fill_text() + self.use_virtual_stroke_text() + self.use_virtual_ignore_text() + self.use_virtual_fill_shade() + self.use_virtual_fill_image() + self.use_virtual_fill_image_mask() + + self.use_virtual_begin_layer() + self.use_virtual_end_layer() + + self.out = out + + self.seqno = 0 + self.depth = 0 + self.clips = 0 + self.method = None + + self.seqno = 0 + + self.pathdict = dict() + self.scissors = list() + self.linewidth = 0 + self.ptm = mupdf.FzMatrix() + self.ctm = mupdf.FzMatrix() + self.rot = mupdf.FzMatrix() + self.lastpoint = mupdf.FzPoint() + self.pathrect = mupdf.FzRect() + self.pathfactor = 0 + self.linecount = 0 + self.path_type = 0 + self.layer_name = "" + + fill_path = jm_increase_seqno + stroke_path = jm_dev_linewidth + fill_text = jm_lineart_fill_text + stroke_text = jm_lineart_stroke_text + ignore_text = jm_lineart_ignore_text + fill_shade = jm_increase_seqno + fill_image = jm_increase_seqno + fill_image_mask = jm_increase_seqno + + begin_layer = jm_lineart_begin_layer + end_layer = jm_lineart_end_layer + + +def ConversionHeader(i: str, filename: OptStr ="unknown"): + t = i.lower() + import textwrap + html = textwrap.dedent(""" + + + + + + + """) + + xml = textwrap.dedent(""" + + + """ + % filename + ) + + xhtml = textwrap.dedent(""" + + + + + + + + """) + + text = "" + json = '{"document": "%s", "pages": [\n' % filename + if t == "html": + r = html + elif t == "json": + r = json + elif t == "xml": + r = xml + elif t == "xhtml": + r = xhtml + else: + r = text + + return r + + +def ConversionTrailer(i: str): + t = i.lower() + text = "" + json = "]\n}" + html = "\n\n" + xml = "\n" + xhtml = html + if t == "html": + r = html + elif t == "json": + r = json + elif t == "xml": + r = xml + elif t == "xhtml": + r = xhtml + else: + r = text + + return r + + +def annot_preprocess(page: "Page") -> int: + """Prepare for annotation insertion on the page. + + Returns: + Old page rotation value. Temporarily sets rotation to 0 when required. + """ + CheckParent(page) + if not page.parent.is_pdf: + raise ValueError("is no PDF") + old_rotation = page.rotation + if old_rotation != 0: + page.set_rotation(0) + return old_rotation + + +def annot_postprocess(page: "Page", annot: "Annot") -> None: + """Clean up after annotation insertion. + + Set ownership flag and store annotation in page annotation dictionary. + """ + #annot.parent = weakref.proxy(page) + assert isinstance( page, Page) + assert isinstance( annot, Annot) + annot.parent = page + page._annot_refs[id(annot)] = annot + annot.thisown = True + + +def canon(c): + assert isinstance(c, int) + # TODO: proper unicode case folding + # TODO: character equivalence (a matches ä, etc) + if c == 0xA0 or c == 0x2028 or c == 0x2029: + return ord(' ') + if c == ord('\r') or c == ord('\n') or c == ord('\t'): + return ord(' ') + if c >= ord('A') and c <= ord('Z'): + return c - ord('A') + ord('a') + return c + + +def chartocanon(s): + assert isinstance(s, str) + n, c = mupdf.fz_chartorune(s) + c = canon(c) + return n, c + + +def dest_is_valid(o, page_count, page_object_nums, names_list): + p = mupdf.pdf_dict_get( o, PDF_NAME('A')) + if ( + mupdf.pdf_name_eq( + mupdf.pdf_dict_get( p, PDF_NAME('S')), + PDF_NAME('GoTo') + ) + and not string_in_names_list( + mupdf.pdf_dict_get( p, PDF_NAME('D')), + names_list + ) + ): + return 0 + + p = mupdf.pdf_dict_get( o, PDF_NAME('Dest')) + if not p.m_internal: + pass + elif mupdf.pdf_is_string( p): + return string_in_names_list( p, names_list) + elif not dest_is_valid_page( + mupdf.pdf_array_get( p, 0), + page_object_nums, + page_count, + ): + return 0 + return 1 + + +def dest_is_valid_page(obj, page_object_nums, pagecount): + num = mupdf.pdf_to_num(obj) + + if num == 0: + return 0 + for i in range(pagecount): + if page_object_nums[i] == num: + return 1 + return 0 + + +def find_string(s, needle): + assert isinstance(s, str) + for i in range(len(s)): + end = match_string(s[i:], needle) + if end is not None: + end += i + return i, end + return None, None + + +def get_pdf_now() -> str: + ''' + "Now" timestamp in PDF Format + ''' + import time + tz = "%s'%s'" % ( + str(abs(time.altzone // 3600)).rjust(2, "0"), + str((abs(time.altzone // 60) % 60)).rjust(2, "0"), + ) + tstamp = time.strftime("D:%Y%m%d%H%M%S", time.localtime()) + if time.altzone > 0: + tstamp += "-" + tz + elif time.altzone < 0: + tstamp += "+" + tz + else: + pass + return tstamp + + +class ElementPosition(object): + """Convert a dictionary with element position information to an object.""" + + def __init__(self): + pass + + +def make_story_elpos(): + return ElementPosition() + + +def get_highlight_selection(page, start: point_like =None, stop: point_like =None, clip: rect_like =None) -> list: + """Return rectangles of text lines between two points. + + Notes: + The default of 'start' is top-left of 'clip'. The default of 'stop' + is bottom-reight of 'clip'. + + Args: + start: start point_like + stop: end point_like, must be 'below' start + clip: consider this rect_like only, default is page rectangle + Returns: + List of line bbox intersections with the area established by the + parameters. + """ + # validate and normalize arguments + if clip is None: + clip = page.rect + clip = Rect(clip) + if start is None: + start = clip.tl + if stop is None: + stop = clip.br + clip.y0 = start.y + clip.y1 = stop.y + if clip.is_empty or clip.is_infinite: + return [] + + # extract text of page, clip only, no images, expand ligatures + blocks = page.get_text( + "dict", flags=0, clip=clip, + )["blocks"] + + lines = [] # will return this list of rectangles + for b in blocks: + bbox = Rect(b["bbox"]) + if bbox.is_infinite or bbox.is_empty: + continue + for line in b["lines"]: + bbox = Rect(line["bbox"]) + if bbox.is_infinite or bbox.is_empty: + continue + lines.append(bbox) + + if lines == []: # did not select anything + return lines + + lines.sort(key=lambda bbox: bbox.y1) # sort by vertical positions + + # cut off prefix from first line if start point is close to its top + bboxf = lines.pop(0) + if bboxf.y0 - start.y <= 0.1 * bboxf.height: # close enough? + r = Rect(start.x, bboxf.y0, bboxf.br) # intersection rectangle + if not (r.is_empty or r.is_infinite): + lines.insert(0, r) # insert again if not empty + else: + lines.insert(0, bboxf) # insert again + + if lines == []: # the list might have been emptied + return lines + + # cut off suffix from last line if stop point is close to its bottom + bboxl = lines.pop() + if stop.y - bboxl.y1 <= 0.1 * bboxl.height: # close enough? + r = Rect(bboxl.tl, stop.x, bboxl.y1) # intersection rectangle + if not (r.is_empty or r.is_infinite): + lines.append(r) # append if not empty + else: + lines.append(bboxl) # append again + + return lines + + +def glyph_name_to_unicode(name: str) -> int: + """Convenience function accessing unicodedata.""" + import unicodedata + try: + unc = ord(unicodedata.lookup(name)) + except Exception: + unc = 65533 + return unc + + +def hdist(dir, a, b): + dx = b.x - a.x + dy = b.y - a.y + return mupdf.fz_abs(dx * dir.x + dy * dir.y) + + +def make_table(rect: rect_like =(0, 0, 1, 1), cols: int =1, rows: int =1) -> list: + """Return a list of (rows x cols) equal sized rectangles. + + Notes: + A utility to fill a given area with table cells of equal size. + Args: + rect: rect_like to use as the table area + rows: number of rows + cols: number of columns + Returns: + A list with items, where each item is a list of + PyMuPDF Rect objects of equal sizes. + """ + rect = Rect(rect) # ensure this is a Rect + if rect.is_empty or rect.is_infinite: + raise ValueError("rect must be finite and not empty") + tl = rect.tl + + height = rect.height / rows # height of one table cell + width = rect.width / cols # width of one table cell + delta_h = (width, 0, width, 0) # diff to next right rect + delta_v = (0, height, 0, height) # diff to next lower rect + + r = Rect(tl, tl.x + width, tl.y + height) # first rectangle + + # make the first row + row = [r] + for i in range(1, cols): + r += delta_h # build next rect to the right + row.append(r) + + # make result, starts with first row + rects = [row] + for i in range(1, rows): + row = rects[i - 1] # take previously appended row + nrow = [] # the new row to append + for r in row: # for each previous cell add its downward copy + nrow.append(r + delta_v) + rects.append(nrow) # append new row to result + + return rects + + +def util_ensure_widget_calc(annot): + ''' + Ensure that widgets with /AA/C JavaScript are in array AcroForm/CO + ''' + annot_obj = mupdf.pdf_annot_obj(annot.this) + pdf = mupdf.pdf_get_bound_document(annot_obj) + PDFNAME_CO = mupdf.pdf_new_name("CO") # = PDF_NAME(CO) + acro = mupdf.pdf_dict_getl( # get AcroForm dict + mupdf.pdf_trailer(pdf), + PDF_NAME('Root'), + PDF_NAME('AcroForm'), + ) + + CO = mupdf.pdf_dict_get(acro, PDFNAME_CO) # = AcroForm/CO + if not mupdf.pdf_is_array(CO): + CO = mupdf.pdf_dict_put_array(acro, PDFNAME_CO, 2) + n = mupdf.pdf_array_len(CO) + found = 0 + xref = mupdf.pdf_to_num(annot_obj) + for i in range(n): + nxref = mupdf.pdf_to_num(mupdf.pdf_array_get(CO, i)) + if xref == nxref: + found = 1 + break + if not found: + mupdf.pdf_array_push(CO, mupdf.pdf_new_indirect(pdf, xref, 0)) + + +def util_make_rect( *args, p0=None, p1=None, x0=None, y0=None, x1=None, y1=None): + ''' + Helper for initialising rectangle classes. + + 2022-09-02: This is quite different from PyMuPDF's util_make_rect(), which + uses `goto` in ways that don't easily translate to Python. + + Returns (x0, y0, x1, y1) derived from , then override with p0, p1, + x0, y0, x1, y1 if they are not None. + + Accepts following forms for : + () returns all zeros. + (top-left, bottom-right) + (top-left, x1, y1) + (x0, y0, bottom-right) + (x0, y0, x1, y1) + (rect) + + Where top-left and bottom-right are (x, y) or something with .x, .y + members; rect is something with .x0, .y0, .x1, and .y1 members. + + 2023-11-18: we now override with p0, p1, x0, y0, x1, y1 if not None. + ''' + def get_xy( arg): + if isinstance( arg, (list, tuple)) and len( arg) == 2: + return arg[0], arg[1] + if isinstance( arg, (Point, mupdf.FzPoint, mupdf.fz_point)): + return arg.x, arg.y + return None, None + def make_tuple( a): + if isinstance( a, tuple): + return a + if isinstance( a, Point): + return a.x, a.y + elif isinstance( a, (Rect, IRect, mupdf.FzRect, mupdf.fz_rect)): + return a.x0, a.y0, a.x1, a.y1 + if not isinstance( a, (list, tuple)): + a = a, + return a + def handle_args(): + if len(args) == 0: + return 0, 0, 0, 0 + elif len(args) == 1: + arg = args[0] + if isinstance( arg, (list, tuple)) and len( arg) == 2: + p1, p2 = arg + ret = *p1, *p2 + assert len(ret) == 4 + return ret + if isinstance( arg, (list, tuple)) and len( arg) == 3: + a, b, c = arg + a = make_tuple(a) + b = make_tuple(b) + c = make_tuple(c) + ret = *a, *b, *c + assert len(ret) == 4 + return ret + ret = make_tuple( arg) + assert len(ret) == 4, f'{arg=} {ret=}' + return ret + elif len(args) == 2: + ret = get_xy( args[0]) + get_xy( args[1]) + assert len(ret) == 4 + return ret + elif len(args) == 3: + x0, y0 = get_xy( args[0]) + if (x0, y0) != (None, None): + return x0, y0, args[1], args[2] + x1, y1 = get_xy( args[2]) + if (x1, y1) != (None, None): + return args[0], args[1], x1, y1 + elif len(args) == 4: + return args[0], args[1], args[2], args[3] + raise Exception( f'Unrecognised args: {args}') + ret_x0, ret_y0, ret_x1, ret_y1 = handle_args() + if p0 is not None: ret_x0, ret_y0 = get_xy(p0) + if p1 is not None: ret_x1, ret_y1 = get_xy(p1) + if x0 is not None: ret_x0 = x0 + if y0 is not None: ret_y0 = y0 + if x1 is not None: ret_x1 = x1 + if y1 is not None: ret_y1 = y1 + return ret_x0, ret_y0, ret_x1, ret_y1 + + +def util_make_irect( *args, p0=None, p1=None, x0=None, y0=None, x1=None, y1=None): + a, b, c, d = util_make_rect( *args, p0=p0, p1=p1, x0=x0, y0=y0, x1=x1, y1=y1) + def convert(x, ceil): + if ceil: + return int(math.ceil(x)) + else: + return int(math.floor(x)) + a = convert(a, False) + b = convert(b, False) + c = convert(c, True) + d = convert(d, True) + return a, b, c, d + + +def util_round_rect( rect): + return JM_py_from_irect(mupdf.fz_round_rect(JM_rect_from_py(rect))) + + +def util_transform_rect( rect, matrix): + if g_use_extra: + return extra.util_transform_rect( rect, matrix) + return JM_py_from_rect(mupdf.fz_transform_rect(JM_rect_from_py(rect), JM_matrix_from_py(matrix))) + + +def util_intersect_rect( r1, r2): + return JM_py_from_rect( + mupdf.fz_intersect_rect( + JM_rect_from_py(r1), + JM_rect_from_py(r2), + ) + ) + + +def util_is_point_in_rect( p, r): + return mupdf.fz_is_point_inside_rect( + JM_point_from_py(p), + JM_rect_from_py(r), + ) + +def util_include_point_in_rect( r, p): + return JM_py_from_rect( + mupdf.fz_include_point_in_rect( + JM_rect_from_py(r), + JM_point_from_py(p), + ) + ) + + +def util_point_in_quad( P, Q): + p = JM_point_from_py(P) + q = JM_quad_from_py(Q) + return mupdf.fz_is_point_inside_quad(p, q) + + +def util_transform_point( point, matrix): + return JM_py_from_point( + mupdf.fz_transform_point( + JM_point_from_py(point), + JM_matrix_from_py(matrix), + ) + ) + + +def util_union_rect( r1, r2): + return JM_py_from_rect( + mupdf.fz_union_rect( + JM_rect_from_py(r1), + JM_rect_from_py(r2), + ) + ) + + +def util_concat_matrix( m1, m2): + return JM_py_from_matrix( + mupdf.fz_concat( + JM_matrix_from_py(m1), + JM_matrix_from_py(m2), + ) + ) + + +def util_invert_matrix(matrix): + if 0: + # Use MuPDF's fz_invert_matrix(). + if isinstance( matrix, (tuple, list)): + matrix = mupdf.FzMatrix( *matrix) + elif isinstance( matrix, mupdf.fz_matrix): + matrix = mupdf.FzMatrix( matrix) + elif isinstance( matrix, Matrix): + matrix = mupdf.FzMatrix( matrix.a, matrix.b, matrix.c, matrix.d, matrix.e, matrix.f) + assert isinstance( matrix, mupdf.FzMatrix), f'{type(matrix)=}: {matrix}' + ret = mupdf.fz_invert_matrix( matrix) + if ret == matrix and (0 + or abs( matrix.a - 1) >= sys.float_info.epsilon + or abs( matrix.b - 0) >= sys.float_info.epsilon + or abs( matrix.c - 0) >= sys.float_info.epsilon + or abs( matrix.d - 1) >= sys.float_info.epsilon + ): + # Inversion not possible. + return 1, () + return 0, (ret.a, ret.b, ret.c, ret.d, ret.e, ret.f) + # Do inversion in python. + src = JM_matrix_from_py(matrix) + a = src.a + det = a * src.d - src.b * src.c + if det < -sys.float_info.epsilon or det > sys.float_info.epsilon: + dst = mupdf.FzMatrix() + rdet = 1 / det + dst.a = src.d * rdet + dst.b = -src.b * rdet + dst.c = -src.c * rdet + dst.d = a * rdet + a = -src.e * dst.a - src.f * dst.c + dst.f = -src.e * dst.b - src.f * dst.d + dst.e = a + return 0, (dst.a, dst.b, dst.c, dst.d, dst.e, dst.f) + + return 1, () + + +def util_measure_string( text, fontname, fontsize, encoding): + font = mupdf.fz_new_base14_font(fontname) + w = 0 + pos = 0 + while pos < len(text): + t, c = mupdf.fz_chartorune(text[pos:]) + pos += t + if encoding == mupdf.PDF_SIMPLE_ENCODING_GREEK: + c = mupdf.fz_iso8859_7_from_unicode(c) + elif encoding == mupdf.PDF_SIMPLE_ENCODING_CYRILLIC: + c = mupdf.fz_windows_1251_from_unicode(c) + else: + c = mupdf.fz_windows_1252_from_unicode(c) + if c < 0: + c = 0xB7 + g = mupdf.fz_encode_character(font, c) + dw = mupdf.fz_advance_glyph(font, g, 0) + w += dw + ret = w * fontsize + return ret + + +def util_sine_between(C, P, Q): + # for points C, P, Q compute the sine between lines CP and QP + c = JM_point_from_py(C) + p = JM_point_from_py(P) + q = JM_point_from_py(Q) + s = mupdf.fz_normalize_vector(mupdf.fz_make_point(q.x - p.x, q.y - p.y)) + m1 = mupdf.fz_make_matrix(1, 0, 0, 1, -p.x, -p.y) + m2 = mupdf.fz_make_matrix(s.x, -s.y, s.y, s.x, 0, 0) + m1 = mupdf.fz_concat(m1, m2) + c = mupdf.fz_transform_point(c, m1) + c = mupdf.fz_normalize_vector(c) + return c.y + + +def util_hor_matrix(C, P): + ''' + Return the matrix that maps two points C, P to the x-axis such that + C -> (0,0) and the image of P have the same distance. + ''' + c = JM_point_from_py(C) + p = JM_point_from_py(P) + + # compute (cosine, sine) of vector P-C with double precision: + s = mupdf.fz_normalize_vector(mupdf.fz_make_point(p.x - c.x, p.y - c.y)) + + m1 = mupdf.fz_make_matrix(1, 0, 0, 1, -c.x, -c.y) + m2 = mupdf.fz_make_matrix(s.x, -s.y, s.y, s.x, 0, 0) + return JM_py_from_matrix(mupdf.fz_concat(m1, m2)) + + +def match_string(h0, n0): + h = 0 + n = 0 + e = h + delta_h, hc = chartocanon(h0[h:]) + h += delta_h + delta_n, nc = chartocanon(n0[n:]) + n += delta_n + while hc == nc: + e = h + if hc == ord(' '): + while 1: + delta_h, hc = chartocanon(h0[h:]) + h += delta_h + if hc != ord(' '): + break + else: + delta_h, hc = chartocanon(h0[h:]) + h += delta_h + if nc == ord(' '): + while 1: + delta_n, nc = chartocanon(n0[n:]) + n += delta_n + if nc != ord(' '): + break + else: + delta_n, nc = chartocanon(n0[n:]) + n += delta_n + return None if nc != 0 else e + + +def on_highlight_char(hits, line, ch): + assert hits + assert isinstance(line, mupdf.FzStextLine) + assert isinstance(ch, mupdf.FzStextChar) + vfuzz = ch.m_internal.size * hits.vfuzz + hfuzz = ch.m_internal.size * hits.hfuzz + ch_quad = JM_char_quad(line, ch) + if hits.len > 0: + # fixme: end = hits.quads[-1] + quad = hits.quads[hits.len - 1] + end = JM_quad_from_py(quad) + if ( 1 + and hdist(line.m_internal.dir, end.lr, ch_quad.ll) < hfuzz + and vdist(line.m_internal.dir, end.lr, ch_quad.ll) < vfuzz + and hdist(line.m_internal.dir, end.ur, ch_quad.ul) < hfuzz + and vdist(line.m_internal.dir, end.ur, ch_quad.ul) < vfuzz + ): + end.ur = ch_quad.ur + end.lr = ch_quad.lr + assert hits.quads[-1] == end + return + hits.quads.append(ch_quad) + hits.len += 1 + + +def page_merge(doc_des, doc_src, page_from, page_to, rotate, links, copy_annots, graft_map): + ''' + Deep-copies a source page to the target. + Modified version of function of pdfmerge.c: we also copy annotations, but + we skip some subtypes. In addition we rotate output. + ''' + if g_use_extra: + #log( 'Calling C++ extra.page_merge()') + return extra.page_merge( doc_des, doc_src, page_from, page_to, rotate, links, copy_annots, graft_map) + + # list of object types (per page) we want to copy + known_page_objs = [ + PDF_NAME('Contents'), + PDF_NAME('Resources'), + PDF_NAME('MediaBox'), + PDF_NAME('CropBox'), + PDF_NAME('BleedBox'), + PDF_NAME('TrimBox'), + PDF_NAME('ArtBox'), + PDF_NAME('Rotate'), + PDF_NAME('UserUnit'), + ] + page_ref = mupdf.pdf_lookup_page_obj(doc_src, page_from) + + # make new page dict in dest doc + page_dict = mupdf.pdf_new_dict(doc_des, 4) + mupdf.pdf_dict_put(page_dict, PDF_NAME('Type'), PDF_NAME('Page')) + + # copy objects of source page into it + for i in range( len(known_page_objs)): + obj = mupdf.pdf_dict_get_inheritable( page_ref, known_page_objs[i]) + if obj.m_internal: + #log( '{=type(graft_map) type(graft_map.this)}') + mupdf.pdf_dict_put( page_dict, known_page_objs[i], mupdf.pdf_graft_mapped_object(graft_map.this, obj)) + + # Copy annotations, but skip Link, Popup, IRT, Widget types + # If selected, remove dict keys P (parent) and Popup + if copy_annots: + old_annots = mupdf.pdf_dict_get( page_ref, PDF_NAME('Annots')) + n = mupdf.pdf_array_len( old_annots) + if n > 0: + new_annots = mupdf.pdf_dict_put_array( page_dict, PDF_NAME('Annots'), n) + for i in range(n): + o = mupdf.pdf_array_get( old_annots, i) + if not o.m_internal or not mupdf.pdf_is_dict(o): + continue # skip non-dict items + if mupdf.pdf_dict_gets( o, "IRT").m_internal: + continue + subtype = mupdf.pdf_dict_get( o, PDF_NAME('Subtype')) + if mupdf.pdf_name_eq( subtype, PDF_NAME('Link')): + continue + if mupdf.pdf_name_eq( subtype, PDF_NAME('Popup')): + continue + if mupdf.pdf_name_eq(subtype, PDF_NAME('Widget')): + continue + mupdf.pdf_dict_del( o, PDF_NAME('Popup')) + mupdf.pdf_dict_del( o, PDF_NAME('P')) + copy_o = mupdf.pdf_graft_mapped_object( graft_map.this, o) + annot = mupdf.pdf_new_indirect( doc_des, mupdf.pdf_to_num( copy_o), 0) + mupdf.pdf_array_push( new_annots, annot) + + # rotate the page + if rotate != -1: + mupdf.pdf_dict_put_int( page_dict, PDF_NAME('Rotate'), rotate) + # Now add the page dictionary to dest PDF + ref = mupdf.pdf_add_object( doc_des, page_dict) + + # Insert new page at specified location + mupdf.pdf_insert_page( doc_des, page_to, ref) + + +def paper_rect(s: str) -> Rect: + """Return a Rect for the paper size indicated in string 's'. Must conform to the argument of method 'PaperSize', which will be invoked. + """ + width, height = paper_size(s) + return Rect(0.0, 0.0, width, height) + + +def paper_size(s: str) -> tuple: + """Return a tuple (width, height) for a given paper format string. + + Notes: + 'A4-L' will return (842, 595), the values for A4 landscape. + Suffix '-P' and no suffix return the portrait tuple. + """ + size = s.lower() + f = "p" + if size.endswith("-l"): + f = "l" + size = size[:-2] + if size.endswith("-p"): + size = size[:-2] + rc = paper_sizes().get(size, (-1, -1)) + if f == "p": + return rc + return (rc[1], rc[0]) + + +def paper_sizes(): + """Known paper formats @ 72 dpi as a dictionary. Key is the format string + like "a4" for ISO-A4. Value is the tuple (width, height). + + Information taken from the following web sites: + www.din-formate.de + www.din-formate.info/amerikanische-formate.html + www.directtools.de/wissen/normen/iso.htm + """ + return { + "a0": (2384, 3370), + "a1": (1684, 2384), + "a10": (74, 105), + "a2": (1191, 1684), + "a3": (842, 1191), + "a4": (595, 842), + "a5": (420, 595), + "a6": (298, 420), + "a7": (210, 298), + "a8": (147, 210), + "a9": (105, 147), + "b0": (2835, 4008), + "b1": (2004, 2835), + "b10": (88, 125), + "b2": (1417, 2004), + "b3": (1001, 1417), + "b4": (709, 1001), + "b5": (499, 709), + "b6": (354, 499), + "b7": (249, 354), + "b8": (176, 249), + "b9": (125, 176), + "c0": (2599, 3677), + "c1": (1837, 2599), + "c10": (79, 113), + "c2": (1298, 1837), + "c3": (918, 1298), + "c4": (649, 918), + "c5": (459, 649), + "c6": (323, 459), + "c7": (230, 323), + "c8": (162, 230), + "c9": (113, 162), + "card-4x6": (288, 432), + "card-5x7": (360, 504), + "commercial": (297, 684), + "executive": (522, 756), + "invoice": (396, 612), + "ledger": (792, 1224), + "legal": (612, 1008), + "legal-13": (612, 936), + "letter": (612, 792), + "monarch": (279, 540), + "tabloid-extra": (864, 1296), + } + +def pdf_lookup_page_loc(doc, needle): + return mupdf.pdf_lookup_page_loc(doc, needle) + + +def pdfobj_string(o, prefix=''): + ''' + Returns description of mupdf.PdfObj (wrapper for pdf_obj) . + ''' + assert 0, 'use mupdf.pdf_debug_obj() ?' + ret = '' + if mupdf.pdf_is_array(o): + l = mupdf.pdf_array_len(o) + ret += f'array {l}\n' + for i in range(l): + oo = mupdf.pdf_array_get(o, i) + ret += pdfobj_string(oo, prefix + ' ') + ret += '\n' + elif mupdf.pdf_is_bool(o): + ret += f'bool: {o.array_get_bool()}\n' + elif mupdf.pdf_is_dict(o): + l = mupdf.pdf_dict_len(o) + ret += f'dict {l}\n' + for i in range(l): + key = mupdf.pdf_dict_get_key(o, i) + value = mupdf.pdf_dict_get( o, key) + ret += f'{prefix} {key}: ' + ret += pdfobj_string( value, prefix + ' ') + ret += '\n' + elif mupdf.pdf_is_embedded_file(o): + ret += f'embedded_file: {o.embedded_file_name()}\n' + elif mupdf.pdf_is_indirect(o): + ret += f'indirect: ...\n' + elif mupdf.pdf_is_int(o): + ret += f'int: {mupdf.pdf_to_int(o)}\n' + elif mupdf.pdf_is_jpx_image(o): + ret += f'jpx_image:\n' + elif mupdf.pdf_is_name(o): + ret += f'name: {mupdf.pdf_to_name(o)}\n' + elif o.pdf_is_null: + ret += f'null\n' + #elif o.pdf_is_number: + # ret += f'number\n' + elif o.pdf_is_real: + ret += f'real: {o.pdf_to_real()}\n' + elif mupdf.pdf_is_stream(o): + ret += f'stream\n' + elif mupdf.pdf_is_string(o): + ret += f'string: {mupdf.pdf_to_string(o)}\n' + else: + ret += '<>\n' + + return ret + + +def repair_mono_font(page: "Page", font: "Font") -> None: + """Repair character spacing for mono fonts. + + Notes: + Some mono-spaced fonts are displayed with a too large character + distance, e.g. "a b c" instead of "abc". This utility adds an entry + "/W[0 65535 w]" to the descendent font(s) of font. The float w is + taken to be the width of 0x20 (space). + This should enforce viewers to use 'w' as the character width. + + Args: + page: pymupdf.Page object. + font: pymupdf.Font object. + """ + if not font.flags["mono"]: # font not flagged as monospaced + return None + doc = page.parent # the document + fontlist = page.get_fonts() # list of fonts on page + xrefs = [ # list of objects referring to font + f[0] + for f in fontlist + if (f[3] == font.name and f[4].startswith("F") and f[5].startswith("Identity")) + ] + if xrefs == []: # our font does not occur + return + xrefs = set(xrefs) # drop any double counts + width = int(round((font.glyph_advance(32) * 1000))) + for xref in xrefs: + if not TOOLS.set_font_width(doc, xref, width): + log("Cannot set width for '%s' in xref %i" % (font.name, xref)) + + +def sRGB_to_pdf(srgb: int) -> tuple: + """Convert sRGB color code to a PDF color triple. + + There is **no error checking** for performance reasons! + + Args: + srgb: (int) RRGGBB (red, green, blue), each color in range(255). + Returns: + Tuple (red, green, blue) each item in interval 0 <= item <= 1. + """ + t = sRGB_to_rgb(srgb) + return t[0] / 255.0, t[1] / 255.0, t[2] / 255.0 + + +def sRGB_to_rgb(srgb: int) -> tuple: + """Convert sRGB color code to an RGB color triple. + + There is **no error checking** for performance reasons! + + Args: + srgb: (int) SSRRGGBB (red, green, blue), each color in range(255). + With MuPDF < 1.26, `s` is always 0. + Returns: + Tuple (red, green, blue) each item in interval 0 <= item <= 255. + """ + srgb &= 0xffffff + r = srgb >> 16 + g = (srgb - (r << 16)) >> 8 + b = srgb - (r << 16) - (g << 8) + return (r, g, b) + + +def string_in_names_list(p, names_list): + n = mupdf.pdf_array_len( names_list) if names_list else 0 + str_ = mupdf.pdf_to_text_string( p) + for i in range(0, n, 2): + if mupdf.pdf_to_text_string( mupdf.pdf_array_get( names_list, i)) == str_: + return 1 + return 0 + + +def strip_outline(doc, outlines, page_count, page_object_nums, names_list): + ''' + Returns (count, first, prev). + ''' + first = None + count = 0 + current = outlines + prev = None + while current.m_internal: + # Strip any children to start with. This takes care of + # First / Last / Count for us. + nc = strip_outlines(doc, current, page_count, page_object_nums, names_list) + + if not dest_is_valid(current, page_count, page_object_nums, names_list): + if nc == 0: + # Outline with invalid dest and no children. Drop it by + # pulling the next one in here. + next = mupdf.pdf_dict_get(current, PDF_NAME('Next')) + if not next.m_internal: + # There is no next one to pull in + if prev.m_internal: + mupdf.pdf_dict_del(prev, PDF_NAME('Next')) + elif prev.m_internal: + mupdf.pdf_dict_put(prev, PDF_NAME('Next'), next) + mupdf.pdf_dict_put(next, PDF_NAME('Prev'), prev) + else: + mupdf.pdf_dict_del(next, PDF_NAME('Prev')) + current = next + else: + # Outline with invalid dest, but children. Just drop the dest. + mupdf.pdf_dict_del(current, PDF_NAME('Dest')) + mupdf.pdf_dict_del(current, PDF_NAME('A')) + current = mupdf.pdf_dict_get(current, PDF_NAME('Next')) + else: + # Keep this one + if not first or not first.m_internal: + first = current + prev = current + current = mupdf.pdf_dict_get(current, PDF_NAME('Next')) + count += 1 + + return count, first, prev + + +def strip_outlines(doc, outlines, page_count, page_object_nums, names_list): + if not outlines.m_internal: + return 0 + + first = mupdf.pdf_dict_get(outlines, PDF_NAME('First')) + if not first.m_internal: + nc = 0 + else: + nc, first, last = strip_outline(doc, first, page_count, page_object_nums, names_list) + + if nc == 0: + mupdf.pdf_dict_del(outlines, PDF_NAME('First')) + mupdf.pdf_dict_del(outlines, PDF_NAME('Last')) + mupdf.pdf_dict_del(outlines, PDF_NAME('Count')) + else: + old_count = mupdf.pdf_to_int(mupdf.pdf_dict_get(outlines, PDF_NAME('Count'))) + mupdf.pdf_dict_put(outlines, PDF_NAME('First'), first) + mupdf.pdf_dict_put(outlines, PDF_NAME('Last'), last) + mupdf.pdf_dict_put(outlines, PDF_NAME('Count'), mupdf.pdf_new_int(nc if old_count > 0 else -nc)) + return nc + + +trace_device_FILL_PATH = 1 +trace_device_STROKE_PATH = 2 +trace_device_CLIP_PATH = 3 +trace_device_CLIP_STROKE_PATH = 4 + + +def unicode_to_glyph_name(ch: int) -> str: + """ + Convenience function accessing unicodedata. + """ + import unicodedata + try: + name = unicodedata.name(chr(ch)) + except ValueError: + name = ".notdef" + return name + + +def vdist(dir, a, b): + dx = b.x - a.x + dy = b.y - a.y + return mupdf.fz_abs(dx * dir.y + dy * dir.x) + + +def apply_pages( + path, + pagefn, + *, + pagefn_args=(), + pagefn_kwargs=dict(), + initfn=None, + initfn_args=(), + initfn_kwargs=dict(), + pages=None, + method='single', + concurrency=None, + _stats=False, + ): + ''' + Returns list of results from `pagefn()`, optionally using concurrency for + speed. + + Args: + path: + Path of document. + pagefn: + Function to call for each page; is passed (page, *pagefn_args, + **pagefn_kwargs). Return value is added to list that we return. If + `method` is not 'single', must be a top-level function - nested + functions don't work with concurrency. + pagefn_args + pagefn_kwargs: + Additional args to pass to `pagefn`. Must be picklable. + initfn: + If true, called once in each worker process; is passed + (*initfn_args, **initfn_kwargs). + initfn_args + initfn_kwargs: + Args to pass to initfn. Must be picklable. + pages: + List of page numbers to process, or None to include all pages. + method: + 'single' + Do not use concurrency. + 'mp' + Operate concurrently using Python's `multiprocessing` module. + 'fork' + Operate concurrently using custom implementation with + `os.fork()`. Does not work on Windows. + concurrency: + Number of worker processes to use when operating concurrently. If + None, we use the number of available CPUs. + _stats: + Internal, may change or be removed. If true, we output simple + timing diagnostics. + + Note: We require a file path rather than a Document, because Document + instances do not work properly after a fork - internal file descriptor + offsets are shared between the parent and child processes. + ''' + if _stats: + t0 = time.time() + + if method == 'single': + if initfn: + initfn(*initfn_args, **initfn_kwargs) + ret = list() + document = Document(path) + if pages is None: + pages = range(len(document)) + for pno in pages: + page = document[pno] + r = pagefn(page, *pagefn_args, **initfn_kwargs) + ret.append(r) + + else: + # Use concurrency. + # + from . import _apply_pages + + if pages is None: + if _stats: + t = time.time() + with Document(path) as document: + num_pages = len(document) + pages = list(range(num_pages)) + if _stats: + t = time.time() - t + log(f'{t:.2f}s: count pages.') + + if _stats: + t = time.time() + + if method == 'mp': + ret = _apply_pages._multiprocessing( + path, + pages, + pagefn, + pagefn_args, + pagefn_kwargs, + initfn, + initfn_args, + initfn_kwargs, + concurrency, + _stats, + ) + + elif method == 'fork': + ret = _apply_pages._fork( + path, + pages, + pagefn, + pagefn_args, + pagefn_kwargs, + initfn, + initfn_args, + initfn_kwargs, + concurrency, + _stats, + ) + + else: + assert 0, f'Unrecognised {method=}.' + + if _stats: + t = time.time() - t + log(f'{t:.2f}s: work.') + + if _stats: + t = time.time() - t0 + log(f'{t:.2f}s: total.') + return ret + + +def get_text( + path, + *, + pages=None, + method='single', + concurrency=None, + + option='text', + clip=None, + flags=None, + textpage=None, + sort=False, + delimiters=None, + + _stats=False, + ): + ''' + Returns list of results from `Page.get_text()`, optionally using + concurrency for speed. + + Args: + path: + Path of document. + pages: + List of page numbers to process, or None to include all pages. + method: + 'single' + Do not use concurrency. + 'mp' + Operate concurrently using Python's `multiprocessing` module. + 'fork' + Operate concurrently using custom implementation with + `os.fork`. Does not work on Windows. + concurrency: + Number of worker processes to use when operating concurrently. If + None, we use the number of available CPUs. + option + clip + flags + textpage + sort + delimiters: + Passed to internal calls to `Page.get_text()`. + ''' + args_dict = dict( + option=option, + clip=clip, + flags=flags, + textpage=textpage, + sort=sort, + delimiters=delimiters, + ) + + return apply_pages( + path, + Page.get_text, + pagefn_kwargs=args_dict, + pages=pages, + method=method, + concurrency=concurrency, + _stats=_stats, + ) + + +class TOOLS: + ''' + We use @staticmethod to avoid the need to create an instance of this class. + ''' + + def _derotate_matrix(page): + if isinstance(page, mupdf.PdfPage): + return JM_py_from_matrix(JM_derotate_page_matrix(page)) + else: + return JM_py_from_matrix(mupdf.FzMatrix()) + + @staticmethod + def _fill_widget(annot, widget): + val = JM_get_widget_properties(annot, widget) + + widget.rect = Rect(annot.rect) + widget.xref = annot.xref + widget.parent = annot.parent + widget._annot = annot # backpointer to annot object + if not widget.script: + widget.script = None + if not widget.script_stroke: + widget.script_stroke = None + if not widget.script_format: + widget.script_format = None + if not widget.script_change: + widget.script_change = None + if not widget.script_calc: + widget.script_calc = None + if not widget.script_blur: + widget.script_blur = None + if not widget.script_focus: + widget.script_focus = None + return val + + @staticmethod + def _get_all_contents(page): + page = _as_pdf_page(page.this) + res = JM_read_contents(page.obj()) + result = JM_BinFromBuffer( res) + return result + + @staticmethod + def _insert_contents(page, newcont, overlay=1): + """Add bytes as a new /Contents object for a page, and return its xref.""" + pdfpage = _as_pdf_page(page, required=1) + contbuf = JM_BufferFromBytes(newcont) + xref = JM_insert_contents(pdfpage.doc(), pdfpage.obj(), contbuf, overlay) + #fixme: pdfpage->doc->dirty = 1; + return xref + + @staticmethod + def _le_annot_parms(annot, p1, p2, fill_color): + """Get common parameters for making annot line end symbols. + + Returns: + m: matrix that maps p1, p2 to points L, P on the x-axis + im: its inverse + L, P: transformed p1, p2 + w: line width + scol: stroke color string + fcol: fill color store_shrink + opacity: opacity string (gs command) + """ + w = annot.border["width"] # line width + sc = annot.colors["stroke"] # stroke color + if not sc: # black if missing + sc = (0,0,0) + scol = " ".join(map(str, sc)) + " RG\n" + if fill_color: + fc = fill_color + else: + fc = annot.colors["fill"] # fill color + if not fc: + fc = (1,1,1) # white if missing + fcol = " ".join(map(str, fc)) + " rg\n" + # nr = annot.rect + np1 = p1 # point coord relative to annot rect + np2 = p2 # point coord relative to annot rect + m = Matrix(util_hor_matrix(np1, np2)) # matrix makes the line horizontal + im = ~m # inverted matrix + L = np1 * m # converted start (left) point + R = np2 * m # converted end (right) point + if 0 <= annot.opacity < 1: + opacity = "/H gs\n" + else: + opacity = "" + return m, im, L, R, w, scol, fcol, opacity + + @staticmethod + def _le_butt(annot, p1, p2, lr, fill_color): + """Make stream commands for butt line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = TOOLS._le_annot_parms(annot, p1, p2, fill_color) + shift = 3 + d = shift * max(1, w) + M = R if lr else L + top = (M + (0, -d/2.)) * im + bot = (M + (0, d/2.)) * im + ap = "\nq\n%s%f %f m\n" % (opacity, top.x, top.y) + ap += "%f %f l\n" % (bot.x, bot.y) + ap += _format_g(w) + " w\n" + ap += scol + "s\nQ\n" + return ap + + @staticmethod + def _le_circle(annot, p1, p2, lr, fill_color): + """Make stream commands for circle line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = TOOLS._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 # 2*shift*width = length of square edge + d = shift * max(1, w) + M = R - (d/2., 0) if lr else L + (d/2., 0) + r = Rect(M, M) + (-d, -d, d, d) # the square + ap = "q\n" + opacity + TOOLS._oval_string(r.tl * im, r.tr * im, r.br * im, r.bl * im) + ap += _format_g(w) + " w\n" + ap += scol + fcol + "b\nQ\n" + return ap + + @staticmethod + def _le_closedarrow(annot, p1, p2, lr, fill_color): + """Make stream commands for closed arrow line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = TOOLS._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 + d = shift * max(1, w) + p2 = R + (d/2., 0) if lr else L - (d/2., 0) + p1 = p2 + (-2*d, -d) if lr else p2 + (2*d, -d) + p3 = p2 + (-2*d, d) if lr else p2 + (2*d, d) + p1 *= im + p2 *= im + p3 *= im + ap = "\nq\n%s%f %f m\n" % (opacity, p1.x, p1.y) + ap += "%f %f l\n" % (p2.x, p2.y) + ap += "%f %f l\n" % (p3.x, p3.y) + ap += _format_g(w) + " w\n" + ap += scol + fcol + "b\nQ\n" + return ap + + @staticmethod + def _le_diamond(annot, p1, p2, lr, fill_color): + """Make stream commands for diamond line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = TOOLS._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 # 2*shift*width = length of square edge + d = shift * max(1, w) + M = R - (d/2., 0) if lr else L + (d/2., 0) + r = Rect(M, M) + (-d, -d, d, d) # the square + # the square makes line longer by (2*shift - 1)*width + p = (r.tl + (r.bl - r.tl) * 0.5) * im + ap = "q\n%s%f %f m\n" % (opacity, p.x, p.y) + p = (r.tl + (r.tr - r.tl) * 0.5) * im + ap += "%f %f l\n" % (p.x, p.y) + p = (r.tr + (r.br - r.tr) * 0.5) * im + ap += "%f %f l\n" % (p.x, p.y) + p = (r.br + (r.bl - r.br) * 0.5) * im + ap += "%f %f l\n" % (p.x, p.y) + ap += _format_g(w) + " w\n" + ap += scol + fcol + "b\nQ\n" + return ap + + @staticmethod + def _le_openarrow(annot, p1, p2, lr, fill_color): + """Make stream commands for open arrow line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = TOOLS._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 + d = shift * max(1, w) + p2 = R + (d/2., 0) if lr else L - (d/2., 0) + p1 = p2 + (-2*d, -d) if lr else p2 + (2*d, -d) + p3 = p2 + (-2*d, d) if lr else p2 + (2*d, d) + p1 *= im + p2 *= im + p3 *= im + ap = "\nq\n%s%f %f m\n" % (opacity, p1.x, p1.y) + ap += "%f %f l\n" % (p2.x, p2.y) + ap += "%f %f l\n" % (p3.x, p3.y) + ap += _format_g(w) + " w\n" + ap += scol + "S\nQ\n" + return ap + + @staticmethod + def _le_rclosedarrow(annot, p1, p2, lr, fill_color): + """Make stream commands for right closed arrow line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = TOOLS._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 + d = shift * max(1, w) + p2 = R - (2*d, 0) if lr else L + (2*d, 0) + p1 = p2 + (2*d, -d) if lr else p2 + (-2*d, -d) + p3 = p2 + (2*d, d) if lr else p2 + (-2*d, d) + p1 *= im + p2 *= im + p3 *= im + ap = "\nq\n%s%f %f m\n" % (opacity, p1.x, p1.y) + ap += "%f %f l\n" % (p2.x, p2.y) + ap += "%f %f l\n" % (p3.x, p3.y) + ap += _format_g(w) + " w\n" + ap += scol + fcol + "b\nQ\n" + return ap + + @staticmethod + def _le_ropenarrow(annot, p1, p2, lr, fill_color): + """Make stream commands for right open arrow line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = TOOLS._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 + d = shift * max(1, w) + p2 = R - (d/3., 0) if lr else L + (d/3., 0) + p1 = p2 + (2*d, -d) if lr else p2 + (-2*d, -d) + p3 = p2 + (2*d, d) if lr else p2 + (-2*d, d) + p1 *= im + p2 *= im + p3 *= im + ap = "\nq\n%s%f %f m\n" % (opacity, p1.x, p1.y) + ap += "%f %f l\n" % (p2.x, p2.y) + ap += "%f %f l\n" % (p3.x, p3.y) + ap += _format_g(w) + " w\n" + ap += scol + fcol + "S\nQ\n" + return ap + + @staticmethod + def _le_slash(annot, p1, p2, lr, fill_color): + """Make stream commands for slash line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = TOOLS._le_annot_parms(annot, p1, p2, fill_color) + rw = 1.1547 * max(1, w) * 1.0 # makes rect diagonal a 30 deg inclination + M = R if lr else L + r = Rect(M.x - rw, M.y - 2 * w, M.x + rw, M.y + 2 * w) + top = r.tl * im + bot = r.br * im + ap = "\nq\n%s%f %f m\n" % (opacity, top.x, top.y) + ap += "%f %f l\n" % (bot.x, bot.y) + ap += _format_g(w) + " w\n" + ap += scol + "s\nQ\n" + return ap + + @staticmethod + def _le_square(annot, p1, p2, lr, fill_color): + """Make stream commands for square line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = TOOLS._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 # 2*shift*width = length of square edge + d = shift * max(1, w) + M = R - (d/2., 0) if lr else L + (d/2., 0) + r = Rect(M, M) + (-d, -d, d, d) # the square + # the square makes line longer by (2*shift - 1)*width + p = r.tl * im + ap = "q\n%s%f %f m\n" % (opacity, p.x, p.y) + p = r.tr * im + ap += "%f %f l\n" % (p.x, p.y) + p = r.br * im + ap += "%f %f l\n" % (p.x, p.y) + p = r.bl * im + ap += "%f %f l\n" % (p.x, p.y) + ap += _format_g(w) + " w\n" + ap += scol + fcol + "b\nQ\n" + return ap + + @staticmethod + def _oval_string(p1, p2, p3, p4): + """Return /AP string defining an oval within a 4-polygon provided as points + """ + def bezier(p, q, r): + f = "%f %f %f %f %f %f c\n" + return f % (p.x, p.y, q.x, q.y, r.x, r.y) + + kappa = 0.55228474983 # magic number + ml = p1 + (p4 - p1) * 0.5 # middle points ... + mo = p1 + (p2 - p1) * 0.5 # for each ... + mr = p2 + (p3 - p2) * 0.5 # polygon ... + mu = p4 + (p3 - p4) * 0.5 # side + ol1 = ml + (p1 - ml) * kappa # the 8 bezier + ol2 = mo + (p1 - mo) * kappa # helper points + or1 = mo + (p2 - mo) * kappa + or2 = mr + (p2 - mr) * kappa + ur1 = mr + (p3 - mr) * kappa + ur2 = mu + (p3 - mu) * kappa + ul1 = mu + (p4 - mu) * kappa + ul2 = ml + (p4 - ml) * kappa + # now draw, starting from middle point of left side + ap = "%f %f m\n" % (ml.x, ml.y) + ap += bezier(ol1, ol2, mo) + ap += bezier(or1, or2, mr) + ap += bezier(ur1, ur2, mu) + ap += bezier(ul1, ul2, ml) + return ap + + @staticmethod + def _parse_da(annot): + + if g_use_extra: + val = extra.Tools_parse_da( annot.this) + else: + def Tools__parse_da(annot): + this_annot = annot.this + assert isinstance(this_annot, mupdf.PdfAnnot) + this_annot_obj = mupdf.pdf_annot_obj( this_annot) + pdf = mupdf.pdf_get_bound_document( this_annot_obj) + try: + da = mupdf.pdf_dict_get_inheritable( this_annot_obj, PDF_NAME('DA')) + if not da.m_internal: + trailer = mupdf.pdf_trailer(pdf) + da = mupdf.pdf_dict_getl(trailer, + PDF_NAME('Root'), + PDF_NAME('AcroForm'), + PDF_NAME('DA'), + ) + da_str = mupdf.pdf_to_text_string(da) + except Exception: + if g_exceptions_verbose: exception_info() + return + return da_str + val = Tools__parse_da(annot) + + if not val: + return ((0,), "", 0) + font = "Helv" + fsize = 12 + col = (0, 0, 0) + dat = val.split() # split on any whitespace + for i, item in enumerate(dat): + if item == "Tf": + font = dat[i - 2][1:] + fsize = float(dat[i - 1]) + dat[i] = dat[i-1] = dat[i-2] = "" + continue + if item == "g": # unicolor text + col = [(float(dat[i - 1]))] + dat[i] = dat[i-1] = "" + continue + if item == "rg": # RGB colored text + col = [float(f) for f in dat[i - 3:i]] + dat[i] = dat[i-1] = dat[i-2] = dat[i-3] = "" + continue + if item == "k": # CMYK colored text + col = [float(f) for f in dat[i - 4:i]] + dat[i] = dat[i-1] = dat[i-2] = dat[i-3] = dat[i-4] = "" + continue + + val = (col, font, fsize) + return val + + @staticmethod + def _reset_widget(annot): + this_annot = annot + this_annot_obj = mupdf.pdf_annot_obj(this_annot) + pdf = mupdf.pdf_get_bound_document(this_annot_obj) + mupdf.pdf_field_reset(pdf, this_annot_obj) + + @staticmethod + def _rotate_matrix(page): + pdfpage = page._pdf_page(required=False) + if not pdfpage.m_internal: + return JM_py_from_matrix(mupdf.FzMatrix()) + return JM_py_from_matrix(JM_rotate_page_matrix(pdfpage)) + + @staticmethod + def _save_widget(annot, widget): + JM_set_widget_properties(annot, widget) + + def _update_da(annot, da_str): + if g_use_extra: + extra.Tools_update_da( annot.this, da_str) + else: + try: + this_annot = annot.this + assert isinstance(this_annot, mupdf.PdfAnnot) + mupdf.pdf_dict_put_text_string(mupdf.pdf_annot_obj(this_annot), PDF_NAME('DA'), da_str) + mupdf.pdf_dict_del(mupdf.pdf_annot_obj(this_annot), PDF_NAME('DS')) # /* not supported */ + mupdf.pdf_dict_del(mupdf.pdf_annot_obj(this_annot), PDF_NAME('RC')) # /* not supported */ + except Exception: + if g_exceptions_verbose: exception_info() + return + return + + @staticmethod + def gen_id(): + global TOOLS_JM_UNIQUE_ID + TOOLS_JM_UNIQUE_ID += 1 + return TOOLS_JM_UNIQUE_ID + + @staticmethod + def glyph_cache_empty(): + ''' + Empty the glyph cache. + ''' + mupdf.fz_purge_glyph_cache() + + @staticmethod + def image_profile(stream, keep_image=0): + ''' + Metadata of an image binary stream. + ''' + return JM_image_profile(stream, keep_image) + + @staticmethod + def mupdf_display_errors(on=None): + ''' + Set MuPDF error display to True or False. + ''' + global JM_mupdf_show_errors + if on is not None: + JM_mupdf_show_errors = bool(on) + return JM_mupdf_show_errors + + @staticmethod + def mupdf_display_warnings(on=None): + ''' + Set MuPDF warnings display to True or False. + ''' + global JM_mupdf_show_warnings + if on is not None: + JM_mupdf_show_warnings = bool(on) + return JM_mupdf_show_warnings + + @staticmethod + def mupdf_version(): + '''Get version of MuPDF binary build.''' + return mupdf.FZ_VERSION + + @staticmethod + def mupdf_warnings(reset=1): + ''' + Get the MuPDF warnings/errors with optional reset (default). + ''' + # Get any trailing `... repeated times...` message. + mupdf.fz_flush_warnings() + ret = '\n'.join( JM_mupdf_warnings_store) + if reset: + TOOLS.reset_mupdf_warnings() + return ret + + @staticmethod + def reset_mupdf_warnings(): + global JM_mupdf_warnings_store + JM_mupdf_warnings_store = list() + + @staticmethod + def set_aa_level(level): + ''' + Set anti-aliasing level. + ''' + mupdf.fz_set_aa_level(level) + + @staticmethod + def set_annot_stem( stem=None): + global JM_annot_id_stem + if stem is None: + return JM_annot_id_stem + len_ = len(stem) + 1 + if len_ > 50: + len_ = 50 + JM_annot_id_stem = stem[:50] + return JM_annot_id_stem + + @staticmethod + def set_font_width(doc, xref, width): + pdf = _as_pdf_document(doc, required=0) + if not pdf.m_internal: + return False + font = mupdf.pdf_load_object(pdf, xref) + dfonts = mupdf.pdf_dict_get(font, PDF_NAME('DescendantFonts')) + if mupdf.pdf_is_array(dfonts): + n = mupdf.pdf_array_len(dfonts) + for i in range(n): + dfont = mupdf.pdf_array_get(dfonts, i) + warray = mupdf.pdf_new_array(pdf, 3) + mupdf.pdf_array_push(warray, mupdf.pdf_new_int(0)) + mupdf.pdf_array_push(warray, mupdf.pdf_new_int(65535)) + mupdf.pdf_array_push(warray, mupdf.pdf_new_int(width)) + mupdf.pdf_dict_put(dfont, PDF_NAME('W'), warray) + return True + + @staticmethod + def set_graphics_min_line_width(min_line_width): + ''' + Set the graphics minimum line width. + ''' + mupdf.fz_set_graphics_min_line_width(min_line_width) + + @staticmethod + def set_icc( on=0): + """Set ICC color handling on or off.""" + if on: + if mupdf.FZ_ENABLE_ICC: + mupdf.fz_enable_icc() + else: + RAISEPY( "MuPDF built w/o ICC support",PyExc_ValueError) + elif mupdf.FZ_ENABLE_ICC: + mupdf.fz_disable_icc() + + @staticmethod + def set_low_memory( on=None): + """Set / unset MuPDF device caching.""" + if on is not None: + _globals.no_device_caching = bool(on) + return _globals.no_device_caching + + @staticmethod + def set_small_glyph_heights(on=None): + """Set / unset small glyph heights.""" + if on is not None: + _globals.small_glyph_heights = bool(on) + if g_use_extra: + extra.set_small_glyph_heights(_globals.small_glyph_heights) + return _globals.small_glyph_heights + + @staticmethod + def set_subset_fontnames(on=None): + ''' + Set / unset returning fontnames with their subset prefix. + ''' + if on is not None: + _globals.subset_fontnames = bool(on) + if g_use_extra: + extra.set_subset_fontnames(_globals.subset_fontnames) + return _globals.subset_fontnames + + @staticmethod + def show_aa_level(): + ''' + Show anti-aliasing values. + ''' + return dict( + graphics = mupdf.fz_graphics_aa_level(), + text = mupdf.fz_text_aa_level(), + graphics_min_line_width = mupdf.fz_graphics_min_line_width(), + ) + + @staticmethod + def store_maxsize(): + ''' + MuPDF store size limit. + ''' + # fixme: return gctx->store->max. + return None + + @staticmethod + def store_shrink(percent): + ''' + Free 'percent' of current store size. + ''' + if percent >= 100: + mupdf.fz_empty_store() + return 0 + if percent > 0: + mupdf.fz_shrink_store( 100 - percent) + # fixme: return gctx->store->size. + + @staticmethod + def store_size(): + ''' + MuPDF current store size. + ''' + # fixme: return gctx->store->size. + return None + + @staticmethod + def unset_quad_corrections(on=None): + ''' + Set ascender / descender corrections on or off. + ''' + if on is not None: + _globals.skip_quad_corrections = bool(on) + if g_use_extra: + extra.set_skip_quad_corrections(_globals.skip_quad_corrections) + return _globals.skip_quad_corrections + + # fixme: also defined at top-level. + JM_annot_id_stem = 'fitz' + + fitz_config = JM_fitz_config() + + +# Callbacks not yet supported with cppyy. +if not mupdf_cppyy: + mupdf.fz_set_warning_callback(JM_mupdf_warning) + mupdf.fz_set_error_callback(JM_mupdf_error) + + +# If there are pending warnings when we exit, we end up in this sequence: +# +# atexit() +# -> mupdf::internal_thread_state::~internal_thread_state() +# -> fz_drop_context() +# -> fz_flush_warnings() +# -> SWIG Director code +# -> Python calling JM_mupdf_warning(). +# +# Unfortunately this causes a SEGV, seemingly because the SWIG Director code has +# already been torn down. +# +# So we use a Python atexit handler to explicitly call fz_flush_warnings(); +# this appears to happen early enough for the Director machinery to still +# work. So in the sequence above, fz_flush_warnings() will find that there are +# no pending warnings and will not attempt to call JM_mupdf_warning(). +# +def _atexit(): + #log( 'PyMuPDF/src/__init__.py:_atexit() called') + mupdf.fz_flush_warnings() + mupdf.fz_set_warning_callback(None) + mupdf.fz_set_error_callback(None) + #log( '_atexit() returning') +atexit.register( _atexit) + + +# List of (name, red, green, blue) where: +# name: upper-case name. +# red, green, blue: integer in range 0..255. +# +from . import _wxcolors +_wxcolors = _wxcolors._wxcolors + + +# Dict mapping from name to (red, green, blue). +# name: lower-case name. +# red, green, blue: float in range 0..1. +# +pdfcolor = dict() +for name, r, g, b in _wxcolors: + pdfcolor[name.lower()] = (r/255, g/255, b/255) + + +def colors_pdf_dict(): + ''' + Returns dict mapping from name to (red, green, blue). + name: lower-case name. + red, green, blue: float in range 0..1. + ''' + return pdfcolor + + +def colors_wx_list(): + ''' + Returns list of (name, red, green, blue) tuples: + name: upper-case name. + red, green, blue: integers in range 0..255. + ''' + return _wxcolors + + +# We cannot import utils earlier because it imports this .py file itself and +# uses some pymupdf.* types in function typing. +# +from . import utils + + +# Use utils.*() fns for some class methods. +# +recover_bbox_quad = utils.recover_bbox_quad +recover_char_quad = utils.recover_char_quad +recover_line_quad = utils.recover_line_quad +recover_quad = utils.recover_quad +recover_span_quad = utils.recover_span_quad + +Annot.get_text = utils.get_text +Annot.get_textbox = utils.get_textbox + +Document._do_links = utils.do_links +Document._do_widgets = utils.do_widgets +Document.del_toc_item = utils.del_toc_item +Document.get_char_widths = utils.get_char_widths +Document.get_oc = utils.get_oc +Document.get_ocmd = utils.get_ocmd +Document.get_page_labels = utils.get_page_labels +Document.get_page_numbers = utils.get_page_numbers +Document.get_page_pixmap = utils.get_page_pixmap +Document.get_page_text = utils.get_page_text +Document.get_toc = utils.get_toc +Document.has_annots = utils.has_annots +Document.has_links = utils.has_links +Document.insert_page = utils.insert_page +Document.new_page = utils.new_page +Document.scrub = utils.scrub +Document.search_page_for = utils.search_page_for +Document.set_metadata = utils.set_metadata +Document.set_oc = utils.set_oc +Document.set_ocmd = utils.set_ocmd +Document.set_page_labels = utils.set_page_labels +Document.set_toc = utils.set_toc +Document.set_toc_item = utils.set_toc_item +Document.subset_fonts = utils.subset_fonts +Document.tobytes = Document.write +Document.xref_copy = utils.xref_copy + +IRect.get_area = utils.get_area + +Page.apply_redactions = utils.apply_redactions +Page.delete_image = utils.delete_image +Page.delete_widget = utils.delete_widget +Page.draw_bezier = utils.draw_bezier +Page.draw_circle = utils.draw_circle +Page.draw_curve = utils.draw_curve +Page.draw_line = utils.draw_line +Page.draw_oval = utils.draw_oval +Page.draw_polyline = utils.draw_polyline +Page.draw_quad = utils.draw_quad +Page.draw_rect = utils.draw_rect +Page.draw_sector = utils.draw_sector +Page.draw_squiggle = utils.draw_squiggle +Page.draw_zigzag = utils.draw_zigzag +Page.get_image_info = utils.get_image_info +Page.get_image_rects = utils.get_image_rects +Page.get_label = utils.get_label +Page.get_links = utils.get_links +Page.get_pixmap = utils.get_pixmap +Page.get_text = utils.get_text +Page.get_text_blocks = utils.get_text_blocks +Page.get_text_selection = utils.get_text_selection +Page.get_text_words = utils.get_text_words +Page.get_textbox = utils.get_textbox +Page.get_textpage_ocr = utils.get_textpage_ocr +Page.insert_image = utils.insert_image +Page.insert_link = utils.insert_link +Page.insert_text = utils.insert_text +Page.insert_textbox = utils.insert_textbox +Page.insert_htmlbox = utils.insert_htmlbox +Page.new_shape = lambda x: utils.Shape(x) +Page.replace_image = utils.replace_image +Page.search_for = utils.search_for +Page.show_pdf_page = utils.show_pdf_page +Page.update_link = utils.update_link +Page.write_text = utils.write_text +Shape = utils.Shape +from .table import find_tables + +Page.find_tables = find_tables + +Rect.get_area = utils.get_area + +TextWriter.fill_textbox = utils.fill_textbox + + +class FitzDeprecation(DeprecationWarning): + pass + +def restore_aliases(): + warnings.filterwarnings( "once", category=FitzDeprecation) + + def showthis(msg, cat, filename, lineno, file=None, line=None): + text = warnings.formatwarning(msg, cat, filename, lineno, line=line) + s = text.find("FitzDeprecation") + if s < 0: + log(text) + return + text = text[s:].splitlines()[0][4:] + log(text) + + warnings.showwarning = showthis + + def _alias(class_, new_name, legacy_name=None): + ''' + Adds an alias for a class_ or module item clled .. + + class_: + Class/module to modify; use None for the current module. + new_name: + String name of existing item, e.g. name of method. + legacy_name: + Name of legacy object to create in . If None, we generate + from by removing underscores and capitalising the next + letter. + ''' + if class_ is None: + class_ = sys.modules[__name__] + if not legacy_name: + legacy_name = '' + capitalise_next = False + for c in new_name: + if c == '_': + capitalise_next = True + elif capitalise_next: + legacy_name += c.upper() + capitalise_next = False + else: + legacy_name += c + new_object = getattr( class_, new_name) + assert not getattr( class_, legacy_name, None), f'class {class_} already has {legacy_name}' + if callable( new_object): + def deprecated_function( *args, **kwargs): + warnings.warn( + f'"{legacy_name=}" removed from {class_} after v1.19.0 - use "{new_name}".', + category=FitzDeprecation, + ) + return new_object( *args, **kwargs) + setattr( class_, legacy_name, deprecated_function) + deprecated_function.__doc__ = ( + f'*** Deprecated and removed in version after v1.19.0 - use "{new_name}". ***\n' + f'{new_object.__doc__}' + ) + else: + setattr( class_, legacy_name, new_object) + + _alias( Annot, 'get_file', 'fileGet') + _alias( Annot, 'get_pixmap') + _alias( Annot, 'get_sound', 'soundGet') + _alias( Annot, 'get_text') + _alias( Annot, 'get_textbox') + _alias( Annot, 'get_textpage', 'getTextPage') + _alias( Annot, 'line_ends') + _alias( Annot, 'set_blendmode', 'setBlendMode') + _alias( Annot, 'set_border') + _alias( Annot, 'set_colors') + _alias( Annot, 'set_flags') + _alias( Annot, 'set_info') + _alias( Annot, 'set_line_ends') + _alias( Annot, 'set_name') + _alias( Annot, 'set_oc', 'setOC') + _alias( Annot, 'set_opacity') + _alias( Annot, 'set_rect') + _alias( Annot, 'update_file', 'fileUpd') + _alias( DisplayList, 'get_pixmap') + _alias( DisplayList, 'get_textpage', 'getTextPage') + _alias( Document, 'chapter_count') + _alias( Document, 'chapter_page_count') + _alias( Document, 'convert_to_pdf', 'convertToPDF') + _alias( Document, 'copy_page') + _alias( Document, 'delete_page') + _alias( Document, 'delete_pages', 'deletePageRange') + _alias( Document, 'embfile_add', 'embeddedFileAdd') + _alias( Document, 'embfile_count', 'embeddedFileCount') + _alias( Document, 'embfile_del', 'embeddedFileDel') + _alias( Document, 'embfile_get', 'embeddedFileGet') + _alias( Document, 'embfile_info', 'embeddedFileInfo') + _alias( Document, 'embfile_names', 'embeddedFileNames') + _alias( Document, 'embfile_upd', 'embeddedFileUpd') + _alias( Document, 'extract_font') + _alias( Document, 'extract_image') + _alias( Document, 'find_bookmark') + _alias( Document, 'fullcopy_page') + _alias( Document, 'get_char_widths') + _alias( Document, 'get_ocgs', 'getOCGs') + _alias( Document, 'get_page_fonts', 'getPageFontList') + _alias( Document, 'get_page_images', 'getPageImageList') + _alias( Document, 'get_page_pixmap') + _alias( Document, 'get_page_text') + _alias( Document, 'get_page_xobjects', 'getPageXObjectList') + _alias( Document, 'get_sigflags', 'getSigFlags') + _alias( Document, 'get_toc', 'getToC') + _alias( Document, 'get_xml_metadata') + _alias( Document, 'insert_page') + _alias( Document, 'insert_pdf', 'insertPDF') + _alias( Document, 'is_dirty') + _alias( Document, 'is_form_pdf', 'isFormPDF') + _alias( Document, 'is_pdf', 'isPDF') + _alias( Document, 'is_reflowable') + _alias( Document, 'is_repaired') + _alias( Document, 'last_location') + _alias( Document, 'load_page') + _alias( Document, 'make_bookmark') + _alias( Document, 'move_page') + _alias( Document, 'needs_pass') + _alias( Document, 'new_page') + _alias( Document, 'next_location') + _alias( Document, 'page_count') + _alias( Document, 'page_cropbox', 'pageCropBox') + _alias( Document, 'page_xref') + _alias( Document, 'pdf_catalog', 'PDFCatalog') + _alias( Document, 'pdf_trailer', 'PDFTrailer') + _alias( Document, 'prev_location', 'previousLocation') + _alias( Document, 'resolve_link') + _alias( Document, 'search_page_for') + _alias( Document, 'set_language') + _alias( Document, 'set_metadata') + _alias( Document, 'set_toc', 'setToC') + _alias( Document, 'set_xml_metadata') + _alias( Document, 'update_object') + _alias( Document, 'update_stream') + _alias( Document, 'xref_is_stream', 'isStream') + _alias( Document, 'xref_length') + _alias( Document, 'xref_object') + _alias( Document, 'xref_stream') + _alias( Document, 'xref_stream_raw') + _alias( Document, 'xref_xml_metadata', 'metadataXML') + _alias( IRect, 'get_area') + _alias( IRect, 'get_area', 'getRectArea') + _alias( IRect, 'include_point') + _alias( IRect, 'include_rect') + _alias( IRect, 'is_empty') + _alias( IRect, 'is_infinite') + _alias( Link, 'is_external') + _alias( Link, 'set_border') + _alias( Link, 'set_colors') + _alias( Matrix, 'is_rectilinear') + _alias( Matrix, 'prerotate', 'preRotate') + _alias( Matrix, 'prescale', 'preScale') + _alias( Matrix, 'preshear', 'preShear') + _alias( Matrix, 'pretranslate', 'preTranslate') + _alias( None, 'get_pdf_now', 'getPDFnow') + _alias( None, 'get_pdf_str', 'getPDFstr') + _alias( None, 'get_text_length') + _alias( None, 'get_text_length', 'getTextlength') + _alias( None, 'image_profile', 'ImageProperties') + _alias( None, 'paper_rect', 'PaperRect') + _alias( None, 'paper_size', 'PaperSize') + _alias( None, 'paper_sizes') + _alias( None, 'planish_line') + _alias( Outline, 'is_external') + _alias( Outline, 'is_open') + _alias( Page, 'add_caret_annot') + _alias( Page, 'add_circle_annot') + _alias( Page, 'add_file_annot') + _alias( Page, 'add_freetext_annot') + _alias( Page, 'add_highlight_annot') + _alias( Page, 'add_ink_annot') + _alias( Page, 'add_line_annot') + _alias( Page, 'add_polygon_annot') + _alias( Page, 'add_polyline_annot') + _alias( Page, 'add_rect_annot') + _alias( Page, 'add_redact_annot') + _alias( Page, 'add_squiggly_annot') + _alias( Page, 'add_stamp_annot') + _alias( Page, 'add_strikeout_annot') + _alias( Page, 'add_text_annot') + _alias( Page, 'add_underline_annot') + _alias( Page, 'add_widget') + _alias( Page, 'clean_contents') + _alias( Page, 'cropbox', 'CropBox') + _alias( Page, 'cropbox_position', 'CropBoxPosition') + _alias( Page, 'delete_annot') + _alias( Page, 'delete_link') + _alias( Page, 'delete_widget') + _alias( Page, 'derotation_matrix') + _alias( Page, 'draw_bezier') + _alias( Page, 'draw_circle') + _alias( Page, 'draw_curve') + _alias( Page, 'draw_line') + _alias( Page, 'draw_oval') + _alias( Page, 'draw_polyline') + _alias( Page, 'draw_quad') + _alias( Page, 'draw_rect') + _alias( Page, 'draw_sector') + _alias( Page, 'draw_squiggle') + _alias( Page, 'draw_zigzag') + _alias( Page, 'first_annot') + _alias( Page, 'first_link') + _alias( Page, 'first_widget') + _alias( Page, 'get_contents') + _alias( Page, 'get_displaylist', 'getDisplayList') + _alias( Page, 'get_drawings') + _alias( Page, 'get_fonts', 'getFontList') + _alias( Page, 'get_image_bbox') + _alias( Page, 'get_images', 'getImageList') + _alias( Page, 'get_links') + _alias( Page, 'get_pixmap') + _alias( Page, 'get_svg_image', 'getSVGimage') + _alias( Page, 'get_text') + _alias( Page, 'get_text_blocks') + _alias( Page, 'get_text_words') + _alias( Page, 'get_textbox') + _alias( Page, 'get_textpage', 'getTextPage') + _alias( Page, 'insert_font') + _alias( Page, 'insert_image') + _alias( Page, 'insert_link') + _alias( Page, 'insert_text') + _alias( Page, 'insert_textbox') + _alias( Page, 'is_wrapped', '_isWrapped') + _alias( Page, 'load_annot') + _alias( Page, 'load_links') + _alias( Page, 'mediabox', 'MediaBox') + _alias( Page, 'mediabox_size', 'MediaBoxSize') + _alias( Page, 'new_shape') + _alias( Page, 'read_contents') + _alias( Page, 'rotation_matrix') + _alias( Page, 'search_for') + _alias( Page, 'set_cropbox', 'setCropBox') + _alias( Page, 'set_mediabox', 'setMediaBox') + _alias( Page, 'set_rotation') + _alias( Page, 'show_pdf_page', 'showPDFpage') + _alias( Page, 'transformation_matrix') + _alias( Page, 'update_link') + _alias( Page, 'wrap_contents') + _alias( Page, 'write_text') + _alias( Pixmap, 'clear_with') + _alias( Pixmap, 'copy', 'copyPixmap') + _alias( Pixmap, 'gamma_with') + _alias( Pixmap, 'invert_irect', 'invertIRect') + _alias( Pixmap, 'pil_save', 'pillowWrite') + _alias( Pixmap, 'pil_tobytes', 'pillowData') + _alias( Pixmap, 'save', 'writeImage') + _alias( Pixmap, 'save', 'writePNG') + _alias( Pixmap, 'set_alpha') + _alias( Pixmap, 'set_dpi', 'setResolution') + _alias( Pixmap, 'set_origin') + _alias( Pixmap, 'set_pixel') + _alias( Pixmap, 'set_rect') + _alias( Pixmap, 'tint_with') + _alias( Pixmap, 'tobytes', 'getImageData') + _alias( Pixmap, 'tobytes', 'getPNGData') + _alias( Pixmap, 'tobytes', 'getPNGdata') + _alias( Quad, 'is_convex') + _alias( Quad, 'is_empty') + _alias( Quad, 'is_rectangular') + _alias( Rect, 'get_area') + _alias( Rect, 'get_area', 'getRectArea') + _alias( Rect, 'include_point') + _alias( Rect, 'include_rect') + _alias( Rect, 'is_empty') + _alias( Rect, 'is_infinite') + _alias( TextWriter, 'fill_textbox') + _alias( TextWriter, 'write_text') + _alias( utils.Shape, 'draw_bezier') + _alias( utils.Shape, 'draw_circle') + _alias( utils.Shape, 'draw_curve') + _alias( utils.Shape, 'draw_line') + _alias( utils.Shape, 'draw_oval') + _alias( utils.Shape, 'draw_polyline') + _alias( utils.Shape, 'draw_quad') + _alias( utils.Shape, 'draw_rect') + _alias( utils.Shape, 'draw_sector') + _alias( utils.Shape, 'draw_squiggle') + _alias( utils.Shape, 'draw_zigzag') + _alias( utils.Shape, 'insert_text') + _alias( utils.Shape, 'insert_textbox') + +if 0: + restore_aliases() + +__version__ = VersionBind +__doc__ = ( + f'PyMuPDF {VersionBind}: Python bindings for the MuPDF {VersionFitz} library (rebased implementation).\n' + f'Python {sys.version_info[0]}.{sys.version_info[1]} running on {sys.platform} ({64 if sys.maxsize > 2**32 else 32}-bit).\n' + ) diff -r 000000000000 -r 1d09e1dec1d9 src/__main__.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/__main__.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1140 @@ +# ----------------------------------------------------------------------------- +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# Part of "PyMuPDF", Python bindings for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ----------------------------------------------------------------------------- +import argparse +import bisect +import os +import sys +import statistics +from typing import Dict, List, Set + +from . import pymupdf + +def mycenter(x): + return (" %s " % x).center(75, "-") + + +def recoverpix(doc, item): + """Return image for a given XREF.""" + x = item[0] # xref of PDF image + s = item[1] # xref of its /SMask + if s == 0: # no smask: use direct image output + return doc.extract_image(x) + + def getimage(pix): + if pix.colorspace.n != 4: + return pix + tpix = pymupdf.Pixmap(pymupdf.csRGB, pix) + return tpix + + # we need to reconstruct the alpha channel with the smask + pix1 = pymupdf.Pixmap(doc, x) + pix2 = pymupdf.Pixmap(doc, s) # create pixmap of the /SMask entry + + """Sanity check: + - both pixmaps must have the same rectangle + - both pixmaps must have alpha=0 + - pix2 must consist of 1 byte per pixel + """ + if not (pix1.irect == pix2.irect and pix1.alpha == pix2.alpha == 0 and pix2.n == 1): + pymupdf.message("Warning: unsupported /SMask %i for %i:" % (s, x)) + pymupdf.message(pix2) + pix2 = None + return getimage(pix1) # return the pixmap as is + + pix = pymupdf.Pixmap(pix1) # copy of pix1, with an alpha channel added + pix.set_alpha(pix2.samples) # treat pix2.samples as the alpha values + pix1 = pix2 = None # free temp pixmaps + + # we may need to adjust something for CMYK pixmaps here: + return getimage(pix) + + +def open_file(filename, password, show=False, pdf=True): + """Open and authenticate a document.""" + doc = pymupdf.open(filename) + if not doc.is_pdf and pdf is True: + sys.exit("this command supports PDF files only") + rc = -1 + if not doc.needs_pass: + return doc + if password: + rc = doc.authenticate(password) + if not rc: + sys.exit("authentication unsuccessful") + if show is True: + pymupdf.message("authenticated as %s" % "owner" if rc > 2 else "user") + else: + sys.exit("'%s' requires a password" % doc.name) + return doc + + +def print_dict(item): + """Print a Python dictionary.""" + l = max([len(k) for k in item.keys()]) + 1 + for k, v in item.items(): + msg = "%s: %s" % (k.rjust(l), v) + pymupdf.message(msg) + + +def print_xref(doc, xref): + """Print an object given by XREF number. + + Simulate the PDF source in "pretty" format. + For a stream also print its size. + """ + pymupdf.message("%i 0 obj" % xref) + xref_str = doc.xref_object(xref) + pymupdf.message(xref_str) + if doc.xref_is_stream(xref): + temp = xref_str.split() + try: + idx = temp.index("/Length") + 1 + size = temp[idx] + if size.endswith("0 R"): + size = "unknown" + except Exception: + size = "unknown" + pymupdf.message("stream\n...%s bytes" % size) + pymupdf.message("endstream") + pymupdf.message("endobj") + + +def get_list(rlist, limit, what="page"): + """Transform a page / xref specification into a list of integers. + + Args + ---- + rlist: (str) the specification + limit: maximum number, i.e. number of pages, number of objects + what: a string to be used in error messages + Returns + ------- + A list of integers representing the specification. + """ + N = str(limit - 1) + rlist = rlist.replace("N", N).replace(" ", "") + rlist_arr = rlist.split(",") + out_list = [] + for seq, item in enumerate(rlist_arr): + n = seq + 1 + if item.isdecimal(): # a single integer + i = int(item) + if 1 <= i < limit: + out_list.append(int(item)) + else: + sys.exit("bad %s specification at item %i" % (what, n)) + continue + try: # this must be a range now, and all of the following must work: + i1, i2 = item.split("-") # will fail if not 2 items produced + i1 = int(i1) # will fail on non-integers + i2 = int(i2) + except Exception: + sys.exit("bad %s range specification at item %i" % (what, n)) + + if not (1 <= i1 < limit and 1 <= i2 < limit): + sys.exit("bad %s range specification at item %i" % (what, n)) + + if i1 == i2: # just in case: a range of equal numbers + out_list.append(i1) + continue + + if i1 < i2: # first less than second + out_list += list(range(i1, i2 + 1)) + else: # first larger than second + out_list += list(range(i1, i2 - 1, -1)) + + return out_list + + +def show(args): + doc = open_file(args.input, args.password, True) + size = os.path.getsize(args.input) / 1024 + flag = "KB" + if size > 1000: + size /= 1024 + flag = "MB" + size = round(size, 1) + meta = doc.metadata # pylint: disable=no-member + pymupdf.message( + "'%s', pages: %i, objects: %i, %g %s, %s, encryption: %s" + % ( + args.input, + doc.page_count, + doc.xref_length() - 1, + size, + flag, + meta["format"], + meta["encryption"], + ) + ) + n = doc.is_form_pdf + if n > 0: + s = doc.get_sigflags() + pymupdf.message( + "document contains %i root form fields and is %ssigned" + % (n, "not " if s != 3 else "") + ) + n = doc.embfile_count() + if n > 0: + pymupdf.message("document contains %i embedded files" % n) + pymupdf.message() + if args.catalog: + pymupdf.message(mycenter("PDF catalog")) + xref = doc.pdf_catalog() + print_xref(doc, xref) + pymupdf.message() + if args.metadata: + pymupdf.message(mycenter("PDF metadata")) + print_dict(doc.metadata) # pylint: disable=no-member + pymupdf.message() + if args.xrefs: + pymupdf.message(mycenter("object information")) + xrefl = get_list(args.xrefs, doc.xref_length(), what="xref") + for xref in xrefl: + print_xref(doc, xref) + pymupdf.message() + if args.pages: + pymupdf.message(mycenter("page information")) + pagel = get_list(args.pages, doc.page_count + 1) + for pno in pagel: + n = pno - 1 + xref = doc.page_xref(n) + pymupdf.message("Page %i:" % pno) + print_xref(doc, xref) + pymupdf.message() + if args.trailer: + pymupdf.message(mycenter("PDF trailer")) + pymupdf.message(doc.pdf_trailer()) + pymupdf.message() + doc.close() + + +def clean(args): + doc = open_file(args.input, args.password, pdf=True) + encryption = args.encryption + encrypt = ("keep", "none", "rc4-40", "rc4-128", "aes-128", "aes-256").index( + encryption + ) + + if not args.pages: # simple cleaning + doc.save( + args.output, + garbage=args.garbage, + deflate=args.compress, + pretty=args.pretty, + clean=args.sanitize, + ascii=args.ascii, + linear=args.linear, + encryption=encrypt, + owner_pw=args.owner, + user_pw=args.user, + permissions=args.permission, + ) + return + + # create sub document from page numbers + pages = get_list(args.pages, doc.page_count + 1) + outdoc = pymupdf.open() + for pno in pages: + n = pno - 1 + outdoc.insert_pdf(doc, from_page=n, to_page=n) + outdoc.save( + args.output, + garbage=args.garbage, + deflate=args.compress, + pretty=args.pretty, + clean=args.sanitize, + ascii=args.ascii, + linear=args.linear, + encryption=encrypt, + owner_pw=args.owner, + user_pw=args.user, + permissions=args.permission, + ) + doc.close() + outdoc.close() + return + + +def doc_join(args): + """Join pages from several PDF documents.""" + doc_list = args.input # a list of input PDFs + doc = pymupdf.open() # output PDF + for src_item in doc_list: # process one input PDF + src_list = src_item.split(",") + password = src_list[1] if len(src_list) > 1 else None + src = open_file(src_list[0], password, pdf=True) + pages = ",".join(src_list[2:]) # get 'pages' specifications + if pages: # if anything there, retrieve a list of desired pages + page_list = get_list(",".join(src_list[2:]), src.page_count + 1) + else: # take all pages + page_list = range(1, src.page_count + 1) + for i in page_list: + doc.insert_pdf(src, from_page=i - 1, to_page=i - 1) # copy each source page + src.close() + + doc.save(args.output, garbage=4, deflate=True) + doc.close() + + +def embedded_copy(args): + """Copy embedded files between PDFs.""" + doc = open_file(args.input, args.password, pdf=True) + if not doc.can_save_incrementally() and ( + not args.output or args.output == args.input + ): + sys.exit("cannot save PDF incrementally") + src = open_file(args.source, args.pwdsource) + names = set(args.name) if args.name else set() + src_names = set(src.embfile_names()) + if names: + if not names <= src_names: + sys.exit("not all names are contained in source") + else: + names = src_names + if not names: + sys.exit("nothing to copy") + intersect = names & set(doc.embfile_names()) # any equal name already in target? + if intersect: + sys.exit("following names already exist in receiving PDF: %s" % str(intersect)) + + for item in names: + info = src.embfile_info(item) + buff = src.embfile_get(item) + doc.embfile_add( + item, + buff, + filename=info["filename"], + ufilename=info["ufilename"], + desc=info["desc"], + ) + pymupdf.message("copied entry '%s' from '%s'" % (item, src.name)) + src.close() + if args.output and args.output != args.input: + doc.save(args.output, garbage=3) + else: + doc.saveIncr() + doc.close() + + +def embedded_del(args): + """Delete an embedded file entry.""" + doc = open_file(args.input, args.password, pdf=True) + if not doc.can_save_incrementally() and ( + not args.output or args.output == args.input + ): + sys.exit("cannot save PDF incrementally") + + try: + doc.embfile_del(args.name) + except (ValueError, pymupdf.mupdf.FzErrorBase) as e: + sys.exit(f'no such embedded file {args.name!r}: {e}') + if not args.output or args.output == args.input: + doc.saveIncr() + else: + doc.save(args.output, garbage=1) + doc.close() + + +def embedded_get(args): + """Retrieve contents of an embedded file.""" + doc = open_file(args.input, args.password, pdf=True) + try: + stream = doc.embfile_get(args.name) + d = doc.embfile_info(args.name) + except (ValueError, pymupdf.mupdf.FzErrorBase) as e: + sys.exit(f'no such embedded file {args.name!r}: {e}') + filename = args.output if args.output else d["filename"] + with open(filename, "wb") as output: + output.write(stream) + pymupdf.message("saved entry '%s' as '%s'" % (args.name, filename)) + doc.close() + + +def embedded_add(args): + """Insert a new embedded file.""" + doc = open_file(args.input, args.password, pdf=True) + if not doc.can_save_incrementally() and ( + args.output is None or args.output == args.input + ): + sys.exit("cannot save PDF incrementally") + + try: + doc.embfile_del(args.name) + sys.exit("entry '%s' already exists" % args.name) + except Exception: + pass + + if not os.path.exists(args.path) or not os.path.isfile(args.path): + sys.exit("no such file '%s'" % args.path) + with open(args.path, "rb") as f: + stream = f.read() + filename = args.path + ufilename = filename + if not args.desc: + desc = filename + else: + desc = args.desc + doc.embfile_add( + args.name, stream, filename=filename, ufilename=ufilename, desc=desc + ) + if not args.output or args.output == args.input: + doc.saveIncr() + else: + doc.save(args.output, garbage=3) + doc.close() + + +def embedded_upd(args): + """Update contents or metadata of an embedded file.""" + doc = open_file(args.input, args.password, pdf=True) + if not doc.can_save_incrementally() and ( + args.output is None or args.output == args.input + ): + sys.exit("cannot save PDF incrementally") + + try: + doc.embfile_info(args.name) + except Exception: + sys.exit("no such embedded file '%s'" % args.name) + + if ( + args.path is not None + and os.path.exists(args.path) + and os.path.isfile(args.path) + ): + with open(args.path, "rb") as f: + stream = f.read() + else: + stream = None + + if args.filename: + filename = args.filename + else: + filename = None + + if args.ufilename: + ufilename = args.ufilename + elif args.filename: + ufilename = args.filename + else: + ufilename = None + + if args.desc: + desc = args.desc + else: + desc = None + + doc.embfile_upd( + args.name, stream, filename=filename, ufilename=ufilename, desc=desc + ) + if args.output is None or args.output == args.input: + doc.saveIncr() + else: + doc.save(args.output, garbage=3) + doc.close() + + +def embedded_list(args): + """List embedded files.""" + doc = open_file(args.input, args.password, pdf=True) + names = doc.embfile_names() + if args.name is not None: + if args.name not in names: + sys.exit("no such embedded file '%s'" % args.name) + else: + pymupdf.message() + pymupdf.message( + "printing 1 of %i embedded file%s:" + % (len(names), "s" if len(names) > 1 else "") + ) + pymupdf.message() + print_dict(doc.embfile_info(args.name)) + pymupdf.message() + return + if not names: + pymupdf.message("'%s' contains no embedded files" % doc.name) + return + if len(names) > 1: + msg = "'%s' contains the following %i embedded files" % (doc.name, len(names)) + else: + msg = "'%s' contains the following embedded file" % doc.name + pymupdf.message(msg) + pymupdf.message() + for name in names: + if not args.detail: + pymupdf.message(name) + continue + _ = doc.embfile_info(name) + print_dict(doc.embfile_info(name)) + pymupdf.message() + doc.close() + + +def extract_objects(args): + """Extract images and / or fonts from a PDF.""" + if not args.fonts and not args.images: + sys.exit("neither fonts nor images requested") + doc = open_file(args.input, args.password, pdf=True) + + if args.pages: + pages = get_list(args.pages, doc.page_count + 1) + else: + pages = range(1, doc.page_count + 1) + + if not args.output: + out_dir = os.path.abspath(os.curdir) + else: + out_dir = args.output + if not (os.path.exists(out_dir) and os.path.isdir(out_dir)): + sys.exit("output directory %s does not exist" % out_dir) + + font_xrefs = set() # already saved fonts + image_xrefs = set() # already saved images + + for pno in pages: + if args.fonts: + itemlist = doc.get_page_fonts(pno - 1) + for item in itemlist: + xref = item[0] + if xref not in font_xrefs: + font_xrefs.add(xref) + fontname, ext, _, buffer = doc.extract_font(xref) + if ext == "n/a" or not buffer: + continue + outname = os.path.join( + out_dir, f"{fontname.replace(' ', '-')}-{xref}.{ext}" + ) + with open(outname, "wb") as outfile: + outfile.write(buffer) + buffer = None + if args.images: + itemlist = doc.get_page_images(pno - 1) + for item in itemlist: + xref = item[0] + if xref not in image_xrefs: + image_xrefs.add(xref) + pix = recoverpix(doc, item) + if type(pix) is dict: + ext = pix["ext"] + imgdata = pix["image"] + outname = os.path.join(out_dir, "img-%i.%s" % (xref, ext)) + with open(outname, "wb") as outfile: + outfile.write(imgdata) + else: + outname = os.path.join(out_dir, "img-%i.png" % xref) + pix2 = ( + pix + if pix.colorspace.n < 4 + else pymupdf.Pixmap(pymupdf.csRGB, pix) + ) + pix2.save(outname) + + if args.fonts: + pymupdf.message("saved %i fonts to '%s'" % (len(font_xrefs), out_dir)) + if args.images: + pymupdf.message("saved %i images to '%s'" % (len(image_xrefs), out_dir)) + doc.close() + + +def page_simple(page, textout, GRID, fontsize, noformfeed, skip_empty, flags): + eop = b"\n" if noformfeed else bytes([12]) + text = page.get_text("text", flags=flags) + if not text: + if not skip_empty: + textout.write(eop) # write formfeed + return + textout.write(text.encode("utf8", errors="surrogatepass")) + textout.write(eop) + return + + +def page_blocksort(page, textout, GRID, fontsize, noformfeed, skip_empty, flags): + eop = b"\n" if noformfeed else bytes([12]) + blocks = page.get_text("blocks", flags=flags) + if blocks == []: + if not skip_empty: + textout.write(eop) # write formfeed + return + blocks.sort(key=lambda b: (b[3], b[0])) + for b in blocks: + textout.write(b[4].encode("utf8", errors="surrogatepass")) + textout.write(eop) + return + + +def page_layout(page, textout, GRID, fontsize, noformfeed, skip_empty, flags): + eop = b"\n" if noformfeed else bytes([12]) + + # -------------------------------------------------------------------- + def find_line_index(values: List[int], value: int) -> int: + """Find the right row coordinate. + + Args: + values: (list) y-coordinates of rows. + value: (int) lookup for this value (y-origin of char). + Returns: + y-ccordinate of appropriate line for value. + """ + i = bisect.bisect_right(values, value) + if i: + return values[i - 1] + raise RuntimeError("Line for %g not found in %s" % (value, values)) + + # -------------------------------------------------------------------- + def curate_rows(rows: Set[int], GRID) -> List: + rows = list(rows) + rows.sort() # sort ascending + nrows = [rows[0]] + for h in rows[1:]: + if h >= nrows[-1] + GRID: # only keep significant differences + nrows.append(h) + return nrows # curated list of line bottom coordinates + + def process_blocks(blocks: List[Dict], page: pymupdf.Page): + rows = set() + page_width = page.rect.width + page_height = page.rect.height + rowheight = page_height + left = page_width + right = 0 + chars = [] + for block in blocks: + for line in block["lines"]: + if line["dir"] != (1, 0): # ignore non-horizontal text + continue + x0, y0, x1, y1 = line["bbox"] + if y1 < 0 or y0 > page.rect.height: # ignore if outside CropBox + continue + # upd row height + height = y1 - y0 + + if rowheight > height: + rowheight = height + for span in line["spans"]: + if span["size"] <= fontsize: + continue + for c in span["chars"]: + x0, _, x1, _ = c["bbox"] + cwidth = x1 - x0 + ox, oy = c["origin"] + oy = int(round(oy)) + rows.add(oy) + ch = c["c"] + if left > ox and ch != " ": + left = ox # update left coordinate + if right < x1: + right = x1 # update right coordinate + # handle ligatures: + if cwidth == 0 and chars != []: # potential ligature + old_ch, old_ox, old_oy, old_cwidth = chars[-1] + if old_oy == oy: # ligature + if old_ch != chr(0xFB00): # previous "ff" char lig? + lig = joinligature(old_ch + ch) # no + # convert to one of the 3-char ligatures: + elif ch == "i": + lig = chr(0xFB03) # "ffi" + elif ch == "l": + lig = chr(0xFB04) # "ffl" + else: # something wrong, leave old char in place + lig = old_ch + chars[-1] = (lig, old_ox, old_oy, old_cwidth) + continue + chars.append((ch, ox, oy, cwidth)) # all chars on page + return chars, rows, left, right, rowheight + + def joinligature(lig: str) -> str: + """Return ligature character for a given pair / triple of characters. + + Args: + lig: (str) 2/3 characters, e.g. "ff" + Returns: + Ligature, e.g. "ff" -> chr(0xFB00) + """ + + if lig == "ff": + return chr(0xFB00) + elif lig == "fi": + return chr(0xFB01) + elif lig == "fl": + return chr(0xFB02) + elif lig == "ffi": + return chr(0xFB03) + elif lig == "ffl": + return chr(0xFB04) + elif lig == "ft": + return chr(0xFB05) + elif lig == "st": + return chr(0xFB06) + return lig + + # -------------------------------------------------------------------- + def make_textline(left, slot, minslot, lchars): + """Produce the text of one output line. + + Args: + left: (float) left most coordinate used on page + slot: (float) avg width of one character in any font in use. + minslot: (float) min width for the characters in this line. + chars: (list[tuple]) characters of this line. + Returns: + text: (str) text string for this line + """ + text = "" # we output this + old_char = "" + old_x1 = 0 # end coordinate of last char + old_ox = 0 # x-origin of last char + if minslot <= pymupdf.EPSILON: + raise RuntimeError("program error: minslot too small = %g" % minslot) + + for c in lchars: # loop over characters + char, ox, _, cwidth = c + ox = ox - left # its (relative) start coordinate + x1 = ox + cwidth # ending coordinate + + # eliminate overprint effect + if old_char == char and ox - old_ox <= cwidth * 0.2: + continue + + # omit spaces overlapping previous char + if char == " " and (old_x1 - ox) / cwidth > 0.8: + continue + + old_char = char + # close enough to previous? + if ox < old_x1 + minslot: # assume char adjacent to previous + text += char # append to output + old_x1 = x1 # new end coord + old_ox = ox # new origin.x + continue + + # else next char starts after some gap: + # fill in right number of spaces, so char is positioned + # in the right slot of the line + if char == " ": # rest relevant for non-space only + continue + delta = int(ox / slot) - len(text) + if ox > old_x1 and delta > 1: + text += " " * delta + # now append char + text += char + old_x1 = x1 # new end coordinate + old_ox = ox # new origin + return text.rstrip() + + # extract page text by single characters ("rawdict") + blocks = page.get_text("rawdict", flags=flags)["blocks"] + chars, rows, left, right, rowheight = process_blocks(blocks, page) + + if chars == []: + if not skip_empty: + textout.write(eop) # write formfeed + return + # compute list of line coordinates - ignoring small (GRID) differences + rows = curate_rows(rows, GRID) + + # sort all chars by x-coordinates, so every line will receive char info, + # sorted from left to right. + chars.sort(key=lambda c: c[1]) + + # populate the lines with their char info + lines = {} # key: y1-ccordinate, value: char list + for c in chars: + _, _, oy, _ = c + y = find_line_index(rows, oy) # y-coord of the right line + lchars = lines.get(y, []) # read line chars so far + lchars.append(c) # append this char + lines[y] = lchars # write back to line + + # ensure line coordinates are ascending + keys = list(lines.keys()) + keys.sort() + + # ------------------------------------------------------------------------- + # Compute "char resolution" for the page: the char width corresponding to + # 1 text char position on output - call it 'slot'. + # For each line, compute median of its char widths. The minimum across all + # lines is 'slot'. + # The minimum char width of each line is used to determine if spaces must + # be inserted in between two characters. + # ------------------------------------------------------------------------- + slot = right - left + minslots = {} + for k in keys: + lchars = lines[k] + ccount = len(lchars) + if ccount < 2: + minslots[k] = 1 + continue + widths = [c[3] for c in lchars] + widths.sort() + this_slot = statistics.median(widths) # take median value + if this_slot < slot: + slot = this_slot + minslots[k] = widths[0] + + # compute line advance in text output + rowheight = rowheight * (rows[-1] - rows[0]) / (rowheight * len(rows)) * 1.2 + rowpos = rows[0] # first line positioned here + textout.write(b"\n") + for k in keys: # walk through the lines + while rowpos < k: # honor distance between lines + textout.write(b"\n") + rowpos += rowheight + text = make_textline(left, slot, minslots[k], lines[k]) + textout.write((text + "\n").encode("utf8", errors="surrogatepass")) + rowpos = k + rowheight + + textout.write(eop) # write formfeed + + +def gettext(args): + doc = open_file(args.input, args.password, pdf=False) + pagel = get_list(args.pages, doc.page_count + 1) + output = args.output + if output is None: + filename, _ = os.path.splitext(doc.name) + output = filename + ".txt" + with open(output, "wb") as textout: + flags = pymupdf.TEXT_PRESERVE_LIGATURES | pymupdf.TEXT_PRESERVE_WHITESPACE + if args.convert_white: + flags ^= pymupdf.TEXT_PRESERVE_WHITESPACE + if args.noligatures: + flags ^= pymupdf.TEXT_PRESERVE_LIGATURES + if args.extra_spaces: + flags ^= pymupdf.TEXT_INHIBIT_SPACES + func = { + "simple": page_simple, + "blocks": page_blocksort, + "layout": page_layout, + } + for pno in pagel: + page = doc[pno - 1] + func[args.mode]( + page, + textout, + args.grid, + args.fontsize, + args.noformfeed, + args.skip_empty, + flags=flags, + ) + + +def _internal(args): + pymupdf.message('This is from PyMuPDF message().') + pymupdf.log('This is from PyMuPDF log().') + +def main(): + """Define command configurations.""" + parser = argparse.ArgumentParser( + prog="pymupdf", + description=mycenter("Basic PyMuPDF Functions"), + ) + subps = parser.add_subparsers( + title="Subcommands", help="Enter 'command -h' for subcommand specific help" + ) + + # ------------------------------------------------------------------------- + # 'show' command + # ------------------------------------------------------------------------- + ps_show = subps.add_parser("show", description=mycenter("display PDF information")) + ps_show.add_argument("input", type=str, help="PDF filename") + ps_show.add_argument("-password", help="password") + ps_show.add_argument("-catalog", action="store_true", help="show PDF catalog") + ps_show.add_argument("-trailer", action="store_true", help="show PDF trailer") + ps_show.add_argument("-metadata", action="store_true", help="show PDF metadata") + ps_show.add_argument( + "-xrefs", type=str, help="show selected objects, format: 1,5-7,N" + ) + ps_show.add_argument( + "-pages", type=str, help="show selected pages, format: 1,5-7,50-N" + ) + ps_show.set_defaults(func=show) + + # ------------------------------------------------------------------------- + # 'clean' command + # ------------------------------------------------------------------------- + ps_clean = subps.add_parser( + "clean", description=mycenter("optimize PDF, or create sub-PDF if pages given") + ) + ps_clean.add_argument("input", type=str, help="PDF filename") + ps_clean.add_argument("output", type=str, help="output PDF filename") + ps_clean.add_argument("-password", help="password") + + ps_clean.add_argument( + "-encryption", + help="encryption method", + choices=("keep", "none", "rc4-40", "rc4-128", "aes-128", "aes-256"), + default="none", + ) + + ps_clean.add_argument("-owner", type=str, help="owner password") + ps_clean.add_argument("-user", type=str, help="user password") + + ps_clean.add_argument( + "-garbage", + type=int, + help="garbage collection level", + choices=range(5), + default=0, + ) + + ps_clean.add_argument( + "-compress", + action="store_true", + default=False, + help="compress (deflate) output", + ) + + ps_clean.add_argument( + "-ascii", action="store_true", default=False, help="ASCII encode binary data" + ) + + ps_clean.add_argument( + "-linear", + action="store_true", + default=False, + help="format for fast web display", + ) + + ps_clean.add_argument( + "-permission", type=int, default=-1, help="integer with permission levels" + ) + + ps_clean.add_argument( + "-sanitize", + action="store_true", + default=False, + help="sanitize / clean contents", + ) + ps_clean.add_argument( + "-pretty", action="store_true", default=False, help="prettify PDF structure" + ) + ps_clean.add_argument( + "-pages", help="output selected pages pages, format: 1,5-7,50-N" + ) + ps_clean.set_defaults(func=clean) + + # ------------------------------------------------------------------------- + # 'join' command + # ------------------------------------------------------------------------- + ps_join = subps.add_parser( + "join", + description=mycenter("join PDF documents"), + epilog="specify each input as 'filename[,password[,pages]]'", + ) + ps_join.add_argument("input", nargs="*", help="input filenames") + ps_join.add_argument("-output", required=True, help="output filename") + ps_join.set_defaults(func=doc_join) + + # ------------------------------------------------------------------------- + # 'extract' command + # ------------------------------------------------------------------------- + ps_extract = subps.add_parser( + "extract", description=mycenter("extract images and fonts to disk") + ) + ps_extract.add_argument("input", type=str, help="PDF filename") + ps_extract.add_argument("-images", action="store_true", help="extract images") + ps_extract.add_argument("-fonts", action="store_true", help="extract fonts") + ps_extract.add_argument( + "-output", help="folder to receive output, defaults to current" + ) + ps_extract.add_argument("-password", help="password") + ps_extract.add_argument( + "-pages", type=str, help="consider these pages only, format: 1,5-7,50-N" + ) + ps_extract.set_defaults(func=extract_objects) + + # ------------------------------------------------------------------------- + # 'embed-info' + # ------------------------------------------------------------------------- + ps_show = subps.add_parser( + "embed-info", description=mycenter("list embedded files") + ) + ps_show.add_argument("input", help="PDF filename") + ps_show.add_argument("-name", help="if given, report only this one") + ps_show.add_argument("-detail", action="store_true", help="detail information") + ps_show.add_argument("-password", help="password") + ps_show.set_defaults(func=embedded_list) + + # ------------------------------------------------------------------------- + # 'embed-add' command + # ------------------------------------------------------------------------- + ps_embed_add = subps.add_parser( + "embed-add", description=mycenter("add embedded file") + ) + ps_embed_add.add_argument("input", help="PDF filename") + ps_embed_add.add_argument("-password", help="password") + ps_embed_add.add_argument( + "-output", help="output PDF filename, incremental save if none" + ) + ps_embed_add.add_argument("-name", required=True, help="name of new entry") + ps_embed_add.add_argument("-path", required=True, help="path to data for new entry") + ps_embed_add.add_argument("-desc", help="description of new entry") + ps_embed_add.set_defaults(func=embedded_add) + + # ------------------------------------------------------------------------- + # 'embed-del' command + # ------------------------------------------------------------------------- + ps_embed_del = subps.add_parser( + "embed-del", description=mycenter("delete embedded file") + ) + ps_embed_del.add_argument("input", help="PDF filename") + ps_embed_del.add_argument("-password", help="password") + ps_embed_del.add_argument( + "-output", help="output PDF filename, incremental save if none" + ) + ps_embed_del.add_argument("-name", required=True, help="name of entry to delete") + ps_embed_del.set_defaults(func=embedded_del) + + # ------------------------------------------------------------------------- + # 'embed-upd' command + # ------------------------------------------------------------------------- + ps_embed_upd = subps.add_parser( + "embed-upd", + description=mycenter("update embedded file"), + epilog="except '-name' all parameters are optional", + ) + ps_embed_upd.add_argument("input", help="PDF filename") + ps_embed_upd.add_argument("-name", required=True, help="name of entry") + ps_embed_upd.add_argument("-password", help="password") + ps_embed_upd.add_argument( + "-output", help="Output PDF filename, incremental save if none" + ) + ps_embed_upd.add_argument("-path", help="path to new data for entry") + ps_embed_upd.add_argument("-filename", help="new filename to store in entry") + ps_embed_upd.add_argument( + "-ufilename", help="new unicode filename to store in entry" + ) + ps_embed_upd.add_argument("-desc", help="new description to store in entry") + ps_embed_upd.set_defaults(func=embedded_upd) + + # ------------------------------------------------------------------------- + # 'embed-extract' command + # ------------------------------------------------------------------------- + ps_embed_extract = subps.add_parser( + "embed-extract", description=mycenter("extract embedded file to disk") + ) + ps_embed_extract.add_argument("input", type=str, help="PDF filename") + ps_embed_extract.add_argument("-name", required=True, help="name of entry") + ps_embed_extract.add_argument("-password", help="password") + ps_embed_extract.add_argument( + "-output", help="output filename, default is stored name" + ) + ps_embed_extract.set_defaults(func=embedded_get) + + # ------------------------------------------------------------------------- + # 'embed-copy' command + # ------------------------------------------------------------------------- + ps_embed_copy = subps.add_parser( + "embed-copy", description=mycenter("copy embedded files between PDFs") + ) + ps_embed_copy.add_argument("input", type=str, help="PDF to receive embedded files") + ps_embed_copy.add_argument("-password", help="password of input") + ps_embed_copy.add_argument( + "-output", help="output PDF, incremental save to 'input' if omitted" + ) + ps_embed_copy.add_argument( + "-source", required=True, help="copy embedded files from here" + ) + ps_embed_copy.add_argument("-pwdsource", help="password of 'source' PDF") + ps_embed_copy.add_argument( + "-name", nargs="*", help="restrict copy to these entries" + ) + ps_embed_copy.set_defaults(func=embedded_copy) + + # ------------------------------------------------------------------------- + # 'textlayout' command + # ------------------------------------------------------------------------- + ps_gettext = subps.add_parser( + "gettext", description=mycenter("extract text in various formatting modes") + ) + ps_gettext.add_argument("input", type=str, help="input document filename") + ps_gettext.add_argument("-password", help="password for input document") + ps_gettext.add_argument( + "-mode", + type=str, + help="mode: simple, block sort, or layout (default)", + choices=("simple", "blocks", "layout"), + default="layout", + ) + ps_gettext.add_argument( + "-pages", + type=str, + help="select pages, format: 1,5-7,50-N", + default="1-N", + ) + ps_gettext.add_argument( + "-noligatures", + action="store_true", + help="expand ligature characters (default False)", + default=False, + ) + ps_gettext.add_argument( + "-convert-white", + action="store_true", + help="convert whitespace characters to white (default False)", + default=False, + ) + ps_gettext.add_argument( + "-extra-spaces", + action="store_true", + help="fill gaps with spaces (default False)", + default=False, + ) + ps_gettext.add_argument( + "-noformfeed", + action="store_true", + help="write linefeeds, no formfeeds (default False)", + default=False, + ) + ps_gettext.add_argument( + "-skip-empty", + action="store_true", + help="suppress pages with no text (default False)", + default=False, + ) + ps_gettext.add_argument( + "-output", + help="store text in this file (default inputfilename.txt)", + ) + ps_gettext.add_argument( + "-grid", + type=float, + help="merge lines if closer than this (default 2)", + default=2, + ) + ps_gettext.add_argument( + "-fontsize", + type=float, + help="only include text with a larger fontsize (default 3)", + default=3, + ) + ps_gettext.set_defaults(func=gettext) + + # ------------------------------------------------------------------------- + # '_internal' command + # ------------------------------------------------------------------------- + ps_internal = subps.add_parser( + "internal", description=mycenter("internal testing") + ) + ps_internal.set_defaults(func=_internal) + + # ------------------------------------------------------------------------- + # start program + # ------------------------------------------------------------------------- + args = parser.parse_args() # create parameter arguments class + if not hasattr(args, "func"): # no function selected + parser.print_help() # so print top level help + else: + args.func(args) # execute requested command + + +if __name__ == "__main__": + main() diff -r 000000000000 -r 1d09e1dec1d9 src/_apply_pages.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/_apply_pages.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,253 @@ +import multiprocessing +import os +import time + +import pymupdf + + +# Support for concurrent processing of document pages. +# + +class _worker_State: + pass +_worker_state = _worker_State() + + +def _worker_init( + path, + initfn, + initfn_args, + initfn_kwargs, + pagefn, + pagefn_args, + pagefn_kwargs, + stats, + ): + # pylint: disable=attribute-defined-outside-init + _worker_state.path = path + _worker_state.pagefn = pagefn + _worker_state.pagefn_args = pagefn_args + _worker_state.pagefn_kwargs = pagefn_kwargs + _worker_state.stats = stats + _worker_state.document = None + if initfn: + initfn(*initfn_args, **initfn_kwargs) + + +def _stats_write(t, label): + t = time.time() - t + if t >= 10: + pymupdf.log(f'{os.getpid()=}: {t:2f}s: {label}.') + + +def _worker_fn(page_number): + # Create Document from filename if we haven't already done so. + if not _worker_state.document: + if _worker_state.stats: + t = time.time() + _worker_state.document = pymupdf.Document(_worker_state.path) # pylint: disable=attribute-defined-outside-init + if _worker_state.stats: + _stats_write(t, 'pymupdf.Document()') + + if _worker_state.stats: + t = time.time() + page = _worker_state.document[page_number] + if _worker_state.stats: + _stats_write(t, '_worker_state.document[page_number]') + + if _worker_state.stats: + t = time.time() + ret = _worker_state.pagefn( + page, + *_worker_state.pagefn_args, + **_worker_state.pagefn_kwargs, + ) + if _worker_state.stats: + _stats_write(t, '_worker_state.pagefn()') + + return ret + + +def _multiprocessing( + path, + pages, + pagefn, + pagefn_args, + pagefn_kwargs, + initfn, + initfn_args, + initfn_kwargs, + concurrency, + stats, + ): + #print(f'_worker_mp(): {concurrency=}', flush=1) + with multiprocessing.Pool( + concurrency, + _worker_init, + ( + path, + initfn, initfn_args, initfn_kwargs, + pagefn, pagefn_args, pagefn_kwargs, + stats, + ), + ) as pool: + result = pool.map_async(_worker_fn, pages) + return result.get() + + +def _fork( + path, + pages, + pagefn, + pagefn_args, + pagefn_kwargs, + initfn, + initfn_args, + initfn_kwargs, + concurrency, + stats, + ): + verbose = 0 + if concurrency is None: + concurrency = multiprocessing.cpu_count() + # We write page numbers to `queue_down` and read `(page_num, text)` from + # `queue_up`. Workers each repeatedly read the next available page number + # from `queue_down`, extract the text and write it onto `queue_up`. + # + # This is better than pre-allocating a subset of pages to each worker + # because it ensures there will never be idle workers until we are near the + # end with fewer pages left than workers. + # + queue_down = multiprocessing.Queue() + queue_up = multiprocessing.Queue() + def childfn(): + document = None + if verbose: + pymupdf.log(f'{os.getpid()=}: {initfn=} {initfn_args=}') + _worker_init( + path, + initfn, + initfn_args, + initfn_kwargs, + pagefn, + pagefn_args, + pagefn_kwargs, + stats, + ) + while 1: + if verbose: + pymupdf.log(f'{os.getpid()=}: calling get().') + page_num = queue_down.get() + if verbose: + pymupdf.log(f'{os.getpid()=}: {page_num=}.') + if page_num is None: + break + try: + if not document: + if stats: + t = time.time() + document = pymupdf.Document(path) + if stats: + _stats_write(t, 'pymupdf.Document(path)') + + if stats: + t = time.time() + page = document[page_num] + if stats: + _stats_write(t, 'document[page_num]') + + if verbose: + pymupdf.log(f'{os.getpid()=}: {_worker_state=}') + + if stats: + t = time.time() + ret = pagefn( + page, + *_worker_state.pagefn_args, + **_worker_state.pagefn_kwargs, + ) + if stats: + _stats_write(t, f'{page_num=} pagefn()') + except Exception as e: + if verbose: pymupdf.log(f'{os.getpid()=}: exception {e=}') + ret = e + if verbose: + pymupdf.log(f'{os.getpid()=}: sending {page_num=} {ret=}') + + queue_up.put( (page_num, ret) ) + + error = None + + pids = list() + try: + # Start child processes. + if stats: + t = time.time() + for i in range(concurrency): + p = os.fork() # pylint: disable=no-member + if p == 0: + # Child process. + try: + try: + childfn() + except Exception as e: + pymupdf.log(f'{os.getpid()=}: childfn() => {e=}') + raise + finally: + if verbose: + pymupdf.log(f'{os.getpid()=}: calling os._exit(0)') + os._exit(0) + pids.append(p) + if stats: + _stats_write(t, 'create child processes') + + # Send page numbers. + if stats: + t = time.time() + if verbose: + pymupdf.log(f'Sending page numbers.') + for page_num in range(len(pages)): + queue_down.put(page_num) + if stats: + _stats_write(t, 'Send page numbers') + + # Collect results. We give up if any worker sends an exception instead + # of text, but this hasn't been tested. + ret = [None] * len(pages) + for i in range(len(pages)): + page_num, text = queue_up.get() + if verbose: + pymupdf.log(f'{page_num=} {len(text)=}') + assert ret[page_num] is None + if isinstance(text, Exception): + if not error: + error = text + break + ret[page_num] = text + + # Close queue. This should cause exception in workers and terminate + # them, but on macos-arm64 this does not seem to happen, so we also + # send None, which makes workers terminate. + for i in range(concurrency): + queue_down.put(None) + if verbose: pymupdf.log(f'Closing queues.') + queue_down.close() + + if error: + raise error + if verbose: + pymupdf.log(f'After concurrent, returning {len(ret)=}') + return ret + + finally: + # Join all child processes. + if stats: + t = time.time() + for pid in pids: + if verbose: + pymupdf.log(f'waiting for {pid=}.') + e = os.waitpid(pid, 0) + if verbose: + pymupdf.log(f'{pid=} => {e=}') + if stats: + _stats_write(t, 'Join all child proceses') diff -r 000000000000 -r 1d09e1dec1d9 src/_wxcolors.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/_wxcolors.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,562 @@ +_wxcolors = [ + ("ALICEBLUE", 240, 248, 255), + ("ANTIQUEWHITE", 250, 235, 215), + ("ANTIQUEWHITE1", 255, 239, 219), + ("ANTIQUEWHITE2", 238, 223, 204), + ("ANTIQUEWHITE3", 205, 192, 176), + ("ANTIQUEWHITE4", 139, 131, 120), + ("AQUA", 0, 255, 255), + ("AQUAMARINE", 127, 255, 212), + ("AQUAMARINE1", 127, 255, 212), + ("AQUAMARINE2", 118, 238, 198), + ("AQUAMARINE3", 102, 205, 170), + ("AQUAMARINE4", 69, 139, 116), + ("AZURE", 240, 255, 255), + ("AZURE1", 240, 255, 255), + ("AZURE2", 224, 238, 238), + ("AZURE3", 193, 205, 205), + ("AZURE4", 131, 139, 139), + ("BEIGE", 245, 245, 220), + ("BISQUE", 255, 228, 196), + ("BISQUE1", 255, 228, 196), + ("BISQUE2", 238, 213, 183), + ("BISQUE3", 205, 183, 158), + ("BISQUE4", 139, 125, 107), + ("BLACK", 0, 0, 0), + ("BLANCHEDALMOND", 255, 235, 205), + ("BLUE", 0, 0, 255), + ("BLUE1", 0, 0, 255), + ("BLUE2", 0, 0, 238), + ("BLUE3", 0, 0, 205), + ("BLUE4", 0, 0, 139), + ("BLUEVIOLET", 138, 43, 226), + ("BROWN", 165, 42, 42), + ("BROWN1", 255, 64, 64), + ("BROWN2", 238, 59, 59), + ("BROWN3", 205, 51, 51), + ("BROWN4", 139, 35, 35), + ("BURLYWOOD", 222, 184, 135), + ("BURLYWOOD1", 255, 211, 155), + ("BURLYWOOD2", 238, 197, 145), + ("BURLYWOOD3", 205, 170, 125), + ("BURLYWOOD4", 139, 115, 85), + ("CADETBLUE", 95, 158, 160), + ("CADETBLUE1", 152, 245, 255), + ("CADETBLUE2", 142, 229, 238), + ("CADETBLUE3", 122, 197, 205), + ("CADETBLUE4", 83, 134, 139), + ("CHARTREUSE", 127, 255, 0), + ("CHARTREUSE1", 127, 255, 0), + ("CHARTREUSE2", 118, 238, 0), + ("CHARTREUSE3", 102, 205, 0), + ("CHARTREUSE4", 69, 139, 0), + ("CHOCOLATE", 210, 105, 30), + ("CHOCOLATE1", 255, 127, 36), + ("CHOCOLATE2", 238, 118, 33), + ("CHOCOLATE3", 205, 102, 29), + ("CHOCOLATE4", 139, 69, 19), + ("COFFEE", 156, 79, 0), + ("CORAL", 255, 127, 80), + ("CORAL1", 255, 114, 86), + ("CORAL2", 238, 106, 80), + ("CORAL3", 205, 91, 69), + ("CORAL4", 139, 62, 47), + ("CORNFLOWERBLUE", 100, 149, 237), + ("CORNSILK", 255, 248, 220), + ("CORNSILK1", 255, 248, 220), + ("CORNSILK2", 238, 232, 205), + ("CORNSILK3", 205, 200, 177), + ("CORNSILK4", 139, 136, 120), + ("CRIMSON", 220, 20, 60), + ("CYAN", 0, 255, 255), + ("CYAN1", 0, 255, 255), + ("CYAN2", 0, 238, 238), + ("CYAN3", 0, 205, 205), + ("CYAN4", 0, 139, 139), + ("DARKBLUE", 0, 0, 139), + ("DARKCYAN", 0, 139, 139), + ("DARKGOLDENROD", 184, 134, 11), + ("DARKGOLDENROD1", 255, 185, 15), + ("DARKGOLDENROD2", 238, 173, 14), + ("DARKGOLDENROD3", 205, 149, 12), + ("DARKGOLDENROD4", 139, 101, 8), + ("DARKGRAY", 169, 169, 169), + ("DARKGREEN", 0, 100, 0), + ("DARKGREY", 169, 169, 169), + ("DARKKHAKI", 189, 183, 107), + ("DARKMAGENTA", 139, 0, 139), + ("DARKOLIVEGREEN", 85, 107, 47), + ("DARKOLIVEGREEN1", 202, 255, 112), + ("DARKOLIVEGREEN2", 188, 238, 104), + ("DARKOLIVEGREEN3", 162, 205, 90), + ("DARKOLIVEGREEN4", 110, 139, 61), + ("DARKORANGE", 255, 140, 0), + ("DARKORANGE1", 255, 127, 0), + ("DARKORANGE2", 238, 118, 0), + ("DARKORANGE3", 205, 102, 0), + ("DARKORANGE4", 139, 69, 0), + ("DARKORCHID", 153, 50, 204), + ("DARKORCHID1", 191, 62, 255), + ("DARKORCHID2", 178, 58, 238), + ("DARKORCHID3", 154, 50, 205), + ("DARKORCHID4", 104, 34, 139), + ("DARKRED", 139, 0, 0), + ("DARKSALMON", 233, 150, 122), + ("DARKSEAGREEN", 143, 188, 143), + ("DARKSEAGREEN1", 193, 255, 193), + ("DARKSEAGREEN2", 180, 238, 180), + ("DARKSEAGREEN3", 155, 205, 155), + ("DARKSEAGREEN4", 105, 139, 105), + ("DARKSLATEBLUE", 72, 61, 139), + ("DARKSLATEGRAY", 47, 79, 79), + ("DARKSLATEGREY", 47, 79, 79), + ("DARKTURQUOISE", 0, 206, 209), + ("DARKVIOLET", 148, 0, 211), + ("DEEPPINK", 255, 20, 147), + ("DEEPPINK1", 255, 20, 147), + ("DEEPPINK2", 238, 18, 137), + ("DEEPPINK3", 205, 16, 118), + ("DEEPPINK4", 139, 10, 80), + ("DEEPSKYBLUE", 0, 191, 255), + ("DEEPSKYBLUE1", 0, 191, 255), + ("DEEPSKYBLUE2", 0, 178, 238), + ("DEEPSKYBLUE3", 0, 154, 205), + ("DEEPSKYBLUE4", 0, 104, 139), + ("DIMGRAY", 105, 105, 105), + ("DIMGREY", 105, 105, 105), + ("DODGERBLUE", 30, 144, 255), + ("DODGERBLUE1", 30, 144, 255), + ("DODGERBLUE2", 28, 134, 238), + ("DODGERBLUE3", 24, 116, 205), + ("DODGERBLUE4", 16, 78, 139), + ("FIREBRICK", 178, 34, 34), + ("FIREBRICK1", 255, 48, 48), + ("FIREBRICK2", 238, 44, 44), + ("FIREBRICK3", 205, 38, 38), + ("FIREBRICK4", 139, 26, 26), + ("FLORALWHITE", 255, 250, 240), + ("FORESTGREEN", 34, 139, 34), + ("FUCHSIA", 255, 0, 255), + ("GAINSBORO", 220, 220, 220), + ("GHOSTWHITE", 248, 248, 255), + ("GOLD", 255, 215, 0), + ("GOLD1", 255, 215, 0), + ("GOLD2", 238, 201, 0), + ("GOLD3", 205, 173, 0), + ("GOLD4", 139, 117, 0), + ("GOLDENROD", 218, 165, 32), + ("GOLDENROD1", 255, 193, 37), + ("GOLDENROD2", 238, 180, 34), + ("GOLDENROD3", 205, 155, 29), + ("GOLDENROD4", 139, 105, 20), + ("GRAY", 190, 190, 190), + ("GRAY0", 0, 0, 0), + ("GRAY1", 3, 3, 3), + ("GRAY10", 26, 26, 26), + ("GRAY100", 255, 255, 255), + ("GRAY11", 28, 28, 28), + ("GRAY12", 31, 31, 31), + ("GRAY13", 33, 33, 33), + ("GRAY14", 36, 36, 36), + ("GRAY15", 38, 38, 38), + ("GRAY16", 41, 41, 41), + ("GRAY17", 43, 43, 43), + ("GRAY18", 46, 46, 46), + ("GRAY19", 48, 48, 48), + ("GRAY2", 5, 5, 5), + ("GRAY20", 51, 51, 51), + ("GRAY21", 54, 54, 54), + ("GRAY22", 56, 56, 56), + ("GRAY23", 59, 59, 59), + ("GRAY24", 61, 61, 61), + ("GRAY25", 64, 64, 64), + ("GRAY26", 66, 66, 66), + ("GRAY27", 69, 69, 69), + ("GRAY28", 71, 71, 71), + ("GRAY29", 74, 74, 74), + ("GRAY3", 8, 8, 8), + ("GRAY30", 77, 77, 77), + ("GRAY31", 79, 79, 79), + ("GRAY32", 82, 82, 82), + ("GRAY33", 84, 84, 84), + ("GRAY34", 87, 87, 87), + ("GRAY35", 89, 89, 89), + ("GRAY36", 92, 92, 92), + ("GRAY37", 94, 94, 94), + ("GRAY38", 97, 97, 97), + ("GRAY39", 99, 99, 99), + ("GRAY4", 10, 10, 10), + ("GRAY40", 102, 102, 102), + ("GRAY41", 105, 105, 105), + ("GRAY42", 107, 107, 107), + ("GRAY43", 110, 110, 110), + ("GRAY44", 112, 112, 112), + ("GRAY45", 115, 115, 115), + ("GRAY46", 117, 117, 117), + ("GRAY47", 120, 120, 120), + ("GRAY48", 122, 122, 122), + ("GRAY49", 125, 125, 125), + ("GRAY5", 13, 13, 13), + ("GRAY50", 127, 127, 127), + ("GRAY51", 130, 130, 130), + ("GRAY52", 133, 133, 133), + ("GRAY53", 135, 135, 135), + ("GRAY54", 138, 138, 138), + ("GRAY55", 140, 140, 140), + ("GRAY56", 143, 143, 143), + ("GRAY57", 145, 145, 145), + ("GRAY58", 148, 148, 148), + ("GRAY59", 150, 150, 150), + ("GRAY6", 15, 15, 15), + ("GRAY60", 153, 153, 153), + ("GRAY61", 156, 156, 156), + ("GRAY62", 158, 158, 158), + ("GRAY63", 161, 161, 161), + ("GRAY64", 163, 163, 163), + ("GRAY65", 166, 166, 166), + ("GRAY66", 168, 168, 168), + ("GRAY67", 171, 171, 171), + ("GRAY68", 173, 173, 173), + ("GRAY69", 176, 176, 176), + ("GRAY7", 18, 18, 18), + ("GRAY70", 179, 179, 179), + ("GRAY71", 181, 181, 181), + ("GRAY72", 184, 184, 184), + ("GRAY73", 186, 186, 186), + ("GRAY74", 189, 189, 189), + ("GRAY75", 191, 191, 191), + ("GRAY76", 194, 194, 194), + ("GRAY77", 196, 196, 196), + ("GRAY78", 199, 199, 199), + ("GRAY79", 201, 201, 201), + ("GRAY8", 20, 20, 20), + ("GRAY80", 204, 204, 204), + ("GRAY81", 207, 207, 207), + ("GRAY82", 209, 209, 209), + ("GRAY83", 212, 212, 212), + ("GRAY84", 214, 214, 214), + ("GRAY85", 217, 217, 217), + ("GRAY86", 219, 219, 219), + ("GRAY87", 222, 222, 222), + ("GRAY88", 224, 224, 224), + ("GRAY89", 227, 227, 227), + ("GRAY9", 23, 23, 23), + ("GRAY90", 229, 229, 229), + ("GRAY91", 232, 232, 232), + ("GRAY92", 235, 235, 235), + ("GRAY93", 237, 237, 237), + ("GRAY94", 240, 240, 240), + ("GRAY95", 242, 242, 242), + ("GRAY96", 245, 245, 245), + ("GRAY97", 247, 247, 247), + ("GRAY98", 250, 250, 250), + ("GRAY99", 252, 252, 252), + ("GREEN YELLOW", 173, 255, 47), + ("GREEN", 0, 255, 0), + ("GREEN1", 0, 255, 0), + ("GREEN2", 0, 238, 0), + ("GREEN3", 0, 205, 0), + ("GREEN4", 0, 139, 0), + ("GREENYELLOW", 173, 255, 47), + ("GREY", 128, 128, 128), + ("HONEYDEW", 240, 255, 240), + ("HONEYDEW1", 240, 255, 240), + ("HONEYDEW2", 224, 238, 224), + ("HONEYDEW3", 193, 205, 193), + ("HONEYDEW4", 131, 139, 131), + ("HOTPINK", 255, 105, 180), + ("HOTPINK1", 255, 110, 180), + ("HOTPINK2", 238, 106, 167), + ("HOTPINK3", 205, 96, 144), + ("HOTPINK4", 139, 58, 98), + ("INDIANRED", 205, 92, 92), + ("INDIANRED1", 255, 106, 106), + ("INDIANRED2", 238, 99, 99), + ("INDIANRED3", 205, 85, 85), + ("INDIANRED4", 139, 58, 58), + ("INDIGO", 75, 0, 130), + ("IVORY", 255, 255, 240), + ("IVORY1", 255, 255, 240), + ("IVORY2", 238, 238, 224), + ("IVORY3", 205, 205, 193), + ("IVORY4", 139, 139, 131), + ("KHAKI", 240, 230, 140), + ("KHAKI1", 255, 246, 143), + ("KHAKI2", 238, 230, 133), + ("KHAKI3", 205, 198, 115), + ("KHAKI4", 139, 134, 78), + ("LAVENDER", 230, 230, 250), + ("LAVENDERBLUSH", 255, 240, 245), + ("LAVENDERBLUSH1", 255, 240, 245), + ("LAVENDERBLUSH2", 238, 224, 229), + ("LAVENDERBLUSH3", 205, 193, 197), + ("LAVENDERBLUSH4", 139, 131, 134), + ("LAWNGREEN", 124, 252, 0), + ("LEMONCHIFFON", 255, 250, 205), + ("LEMONCHIFFON1", 255, 250, 205), + ("LEMONCHIFFON2", 238, 233, 191), + ("LEMONCHIFFON3", 205, 201, 165), + ("LEMONCHIFFON4", 139, 137, 112), + ("LIGHTBLUE", 173, 216, 230), + ("LIGHTBLUE1", 191, 239, 255), + ("LIGHTBLUE2", 178, 223, 238), + ("LIGHTBLUE3", 154, 192, 205), + ("LIGHTBLUE4", 104, 131, 139), + ("LIGHTCORAL", 240, 128, 128), + ("LIGHTCYAN", 224, 255, 255), + ("LIGHTCYAN1", 224, 255, 255), + ("LIGHTCYAN2", 209, 238, 238), + ("LIGHTCYAN3", 180, 205, 205), + ("LIGHTCYAN4", 122, 139, 139), + ("LIGHTGOLDENROD", 238, 221, 130), + ("LIGHTGOLDENROD1", 255, 236, 139), + ("LIGHTGOLDENROD2", 238, 220, 130), + ("LIGHTGOLDENROD3", 205, 190, 112), + ("LIGHTGOLDENROD4", 139, 129, 76), + ("LIGHTGOLDENRODYELLOW", 250, 250, 210), + ("LIGHTGRAY", 211, 211, 211), + ("LIGHTGREEN", 144, 238, 144), + ("LIGHTGREY", 211, 211, 211), + ("LIGHTPINK", 255, 182, 193), + ("LIGHTPINK1", 255, 174, 185), + ("LIGHTPINK2", 238, 162, 173), + ("LIGHTPINK3", 205, 140, 149), + ("LIGHTPINK4", 139, 95, 101), + ("LIGHTSALMON", 255, 160, 122), + ("LIGHTSALMON1", 255, 160, 122), + ("LIGHTSALMON2", 238, 149, 114), + ("LIGHTSALMON3", 205, 129, 98), + ("LIGHTSALMON4", 139, 87, 66), + ("LIGHTSEAGREEN", 32, 178, 170), + ("LIGHTSKYBLUE", 135, 206, 250), + ("LIGHTSKYBLUE1", 176, 226, 255), + ("LIGHTSKYBLUE2", 164, 211, 238), + ("LIGHTSKYBLUE3", 141, 182, 205), + ("LIGHTSKYBLUE4", 96, 123, 139), + ("LIGHTSLATEBLUE", 132, 112, 255), + ("LIGHTSLATEGRAY", 119, 136, 153), + ("LIGHTSLATEGREY", 119, 136, 153), + ("LIGHTSTEELBLUE", 176, 196, 222), + ("LIGHTSTEELBLUE1", 202, 225, 255), + ("LIGHTSTEELBLUE2", 188, 210, 238), + ("LIGHTSTEELBLUE3", 162, 181, 205), + ("LIGHTSTEELBLUE4", 110, 123, 139), + ("LIGHTYELLOW", 255, 255, 224), + ("LIGHTYELLOW1", 255, 255, 224), + ("LIGHTYELLOW2", 238, 238, 209), + ("LIGHTYELLOW3", 205, 205, 180), + ("LIGHTYELLOW4", 139, 139, 122), + ("LIME", 0, 255, 0), + ("LIMEGREEN", 50, 205, 50), + ("LINEN", 250, 240, 230), + ("MAGENTA", 255, 0, 255), + ("MAGENTA1", 255, 0, 255), + ("MAGENTA2", 238, 0, 238), + ("MAGENTA3", 205, 0, 205), + ("MAGENTA4", 139, 0, 139), + ("MAROON", 176, 48, 96), + ("MAROON1", 255, 52, 179), + ("MAROON2", 238, 48, 167), + ("MAROON3", 205, 41, 144), + ("MAROON4", 139, 28, 98), + ("MEDIUMAQUAMARINE", 102, 205, 170), + ("MEDIUMBLUE", 0, 0, 205), + ("MEDIUMORCHID", 186, 85, 211), + ("MEDIUMORCHID1", 224, 102, 255), + ("MEDIUMORCHID2", 209, 95, 238), + ("MEDIUMORCHID3", 180, 82, 205), + ("MEDIUMORCHID4", 122, 55, 139), + ("MEDIUMPURPLE", 147, 112, 219), + ("MEDIUMPURPLE1", 171, 130, 255), + ("MEDIUMPURPLE2", 159, 121, 238), + ("MEDIUMPURPLE3", 137, 104, 205), + ("MEDIUMPURPLE4", 93, 71, 139), + ("MEDIUMSEAGREEN", 60, 179, 113), + ("MEDIUMSLATEBLUE", 123, 104, 238), + ("MEDIUMSPRINGGREEN", 0, 250, 154), + ("MEDIUMTURQUOISE", 72, 209, 204), + ("MEDIUMVIOLETRED", 199, 21, 133), + ("MIDNIGHTBLUE", 25, 25, 112), + ("MINTCREAM", 245, 255, 250), + ("MISTYROSE", 255, 228, 225), + ("MISTYROSE1", 255, 228, 225), + ("MISTYROSE2", 238, 213, 210), + ("MISTYROSE3", 205, 183, 181), + ("MISTYROSE4", 139, 125, 123), + ("MOCCASIN", 255, 228, 181), + ("MUPDFBLUE", 37, 114, 172), + ("NAVAJOWHITE", 255, 222, 173), + ("NAVAJOWHITE1", 255, 222, 173), + ("NAVAJOWHITE2", 238, 207, 161), + ("NAVAJOWHITE3", 205, 179, 139), + ("NAVAJOWHITE4", 139, 121, 94), + ("NAVY", 0, 0, 128), + ("NAVYBLUE", 0, 0, 128), + ("OLDLACE", 253, 245, 230), + ("OLIVE", 128, 128, 0), + ("OLIVEDRAB", 107, 142, 35), + ("OLIVEDRAB1", 192, 255, 62), + ("OLIVEDRAB2", 179, 238, 58), + ("OLIVEDRAB3", 154, 205, 50), + ("OLIVEDRAB4", 105, 139, 34), + ("ORANGE", 255, 165, 0), + ("ORANGE1", 255, 165, 0), + ("ORANGE2", 238, 154, 0), + ("ORANGE3", 205, 133, 0), + ("ORANGE4", 139, 90, 0), + ("ORANGERED", 255, 69, 0), + ("ORANGERED1", 255, 69, 0), + ("ORANGERED2", 238, 64, 0), + ("ORANGERED3", 205, 55, 0), + ("ORANGERED4", 139, 37, 0), + ("ORCHID", 218, 112, 214), + ("ORCHID1", 255, 131, 250), + ("ORCHID2", 238, 122, 233), + ("ORCHID3", 205, 105, 201), + ("ORCHID4", 139, 71, 137), + ("PALEGOLDENROD", 238, 232, 170), + ("PALEGREEN", 152, 251, 152), + ("PALEGREEN1", 154, 255, 154), + ("PALEGREEN2", 144, 238, 144), + ("PALEGREEN3", 124, 205, 124), + ("PALEGREEN4", 84, 139, 84), + ("PALETURQUOISE", 175, 238, 238), + ("PALETURQUOISE1", 187, 255, 255), + ("PALETURQUOISE2", 174, 238, 238), + ("PALETURQUOISE3", 150, 205, 205), + ("PALETURQUOISE4", 102, 139, 139), + ("PALEVIOLETRED", 219, 112, 147), + ("PALEVIOLETRED1", 255, 130, 171), + ("PALEVIOLETRED2", 238, 121, 159), + ("PALEVIOLETRED3", 205, 104, 137), + ("PALEVIOLETRED4", 139, 71, 93), + ("PAPAYAWHIP", 255, 239, 213), + ("PEACHPUFF", 255, 218, 185), + ("PEACHPUFF1", 255, 218, 185), + ("PEACHPUFF2", 238, 203, 173), + ("PEACHPUFF3", 205, 175, 149), + ("PEACHPUFF4", 139, 119, 101), + ("PERU", 205, 133, 63), + ("PINK", 255, 192, 203), + ("PINK1", 255, 181, 197), + ("PINK2", 238, 169, 184), + ("PINK3", 205, 145, 158), + ("PINK4", 139, 99, 108), + ("PLUM", 221, 160, 221), + ("PLUM1", 255, 187, 255), + ("PLUM2", 238, 174, 238), + ("PLUM3", 205, 150, 205), + ("PLUM4", 139, 102, 139), + ("POWDERBLUE", 176, 224, 230), + ("PURPLE", 160, 32, 240), + ("PURPLE1", 155, 48, 255), + ("PURPLE2", 145, 44, 238), + ("PURPLE3", 125, 38, 205), + ("PURPLE4", 85, 26, 139), + ("PY_COLOR", 240, 255, 210), + ("RED", 255, 0, 0), + ("RED1", 255, 0, 0), + ("RED2", 238, 0, 0), + ("RED3", 205, 0, 0), + ("RED4", 139, 0, 0), + ("ROSYBROWN", 188, 143, 143), + ("ROSYBROWN1", 255, 193, 193), + ("ROSYBROWN2", 238, 180, 180), + ("ROSYBROWN3", 205, 155, 155), + ("ROSYBROWN4", 139, 105, 105), + ("ROYALBLUE", 65, 105, 225), + ("ROYALBLUE1", 72, 118, 255), + ("ROYALBLUE2", 67, 110, 238), + ("ROYALBLUE3", 58, 95, 205), + ("ROYALBLUE4", 39, 64, 139), + ("SADDLEBROWN", 139, 69, 19), + ("SALMON", 250, 128, 114), + ("SALMON1", 255, 140, 105), + ("SALMON2", 238, 130, 98), + ("SALMON3", 205, 112, 84), + ("SALMON4", 139, 76, 57), + ("SANDYBROWN", 244, 164, 96), + ("SEAGREEN", 46, 139, 87), + ("SEAGREEN1", 84, 255, 159), + ("SEAGREEN2", 78, 238, 148), + ("SEAGREEN3", 67, 205, 128), + ("SEAGREEN4", 46, 139, 87), + ("SEASHELL", 255, 245, 238), + ("SEASHELL1", 255, 245, 238), + ("SEASHELL2", 238, 229, 222), + ("SEASHELL3", 205, 197, 191), + ("SEASHELL4", 139, 134, 130), + ("SIENNA", 160, 82, 45), + ("SIENNA1", 255, 130, 71), + ("SIENNA2", 238, 121, 66), + ("SIENNA3", 205, 104, 57), + ("SIENNA4", 139, 71, 38), + ("SILVER", 192, 192, 192), + ("SKYBLUE", 135, 206, 235), + ("SKYBLUE1", 135, 206, 255), + ("SKYBLUE2", 126, 192, 238), + ("SKYBLUE3", 108, 166, 205), + ("SKYBLUE4", 74, 112, 139), + ("SLATEBLUE", 106, 90, 205), + ("SLATEBLUE1", 131, 111, 255), + ("SLATEBLUE2", 122, 103, 238), + ("SLATEBLUE3", 105, 89, 205), + ("SLATEBLUE4", 71, 60, 139), + ("SLATEGRAY", 112, 128, 144), + ("SLATEGREY", 112, 128, 144), + ("SNOW", 255, 250, 250), + ("SNOW1", 255, 250, 250), + ("SNOW2", 238, 233, 233), + ("SNOW3", 205, 201, 201), + ("SNOW4", 139, 137, 137), + ("SPRINGGREEN", 0, 255, 127), + ("SPRINGGREEN1", 0, 255, 127), + ("SPRINGGREEN2", 0, 238, 118), + ("SPRINGGREEN3", 0, 205, 102), + ("SPRINGGREEN4", 0, 139, 69), + ("STEELBLUE", 70, 130, 180), + ("STEELBLUE1", 99, 184, 255), + ("STEELBLUE2", 92, 172, 238), + ("STEELBLUE3", 79, 148, 205), + ("STEELBLUE4", 54, 100, 139), + ("TAN", 210, 180, 140), + ("TAN1", 255, 165, 79), + ("TAN2", 238, 154, 73), + ("TAN3", 205, 133, 63), + ("TAN4", 139, 90, 43), + ("TEAL", 0, 128, 128), + ("THISTLE", 216, 191, 216), + ("THISTLE1", 255, 225, 255), + ("THISTLE2", 238, 210, 238), + ("THISTLE3", 205, 181, 205), + ("THISTLE4", 139, 123, 139), + ("TOMATO", 255, 99, 71), + ("TOMATO1", 255, 99, 71), + ("TOMATO2", 238, 92, 66), + ("TOMATO3", 205, 79, 57), + ("TOMATO4", 139, 54, 38), + ("TURQUOISE", 64, 224, 208), + ("TURQUOISE1", 0, 245, 255), + ("TURQUOISE2", 0, 229, 238), + ("TURQUOISE3", 0, 197, 205), + ("TURQUOISE4", 0, 134, 139), + ("VIOLET", 238, 130, 238), + ("VIOLETRED", 208, 32, 144), + ("VIOLETRED1", 255, 62, 150), + ("VIOLETRED2", 238, 58, 140), + ("VIOLETRED3", 205, 50, 120), + ("VIOLETRED4", 139, 34, 82), + ("WHEAT", 245, 222, 179), + ("WHEAT1", 255, 231, 186), + ("WHEAT2", 238, 216, 174), + ("WHEAT3", 205, 186, 150), + ("WHEAT4", 139, 126, 102), + ("WHITE", 255, 255, 255), + ("WHITESMOKE", 245, 245, 245), + ("YELLOW", 255, 255, 0), + ("YELLOW1", 255, 255, 0), + ("YELLOW2", 238, 238, 0), + ("YELLOW3", 205, 205, 0), + ("YELLOW4", 139, 139, 0), + ("YELLOWGREEN", 154, 205, 50), + ] diff -r 000000000000 -r 1d09e1dec1d9 src/extra.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/extra.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,4285 @@ +%module fitz_extra + +%pythoncode %{ +# pylint: disable=all +%} + +%begin +%{ +#define SWIG_PYTHON_INTERPRETER_NO_DEBUG + +/* This seems to be necessary on some Windows machines with Py_LIMITED_API, +otherwise compilation can fail because free() and malloc() are not declared. */ +#include +%} + +%init +%{ + /* Initialise some globals that require Python functions. + + [Prior to 2023-08-18 we initialised these global variables inline, + but this causes a SEGV on Windows with Python-3.10 for `dictkey_c` + (actually any string of length 1 failed).] */ + + dictkey_align = PyUnicode_InternFromString("align"); + dictkey_ascender = PyUnicode_InternFromString("ascender"); + dictkey_bidi = PyUnicode_InternFromString("bidi"); + dictkey_bbox = PyUnicode_InternFromString("bbox"); + dictkey_blocks = PyUnicode_InternFromString("blocks"); + dictkey_bpc = PyUnicode_InternFromString("bpc"); + dictkey_c = PyUnicode_InternFromString("c"); + dictkey_chars = PyUnicode_InternFromString("chars"); + dictkey_color = PyUnicode_InternFromString("color"); + dictkey_colorspace = PyUnicode_InternFromString("colorspace"); + dictkey_content = PyUnicode_InternFromString("content"); + dictkey_creationDate = PyUnicode_InternFromString("creationDate"); + dictkey_cs_name = PyUnicode_InternFromString("cs-name"); + dictkey_da = PyUnicode_InternFromString("da"); + dictkey_dashes = PyUnicode_InternFromString("dashes"); + dictkey_desc = PyUnicode_InternFromString("descender"); + dictkey_descender = PyUnicode_InternFromString("descender"); + dictkey_dir = PyUnicode_InternFromString("dir"); + dictkey_effect = PyUnicode_InternFromString("effect"); + dictkey_ext = PyUnicode_InternFromString("ext"); + dictkey_filename = PyUnicode_InternFromString("filename"); + dictkey_fill = PyUnicode_InternFromString("fill"); + dictkey_flags = PyUnicode_InternFromString("flags"); + dictkey_char_flags = PyUnicode_InternFromString("char_flags"); /* Only used with mupdf >= 1.25.2. */ + dictkey_font = PyUnicode_InternFromString("font"); + dictkey_glyph = PyUnicode_InternFromString("glyph"); + dictkey_height = PyUnicode_InternFromString("height"); + dictkey_id = PyUnicode_InternFromString("id"); + dictkey_image = PyUnicode_InternFromString("image"); + dictkey_items = PyUnicode_InternFromString("items"); + dictkey_length = PyUnicode_InternFromString("length"); + dictkey_lines = PyUnicode_InternFromString("lines"); + dictkey_matrix = PyUnicode_InternFromString("transform"); + dictkey_modDate = PyUnicode_InternFromString("modDate"); + dictkey_name = PyUnicode_InternFromString("name"); + dictkey_number = PyUnicode_InternFromString("number"); + dictkey_origin = PyUnicode_InternFromString("origin"); + dictkey_rect = PyUnicode_InternFromString("rect"); + dictkey_size = PyUnicode_InternFromString("size"); + dictkey_smask = PyUnicode_InternFromString("smask"); + dictkey_spans = PyUnicode_InternFromString("spans"); + dictkey_stroke = PyUnicode_InternFromString("stroke"); + dictkey_style = PyUnicode_InternFromString("style"); + dictkey_subject = PyUnicode_InternFromString("subject"); + dictkey_text = PyUnicode_InternFromString("text"); + dictkey_title = PyUnicode_InternFromString("title"); + dictkey_type = PyUnicode_InternFromString("type"); + dictkey_ufilename = PyUnicode_InternFromString("ufilename"); + dictkey_width = PyUnicode_InternFromString("width"); + dictkey_wmode = PyUnicode_InternFromString("wmode"); + dictkey_xref = PyUnicode_InternFromString("xref"); + dictkey_xres = PyUnicode_InternFromString("xres"); + dictkey_yres = PyUnicode_InternFromString("yres"); +%} + +%include std_string.i + +%include exception.i +%exception { + try { + $action + } + +/* this might not be ok on windows. +catch (Swig::DirectorException &e) { + SWIG_fail; +}*/ +catch(std::exception& e) { + SWIG_exception(SWIG_RuntimeError, e.what()); +} +catch(...) { + SWIG_exception(SWIG_RuntimeError, "Unknown exception"); + } +} + +%{ +#include "mupdf/classes2.h" +#include "mupdf/exceptions.h" +#include "mupdf/internal.h" + +#include +#include + + +#define MAKE_MUPDF_VERSION_INT(major, minor, patch) ((major << 16) + (minor << 8) + (patch << 0)) + +#define MUPDF_VERSION_INT MAKE_MUPDF_VERSION_INT(FZ_VERSION_MAJOR, FZ_VERSION_MINOR, FZ_VERSION_PATCH) + +#define MUPDF_VERSION_GE(major, minor, patch) \ + MUPDF_VERSION_INT >= MAKE_MUPDF_VERSION_INT(major, minor, patch) + +/* Define a wrapper for PDF_NAME that returns a mupdf::PdfObj instead of a +pdf_obj*. This avoids implicit construction of a mupdf::PdfObj, which is +deliberately prohibited (with `explicit` on constructors) by recent MuPDF. */ +#define PDF_NAME2(X) mupdf::PdfObj(PDF_NAME(X)) + +/* Returns equivalent of `repr(x)`. */ +static std::string repr(PyObject* x) +{ + PyObject* repr = PyObject_Repr(x); + PyObject* repr_str = PyUnicode_AsEncodedString(repr, "utf-8", "~E~"); + #ifdef Py_LIMITED_API + const char* repr_str_s = PyBytes_AsString(repr_str); + #else + const char* repr_str_s = PyBytes_AS_STRING(repr_str); + #endif + std::string ret = repr_str_s; + Py_DECREF(repr_str); + Py_DECREF(repr); + return ret; +} + +#ifdef Py_LIMITED_API + static PyObject* PySequence_ITEM(PyObject* o, Py_ssize_t i) + { + return PySequence_GetItem(o, i); + } + + static const char* PyUnicode_AsUTF8(PyObject* o) + { + static PyObject* string = nullptr; + Py_XDECREF(string); + string = PyUnicode_AsUTF8String(o); + return PyBytes_AsString(string); + } +#endif + + +/* These are also in pymupdf/__init__.py. */ +const char MSG_BAD_ANNOT_TYPE[] = "bad annot type"; +const char MSG_BAD_APN[] = "bad or missing annot AP/N"; +const char MSG_BAD_ARG_INK_ANNOT[] = "arg must be seq of seq of float pairs"; +const char MSG_BAD_ARG_POINTS[] = "bad seq of points"; +const char MSG_BAD_BUFFER[] = "bad type: 'buffer'"; +const char MSG_BAD_COLOR_SEQ[] = "bad color sequence"; +const char MSG_BAD_DOCUMENT[] = "cannot open broken document"; +const char MSG_BAD_FILETYPE[] = "bad filetype"; +const char MSG_BAD_LOCATION[] = "bad location"; +const char MSG_BAD_OC_CONFIG[] = "bad config number"; +const char MSG_BAD_OC_LAYER[] = "bad layer number"; +const char MSG_BAD_OC_REF[] = "bad 'oc' reference"; +const char MSG_BAD_PAGEID[] = "bad page id"; +const char MSG_BAD_PAGENO[] = "bad page number(s)"; +const char MSG_BAD_PDFROOT[] = "PDF has no root"; +const char MSG_BAD_RECT[] = "rect is infinite or empty"; +const char MSG_BAD_TEXT[] = "bad type: 'text'"; +const char MSG_BAD_XREF[] = "bad xref"; +const char MSG_COLOR_COUNT_FAILED[] = "color count failed"; +const char MSG_FILE_OR_BUFFER[] = "need font file or buffer"; +const char MSG_FONT_FAILED[] = "cannot create font"; +const char MSG_IS_NO_ANNOT[] = "is no annotation"; +const char MSG_IS_NO_IMAGE[] = "is no image"; +const char MSG_IS_NO_PDF[] = "is no PDF"; +const char MSG_IS_NO_DICT[] = "object is no PDF dict"; +const char MSG_PIX_NOALPHA[] = "source pixmap has no alpha"; +const char MSG_PIXEL_OUTSIDE[] = "pixel(s) outside image"; + +#define JM_BOOL(x) PyBool_FromLong((long) (x)) + +static PyObject *JM_UnicodeFromStr(const char *c); + + +#ifdef _WIN32 + +/* These functions are not provided on Windows. */ + +int vasprintf(char** str, const char* fmt, va_list ap) +{ + va_list ap2; + + va_copy(ap2, ap); + int len = vsnprintf(nullptr, 0, fmt, ap2); + va_end(ap2); + + char* buffer = (char*) malloc(len + 1); + if (!buffer) + { + *str = nullptr; + return -1; + } + va_copy(ap2, ap); + int len2 = vsnprintf(buffer, len + 1, fmt, ap2); + va_end(ap2); + assert(len2 == len); + *str = buffer; + return len; +} + +int asprintf(char** str, const char* fmt, ...) +{ + va_list ap; + va_start(ap, fmt); + int ret = vasprintf(str, fmt, ap); + va_end(ap); + + return ret; +} +#endif + + +static void messagev(const char* format, va_list va) +{ + static PyObject* pymupdf_module = PyImport_ImportModule("pymupdf"); + static PyObject* message_fn = PyObject_GetAttrString(pymupdf_module, "message"); + char* text; + vasprintf(&text, format, va); + PyObject* text_py = PyString_FromString(text); + PyObject* args = PyTuple_Pack(1, text_py); + PyObject* ret = PyObject_CallObject(message_fn, args); + Py_XDECREF(ret); + Py_XDECREF(args); + Py_XDECREF(text_py); + free(text); +} + +static void messagef(const char* format, ...) +{ + va_list args; + va_start(args, format); + messagev(format, args); + va_end(args); +} + +PyObject* JM_EscapeStrFromStr(const char* c) +{ + if (!c) return PyUnicode_FromString(""); + PyObject* val = PyUnicode_DecodeRawUnicodeEscape(c, (Py_ssize_t) strlen(c), "replace"); + if (!val) + { + val = PyUnicode_FromString(""); + PyErr_Clear(); + } + return val; +} + +PyObject* JM_EscapeStrFromBuffer(fz_buffer* buff) +{ + if (!buff) return PyUnicode_FromString(""); + unsigned char* s = nullptr; + size_t len = mupdf::ll_fz_buffer_storage(buff, &s); + PyObject* val = PyUnicode_DecodeRawUnicodeEscape((const char*) s, (Py_ssize_t) len, "replace"); + if (!val) + { + val = PyUnicode_FromString(""); + PyErr_Clear(); + } + return val; +} + +//---------------------------------------------------------------------------- +// Deep-copies a source page to the target. +// Modified version of function of pdfmerge.c: we also copy annotations, but +// we skip some subtypes. In addition we rotate output. +//---------------------------------------------------------------------------- +static void page_merge( + mupdf::PdfDocument& doc_des, + mupdf::PdfDocument& doc_src, + int page_from, + int page_to, + int rotate, + int links, + int copy_annots, + mupdf::PdfGraftMap& graft_map + ) +{ + // list of object types (per page) we want to copy + + /* Fixme: on linux these get destructed /after/ + mupdf/platform/c++/implementation/internal.cpp:s_thread_state, which causes + problems - s_thread_state::m_ctx will have been freed. We have a hack + that sets s_thread_state::m_ctx when destructed, so it mostly works when + s_thread_state.get_context() is called after destruction, but this causes + memento leaks and is clearly incorrect. + + Perhaps we could use pdf_obj* known_page_objs[] = {...} and create PdfObj + wrappers as used - this would avoid any cleanup at exit. And it's a general + solution to problem of ordering of cleanup of globals. + */ + static pdf_obj* known_page_objs[] = { + PDF_NAME(Contents), + PDF_NAME(Resources), + PDF_NAME(MediaBox), + PDF_NAME(CropBox), + PDF_NAME(BleedBox), + PDF_NAME(TrimBox), + PDF_NAME(ArtBox), + PDF_NAME(Rotate), + PDF_NAME(UserUnit) + }; + int known_page_objs_num = sizeof(known_page_objs) / sizeof(known_page_objs[0]); + mupdf::PdfObj page_ref = mupdf::pdf_lookup_page_obj(doc_src, page_from); + + // make new page dict in dest doc + mupdf::PdfObj page_dict = mupdf::pdf_new_dict(doc_des, 4); + mupdf::pdf_dict_put(page_dict, PDF_NAME2(Type), PDF_NAME2(Page)); + + for (int i = 0; i < known_page_objs_num; ++i) + { + mupdf::PdfObj known_page_obj(known_page_objs[i]); + mupdf::PdfObj obj = mupdf::pdf_dict_get_inheritable(page_ref, known_page_obj); + if (obj.m_internal) + { + mupdf::pdf_dict_put( + page_dict, + known_page_obj, + mupdf::pdf_graft_mapped_object(graft_map, obj) + ); + } + } + + // Copy annotations, but skip Link, Popup, IRT, Widget types + // If selected, remove dict keys P (parent) and Popup + if (copy_annots) + { + mupdf::PdfObj old_annots = mupdf::pdf_dict_get(page_ref, PDF_NAME2(Annots)); + int n = mupdf::pdf_array_len(old_annots); + if (n > 0) + { + mupdf::PdfObj new_annots = mupdf::pdf_dict_put_array(page_dict, PDF_NAME2(Annots), n); + for (int i = 0; i < n; i++) + { + mupdf::PdfObj o = mupdf::pdf_array_get(old_annots, i); + if (!o.m_internal || !mupdf::pdf_is_dict(o)) // skip non-dict items + { + continue; // skip invalid/null/non-dict items + } + if (mupdf::pdf_dict_get(o, PDF_NAME2(IRT)).m_internal) continue; + mupdf::PdfObj subtype = mupdf::pdf_dict_get(o, PDF_NAME2(Subtype)); + if (mupdf::pdf_name_eq(subtype, PDF_NAME2(Link))) continue; + if (mupdf::pdf_name_eq(subtype, PDF_NAME2(Popup))) continue; + if (mupdf::pdf_name_eq(subtype, PDF_NAME2(Widget))) continue; + mupdf::pdf_dict_del(o, PDF_NAME2(Popup)); + mupdf::pdf_dict_del(o, PDF_NAME2(P)); + mupdf::PdfObj copy_o = mupdf::pdf_graft_mapped_object(graft_map, o); + mupdf::PdfObj annot = mupdf::pdf_new_indirect( + doc_des, + mupdf::pdf_to_num(copy_o), + 0 + ); + mupdf::pdf_array_push(new_annots, annot); + } + } + } + // rotate the page + if (rotate != -1) + { + mupdf::pdf_dict_put_int(page_dict, PDF_NAME2(Rotate), rotate); + } + // Now add the page dictionary to dest PDF + mupdf::PdfObj ref = mupdf::pdf_add_object(doc_des, page_dict); + + // Insert new page at specified location + mupdf::pdf_insert_page(doc_des, page_to, ref); +} + +//----------------------------------------------------------------------------- +// Copy a range of pages (spage, epage) from a source PDF to a specified +// location (apage) of the target PDF. +// If spage > epage, the sequence of source pages is reversed. +//----------------------------------------------------------------------------- +static void JM_merge_range( + mupdf::PdfDocument& doc_des, + mupdf::PdfDocument& doc_src, + int spage, + int epage, + int apage, + int rotate, + int links, + int annots, + int show_progress, + mupdf::PdfGraftMap& graft_map + ) +{ + int afterpage = apage; + int counter = 0; // copied pages counter + int total = mupdf::ll_fz_absi(epage - spage) + 1; // total pages to copy + + if (spage < epage) + { + for (int page = spage; page <= epage; page++, afterpage++) + { + page_merge(doc_des, doc_src, page, afterpage, rotate, links, annots, graft_map); + counter++; + if (show_progress > 0 && counter % show_progress == 0) + { + messagef("Inserted %i of %i pages.", counter, total); + } + } + } + else + { + for (int page = spage; page >= epage; page--, afterpage++) + { + page_merge(doc_des, doc_src, page, afterpage, rotate, links, annots, graft_map); + counter++; + if (show_progress > 0 && counter % show_progress == 0) + { + messagef("Inserted %i of %i pages.", counter, total); + } + } + } +} + +static bool JM_have_operation(mupdf::PdfDocument& pdf) +{ + // Ensure valid journalling state + if (pdf.m_internal->journal and !mupdf::pdf_undoredo_step(pdf, 0)) + { + return 0; + } + return 1; +} + +static void JM_ensure_operation(mupdf::PdfDocument& pdf) +{ + if (!JM_have_operation(pdf)) + { + throw std::runtime_error("No journalling operation started"); + } +} + + +static void FzDocument_insert_pdf( + mupdf::FzDocument& doc, + mupdf::FzDocument& src, + int from_page, + int to_page, + int start_at, + int rotate, + int links, + int annots, + int show_progress, + int final, + mupdf::PdfGraftMap& graft_map + ) +{ + //std::cerr << __FILE__ << ":" << __LINE__ << ":" << __FUNCTION__ << "\n"; + mupdf::PdfDocument pdfout = mupdf::pdf_specifics(doc); + mupdf::PdfDocument pdfsrc = mupdf::pdf_specifics(src); + int outCount = mupdf::fz_count_pages(doc); + int srcCount = mupdf::fz_count_pages(src); + + // local copies of page numbers + int fp = from_page; + int tp = to_page; + int sa = start_at; + + // normalize page numbers + fp = std::max(fp, 0); // -1 = first page + fp = std::min(fp, srcCount - 1); // but do not exceed last page + + if (tp < 0) tp = srcCount - 1; // -1 = last page + tp = std::min(tp, srcCount - 1); // but do not exceed last page + + if (sa < 0) sa = outCount; // -1 = behind last page + sa = std::min(sa, outCount); // but that is also the limit + + if (!pdfout.m_internal || !pdfsrc.m_internal) + { + throw std::runtime_error("source or target not a PDF"); + } + JM_ensure_operation(pdfout); + JM_merge_range(pdfout, pdfsrc, fp, tp, sa, rotate, links, annots, show_progress, graft_map); +} + +static int page_xref(mupdf::FzDocument& this_doc, int pno) +{ + int page_count = mupdf::fz_count_pages(this_doc); + int n = pno; + while (n < 0) + { + n += page_count; + } + mupdf::PdfDocument pdf = mupdf::pdf_specifics(this_doc); + assert(pdf.m_internal); + int xref = 0; + if (n >= page_count) + { + throw std::runtime_error(MSG_BAD_PAGENO);//, PyExc_ValueError); + } + xref = mupdf::pdf_to_num(mupdf::pdf_lookup_page_obj(pdf, n)); + return xref; +} + +static void _newPage(mupdf::PdfDocument& pdf, int pno=-1, float width=595, float height=842) +{ + if (!pdf.m_internal) + { + throw std::runtime_error("is no PDF"); + } + mupdf::FzRect mediabox(0, 0, width, height); + if (pno < -1) + { + throw std::runtime_error("bad page number(s)"); // Should somehow be Python ValueError + } + JM_ensure_operation(pdf); + // create /Resources and /Contents objects + mupdf::PdfObj resources = mupdf::pdf_add_new_dict(pdf, 1); + mupdf::FzBuffer contents; + mupdf::PdfObj page_obj = mupdf::pdf_add_page(pdf, mediabox, 0, resources, contents); + mupdf::pdf_insert_page(pdf, pno, page_obj); +} + +static void _newPage(mupdf::FzDocument& self, int pno=-1, float width=595, float height=842) +{ + mupdf::PdfDocument pdf = mupdf::pdf_specifics(self); + _newPage(pdf, pno, width, height); +} + + +//------------------------------------------------------------------------ +// return the annotation names (list of /NM entries) +//------------------------------------------------------------------------ +static std::vector< std::string> JM_get_annot_id_list(mupdf::PdfPage& page) +{ + std::vector< std::string> names; + mupdf::PdfObj annots = mupdf::pdf_dict_get(page.obj(), PDF_NAME2(Annots)); + if (!annots.m_internal) return names; + int n = mupdf::pdf_array_len(annots); + for (int i = 0; i < n; i++) + { + mupdf::PdfObj annot_obj = mupdf::pdf_array_get(annots, i); + mupdf::PdfObj name = mupdf::pdf_dict_gets(annot_obj, "NM"); + if (name.m_internal) + { + names.push_back(mupdf::pdf_to_text_string(name)); + } + } + return names; +} + + +//------------------------------------------------------------------------ +// Add a unique /NM key to an annotation or widget. +// Append a number to 'stem' such that the result is a unique name. +//------------------------------------------------------------------------ +static void JM_add_annot_id(mupdf::PdfAnnot& annot, const char* stem) +{ + mupdf::PdfPage page = mupdf::pdf_annot_page(annot); + mupdf::PdfObj annot_obj = mupdf::pdf_annot_obj(annot); + std::vector< std::string> names = JM_get_annot_id_list(page); + char* stem_id = nullptr; + for (int i=0; ; ++i) + { + free(stem_id); + asprintf(&stem_id, "fitz-%s%d", stem, i); + if (std::find(names.begin(), names.end(), stem_id) == names.end()) + { + break; + } + } + mupdf::PdfObj name = mupdf::pdf_new_string(stem_id, strlen(stem_id)); + free(stem_id); + mupdf::pdf_dict_puts(annot_obj, "NM", name); + page.m_internal->doc->resynth_required = 0; +} + +//---------------------------------------------------------------- +// page add_caret_annot +//---------------------------------------------------------------- +static mupdf::PdfAnnot _add_caret_annot(mupdf::PdfPage& page, mupdf::FzPoint& point) +{ + mupdf::PdfAnnot annot = mupdf::pdf_create_annot(page, ::PDF_ANNOT_CARET); + mupdf::FzPoint p = point; + mupdf::FzRect r = mupdf::pdf_annot_rect(annot); + r = mupdf::fz_make_rect(p.x, p.y, p.x + r.x1 - r.x0, p.y + r.y1 - r.y0); + mupdf::pdf_set_annot_rect(annot, r); + mupdf::pdf_update_annot(annot); + JM_add_annot_id(annot, "A"); + return annot; +} + +static mupdf::PdfAnnot _add_caret_annot(mupdf::FzPage& page, mupdf::FzPoint& point) +{ + mupdf::PdfPage pdf_page = mupdf::pdf_page_from_fz_page(page); + return _add_caret_annot(pdf_page, point); +} + +static const char* Tools_parse_da(mupdf::PdfAnnot& this_annot) +{ + const char* da_str = nullptr; + mupdf::PdfObj this_annot_obj = mupdf::pdf_annot_obj(this_annot); + mupdf::PdfDocument pdf = mupdf::pdf_get_bound_document(this_annot_obj); + try + { + mupdf::PdfObj da = mupdf::pdf_dict_get_inheritable(this_annot_obj, PDF_NAME2(DA)); + if (!da.m_internal) + { + mupdf::PdfObj trailer = mupdf::pdf_trailer(pdf); + da = mupdf::pdf_dict_getl( + &trailer, + PDF_NAME(Root), + PDF_NAME(AcroForm), + PDF_NAME(DA), + nullptr + ); + } + da_str = mupdf::pdf_to_text_string(da); + } + catch (std::exception&) + { + return nullptr; + } + return da_str; +} + +//---------------------------------------------------------------------------- +// Turn fz_buffer into a Python bytes object +//---------------------------------------------------------------------------- +static PyObject* JM_BinFromBuffer(fz_buffer* buffer) +{ + if (!buffer) + { + return PyBytes_FromStringAndSize("", 0); + } + unsigned char* c = nullptr; + size_t len = mupdf::ll_fz_buffer_storage(buffer, &c); + return PyBytes_FromStringAndSize((const char*) c, len); +} +static PyObject* JM_BinFromBuffer(mupdf::FzBuffer& buffer) +{ + return JM_BinFromBuffer( buffer.m_internal); +} + +static PyObject* Annot_getAP(mupdf::PdfAnnot& annot) +{ + mupdf::PdfObj annot_obj = mupdf::pdf_annot_obj(annot); + mupdf::PdfObj ap = mupdf::pdf_dict_getl( + &annot_obj, + PDF_NAME(AP), + PDF_NAME(N), + nullptr + ); + if (mupdf::pdf_is_stream(ap)) + { + mupdf::FzBuffer res = mupdf::pdf_load_stream(ap); + return JM_BinFromBuffer(res); + } + return PyBytes_FromStringAndSize("", 0); +} + +void Tools_update_da(mupdf::PdfAnnot& this_annot, const char* da_str) +{ + mupdf::PdfObj this_annot_obj = mupdf::pdf_annot_obj(this_annot); + mupdf::pdf_dict_put_text_string(this_annot_obj, PDF_NAME2(DA), da_str); + mupdf::pdf_dict_del(this_annot_obj, PDF_NAME2(DS)); /* not supported */ + mupdf::pdf_dict_del(this_annot_obj, PDF_NAME2(RC)); /* not supported */ +} + +static int +jm_float_item(PyObject* obj, Py_ssize_t idx, double* result) +{ + PyObject* temp = PySequence_ITEM(obj, idx); + if (!temp) return 1; + *result = PyFloat_AsDouble(temp); + Py_DECREF(temp); + if (PyErr_Occurred()) + { + PyErr_Clear(); + return 1; + } + return 0; +} + + +static mupdf::FzPoint JM_point_from_py(PyObject* p) +{ + fz_point p0 = fz_make_point(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT); + if (!p || !PySequence_Check(p) || PySequence_Size(p) != 2) + { + return p0; + } + double x; + double y; + if (jm_float_item(p, 0, &x) == 1) return p0; + if (jm_float_item(p, 1, &y) == 1) return p0; + if (x < FZ_MIN_INF_RECT) x = FZ_MIN_INF_RECT; + if (y < FZ_MIN_INF_RECT) y = FZ_MIN_INF_RECT; + if (x > FZ_MAX_INF_RECT) x = FZ_MAX_INF_RECT; + if (y > FZ_MAX_INF_RECT) y = FZ_MAX_INF_RECT; + + return fz_make_point(x, y); +} + +static int s_list_append_drop(PyObject* list, PyObject* item) +{ + if (!list || !PyList_Check(list) || !item) + { + return -2; + } + int rc = PyList_Append(list, item); + Py_DECREF(item); + return rc; +} + +static int LIST_APPEND_DROP(PyObject *list, PyObject *item) +{ + if (!list || !PyList_Check(list) || !item) return -2; + int rc = PyList_Append(list, item); + Py_DECREF(item); + return rc; +} + +static int LIST_APPEND(PyObject *list, PyObject *item) +{ + if (!list || !PyList_Check(list) || !item) return -2; + int rc = PyList_Append(list, item); + return rc; +} + +static int DICT_SETITEM_DROP(PyObject *dict, PyObject *key, PyObject *value) +{ + if (!dict || !PyDict_Check(dict) || !key || !value) return -2; + int rc = PyDict_SetItem(dict, key, value); + Py_DECREF(value); + return rc; +} + +static int DICT_SETITEMSTR_DROP(PyObject *dict, const char *key, PyObject *value) +{ + if (!dict || !PyDict_Check(dict) || !key || !value) return -2; + int rc = PyDict_SetItemString(dict, key, value); + Py_DECREF(value); + return rc; +} + + +//----------------------------------------------------------------------------- +// Functions converting between PySequences and pymupdf geometry objects +//----------------------------------------------------------------------------- +static int +jm_init_item(PyObject* obj, Py_ssize_t idx, int* result) +{ + PyObject* temp = PySequence_ITEM(obj, idx); + if (!temp) + { + return 1; + } + if (PyLong_Check(temp)) + { + *result = (int) PyLong_AsLong(temp); + Py_DECREF(temp); + } + else if (PyFloat_Check(temp)) + { + *result = (int) PyFloat_AsDouble(temp); + Py_DECREF(temp); + } + else + { + Py_DECREF(temp); + return 1; + } + if (PyErr_Occurred()) + { + PyErr_Clear(); + return 1; + } + return 0; +} + +// TODO: ------------------------------------------------------------------ +// This is a temporary solution and should be replaced by a C++ extension: +// There is no way in Python specify an array of fz_point - as is required +// for function pdf_set_annot_callout_line(). +static void JM_set_annot_callout_line(mupdf::PdfAnnot& annot, PyObject *callout, int count) +{ + fz_point points[3]; + mupdf::FzPoint p; + for (int i = 0; i < count; i++) + { + p = JM_point_from_py(PyTuple_GetItem(callout, (Py_ssize_t) i)); + points[i] = fz_make_point(p.x, p.y); + } + mupdf::pdf_set_annot_callout_line(annot, points, count); +} + + +//---------------------------------------------------------------------------- +// Return list of outline xref numbers. Recursive function. Arguments: +// 'obj' first OL item +// 'xrefs' empty Python list +//---------------------------------------------------------------------------- +static PyObject* JM_outline_xrefs(mupdf::PdfObj obj, PyObject* xrefs) +{ + if (!obj.m_internal) + { + return xrefs; + } + PyObject* newxref = nullptr; + mupdf::PdfObj thisobj = obj; + while (thisobj.m_internal) + { + int nxr = mupdf::pdf_to_num(thisobj); + newxref = PyLong_FromLong((long) nxr); + if (PySequence_Contains(xrefs, newxref) + or mupdf::pdf_dict_get(thisobj, PDF_NAME2(Type)).m_internal + ) + { + // circular ref or top of chain: terminate + Py_DECREF(newxref); + break; + } + s_list_append_drop(xrefs, newxref); + mupdf::PdfObj first = mupdf::pdf_dict_get(thisobj, PDF_NAME2(First)); // try go down + if (mupdf::pdf_is_dict(first)) + { + xrefs = JM_outline_xrefs(first, xrefs); + } + thisobj = mupdf::pdf_dict_get(thisobj, PDF_NAME2(Next)); // try go next + mupdf::PdfObj parent = mupdf::pdf_dict_get(thisobj, PDF_NAME2(Parent)); // get parent + if (!mupdf::pdf_is_dict(thisobj)) + { + thisobj = parent; + } + } + return xrefs; +} + + +PyObject* dictkey_align = NULL; +PyObject* dictkey_ascender = NULL; +PyObject* dictkey_bidi = NULL; +PyObject* dictkey_bbox = NULL; +PyObject* dictkey_blocks = NULL; +PyObject* dictkey_bpc = NULL; +PyObject* dictkey_c = NULL; +PyObject* dictkey_chars = NULL; +PyObject* dictkey_color = NULL; +PyObject* dictkey_colorspace = NULL; +PyObject* dictkey_content = NULL; +PyObject* dictkey_creationDate = NULL; +PyObject* dictkey_cs_name = NULL; +PyObject* dictkey_da = NULL; +PyObject* dictkey_dashes = NULL; +PyObject* dictkey_desc = NULL; +PyObject* dictkey_descender = NULL; +PyObject* dictkey_dir = NULL; +PyObject* dictkey_effect = NULL; +PyObject* dictkey_ext = NULL; +PyObject* dictkey_filename = NULL; +PyObject* dictkey_fill = NULL; +PyObject* dictkey_flags = NULL; +PyObject* dictkey_char_bidi = NULL; +PyObject* dictkey_char_flags = NULL; +PyObject* dictkey_font = NULL; +PyObject* dictkey_glyph = NULL; +PyObject* dictkey_height = NULL; +PyObject* dictkey_id = NULL; +PyObject* dictkey_image = NULL; +PyObject* dictkey_items = NULL; +PyObject* dictkey_length = NULL; +PyObject* dictkey_lines = NULL; +PyObject* dictkey_matrix = NULL; +PyObject* dictkey_modDate = NULL; +PyObject* dictkey_name = NULL; +PyObject* dictkey_number = NULL; +PyObject* dictkey_origin = NULL; +PyObject* dictkey_rect = NULL; +PyObject* dictkey_size = NULL; +PyObject* dictkey_smask = NULL; +PyObject* dictkey_spans = NULL; +PyObject* dictkey_stroke = NULL; +PyObject* dictkey_style = NULL; +PyObject* dictkey_subject = NULL; +PyObject* dictkey_text = NULL; +PyObject* dictkey_title = NULL; +PyObject* dictkey_type = NULL; +PyObject* dictkey_ufilename = NULL; +PyObject* dictkey_width = NULL; +PyObject* dictkey_wmode = NULL; +PyObject* dictkey_xref = NULL; +PyObject* dictkey_xres = NULL; +PyObject* dictkey_yres = NULL; + +static int dict_setitem_drop(PyObject* dict, PyObject* key, PyObject* value) +{ + if (!dict || !PyDict_Check(dict) || !key || !value) + { + return -2; + } + int rc = PyDict_SetItem(dict, key, value); + Py_DECREF(value); + return rc; +} + +static int dict_setitemstr_drop(PyObject* dict, const char* key, PyObject* value) +{ + if (!dict || !PyDict_Check(dict) || !key || !value) + { + return -2; + } + int rc = PyDict_SetItemString(dict, key, value); + Py_DECREF(value); + return rc; +} + + +static void Document_extend_toc_items(mupdf::PdfDocument& pdf, PyObject* items) +{ + PyObject* item=nullptr; + PyObject* itemdict=nullptr; + PyObject* xrefs=nullptr; + + PyObject* bold = PyUnicode_FromString("bold"); + PyObject* italic = PyUnicode_FromString("italic"); + PyObject* collapse = PyUnicode_FromString("collapse"); + PyObject* zoom = PyUnicode_FromString("zoom"); + + try + { + /* Need to define these things early because later code uses + `goto`; otherwise we get compiler warnings 'jump bypasses variable + initialization' */ + int xref = 0; + mupdf::PdfObj root; + mupdf::PdfObj olroot; + mupdf::PdfObj first; + Py_ssize_t n; + Py_ssize_t m; + + root = mupdf::pdf_dict_get(mupdf::pdf_trailer(pdf), PDF_NAME2(Root)); + if (!root.m_internal) goto end; + + olroot = mupdf::pdf_dict_get(root, PDF_NAME2(Outlines)); + if (!olroot.m_internal) goto end; + + first = mupdf::pdf_dict_get(olroot, PDF_NAME2(First)); + if (!first.m_internal) goto end; + + xrefs = PyList_New(0); // pre-allocate an empty list + xrefs = JM_outline_xrefs(first, xrefs); + n = PySequence_Size(xrefs); + m = PySequence_Size(items); + if (!n) goto end; + + if (n != m) + { + throw std::runtime_error("internal error finding outline xrefs"); + } + + // update all TOC item dictionaries + for (int i = 0; i < n; i++) + { + jm_init_item(xrefs, i, &xref); + item = PySequence_ITEM(items, i); + itemdict = PySequence_ITEM(item, 3); + if (!itemdict || !PyDict_Check(itemdict)) + { + throw std::runtime_error("need non-simple TOC format"); + } + PyDict_SetItem(itemdict, dictkey_xref, PySequence_ITEM(xrefs, i)); + mupdf::PdfObj bm = mupdf::pdf_load_object(pdf, xref); + int flags = mupdf::pdf_to_int(mupdf::pdf_dict_get(bm, PDF_NAME2(F))); + if (flags == 1) + { + PyDict_SetItem(itemdict, italic, Py_True); + } + else if (flags == 2) + { + PyDict_SetItem(itemdict, bold, Py_True); + } + else if (flags == 3) + { + PyDict_SetItem(itemdict, italic, Py_True); + PyDict_SetItem(itemdict, bold, Py_True); + } + int count = mupdf::pdf_to_int(mupdf::pdf_dict_get(bm, PDF_NAME2(Count))); + if (count < 0) + { + PyDict_SetItem(itemdict, collapse, Py_True); + } + else if (count > 0) + { + PyDict_SetItem(itemdict, collapse, Py_False); + } + mupdf::PdfObj col = mupdf::pdf_dict_get(bm, PDF_NAME2(C)); + if (mupdf::pdf_is_array(col) && mupdf::pdf_array_len(col) == 3) + { + PyObject* color = PyTuple_New(3); + PyTuple_SET_ITEM(color, 0, Py_BuildValue("f", mupdf::pdf_to_real(mupdf::pdf_array_get(col, 0)))); + PyTuple_SET_ITEM(color, 1, Py_BuildValue("f", mupdf::pdf_to_real(mupdf::pdf_array_get(col, 1)))); + PyTuple_SET_ITEM(color, 2, Py_BuildValue("f", mupdf::pdf_to_real(mupdf::pdf_array_get(col, 2)))); + dict_setitem_drop(itemdict, dictkey_color, color); + } + float z=0; + mupdf::PdfObj obj = mupdf::pdf_dict_get(bm, PDF_NAME2(Dest)); + if (!obj.m_internal || !mupdf::pdf_is_array(obj)) + { + obj = mupdf::pdf_dict_getl(&bm, PDF_NAME(A), PDF_NAME(D), nullptr); + } + if (mupdf::pdf_is_array(obj) && mupdf::pdf_array_len(obj) == 5) + { + z = mupdf::pdf_to_real(mupdf::pdf_array_get(obj, 4)); + } + dict_setitem_drop(itemdict, zoom, Py_BuildValue("f", z)); + PyList_SetItem(item, 3, itemdict); + PyList_SetItem(items, i, item); + } + end:; + } + catch (std::exception&) + { + } + Py_CLEAR(xrefs); + Py_CLEAR(bold); + Py_CLEAR(italic); + Py_CLEAR(collapse); + Py_CLEAR(zoom); +} + +static void Document_extend_toc_items(mupdf::FzDocument& document, PyObject* items) +{ + mupdf::PdfDocument pdf = mupdf::pdf_document_from_fz_document(document); + return Document_extend_toc_items(pdf, items); +} + +//----------------------------------------------------------------------------- +// PySequence from fz_rect +//----------------------------------------------------------------------------- +static PyObject* JM_py_from_rect(fz_rect r) +{ + return Py_BuildValue("ffff", r.x0, r.y0, r.x1, r.y1); +} +static PyObject* JM_py_from_rect(mupdf::FzRect r) +{ + return JM_py_from_rect(*r.internal()); +} + +//----------------------------------------------------------------------------- +// PySequence from fz_point +//----------------------------------------------------------------------------- +static PyObject* JM_py_from_point(fz_point p) +{ + return Py_BuildValue("ff", p.x, p.y); +} + +//----------------------------------------------------------------------------- +// PySequence from fz_quad. +//----------------------------------------------------------------------------- +static PyObject * +JM_py_from_quad(fz_quad q) +{ + return Py_BuildValue("((f,f),(f,f),(f,f),(f,f))", + q.ul.x, q.ul.y, q.ur.x, q.ur.y, + q.ll.x, q.ll.y, q.lr.x, q.lr.y); +} + +//---------------------------------------------------------------- +// annotation rectangle +//---------------------------------------------------------------- +static mupdf::FzRect Annot_rect(mupdf::PdfAnnot& annot) +{ + mupdf::FzRect rect = mupdf::pdf_bound_annot(annot); + return rect; +} + +static PyObject* Annot_rect3(mupdf::PdfAnnot& annot) +{ + fz_rect rect = mupdf::ll_pdf_bound_annot(annot.m_internal); + return JM_py_from_rect(rect); +} + +//----------------------------------------------------------------------------- +// PySequence to fz_rect. Default: infinite rect +//----------------------------------------------------------------------------- +static fz_rect JM_rect_from_py(PyObject* r) +{ + if (!r || !PySequence_Check(r) || PySequence_Size(r) != 4) + { + return *mupdf::FzRect(mupdf::FzRect::Fixed_INFINITE).internal();// fz_infinite_rect; + } + double f[4]; + for (int i = 0; i < 4; i++) + { + if (jm_float_item(r, i, &f[i]) == 1) + { + return *mupdf::FzRect(mupdf::FzRect::Fixed_INFINITE).internal(); + } + if (f[i] < FZ_MIN_INF_RECT) f[i] = FZ_MIN_INF_RECT; + if (f[i] > FZ_MAX_INF_RECT) f[i] = FZ_MAX_INF_RECT; + } + return mupdf::ll_fz_make_rect( + (float) f[0], + (float) f[1], + (float) f[2], + (float) f[3] + ); +} + +//----------------------------------------------------------------------------- +// PySequence to fz_matrix. Default: fz_identity +//----------------------------------------------------------------------------- +static fz_matrix JM_matrix_from_py(PyObject* m) +{ + double a[6]; + + if (!m || !PySequence_Check(m) || PySequence_Size(m) != 6) + { + return fz_identity; + } + for (int i = 0; i < 6; i++) + { + if (jm_float_item(m, i, &a[i]) == 1) + { + return *mupdf::FzMatrix().internal(); + } + } + return mupdf::ll_fz_make_matrix( + (float) a[0], + (float) a[1], + (float) a[2], + (float) a[3], + (float) a[4], + (float) a[5] + ); +} + +PyObject* util_transform_rect(PyObject* rect, PyObject* matrix) +{ + return JM_py_from_rect( + mupdf::ll_fz_transform_rect( + JM_rect_from_py(rect), + JM_matrix_from_py(matrix) + ) + ); +} + +//---------------------------------------------------------------------------- +// return normalized /Rotate value:one of 0, 90, 180, 270 +//---------------------------------------------------------------------------- +static int JM_norm_rotation(int rotate) +{ + while (rotate < 0) rotate += 360; + while (rotate >= 360) rotate -= 360; + if (rotate % 90 != 0) return 0; + return rotate; +} + + +//---------------------------------------------------------------------------- +// return a PDF page's /Rotate value: one of (0, 90, 180, 270) +//---------------------------------------------------------------------------- +static int JM_page_rotation(mupdf::PdfPage& page) +{ + int rotate = 0; + rotate = mupdf::pdf_to_int( + mupdf::pdf_dict_get_inheritable(page.obj(), PDF_NAME2(Rotate)) + ); + rotate = JM_norm_rotation(rotate); + return rotate; +} + + +//---------------------------------------------------------------------------- +// return a PDF page's MediaBox +//---------------------------------------------------------------------------- +static mupdf::FzRect JM_mediabox(mupdf::PdfObj& page_obj) +{ + mupdf::FzRect mediabox = mupdf::pdf_to_rect( + mupdf::pdf_dict_get_inheritable(page_obj, PDF_NAME2(MediaBox)) + ); + if (mupdf::fz_is_empty_rect(mediabox) || mupdf::fz_is_infinite_rect(mediabox)) + { + mediabox.x0 = 0; + mediabox.y0 = 0; + mediabox.x1 = 612; + mediabox.y1 = 792; + } + mupdf::FzRect page_mediabox; + page_mediabox.x0 = mupdf::fz_min(mediabox.x0, mediabox.x1); + page_mediabox.y0 = mupdf::fz_min(mediabox.y0, mediabox.y1); + page_mediabox.x1 = mupdf::fz_max(mediabox.x0, mediabox.x1); + page_mediabox.y1 = mupdf::fz_max(mediabox.y0, mediabox.y1); + if (0 + || page_mediabox.x1 - page_mediabox.x0 < 1 + || page_mediabox.y1 - page_mediabox.y0 < 1 + ) + { + page_mediabox = *mupdf::FzRect(mupdf::FzRect::Fixed_UNIT).internal(); //fz_unit_rect; + } + return page_mediabox; +} + + +//---------------------------------------------------------------------------- +// return a PDF page's CropBox +//---------------------------------------------------------------------------- +mupdf::FzRect JM_cropbox(mupdf::PdfObj& page_obj) +{ + mupdf::FzRect mediabox = JM_mediabox(page_obj); + mupdf::FzRect cropbox = mupdf::pdf_to_rect( + mupdf::pdf_dict_get_inheritable(page_obj, PDF_NAME2(CropBox)) + ); + if (mupdf::fz_is_infinite_rect(cropbox) || mupdf::fz_is_empty_rect(cropbox)) + { + cropbox = mediabox; + } + float y0 = mediabox.y1 - cropbox.y1; + float y1 = mediabox.y1 - cropbox.y0; + cropbox.y0 = y0; + cropbox.y1 = y1; + return cropbox; +} + + +//---------------------------------------------------------------------------- +// calculate width and height of the UNROTATED page +//---------------------------------------------------------------------------- +static mupdf::FzPoint JM_cropbox_size(mupdf::PdfObj& page_obj) +{ + mupdf::FzPoint size; + mupdf::FzRect rect = JM_cropbox(page_obj); + float w = (rect.x0 < rect.x1) ? rect.x1 - rect.x0 : rect.x0 - rect.x1; + float h = (rect.y0 < rect.y1) ? rect.y1 - rect.y0 : rect.y0 - rect.y1; + size = fz_make_point(w, h); + return size; +} + + +//---------------------------------------------------------------------------- +// calculate page rotation matrices +//---------------------------------------------------------------------------- +static mupdf::FzMatrix JM_rotate_page_matrix(mupdf::PdfPage& page) +{ + if (!page.m_internal) + { + return *mupdf::FzMatrix().internal(); // no valid pdf page given + } + int rotation = JM_page_rotation(page); + if (rotation == 0) + { + return *mupdf::FzMatrix().internal(); // no rotation + } + auto po = page.obj(); + mupdf::FzPoint cb_size = JM_cropbox_size(po); + float w = cb_size.x; + float h = cb_size.y; + mupdf::FzMatrix m; + if (rotation == 90) + { + m = mupdf::fz_make_matrix(0, 1, -1, 0, h, 0); + } + else if (rotation == 180) + { + m = mupdf::fz_make_matrix(-1, 0, 0, -1, w, h); + } + else + { + m = mupdf::fz_make_matrix(0, -1, 1, 0, 0, w); + } + return m; +} + + +static mupdf::FzMatrix JM_derotate_page_matrix(mupdf::PdfPage& page) +{ // just the inverse of rotation + return mupdf::fz_invert_matrix(JM_rotate_page_matrix(page)); +} + +//----------------------------------------------------------------------------- +// PySequence from fz_matrix +//----------------------------------------------------------------------------- +static PyObject* JM_py_from_matrix(mupdf::FzMatrix m) +{ + return Py_BuildValue("ffffff", m.a, m.b, m.c, m.d, m.e, m.f); +} + +static mupdf::FzMatrix Page_derotate_matrix(mupdf::PdfPage& pdfpage) +{ + if (!pdfpage.m_internal) + { + return mupdf::FzMatrix(); + } + return JM_derotate_page_matrix(pdfpage); +} + +static mupdf::FzMatrix Page_derotate_matrix(mupdf::FzPage& page) +{ + mupdf::PdfPage pdf_page = mupdf::pdf_page_from_fz_page(page); + return Page_derotate_matrix(pdf_page); +} + + +static PyObject *lll_JM_get_annot_xref_list(pdf_obj *page_obj) +{ + fz_context* ctx = mupdf::internal_context_get(); + PyObject *names = PyList_New(0); + pdf_obj *id, *subtype, *annots, *annot_obj; + int xref, type, i, n; + fz_try(ctx) { + annots = pdf_dict_get(ctx, page_obj, PDF_NAME(Annots)); + n = pdf_array_len(ctx, annots); + for (i = 0; i < n; i++) { + annot_obj = pdf_array_get(ctx, annots, i); + xref = pdf_to_num(ctx, annot_obj); + subtype = pdf_dict_get(ctx, annot_obj, PDF_NAME(Subtype)); + if (!subtype) { + continue; // subtype is required + } + type = pdf_annot_type_from_string(ctx, pdf_to_name(ctx, subtype)); + if (type == PDF_ANNOT_UNKNOWN) { + continue; // only accept valid annot types + } + id = pdf_dict_gets(ctx, annot_obj, "NM"); + LIST_APPEND_DROP(names, Py_BuildValue("iis", xref, type, pdf_to_text_string(ctx, id))); + } + } + fz_catch(ctx) { + return names; + } + return names; +} +//------------------------------------------------------------------------ +// return the xrefs and /NM ids of a page's annots, links and fields +//------------------------------------------------------------------------ +static PyObject* JM_get_annot_xref_list(const mupdf::PdfObj& page_obj) +{ + PyObject* names = PyList_New(0); + if (!page_obj.m_internal) + { + return names; + } + return lll_JM_get_annot_xref_list( page_obj.m_internal); +} + +static mupdf::FzBuffer JM_object_to_buffer(const mupdf::PdfObj& what, int compress, int ascii) +{ + mupdf::FzBuffer res = mupdf::fz_new_buffer(512); + mupdf::FzOutput out(res); + mupdf::pdf_print_obj(out, what, compress, ascii); + out.fz_close_output(); + mupdf::fz_terminate_buffer(res); + return res; +} + +static PyObject* JM_EscapeStrFromBuffer(mupdf::FzBuffer& buff) +{ + if (!buff.m_internal) + { + return PyUnicode_FromString(""); + } + unsigned char* s = nullptr; + size_t len = mupdf::fz_buffer_storage(buff, &s); + PyObject* val = PyUnicode_DecodeRawUnicodeEscape((const char*) s, (Py_ssize_t) len, "replace"); + if (!val) + { + val = PyUnicode_FromString(""); + PyErr_Clear(); + } + return val; +} + +static PyObject* xref_object(mupdf::PdfDocument& pdf, int xref, int compressed=0, int ascii=0) +{ + if (!pdf.m_internal) + { + throw std::runtime_error(MSG_IS_NO_PDF); + } + int xreflen = mupdf::pdf_xref_len(pdf); + if ((xref < 1 || xref >= xreflen) and xref != -1) + { + throw std::runtime_error(MSG_BAD_XREF); + } + mupdf::PdfObj obj = (xref > 0) ? mupdf::pdf_load_object(pdf, xref) : mupdf::pdf_trailer(pdf); + mupdf::FzBuffer res = JM_object_to_buffer(mupdf::pdf_resolve_indirect(obj), compressed, ascii); + PyObject* text = JM_EscapeStrFromBuffer(res); + return text; +} + +static PyObject* xref_object(mupdf::FzDocument& document, int xref, int compressed=0, int ascii=0) +{ + mupdf::PdfDocument pdf = mupdf::pdf_document_from_fz_document(document); + return xref_object(pdf, xref, compressed, ascii); +} + + +//------------------------------------- +// fz_output for Python file objects +//------------------------------------- + +static PyObject* Link_is_external(mupdf::FzLink& this_link) +{ + const char* uri = this_link.m_internal->uri; + if (!uri) + { + return PyBool_FromLong(0); + } + bool ret = mupdf::fz_is_external_link(uri); + return PyBool_FromLong((long) ret); +} + +static mupdf::FzLink Link_next(mupdf::FzLink& this_link) +{ + return this_link.next(); +} + + +//----------------------------------------------------------------------------- +// create PDF object from given string +//----------------------------------------------------------------------------- +static pdf_obj *lll_JM_pdf_obj_from_str(fz_context *ctx, pdf_document *doc, const char *src) +{ + pdf_obj *result = NULL; + pdf_lexbuf lexbuf; + fz_stream *stream = fz_open_memory(ctx, (unsigned char *)src, strlen(src)); + + pdf_lexbuf_init(ctx, &lexbuf, PDF_LEXBUF_SMALL); + + fz_try(ctx) { + result = pdf_parse_stm_obj(ctx, doc, stream, &lexbuf); + } + + fz_always(ctx) { + pdf_lexbuf_fin(ctx, &lexbuf); + fz_drop_stream(ctx, stream); + } + + fz_catch(ctx) { + mupdf::internal_throw_exception(ctx); + } + + return result; + +} + +/*********************************************************************/ +// Page._addAnnot_FromString +// Add new links provided as an array of string object definitions. +/*********************************************************************/ +PyObject* Page_addAnnot_FromString(mupdf::PdfPage& page, PyObject* linklist) +{ + PyObject* txtpy = nullptr; + int lcount = (int) PySequence_Size(linklist); // link count + //printf("Page_addAnnot_FromString(): lcount=%i\n", lcount); + if (lcount < 1) + { + Py_RETURN_NONE; + } + try + { + // insert links from the provided sources + if (!page.m_internal) + { + throw std::runtime_error(MSG_IS_NO_PDF); + } + if (!mupdf::pdf_dict_get(page.obj(), PDF_NAME2(Annots)).m_internal) + { + mupdf::pdf_dict_put_array(page.obj(), PDF_NAME2(Annots), lcount); + } + mupdf::PdfObj annots = mupdf::pdf_dict_get(page.obj(), PDF_NAME2(Annots)); + mupdf::PdfDocument doc = page.doc(); + //printf("lcount=%i\n", lcount); + fz_context* ctx = mupdf::internal_context_get(); + for (int i = 0; i < lcount; i++) + { + const char* text = nullptr; + txtpy = PySequence_ITEM(linklist, (Py_ssize_t) i); + text = PyUnicode_AsUTF8(txtpy); + Py_CLEAR(txtpy); + if (!text) + { + messagef("skipping bad link / annot item %i.", i); + continue; + } + try + { + pdf_obj* obj = lll_JM_pdf_obj_from_str(ctx, doc.m_internal, text); + pdf_obj* annot = pdf_add_object_drop( + ctx, + doc.m_internal, + obj + ); + pdf_obj* ind_obj = pdf_new_indirect(ctx, doc.m_internal, pdf_to_num(ctx, annot), 0); + pdf_array_push_drop(ctx, annots.m_internal, ind_obj); + pdf_drop_obj(ctx, annot); + } + catch (std::exception&) + { + messagef("skipping bad link / annot item %i.", i); + } + } + } + catch (std::exception&) + { + PyErr_Clear(); + return nullptr; + } + Py_RETURN_NONE; +} + +PyObject* Page_addAnnot_FromString(mupdf::FzPage& page, PyObject* linklist) +{ + mupdf::PdfPage pdf_page = mupdf::pdf_page_from_fz_page(page); + return Page_addAnnot_FromString(pdf_page, linklist); +} + +static int page_count_fz2(void* document) +{ + mupdf::FzDocument* document2 = (mupdf::FzDocument*) document; + return mupdf::fz_count_pages(*document2); +} + +static int page_count_fz(mupdf::FzDocument& document) +{ + return mupdf::fz_count_pages(document); +} + +static int page_count_pdf(mupdf::PdfDocument& pdf) +{ + mupdf::FzDocument document = pdf.super(); + return page_count_fz(document); +} + +static int page_count(mupdf::FzDocument& document) +{ + return mupdf::fz_count_pages(document); +} + +static int page_count(mupdf::PdfDocument& pdf) +{ + mupdf::FzDocument document = pdf.super(); + return page_count(document); +} + +static PyObject* page_annot_xrefs(mupdf::FzDocument& document, mupdf::PdfDocument& pdf, int pno) +{ + int page_count = mupdf::fz_count_pages(document); + int n = pno; + while (n < 0) + { + n += page_count; + } + PyObject* annots = nullptr; + if (n >= page_count) + { + throw std::runtime_error(MSG_BAD_PAGENO); + } + if (!pdf.m_internal) + { + throw std::runtime_error(MSG_IS_NO_PDF); + } + annots = JM_get_annot_xref_list(mupdf::pdf_lookup_page_obj(pdf, n)); + return annots; +} + +static PyObject* page_annot_xrefs(mupdf::FzDocument& document, int pno) +{ + mupdf::PdfDocument pdf = mupdf::pdf_specifics(document); + return page_annot_xrefs(document, pdf, pno); +} + +static PyObject* page_annot_xrefs(mupdf::PdfDocument& pdf, int pno) +{ + mupdf::FzDocument document = pdf.super(); + return page_annot_xrefs(document, pdf, pno); +} + +static bool Outline_is_external(mupdf::FzOutline* outline) +{ + if (!outline->m_internal->uri) + { + return false; + } + return mupdf::ll_fz_is_external_link(outline->m_internal->uri); +} + +int ll_fz_absi(int i) +{ + return mupdf::ll_fz_absi(i); +} + +enum +{ + TEXT_FONT_SUPERSCRIPT = 1, + TEXT_FONT_ITALIC = 2, + TEXT_FONT_SERIFED = 4, + TEXT_FONT_MONOSPACED = 8, + TEXT_FONT_BOLD = 16, +}; + +int g_skip_quad_corrections = 0; +int g_subset_fontnames = 0; +int g_small_glyph_heights = 0; + +void set_skip_quad_corrections(int on) +{ + g_skip_quad_corrections = on; +} + +void set_subset_fontnames(int on) +{ + g_subset_fontnames = on; +} + +void set_small_glyph_heights(int on) +{ + g_small_glyph_heights = on; +} + +struct jm_lineart_device +{ + fz_device super; + + PyObject* out = {}; + PyObject* method = {}; + PyObject* pathdict = {}; + PyObject* scissors = {}; + float pathfactor = {}; + fz_matrix ctm = {}; + fz_matrix ptm = {}; + fz_matrix rot = {}; + fz_point lastpoint = {}; + fz_point firstpoint = {}; + int havemove = 0; + fz_rect pathrect = {}; + int clips = {}; + int linecount = {}; + float linewidth = {}; + int path_type = {}; + long depth = {}; + size_t seqno = {}; + char* layer_name; +}; + + +static void jm_lineart_drop_device(fz_context *ctx, fz_device *dev_) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (PyList_Check(dev->out)) { + Py_CLEAR(dev->out); + } + Py_CLEAR(dev->method); + Py_CLEAR(dev->scissors); + mupdf::ll_fz_free(dev->layer_name); + dev->layer_name = nullptr; +} + +typedef jm_lineart_device jm_tracedraw_device; + +// need own versions of ascender / descender +static float JM_font_ascender(fz_font* font) +{ + if (g_skip_quad_corrections) + { + return 0.8f; + } + return mupdf::ll_fz_font_ascender(font); +} + +static float JM_font_descender(fz_font* font) +{ + if (g_skip_quad_corrections) + { + return -0.2f; + } + return mupdf::ll_fz_font_descender(font); +} + + +//---------------------------------------------------------------- +// Return true if character is considered to be a word delimiter +//---------------------------------------------------------------- +static int +JM_is_word_delimiter(int c, PyObject *delimiters) +{ + if (c <= 32 || c == 160) return 1; // a standard delimiter + if (0x202a <= c && c <= 0x202e) + { + return 1; // change between writing directions + } + + // extra delimiters must be a non-empty sequence + if (!delimiters || PyObject_Not(delimiters) || !PySequence_Check(delimiters)) { + return 0; + } + + // convert to tuple for easier looping + PyObject *delims = PySequence_Tuple(delimiters); + if (!delims) { + PyErr_Clear(); + return 0; + } + + // Make 1-char PyObject from character given as integer + PyObject *cchar = Py_BuildValue("C", c); // single character PyObject + Py_ssize_t i, len = PyTuple_Size(delims); + for (i = 0; i < len; i++) { + int rc = PyUnicode_Compare(cchar, PyTuple_GET_ITEM(delims, i)); + if (rc == 0) { // equal to a delimiter character + Py_DECREF(cchar); + Py_DECREF(delims); + PyErr_Clear(); + return 1; + } + } + + Py_DECREF(delims); + PyErr_Clear(); + return 0; +} + +static int +JM_is_rtl_char(int c) +{ + if (c < 0x590 || c > 0x900) return 0; + return 1; +} + +static const char* JM_font_name(fz_font* font) +{ + const char* name = mupdf::ll_fz_font_name(font); + const char* s = strchr(name, '+'); + if (g_subset_fontnames || !s || s-name != 6) + { + return name; + } + return s + 1; +} + +static int detect_super_script(fz_stext_line *line, fz_stext_char *ch) +{ + if (line->wmode == 0 && line->dir.x == 1 && line->dir.y == 0) + { + return ch->origin.y < line->first_char->origin.y - ch->size * 0.1f; + } + return 0; +} + +static int JM_char_font_flags(fz_font *font, fz_stext_line *line, fz_stext_char *ch) +{ + int flags = 0; + if (line && ch) + { + flags += detect_super_script(line, ch) * TEXT_FONT_SUPERSCRIPT; + } + flags += mupdf::ll_fz_font_is_italic(font) * TEXT_FONT_ITALIC; + flags += mupdf::ll_fz_font_is_serif(font) * TEXT_FONT_SERIFED; + flags += mupdf::ll_fz_font_is_monospaced(font) * TEXT_FONT_MONOSPACED; + flags += mupdf::ll_fz_font_is_bold(font) * TEXT_FONT_BOLD; + return flags; +} + +static void jm_trace_text_span( + jm_tracedraw_device* dev, + fz_text_span* span, + int type, + fz_matrix ctm, + fz_colorspace* colorspace, + const float* color, + float alpha, + size_t seqno + ) +{ + //printf("extra.jm_trace_text_span(): seqno=%zi\n", seqno); + //fz_matrix join = mupdf::ll_fz_concat(span->trm, ctm); + //double fsize = sqrt(fabs((double) span->trm.a * (double) span->trm.d)); + fz_matrix mat = mupdf::ll_fz_concat(span->trm, ctm); // text transformation matrix + fz_point dir = mupdf::ll_fz_transform_vector(mupdf::ll_fz_make_point(1, 0), mat); // writing direction + double fsize = sqrt(dir.x * dir.x + dir.y * dir.y); // font size + + dir = mupdf::ll_fz_normalize_vector(dir); + + // compute effective ascender / descender + double asc = (double) JM_font_ascender(span->font); + double dsc = (double) JM_font_descender(span->font); + if (asc < 1e-3) { // probably Tesseract font + dsc = -0.1; + asc = 0.9; + } + + double ascsize = asc * fsize / (asc - dsc); + double dscsize = dsc * fsize / (asc - dsc); + int fflags = 0; // font flags + int mono = mupdf::ll_fz_font_is_monospaced(span->font); + fflags += mono * TEXT_FONT_MONOSPACED; + fflags += mupdf::ll_fz_font_is_italic(span->font) * TEXT_FONT_ITALIC; + fflags += mupdf::ll_fz_font_is_serif(span->font) * TEXT_FONT_SERIFED; + fflags += mupdf::ll_fz_font_is_bold(span->font) * TEXT_FONT_BOLD; + + // walk through characters of span + fz_matrix rot = mupdf::ll_fz_make_matrix(dir.x, dir.y, -dir.y, dir.x, 0, 0); + if (dir.x == -1) + { + // left-right flip + rot.d = 1; + } + PyObject* chars = PyTuple_New(span->len); + double space_adv = 0; + double last_adv = 0; + fz_rect span_bbox; + + for (int i = 0; i < span->len; i++) + { + double adv = 0; + if (span->items[i].gid >= 0) + { + adv = (double) mupdf::ll_fz_advance_glyph(span->font, span->items[i].gid, span->wmode); + } + adv *= fsize; + last_adv = adv; + if (span->items[i].ucs == 32) + { + space_adv = adv; + } + fz_point char_orig; + char_orig = fz_make_point(span->items[i].x, span->items[i].y); + char_orig = fz_transform_point(char_orig, ctm); + fz_matrix m1 = mupdf::ll_fz_make_matrix(1, 0, 0, 1, -char_orig.x, -char_orig.y); + m1 = mupdf::ll_fz_concat(m1, rot); + m1 = mupdf::ll_fz_concat(m1, mupdf::ll_fz_make_matrix(1, 0, 0, 1, char_orig.x, char_orig.y)); + float x0 = char_orig.x; + float x1 = x0 + adv; + float y0; + float y1; + if ( + (mat.d > 0 && (dir.x == 1 || dir.x == -1)) + || + (mat.b !=0 && mat.b == -mat.c) + ) // up-down flip + { + // up-down flip + y0 = char_orig.y + dscsize; + y1 = char_orig.y + ascsize; + } + else + { + y0 = char_orig.y - ascsize; + y1 = char_orig.y - dscsize; + } + fz_rect char_bbox = mupdf::ll_fz_make_rect(x0, y0, x1, y1); + char_bbox = mupdf::ll_fz_transform_rect(char_bbox, m1); + PyTuple_SET_ITEM( + chars, + (Py_ssize_t) i, + Py_BuildValue( + "ii(ff)(ffff)", + span->items[i].ucs, + span->items[i].gid, + char_orig.x, + char_orig.y, + char_bbox.x0, + char_bbox.y0, + char_bbox.x1, + char_bbox.y1 + ) + ); + if (i > 0) + { + span_bbox = mupdf::ll_fz_union_rect(span_bbox, char_bbox); + } + else + { + span_bbox = char_bbox; + } + } + if (!space_adv) + { + if (!(fflags & TEXT_FONT_MONOSPACED)) + { + fz_font* out_font = nullptr; + space_adv = mupdf::ll_fz_advance_glyph( + span->font, + mupdf::ll_fz_encode_character_with_fallback(span->font, 32, 0, 0, &out_font), + span->wmode + ); + space_adv *= fsize; + if (!space_adv) + { + space_adv = last_adv; + } + } + else + { + space_adv = last_adv; // for mono any char width suffices + } + } + // make the span dictionary + PyObject* span_dict = PyDict_New(); + dict_setitemstr_drop(span_dict, "dir", JM_py_from_point(dir)); + dict_setitem_drop(span_dict, dictkey_font, JM_EscapeStrFromStr(JM_font_name(span->font))); + dict_setitem_drop(span_dict, dictkey_wmode, PyLong_FromLong((long) span->wmode)); + dict_setitem_drop(span_dict, dictkey_flags, PyLong_FromLong((long) fflags)); + dict_setitemstr_drop(span_dict, "bidi_lvl", PyLong_FromLong((long) span->bidi_level)); + dict_setitemstr_drop(span_dict, "bidi_dir", PyLong_FromLong((long) span->markup_dir)); + dict_setitem_drop(span_dict, dictkey_ascender, PyFloat_FromDouble(asc)); + dict_setitem_drop(span_dict, dictkey_descender, PyFloat_FromDouble(dsc)); + dict_setitem_drop(span_dict, dictkey_colorspace, PyLong_FromLong(3)); + float rgb[3]; + if (colorspace) + { + mupdf::ll_fz_convert_color( + colorspace, + color, + mupdf::ll_fz_device_rgb(), + rgb, + nullptr, + fz_default_color_params + ); + } + else + { + rgb[0] = rgb[1] = rgb[2] = 0; + } + double linewidth; + if (dev->linewidth > 0) // width of character border + { + linewidth = (double) dev->linewidth; + } + else + { + linewidth = fsize * 0.05; // default: 5% of font size + } + if (0) std::cout + << " dev->linewidth=" << dev->linewidth + << " fsize=" << fsize + << " linewidth=" << linewidth + << "\n"; + dict_setitem_drop(span_dict, dictkey_color, Py_BuildValue("fff", rgb[0], rgb[1], rgb[2])); + dict_setitem_drop(span_dict, dictkey_size, PyFloat_FromDouble(fsize)); + dict_setitemstr_drop(span_dict, "opacity", PyFloat_FromDouble((double) alpha)); + dict_setitemstr_drop(span_dict, "linewidth", PyFloat_FromDouble((double) linewidth)); + dict_setitemstr_drop(span_dict, "spacewidth", PyFloat_FromDouble(space_adv)); + dict_setitem_drop(span_dict, dictkey_type, PyLong_FromLong((long) type)); + dict_setitem_drop(span_dict, dictkey_bbox, JM_py_from_rect(span_bbox)); + dict_setitemstr_drop(span_dict, "layer", JM_UnicodeFromStr(dev->layer_name)); + dict_setitemstr_drop(span_dict, "seqno", PyLong_FromSize_t(seqno)); + dict_setitem_drop(span_dict, dictkey_chars, chars); + //std::cout << "span_dict=" << repr(span_dict) << "\n"; + s_list_append_drop(dev->out, span_dict); +} + +static inline void jm_increase_seqno(fz_context* ctx, fz_device* dev_) +{ + jm_tracedraw_device* dev = (jm_tracedraw_device*) dev_; + dev->seqno += 1; +} + +static void jm_fill_path( + fz_context* ctx, + fz_device* dev, + const fz_path*, + int even_odd, + fz_matrix, + fz_colorspace*, + const float* color, + float alpha, + fz_color_params + ) +{ + jm_increase_seqno(ctx, dev); +} + +static void jm_fill_shade( + fz_context* ctx, + fz_device* dev, + fz_shade* shd, + fz_matrix ctm, + float alpha, + fz_color_params color_params + ) +{ + jm_increase_seqno(ctx, dev); +} + +static void jm_fill_image( + fz_context* ctx, + fz_device* dev, + fz_image* img, + fz_matrix ctm, + float alpha, + fz_color_params color_params + ) +{ + jm_increase_seqno(ctx, dev); +} + +static void jm_fill_image_mask( + fz_context* ctx, + fz_device* dev, + fz_image* img, + fz_matrix ctm, + fz_colorspace* cs, + const float* color, + float alpha, + fz_color_params color_params + ) +{ + jm_increase_seqno(ctx, dev); +} + +static void jm_dev_linewidth( + fz_context* ctx, + fz_device* dev_, + const fz_path* path, + const fz_stroke_state* stroke, + fz_matrix ctm, + fz_colorspace* colorspace, + const float* color, + float alpha, + fz_color_params color_params + ) +{ + jm_tracedraw_device* dev = (jm_tracedraw_device*) dev_; + if (0) std::cout << "jm_dev_linewidth(): changing dev->linewidth from " << dev->linewidth + << " to stroke->linewidth=" << stroke->linewidth + << "\n"; + dev->linewidth = stroke->linewidth; + jm_increase_seqno(ctx, dev_); +} + +static void jm_trace_text( + jm_tracedraw_device* dev, + const fz_text* text, + int type, + fz_matrix ctm, + fz_colorspace* colorspace, + const float* color, + float alpha, + size_t seqno + ) +{ + fz_text_span* span; + for (span = text->head; span; span = span->next) + { + jm_trace_text_span(dev, span, type, ctm, colorspace, color, alpha, seqno); + } +} + +/*--------------------------------------------------------- +There are 3 text trace types: +0 - fill text (PDF Tr 0) +1 - stroke text (PDF Tr 1) +3 - ignore text (PDF Tr 3) +---------------------------------------------------------*/ +static void +jm_tracedraw_fill_text( + fz_context* ctx, + fz_device* dev_, + const fz_text* text, + fz_matrix ctm, + fz_colorspace* colorspace, + const float* color, + float alpha, + fz_color_params color_params + ) +{ + jm_tracedraw_device* dev = (jm_tracedraw_device*) dev_; + jm_trace_text(dev, text, 0, ctm, colorspace, color, alpha, dev->seqno); + dev->seqno += 1; +} + +static void +jm_tracedraw_stroke_text( + fz_context* ctx, + fz_device* dev_, + const fz_text* text, + const fz_stroke_state* stroke, + fz_matrix ctm, + fz_colorspace* colorspace, + const float* color, + float alpha, + fz_color_params color_params + ) +{ + jm_tracedraw_device* dev = (jm_tracedraw_device*) dev_; + jm_trace_text(dev, text, 1, ctm, colorspace, color, alpha, dev->seqno); + dev->seqno += 1; +} + + +static void +jm_tracedraw_ignore_text( + fz_context* ctx, + fz_device* dev_, + const fz_text* text, + fz_matrix ctm + ) +{ + jm_tracedraw_device* dev = (jm_tracedraw_device*) dev_; + jm_trace_text(dev, text, 3, ctm, nullptr, nullptr, 1, dev->seqno); + dev->seqno += 1; +} + +static void +jm_lineart_begin_layer(fz_context *ctx, fz_device *dev_, const char *name) +{ + jm_tracedraw_device* dev = (jm_tracedraw_device*) dev_; + mupdf::ll_fz_free(dev->layer_name); + dev->layer_name = mupdf::ll_fz_strdup(name); +} + +static void +jm_lineart_end_layer(fz_context *ctx, fz_device *dev_) +{ + jm_tracedraw_device* dev = (jm_tracedraw_device*) dev_; + mupdf::ll_fz_free(dev->layer_name); + dev->layer_name = nullptr; +} + + +mupdf::FzDevice JM_new_texttrace_device(PyObject* out) +{ + mupdf::FzDevice device(sizeof(jm_tracedraw_device)); + jm_tracedraw_device* dev = (jm_tracedraw_device*) device.m_internal; + + dev->super.close_device = nullptr; + dev->super.drop_device = jm_lineart_drop_device; + dev->super.fill_path = jm_fill_path; + dev->super.stroke_path = jm_dev_linewidth; + dev->super.clip_path = nullptr; + dev->super.clip_stroke_path = nullptr; + + dev->super.fill_text = jm_tracedraw_fill_text; + dev->super.stroke_text = jm_tracedraw_stroke_text; + dev->super.clip_text = nullptr; + dev->super.clip_stroke_text = nullptr; + dev->super.ignore_text = jm_tracedraw_ignore_text; + + dev->super.fill_shade = jm_fill_shade; + dev->super.fill_image = jm_fill_image; + dev->super.fill_image_mask = jm_fill_image_mask; + dev->super.clip_image_mask = nullptr; + + dev->super.pop_clip = nullptr; + + dev->super.begin_mask = nullptr; + dev->super.end_mask = nullptr; + dev->super.begin_group = nullptr; + dev->super.end_group = nullptr; + + dev->super.begin_tile = nullptr; + dev->super.end_tile = nullptr; + + dev->super.begin_layer = jm_lineart_begin_layer; + dev->super.end_layer = jm_lineart_end_layer; + + dev->super.begin_structure = nullptr; + dev->super.end_structure = nullptr; + + dev->super.begin_metatext = nullptr; + dev->super.end_metatext = nullptr; + + dev->super.render_flags = nullptr; + dev->super.set_default_colorspaces = nullptr; + + Py_XINCREF(out); + dev->out = out; + dev->seqno = 0; + return device; +} + + +static fz_quad +JM_char_quad(fz_stext_line *line, fz_stext_char *ch) +{ + if (g_skip_quad_corrections) { // no special handling + return ch->quad; + } + if (line->wmode) { // never touch vertical write mode + return ch->quad; + } + fz_font *font = ch->font; + float asc = JM_font_ascender(font); + float dsc = JM_font_descender(font); + float c, s, fsize = ch->size; + float asc_dsc = asc - dsc + FLT_EPSILON; + if (asc_dsc >= 1 && g_small_glyph_heights == 0) { // no problem + return ch->quad; + } + if (asc < 1e-3) { // probably Tesseract glyphless font + dsc = -0.1f; + asc = 0.9f; + asc_dsc = 1.0f; + } + + if (g_small_glyph_heights || asc_dsc < 1) { + dsc = dsc / asc_dsc; + asc = asc / asc_dsc; + } + asc_dsc = asc - dsc; + asc = asc * fsize / asc_dsc; + dsc = dsc * fsize / asc_dsc; + + /* ------------------------------ + Re-compute quad with the adjusted ascender / descender values: + Move ch->origin to (0,0) and de-rotate quad, then adjust the corners, + re-rotate and move back to ch->origin location. + ------------------------------ */ + fz_matrix trm1, trm2, xlate1, xlate2; + fz_quad quad; + c = line->dir.x; // cosine + s = line->dir.y; // sine + trm1 = mupdf::ll_fz_make_matrix(c, -s, s, c, 0, 0); // derotate + trm2 = mupdf::ll_fz_make_matrix(c, s, -s, c, 0, 0); // rotate + if (c == -1) { // left-right flip + trm1.d = 1; + trm2.d = 1; + } + xlate1 = mupdf::ll_fz_make_matrix(1, 0, 0, 1, -ch->origin.x, -ch->origin.y); + xlate2 = mupdf::ll_fz_make_matrix(1, 0, 0, 1, ch->origin.x, ch->origin.y); + + quad = mupdf::ll_fz_transform_quad(ch->quad, xlate1); // move origin to (0,0) + quad = mupdf::ll_fz_transform_quad(quad, trm1); // de-rotate corners + + // adjust vertical coordinates + if (c == 1 && quad.ul.y > 0) { // up-down flip + quad.ul.y = asc; + quad.ur.y = asc; + quad.ll.y = dsc; + quad.lr.y = dsc; + } else { + quad.ul.y = -asc; + quad.ur.y = -asc; + quad.ll.y = -dsc; + quad.lr.y = -dsc; + } + + // adjust horizontal coordinates that are too crazy: + // (1) left x must be >= 0 + // (2) if bbox width is 0, lookup char advance in font. + if (quad.ll.x < 0) { + quad.ll.x = 0; + quad.ul.x = 0; + } + float cwidth = quad.lr.x - quad.ll.x; + if (cwidth < FLT_EPSILON) { + int glyph = mupdf::ll_fz_encode_character( font, ch->c); + if (glyph) { + float fwidth = mupdf::ll_fz_advance_glyph( font, glyph, line->wmode); + quad.lr.x = quad.ll.x + fwidth * fsize; + quad.ur.x = quad.lr.x; + } + } + + quad = mupdf::ll_fz_transform_quad(quad, trm2); // rotate back + quad = mupdf::ll_fz_transform_quad(quad, xlate2); // translate back + return quad; +} + + +static fz_rect JM_char_bbox(fz_stext_line* line, fz_stext_char* ch) +{ + fz_rect r = mupdf::ll_fz_rect_from_quad(JM_char_quad( line, ch)); + if (!line->wmode) { + return r; + } + if (r.y1 < r.y0 + ch->size) { + r.y0 = r.y1 - ch->size; + } + return r; +} + +fz_rect JM_char_bbox(const mupdf::FzStextLine& line, const mupdf::FzStextChar& ch) +{ + return JM_char_bbox( line.m_internal, ch.m_internal); +} + +static int JM_rects_overlap(const fz_rect a, const fz_rect b) +{ + if (0 + || a.x0 >= b.x1 + || a.y0 >= b.y1 + || a.x1 <= b.x0 + || a.y1 <= b.y0 + ) + return 0; + return 1; +} + +// +void JM_append_rune(fz_buffer *buff, int ch); + +//----------------------------------------------------------------------------- +// Plain text output. An identical copy of fz_print_stext_page_as_text, +// but lines within a block are concatenated by space instead a new-line +// character (which else leads to 2 new-lines). +//----------------------------------------------------------------------------- +void JM_print_stext_page_as_text(mupdf::FzBuffer& res, mupdf::FzStextPage& page) +{ + fz_rect rect = page.m_internal->mediabox; + + for (auto block: page) + { + if (block.m_internal->type == FZ_STEXT_BLOCK_TEXT) + { + for (auto line: block) + { + int last_char = 0; + for (auto ch: line) + { + fz_rect chbbox = JM_char_bbox( line, ch); + if (mupdf::ll_fz_is_infinite_rect(rect) + || JM_rects_overlap(rect, chbbox) + ) + { + last_char = ch.m_internal->c; + JM_append_rune(res.m_internal, last_char); + } + } + if (last_char != 10 && last_char > 0) + { + mupdf::ll_fz_append_string(res.m_internal, "\n"); + } + } + } + } +} + + + +// path_type is one of: +#define FILL_PATH 1 +#define STROKE_PATH 2 +#define CLIP_PATH 3 +#define CLIP_STROKE_PATH 4 + +// Every scissor of a clip is a sub rectangle of the preceding clip scissor if +// the clip level is larger. +static fz_rect compute_scissor(jm_lineart_device *dev) +{ + PyObject *last_scissor = NULL; + fz_rect scissor; + if (!dev->scissors) { + dev->scissors = PyList_New(0); + } + Py_ssize_t num_scissors = PyList_Size(dev->scissors); + if (num_scissors > 0) { + last_scissor = PyList_GET_ITEM(dev->scissors, num_scissors-1); + scissor = JM_rect_from_py(last_scissor); + scissor = fz_intersect_rect(scissor, dev->pathrect); + } else { + scissor = dev->pathrect; + } + LIST_APPEND_DROP(dev->scissors, JM_py_from_rect(scissor)); + return scissor; +} + + +/* +-------------------------------------------------------------------------- +Check whether the last 4 lines represent a quad. +Because of how we count, the lines are a polyline already, i.e. last point +of a line equals 1st point of next line. +So we check for a polygon (last line's end point equals start point). +If not true we return 0. +-------------------------------------------------------------------------- +*/ +static int +jm_checkquad(jm_lineart_device* dev) +{ + PyObject *items = PyDict_GetItem(dev->pathdict, dictkey_items); + Py_ssize_t i, len = PyList_Size(items); + float f[8]; // coordinates of the 4 corners + mupdf::FzPoint temp, lp; // line = (temp, lp) + PyObject *rect; + PyObject *line; + // fill the 8 floats in f, start from items[-4:] + for (i = 0; i < 4; i++) { // store line start points + line = PyList_GET_ITEM(items, len - 4 + i); + temp = JM_point_from_py(PyTuple_GET_ITEM(line, 1)); + f[i * 2] = temp.x; + f[i * 2 + 1] = temp.y; + lp = JM_point_from_py(PyTuple_GET_ITEM(line, 2)); + } + if (lp.x != f[0] || lp.y != f[1]) { + // not a polygon! + //dev_linecount -= 1; + return 0; + } + + // we have detected a quad + dev->linecount = 0; // reset this + // a quad item is ("qu", (ul, ur, ll, lr)), where the tuple items + // are pairs of floats representing a quad corner each. + rect = PyTuple_New(2); + PyTuple_SET_ITEM(rect, 0, PyUnicode_FromString("qu")); + /* ---------------------------------------------------- + * relationship of float array to quad points: + * (0, 1) = ul, (2, 3) = ll, (6, 7) = ur, (4, 5) = lr + ---------------------------------------------------- */ + fz_quad q = fz_make_quad(f[0], f[1], f[6], f[7], f[2], f[3], f[4], f[5]); + PyTuple_SET_ITEM(rect, 1, JM_py_from_quad(q)); + PyList_SetItem(items, len - 4, rect); // replace item -4 by rect + PyList_SetSlice(items, len - 3, len, NULL); // delete remaining 3 items + return 1; +} + + +/* +-------------------------------------------------------------------------- +Check whether the last 3 path items represent a rectangle. +Line 1 and 3 must be horizontal, line 2 must be vertical. +Returns 1 if we have modified the path, otherwise 0. +-------------------------------------------------------------------------- +*/ +static int +jm_checkrect(jm_lineart_device* dev) +{ + dev->linecount = 0; // reset line count + long orientation = 0; // area orientation of rectangle + mupdf::FzPoint ll, lr, ur, ul; + mupdf::FzRect r; + PyObject *rect; + PyObject *line0, *line2; + PyObject *items = PyDict_GetItem(dev->pathdict, dictkey_items); + Py_ssize_t len = PyList_Size(items); + + line0 = PyList_GET_ITEM(items, len - 3); + ll = JM_point_from_py(PyTuple_GET_ITEM(line0, 1)); + lr = JM_point_from_py(PyTuple_GET_ITEM(line0, 2)); + // no need to extract "line1"! + line2 = PyList_GET_ITEM(items, len - 1); + ur = JM_point_from_py(PyTuple_GET_ITEM(line2, 1)); + ul = JM_point_from_py(PyTuple_GET_ITEM(line2, 2)); + + /* + --------------------------------------------------------------------- + Assumption: + When decomposing rects, MuPDF always starts with a horizontal line, + followed by a vertical line, followed by a horizontal line. + First line: (ll, lr), third line: (ul, ur). + If 1st line is below 3rd line, we record anti-clockwise (+1), else + clockwise (-1) orientation. + --------------------------------------------------------------------- + */ + if (ll.y != lr.y || + ll.x != ul.x || + ur.y != ul.y || + ur.x != lr.x) { + goto drop_out; // not a rectangle + } + + // we have a rect, replace last 3 "l" items by one "re" item. + if (ul.y < lr.y) { + r = fz_make_rect(ul.x, ul.y, lr.x, lr.y); + orientation = 1; + } else { + r = fz_make_rect(ll.x, ll.y, ur.x, ur.y); + orientation = -1; + } + rect = PyTuple_New(3); + PyTuple_SET_ITEM(rect, 0, PyUnicode_FromString("re")); + PyTuple_SET_ITEM(rect, 1, JM_py_from_rect(r)); + PyTuple_SET_ITEM(rect, 2, PyLong_FromLong(orientation)); + PyList_SetItem(items, len - 3, rect); // replace item -3 by rect + PyList_SetSlice(items, len - 2, len, NULL); // delete remaining 2 items + return 1; + drop_out:; + return 0; +} + +static PyObject * +jm_lineart_color(fz_colorspace *colorspace, const float *color) +{ + float rgb[3]; + if (colorspace) { + mupdf::ll_fz_convert_color(colorspace, color, mupdf::ll_fz_device_rgb(), + rgb, NULL, fz_default_color_params); + return Py_BuildValue("fff", rgb[0], rgb[1], rgb[2]); + } + return PyTuple_New(0); +} + +static void +trace_moveto(fz_context *ctx, void *dev_, float x, float y) +{ + jm_lineart_device* dev = (jm_lineart_device*) dev_; + dev->lastpoint = mupdf::ll_fz_transform_point(fz_make_point(x, y), dev->ctm); + if (mupdf::ll_fz_is_infinite_rect(dev->pathrect)) + { + dev->pathrect = mupdf::ll_fz_make_rect( + dev->lastpoint.x, + dev->lastpoint.y, + dev->lastpoint.x, + dev->lastpoint.y + ); + } + dev->firstpoint = dev->lastpoint; + dev->havemove = 1; + dev->linecount = 0; // reset # of consec. lines +} + +static void +trace_lineto(fz_context *ctx, void *dev_, float x, float y) +{ + jm_lineart_device* dev = (jm_lineart_device*) dev_; + fz_point p1 = fz_transform_point(fz_make_point(x, y), dev->ctm); + dev->pathrect = fz_include_point_in_rect(dev->pathrect, p1); + PyObject *list = PyTuple_New(3); + PyTuple_SET_ITEM(list, 0, PyUnicode_FromString("l")); + PyTuple_SET_ITEM(list, 1, JM_py_from_point(dev->lastpoint)); + PyTuple_SET_ITEM(list, 2, JM_py_from_point(p1)); + dev->lastpoint = p1; + PyObject *items = PyDict_GetItem(dev->pathdict, dictkey_items); + LIST_APPEND_DROP(items, list); + dev->linecount += 1; // counts consecutive lines + if (dev->linecount == 4 && dev->path_type != FILL_PATH) { // shrink to "re" or "qu" item + jm_checkquad(dev); + } +} + +static void +trace_curveto(fz_context *ctx, void *dev_, float x1, float y1, float x2, float y2, float x3, float y3) +{ + jm_lineart_device* dev = (jm_lineart_device*) dev_; + dev->linecount = 0; // reset # of consec. lines + fz_point p1 = fz_make_point(x1, y1); + fz_point p2 = fz_make_point(x2, y2); + fz_point p3 = fz_make_point(x3, y3); + p1 = fz_transform_point(p1, dev->ctm); + p2 = fz_transform_point(p2, dev->ctm); + p3 = fz_transform_point(p3, dev->ctm); + dev->pathrect = fz_include_point_in_rect(dev->pathrect, p1); + dev->pathrect = fz_include_point_in_rect(dev->pathrect, p2); + dev->pathrect = fz_include_point_in_rect(dev->pathrect, p3); + + PyObject *list = PyTuple_New(5); + PyTuple_SET_ITEM(list, 0, PyUnicode_FromString("c")); + PyTuple_SET_ITEM(list, 1, JM_py_from_point(dev->lastpoint)); + PyTuple_SET_ITEM(list, 2, JM_py_from_point(p1)); + PyTuple_SET_ITEM(list, 3, JM_py_from_point(p2)); + PyTuple_SET_ITEM(list, 4, JM_py_from_point(p3)); + dev->lastpoint = p3; + PyObject *items = PyDict_GetItem(dev->pathdict, dictkey_items); + LIST_APPEND_DROP(items, list); +} + +static void +trace_close(fz_context *ctx, void *dev_) +{ + jm_lineart_device* dev = (jm_lineart_device*) dev_; + if (dev->linecount == 3) { + if (jm_checkrect(dev)) { + return; + } + } + dev->linecount = 0; // reset # of consec. lines + if (dev->havemove) { + if (dev->firstpoint.x != dev->lastpoint.x || dev->firstpoint.y != dev->lastpoint.y) { + PyObject *list = PyTuple_New(3); + PyTuple_SET_ITEM(list, 0, PyUnicode_FromString("l")); + PyTuple_SET_ITEM(list, 1, JM_py_from_point(dev->lastpoint)); + PyTuple_SET_ITEM(list, 2, JM_py_from_point(dev->firstpoint)); + dev->lastpoint = dev->firstpoint; + PyObject *items = PyDict_GetItem(dev->pathdict, dictkey_items); + LIST_APPEND_DROP(items, list); + } + dev->havemove = 0; + DICT_SETITEMSTR_DROP(dev->pathdict, "closePath", JM_BOOL(0)); + } else { + DICT_SETITEMSTR_DROP(dev->pathdict, "closePath", JM_BOOL(1)); + } +} + +static const fz_path_walker trace_path_walker = + { + trace_moveto, + trace_lineto, + trace_curveto, + trace_close + }; + +/* +--------------------------------------------------------------------- +Create the "items" list of the path dictionary +* either create or empty the path dictionary +* reset the end point of the path +* reset count of consecutive lines +* invoke fz_walk_path(), which create the single items +* if no items detected, empty path dict again +--------------------------------------------------------------------- +*/ +static void +jm_lineart_path(jm_lineart_device *dev, const fz_path *path) +{ + dev->pathrect = fz_infinite_rect; + dev->linecount = 0; + dev->lastpoint = fz_make_point(0, 0); + dev->firstpoint = fz_make_point(0, 0); + if (dev->pathdict) { + Py_CLEAR(dev->pathdict); + } + dev->pathdict = PyDict_New(); + DICT_SETITEM_DROP(dev->pathdict, dictkey_items, PyList_New(0)); + mupdf::ll_fz_walk_path(path, &trace_path_walker, dev); + // Check if any items were added ... + if (!PyDict_GetItem(dev->pathdict, dictkey_items) || !PyList_Size(PyDict_GetItem(dev->pathdict, dictkey_items))) + { + Py_CLEAR(dev->pathdict); + } +} + +//--------------------------------------------------------------------------- +// Append current path to list or merge into last path of the list. +// (1) Append if first path, different item lists or not a 'stroke' version +// of previous path +// (2) If new path has the same items, merge its content into previous path +// and change path["type"] to "fs". +// (3) If "out" is callable, skip the previous and pass dictionary to it. +//--------------------------------------------------------------------------- +static void +// todo: remove `method` arg - it is dev->method. +jm_append_merge(jm_lineart_device *dev) +{ + Py_ssize_t len; + int rc; + PyObject *prev; + PyObject *previtems; + PyObject *thisitems; + const char *thistype; + const char *prevtype; + if (PyCallable_Check(dev->out) || dev->method != Py_None) { // function or method + goto callback; + } + len = PyList_Size(dev->out); // len of output list so far + if (len == 0) { // always append first path + goto append; + } + thistype = PyUnicode_AsUTF8(PyDict_GetItem(dev->pathdict, dictkey_type)); + if (strcmp(thistype, "s") != 0) { // if not stroke, then append + goto append; + } + prev = PyList_GET_ITEM(dev->out, len - 1); // get prev path + prevtype = PyUnicode_AsUTF8(PyDict_GetItem(prev, dictkey_type)); + if (strcmp(prevtype, "f") != 0) { // if previous not fill, append + goto append; + } + // last check: there must be the same list of items for "f" and "s". + previtems = PyDict_GetItem(prev, dictkey_items); + thisitems = PyDict_GetItem(dev->pathdict, dictkey_items); + if (PyObject_RichCompareBool(previtems, thisitems, Py_NE)) { + goto append; + } + rc = PyDict_Merge(prev, dev->pathdict, 0); // merge, do not override + if (rc == 0) { + DICT_SETITEM_DROP(prev, dictkey_type, PyUnicode_FromString("fs")); + goto postappend; + } else { + messagef("could not merge stroke and fill path"); + goto append; + } + append:; + //printf("Appending to dev->out. len(dev->out)=%zi\n", PyList_Size(dev->out)); + PyList_Append(dev->out, dev->pathdict); + postappend:; + Py_CLEAR(dev->pathdict); + return; + + callback:; // callback function or method + PyObject *resp = NULL; + if (dev->method == Py_None) { + resp = PyObject_CallFunctionObjArgs(dev->out, dev->pathdict, NULL); + } else { + resp = PyObject_CallMethodObjArgs(dev->out, dev->method, dev->pathdict, NULL); + } + if (resp) { + Py_DECREF(resp); + } else { + messagef("calling cdrawings callback function/method failed!"); + PyErr_Clear(); + } + Py_CLEAR(dev->pathdict); + return; +} + +static void +jm_lineart_fill_path(fz_context *ctx, fz_device *dev_, const fz_path *path, + int even_odd, fz_matrix ctm, fz_colorspace *colorspace, + const float *color, float alpha, fz_color_params color_params) +{ + jm_lineart_device *dev = (jm_lineart_device *) dev_; + //printf("extra.jm_lineart_fill_path(): dev->seqno=%zi\n", dev->seqno); + dev->ctm = ctm; //fz_concat(ctm, trace_device_ptm); + dev->path_type = FILL_PATH; + jm_lineart_path(dev, path); + if (!dev->pathdict) { + return; + } + DICT_SETITEM_DROP(dev->pathdict, dictkey_type, PyUnicode_FromString("f")); + DICT_SETITEMSTR_DROP(dev->pathdict, "even_odd", JM_BOOL(even_odd)); + DICT_SETITEMSTR_DROP(dev->pathdict, "fill_opacity", Py_BuildValue("f", alpha)); + DICT_SETITEMSTR_DROP(dev->pathdict, "fill", jm_lineart_color(colorspace, color)); + DICT_SETITEM_DROP(dev->pathdict, dictkey_rect, JM_py_from_rect(dev->pathrect)); + DICT_SETITEMSTR_DROP(dev->pathdict, "seqno", PyLong_FromSize_t(dev->seqno)); + DICT_SETITEMSTR_DROP(dev->pathdict, "layer", JM_UnicodeFromStr(dev->layer_name)); + if (dev->clips) { + DICT_SETITEMSTR_DROP(dev->pathdict, "level", PyLong_FromLong(dev->depth)); + } + jm_append_merge(dev); + dev->seqno += 1; +} + +static void +jm_lineart_stroke_path(fz_context *ctx, fz_device *dev_, const fz_path *path, + const fz_stroke_state *stroke, fz_matrix ctm, + fz_colorspace *colorspace, const float *color, float alpha, + fz_color_params color_params) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + //printf("extra.jm_lineart_stroke_path(): dev->seqno=%zi\n", dev->seqno); + int i; + dev->pathfactor = 1; + if (ctm.a != 0 && fz_abs(ctm.a) == fz_abs(ctm.d)) { + dev->pathfactor = fz_abs(ctm.a); + } else { + if (ctm.b != 0 && fz_abs(ctm.b) == fz_abs(ctm.c)) { + dev->pathfactor = fz_abs(ctm.b); + } + } + dev->ctm = ctm; // fz_concat(ctm, trace_device_ptm); + dev->path_type = STROKE_PATH; + + jm_lineart_path(dev, path); + if (!dev->pathdict) { + return; + } + DICT_SETITEM_DROP(dev->pathdict, dictkey_type, PyUnicode_FromString("s")); + DICT_SETITEMSTR_DROP(dev->pathdict, "stroke_opacity", Py_BuildValue("f", alpha)); + DICT_SETITEMSTR_DROP(dev->pathdict, "color", jm_lineart_color(colorspace, color)); + DICT_SETITEM_DROP(dev->pathdict, dictkey_width, Py_BuildValue("f", dev->pathfactor * stroke->linewidth)); + DICT_SETITEMSTR_DROP(dev->pathdict, "lineCap", Py_BuildValue("iii", stroke->start_cap, stroke->dash_cap, stroke->end_cap)); + DICT_SETITEMSTR_DROP(dev->pathdict, "lineJoin", Py_BuildValue("f", dev->pathfactor * stroke->linejoin)); + if (!PyDict_GetItemString(dev->pathdict, "closePath")) { + DICT_SETITEMSTR_DROP(dev->pathdict, "closePath", JM_BOOL(0)); + } + + // output the "dashes" string + if (stroke->dash_len) { + mupdf::FzBuffer buff(256); + mupdf::fz_append_string(buff, "[ "); // left bracket + for (i = 0; i < stroke->dash_len; i++) { + fz_append_printf(ctx, buff.m_internal, "%g ", dev->pathfactor * stroke->dash_list[i]); + } + fz_append_printf(ctx, buff.m_internal, "] %g", dev->pathfactor * stroke->dash_phase); + DICT_SETITEMSTR_DROP(dev->pathdict, "dashes", JM_EscapeStrFromBuffer(buff)); + } else { + DICT_SETITEMSTR_DROP(dev->pathdict, "dashes", PyUnicode_FromString("[] 0")); + } + + DICT_SETITEM_DROP(dev->pathdict, dictkey_rect, JM_py_from_rect(dev->pathrect)); + DICT_SETITEMSTR_DROP(dev->pathdict, "layer", JM_UnicodeFromStr(dev->layer_name)); + DICT_SETITEMSTR_DROP(dev->pathdict, "seqno", PyLong_FromSize_t(dev->seqno)); + if (dev->clips) { + DICT_SETITEMSTR_DROP(dev->pathdict, "level", PyLong_FromLong(dev->depth)); + } + // output the dict - potentially merging it with a previous fill_path twin + jm_append_merge(dev); + dev->seqno += 1; +} + +static void +jm_lineart_clip_path(fz_context *ctx, fz_device *dev_, const fz_path *path, int even_odd, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + dev->ctm = ctm; //fz_concat(ctm, trace_device_ptm); + dev->path_type = CLIP_PATH; + jm_lineart_path(dev, path); + if (!dev->pathdict) { + return; + } + DICT_SETITEM_DROP(dev->pathdict, dictkey_type, PyUnicode_FromString("clip")); + DICT_SETITEMSTR_DROP(dev->pathdict, "even_odd", JM_BOOL(even_odd)); + if (!PyDict_GetItemString(dev->pathdict, "closePath")) { + DICT_SETITEMSTR_DROP(dev->pathdict, "closePath", JM_BOOL(0)); + } + DICT_SETITEMSTR_DROP(dev->pathdict, "scissor", JM_py_from_rect(compute_scissor(dev))); + DICT_SETITEMSTR_DROP(dev->pathdict, "level", PyLong_FromLong(dev->depth)); + DICT_SETITEMSTR_DROP(dev->pathdict, "layer", JM_UnicodeFromStr(dev->layer_name)); + jm_append_merge(dev); + dev->depth++; +} + +static void +jm_lineart_clip_stroke_path(fz_context *ctx, fz_device *dev_, const fz_path *path, const fz_stroke_state *stroke, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + dev->ctm = ctm; //fz_concat(ctm, trace_device_ptm); + dev->path_type = CLIP_STROKE_PATH; + jm_lineart_path(dev, path); + if (!dev->pathdict) { + return; + } + DICT_SETITEM_DROP(dev->pathdict, dictkey_type, PyUnicode_FromString("clip")); + DICT_SETITEMSTR_DROP(dev->pathdict, "even_odd", Py_BuildValue("s", NULL)); + if (!PyDict_GetItemString(dev->pathdict, "closePath")) { + DICT_SETITEMSTR_DROP(dev->pathdict, "closePath", JM_BOOL(0)); + } + DICT_SETITEMSTR_DROP(dev->pathdict, "scissor", JM_py_from_rect(compute_scissor(dev))); + DICT_SETITEMSTR_DROP(dev->pathdict, "level", PyLong_FromLong(dev->depth)); + DICT_SETITEMSTR_DROP(dev->pathdict, "layer", JM_UnicodeFromStr(dev->layer_name)); + jm_append_merge(dev); + dev->depth++; +} + + +static void +jm_lineart_clip_stroke_text(fz_context *ctx, fz_device *dev_, const fz_text *text, const fz_stroke_state *stroke, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + compute_scissor(dev); + dev->depth++; +} + +static void +jm_lineart_clip_text(fz_context *ctx, fz_device *dev_, const fz_text *text, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + compute_scissor(dev); + dev->depth++; +} + +static void +jm_lineart_clip_image_mask(fz_context *ctx, fz_device *dev_, fz_image *image, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + compute_scissor(dev); + dev->depth++; +} + +static void +jm_lineart_pop_clip(fz_context *ctx, fz_device *dev_) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + if (!dev->scissors) return; + Py_ssize_t len = PyList_Size(dev->scissors); + if (len < 1) return; + PyList_SetSlice(dev->scissors, len - 1, len, NULL); + dev->depth--; +} + + +static void +jm_lineart_begin_group(fz_context *ctx, fz_device *dev_, fz_rect bbox, fz_colorspace *cs, int isolated, int knockout, int blendmode, float alpha) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + dev->pathdict = Py_BuildValue("{s:s,s:N,s:N,s:N,s:s,s:f,s:i,s:N}", + "type", "group", + "rect", JM_py_from_rect(bbox), + "isolated", JM_BOOL(isolated), + "knockout", JM_BOOL(knockout), + "blendmode", fz_blendmode_name(blendmode), + "opacity", alpha, + "level", dev->depth, + "layer", JM_UnicodeFromStr(dev->layer_name) + ); + jm_append_merge(dev); + dev->depth++; +} + +static void +jm_lineart_end_group(fz_context *ctx, fz_device *dev_) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + dev->depth--; +} + +static void jm_lineart_fill_text(fz_context *ctx, fz_device *dev, const fz_text *, fz_matrix, fz_colorspace *, const float *color, float alpha, fz_color_params) +{ + jm_increase_seqno(ctx, dev); +} + +static void jm_lineart_stroke_text(fz_context *ctx, fz_device *dev, const fz_text *, const fz_stroke_state *, fz_matrix, fz_colorspace *, const float *color, float alpha, fz_color_params) +{ + jm_increase_seqno(ctx, dev); +} + +static void jm_lineart_fill_shade(fz_context *ctx, fz_device *dev, fz_shade *shd, fz_matrix ctm, float alpha, fz_color_params color_params) +{ + jm_increase_seqno(ctx, dev); +} + +static void jm_lineart_fill_image(fz_context *ctx, fz_device *dev, fz_image *img, fz_matrix ctm, float alpha, fz_color_params color_params) +{ + jm_increase_seqno(ctx, dev); +} + +static void jm_lineart_fill_image_mask(fz_context *ctx, fz_device *dev, fz_image *img, fz_matrix ctm, fz_colorspace *, const float *color, float alpha, fz_color_params color_params) +{ + jm_increase_seqno(ctx, dev); +} + +static void jm_lineart_ignore_text(fz_context *ctx, fz_device *dev, const fz_text *, fz_matrix) +{ + jm_increase_seqno(ctx, dev); +} + + +//------------------------------------------------------------------- +// LINEART device for Python method Page.get_cdrawings() +//------------------------------------------------------------------- +mupdf::FzDevice JM_new_lineart_device(PyObject *out, int clips, PyObject *method) +{ + //printf("extra.JM_new_lineart_device()\n"); + jm_lineart_device* dev = (jm_lineart_device*) mupdf::ll_fz_new_device_of_size(sizeof(jm_lineart_device)); + + dev->super.close_device = NULL; + dev->super.drop_device = jm_lineart_drop_device; + dev->super.fill_path = jm_lineart_fill_path; + dev->super.stroke_path = jm_lineart_stroke_path; + dev->super.clip_path = jm_lineart_clip_path; + dev->super.clip_stroke_path = jm_lineart_clip_stroke_path; + + dev->super.fill_text = jm_lineart_fill_text; + dev->super.stroke_text = jm_lineart_stroke_text; + dev->super.clip_text = jm_lineart_clip_text; + dev->super.clip_stroke_text = jm_lineart_clip_stroke_text; + dev->super.ignore_text = jm_lineart_ignore_text; + + dev->super.fill_shade = jm_lineart_fill_shade; + dev->super.fill_image = jm_lineart_fill_image; + dev->super.fill_image_mask = jm_lineart_fill_image_mask; + dev->super.clip_image_mask = jm_lineart_clip_image_mask; + + dev->super.pop_clip = jm_lineart_pop_clip; + + dev->super.begin_mask = NULL; + dev->super.end_mask = NULL; + dev->super.begin_group = jm_lineart_begin_group; + dev->super.end_group = jm_lineart_end_group; + + dev->super.begin_tile = NULL; + dev->super.end_tile = NULL; + + dev->super.begin_layer = jm_lineart_begin_layer; + dev->super.end_layer = jm_lineart_end_layer; + + dev->super.begin_structure = NULL; + dev->super.end_structure = NULL; + + dev->super.begin_metatext = NULL; + dev->super.end_metatext = NULL; + + dev->super.render_flags = NULL; + dev->super.set_default_colorspaces = NULL; + + if (PyList_Check(out)) { + Py_INCREF(out); + } + Py_INCREF(method); + dev->out = out; + dev->seqno = 0; + dev->depth = 0; + dev->clips = clips; + dev->method = method; + dev->pathdict = nullptr; + + return mupdf::FzDevice(&dev->super); +} + +PyObject* get_cdrawings(mupdf::FzPage& page, PyObject *extended=NULL, PyObject *callback=NULL, PyObject *method=NULL) +{ + //fz_page *page = (fz_page *) $self; + //fz_device *dev = NULL; + PyObject *rc = NULL; + int clips = PyObject_IsTrue(extended); + + mupdf::FzDevice dev; + if (PyCallable_Check(callback) || method != Py_None) { + dev = JM_new_lineart_device(callback, clips, method); + } else { + rc = PyList_New(0); + dev = JM_new_lineart_device(rc, clips, method); + } + mupdf::FzRect prect = mupdf::fz_bound_page(page); + ((jm_lineart_device*) dev.m_internal)->ptm = mupdf::ll_fz_make_matrix(1, 0, 0, -1, 0, prect.y1); + + mupdf::FzCookie cookie; + mupdf::FzMatrix identity; + mupdf::fz_run_page( page, dev, *identity.internal(), cookie); + mupdf::fz_close_device( dev); + if (PyCallable_Check(callback) || method != Py_None) + { + Py_RETURN_NONE; + } + return rc; +} + + +//--------------------------------------------------------------------------- +// APPEND non-ascii runes in unicode escape format to fz_buffer +//--------------------------------------------------------------------------- +void JM_append_rune(fz_buffer *buff, int ch) +{ + char text[32]; + if (ch == 92) // prevent accidental "\u", "\U" sequences + { + mupdf::ll_fz_append_string(buff, "\\u005c"); + } + else if ((ch >= 32 && ch <= 127) || ch == 10) + { + mupdf::ll_fz_append_byte(buff, ch); + } + else if (ch >= 0xd800 && ch <= 0xdfff) // orphaned surrogate Unicodes + { + mupdf::ll_fz_append_string(buff, "\\ufffd"); + } + else if (ch <= 0xffff) + { + // 4 hex digits + snprintf(text, sizeof(text), "\\u%04x", ch); + mupdf::ll_fz_append_string(buff, text); + } + else + { + // 8 hex digits + snprintf(text, sizeof(text), "\\U%08x", ch); + mupdf::ll_fz_append_string(buff, text); + } +} + + +mupdf::FzRect JM_make_spanlist( + PyObject *line_dict, + mupdf::FzStextLine& line, + int raw, + mupdf::FzBuffer& buff, + mupdf::FzRect& tp_rect + ) +{ + PyObject *span = NULL, *char_list = NULL, *char_dict; + PyObject *span_list = PyList_New(0); + mupdf::fz_clear_buffer(buff); + fz_rect span_rect = fz_empty_rect; + fz_rect line_rect = fz_empty_rect; + fz_point span_origin = {0, 0}; + struct char_style + { + float size = -1; + unsigned flags = 0; + + #if MUPDF_VERSION_GE(1, 25, 2) + /* From mupdf:include/mupdf/fitz/structured-text.h:fz_stext_char::flags, which + uses anonymous enum values: + FZ_STEXT_STRIKEOUT = 1, + FZ_STEXT_UNDERLINE = 2, + FZ_STEXT_SYNTHETIC = 4, + FZ_STEXT_FILLED = 16, + FZ_STEXT_STROKED = 32, + FZ_STEXT_CLIPPED = 64 + */ + unsigned char_flags = 0; + #endif + + const char *font = ""; + unsigned argb = 0; + float asc = 0; + float desc = 0; + uint16_t bidi = 0; + }; + char_style old_style; + char_style style; + + for (mupdf::FzStextChar ch: line) + { + fz_rect r = JM_char_bbox(line, ch); + if (!JM_rects_overlap(*tp_rect.internal(), r) && !fz_is_infinite_rect(tp_rect)) + { + continue; + } + /* Info from: + detect_super_script() + fz_font_is_italic() + fz_font_is_serif() + fz_font_is_monospaced() + fz_font_is_bold() + */ + int flags = JM_char_font_flags( ch.m_internal->font, line.m_internal, ch.m_internal); + fz_point origin = ch.m_internal->origin; + style.size = ch.m_internal->size; + style.flags = flags; + #if MUPDF_VERSION_GE(1, 25, 2) + /* FZ_STEXT_SYNTHETIC is per-char, not per-span. */ + style.char_flags = ch.m_internal->flags & ~FZ_STEXT_SYNTHETIC; + #endif + style.font = JM_font_name(ch.m_internal->font); + #if MUPDF_VERSION_GE(1, 25, 0) + style.argb = ch.m_internal->argb; + #else + style.argb = ch.m_internal->color; + #endif + style.asc = JM_font_ascender(ch.m_internal->font); + style.desc = JM_font_descender(ch.m_internal->font); + + if (0 + || style.size != old_style.size + || style.flags != old_style.flags + #if MUPDF_VERSION_GE(1, 25, 2) + || style.char_flags != old_style.char_flags + #endif + || style.argb != old_style.argb + || strcmp(style.font, old_style.font) != 0 + || style.bidi != old_style.bidi + ) + { + if (old_style.size >= 0) + { + // not first one, output previous + if (raw) + { + // put character list in the span + DICT_SETITEM_DROP(span, dictkey_chars, char_list); + char_list = NULL; + } + else + { + // put text string in the span + DICT_SETITEM_DROP(span, dictkey_text, JM_EscapeStrFromBuffer(buff)); + mupdf::fz_clear_buffer(buff); + } + + DICT_SETITEM_DROP(span, dictkey_origin, JM_py_from_point(span_origin)); + DICT_SETITEM_DROP(span, dictkey_bbox, JM_py_from_rect(span_rect)); + line_rect = mupdf::ll_fz_union_rect(line_rect, span_rect); + LIST_APPEND_DROP(span_list, span); + span = NULL; + } + + span = PyDict_New(); + float asc = style.asc, desc = style.desc; + if (style.asc < 1e-3) + { + asc = 0.9f; + desc = -0.1f; + } + + DICT_SETITEM_DROP(span, dictkey_size, Py_BuildValue("f", style.size)); + DICT_SETITEM_DROP(span, dictkey_flags, Py_BuildValue("I", style.flags)); + DICT_SETITEM_DROP(span, dictkey_bidi, Py_BuildValue("I", style.bidi)); + #if MUPDF_VERSION_GE(1, 25, 2) + DICT_SETITEM_DROP(span, dictkey_char_flags, Py_BuildValue("I", style.char_flags)); + #endif + DICT_SETITEM_DROP(span, dictkey_font, JM_EscapeStrFromStr(style.font)); + DICT_SETITEM_DROP(span, dictkey_color, Py_BuildValue("I", style.argb & 0xffffff)); + #if MUPDF_VERSION_GE(1, 25, 0) + DICT_SETITEMSTR_DROP(span, "alpha", Py_BuildValue("I", style.argb >> 24)); + #endif + DICT_SETITEMSTR_DROP(span, "ascender", Py_BuildValue("f", asc)); + DICT_SETITEMSTR_DROP(span, "descender", Py_BuildValue("f", desc)); + + old_style = style; + span_rect = r; + span_origin = origin; + + } + span_rect = mupdf::ll_fz_union_rect(span_rect, r); + + if (raw) + { + // make and append a char dict + char_dict = PyDict_New(); + DICT_SETITEM_DROP(char_dict, dictkey_origin, JM_py_from_point(ch.m_internal->origin)); + + DICT_SETITEM_DROP(char_dict, dictkey_bbox, JM_py_from_rect(r)); + + DICT_SETITEM_DROP(char_dict, dictkey_c, Py_BuildValue("C", ch.m_internal->c)); + DICT_SETITEMSTR_DROP(char_dict, "synthetic", Py_BuildValue("O", (ch.m_internal->flags & FZ_STEXT_SYNTHETIC) ? Py_True : Py_False)); + if (!char_list) + { + char_list = PyList_New(0); + } + LIST_APPEND_DROP(char_list, char_dict); + } + else + { + // add character byte to buffer + JM_append_rune(buff.m_internal, ch.m_internal->c); + } + } + // all characters processed, now flush remaining span + if (span) + { + if (raw) + { + DICT_SETITEM_DROP(span, dictkey_chars, char_list); + char_list = NULL; + } + else + { + DICT_SETITEM_DROP(span, dictkey_text, JM_EscapeStrFromBuffer(buff)); + mupdf::fz_clear_buffer(buff); + } + DICT_SETITEM_DROP(span, dictkey_origin, JM_py_from_point(span_origin)); + DICT_SETITEM_DROP(span, dictkey_bbox, JM_py_from_rect(span_rect)); + + if (!fz_is_empty_rect(span_rect)) + { + LIST_APPEND_DROP(span_list, span); + line_rect = fz_union_rect(line_rect, span_rect); + } + else + { + Py_DECREF(span); + } + span = NULL; + } + if (!mupdf::fz_is_empty_rect(line_rect)) + { + DICT_SETITEM_DROP(line_dict, dictkey_spans, span_list); + } + else + { + DICT_SETITEM_DROP(line_dict, dictkey_spans, span_list); + } + return line_rect; +} + +//----------------------------------------------------------------------------- +// Functions for wordlist output +//----------------------------------------------------------------------------- +int JM_append_word( + PyObject* lines, + fz_buffer* buff, + fz_rect* wbbox, + int block_n, + int line_n, + int word_n + ) +{ + PyObject* s = JM_EscapeStrFromBuffer(buff); + PyObject* litem = Py_BuildValue( + "ffffOiii", + wbbox->x0, + wbbox->y0, + wbbox->x1, + wbbox->y1, + s, + block_n, + line_n, + word_n + ); + LIST_APPEND_DROP(lines, litem); + Py_DECREF(s); + *wbbox = fz_empty_rect; + return word_n + 1; // word counter +} + +PyObject* extractWORDS(mupdf::FzStextPage& this_tpage, PyObject *delimiters) +{ + int block_n = -1; + fz_rect wbbox = fz_empty_rect; // word bbox + fz_rect tp_rect = this_tpage.m_internal->mediabox; + + PyObject *lines = NULL; + mupdf::FzBuffer buff = mupdf::fz_new_buffer(64); + lines = PyList_New(0); + for (mupdf::FzStextBlock block: this_tpage) + { + block_n++; + if (block.m_internal->type != FZ_STEXT_BLOCK_TEXT) + { + continue; + } + int line_n = -1; + for (mupdf::FzStextLine line: block) + { + line_n++; + int word_n = 0; // word counter per line + mupdf::fz_clear_buffer(buff); // reset word buffer + size_t buflen = 0; // reset char counter + int last_char_rtl = 0; // was last character RTL? + for (mupdf::FzStextChar ch: line) + { + mupdf::FzRect cbbox = JM_char_bbox(line, ch); + if (!JM_rects_overlap(tp_rect, *cbbox.internal()) && !fz_is_infinite_rect(tp_rect)) + { + continue; + } + + int word_delimiter = JM_is_word_delimiter(ch.m_internal->c, delimiters); + int this_char_rtl = JM_is_rtl_char(ch.m_internal->c); + if (word_delimiter || this_char_rtl != last_char_rtl) + { + if (buflen == 0 && word_delimiter) + { + continue; // skip delimiters at line start + } + if (!fz_is_empty_rect(wbbox)) + { + word_n = JM_append_word( + lines, + buff.m_internal, + &wbbox, + block_n, + line_n, + word_n + ); + } + mupdf::fz_clear_buffer(buff); + buflen = 0; // reset char counter + if (word_delimiter) continue; + } + // append one unicode character to the word + JM_append_rune(buff.m_internal, ch.m_internal->c); + last_char_rtl = this_char_rtl; + buflen++; + // enlarge word bbox + wbbox = fz_union_rect(wbbox, JM_char_bbox(line, ch)); + } + if (buflen && !fz_is_empty_rect(wbbox)) + { + word_n = JM_append_word( + lines, + buff.m_internal, + &wbbox, + block_n, + line_n, + word_n + ); + } + mupdf::fz_clear_buffer(buff); + buflen = 0; + } + } + return lines; +} + + + +struct ScopedPyObject +/* PyObject* wrapper, destructor calls Py_CLEAR() unless `release()` has been +called. */ +{ + ScopedPyObject(PyObject* rhs=nullptr) + : + m_pyobject(rhs) + {} + + PyObject*& get() + { + return m_pyobject; + } + + ScopedPyObject& operator= (PyObject* rhs) + { + Py_CLEAR(m_pyobject); + m_pyobject = rhs; + return *this; + } + + PyObject* release() + { + PyObject* ret = m_pyobject; + m_pyobject = nullptr; + return ret; + } + ~ScopedPyObject() + { + Py_CLEAR(m_pyobject); + } + + PyObject* m_pyobject = nullptr; +}; + + +PyObject* extractBLOCKS(mupdf::FzStextPage& self) +{ + fz_stext_page *this_tpage = self.m_internal; + fz_rect tp_rect = this_tpage->mediabox; + mupdf::FzBuffer res(1024); + ScopedPyObject lines( PyList_New(0)); + int block_n = -1; + for (fz_stext_block* block = this_tpage->first_block; block; block = block->next) + { + ScopedPyObject text; + block_n++; + fz_rect blockrect = fz_empty_rect; + if (block->type == FZ_STEXT_BLOCK_TEXT) + { + mupdf::fz_clear_buffer(res); // set text buffer to empty + int line_n = -1; + int last_char = 0; + (void) line_n; /* Not actually used, but keeping in the code for now. */ + for (fz_stext_line* line = block->u.t.first_line; line; line = line->next) + { + line_n++; + fz_rect linerect = fz_empty_rect; + for (fz_stext_char* ch = line->first_char; ch; ch = ch->next) + { + fz_rect cbbox = JM_char_bbox(line, ch); + if (!JM_rects_overlap(tp_rect, cbbox) && !fz_is_infinite_rect(tp_rect)) + { + continue; + } + JM_append_rune(res.m_internal, ch->c); + last_char = ch->c; + linerect = fz_union_rect(linerect, cbbox); + } + if (last_char != 10 && !fz_is_empty_rect(linerect)) + { + mupdf::fz_append_byte(res, 10); + } + blockrect = fz_union_rect(blockrect, linerect); + } + text = JM_EscapeStrFromBuffer(res); + } + else if (JM_rects_overlap(tp_rect, block->bbox) || fz_is_infinite_rect(tp_rect)) + { + fz_image *img = block->u.i.image; + fz_colorspace *cs = img->colorspace; + text = PyUnicode_FromFormat( + "", + mupdf::ll_fz_colorspace_name(cs), + img->w, + img->h, + img->bpc + ); + blockrect = fz_union_rect(blockrect, block->bbox); + } + if (!fz_is_empty_rect(blockrect)) + { + ScopedPyObject litem = PyTuple_New(7); + PyTuple_SET_ITEM(litem.get(), 0, Py_BuildValue("f", blockrect.x0)); + PyTuple_SET_ITEM(litem.get(), 1, Py_BuildValue("f", blockrect.y0)); + PyTuple_SET_ITEM(litem.get(), 2, Py_BuildValue("f", blockrect.x1)); + PyTuple_SET_ITEM(litem.get(), 3, Py_BuildValue("f", blockrect.y1)); + PyTuple_SET_ITEM(litem.get(), 4, Py_BuildValue("O", text.get())); + PyTuple_SET_ITEM(litem.get(), 5, Py_BuildValue("i", block_n)); + PyTuple_SET_ITEM(litem.get(), 6, Py_BuildValue("i", block->type)); + LIST_APPEND(lines.get(), litem.get()); + } + } + return lines.release(); +} + +#define EMPTY_STRING PyUnicode_FromString("") + +static PyObject *JM_UnicodeFromStr(const char *c) +{ + if (!c) return EMPTY_STRING; + PyObject *val = Py_BuildValue("s", c); + if (!val) { + val = EMPTY_STRING; + PyErr_Clear(); + } + return val; +} + +PyObject* link_uri(mupdf::FzLink& link) +{ + return JM_UnicodeFromStr( link.m_internal->uri); +} + +fz_stext_page* page_get_textpage( + mupdf::FzPage& self, + PyObject* clip, + int flags, + PyObject* matrix + ) +{ + fz_context* ctx = mupdf::internal_context_get(); + fz_stext_page *tpage=NULL; + fz_page *page = self.m_internal; + fz_device *dev = NULL; + fz_stext_options options; + memset(&options, 0, sizeof options); + options.flags = flags; + fz_try(ctx) { + // Default to page's rect if `clip` not specified, for #2048. + fz_rect rect = (clip==Py_None) ? fz_bound_page(ctx, page) : JM_rect_from_py(clip); + fz_matrix ctm = JM_matrix_from_py(matrix); + tpage = fz_new_stext_page(ctx, rect); + dev = fz_new_stext_device(ctx, tpage, &options); + fz_run_page(ctx, page, dev, ctm, NULL); + fz_close_device(ctx, dev); + } + fz_always(ctx) { + fz_drop_device(ctx, dev); + } + fz_catch(ctx) { + mupdf::internal_throw_exception(ctx); + } + return tpage; +} + +// return extension for pymupdf image type +const char *JM_image_extension(int type) +{ + switch (type) { + case(FZ_IMAGE_RAW): return "raw"; + case(FZ_IMAGE_FLATE): return "flate"; + case(FZ_IMAGE_LZW): return "lzw"; + case(FZ_IMAGE_RLD): return "rld"; + case(FZ_IMAGE_BMP): return "bmp"; + case(FZ_IMAGE_GIF): return "gif"; + case(FZ_IMAGE_JBIG2): return "jb2"; + case(FZ_IMAGE_JPEG): return "jpeg"; + case(FZ_IMAGE_JPX): return "jpx"; + case(FZ_IMAGE_JXR): return "jxr"; + case(FZ_IMAGE_PNG): return "png"; + case(FZ_IMAGE_PNM): return "pnm"; + case(FZ_IMAGE_TIFF): return "tiff"; + default: return "n/a"; + } +} + +void JM_make_image_block(fz_stext_block *block, PyObject *block_dict) +{ + fz_context* ctx = mupdf::internal_context_get(); + fz_image *image = block->u.i.image; + fz_buffer *buf = NULL, *freebuf = NULL, *mask_buf = NULL; + fz_compressed_buffer *buffer = fz_compressed_image_buffer(ctx, image); + fz_var(buf); + fz_var(freebuf); + fz_var(mask_buf); + int n = fz_colorspace_n(ctx, image->colorspace); + int w = image->w; + int h = image->h; + const char *ext = ""; + int type = FZ_IMAGE_UNKNOWN; + if (buffer) { + type = buffer->params.type; + ext = JM_image_extension(type); + } + if (type < FZ_IMAGE_BMP || type == FZ_IMAGE_JBIG2) + type = FZ_IMAGE_UNKNOWN; + PyObject *bytes = NULL; + fz_var(bytes); + PyObject *mask_bytes = NULL; + fz_var(mask_bytes); + fz_try(ctx) { + if (!buffer || type == FZ_IMAGE_UNKNOWN) + { + buf = freebuf = fz_new_buffer_from_image_as_png(ctx, image, fz_default_color_params); + ext = "png"; + } + else if (n == 4 && strcmp(ext, "jpeg") == 0) // JPEG CMYK needs another step + { + buf = freebuf = fz_new_buffer_from_image_as_jpeg(ctx, image, fz_default_color_params, 95, 1); + } + else + { + buf = buffer->buffer; + } + bytes = JM_BinFromBuffer(buf); + if (image->mask) { + mask_buf = fz_new_buffer_from_image_as_png(ctx, image->mask, fz_default_color_params); + mask_bytes = JM_BinFromBuffer(mask_buf); + } else { + mask_bytes = Py_BuildValue("s", NULL); + } + } + fz_always(ctx) { + if (!bytes) + bytes = PyBytes_FromString(""); + DICT_SETITEM_DROP(block_dict, dictkey_width, + Py_BuildValue("i", w)); + DICT_SETITEM_DROP(block_dict, dictkey_height, + Py_BuildValue("i", h)); + DICT_SETITEM_DROP(block_dict, dictkey_ext, + Py_BuildValue("s", ext)); + DICT_SETITEM_DROP(block_dict, dictkey_colorspace, + Py_BuildValue("i", n)); + DICT_SETITEM_DROP(block_dict, dictkey_xres, + Py_BuildValue("i", image->xres)); + DICT_SETITEM_DROP(block_dict, dictkey_yres, + Py_BuildValue("i", image->xres)); + DICT_SETITEM_DROP(block_dict, dictkey_bpc, + Py_BuildValue("i", (int) image->bpc)); + DICT_SETITEM_DROP(block_dict, dictkey_matrix, + JM_py_from_matrix(block->u.i.transform)); + DICT_SETITEM_DROP(block_dict, dictkey_size, + Py_BuildValue("n", PyBytes_Size(bytes))); + DICT_SETITEM_DROP(block_dict, dictkey_image, bytes); + DICT_SETITEMSTR_DROP(block_dict, "mask", mask_bytes); + fz_drop_buffer(ctx, mask_buf); + fz_drop_buffer(ctx, freebuf); + } + fz_catch(ctx) {;} + return; +} + +static void JM_make_text_block(fz_stext_block *block, PyObject *block_dict, int raw, fz_buffer *buff, fz_rect tp_rect) +{ + fz_stext_line *line; + PyObject *line_list = PyList_New(0), *line_dict; + fz_rect block_rect = fz_empty_rect; + for (line = block->u.t.first_line; line; line = line->next) { + if (fz_is_empty_rect(fz_intersect_rect(tp_rect, line->bbox)) && + !fz_is_infinite_rect(tp_rect)) { + continue; + } + line_dict = PyDict_New(); + mupdf::FzStextLine line2(line); + mupdf::FzBuffer buff2( mupdf::ll_fz_keep_buffer( buff)); + mupdf::FzRect tp_rect2( tp_rect); + mupdf::FzRect line_rect2 = JM_make_spanlist( + line_dict, + line2, + raw, + buff2, + tp_rect2 + ); + fz_rect& line_rect = *line_rect2.internal(); + block_rect = fz_union_rect(block_rect, line_rect); + DICT_SETITEM_DROP(line_dict, dictkey_wmode, + Py_BuildValue("i", line->wmode)); + DICT_SETITEM_DROP(line_dict, dictkey_dir, JM_py_from_point(line->dir)); + DICT_SETITEM_DROP(line_dict, dictkey_bbox, + JM_py_from_rect(line_rect)); + LIST_APPEND_DROP(line_list, line_dict); + } + DICT_SETITEM_DROP(block_dict, dictkey_bbox, JM_py_from_rect(block_rect)); + DICT_SETITEM_DROP(block_dict, dictkey_lines, line_list); + return; +} + +void JM_make_textpage_dict(fz_stext_page *tp, PyObject *page_dict, int raw) +{ + fz_context* ctx = mupdf::internal_context_get(); + fz_stext_block *block; + fz_buffer *text_buffer = fz_new_buffer(ctx, 128); + PyObject *block_dict, *block_list = PyList_New(0); + fz_rect tp_rect = tp->mediabox; + int block_n = -1; + for (block = tp->first_block; block; block = block->next) { + block_n++; + if (!fz_contains_rect(tp_rect, block->bbox) && + !fz_is_infinite_rect(tp_rect) && + block->type == FZ_STEXT_BLOCK_IMAGE) { + continue; + } + if (!fz_is_infinite_rect(tp_rect) && + fz_is_empty_rect(fz_intersect_rect(tp_rect, block->bbox))) { + continue; + } + + block_dict = PyDict_New(); + DICT_SETITEM_DROP(block_dict, dictkey_number, Py_BuildValue("i", block_n)); + DICT_SETITEM_DROP(block_dict, dictkey_type, Py_BuildValue("i", block->type)); + if (block->type == FZ_STEXT_BLOCK_IMAGE) { + DICT_SETITEM_DROP(block_dict, dictkey_bbox, JM_py_from_rect(block->bbox)); + JM_make_image_block(block, block_dict); + } else { + JM_make_text_block(block, block_dict, raw, text_buffer, tp_rect); + } + + LIST_APPEND_DROP(block_list, block_dict); + } + DICT_SETITEM_DROP(page_dict, dictkey_blocks, block_list); + fz_drop_buffer(ctx, text_buffer); +} + +//----------------------------------------------------------------- +// get one pixel as a list +//----------------------------------------------------------------- +PyObject *pixmap_pixel(fz_pixmap* pm, int x, int y) +{ + fz_context* ctx = mupdf::internal_context_get(); + PyObject *p = NULL; + if (0 + || x < 0 + || x >= pm->w + || y < 0 + || y >= pm->h + ) + { + throw std::range_error( MSG_PIXEL_OUTSIDE); + } + int n = pm->n; + int stride = fz_pixmap_stride(ctx, pm); + int i = stride * y + n * x; + p = PyTuple_New(n); + for (int j = 0; j < n; j++) + { + PyTuple_SET_ITEM(p, j, Py_BuildValue("i", pm->samples[i + j])); + } + return p; +} + +int pixmap_n(mupdf::FzPixmap& pixmap) +{ + return mupdf::fz_pixmap_components( pixmap); +} + +static int +JM_INT_ITEM(PyObject *obj, Py_ssize_t idx, int *result) +{ + PyObject *temp = PySequence_ITEM(obj, idx); + if (!temp) return 1; + if (PyLong_Check(temp)) { + *result = (int) PyLong_AsLong(temp); + Py_DECREF(temp); + } else if (PyFloat_Check(temp)) { + *result = (int) PyFloat_AsDouble(temp); + Py_DECREF(temp); + } else { + Py_DECREF(temp); + return 1; + } + if (PyErr_Occurred()) { + PyErr_Clear(); + return 1; + } + return 0; +} + +PyObject *set_pixel(fz_pixmap* pm, int x, int y, PyObject *color) +{ + fz_context* ctx = mupdf::internal_context_get(); + if (0 + || x < 0 + || x >= pm->w + || y < 0 + || y >= pm->h + ) + { + throw std::range_error( MSG_PIXEL_OUTSIDE); + } + int n = pm->n; + if (!PySequence_Check(color) || PySequence_Size(color) != n) { + throw std::range_error(MSG_BAD_COLOR_SEQ); + } + int i, j; + unsigned char c[5]; + for (j = 0; j < n; j++) { + if (JM_INT_ITEM(color, j, &i) == 1) { + throw std::range_error(MSG_BAD_COLOR_SEQ); + } + if (i < 0 or i >= 256) { + throw std::range_error(MSG_BAD_COLOR_SEQ); + } + c[j] = (unsigned char) i; + } + int stride = fz_pixmap_stride(ctx, pm); + i = stride * y + n * x; + for (j = 0; j < n; j++) { + pm->samples[i + j] = c[j]; + } + Py_RETURN_NONE; +} +//------------------------------------------- +// make a buffer from an stext_page's text +//------------------------------------------- +fz_buffer * +JM_new_buffer_from_stext_page(fz_stext_page *page) +{ + fz_context* ctx = mupdf::internal_context_get(); + fz_stext_block *block; + fz_stext_line *line; + fz_stext_char *ch; + fz_rect rect = page->mediabox; + fz_buffer *buf = NULL; + + fz_try(ctx) + { + buf = fz_new_buffer(ctx, 256); + for (block = page->first_block; block; block = block->next) { + if (block->type == FZ_STEXT_BLOCK_TEXT) { + for (line = block->u.t.first_line; line; line = line->next) { + for (ch = line->first_char; ch; ch = ch->next) { + if (!JM_rects_overlap(rect, JM_char_bbox(line, ch)) && + !fz_is_infinite_rect(rect)) { + continue; + } + fz_append_rune(ctx, buf, ch->c); + } + fz_append_byte(ctx, buf, '\n'); + } + fz_append_byte(ctx, buf, '\n'); + } + } + } + fz_catch(ctx) { + fz_drop_buffer(ctx, buf); + mupdf::internal_throw_exception(ctx); + } + return buf; +} + +static inline int canon(int c) +{ + /* TODO: proper unicode case folding */ + /* TODO: character equivalence (a matches ä, etc) */ + if (c == 0xA0 || c == 0x2028 || c == 0x2029) + return ' '; + if (c == '\r' || c == '\n' || c == '\t') + return ' '; + if (c >= 'A' && c <= 'Z') + return c - 'A' + 'a'; + return c; +} + +static inline int chartocanon(int *c, const char *s) +{ + int n = fz_chartorune(c, s); + *c = canon(*c); + return n; +} + +static const char *match_string(const char *h, const char *n) +{ + int hc, nc; + const char *e = h; + h += chartocanon(&hc, h); + n += chartocanon(&nc, n); + while (hc == nc) + { + e = h; + if (hc == ' ') + do + h += chartocanon(&hc, h); + while (hc == ' '); + else + h += chartocanon(&hc, h); + if (nc == ' ') + do + n += chartocanon(&nc, n); + while (nc == ' '); + else + n += chartocanon(&nc, n); + } + return nc == 0 ? e : NULL; +} + + +static const char *find_string(const char *s, const char *needle, const char **endp) +{ + const char *end; + while (*s) + { + end = match_string(s, needle); + if (end) + { + *endp = end; + return s; + } + ++s; + } + *endp = NULL; + return NULL; +} + +struct highlight +{ + Py_ssize_t len; + PyObject *quads; + float hfuzz, vfuzz; +}; + + +static int +JM_FLOAT_ITEM(PyObject *obj, Py_ssize_t idx, double *result) +{ + PyObject *temp = PySequence_ITEM(obj, idx); + if (!temp) return 1; + *result = PyFloat_AsDouble(temp); + Py_DECREF(temp); + if (PyErr_Occurred()) { + PyErr_Clear(); + return 1; + } + return 0; +} + + +//----------------------------------------------------------------------------- +// fz_quad from PySequence. Four floats are treated as rect. +// Else must be four pairs of floats. +//----------------------------------------------------------------------------- +static fz_quad +JM_quad_from_py(PyObject *r) +{ + fz_quad q = fz_make_quad(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, + FZ_MAX_INF_RECT, FZ_MIN_INF_RECT, + FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, + FZ_MAX_INF_RECT, FZ_MAX_INF_RECT); + fz_point p[4]; + double test, x, y; + Py_ssize_t i; + PyObject *obj = NULL; + + if (!r || !PySequence_Check(r) || PySequence_Size(r) != 4) + return q; + + if (JM_FLOAT_ITEM(r, 0, &test) == 0) + return fz_quad_from_rect(JM_rect_from_py(r)); + + for (i = 0; i < 4; i++) { + obj = PySequence_ITEM(r, i); // next point item + if (!obj || !PySequence_Check(obj) || PySequence_Size(obj) != 2) + goto exit_result; // invalid: cancel the rest + + if (JM_FLOAT_ITEM(obj, 0, &x) == 1) goto exit_result; + if (JM_FLOAT_ITEM(obj, 1, &y) == 1) goto exit_result; + if (x < FZ_MIN_INF_RECT) x = FZ_MIN_INF_RECT; + if (y < FZ_MIN_INF_RECT) y = FZ_MIN_INF_RECT; + if (x > FZ_MAX_INF_RECT) x = FZ_MAX_INF_RECT; + if (y > FZ_MAX_INF_RECT) y = FZ_MAX_INF_RECT; + p[i] = fz_make_point((float) x, (float) y); + + Py_CLEAR(obj); + } + q.ul = p[0]; + q.ur = p[1]; + q.ll = p[2]; + q.lr = p[3]; + return q; + + exit_result:; + Py_CLEAR(obj); + return q; +} + +static float hdist(fz_point *dir, fz_point *a, fz_point *b) +{ + float dx = b->x - a->x; + float dy = b->y - a->y; + return fz_abs(dx * dir->x + dy * dir->y); +} + +static float vdist(fz_point *dir, fz_point *a, fz_point *b) +{ + float dx = b->x - a->x; + float dy = b->y - a->y; + return fz_abs(dx * dir->y + dy * dir->x); +} + +static void on_highlight_char(fz_context *ctx, void *arg, fz_stext_line *line, fz_stext_char *ch) +{ + struct highlight* hits = (struct highlight*) arg; + float vfuzz = ch->size * hits->vfuzz; + float hfuzz = ch->size * hits->hfuzz; + fz_quad ch_quad = JM_char_quad(line, ch); + if (hits->len > 0) { + PyObject *quad = PySequence_ITEM(hits->quads, hits->len - 1); + fz_quad end = JM_quad_from_py(quad); + Py_DECREF(quad); + if (hdist(&line->dir, &end.lr, &ch_quad.ll) < hfuzz + && vdist(&line->dir, &end.lr, &ch_quad.ll) < vfuzz + && hdist(&line->dir, &end.ur, &ch_quad.ul) < hfuzz + && vdist(&line->dir, &end.ur, &ch_quad.ul) < vfuzz) + { + end.ur = ch_quad.ur; + end.lr = ch_quad.lr; + quad = JM_py_from_quad(end); + PyList_SetItem(hits->quads, hits->len - 1, quad); + return; + } + } + LIST_APPEND_DROP(hits->quads, JM_py_from_quad(ch_quad)); + hits->len++; +} + + +PyObject* JM_search_stext_page(fz_stext_page *page, const char *needle) +{ + fz_context* ctx = mupdf::internal_context_get(); + struct highlight hits; + fz_stext_block *block; + fz_stext_line *line; + fz_stext_char *ch; + fz_buffer *buffer = NULL; + const char *haystack, *begin, *end; + fz_rect rect = page->mediabox; + int c, inside; + + if (strlen(needle) == 0) Py_RETURN_NONE; + PyObject *quads = PyList_New(0); + hits.len = 0; + hits.quads = quads; + hits.hfuzz = 0.2f; /* merge kerns but not large gaps */ + hits.vfuzz = 0.1f; + + fz_try(ctx) { + buffer = JM_new_buffer_from_stext_page( page); + haystack = fz_string_from_buffer(ctx, buffer); + begin = find_string(haystack, needle, &end); + if (!begin) goto no_more_matches; + + inside = 0; + for (block = page->first_block; block; block = block->next) { + if (block->type != FZ_STEXT_BLOCK_TEXT) { + continue; + } + for (line = block->u.t.first_line; line; line = line->next) { + for (ch = line->first_char; ch; ch = ch->next) { + if (!fz_is_infinite_rect(rect) && + !JM_rects_overlap(rect, JM_char_bbox(line, ch))) { + goto next_char; + } +try_new_match: + if (!inside) { + if (haystack >= begin) inside = 1; + } + if (inside) { + if (haystack < end) { + on_highlight_char(ctx, &hits, line, ch); + } else { + inside = 0; + begin = find_string(haystack, needle, &end); + if (!begin) goto no_more_matches; + else goto try_new_match; + } + } + haystack += fz_chartorune(&c, haystack); +next_char:; + } + assert(*haystack == '\n'); + ++haystack; + } + assert(*haystack == '\n'); + ++haystack; + } +no_more_matches:; + } + fz_always(ctx) + fz_drop_buffer(ctx, buffer); + fz_catch(ctx) + mupdf::internal_throw_exception(ctx); + + return quads; +} + +void pixmap_copy( fz_pixmap* pm, const fz_pixmap* src, int n) +{ + assert(pm->w == src->w); + assert(pm->h == src->h); + assert(n <= pm->n); + assert(n <= src->n); + + if (pm->n == src->n) + { + // identical samples + assert(pm->stride == src->stride); + memcpy(pm->samples, src->samples, pm->w * pm->h * pm->n); + } + else + { + int nn; + int do_alpha; + if (pm->n > src->n) + { + assert(pm->n == src->n + 1); + nn = src->n; + assert(!src->alpha); + assert(pm->alpha); + do_alpha = 1; + } + else + { + assert(src->n == pm->n + 1); + nn = pm->n; + assert(src->alpha); + assert(!pm->alpha); + do_alpha = 0; + } + for (int y=0; yh; ++y) + { + for (int x=0; xw; ++x) + { + memcpy( + pm->samples + pm->stride * y + pm->n * x, + src->samples + src->stride * y + src->n * x, + nn + ); + if (do_alpha) + { + pm->samples[pm->stride * y + pm->n * x + pm->n-1] = 255; + } + } + } + } +} + + +PyObject* ll_JM_color_count(fz_pixmap *pm, PyObject *clip) +{ + fz_context* ctx = mupdf::internal_context_get(); + PyObject* rc = PyDict_New(); + fz_irect irect = fz_pixmap_bbox(ctx, pm); + irect = fz_intersect_irect(irect, fz_round_rect(JM_rect_from_py(clip))); + if (fz_is_empty_irect(irect)) + { + return rc; + } + size_t stride = pm->stride; + size_t width = irect.x1 - irect.x0; + size_t height = irect.y1 - irect.y0; + size_t n = (size_t) pm->n; + size_t substride = width * n; + unsigned char* s = pm->samples + stride * (irect.y0 - pm->y) + n * (irect.x0 - pm->x); + // Cache previous pixel. + char oldpix[10]; + assert(n <= sizeof(oldpix)); + memcpy(oldpix, s, n); + long cnt = 0; + for (size_t i = 0; i < height; i++) + { + for (size_t j = 0; j < substride; j += n) + { + const char* newpix = (const char*) s + j; + if (memcmp(oldpix, newpix, n)) + { + /* Pixel differs from previous pixel, so update results with + last run of pixels. We get a PyObject representation of pixel + so we can look up in Python dict . */ + PyObject* pixel = PyBytes_FromStringAndSize(&oldpix[0], n); + PyObject* c = PyDict_GetItem(rc, pixel); + if (c) cnt += PyLong_AsLong(c); + DICT_SETITEM_DROP(rc, pixel, PyLong_FromLong(cnt)); + Py_DECREF(pixel); + /* Start next run of identical pixels. */ + cnt = 1; + memcpy(oldpix, newpix, n); + } + else + { + cnt += 1; + } + } + s += stride; + } + /* Update results with last pixel. */ + PyObject* pixel = PyBytes_FromStringAndSize(&oldpix[0], n); + PyObject* c = PyDict_GetItem(rc, pixel); + if (c) cnt += PyLong_AsLong(c); + DICT_SETITEM_DROP(rc, pixel, PyLong_FromLong(cnt)); + Py_DECREF(pixel); + PyErr_Clear(); + return rc; +} + +%} + +/* Declarations for functions defined above. */ + +void page_merge( + mupdf::PdfDocument& doc_des, + mupdf::PdfDocument& doc_src, + int page_from, + int page_to, + int rotate, + int links, + int copy_annots, + mupdf::PdfGraftMap& graft_map + ); + +void JM_merge_range( + mupdf::PdfDocument& doc_des, + mupdf::PdfDocument& doc_src, + int spage, + int epage, + int apage, + int rotate, + int links, + int annots, + int show_progress, + mupdf::PdfGraftMap& graft_map + ); + +void FzDocument_insert_pdf( + mupdf::FzDocument& doc, + mupdf::FzDocument& src, + int from_page, + int to_page, + int start_at, + int rotate, + int links, + int annots, + int show_progress, + int final, + mupdf::PdfGraftMap& graft_map + ); + +int page_xref(mupdf::FzDocument& this_doc, int pno); +void _newPage(mupdf::FzDocument& self, int pno=-1, float width=595, float height=842); +void _newPage(mupdf::PdfDocument& self, int pno=-1, float width=595, float height=842); +void JM_add_annot_id(mupdf::PdfAnnot& annot, const char* stem); +void JM_set_annot_callout_line(mupdf::PdfAnnot& annot, PyObject *callout, int count); +std::vector< std::string> JM_get_annot_id_list(mupdf::PdfPage& page); +mupdf::PdfAnnot _add_caret_annot(mupdf::PdfPage& self, mupdf::FzPoint& point); +mupdf::PdfAnnot _add_caret_annot(mupdf::FzPage& self, mupdf::FzPoint& point); +const char* Tools_parse_da(mupdf::PdfAnnot& this_annot); +PyObject* Annot_getAP(mupdf::PdfAnnot& annot); +void Tools_update_da(mupdf::PdfAnnot& this_annot, const char* da_str); +mupdf::FzPoint JM_point_from_py(PyObject* p); +mupdf::FzRect Annot_rect(mupdf::PdfAnnot& annot); +PyObject* util_transform_rect(PyObject* rect, PyObject* matrix); +PyObject* Annot_rect3(mupdf::PdfAnnot& annot); +mupdf::FzMatrix Page_derotate_matrix(mupdf::PdfPage& pdfpage); +mupdf::FzMatrix Page_derotate_matrix(mupdf::FzPage& pdfpage); +PyObject* JM_get_annot_xref_list(const mupdf::PdfObj& page_obj); +PyObject* xref_object(mupdf::PdfDocument& pdf, int xref, int compressed=0, int ascii=0); +PyObject* xref_object(mupdf::FzDocument& document, int xref, int compressed=0, int ascii=0); + +PyObject* Link_is_external(mupdf::FzLink& this_link); +PyObject* Page_addAnnot_FromString(mupdf::PdfPage& page, PyObject* linklist); +PyObject* Page_addAnnot_FromString(mupdf::FzPage& page, PyObject* linklist); +mupdf::FzLink Link_next(mupdf::FzLink& this_link); + +static int page_count_fz2(void* document); +int page_count_fz(mupdf::FzDocument& document); +int page_count_pdf(mupdf::PdfDocument& pdf); +int page_count(mupdf::FzDocument& document); +int page_count(mupdf::PdfDocument& pdf); + +PyObject* page_annot_xrefs(mupdf::PdfDocument& pdf, int pno); +PyObject* page_annot_xrefs(mupdf::FzDocument& document, int pno); +bool Outline_is_external(mupdf::FzOutline* outline); +void Document_extend_toc_items(mupdf::PdfDocument& pdf, PyObject* items); +void Document_extend_toc_items(mupdf::FzDocument& document, PyObject* items); + +int ll_fz_absi(int i); + +mupdf::FzDevice JM_new_texttrace_device(PyObject* out); + +fz_rect JM_char_bbox(const mupdf::FzStextLine& line, const mupdf::FzStextChar& ch); + +static fz_quad JM_char_quad( fz_stext_line *line, fz_stext_char *ch); +void JM_print_stext_page_as_text(mupdf::FzBuffer& res, mupdf::FzStextPage& page); + +void set_skip_quad_corrections(int on); +void set_subset_fontnames(int on); +void set_small_glyph_heights(int on); + +mupdf::FzRect JM_cropbox(mupdf::PdfObj& page_obj); +PyObject* get_cdrawings(mupdf::FzPage& page, PyObject *extended=NULL, PyObject *callback=NULL, PyObject *method=NULL); + +mupdf::FzRect JM_make_spanlist( + PyObject *line_dict, + mupdf::FzStextLine& line, + int raw, + mupdf::FzBuffer& buff, + mupdf::FzRect& tp_rect + ); + +PyObject* extractWORDS(mupdf::FzStextPage& this_tpage, PyObject *delimiters); +PyObject* extractBLOCKS(mupdf::FzStextPage& self); + +PyObject* link_uri(mupdf::FzLink& link); + +fz_stext_page* page_get_textpage( + mupdf::FzPage& self, + PyObject* clip, + int flags, + PyObject* matrix + ); + +void JM_make_textpage_dict(fz_stext_page *tp, PyObject *page_dict, int raw); +PyObject *pixmap_pixel(fz_pixmap* pm, int x, int y); +int pixmap_n(mupdf::FzPixmap& pixmap); + +PyObject* JM_search_stext_page(fz_stext_page *page, const char *needle); + +PyObject *set_pixel(fz_pixmap* pm, int x, int y, PyObject *color); + +/* Copies from to , which must have same width and height. pm->n - +src->n must be -1, 0 or +1. If -1, must have alpha and must not have +alpha, and we copy the non-alpha bytes. If +1 must not have alpha and + must have alpha and we set 's alpha bytes all to 255.*/ +void pixmap_copy(fz_pixmap* pm, const fz_pixmap* src, int n); + +PyObject* ll_JM_color_count(fz_pixmap *pm, PyObject *clip); diff -r 000000000000 -r 1d09e1dec1d9 src/fitz___init__.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/fitz___init__.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,13 @@ +# pylint: disable=wildcard-import,unused-import,unused-wildcard-import +from pymupdf import * +from pymupdf import _as_fz_document +from pymupdf import _as_fz_page +from pymupdf import _as_pdf_document +from pymupdf import _as_pdf_page +from pymupdf import _log_items +from pymupdf import _log_items_active +from pymupdf import _log_items_clear +from pymupdf import __version__ +from pymupdf import __doc__ +from pymupdf import _globals +from pymupdf import _g_out_message diff -r 000000000000 -r 1d09e1dec1d9 src/fitz_table.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/fitz_table.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,2 @@ +# pylint: disable=wildcard-import,unused-wildcard-import +from pymupdf.table import * diff -r 000000000000 -r 1d09e1dec1d9 src/fitz_utils.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/fitz_utils.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,2 @@ +# pylint: disable=wildcard-import,unused-wildcard-import +from pymupdf.utils import * diff -r 000000000000 -r 1d09e1dec1d9 src/pymupdf.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/pymupdf.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,2 @@ +# pylint: disable=wildcard-import,unused-import +from . import * diff -r 000000000000 -r 1d09e1dec1d9 src/table.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/table.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,2563 @@ +""" +Copyright (C) 2023 Artifex Software, Inc. + +This file is part of PyMuPDF. + +PyMuPDF is free software: you can redistribute it and/or modify it under the +terms of the GNU Affero General Public License as published by the Free +Software Foundation, either version 3 of the License, or (at your option) +any later version. + +PyMuPDF is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more +details. + +You should have received a copy of the GNU Affero General Public License +along with MuPDF. If not, see + +Alternative licensing terms are available from the licensor. +For commercial licensing, see or contact +Artifex Software, Inc., 39 Mesa Street, Suite 108A, San Francisco, +CA 94129, USA, for further information. + +--------------------------------------------------------------------- +Portions of this code have been ported from pdfplumber, see +https://pypi.org/project/pdfplumber/. + +The ported code is under the following MIT license: + +--------------------------------------------------------------------- +The MIT License (MIT) + +Copyright (c) 2015, Jeremy Singer-Vine + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. +--------------------------------------------------------------------- +Also see here: https://github.com/jsvine/pdfplumber/blob/stable/LICENSE.txt +--------------------------------------------------------------------- + +The porting mainly pertains to files "table.py" and relevant parts of +"utils/text.py" within pdfplumber's repository on Github. +With respect to "text.py", we have removed functions or features that are not +used by table processing. Examples are: + +* the text search function +* simple text extraction +* text extraction by lines + +Original pdfplumber code does neither detect, nor identify table headers. +This PyMuPDF port adds respective code to the 'Table' class as method '_get_header'. +This is implemented as new class TableHeader with the properties: +* bbox: A tuple for the header's bbox +* cells: A tuple for each bbox of a column header +* names: A list of strings with column header text +* external: A bool indicating whether the header is outside the table cells. + +""" + +import inspect +import itertools +import string +import html +from collections.abc import Sequence +from dataclasses import dataclass +from operator import itemgetter +import weakref + +# ------------------------------------------------------------------- +# Start of PyMuPDF interface code +# ------------------------------------------------------------------- +from . import ( + Rect, + Matrix, + TEXTFLAGS_TEXT, + TEXT_FONT_BOLD, + TEXT_FONT_ITALIC, + TEXT_FONT_MONOSPACED, + TEXT_FONT_SUPERSCRIPT, + TEXT_COLLECT_STYLES, + TOOLS, + EMPTY_RECT, + sRGB_to_pdf, + Point, + message, + mupdf, +) + +EDGES = [] # vector graphics from PyMuPDF +CHARS = [] # text characters from PyMuPDF +TEXTPAGE = None +TEXT_BOLD = mupdf.FZ_STEXT_BOLD +TEXT_STRIKEOUT = mupdf.FZ_STEXT_STRIKEOUT +FLAGS = TEXTFLAGS_TEXT | TEXT_COLLECT_STYLES + +white_spaces = set(string.whitespace) # for checking white space only cells + + +def extract_cells(textpage, cell, markdown=False): + """Extract text from a rect-like 'cell' as plain or MD style text. + + This function should ultimately be used to extract text from a table cell. + Markdown output will only work correctly if extraction flag bit + TEXT_COLLECT_STYLES is set. + + Args: + textpage: A PyMuPDF TextPage object. Must have been created with + TEXTFLAGS_TEXT | TEXT_COLLECT_STYLES. + cell: A tuple (x0, y0, x1, y1) defining the cell's bbox. + markdown: If True, return text formatted for Markdown. + + Returns: + A string with the text extracted from the cell. + """ + text = "" + for block in textpage.extractRAWDICT()["blocks"]: + if block["type"] != 0: + continue + block_bbox = block["bbox"] + if ( + 0 + or block_bbox[0] > cell[2] + or block_bbox[2] < cell[0] + or block_bbox[1] > cell[3] + or block_bbox[3] < cell[1] + ): + continue # skip block outside cell + for line in block["lines"]: + lbbox = line["bbox"] + if ( + 0 + or lbbox[0] > cell[2] + or lbbox[2] < cell[0] + or lbbox[1] > cell[3] + or lbbox[3] < cell[1] + ): + continue # skip line outside cell + + if text: # must be a new line in the cell + text += "
" if markdown else "\n" + + # strikeout detection only works with horizontal text + horizontal = line["dir"] == (0, 1) or line["dir"] == (1, 0) + + for span in line["spans"]: + sbbox = span["bbox"] + if ( + 0 + or sbbox[0] > cell[2] + or sbbox[2] < cell[0] + or sbbox[1] > cell[3] + or sbbox[3] < cell[1] + ): + continue # skip spans outside cell + + # only include chars with more than 50% bbox overlap + span_text = "" + for char in span["chars"]: + bbox = Rect(char["bbox"]) + if abs(bbox & cell) > 0.5 * abs(bbox): + span_text += char["c"] + + if not span_text: + continue # skip empty span + + if not markdown: # no MD styling + text += span_text + continue + + prefix = "" + suffix = "" + if horizontal and span["char_flags"] & TEXT_STRIKEOUT: + prefix += "~~" + suffix = "~~" + suffix + if span["char_flags"] & TEXT_BOLD: + prefix += "**" + suffix = "**" + suffix + if span["flags"] & TEXT_FONT_ITALIC: + prefix += "_" + suffix = "_" + suffix + if span["flags"] & TEXT_FONT_MONOSPACED: + prefix += "`" + suffix = "`" + suffix + + if len(span["chars"]) > 2: + span_text = span_text.rstrip() + + # if span continues previous styling: extend cell text + if (ls := len(suffix)) and text.endswith(suffix): + text = text[:-ls] + span_text + suffix + else: # append the span with new styling + if not span_text.strip(): + text += " " + else: + text += prefix + span_text + suffix + + return text.strip() + + +# ------------------------------------------------------------------- +# End of PyMuPDF interface code +# ------------------------------------------------------------------- + + +class UnsetFloat(float): + pass + + +NON_NEGATIVE_SETTINGS = [ + "snap_tolerance", + "snap_x_tolerance", + "snap_y_tolerance", + "join_tolerance", + "join_x_tolerance", + "join_y_tolerance", + "edge_min_length", + "min_words_vertical", + "min_words_horizontal", + "intersection_tolerance", + "intersection_x_tolerance", + "intersection_y_tolerance", +] + + +TABLE_STRATEGIES = ["lines", "lines_strict", "text", "explicit"] +UNSET = UnsetFloat(0) +DEFAULT_SNAP_TOLERANCE = 3 +DEFAULT_JOIN_TOLERANCE = 3 +DEFAULT_MIN_WORDS_VERTICAL = 3 +DEFAULT_MIN_WORDS_HORIZONTAL = 1 +DEFAULT_X_TOLERANCE = 3 +DEFAULT_Y_TOLERANCE = 3 +DEFAULT_X_DENSITY = 7.25 +DEFAULT_Y_DENSITY = 13 +bbox_getter = itemgetter("x0", "top", "x1", "bottom") + + +LIGATURES = { + "ff": "ff", + "ffi": "ffi", + "ffl": "ffl", + "fi": "fi", + "fl": "fl", + "st": "st", + "ſt": "st", +} + + +def to_list(collection) -> list: + if isinstance(collection, list): + return collection + elif isinstance(collection, Sequence): + return list(collection) + elif hasattr(collection, "to_dict"): + res = collection.to_dict("records") # pragma: nocover + return res + else: + return list(collection) + + +class TextMap: + """ + A TextMap maps each unicode character in the text to an individual `char` + object (or, in the case of layout-implied whitespace, `None`). + """ + + def __init__(self, tuples=None) -> None: + self.tuples = tuples + self.as_string = "".join(map(itemgetter(0), tuples)) + + def match_to_dict( + self, + m, + main_group: int = 0, + return_groups: bool = True, + return_chars: bool = True, + ) -> dict: + subset = self.tuples[m.start(main_group) : m.end(main_group)] + chars = [c for (text, c) in subset if c is not None] + x0, top, x1, bottom = objects_to_bbox(chars) + + result = { + "text": m.group(main_group), + "x0": x0, + "top": top, + "x1": x1, + "bottom": bottom, + } + + if return_groups: + result["groups"] = m.groups() + + if return_chars: + result["chars"] = chars + + return result + + +class WordMap: + """ + A WordMap maps words->chars. + """ + + def __init__(self, tuples) -> None: + self.tuples = tuples + + def to_textmap( + self, + layout: bool = False, + layout_width=0, + layout_height=0, + layout_width_chars: int = 0, + layout_height_chars: int = 0, + x_density=DEFAULT_X_DENSITY, + y_density=DEFAULT_Y_DENSITY, + x_shift=0, + y_shift=0, + y_tolerance=DEFAULT_Y_TOLERANCE, + use_text_flow: bool = False, + presorted: bool = False, + expand_ligatures: bool = True, + ) -> TextMap: + """ + Given a list of (word, chars) tuples (i.e., a WordMap), return a list of + (char-text, char) tuples (i.e., a TextMap) that can be used to mimic the + structural layout of the text on the page(s), using the following approach: + + - Sort the words by (doctop, x0) if not already sorted. + + - Calculate the initial doctop for the starting page. + + - Cluster the words by doctop (taking `y_tolerance` into account), and + iterate through them. + + - For each cluster, calculate the distance between that doctop and the + initial doctop, in points, minus `y_shift`. Divide that distance by + `y_density` to calculate the minimum number of newlines that should come + before this cluster. Append that number of newlines *minus* the number of + newlines already appended, with a minimum of one. + + - Then for each cluster, iterate through each word in it. Divide each + word's x0, minus `x_shift`, by `x_density` to calculate the minimum + number of characters that should come before this cluster. Append that + number of spaces *minus* the number of characters and spaces already + appended, with a minimum of one. Then append the word's text. + + - At the termination of each line, add more spaces if necessary to + mimic `layout_width`. + + - Finally, add newlines to the end if necessary to mimic to + `layout_height`. + + Note: This approach currently works best for horizontal, left-to-right + text, but will display all words regardless of orientation. There is room + for improvement in better supporting right-to-left text, as well as + vertical text. + """ + _textmap = [] + + if not len(self.tuples): + return TextMap(_textmap) + + expansions = LIGATURES if expand_ligatures else {} + + if layout: + if layout_width_chars: + if layout_width: + raise ValueError( + "`layout_width` and `layout_width_chars` cannot both be set." + ) + else: + layout_width_chars = int(round(layout_width / x_density)) + + if layout_height_chars: + if layout_height: + raise ValueError( + "`layout_height` and `layout_height_chars` cannot both be set." + ) + else: + layout_height_chars = int(round(layout_height / y_density)) + + blank_line = [(" ", None)] * layout_width_chars + else: + blank_line = [] + + num_newlines = 0 + + words_sorted_doctop = ( + self.tuples + if presorted or use_text_flow + else sorted(self.tuples, key=lambda x: float(x[0]["doctop"])) + ) + + first_word = words_sorted_doctop[0][0] + doctop_start = first_word["doctop"] - first_word["top"] + + for i, ws in enumerate( + cluster_objects( + words_sorted_doctop, lambda x: float(x[0]["doctop"]), y_tolerance + ) + ): + y_dist = ( + (ws[0][0]["doctop"] - (doctop_start + y_shift)) / y_density + if layout + else 0 + ) + num_newlines_prepend = max( + # At least one newline, unless this iis the first line + int(i > 0), + # ... or as many as needed to get the imputed "distance" from the top + round(y_dist) - num_newlines, + ) + + for i in range(num_newlines_prepend): + if not len(_textmap) or _textmap[-1][0] == "\n": + _textmap += blank_line + _textmap.append(("\n", None)) + + num_newlines += num_newlines_prepend + + line_len = 0 + + line_words_sorted_x0 = ( + ws + if presorted or use_text_flow + else sorted(ws, key=lambda x: float(x[0]["x0"])) + ) + + for word, chars in line_words_sorted_x0: + x_dist = (word["x0"] - x_shift) / x_density if layout else 0 + num_spaces_prepend = max(min(1, line_len), round(x_dist) - line_len) + _textmap += [(" ", None)] * num_spaces_prepend + line_len += num_spaces_prepend + + for c in chars: + letters = expansions.get(c["text"], c["text"]) + for letter in letters: + _textmap.append((letter, c)) + line_len += 1 + + # Append spaces at end of line + if layout: + _textmap += [(" ", None)] * (layout_width_chars - line_len) + + # Append blank lines at end of text + if layout: + num_newlines_append = layout_height_chars - (num_newlines + 1) + for i in range(num_newlines_append): + if i > 0: + _textmap += blank_line + _textmap.append(("\n", None)) + + # Remove terminal newline + if _textmap[-1] == ("\n", None): + _textmap = _textmap[:-1] + + return TextMap(_textmap) + + +class WordExtractor: + def __init__( + self, + x_tolerance=DEFAULT_X_TOLERANCE, + y_tolerance=DEFAULT_Y_TOLERANCE, + keep_blank_chars: bool = False, + use_text_flow=False, + horizontal_ltr=True, # Should words be read left-to-right? + vertical_ttb=False, # Should vertical words be read top-to-bottom? + extra_attrs=None, + split_at_punctuation=False, + expand_ligatures=True, + ): + self.x_tolerance = x_tolerance + self.y_tolerance = y_tolerance + self.keep_blank_chars = keep_blank_chars + self.use_text_flow = use_text_flow + self.horizontal_ltr = horizontal_ltr + self.vertical_ttb = vertical_ttb + self.extra_attrs = [] if extra_attrs is None else extra_attrs + + # Note: string.punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' + self.split_at_punctuation = ( + string.punctuation + if split_at_punctuation is True + else (split_at_punctuation or "") + ) + + self.expansions = LIGATURES if expand_ligatures else {} + + def merge_chars(self, ordered_chars: list): + x0, top, x1, bottom = objects_to_bbox(ordered_chars) + doctop_adj = ordered_chars[0]["doctop"] - ordered_chars[0]["top"] + upright = ordered_chars[0]["upright"] + direction = 1 if (self.horizontal_ltr if upright else self.vertical_ttb) else -1 + + matrix = ordered_chars[0]["matrix"] + + rotation = 0 + if not upright and matrix[1] < 0: + ordered_chars = reversed(ordered_chars) + rotation = 270 + + if matrix[0] < 0 and matrix[3] < 0: + rotation = 180 + elif matrix[1] > 0: + rotation = 90 + + word = { + "text": "".join( + self.expansions.get(c["text"], c["text"]) for c in ordered_chars + ), + "x0": x0, + "x1": x1, + "top": top, + "doctop": top + doctop_adj, + "bottom": bottom, + "upright": upright, + "direction": direction, + "rotation": rotation, + } + + for key in self.extra_attrs: + word[key] = ordered_chars[0][key] + + return word + + def char_begins_new_word( + self, + prev_char, + curr_char, + ) -> bool: + """This method takes several factors into account to determine if + `curr_char` represents the beginning of a new word: + + - Whether the text is "upright" (i.e., non-rotated) + - Whether the user has specified that horizontal text runs + left-to-right (default) or right-to-left, as represented by + self.horizontal_ltr + - Whether the user has specified that vertical text the text runs + top-to-bottom (default) or bottom-to-top, as represented by + self.vertical_ttb + - The x0, top, x1, and bottom attributes of prev_char and + curr_char + - The self.x_tolerance and self.y_tolerance settings. Note: In + this case, x/y refer to those directions for non-rotated text. + For vertical text, they are flipped. A more accurate terminology + might be "*intra*line character distance tolerance" and + "*inter*line character distance tolerance" + + An important note: The *intra*line distance is measured from the + *end* of the previous character to the *beginning* of the current + character, while the *inter*line distance is measured from the + *top* of the previous character to the *top* of the next + character. The reasons for this are partly repository-historical, + and partly logical, as successive text lines' bounding boxes often + overlap slightly (and we don't want that overlap to be interpreted + as the two lines being the same line). + + The upright-ness of the character determines the attributes to + compare, while horizontal_ltr/vertical_ttb determine the direction + of the comparison. + """ + + # Note: Due to the grouping step earlier in the process, + # curr_char["upright"] will always equal prev_char["upright"]. + if curr_char["upright"]: + x = self.x_tolerance + y = self.y_tolerance + ay = prev_char["top"] + cy = curr_char["top"] + if self.horizontal_ltr: + ax = prev_char["x0"] + bx = prev_char["x1"] + cx = curr_char["x0"] + else: + ax = -prev_char["x1"] + bx = -prev_char["x0"] + cx = -curr_char["x1"] + + else: + x = self.y_tolerance + y = self.x_tolerance + ay = prev_char["x0"] + cy = curr_char["x0"] + if self.vertical_ttb: + ax = prev_char["top"] + bx = prev_char["bottom"] + cx = curr_char["top"] + else: + ax = -prev_char["bottom"] + bx = -prev_char["top"] + cx = -curr_char["bottom"] + + return bool( + # Intraline test + (cx < ax) + or (cx > bx + x) + # Interline test + or (cy > ay + y) + ) + + def iter_chars_to_words(self, ordered_chars): + current_word: list = [] + + def start_next_word(new_char=None): + nonlocal current_word + + if current_word: + yield current_word + + current_word = [] if new_char is None else [new_char] + + for char in ordered_chars: + text = char["text"] + + if not self.keep_blank_chars and text.isspace(): + yield from start_next_word(None) + + elif text in self.split_at_punctuation: + yield from start_next_word(char) + yield from start_next_word(None) + + elif current_word and self.char_begins_new_word(current_word[-1], char): + yield from start_next_word(char) + + else: + current_word.append(char) + + # Finally, after all chars processed + if current_word: + yield current_word + + def iter_sort_chars(self, chars): + def upright_key(x) -> int: + return -int(x["upright"]) + + for upright_cluster in cluster_objects(list(chars), upright_key, 0): + upright = upright_cluster[0]["upright"] + cluster_key = "doctop" if upright else "x0" + + # Cluster by line + subclusters = cluster_objects( + upright_cluster, itemgetter(cluster_key), self.y_tolerance + ) + + for sc in subclusters: + # Sort within line + sort_key = "x0" if upright else "doctop" + to_yield = sorted(sc, key=itemgetter(sort_key)) + + # Reverse order if necessary + if not (self.horizontal_ltr if upright else self.vertical_ttb): + yield from reversed(to_yield) + else: + yield from to_yield + + def iter_extract_tuples(self, chars): + ordered_chars = chars if self.use_text_flow else self.iter_sort_chars(chars) + + grouping_key = itemgetter("upright", *self.extra_attrs) + grouped_chars = itertools.groupby(ordered_chars, grouping_key) + + for keyvals, char_group in grouped_chars: + for word_chars in self.iter_chars_to_words(char_group): + yield (self.merge_chars(word_chars), word_chars) + + def extract_wordmap(self, chars) -> WordMap: + return WordMap(list(self.iter_extract_tuples(chars))) + + def extract_words(self, chars: list) -> list: + words = list(word for word, word_chars in self.iter_extract_tuples(chars)) + return words + + +def extract_words(chars: list, **kwargs) -> list: + return WordExtractor(**kwargs).extract_words(chars) + + +TEXTMAP_KWARGS = inspect.signature(WordMap.to_textmap).parameters.keys() +WORD_EXTRACTOR_KWARGS = inspect.signature(WordExtractor).parameters.keys() + + +def chars_to_textmap(chars: list, **kwargs) -> TextMap: + kwargs.update({"presorted": True}) + + extractor = WordExtractor( + **{k: kwargs[k] for k in WORD_EXTRACTOR_KWARGS if k in kwargs} + ) + wordmap = extractor.extract_wordmap(chars) + textmap = wordmap.to_textmap( + **{k: kwargs[k] for k in TEXTMAP_KWARGS if k in kwargs} + ) + + return textmap + + +def extract_text(chars: list, **kwargs) -> str: + chars = to_list(chars) + if len(chars) == 0: + return "" + + if kwargs.get("layout"): + return chars_to_textmap(chars, **kwargs).as_string + else: + y_tolerance = kwargs.get("y_tolerance", DEFAULT_Y_TOLERANCE) + extractor = WordExtractor( + **{k: kwargs[k] for k in WORD_EXTRACTOR_KWARGS if k in kwargs} + ) + words = extractor.extract_words(chars) + if words: + rotation = words[0]["rotation"] # rotation cannot change within a cell + else: + rotation = 0 + + if rotation == 90: + words.sort(key=lambda w: (w["x1"], -w["top"])) + lines = " ".join([w["text"] for w in words]) + elif rotation == 270: + words.sort(key=lambda w: (-w["x1"], w["top"])) + lines = " ".join([w["text"] for w in words]) + else: + lines = cluster_objects(words, itemgetter("doctop"), y_tolerance) + lines = "\n".join(" ".join(word["text"] for word in line) for line in lines) + if rotation == 180: # needs extra treatment + lines = "".join([(c if c != "\n" else " ") for c in reversed(lines)]) + + return lines + + +def collate_line( + line_chars: list, + tolerance=DEFAULT_X_TOLERANCE, +) -> str: + coll = "" + last_x1 = None + for char in sorted(line_chars, key=itemgetter("x0")): + if (last_x1 is not None) and (char["x0"] > (last_x1 + tolerance)): + coll += " " + last_x1 = char["x1"] + coll += char["text"] + return coll + + +def dedupe_chars(chars: list, tolerance=1) -> list: + """ + Removes duplicate chars — those sharing the same text, fontname, size, + and positioning (within `tolerance`) as other characters in the set. + """ + key = itemgetter("fontname", "size", "upright", "text") + pos_key = itemgetter("doctop", "x0") + + def yield_unique_chars(chars: list): + sorted_chars = sorted(chars, key=key) + for grp, grp_chars in itertools.groupby(sorted_chars, key=key): + for y_cluster in cluster_objects( + list(grp_chars), itemgetter("doctop"), tolerance + ): + for x_cluster in cluster_objects( + y_cluster, itemgetter("x0"), tolerance + ): + yield sorted(x_cluster, key=pos_key)[0] + + deduped = yield_unique_chars(chars) + return sorted(deduped, key=chars.index) + + +def line_to_edge(line): + edge = dict(line) + edge["orientation"] = "h" if (line["top"] == line["bottom"]) else "v" + return edge + + +def rect_to_edges(rect) -> list: + top, bottom, left, right = [dict(rect) for x in range(4)] + top.update( + { + "object_type": "rect_edge", + "height": 0, + "y0": rect["y1"], + "bottom": rect["top"], + "orientation": "h", + } + ) + bottom.update( + { + "object_type": "rect_edge", + "height": 0, + "y1": rect["y0"], + "top": rect["top"] + rect["height"], + "doctop": rect["doctop"] + rect["height"], + "orientation": "h", + } + ) + left.update( + { + "object_type": "rect_edge", + "width": 0, + "x1": rect["x0"], + "orientation": "v", + } + ) + right.update( + { + "object_type": "rect_edge", + "width": 0, + "x0": rect["x1"], + "orientation": "v", + } + ) + return [top, bottom, left, right] + + +def curve_to_edges(curve) -> list: + point_pairs = zip(curve["pts"], curve["pts"][1:]) + return [ + { + "object_type": "curve_edge", + "x0": min(p0[0], p1[0]), + "x1": max(p0[0], p1[0]), + "top": min(p0[1], p1[1]), + "doctop": min(p0[1], p1[1]) + (curve["doctop"] - curve["top"]), + "bottom": max(p0[1], p1[1]), + "width": abs(p0[0] - p1[0]), + "height": abs(p0[1] - p1[1]), + "orientation": "v" if p0[0] == p1[0] else ("h" if p0[1] == p1[1] else None), + } + for p0, p1 in point_pairs + ] + + +def obj_to_edges(obj) -> list: + t = obj["object_type"] + if "_edge" in t: + return [obj] + elif t == "line": + return [line_to_edge(obj)] + else: + return {"rect": rect_to_edges, "curve": curve_to_edges}[t](obj) + + +def filter_edges( + edges, + orientation=None, + edge_type=None, + min_length=1, +) -> list: + if orientation not in ("v", "h", None): + raise ValueError("Orientation must be 'v' or 'h'") + + def test(e) -> bool: + dim = "height" if e["orientation"] == "v" else "width" + et_correct = e["object_type"] == edge_type if edge_type is not None else True + orient_correct = orientation is None or e["orientation"] == orientation + return bool(et_correct and orient_correct and (e[dim] >= min_length)) + + return list(filter(test, edges)) + + +def cluster_list(xs, tolerance=0) -> list: + if tolerance == 0: + return [[x] for x in sorted(xs)] + if len(xs) < 2: + return [[x] for x in sorted(xs)] + groups = [] + xs = list(sorted(xs)) + current_group = [xs[0]] + last = xs[0] + for x in xs[1:]: + if x <= (last + tolerance): + current_group.append(x) + else: + groups.append(current_group) + current_group = [x] + last = x + groups.append(current_group) + return groups + + +def make_cluster_dict(values, tolerance) -> dict: + clusters = cluster_list(list(set(values)), tolerance) + + nested_tuples = [ + [(val, i) for val in value_cluster] for i, value_cluster in enumerate(clusters) + ] + + return dict(itertools.chain(*nested_tuples)) + + +def cluster_objects(xs, key_fn, tolerance) -> list: + if not callable(key_fn): + key_fn = itemgetter(key_fn) + + values = map(key_fn, xs) + cluster_dict = make_cluster_dict(values, tolerance) + + get_0, get_1 = itemgetter(0), itemgetter(1) + + cluster_tuples = sorted(((x, cluster_dict.get(key_fn(x))) for x in xs), key=get_1) + + grouped = itertools.groupby(cluster_tuples, key=get_1) + + return [list(map(get_0, v)) for k, v in grouped] + + +def move_object(obj, axis: str, value): + assert axis in ("h", "v") + if axis == "h": + new_items = [ + ("x0", obj["x0"] + value), + ("x1", obj["x1"] + value), + ] + if axis == "v": + new_items = [ + ("top", obj["top"] + value), + ("bottom", obj["bottom"] + value), + ] + if "doctop" in obj: + new_items += [("doctop", obj["doctop"] + value)] + if "y0" in obj: + new_items += [ + ("y0", obj["y0"] - value), + ("y1", obj["y1"] - value), + ] + return obj.__class__(tuple(obj.items()) + tuple(new_items)) + + +def snap_objects(objs, attr: str, tolerance) -> list: + axis = {"x0": "h", "x1": "h", "top": "v", "bottom": "v"}[attr] + list_objs = list(objs) + clusters = cluster_objects(list_objs, itemgetter(attr), tolerance) + avgs = [sum(map(itemgetter(attr), cluster)) / len(cluster) for cluster in clusters] + snapped_clusters = [ + [move_object(obj, axis, avg - obj[attr]) for obj in cluster] + for cluster, avg in zip(clusters, avgs) + ] + return list(itertools.chain(*snapped_clusters)) + + +def snap_edges( + edges, + x_tolerance=DEFAULT_SNAP_TOLERANCE, + y_tolerance=DEFAULT_SNAP_TOLERANCE, +): + """ + Given a list of edges, snap any within `tolerance` pixels of one another + to their positional average. + """ + by_orientation = {"v": [], "h": []} + for e in edges: + by_orientation[e["orientation"]].append(e) + + snapped_v = snap_objects(by_orientation["v"], "x0", x_tolerance) + snapped_h = snap_objects(by_orientation["h"], "top", y_tolerance) + return snapped_v + snapped_h + + +def resize_object(obj, key: str, value): + assert key in ("x0", "x1", "top", "bottom") + old_value = obj[key] + diff = value - old_value + new_items = [ + (key, value), + ] + if key == "x0": + assert value <= obj["x1"] + new_items.append(("width", obj["x1"] - value)) + elif key == "x1": + assert value >= obj["x0"] + new_items.append(("width", value - obj["x0"])) + elif key == "top": + assert value <= obj["bottom"] + new_items.append(("doctop", obj["doctop"] + diff)) + new_items.append(("height", obj["height"] - diff)) + if "y1" in obj: + new_items.append(("y1", obj["y1"] - diff)) + elif key == "bottom": + assert value >= obj["top"] + new_items.append(("height", obj["height"] + diff)) + if "y0" in obj: + new_items.append(("y0", obj["y0"] - diff)) + return obj.__class__(tuple(obj.items()) + tuple(new_items)) + + +def join_edge_group(edges, orientation: str, tolerance=DEFAULT_JOIN_TOLERANCE): + """ + Given a list of edges along the same infinite line, join those that + are within `tolerance` pixels of one another. + """ + if orientation == "h": + min_prop, max_prop = "x0", "x1" + elif orientation == "v": + min_prop, max_prop = "top", "bottom" + else: + raise ValueError("Orientation must be 'v' or 'h'") + + sorted_edges = list(sorted(edges, key=itemgetter(min_prop))) + joined = [sorted_edges[0]] + for e in sorted_edges[1:]: + last = joined[-1] + if e[min_prop] <= (last[max_prop] + tolerance): + if e[max_prop] > last[max_prop]: + # Extend current edge to new extremity + joined[-1] = resize_object(last, max_prop, e[max_prop]) + else: + # Edge is separate from previous edges + joined.append(e) + + return joined + + +def merge_edges( + edges, + snap_x_tolerance, + snap_y_tolerance, + join_x_tolerance, + join_y_tolerance, +): + """ + Using the `snap_edges` and `join_edge_group` methods above, + merge a list of edges into a more "seamless" list. + """ + + def get_group(edge): + if edge["orientation"] == "h": + return ("h", edge["top"]) + else: + return ("v", edge["x0"]) + + if snap_x_tolerance > 0 or snap_y_tolerance > 0: + edges = snap_edges(edges, snap_x_tolerance, snap_y_tolerance) + + _sorted = sorted(edges, key=get_group) + edge_groups = itertools.groupby(_sorted, key=get_group) + edge_gen = ( + join_edge_group( + items, k[0], (join_x_tolerance if k[0] == "h" else join_y_tolerance) + ) + for k, items in edge_groups + ) + edges = list(itertools.chain(*edge_gen)) + return edges + + +def bbox_to_rect(bbox) -> dict: + """ + Return the rectangle (i.e a dict with keys "x0", "top", "x1", + "bottom") for an object. + """ + return {"x0": bbox[0], "top": bbox[1], "x1": bbox[2], "bottom": bbox[3]} + + +def objects_to_rect(objects) -> dict: + """ + Given an iterable of objects, return the smallest rectangle (i.e. a + dict with "x0", "top", "x1", and "bottom" keys) that contains them + all. + """ + return bbox_to_rect(objects_to_bbox(objects)) + + +def merge_bboxes(bboxes): + """ + Given an iterable of bounding boxes, return the smallest bounding box + that contains them all. + """ + x0, top, x1, bottom = zip(*bboxes) + return (min(x0), min(top), max(x1), max(bottom)) + + +def objects_to_bbox(objects): + """ + Given an iterable of objects, return the smallest bounding box that + contains them all. + """ + return merge_bboxes(map(bbox_getter, objects)) + + +def words_to_edges_h(words, word_threshold: int = DEFAULT_MIN_WORDS_HORIZONTAL): + """ + Find (imaginary) horizontal lines that connect the tops + of at least `word_threshold` words. + """ + by_top = cluster_objects(words, itemgetter("top"), 1) + large_clusters = filter(lambda x: len(x) >= word_threshold, by_top) + rects = list(map(objects_to_rect, large_clusters)) + if len(rects) == 0: + return [] + min_x0 = min(map(itemgetter("x0"), rects)) + max_x1 = max(map(itemgetter("x1"), rects)) + + edges = [] + for r in rects: + edges += [ + # Top of text + { + "x0": min_x0, + "x1": max_x1, + "top": r["top"], + "bottom": r["top"], + "width": max_x1 - min_x0, + "orientation": "h", + }, + # For each detected row, we also add the 'bottom' line. This will + # generate extra edges, (some will be redundant with the next row + # 'top' line), but this catches the last row of every table. + { + "x0": min_x0, + "x1": max_x1, + "top": r["bottom"], + "bottom": r["bottom"], + "width": max_x1 - min_x0, + "orientation": "h", + }, + ] + + return edges + + +def get_bbox_overlap(a, b): + a_left, a_top, a_right, a_bottom = a + b_left, b_top, b_right, b_bottom = b + o_left = max(a_left, b_left) + o_right = min(a_right, b_right) + o_bottom = min(a_bottom, b_bottom) + o_top = max(a_top, b_top) + o_width = o_right - o_left + o_height = o_bottom - o_top + if o_height >= 0 and o_width >= 0 and o_height + o_width > 0: + return (o_left, o_top, o_right, o_bottom) + else: + return None + + +def words_to_edges_v(words, word_threshold: int = DEFAULT_MIN_WORDS_VERTICAL): + """ + Find (imaginary) vertical lines that connect the left, right, or + center of at least `word_threshold` words. + """ + # Find words that share the same left, right, or centerpoints + by_x0 = cluster_objects(words, itemgetter("x0"), 1) + by_x1 = cluster_objects(words, itemgetter("x1"), 1) + + def get_center(word): + return float(word["x0"] + word["x1"]) / 2 + + by_center = cluster_objects(words, get_center, 1) + clusters = by_x0 + by_x1 + by_center + + # Find the points that align with the most words + sorted_clusters = sorted(clusters, key=lambda x: -len(x)) + large_clusters = filter(lambda x: len(x) >= word_threshold, sorted_clusters) + + # For each of those points, find the bboxes fitting all matching words + bboxes = list(map(objects_to_bbox, large_clusters)) + + # Iterate through those bboxes, condensing overlapping bboxes + condensed_bboxes = [] + for bbox in bboxes: + overlap = any(get_bbox_overlap(bbox, c) for c in condensed_bboxes) + if not overlap: + condensed_bboxes.append(bbox) + + if not condensed_bboxes: + return [] + + condensed_rects = map(bbox_to_rect, condensed_bboxes) + sorted_rects = list(sorted(condensed_rects, key=itemgetter("x0"))) + + max_x1 = max(map(itemgetter("x1"), sorted_rects)) + min_top = min(map(itemgetter("top"), sorted_rects)) + max_bottom = max(map(itemgetter("bottom"), sorted_rects)) + + return [ + { + "x0": b["x0"], + "x1": b["x0"], + "top": min_top, + "bottom": max_bottom, + "height": max_bottom - min_top, + "orientation": "v", + } + for b in sorted_rects + ] + [ + { + "x0": max_x1, + "x1": max_x1, + "top": min_top, + "bottom": max_bottom, + "height": max_bottom - min_top, + "orientation": "v", + } + ] + + +def edges_to_intersections(edges, x_tolerance=1, y_tolerance=1) -> dict: + """ + Given a list of edges, return the points at which they intersect + within `tolerance` pixels. + """ + intersections = {} + v_edges, h_edges = [ + list(filter(lambda x: x["orientation"] == o, edges)) for o in ("v", "h") + ] + for v in sorted(v_edges, key=itemgetter("x0", "top")): + for h in sorted(h_edges, key=itemgetter("top", "x0")): + if ( + (v["top"] <= (h["top"] + y_tolerance)) + and (v["bottom"] >= (h["top"] - y_tolerance)) + and (v["x0"] >= (h["x0"] - x_tolerance)) + and (v["x0"] <= (h["x1"] + x_tolerance)) + ): + vertex = (v["x0"], h["top"]) + if vertex not in intersections: + intersections[vertex] = {"v": [], "h": []} + intersections[vertex]["v"].append(v) + intersections[vertex]["h"].append(h) + return intersections + + +def obj_to_bbox(obj): + """ + Return the bounding box for an object. + """ + return bbox_getter(obj) + + +def intersections_to_cells(intersections): + """ + Given a list of points (`intersections`), return all rectangular "cells" + that those points describe. + + `intersections` should be a dictionary with (x0, top) tuples as keys, + and a list of edge objects as values. The edge objects should correspond + to the edges that touch the intersection. + """ + + def edge_connects(p1, p2) -> bool: + def edges_to_set(edges): + return set(map(obj_to_bbox, edges)) + + if p1[0] == p2[0]: + common = edges_to_set(intersections[p1]["v"]).intersection( + edges_to_set(intersections[p2]["v"]) + ) + if len(common): + return True + + if p1[1] == p2[1]: + common = edges_to_set(intersections[p1]["h"]).intersection( + edges_to_set(intersections[p2]["h"]) + ) + if len(common): + return True + return False + + points = list(sorted(intersections.keys())) + n_points = len(points) + + def find_smallest_cell(points, i: int): + if i == n_points - 1: + return None + pt = points[i] + rest = points[i + 1 :] + # Get all the points directly below and directly right + below = [x for x in rest if x[0] == pt[0]] + right = [x for x in rest if x[1] == pt[1]] + for below_pt in below: + if not edge_connects(pt, below_pt): + continue + + for right_pt in right: + if not edge_connects(pt, right_pt): + continue + + bottom_right = (right_pt[0], below_pt[1]) + + if ( + (bottom_right in intersections) + and edge_connects(bottom_right, right_pt) + and edge_connects(bottom_right, below_pt) + ): + return (pt[0], pt[1], bottom_right[0], bottom_right[1]) + return None + + cell_gen = (find_smallest_cell(points, i) for i in range(len(points))) + return list(filter(None, cell_gen)) + + +def cells_to_tables(page, cells) -> list: + """ + Given a list of bounding boxes (`cells`), return a list of tables that + hold those cells most simply (and contiguously). + """ + + def bbox_to_corners(bbox) -> tuple: + x0, top, x1, bottom = bbox + return ((x0, top), (x0, bottom), (x1, top), (x1, bottom)) + + remaining_cells = list(cells) + + # Iterate through the cells found above, and assign them + # to contiguous tables + + current_corners = set() + current_cells = [] + + tables = [] + while len(remaining_cells): + initial_cell_count = len(current_cells) + for cell in list(remaining_cells): + cell_corners = bbox_to_corners(cell) + # If we're just starting a table ... + if len(current_cells) == 0: + # ... immediately assign it to the empty group + current_corners |= set(cell_corners) + current_cells.append(cell) + remaining_cells.remove(cell) + else: + # How many corners does this table share with the current group? + corner_count = sum(c in current_corners for c in cell_corners) + + # If touching on at least one corner... + if corner_count > 0: + # ... assign it to the current group + current_corners |= set(cell_corners) + current_cells.append(cell) + remaining_cells.remove(cell) + + # If this iteration did not find any more cells to append... + if len(current_cells) == initial_cell_count: + # ... start a new cell group + tables.append(list(current_cells)) + current_corners.clear() + current_cells.clear() + + # Once we have exhausting the list of cells ... + + # ... and we have a cell group that has not been stored + if len(current_cells): + # ... store it. + tables.append(list(current_cells)) + + # PyMuPDF modification: + # Remove tables without text or having only 1 column + for i in range(len(tables) - 1, -1, -1): + r = EMPTY_RECT() + x1_vals = set() + x0_vals = set() + for c in tables[i]: + r |= c + x1_vals.add(c[2]) + x0_vals.add(c[0]) + if ( + len(x1_vals) < 2 + or len(x0_vals) < 2 + or white_spaces.issuperset( + page.get_textbox( + r, + textpage=TEXTPAGE, + ) + ) + ): + del tables[i] + + # Sort the tables top-to-bottom-left-to-right based on the value of the + # topmost-and-then-leftmost coordinate of a table. + _sorted = sorted(tables, key=lambda t: min((c[1], c[0]) for c in t)) + return _sorted + + +class CellGroup: + def __init__(self, cells): + self.cells = cells + self.bbox = ( + min(map(itemgetter(0), filter(None, cells))), + min(map(itemgetter(1), filter(None, cells))), + max(map(itemgetter(2), filter(None, cells))), + max(map(itemgetter(3), filter(None, cells))), + ) + + +class TableRow(CellGroup): + pass + + +class TableHeader: + """PyMuPDF extension containing the identified table header.""" + + def __init__(self, bbox, cells, names, above): + self.bbox = bbox + self.cells = cells + self.names = names + self.external = above + + +class Table: + def __init__(self, page, cells): + self.page = page + self.cells = cells + self.header = self._get_header() # PyMuPDF extension + + @property + def bbox(self): + c = self.cells + return ( + min(map(itemgetter(0), c)), + min(map(itemgetter(1), c)), + max(map(itemgetter(2), c)), + max(map(itemgetter(3), c)), + ) + + @property + def rows(self) -> list: + _sorted = sorted(self.cells, key=itemgetter(1, 0)) + xs = list(sorted(set(map(itemgetter(0), self.cells)))) + rows = [] + for y, row_cells in itertools.groupby(_sorted, itemgetter(1)): + xdict = {cell[0]: cell for cell in row_cells} + row = TableRow([xdict.get(x) for x in xs]) + rows.append(row) + return rows + + @property + def row_count(self) -> int: # PyMuPDF extension + return len(self.rows) + + @property + def col_count(self) -> int: # PyMuPDF extension + return max([len(r.cells) for r in self.rows]) + + def extract(self, **kwargs) -> list: + chars = CHARS + table_arr = [] + + def char_in_bbox(char, bbox) -> bool: + v_mid = (char["top"] + char["bottom"]) / 2 + h_mid = (char["x0"] + char["x1"]) / 2 + x0, top, x1, bottom = bbox + return bool( + (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom) + ) + + for row in self.rows: + arr = [] + row_chars = [char for char in chars if char_in_bbox(char, row.bbox)] + + for cell in row.cells: + if cell is None: + cell_text = None + else: + cell_chars = [ + char for char in row_chars if char_in_bbox(char, cell) + ] + + if len(cell_chars): + kwargs["x_shift"] = cell[0] + kwargs["y_shift"] = cell[1] + if "layout" in kwargs: + kwargs["layout_width"] = cell[2] - cell[0] + kwargs["layout_height"] = cell[3] - cell[1] + cell_text = extract_text(cell_chars, **kwargs) + else: + cell_text = "" + arr.append(cell_text) + table_arr.append(arr) + + return table_arr + + def to_markdown(self, clean=False, fill_empty=True): + """Output table content as a string in Github-markdown format. + + If "clean" then markdown syntax is removed from cell content. + If "fill_empty" then cell content None is replaced by the values + above (columns) or left (rows) in an effort to approximate row and + columns spans. + + """ + output = "|" + rows = self.row_count + cols = self.col_count + + # cell coordinates + cell_boxes = [[c for c in r.cells] for r in self.rows] + + # cell text strings + cells = [[None for i in range(cols)] for j in range(rows)] + for i, row in enumerate(cell_boxes): + for j, cell in enumerate(row): + if cell is not None: + cells[i][j] = extract_cells( + TEXTPAGE, cell_boxes[i][j], markdown=True + ) + + if fill_empty: # fill "None" cells where possible + + # for rows, copy content from left to right + for j in range(rows): + for i in range(cols - 1): + if cells[j][i + 1] is None: + cells[j][i + 1] = cells[j][i] + + # for columns, copy top to bottom + for i in range(cols): + for j in range(rows - 1): + if cells[j + 1][i] is None: + cells[j + 1][i] = cells[j][i] + + # generate header string and MD separator + for i, name in enumerate(self.header.names): + if not name: # generate a name if empty + name = f"Col{i+1}" + name = name.replace("\n", "
") # use HTML line breaks + if clean: # remove sensitive syntax + name = html.escape(name.replace("-", "-")) + output += name + "|" + + output += "\n" + # insert GitHub header line separator + output += "|" + "|".join("---" for i in range(self.col_count)) + "|\n" + + # skip first row in details if header is part of the table + j = 0 if self.header.external else 1 + + # iterate over detail rows + for row in cells[j:]: + line = "|" + for i, cell in enumerate(row): + # replace None cells with empty string + # use HTML line break tag + if cell is None: + cell = "" + if clean: # remove sensitive syntax + cell = html.escape(cell.replace("-", "-")) + line += cell + "|" + line += "\n" + output += line + return output + "\n" + + def to_pandas(self, **kwargs): + """Return a pandas DataFrame version of the table.""" + try: + import pandas as pd + except ModuleNotFoundError: + message("Package 'pandas' is not installed") + raise + + pd_dict = {} + extract = self.extract() + hdr = self.header + names = self.header.names + hdr_len = len(names) + # ensure uniqueness of column names + for i in range(hdr_len): + name = names[i] + if not name: + names[i] = f"Col{i}" + if hdr_len != len(set(names)): + for i in range(hdr_len): + name = names[i] + if name != f"Col{i}": + names[i] = f"{i}-{name}" + + if not hdr.external: # header is part of 'extract' + extract = extract[1:] + + for i in range(hdr_len): + key = names[i] + value = [] + for j in range(len(extract)): + value.append(extract[j][i]) + pd_dict[key] = value + + return pd.DataFrame(pd_dict) + + def _get_header(self, y_tolerance=3): + """Identify the table header. + + *** PyMuPDF extension. *** + + Starting from the first line above the table upwards, check if it + qualifies to be part of the table header. + + Criteria include: + * A one-line table never has an extra header. + * Column borders must not intersect any word. If this happens, all + text of this line and above of it is ignored. + * No excess inter-line distance: If a line further up has a distance + of more than 1.5 times of its font size, it will be ignored and + all lines above of it. + * Must have same text properties. + * Starting with the top table line, a bold text property cannot change + back to non-bold. + + If not all criteria are met (or there is no text above the table), + the first table row is assumed to be the header. + """ + page = self.page + y_delta = y_tolerance + + def top_row_bg_color(self): + """ + Compare top row background color with color of same-sized bbox + above. If different, return True indicating that the original + table top row is already the header. + """ + bbox0 = Rect(self.rows[0].bbox) + bboxt = bbox0 + (0, -bbox0.height, 0, -bbox0.height) # area above + top_color0 = page.get_pixmap(clip=bbox0).color_topusage()[1] + top_colort = page.get_pixmap(clip=bboxt).color_topusage()[1] + if top_color0 != top_colort: + return True # top row is header + return False + + def row_has_bold(bbox): + """Check if a row contains some bold text. + + If e.g. true for the top row, then it will be used as (internal) + column header row if any of the following is true: + * the previous (above) text line has no bold span + * the second table row text has no bold span + + Returns True if any spans are bold else False. + """ + blocks = page.get_text("dict", flags=TEXTFLAGS_TEXT, clip=bbox)["blocks"] + spans = [s for b in blocks for l in b["lines"] for s in l["spans"]] + + return any(s["flags"] & TEXT_FONT_BOLD for s in spans) + + try: + row = self.rows[0] + cells = row.cells + bbox = Rect(row.bbox) + except IndexError: # this table has no rows + return None + + # return this if we determine that the top row is the header + header_top_row = TableHeader(bbox, cells, self.extract()[0], False) + + # 1-line tables have no extra header + if len(self.rows) < 2: + return header_top_row + + # 1-column tables have no extra header + if len(cells) < 2: + return header_top_row + + # assume top row is the header if second row is empty + row2 = self.rows[1] # second row + if all(c is None for c in row2.cells): # no valid cell bboxes in row2 + return header_top_row + + # Special check: is top row bold? + top_row_bold = row_has_bold(bbox) + + # assume top row is header if it is bold and any cell + # of 2nd row is non-bold + if top_row_bold and not row_has_bold(row2.bbox): + return header_top_row + + if top_row_bg_color(self): + # if area above top row has a different background color, + # then top row is already the header + return header_top_row + + # column coordinates (x1 values) in top row + col_x = [c[2] if c is not None else None for c in cells[:-1]] + + # clip = page area above the table + # We will inspect this area for text qualifying as column header. + clip = +bbox # take row 0 bbox + clip.y0 = 0 # start at top of page + clip.y1 = bbox.y0 # end at top of table + + blocks = page.get_text("dict", clip=clip, flags=TEXTFLAGS_TEXT)["blocks"] + # non-empty, non-superscript spans above table, sorted descending by y1 + spans = sorted( + [ + s + for b in blocks + for l in b["lines"] + for s in l["spans"] + if not ( + white_spaces.issuperset(s["text"]) + or s["flags"] & TEXT_FONT_SUPERSCRIPT + ) + ], + key=lambda s: s["bbox"][3], + reverse=True, + ) + + select = [] # y1 coordinates above, sorted descending + line_heights = [] # line heights above, sorted descending + line_bolds = [] # bold indicator per line above, same sorting + + # walk through the spans and fill above 3 lists + for i in range(len(spans)): + s = spans[i] + y1 = s["bbox"][3] # span bottom + h = y1 - s["bbox"][1] # span bbox height + bold = s["flags"] & TEXT_FONT_BOLD + + # use first item to start the lists + if i == 0: + select.append(y1) + line_heights.append(h) + line_bolds.append(bold) + continue + + # get previous items from the 3 lists + y0 = select[-1] + h0 = line_heights[-1] + bold0 = line_bolds[-1] + + if bold0 and not bold: + break # stop if switching from bold to non-bold + + # if fitting in height of previous span, modify bbox + if y0 - y1 <= y_delta or abs((y0 - h0) - s["bbox"][1]) <= y_delta: + s["bbox"] = (s["bbox"][0], y0 - h0, s["bbox"][2], y0) + spans[i] = s + if bold: + line_bolds[-1] = bold + continue + elif y0 - y1 > 1.5 * h0: + break # stop if distance to previous line too large + select.append(y1) + line_heights.append(h) + line_bolds.append(bold) + + if select == []: # nothing above the table? + return header_top_row + + select = select[:5] # accept up to 5 lines for an external header + + # assume top row as header if text above is too far away + if bbox.y0 - select[0] >= line_heights[0]: + return header_top_row + + # accept top row as header if bold, but line above is not + if top_row_bold and not line_bolds[0]: + return header_top_row + + if spans == []: # nothing left above the table, return top row + return header_top_row + + # re-compute clip above table + nclip = EMPTY_RECT() + for s in [s for s in spans if s["bbox"][3] >= select[-1]]: + nclip |= s["bbox"] + if not nclip.is_empty: + clip = nclip + + clip.y1 = bbox.y0 # make sure we still include every word above + + # Confirm that no word in clip is intersecting a column separator + word_rects = [Rect(w[:4]) for w in page.get_text("words", clip=clip)] + word_tops = sorted(list(set([r[1] for r in word_rects])), reverse=True) + + select = [] + + # exclude lines with words that intersect a column border + for top in word_tops: + intersecting = [ + (x, r) + for x in col_x + if x is not None + for r in word_rects + if r[1] == top and r[0] < x and r[2] > x + ] + if intersecting == []: + select.append(top) + else: # detected a word crossing a column border + break + + if select == []: # nothing left over: return first row + return header_top_row + + hdr_bbox = +clip # compute the header cells + hdr_bbox.y0 = select[-1] # hdr_bbox top is smallest top coord of words + hdr_cells = [ + (c[0], hdr_bbox.y0, c[2], hdr_bbox.y1) if c is not None else None + for c in cells + ] + + # adjust left/right of header bbox + hdr_bbox.x0 = self.bbox[0] + hdr_bbox.x1 = self.bbox[2] + + # column names: no line breaks, no excess spaces + hdr_names = [ + ( + page.get_textbox(c).replace("\n", " ").replace(" ", " ").strip() + if c is not None + else "" + ) + for c in hdr_cells + ] + return TableHeader(tuple(hdr_bbox), hdr_cells, hdr_names, True) + + +@dataclass +class TableSettings: + vertical_strategy: str = "lines" + horizontal_strategy: str = "lines" + explicit_vertical_lines: list = None + explicit_horizontal_lines: list = None + snap_tolerance: float = DEFAULT_SNAP_TOLERANCE + snap_x_tolerance: float = UNSET + snap_y_tolerance: float = UNSET + join_tolerance: float = DEFAULT_JOIN_TOLERANCE + join_x_tolerance: float = UNSET + join_y_tolerance: float = UNSET + edge_min_length: float = 3 + min_words_vertical: float = DEFAULT_MIN_WORDS_VERTICAL + min_words_horizontal: float = DEFAULT_MIN_WORDS_HORIZONTAL + intersection_tolerance: float = 3 + intersection_x_tolerance: float = UNSET + intersection_y_tolerance: float = UNSET + text_settings: dict = None + + def __post_init__(self) -> "TableSettings": + """Clean up user-provided table settings. + + Validates that the table settings provided consists of acceptable values and + returns a cleaned up version. The cleaned up version fills out the missing + values with the default values in the provided settings. + + TODO: Can be further used to validate that the values are of the correct + type. For example, raising a value error when a non-boolean input is + provided for the key ``keep_blank_chars``. + + :param table_settings: User-provided table settings. + :returns: A cleaned up version of the user-provided table settings. + :raises ValueError: When an unrecognised key is provided. + """ + + for setting in NON_NEGATIVE_SETTINGS: + if (getattr(self, setting) or 0) < 0: + raise ValueError(f"Table setting '{setting}' cannot be negative") + + for orientation in ["horizontal", "vertical"]: + strategy = getattr(self, orientation + "_strategy") + if strategy not in TABLE_STRATEGIES: + raise ValueError( + f"{orientation}_strategy must be one of" + f'{{{",".join(TABLE_STRATEGIES)}}}' + ) + + if self.text_settings is None: + self.text_settings = {} + + # This next section is for backwards compatibility + for attr in ["x_tolerance", "y_tolerance"]: + if attr not in self.text_settings: + self.text_settings[attr] = self.text_settings.get("tolerance", 3) + + if "tolerance" in self.text_settings: + del self.text_settings["tolerance"] + # End of that section + + for attr, fallback in [ + ("snap_x_tolerance", "snap_tolerance"), + ("snap_y_tolerance", "snap_tolerance"), + ("join_x_tolerance", "join_tolerance"), + ("join_y_tolerance", "join_tolerance"), + ("intersection_x_tolerance", "intersection_tolerance"), + ("intersection_y_tolerance", "intersection_tolerance"), + ]: + if getattr(self, attr) is UNSET: + setattr(self, attr, getattr(self, fallback)) + + return self + + @classmethod + def resolve(cls, settings=None): + if settings is None: + return cls() + elif isinstance(settings, cls): + return settings + elif isinstance(settings, dict): + core_settings = {} + text_settings = {} + for k, v in settings.items(): + if k[:5] == "text_": + text_settings[k[5:]] = v + else: + core_settings[k] = v + core_settings["text_settings"] = text_settings + return cls(**core_settings) + else: + raise ValueError(f"Cannot resolve settings: {settings}") + + +class TableFinder: + """ + Given a PDF page, find plausible table structures. + + Largely borrowed from Anssi Nurminen's master's thesis: + http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3 + + ... and inspired by Tabula: + https://github.com/tabulapdf/tabula-extractor/issues/16 + """ + + def __init__(self, page, settings=None): + self.page = weakref.proxy(page) + self.settings = TableSettings.resolve(settings) + self.edges = self.get_edges() + self.intersections = edges_to_intersections( + self.edges, + self.settings.intersection_x_tolerance, + self.settings.intersection_y_tolerance, + ) + self.cells = intersections_to_cells(self.intersections) + self.tables = [ + Table(self.page, cell_group) + for cell_group in cells_to_tables(self.page, self.cells) + ] + + def get_edges(self) -> list: + settings = self.settings + + for orientation in ["vertical", "horizontal"]: + strategy = getattr(settings, orientation + "_strategy") + if strategy == "explicit": + lines = getattr(settings, "explicit_" + orientation + "_lines") + if len(lines) < 2: + raise ValueError( + f"If {orientation}_strategy == 'explicit', " + f"explicit_{orientation}_lines " + f"must be specified as a list/tuple of two or more " + f"floats/ints." + ) + + v_strat = settings.vertical_strategy + h_strat = settings.horizontal_strategy + + if v_strat == "text" or h_strat == "text": + words = extract_words(CHARS, **(settings.text_settings or {})) + else: + words = [] + + v_explicit = [] + for desc in settings.explicit_vertical_lines or []: + if isinstance(desc, dict): + for e in obj_to_edges(desc): + if e["orientation"] == "v": + v_explicit.append(e) + else: + v_explicit.append( + { + "x0": desc, + "x1": desc, + "top": self.page.rect[1], + "bottom": self.page.rect[3], + "height": self.page.rect[3] - self.page.rect[1], + "orientation": "v", + } + ) + + if v_strat == "lines": + v_base = filter_edges(EDGES, "v") + elif v_strat == "lines_strict": + v_base = filter_edges(EDGES, "v", edge_type="line") + elif v_strat == "text": + v_base = words_to_edges_v(words, word_threshold=settings.min_words_vertical) + elif v_strat == "explicit": + v_base = [] + else: + v_base = [] + + v = v_base + v_explicit + + h_explicit = [] + for desc in settings.explicit_horizontal_lines or []: + if isinstance(desc, dict): + for e in obj_to_edges(desc): + if e["orientation"] == "h": + h_explicit.append(e) + else: + h_explicit.append( + { + "x0": self.page.rect[0], + "x1": self.page.rect[2], + "width": self.page.rect[2] - self.page.rect[0], + "top": desc, + "bottom": desc, + "orientation": "h", + } + ) + + if h_strat == "lines": + h_base = filter_edges(EDGES, "h") + elif h_strat == "lines_strict": + h_base = filter_edges(EDGES, "h", edge_type="line") + elif h_strat == "text": + h_base = words_to_edges_h( + words, word_threshold=settings.min_words_horizontal + ) + elif h_strat == "explicit": + h_base = [] + else: + h_base = [] + + h = h_base + h_explicit + + edges = list(v) + list(h) + + edges = merge_edges( + edges, + snap_x_tolerance=settings.snap_x_tolerance, + snap_y_tolerance=settings.snap_y_tolerance, + join_x_tolerance=settings.join_x_tolerance, + join_y_tolerance=settings.join_y_tolerance, + ) + + return filter_edges(edges, min_length=settings.edge_min_length) + + def __getitem__(self, i): + tcount = len(self.tables) + if i >= tcount: + raise IndexError("table not on page") + while i < 0: + i += tcount + return self.tables[i] + + +""" +Start of PyMuPDF interface code. +The following functions are executed when "page.find_tables()" is called. + +* make_chars: Fills the CHARS list with text character information extracted + via "rawdict" text extraction. Items in CHARS are formatted + as expected by the table code. +* make_edges: Fills the EDGES list with vector graphic information extracted + via "get_drawings". Items in EDGES are formatted as expected + by the table code. + +The lists CHARS and EDGES are used to replace respective document access +of pdfplumber or, respectively pdfminer. +The table code has been modified to use these lists instead of accessing +page information themselves. +""" + + +# ----------------------------------------------------------------------------- +# Extract all page characters to fill the CHARS list +# ----------------------------------------------------------------------------- +def make_chars(page, clip=None): + """Extract text as "rawdict" to fill CHARS.""" + global TEXTPAGE + page_number = page.number + 1 + page_height = page.rect.height + ctm = page.transformation_matrix + TEXTPAGE = page.get_textpage(clip=clip, flags=FLAGS) + blocks = page.get_text("rawdict", textpage=TEXTPAGE)["blocks"] + doctop_base = page_height * page.number + for block in blocks: + for line in block["lines"]: + ldir = line["dir"] # = (cosine, sine) of angle + ldir = (round(ldir[0], 4), round(ldir[1], 4)) + matrix = Matrix(ldir[0], -ldir[1], ldir[1], ldir[0], 0, 0) + if ldir[1] == 0: + upright = True + else: + upright = False + for span in sorted(line["spans"], key=lambda s: s["bbox"][0]): + fontname = span["font"] + fontsize = span["size"] + color = sRGB_to_pdf(span["color"]) + for char in sorted(span["chars"], key=lambda c: c["bbox"][0]): + bbox = Rect(char["bbox"]) + bbox_ctm = bbox * ctm + origin = Point(char["origin"]) * ctm + matrix.e = origin.x + matrix.f = origin.y + text = char["c"] + char_dict = { + "adv": bbox.x1 - bbox.x0 if upright else bbox.y1 - bbox.y0, + "bottom": bbox.y1, + "doctop": bbox.y0 + doctop_base, + "fontname": fontname, + "height": bbox.y1 - bbox.y0, + "matrix": tuple(matrix), + "ncs": "DeviceRGB", + "non_stroking_color": color, + "non_stroking_pattern": None, + "object_type": "char", + "page_number": page_number, + "size": fontsize if upright else bbox.y1 - bbox.y0, + "stroking_color": color, + "stroking_pattern": None, + "text": text, + "top": bbox.y0, + "upright": upright, + "width": bbox.x1 - bbox.x0, + "x0": bbox.x0, + "x1": bbox.x1, + "y0": bbox_ctm.y0, + "y1": bbox_ctm.y1, + } + CHARS.append(char_dict) + + +# ------------------------------------------------------------------------ +# Extract all page vector graphics to fill the EDGES list. +# We are ignoring Bézier curves completely and are converting everything +# else to lines. +# ------------------------------------------------------------------------ +def make_edges(page, clip=None, tset=None, paths=None, add_lines=None, add_boxes=None): + snap_x = tset.snap_x_tolerance + snap_y = tset.snap_y_tolerance + min_length = tset.edge_min_length + lines_strict = ( + tset.vertical_strategy == "lines_strict" + or tset.horizontal_strategy == "lines_strict" + ) + page_height = page.rect.height + doctop_basis = page.number * page_height + page_number = page.number + 1 + prect = page.rect + if page.rotation in (90, 270): + w, h = prect.br + prect = Rect(0, 0, h, w) + if clip is not None: + clip = Rect(clip) + else: + clip = prect + + def are_neighbors(r1, r2): + """Detect whether r1, r2 are neighbors. + + Defined as: + The minimum distance between points of r1 and points of r2 is not + larger than some delta. + + This check supports empty rect-likes and thus also lines. + + Note: + This type of check is MUCH faster than native Rect containment checks. + """ + if ( # check if x-coordinates of r1 are within those of r2 + r2.x0 - snap_x <= r1.x0 <= r2.x1 + snap_x + or r2.x0 - snap_x <= r1.x1 <= r2.x1 + snap_x + ) and ( # ... same for y-coordinates + r2.y0 - snap_y <= r1.y0 <= r2.y1 + snap_y + or r2.y0 - snap_y <= r1.y1 <= r2.y1 + snap_y + ): + return True + + # same check with r1 / r2 exchanging their roles (this is necessary!) + if ( + r1.x0 - snap_x <= r2.x0 <= r1.x1 + snap_x + or r1.x0 - snap_x <= r2.x1 <= r1.x1 + snap_x + ) and ( + r1.y0 - snap_y <= r2.y0 <= r1.y1 + snap_y + or r1.y0 - snap_y <= r2.y1 <= r1.y1 + snap_y + ): + return True + return False + + def clean_graphics(npaths=None): + """Detect and join rectangles of "connected" vector graphics.""" + if npaths is None: + allpaths = page.get_drawings() + else: # accept passed-in vector graphics + allpaths = npaths[:] # paths relevant for table detection + paths = [] + for p in allpaths: + # If only looking at lines, we ignore fill-only paths, + # except simulated lines (i.e. small width or height). + if ( + lines_strict + and p["type"] == "f" + and p["rect"].width > snap_x + and p["rect"].height > snap_y + ): + continue + paths.append(p) + + # start with all vector graphics rectangles + prects = sorted(set([p["rect"] for p in paths]), key=lambda r: (r.y1, r.x0)) + new_rects = [] # the final list of joined rectangles + # ---------------------------------------------------------------- + # Strategy: Join rectangles that "almost touch" each other. + # Extend first rectangle with any other that is a "neighbor". + # Then move it to the final list and continue with the rest. + # ---------------------------------------------------------------- + while prects: # the algorithm will empty this list + prect0 = prects[0] # copy of first rectangle (performance reasons!) + repeat = True + while repeat: # this loop extends first rect in list + repeat = False # set to true again if some other rect touches + for i in range(len(prects) - 1, 0, -1): # run backwards + if are_neighbors(prect0, prects[i]): # close enough to rect 0? + prect0 |= prects[i].tl # extend rect 0 + prect0 |= prects[i].br # extend rect 0 + del prects[i] # delete this rect + repeat = True # keep checking the rest + + # move rect 0 over to result list if there is some text in it + if not white_spaces.issuperset(page.get_textbox(prect0, textpage=TEXTPAGE)): + # contains text, so accept it as a table bbox candidate + new_rects.append(prect0) + del prects[0] # remove from rect list + + return new_rects, paths + + bboxes, paths = clean_graphics(npaths=paths) + + def is_parallel(p1, p2): + """Check if line is roughly axis-parallel.""" + if abs(p1.x - p2.x) <= snap_x or abs(p1.y - p2.y) <= snap_y: + return True + return False + + def make_line(p, p1, p2, clip): + """Given 2 points, make a line dictionary for table detection.""" + if not is_parallel(p1, p2): # only accepting axis-parallel lines + return {} + # compute the extremal values + x0 = min(p1.x, p2.x) + x1 = max(p1.x, p2.x) + y0 = min(p1.y, p2.y) + y1 = max(p1.y, p2.y) + + # check for outside clip + if x0 > clip.x1 or x1 < clip.x0 or y0 > clip.y1 or y1 < clip.y0: + return {} + + if x0 < clip.x0: + x0 = clip.x0 # adjust to clip boundary + + if x1 > clip.x1: + x1 = clip.x1 # adjust to clip boundary + + if y0 < clip.y0: + y0 = clip.y0 # adjust to clip boundary + + if y1 > clip.y1: + y1 = clip.y1 # adjust to clip boundary + + width = x1 - x0 # from adjusted values + height = y1 - y0 # from adjusted values + if width == height == 0: + return {} # nothing left to deal with + line_dict = { + "x0": x0, + "y0": page_height - y0, + "x1": x1, + "y1": page_height - y1, + "width": width, + "height": height, + "pts": [(x0, y0), (x1, y1)], + "linewidth": p["width"], + "stroke": True, + "fill": False, + "evenodd": False, + "stroking_color": p["color"] if p["color"] else p["fill"], + "non_stroking_color": None, + "object_type": "line", + "page_number": page_number, + "stroking_pattern": None, + "non_stroking_pattern": None, + "top": y0, + "bottom": y1, + "doctop": y0 + doctop_basis, + } + return line_dict + + for p in paths: + items = p["items"] # items in this path + + # if 'closePath', add a line from last to first point + if p["closePath"] and items[0][0] == "l" and items[-1][0] == "l": + items.append(("l", items[-1][2], items[0][1])) + + for i in items: + if i[0] not in ("l", "re", "qu"): + continue # ignore anything else + + if i[0] == "l": # a line + p1, p2 = i[1:] + line_dict = make_line(p, p1, p2, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + elif i[0] == "re": + # A rectangle: decompose into 4 lines, but filter out + # the ones that simulate a line + rect = i[1].normalize() # normalize the rectangle + + if ( + rect.width <= min_length and rect.width < rect.height + ): # simulates a vertical line + x = abs(rect.x1 + rect.x0) / 2 # take middle value for x + p1 = Point(x, rect.y0) + p2 = Point(x, rect.y1) + line_dict = make_line(p, p1, p2, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + continue + + if ( + rect.height <= min_length and rect.height < rect.width + ): # simulates a horizontal line + y = abs(rect.y1 + rect.y0) / 2 # take middle value for y + p1 = Point(rect.x0, y) + p2 = Point(rect.x1, y) + line_dict = make_line(p, p1, p2, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + continue + + line_dict = make_line(p, rect.tl, rect.bl, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + line_dict = make_line(p, rect.bl, rect.br, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + line_dict = make_line(p, rect.br, rect.tr, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + line_dict = make_line(p, rect.tr, rect.tl, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + else: # must be a quad + # we convert it into (up to) 4 lines + ul, ur, ll, lr = i[1] + + line_dict = make_line(p, ul, ll, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + line_dict = make_line(p, ll, lr, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + line_dict = make_line(p, lr, ur, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + line_dict = make_line(p, ur, ul, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + path = {"color": (0, 0, 0), "fill": None, "width": 1} + for bbox in bboxes: # add the border lines for all enveloping bboxes + line_dict = make_line(path, bbox.tl, bbox.tr, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + line_dict = make_line(path, bbox.bl, bbox.br, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + line_dict = make_line(path, bbox.tl, bbox.bl, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + line_dict = make_line(path, bbox.tr, bbox.br, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + if add_lines is not None: # add user-specified lines + assert isinstance(add_lines, (tuple, list)) + else: + add_lines = [] + for p1, p2 in add_lines: + p1 = Point(p1) + p2 = Point(p2) + line_dict = make_line(path, p1, p2, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + if add_boxes is not None: # add user-specified rectangles + assert isinstance(add_boxes, (tuple, list)) + else: + add_boxes = [] + for box in add_boxes: + r = Rect(box) + line_dict = make_line(path, r.tl, r.bl, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + line_dict = make_line(path, r.bl, r.br, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + line_dict = make_line(path, r.br, r.tr, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + line_dict = make_line(path, r.tr, r.tl, clip) + if line_dict: + EDGES.append(line_to_edge(line_dict)) + + +def page_rotation_set0(page): + """Nullify page rotation. + + To correctly detect tables, page rotation must be zero. + This function performs the necessary adjustments and returns information + for reverting this changes. + """ + mediabox = page.mediabox + rot = page.rotation # contains normalized rotation value + # need to derotate the page's content + mb = page.mediabox # current mediabox + + if rot == 90: + # before derotation, shift content horizontally + mat0 = Matrix(1, 0, 0, 1, mb.y1 - mb.x1 - mb.x0 - mb.y0, 0) + elif rot == 270: + # before derotation, shift content vertically + mat0 = Matrix(1, 0, 0, 1, 0, mb.x1 - mb.y1 - mb.y0 - mb.x0) + else: + mat0 = Matrix(1, 0, 0, 1, -2 * mb.x0, -2 * mb.y0) + + # prefix with derotation matrix + mat = mat0 * page.derotation_matrix + cmd = b"%g %g %g %g %g %g cm " % tuple(mat) + xref = TOOLS._insert_contents(page, cmd, 0) + + # swap x- and y-coordinates + if rot in (90, 270): + x0, y0, x1, y1 = mb + mb.x0 = y0 + mb.y0 = x0 + mb.x1 = y1 + mb.y1 = x1 + page.set_mediabox(mb) + + page.set_rotation(0) + + # refresh the page to apply these changes + doc = page.parent + pno = page.number + page = doc[pno] + return page, xref, rot, mediabox + + +def page_rotation_reset(page, xref, rot, mediabox): + """Reset page rotation to original values. + + To be used before we return tables.""" + doc = page.parent # document of the page + doc.update_stream(xref, b" ") # remove de-rotation matrix + page.set_mediabox(mediabox) # set mediabox to old value + page.set_rotation(rot) # set rotation to old value + pno = page.number + page = doc[pno] # update page info + return page + + +def find_tables( + page, + clip=None, + vertical_strategy: str = "lines", + horizontal_strategy: str = "lines", + vertical_lines: list = None, + horizontal_lines: list = None, + snap_tolerance: float = DEFAULT_SNAP_TOLERANCE, + snap_x_tolerance: float = None, + snap_y_tolerance: float = None, + join_tolerance: float = DEFAULT_JOIN_TOLERANCE, + join_x_tolerance: float = None, + join_y_tolerance: float = None, + edge_min_length: float = 3, + min_words_vertical: float = DEFAULT_MIN_WORDS_VERTICAL, + min_words_horizontal: float = DEFAULT_MIN_WORDS_HORIZONTAL, + intersection_tolerance: float = 3, + intersection_x_tolerance: float = None, + intersection_y_tolerance: float = None, + text_tolerance=3, + text_x_tolerance=3, + text_y_tolerance=3, + strategy=None, # offer abbreviation + add_lines=None, # user-specified lines + add_boxes=None, # user-specified rectangles + paths=None, # accept vector graphics as parameter +): + global CHARS, EDGES + CHARS = [] + EDGES = [] + old_small = bool(TOOLS.set_small_glyph_heights()) # save old value + TOOLS.set_small_glyph_heights(True) # we need minimum bboxes + if page.rotation != 0: + page, old_xref, old_rot, old_mediabox = page_rotation_set0(page) + else: + old_xref, old_rot, old_mediabox = None, None, None + + if snap_x_tolerance is None: + snap_x_tolerance = UNSET + if snap_y_tolerance is None: + snap_y_tolerance = UNSET + if join_x_tolerance is None: + join_x_tolerance = UNSET + if join_y_tolerance is None: + join_y_tolerance = UNSET + if intersection_x_tolerance is None: + intersection_x_tolerance = UNSET + if intersection_y_tolerance is None: + intersection_y_tolerance = UNSET + if strategy is not None: + vertical_strategy = strategy + horizontal_strategy = strategy + + settings = { + "vertical_strategy": vertical_strategy, + "horizontal_strategy": horizontal_strategy, + "explicit_vertical_lines": vertical_lines, + "explicit_horizontal_lines": horizontal_lines, + "snap_tolerance": snap_tolerance, + "snap_x_tolerance": snap_x_tolerance, + "snap_y_tolerance": snap_y_tolerance, + "join_tolerance": join_tolerance, + "join_x_tolerance": join_x_tolerance, + "join_y_tolerance": join_y_tolerance, + "edge_min_length": edge_min_length, + "min_words_vertical": min_words_vertical, + "min_words_horizontal": min_words_horizontal, + "intersection_tolerance": intersection_tolerance, + "intersection_x_tolerance": intersection_x_tolerance, + "intersection_y_tolerance": intersection_y_tolerance, + "text_tolerance": text_tolerance, + "text_x_tolerance": text_x_tolerance, + "text_y_tolerance": text_y_tolerance, + } + tset = TableSettings.resolve(settings=settings) + page.table_settings = tset + + make_chars(page, clip=clip) # create character list of page + make_edges( + page, + clip=clip, + tset=tset, + paths=paths, + add_lines=add_lines, + add_boxes=add_boxes, + ) # create lines and curves + tables = TableFinder(page, settings=tset) + + TOOLS.set_small_glyph_heights(old_small) + if old_xref is not None: + page = page_rotation_reset(page, old_xref, old_rot, old_mediabox) + return tables diff -r 000000000000 -r 1d09e1dec1d9 src/utils.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/utils.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,5679 @@ +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +import io +import math +import os +import typing +import weakref + +try: + from . import pymupdf +except Exception: + import pymupdf +try: + from . import mupdf +except Exception: + import mupdf + +_format_g = pymupdf.format_g + +g_exceptions_verbose = pymupdf.g_exceptions_verbose + +point_like = "point_like" +rect_like = "rect_like" +matrix_like = "matrix_like" +quad_like = "quad_like" + +# ByteString is gone from typing in 3.14. +# collections.abc.Buffer available from 3.12 only +try: + ByteString = typing.ByteString +except AttributeError: + # pylint: disable=unsupported-binary-operation + ByteString = bytes | bytearray | memoryview + +AnyType = typing.Any +OptInt = typing.Union[int, None] +OptFloat = typing.Optional[float] +OptStr = typing.Optional[str] +OptDict = typing.Optional[dict] +OptBytes = typing.Optional[ByteString] +OptSeq = typing.Optional[typing.Sequence] + +""" +This is a collection of functions to extend PyMupdf. +""" + + +def write_text( + page: pymupdf.Page, + rect=None, + writers=None, + overlay=True, + color=None, + opacity=None, + keep_proportion=True, + rotate=0, + oc=0, + ) -> None: + """Write the text of one or more pymupdf.TextWriter objects. + + Args: + rect: target rectangle. If None, the union of the text writers is used. + writers: one or more pymupdf.TextWriter objects. + overlay: put in foreground or background. + keep_proportion: maintain aspect ratio of rectangle sides. + rotate: arbitrary rotation angle. + oc: the xref of an optional content object + """ + assert isinstance(page, pymupdf.Page) + if not writers: + raise ValueError("need at least one pymupdf.TextWriter") + if type(writers) is pymupdf.TextWriter: + if rotate == 0 and rect is None: + writers.write_text(page, opacity=opacity, color=color, overlay=overlay) + return None + else: + writers = (writers,) + clip = writers[0].text_rect + textdoc = pymupdf.Document() + tpage = textdoc.new_page(width=page.rect.width, height=page.rect.height) + for writer in writers: + clip |= writer.text_rect + writer.write_text(tpage, opacity=opacity, color=color) + if rect is None: + rect = clip + page.show_pdf_page( + rect, + textdoc, + 0, + overlay=overlay, + keep_proportion=keep_proportion, + rotate=rotate, + clip=clip, + oc=oc, + ) + textdoc = None + tpage = None + + +def show_pdf_page( + page, + rect, + docsrc, + pno=0, + keep_proportion=True, + overlay=True, + oc=0, + rotate=0, + clip=None, + ) -> int: + """Show page number 'pno' of PDF 'docsrc' in rectangle 'rect'. + + Args: + rect: (rect-like) where to place the source image + docsrc: (document) source PDF + pno: (int) source page number + keep_proportion: (bool) do not change width-height-ratio + overlay: (bool) put in foreground + oc: (xref) make visibility dependent on this OCG / OCMD (which must be defined in the target PDF) + rotate: (int) degrees (multiple of 90) + clip: (rect-like) part of source page rectangle + Returns: + xref of inserted object (for reuse) + """ + def calc_matrix(sr, tr, keep=True, rotate=0): + """Calculate transformation matrix from source to target rect. + + Notes: + The product of four matrices in this sequence: (1) translate correct + source corner to origin, (2) rotate, (3) scale, (4) translate to + target's top-left corner. + Args: + sr: source rect in PDF (!) coordinate system + tr: target rect in PDF coordinate system + keep: whether to keep source ratio of width to height + rotate: rotation angle in degrees + Returns: + Transformation matrix. + """ + # calc center point of source rect + smp = (sr.tl + sr.br) / 2.0 + # calc center point of target rect + tmp = (tr.tl + tr.br) / 2.0 + + # m moves to (0, 0), then rotates + m = pymupdf.Matrix(1, 0, 0, 1, -smp.x, -smp.y) * pymupdf.Matrix(rotate) + + sr1 = sr * m # resulting source rect to calculate scale factors + + fw = tr.width / sr1.width # scale the width + fh = tr.height / sr1.height # scale the height + if keep: + fw = fh = min(fw, fh) # take min if keeping aspect ratio + + m *= pymupdf.Matrix(fw, fh) # concat scale matrix + m *= pymupdf.Matrix(1, 0, 0, 1, tmp.x, tmp.y) # concat move to target center + return pymupdf.JM_TUPLE(m) + + pymupdf.CheckParent(page) + doc = page.parent + + if not doc.is_pdf or not docsrc.is_pdf: + raise ValueError("is no PDF") + + if rect.is_empty or rect.is_infinite: + raise ValueError("rect must be finite and not empty") + + while pno < 0: # support negative page numbers + pno += docsrc.page_count + src_page = docsrc[pno] # load source page + + tar_rect = rect * ~page.transformation_matrix # target rect in PDF coordinates + + src_rect = src_page.rect if not clip else src_page.rect & clip # source rect + if src_rect.is_empty or src_rect.is_infinite: + raise ValueError("clip must be finite and not empty") + src_rect = src_rect * ~src_page.transformation_matrix # ... in PDF coord + + matrix = calc_matrix(src_rect, tar_rect, keep=keep_proportion, rotate=rotate) + + # list of existing /Form /XObjects + ilst = [i[1] for i in doc.get_page_xobjects(page.number)] + ilst += [i[7] for i in doc.get_page_images(page.number)] + ilst += [i[4] for i in doc.get_page_fonts(page.number)] + + # create a name not in that list + n = "fzFrm" + i = 0 + _imgname = n + "0" + while _imgname in ilst: + i += 1 + _imgname = n + str(i) + + isrc = docsrc._graft_id # used as key for graftmaps + if doc._graft_id == isrc: + raise ValueError("source document must not equal target") + + # retrieve / make pymupdf.Graftmap for source PDF + gmap = doc.Graftmaps.get(isrc, None) + if gmap is None: + gmap = pymupdf.Graftmap(doc) + doc.Graftmaps[isrc] = gmap + + # take note of generated xref for automatic reuse + pno_id = (isrc, pno) # id of docsrc[pno] + xref = doc.ShownPages.get(pno_id, 0) + + if overlay: + page.wrap_contents() # ensure a balanced graphics state + xref = page._show_pdf_page( + src_page, + overlay=overlay, + matrix=matrix, + xref=xref, + oc=oc, + clip=src_rect, + graftmap=gmap, + _imgname=_imgname, + ) + doc.ShownPages[pno_id] = xref + + return xref + + +def replace_image(page: pymupdf.Page, xref: int, *, filename=None, pixmap=None, stream=None): + """Replace the image referred to by xref. + + Replace the image by changing the object definition stored under xref. This + will leave the pages appearance instructions intact, so the new image is + being displayed with the same bbox, rotation etc. + By providing a small fully transparent image, an effect as if the image had + been deleted can be achieved. + A typical use may include replacing large images by a smaller version, + e.g. with a lower resolution or graylevel instead of colored. + + Args: + xref: the xref of the image to replace. + filename, pixmap, stream: exactly one of these must be provided. The + meaning being the same as in Page.insert_image. + """ + doc = page.parent # the owning document + if not doc.xref_is_image(xref): + raise ValueError("xref not an image") # insert new image anywhere in page + if bool(filename) + bool(stream) + bool(pixmap) != 1: + raise ValueError("Exactly one of filename/stream/pixmap must be given") + new_xref = page.insert_image( + page.rect, filename=filename, stream=stream, pixmap=pixmap + ) + doc.xref_copy(new_xref, xref) # copy over new to old + last_contents_xref = page.get_contents()[-1] + # new image insertion has created a new /Contents source, + # which we will set to spaces now + doc.update_stream(last_contents_xref, b" ") + page._image_info = None # clear cache of extracted image information + + +def delete_image(page: pymupdf.Page, xref: int): + """Delete the image referred to by xef. + + Actually replaces by a small transparent Pixmap using method Page.replace_image. + + Args: + xref: xref of the image to delete. + """ + # make a small 100% transparent pixmap (of just any dimension) + pix = pymupdf.Pixmap(pymupdf.csGRAY, (0, 0, 1, 1), 1) + pix.clear_with() # clear all samples bytes to 0x00 + page.replace_image(xref, pixmap=pix) + + +def insert_image( + page, + rect, + *, + alpha=-1, + filename=None, + height=0, + keep_proportion=True, + mask=None, + oc=0, + overlay=True, + pixmap=None, + rotate=0, + stream=None, + width=0, + xref=0, + ): + """Insert an image for display in a rectangle. + + Args: + rect: (rect_like) position of image on the page. + alpha: (int, optional) set to 0 if image has no transparency. + filename: (str, Path, file object) image filename. + height: (int) + keep_proportion: (bool) keep width / height ratio (default). + mask: (bytes, optional) image consisting of alpha values to use. + oc: (int) xref of OCG or OCMD to declare as Optional Content. + overlay: (bool) put in foreground (default) or background. + pixmap: (pymupdf.Pixmap) use this as image. + rotate: (int) rotate by 0, 90, 180 or 270 degrees. + stream: (bytes) use this as image. + width: (int) + xref: (int) use this as image. + + 'page' and 'rect' are positional, all other parameters are keywords. + + If 'xref' is given, that image is used. Other input options are ignored. + Else, exactly one of pixmap, stream or filename must be given. + + 'alpha=0' for non-transparent images improves performance significantly. + Affects stream and filename only. + + Optimum transparent insertions are possible by using filename / stream in + conjunction with a 'mask' image of alpha values. + + Returns: + xref (int) of inserted image. Re-use as argument for multiple insertions. + """ + pymupdf.CheckParent(page) + doc = page.parent + if not doc.is_pdf: + raise ValueError("is no PDF") + + if xref == 0 and (bool(filename) + bool(stream) + bool(pixmap) != 1): + raise ValueError("xref=0 needs exactly one of filename, pixmap, stream") + + if filename: + if type(filename) is str: + pass + elif hasattr(filename, "absolute"): + filename = str(filename) + elif hasattr(filename, "name"): + filename = filename.name + else: + raise ValueError("bad filename") + + if filename and not os.path.exists(filename): + raise FileNotFoundError("No such file: '%s'" % filename) + elif stream and type(stream) not in (bytes, bytearray, io.BytesIO): + raise ValueError("stream must be bytes-like / BytesIO") + elif pixmap and type(pixmap) is not pymupdf.Pixmap: + raise ValueError("pixmap must be a pymupdf.Pixmap") + if mask and not (stream or filename): + raise ValueError("mask requires stream or filename") + if mask and type(mask) not in (bytes, bytearray, io.BytesIO): + raise ValueError("mask must be bytes-like / BytesIO") + while rotate < 0: + rotate += 360 + while rotate >= 360: + rotate -= 360 + if rotate not in (0, 90, 180, 270): + raise ValueError("bad rotate value") + + r = pymupdf.Rect(rect) + if r.is_empty or r.is_infinite: + raise ValueError("rect must be finite and not empty") + clip = r * ~page.transformation_matrix + + # Create a unique image reference name. + ilst = [i[7] for i in doc.get_page_images(page.number)] + ilst += [i[1] for i in doc.get_page_xobjects(page.number)] + ilst += [i[4] for i in doc.get_page_fonts(page.number)] + n = "fzImg" # 'pymupdf image' + i = 0 + _imgname = n + "0" # first name candidate + while _imgname in ilst: + i += 1 + _imgname = n + str(i) # try new name + + if overlay: + page.wrap_contents() # ensure a balanced graphics state + digests = doc.InsertedImages + xref, digests = page._insert_image( + filename=filename, + pixmap=pixmap, + stream=stream, + imask=mask, + clip=clip, + overlay=overlay, + oc=oc, + xref=xref, + rotate=rotate, + keep_proportion=keep_proportion, + width=width, + height=height, + alpha=alpha, + _imgname=_imgname, + digests=digests, + ) + if digests is not None: + doc.InsertedImages = digests + + return xref + + +def search_for( + page, + text, + *, + clip=None, + quads=False, + flags=pymupdf.TEXT_DEHYPHENATE + | pymupdf.TEXT_PRESERVE_WHITESPACE + | pymupdf.TEXT_PRESERVE_LIGATURES + | pymupdf.TEXT_MEDIABOX_CLIP + , + textpage=None, + ) -> list: + """Search for a string on a page. + + Args: + text: string to be searched for + clip: restrict search to this rectangle + quads: (bool) return quads instead of rectangles + flags: bit switches, default: join hyphened words + textpage: a pre-created pymupdf.TextPage + Returns: + a list of rectangles or quads, each containing one occurrence. + """ + if clip is not None: + clip = pymupdf.Rect(clip) + + pymupdf.CheckParent(page) + tp = textpage + if tp is None: + tp = page.get_textpage(clip=clip, flags=flags) # create pymupdf.TextPage + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + rlist = tp.search(text, quads=quads) + if textpage is None: + del tp + return rlist + + +def search_page_for( + doc: pymupdf.Document, + pno: int, + text: str, + quads: bool = False, + clip: rect_like = None, + flags: int = pymupdf.TEXT_DEHYPHENATE + | pymupdf.TEXT_PRESERVE_LIGATURES + | pymupdf.TEXT_PRESERVE_WHITESPACE + | pymupdf.TEXT_MEDIABOX_CLIP + , + textpage: pymupdf.TextPage = None, +) -> list: + """Search for a string on a page. + + Args: + pno: page number + text: string to be searched for + clip: restrict search to this rectangle + quads: (bool) return quads instead of rectangles + flags: bit switches, default: join hyphened words + textpage: reuse a prepared textpage + Returns: + a list of rectangles or quads, each containing an occurrence. + """ + + return doc[pno].search_for( + text, + quads=quads, + clip=clip, + flags=flags, + textpage=textpage, + ) + + +def get_text_blocks( + page: pymupdf.Page, + clip: rect_like = None, + flags: OptInt = None, + textpage: pymupdf.TextPage = None, + sort: bool = False, +) -> list: + """Return the text blocks on a page. + + Notes: + Lines in a block are concatenated with line breaks. + Args: + flags: (int) control the amount of data parsed into the textpage. + Returns: + A list of the blocks. Each item contains the containing rectangle + coordinates, text lines, running block number and block type. + """ + pymupdf.CheckParent(page) + if flags is None: + flags = pymupdf.TEXTFLAGS_BLOCKS + tp = textpage + if tp is None: + tp = page.get_textpage(clip=clip, flags=flags) + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + + blocks = tp.extractBLOCKS() + if textpage is None: + del tp + if sort: + blocks.sort(key=lambda b: (b[3], b[0])) + return blocks + + +def get_text_words( + page: pymupdf.Page, + clip: rect_like = None, + flags: OptInt = None, + textpage: pymupdf.TextPage = None, + sort: bool = False, + delimiters=None, + tolerance=3, +) -> list: + """Return the text words as a list with the bbox for each word. + + Args: + page: pymupdf.Page + clip: (rect-like) area on page to consider + flags: (int) control the amount of data parsed into the textpage. + textpage: (pymupdf.TextPage) either passed-in or None. + sort: (bool) sort the words in reading sequence. + delimiters: (str,list) characters to use as word delimiters. + tolerance: (float) consider words to be part of the same line if + top or bottom coordinate are not larger than this. Relevant + only if sort=True. + + Returns: + Word tuples (x0, y0, x1, y1, "word", bno, lno, wno). + """ + + def sort_words(words): + """Sort words line-wise, forgiving small deviations.""" + words.sort(key=lambda w: (w[3], w[0])) + nwords = [] # final word list + line = [words[0]] # collects words roughly in same line + lrect = pymupdf.Rect(words[0][:4]) # start the line rectangle + for w in words[1:]: + wrect = pymupdf.Rect(w[:4]) + if ( + abs(wrect.y0 - lrect.y0) <= tolerance + or abs(wrect.y1 - lrect.y1) <= tolerance + ): + line.append(w) + lrect |= wrect + else: + line.sort(key=lambda w: w[0]) # sort words in line l-t-r + nwords.extend(line) # append to final words list + line = [w] # start next line + lrect = wrect # start next line rect + + line.sort(key=lambda w: w[0]) # sort words in line l-t-r + nwords.extend(line) # append to final words list + + return nwords + + pymupdf.CheckParent(page) + if flags is None: + flags = pymupdf.TEXTFLAGS_WORDS + tp = textpage + if tp is None: + tp = page.get_textpage(clip=clip, flags=flags) + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + + words = tp.extractWORDS(delimiters) + + # if textpage was given, we subselect the words in clip + if textpage is not None and clip is not None: + # sub-select words contained in clip + clip = pymupdf.Rect(clip) + words = [ + w for w in words if abs(clip & w[:4]) >= 0.5 * abs(pymupdf.Rect(w[:4])) + ] + + if textpage is None: + del tp + if words and sort: + # advanced sort if any words found + words = sort_words(words) + + return words + + +def get_sorted_text( + page: pymupdf.Page, + clip: rect_like = None, + flags: OptInt = None, + textpage: pymupdf.TextPage = None, + tolerance=3, +) -> str: + """Extract plain text avoiding unacceptable line breaks. + + Text contained in clip will be sorted in reading sequence. Some effort + is also spent to simulate layout vertically and horizontally. + + Args: + page: pymupdf.Page + clip: (rect-like) only consider text inside + flags: (int) text extraction flags + textpage: pymupdf.TextPage + tolerance: (float) consider words to be on the same line if their top + or bottom coordinates do not differ more than this. + + Notes: + If a TextPage is provided, all text is checked for being inside clip + with at least 50% of its bbox. + This allows to use some "global" TextPage in conjunction with sub- + selecting words in parts of the defined TextPage rectangle. + + Returns: + A text string in reading sequence. Left indentation of each line, + inter-line and inter-word distances strive to reflect the layout. + """ + + def line_text(clip, line): + """Create the string of one text line. + + We are trying to simulate some horizontal layout here, too. + + Args: + clip: (pymupdf.Rect) the area from which all text is being read. + line: (list) word tuples (rect, text) contained in the line + Returns: + Text in this line. Generated from words in 'line'. Distance from + predecessor is translated to multiple spaces, thus simulating + text indentations and large horizontal distances. + """ + line.sort(key=lambda w: w[0].x0) + ltext = "" # text in the line + x1 = clip.x0 # end coordinate of ltext + lrect = pymupdf.EMPTY_RECT() # bbox of this line + for r, t in line: + lrect |= r # update line bbox + # convert distance to previous word to multiple spaces + dist = max( + int(round((r.x0 - x1) / r.width * len(t))), + 0 if (x1 == clip.x0 or r.x0 <= x1) else 1, + ) # number of space characters + + ltext += " " * dist + t # append word string + x1 = r.x1 # update new end position + return ltext + + # Extract words in correct sequence first. + words = [ + (pymupdf.Rect(w[:4]), w[4]) + for w in get_text_words( + page, + clip=clip, + flags=flags, + textpage=textpage, + sort=True, + tolerance=tolerance, + ) + ] + + if not words: # no text present + return "" + totalbox = pymupdf.EMPTY_RECT() # area covering all text + for wr, text in words: + totalbox |= wr + + lines = [] # list of reconstituted lines + line = [words[0]] # current line + lrect = words[0][0] # the line's rectangle + + # walk through the words + for wr, text in words[1:]: # start with second word + w0r, _ = line[-1] # read previous word in current line + + # if this word matches top or bottom of the line, append it + if abs(lrect.y0 - wr.y0) <= tolerance or abs(lrect.y1 - wr.y1) <= tolerance: + line.append((wr, text)) + lrect |= wr + else: + # output current line and re-initialize + ltext = line_text(totalbox, line) + lines.append((lrect, ltext)) + line = [(wr, text)] + lrect = wr + + # also append unfinished last line + ltext = line_text(totalbox, line) + lines.append((lrect, ltext)) + + # sort all lines vertically + lines.sort(key=lambda l: (l[0].y1)) + + text = lines[0][1] # text of first line + y1 = lines[0][0].y1 # its bottom coordinate + for lrect, ltext in lines[1:]: + distance = min(int(round((lrect.y0 - y1) / lrect.height)), 5) + breaks = "\n" * (distance + 1) + text += breaks + ltext + y1 = lrect.y1 + + # return text in clip + return text + + +def get_textbox( + page: pymupdf.Page, + rect: rect_like, + textpage: pymupdf.TextPage = None, +) -> str: + tp = textpage + if tp is None: + tp = page.get_textpage() + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + rc = tp.extractTextbox(rect) + if textpage is None: + del tp + return rc + + +def get_text_selection( + page: pymupdf.Page, + p1: point_like, + p2: point_like, + clip: rect_like = None, + textpage: pymupdf.TextPage = None, +): + pymupdf.CheckParent(page) + tp = textpage + if tp is None: + tp = page.get_textpage(clip=clip, flags=pymupdf.TEXT_DEHYPHENATE) + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + rc = tp.extractSelection(p1, p2) + if textpage is None: + del tp + return rc + + +def get_textpage_ocr( + page: pymupdf.Page, + flags: int = 0, + language: str = "eng", + dpi: int = 72, + full: bool = False, + tessdata: str = None, +) -> pymupdf.TextPage: + """Create a Textpage from combined results of normal and OCR text parsing. + + Args: + flags: (int) control content becoming part of the result. + language: (str) specify expected language(s). Default is "eng" (English). + dpi: (int) resolution in dpi, default 72. + full: (bool) whether to OCR the full page image, or only its images (default) + """ + pymupdf.CheckParent(page) + tessdata = pymupdf.get_tessdata(tessdata) + + def full_ocr(page, dpi, language, flags): + zoom = dpi / 72 + mat = pymupdf.Matrix(zoom, zoom) + pix = page.get_pixmap(matrix=mat) + ocr_pdf = pymupdf.Document( + "pdf", + pix.pdfocr_tobytes( + compress=False, + language=language, + tessdata=tessdata, + ), + ) + ocr_page = ocr_pdf.load_page(0) + unzoom = page.rect.width / ocr_page.rect.width + ctm = pymupdf.Matrix(unzoom, unzoom) * page.derotation_matrix + tpage = ocr_page.get_textpage(flags=flags, matrix=ctm) + ocr_pdf.close() + pix = None + tpage.parent = weakref.proxy(page) + return tpage + + # if OCR for the full page, OCR its pixmap @ desired dpi + if full: + return full_ocr(page, dpi, language, flags) + + # For partial OCR, make a normal textpage, then extend it with text that + # is OCRed from each image. + # Because of this, we need the images flag bit set ON. + tpage = page.get_textpage(flags=flags) + for block in page.get_text("dict", flags=pymupdf.TEXT_PRESERVE_IMAGES)["blocks"]: + if block["type"] != 1: # only look at images + continue + bbox = pymupdf.Rect(block["bbox"]) + if bbox.width <= 3 or bbox.height <= 3: # ignore tiny stuff + continue + try: + pix = pymupdf.Pixmap(block["image"]) # get image pixmap + if pix.n - pix.alpha != 3: # we need to convert this to RGB! + pix = pymupdf.Pixmap(pymupdf.csRGB, pix) + if pix.alpha: # must remove alpha channel + pix = pymupdf.Pixmap(pix, 0) + imgdoc = pymupdf.Document( + "pdf", + pix.pdfocr_tobytes(language=language, tessdata=tessdata), + ) # pdf with OCRed page + imgpage = imgdoc.load_page(0) # read image as a page + pix = None + # compute matrix to transform coordinates back to that of 'page' + imgrect = imgpage.rect # page size of image PDF + shrink = pymupdf.Matrix(1 / imgrect.width, 1 / imgrect.height) + mat = shrink * block["transform"] + imgpage.extend_textpage(tpage, flags=0, matrix=mat) + imgdoc.close() + except (RuntimeError, mupdf.FzErrorBase): + if 0 and g_exceptions_verbose: + # Don't show exception info here because it can happen in + # normal operation (see test_3842b). + pymupdf.exception_info() + tpage = None + pymupdf.message("Falling back to full page OCR") + return full_ocr(page, dpi, language, flags) + + return tpage + + +def get_image_info(page: pymupdf.Page, hashes: bool = False, xrefs: bool = False) -> list: + """Extract image information only from a pymupdf.TextPage. + + Args: + hashes: (bool) include MD5 hash for each image. + xrefs: (bool) try to find the xref for each image. Sets hashes to true. + """ + doc = page.parent + if xrefs and doc.is_pdf: + hashes = True + if not doc.is_pdf: + xrefs = False + imginfo = getattr(page, "_image_info", None) + if imginfo and not xrefs: + return imginfo + if not imginfo: + tp = page.get_textpage(flags=pymupdf.TEXT_PRESERVE_IMAGES) + imginfo = tp.extractIMGINFO(hashes=hashes) + del tp + if hashes: + page._image_info = imginfo + if not xrefs or not doc.is_pdf: + return imginfo + imglist = page.get_images() + digests = {} + for item in imglist: + xref = item[0] + pix = pymupdf.Pixmap(doc, xref) + digests[pix.digest] = xref + del pix + for i in range(len(imginfo)): + item = imginfo[i] + xref = digests.get(item["digest"], 0) + item["xref"] = xref + imginfo[i] = item + return imginfo + + +def get_image_rects(page: pymupdf.Page, name, transform=False) -> list: + """Return list of image positions on a page. + + Args: + name: (str, list, int) image identification. May be reference name, an + item of the page's image list or an xref. + transform: (bool) whether to also return the transformation matrix. + Returns: + A list of pymupdf.Rect objects or tuples of (pymupdf.Rect, pymupdf.Matrix) + for all image locations on the page. + """ + if type(name) in (list, tuple): + xref = name[0] + elif type(name) is int: + xref = name + else: + imglist = [i for i in page.get_images() if i[7] == name] + if imglist == []: + raise ValueError("bad image name") + elif len(imglist) != 1: + raise ValueError("multiple image names found") + xref = imglist[0][0] + pix = pymupdf.Pixmap(page.parent, xref) # make pixmap of the image to compute MD5 + digest = pix.digest + del pix + infos = page.get_image_info(hashes=True) + if not transform: + bboxes = [pymupdf.Rect(im["bbox"]) for im in infos if im["digest"] == digest] + else: + bboxes = [ + (pymupdf.Rect(im["bbox"]), pymupdf.Matrix(im["transform"])) + for im in infos + if im["digest"] == digest + ] + return bboxes + + +def get_text( + page: pymupdf.Page, + option: str = "text", + *, + clip: rect_like = None, + flags: OptInt = None, + textpage: pymupdf.TextPage = None, + sort: bool = False, + delimiters=None, + tolerance=3, +): + """Extract text from a page or an annotation. + + This is a unifying wrapper for various methods of the pymupdf.TextPage class. + + Args: + option: (str) text, words, blocks, html, dict, json, rawdict, xhtml or xml. + clip: (rect-like) restrict output to this area. + flags: bit switches to e.g. exclude images or decompose ligatures. + textpage: reuse this pymupdf.TextPage and make no new one. If specified, + 'flags' and 'clip' are ignored. + + Returns: + the output of methods get_text_words / get_text_blocks or pymupdf.TextPage + methods extractText, extractHTML, extractDICT, extractJSON, extractRAWDICT, + extractXHTML or etractXML respectively. + Default and misspelling choice is "text". + """ + formats = { + "text": pymupdf.TEXTFLAGS_TEXT, + "html": pymupdf.TEXTFLAGS_HTML, + "json": pymupdf.TEXTFLAGS_DICT, + "rawjson": pymupdf.TEXTFLAGS_RAWDICT, + "xml": pymupdf.TEXTFLAGS_XML, + "xhtml": pymupdf.TEXTFLAGS_XHTML, + "dict": pymupdf.TEXTFLAGS_DICT, + "rawdict": pymupdf.TEXTFLAGS_RAWDICT, + "words": pymupdf.TEXTFLAGS_WORDS, + "blocks": pymupdf.TEXTFLAGS_BLOCKS, + } + option = option.lower() + assert option in formats + if option not in formats: + option = "text" + if flags is None: + flags = formats[option] + + if option == "words": + return get_text_words( + page, + clip=clip, + flags=flags, + textpage=textpage, + sort=sort, + delimiters=delimiters, + ) + if option == "blocks": + return get_text_blocks( + page, clip=clip, flags=flags, textpage=textpage, sort=sort + ) + + if option == "text" and sort: + return get_sorted_text( + page, + clip=clip, + flags=flags, + textpage=textpage, + tolerance=tolerance, + ) + + pymupdf.CheckParent(page) + cb = None + if option in ("html", "xml", "xhtml"): # no clipping for MuPDF functions + clip = page.cropbox + if clip is not None: + clip = pymupdf.Rect(clip) + cb = None + elif type(page) is pymupdf.Page: + cb = page.cropbox + # pymupdf.TextPage with or without images + tp = textpage + #pymupdf.exception_info() + if tp is None: + tp = page.get_textpage(clip=clip, flags=flags) + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + #pymupdf.log( '{option=}') + if option == "json": + t = tp.extractJSON(cb=cb, sort=sort) + elif option == "rawjson": + t = tp.extractRAWJSON(cb=cb, sort=sort) + elif option == "dict": + t = tp.extractDICT(cb=cb, sort=sort) + elif option == "rawdict": + t = tp.extractRAWDICT(cb=cb, sort=sort) + elif option == "html": + t = tp.extractHTML() + elif option == "xml": + t = tp.extractXML() + elif option == "xhtml": + t = tp.extractXHTML() + else: + t = tp.extractText(sort=sort) + + if textpage is None: + del tp + return t + + +def get_page_text( + doc: pymupdf.Document, + pno: int, + option: str = "text", + clip: rect_like = None, + flags: OptInt = None, + textpage: pymupdf.TextPage = None, + sort: bool = False, +) -> typing.Any: + """Extract a document page's text by page number. + + Notes: + Convenience function calling page.get_text(). + Args: + pno: page number + option: (str) text, words, blocks, html, dict, json, rawdict, xhtml or xml. + Returns: + output from page.TextPage(). + """ + return doc[pno].get_text(option, clip=clip, flags=flags, sort=sort) + +def get_pixmap( + page: pymupdf.Page, + *, + matrix: matrix_like=pymupdf.Identity, + dpi=None, + colorspace: pymupdf.Colorspace=pymupdf.csRGB, + clip: rect_like=None, + alpha: bool=False, + annots: bool=True, + ) -> pymupdf.Pixmap: + """Create pixmap of page. + + Keyword args: + matrix: Matrix for transformation (default: Identity). + dpi: desired dots per inch. If given, matrix is ignored. + colorspace: (str/Colorspace) cmyk, rgb, gray - case ignored, default csRGB. + clip: (irect-like) restrict rendering to this area. + alpha: (bool) whether to include alpha channel + annots: (bool) whether to also render annotations + """ + if dpi: + zoom = dpi / 72 + matrix = pymupdf.Matrix(zoom, zoom) + + if type(colorspace) is str: + if colorspace.upper() == "GRAY": + colorspace = pymupdf.csGRAY + elif colorspace.upper() == "CMYK": + colorspace = pymupdf.csCMYK + else: + colorspace = pymupdf.csRGB + if colorspace.n not in (1, 3, 4): + raise ValueError("unsupported colorspace") + + dl = page.get_displaylist(annots=annots) + pix = dl.get_pixmap(matrix=matrix, colorspace=colorspace, alpha=alpha, clip=clip) + dl = None + if dpi: + pix.set_dpi(dpi, dpi) + return pix + + +def get_page_pixmap( + doc: pymupdf.Document, + pno: int, + *, + matrix: matrix_like = pymupdf.Identity, + dpi=None, + colorspace: pymupdf.Colorspace = pymupdf.csRGB, + clip: rect_like = None, + alpha: bool = False, + annots: bool = True, +) -> pymupdf.Pixmap: + """Create pixmap of document page by page number. + + Notes: + Convenience function calling page.get_pixmap. + Args: + pno: (int) page number + matrix: pymupdf.Matrix for transformation (default: pymupdf.Identity). + colorspace: (str,pymupdf.Colorspace) rgb, rgb, gray - case ignored, default csRGB. + clip: (irect-like) restrict rendering to this area. + alpha: (bool) include alpha channel + annots: (bool) also render annotations + """ + return doc[pno].get_pixmap( + matrix=matrix, + dpi=dpi, colorspace=colorspace, + clip=clip, + alpha=alpha, + annots=annots + ) + + +def getLinkDict(ln, document=None) -> dict: + if isinstance(ln, pymupdf.Outline): + dest = ln.destination(document) + elif isinstance(ln, pymupdf.Link): + dest = ln.dest + else: + assert 0, f'Unexpected {type(ln)=}.' + nl = {"kind": dest.kind, "xref": 0} + try: + if hasattr(ln, 'rect'): + nl["from"] = ln.rect + except Exception: + # This seems to happen quite often in PyMuPDF/tests. + if g_exceptions_verbose >= 2: pymupdf.exception_info() + pass + pnt = pymupdf.Point(0, 0) + if dest.flags & pymupdf.LINK_FLAG_L_VALID: + pnt.x = dest.lt.x + if dest.flags & pymupdf.LINK_FLAG_T_VALID: + pnt.y = dest.lt.y + + if dest.kind == pymupdf.LINK_URI: + nl["uri"] = dest.uri + + elif dest.kind == pymupdf.LINK_GOTO: + nl["page"] = dest.page + nl["to"] = pnt + if dest.flags & pymupdf.LINK_FLAG_R_IS_ZOOM: + nl["zoom"] = dest.rb.x + else: + nl["zoom"] = 0.0 + + elif dest.kind == pymupdf.LINK_GOTOR: + nl["file"] = dest.file_spec.replace("\\", "/") + nl["page"] = dest.page + if dest.page < 0: + nl["to"] = dest.dest + else: + nl["to"] = pnt + if dest.flags & pymupdf.LINK_FLAG_R_IS_ZOOM: + nl["zoom"] = dest.rb.x + else: + nl["zoom"] = 0.0 + + elif dest.kind == pymupdf.LINK_LAUNCH: + nl["file"] = dest.file_spec.replace("\\", "/") + + elif dest.kind == pymupdf.LINK_NAMED: + # The dicts should not have same key(s). + assert not (dest.named.keys() & nl.keys()) + nl.update(dest.named) + if 'to' in nl: + nl['to'] = pymupdf.Point(nl['to']) + + else: + nl["page"] = dest.page + return nl + + +def get_links(page: pymupdf.Page) -> list: + """Create a list of all links contained in a PDF page. + + Notes: + see PyMuPDF ducmentation for details. + """ + + pymupdf.CheckParent(page) + ln = page.first_link + links = [] + while ln: + nl = getLinkDict(ln, page.parent) + links.append(nl) + ln = ln.next + if links != [] and page.parent.is_pdf: + linkxrefs = [x for x in + #page.annot_xrefs() + pymupdf.JM_get_annot_xref_list2(page) + if x[1] == pymupdf.PDF_ANNOT_LINK # pylint: disable=no-member + ] + if len(linkxrefs) == len(links): + for i in range(len(linkxrefs)): + links[i]["xref"] = linkxrefs[i][0] + links[i]["id"] = linkxrefs[i][2] + return links + + +def get_toc( + doc: pymupdf.Document, + simple: bool = True, +) -> list: + """Create a table of contents. + + Args: + simple: a bool to control output. Returns a list, where each entry consists of outline level, title, page number and link destination (if simple = False). For details see PyMuPDF's documentation. + """ + def recurse(olItem, liste, lvl): + """Recursively follow the outline item chain and record item information in a list.""" + while olItem and olItem.this.m_internal: + if olItem.title: + title = olItem.title + else: + title = " " + + if not olItem.is_external: + if olItem.uri: + if olItem.page == -1: + resolve = doc.resolve_link(olItem.uri) + page = resolve[0] + 1 + else: + page = olItem.page + 1 + else: + page = -1 + else: + page = -1 + + if not simple: + link = getLinkDict(olItem, doc) + liste.append([lvl, title, page, link]) + else: + liste.append([lvl, title, page]) + + if olItem.down: + liste = recurse(olItem.down, liste, lvl + 1) + olItem = olItem.next + return liste + + # ensure document is open + if doc.is_closed: + raise ValueError("document closed") + doc.init_doc() + olItem = doc.outline + if not olItem: + return [] + lvl = 1 + liste = [] + toc = recurse(olItem, liste, lvl) + if doc.is_pdf and not simple: + doc._extend_toc_items(toc) + return toc + + +def del_toc_item( + doc: pymupdf.Document, + idx: int, +) -> None: + """Delete TOC / bookmark item by index.""" + xref = doc.get_outline_xrefs()[idx] + doc._remove_toc_item(xref) + + +def set_toc_item( + doc: pymupdf.Document, + idx: int, + dest_dict: OptDict = None, + kind: OptInt = None, + pno: OptInt = None, + uri: OptStr = None, + title: OptStr = None, + to: point_like = None, + filename: OptStr = None, + zoom: float = 0, +) -> None: + """Update TOC item by index. + + It allows changing the item's title and link destination. + + Args: + idx: + (int) desired index of the TOC list, as created by get_toc. + dest_dict: + (dict) destination dictionary as created by get_toc(False). + Outrules all other parameters. If None, the remaining parameters + are used to make a dest dictionary. + kind: + (int) kind of link (pymupdf.LINK_GOTO, etc.). If None, then only + the title will be updated. If pymupdf.LINK_NONE, the TOC item will + be deleted. + pno: + (int) page number (1-based like in get_toc). Required if + pymupdf.LINK_GOTO. + uri: + (str) the URL, required if pymupdf.LINK_URI. + title: + (str) the new title. No change if None. + to: + (point-like) destination on the target page. If omitted, (72, 36) + will be used as target coordinates. + filename: + (str) destination filename, required for pymupdf.LINK_GOTOR and + pymupdf.LINK_LAUNCH. + name: + (str) a destination name for pymupdf.LINK_NAMED. + zoom: + (float) a zoom factor for the target location (pymupdf.LINK_GOTO). + """ + xref = doc.get_outline_xrefs()[idx] + page_xref = 0 + if type(dest_dict) is dict: + if dest_dict["kind"] == pymupdf.LINK_GOTO: + pno = dest_dict["page"] + page_xref = doc.page_xref(pno) + page_height = doc.page_cropbox(pno).height + to = dest_dict.get('to', pymupdf.Point(72, 36)) + to.y = page_height - to.y + dest_dict["to"] = to + action = getDestStr(page_xref, dest_dict) + if not action.startswith("/A"): + raise ValueError("bad bookmark dest") + color = dest_dict.get("color") + if color: + color = list(map(float, color)) + if len(color) != 3 or min(color) < 0 or max(color) > 1: + raise ValueError("bad color value") + bold = dest_dict.get("bold", False) + italic = dest_dict.get("italic", False) + flags = italic + 2 * bold + collapse = dest_dict.get("collapse") + return doc._update_toc_item( + xref, + action=action[2:], + title=title, + color=color, + flags=flags, + collapse=collapse, + ) + + if kind == pymupdf.LINK_NONE: # delete bookmark item + return doc.del_toc_item(idx) + if kind is None and title is None: # treat as no-op + return None + if kind is None: # only update title text + return doc._update_toc_item(xref, action=None, title=title) + + if kind == pymupdf.LINK_GOTO: + if pno is None or pno not in range(1, doc.page_count + 1): + raise ValueError("bad page number") + page_xref = doc.page_xref(pno - 1) + page_height = doc.page_cropbox(pno - 1).height + if to is None: + to = pymupdf.Point(72, page_height - 36) + else: + to = pymupdf.Point(to) + to.y = page_height - to.y + + ddict = { + "kind": kind, + "to": to, + "uri": uri, + "page": pno, + "file": filename, + "zoom": zoom, + } + action = getDestStr(page_xref, ddict) + if action == "" or not action.startswith("/A"): + raise ValueError("bad bookmark dest") + + return doc._update_toc_item(xref, action=action[2:], title=title) + + +def get_area(*args) -> float: + """Calculate area of rectangle.\nparameter is one of 'px' (default), 'in', 'cm', or 'mm'.""" + rect = args[0] + if len(args) > 1: + unit = args[1] + else: + unit = "px" + u = {"px": (1, 1), "in": (1.0, 72.0), "cm": (2.54, 72.0), "mm": (25.4, 72.0)} + f = (u[unit][0] / u[unit][1]) ** 2 + return f * rect.width * rect.height + + +def set_metadata(doc: pymupdf.Document, m: dict = None) -> None: + """Update the PDF /Info object. + + Args: + m: a dictionary like doc.metadata. + """ + if not doc.is_pdf: + raise ValueError("is no PDF") + if doc.is_closed or doc.is_encrypted: + raise ValueError("document closed or encrypted") + if m is None: + m = {} + elif type(m) is not dict: + raise ValueError("bad metadata") + keymap = { + "author": "Author", + "producer": "Producer", + "creator": "Creator", + "title": "Title", + "format": None, + "encryption": None, + "creationDate": "CreationDate", + "modDate": "ModDate", + "subject": "Subject", + "keywords": "Keywords", + "trapped": "Trapped", + } + valid_keys = set(keymap.keys()) + diff_set = set(m.keys()).difference(valid_keys) + if diff_set != set(): + msg = "bad dict key(s): %s" % diff_set + raise ValueError(msg) + + t, temp = doc.xref_get_key(-1, "Info") + if t != "xref": + info_xref = 0 + else: + info_xref = int(temp.replace("0 R", "")) + + if m == {} and info_xref == 0: # nothing to do + return + + if info_xref == 0: # no prev metadata: get new xref + info_xref = doc.get_new_xref() + doc.update_object(info_xref, "<<>>") # fill it with empty object + doc.xref_set_key(-1, "Info", "%i 0 R" % info_xref) + elif m == {}: # remove existing metadata + doc.xref_set_key(-1, "Info", "null") + doc.init_doc() + return + + for key, val in [(k, v) for k, v in m.items() if keymap[k] is not None]: + pdf_key = keymap[key] + if not bool(val) or val in ("none", "null"): + val = "null" + else: + val = pymupdf.get_pdf_str(val) + doc.xref_set_key(info_xref, pdf_key, val) + doc.init_doc() + return + + +def getDestStr(xref: int, ddict: dict) -> str: + """Calculate the PDF action string. + + Notes: + Supports Link annotations and outline items (bookmarks). + """ + if not ddict: + return "" + str_goto = lambda a, b, c, d: f"/A<>" + str_gotor1 = lambda a, b, c, d, e, f: f"/A<>>>" + str_gotor2 = lambda a, b, c: f"/A<>>>" + str_launch = lambda a, b: f"/A<>>>" + str_uri = lambda a: f"/A<>" + + if type(ddict) in (int, float): + dest = str_goto(xref, 0, ddict, 0) + return dest + d_kind = ddict.get("kind", pymupdf.LINK_NONE) + + if d_kind == pymupdf.LINK_NONE: + return "" + + if ddict["kind"] == pymupdf.LINK_GOTO: + d_zoom = ddict.get("zoom", 0) + to = ddict.get("to", pymupdf.Point(0, 0)) + d_left, d_top = to + dest = str_goto(xref, d_left, d_top, d_zoom) + return dest + + if ddict["kind"] == pymupdf.LINK_URI: + dest = str_uri(pymupdf.get_pdf_str(ddict["uri"]),) + return dest + + if ddict["kind"] == pymupdf.LINK_LAUNCH: + fspec = pymupdf.get_pdf_str(ddict["file"]) + dest = str_launch(fspec, fspec) + return dest + + if ddict["kind"] == pymupdf.LINK_GOTOR and ddict["page"] < 0: + fspec = pymupdf.get_pdf_str(ddict["file"]) + dest = str_gotor2(pymupdf.get_pdf_str(ddict["to"]), fspec, fspec) + return dest + + if ddict["kind"] == pymupdf.LINK_GOTOR and ddict["page"] >= 0: + fspec = pymupdf.get_pdf_str(ddict["file"]) + dest = str_gotor1( + ddict["page"], + ddict["to"].x, + ddict["to"].y, + ddict["zoom"], + fspec, + fspec, + ) + return dest + + return "" + + +def set_toc( + doc: pymupdf.Document, + toc: list, + collapse: int = 1, +) -> int: + """Create new outline tree (table of contents, TOC). + + Args: + toc: (list, tuple) each entry must contain level, title, page and + optionally top margin on the page. None or '()' remove the TOC. + collapse: (int) collapses entries beyond this level. Zero or None + shows all entries unfolded. + Returns: + the number of inserted items, or the number of removed items respectively. + """ + if doc.is_closed or doc.is_encrypted: + raise ValueError("document closed or encrypted") + if not doc.is_pdf: + raise ValueError("is no PDF") + if not toc: # remove all entries + return len(doc._delToC()) + + # validity checks -------------------------------------------------------- + if type(toc) not in (list, tuple): + raise ValueError("'toc' must be list or tuple") + toclen = len(toc) + page_count = doc.page_count + t0 = toc[0] + if type(t0) not in (list, tuple): + raise ValueError("items must be sequences of 3 or 4 items") + if t0[0] != 1: + raise ValueError("hierarchy level of item 0 must be 1") + for i in list(range(toclen - 1)): + t1 = toc[i] + t2 = toc[i + 1] + if not -1 <= t1[2] <= page_count: + raise ValueError("row %i: page number out of range" % i) + if (type(t2) not in (list, tuple)) or len(t2) not in (3, 4): + raise ValueError("bad row %i" % (i + 1)) + if (type(t2[0]) is not int) or t2[0] < 1: + raise ValueError("bad hierarchy level in row %i" % (i + 1)) + if t2[0] > t1[0] + 1: + raise ValueError("bad hierarchy level in row %i" % (i + 1)) + # no formal errors in toc -------------------------------------------------- + + # -------------------------------------------------------------------------- + # make a list of xref numbers, which we can use for our TOC entries + # -------------------------------------------------------------------------- + old_xrefs = doc._delToC() # del old outlines, get their xref numbers + + # prepare table of xrefs for new bookmarks + old_xrefs = [] + xref = [0] + old_xrefs + xref[0] = doc._getOLRootNumber() # entry zero is outline root xref number + if toclen > len(old_xrefs): # too few old xrefs? + for i in range((toclen - len(old_xrefs))): + xref.append(doc.get_new_xref()) # acquire new ones + + lvltab = {0: 0} # to store last entry per hierarchy level + + # ------------------------------------------------------------------------------ + # contains new outline objects as strings - first one is the outline root + # ------------------------------------------------------------------------------ + olitems = [{"count": 0, "first": -1, "last": -1, "xref": xref[0]}] + # ------------------------------------------------------------------------------ + # build olitems as a list of PDF-like connected dictionaries + # ------------------------------------------------------------------------------ + for i in range(toclen): + o = toc[i] + lvl = o[0] # level + title = pymupdf.get_pdf_str(o[1]) # title + pno = min(doc.page_count - 1, max(0, o[2] - 1)) # page number + page_xref = doc.page_xref(pno) + page_height = doc.page_cropbox(pno).height + top = pymupdf.Point(72, page_height - 36) + dest_dict = {"to": top, "kind": pymupdf.LINK_GOTO} # fall back target + if o[2] < 0: + dest_dict["kind"] = pymupdf.LINK_NONE + if len(o) > 3: # some target is specified + if type(o[3]) in (int, float): # convert a number to a point + dest_dict["to"] = pymupdf.Point(72, page_height - o[3]) + else: # if something else, make sure we have a dict + # We make a copy of o[3] to avoid modifying our caller's data. + dest_dict = o[3].copy() if type(o[3]) is dict else dest_dict + if "to" not in dest_dict: # target point not in dict? + dest_dict["to"] = top # put default in + else: # transform target to PDF coordinates + page = doc[pno] + point = pymupdf.Point(dest_dict["to"]) + point.y = page.cropbox.height - point.y + point = point * page.rotation_matrix + dest_dict["to"] = (point.x, point.y) + d = {} + d["first"] = -1 + d["count"] = 0 + d["last"] = -1 + d["prev"] = -1 + d["next"] = -1 + d["dest"] = getDestStr(page_xref, dest_dict) + d["top"] = dest_dict["to"] + d["title"] = title + d["parent"] = lvltab[lvl - 1] + d["xref"] = xref[i + 1] + d["color"] = dest_dict.get("color") + d["flags"] = dest_dict.get("italic", 0) + 2 * dest_dict.get("bold", 0) + lvltab[lvl] = i + 1 + parent = olitems[lvltab[lvl - 1]] # the parent entry + + if ( + dest_dict.get("collapse") or collapse and lvl > collapse + ): # suppress expansion + parent["count"] -= 1 # make /Count negative + else: + parent["count"] += 1 # positive /Count + + if parent["first"] == -1: + parent["first"] = i + 1 + parent["last"] = i + 1 + else: + d["prev"] = parent["last"] + prev = olitems[parent["last"]] + prev["next"] = i + 1 + parent["last"] = i + 1 + olitems.append(d) + + # ------------------------------------------------------------------------------ + # now create each outline item as a string and insert it in the PDF + # ------------------------------------------------------------------------------ + for i, ol in enumerate(olitems): + txt = "<<" + if ol["count"] != 0: + txt += "/Count %i" % ol["count"] + try: + txt += ol["dest"] + except Exception: + # Verbose in PyMuPDF/tests. + if g_exceptions_verbose >= 2: pymupdf.exception_info() + pass + try: + if ol["first"] > -1: + txt += "/First %i 0 R" % xref[ol["first"]] + except Exception: + if g_exceptions_verbose >= 2: pymupdf.exception_info() + pass + try: + if ol["last"] > -1: + txt += "/Last %i 0 R" % xref[ol["last"]] + except Exception: + if g_exceptions_verbose >= 2: pymupdf.exception_info() + pass + try: + if ol["next"] > -1: + txt += "/Next %i 0 R" % xref[ol["next"]] + except Exception: + # Verbose in PyMuPDF/tests. + if g_exceptions_verbose >= 2: pymupdf.exception_info() + pass + try: + if ol["parent"] > -1: + txt += "/Parent %i 0 R" % xref[ol["parent"]] + except Exception: + # Verbose in PyMuPDF/tests. + if g_exceptions_verbose >= 2: pymupdf.exception_info() + pass + try: + if ol["prev"] > -1: + txt += "/Prev %i 0 R" % xref[ol["prev"]] + except Exception: + # Verbose in PyMuPDF/tests. + if g_exceptions_verbose >= 2: pymupdf.exception_info() + pass + try: + txt += "/Title" + ol["title"] + except Exception: + # Verbose in PyMuPDF/tests. + if g_exceptions_verbose >= 2: pymupdf.exception_info() + pass + + if ol.get("color") and len(ol["color"]) == 3: + txt += f"/C[ {_format_g(tuple(ol['color']))}]" + if ol.get("flags", 0) > 0: + txt += "/F %i" % ol["flags"] + + if i == 0: # special: this is the outline root + txt += "/Type/Outlines" # so add the /Type entry + txt += ">>" + doc.update_object(xref[i], txt) # insert the PDF object + + doc.init_doc() + return toclen + + +def do_widgets( + tar: pymupdf.Document, + src: pymupdf.Document, + graftmap, + from_page: int = -1, + to_page: int = -1, + start_at: int = -1, + join_duplicates=0, +) -> None: + """Insert widgets of copied page range into target PDF. + + Parameter values **must** equal those of method insert_pdf() which + must have been previously executed. + """ + if not src.is_form_pdf: # nothing to do: source PDF has no fields + return + + def clean_kid_parents(acro_fields): + """ Make sure all kids have correct "Parent" pointers.""" + for i in range(acro_fields.pdf_array_len()): + parent = acro_fields.pdf_array_get(i) + kids = parent.pdf_dict_get(pymupdf.PDF_NAME("Kids")) + for j in range(kids.pdf_array_len()): + kid = kids.pdf_array_get(j) + kid.pdf_dict_put(pymupdf.PDF_NAME("Parent"), parent) + + def join_widgets(pdf, acro_fields, xref1, xref2, name): + """Called for each pair of widgets having the same name. + + Args: + pdf: target MuPDF document + acro_fields: object Root/AcroForm/Fields + xref1, xref2: widget xrefs having same names + name: (str) the name + + Result: + Defined or updated widget parent that points to both widgets. + """ + + def re_target(pdf, acro_fields, xref1, kids1, xref2, kids2): + """Merge widget in xref2 into "Kids" list of widget xref1. + + Args: + xref1, kids1: target widget and its "Kids" array. + xref2, kids2: source wwidget and its "Kids" array (may be empty). + """ + # make indirect objects from widgets + w1_ind = mupdf.pdf_new_indirect(pdf, xref1, 0) + w2_ind = mupdf.pdf_new_indirect(pdf, xref2, 0) + # find source widget in "Fields" array + idx = acro_fields.pdf_array_find(w2_ind) + acro_fields.pdf_array_delete(idx) + + if not kids2.pdf_is_array(): # source widget has no kids + widget = mupdf.pdf_load_object(pdf, xref2) + + # delete name from widget and insert target as parent + widget.pdf_dict_del(pymupdf.PDF_NAME("T")) + widget.pdf_dict_put(pymupdf.PDF_NAME("Parent"), w1_ind) + + # put in target Kids + kids1.pdf_array_push(w2_ind) + else: # copy source kids to target kids + for i in range(kids2.pdf_array_len()): + kid = kids2.pdf_array_get(i) + kid.pdf_dict_put(pymupdf.PDF_NAME("Parent"), w1_ind) + kid_ind = mupdf.pdf_new_indirect(pdf, kid.pdf_to_num(), 0) + kids1.pdf_array_push(kid_ind) + + def new_target(pdf, acro_fields, xref1, w1, xref2, w2, name): + """Make new "Parent" for two widgets with same name. + + Args: + xref1, w1: first widget + xref2, w2: second widget + name: field name + + Result: + Both widgets have no "Kids". We create a new object with the + name and a "Kids" array containing the widgets. + Original widgets must be removed from AcroForm/Fields. + """ + # make new "Parent" object + new = mupdf.pdf_new_dict(pdf, 5) + new.pdf_dict_put_text_string(pymupdf.PDF_NAME("T"), name) + kids = new.pdf_dict_put_array(pymupdf.PDF_NAME("Kids"), 2) + new_obj = mupdf.pdf_add_object(pdf, new) + new_obj_xref = new_obj.pdf_to_num() + new_ind = mupdf.pdf_new_indirect(pdf, new_obj_xref, 0) + + # copy over some required source widget properties + ft = w1.pdf_dict_get(pymupdf.PDF_NAME("FT")) + w1.pdf_dict_del(pymupdf.PDF_NAME("FT")) + new_obj.pdf_dict_put(pymupdf.PDF_NAME("FT"), ft) + + aa = w1.pdf_dict_get(pymupdf.PDF_NAME("AA")) + w1.pdf_dict_del(pymupdf.PDF_NAME("AA")) + new_obj.pdf_dict_put(pymupdf.PDF_NAME("AA"), aa) + + # remove name field, insert "Parent" field in source widgets + w1.pdf_dict_del(pymupdf.PDF_NAME("T")) + w1.pdf_dict_put(pymupdf.PDF_NAME("Parent"), new_ind) + w2.pdf_dict_del(pymupdf.PDF_NAME("T")) + w2.pdf_dict_put(pymupdf.PDF_NAME("Parent"), new_ind) + + # put source widgets in "kids" array + ind1 = mupdf.pdf_new_indirect(pdf, xref1, 0) + ind2 = mupdf.pdf_new_indirect(pdf, xref2, 0) + kids.pdf_array_push(ind1) + kids.pdf_array_push(ind2) + + # remove source widgets from "AcroForm/Fields" + idx = acro_fields.pdf_array_find(ind1) + acro_fields.pdf_array_delete(idx) + idx = acro_fields.pdf_array_find(ind2) + acro_fields.pdf_array_delete(idx) + + acro_fields.pdf_array_push(new_ind) + + w1 = mupdf.pdf_load_object(pdf, xref1) + w2 = mupdf.pdf_load_object(pdf, xref2) + kids1 = w1.pdf_dict_get(pymupdf.PDF_NAME("Kids")) + kids2 = w2.pdf_dict_get(pymupdf.PDF_NAME("Kids")) + + # check which widget has a suitable "Kids" array + if kids1.pdf_is_array(): + re_target(pdf, acro_fields, xref1, kids1, xref2, kids2) # pylint: disable=arguments-out-of-order + elif kids2.pdf_is_array(): + re_target(pdf, acro_fields, xref2, kids2, xref1, kids1) # pylint: disable=arguments-out-of-order + else: + new_target(pdf, acro_fields, xref1, w1, xref2, w2, name) # pylint: disable=arguments-out-of-order + + def get_kids(parent, kids_list): + """Return xref list of leaf kids for a parent. + + Call with an empty list. + """ + kids = mupdf.pdf_dict_get(parent, pymupdf.PDF_NAME("Kids")) + if not kids.pdf_is_array(): + return kids_list + for i in range(kids.pdf_array_len()): + kid = kids.pdf_array_get(i) + if mupdf.pdf_is_dict(mupdf.pdf_dict_get(kid, pymupdf.PDF_NAME("Kids"))): + kids_list = get_kids(kid, kids_list) + else: + kids_list.append(kid.pdf_to_num()) + return kids_list + + def kids_xrefs(widget): + """Get the xref of top "Parent" and the list of leaf widgets.""" + kids_list = [] + parent = mupdf.pdf_dict_get(widget, pymupdf.PDF_NAME("Parent")) + parent_xref = parent.pdf_to_num() + if parent_xref == 0: + return parent_xref, kids_list + kids_list = get_kids(parent, kids_list) + return parent_xref, kids_list + + def deduplicate_names(pdf, acro_fields, join_duplicates=False): + """Handle any widget name duplicates caused by the merge.""" + names = {} # key is a widget name, value a list of widgets having it. + + # extract all names and widgets in "AcroForm/Fields" + for i in range(mupdf.pdf_array_len(acro_fields)): + wobject = mupdf.pdf_array_get(acro_fields, i) + xref = wobject.pdf_to_num() + + # extract widget name and collect widget(s) using it + T = mupdf.pdf_dict_get_text_string(wobject, pymupdf.PDF_NAME("T")) + xrefs = names.get(T, []) + xrefs.append(xref) + names[T] = xrefs + + for name, xrefs in names.items(): + if len(xrefs) < 2: + continue + xref0, xref1 = xrefs[:2] # only exactly 2 should occur! + if join_duplicates: # combine fields with equal names + join_widgets(pdf, acro_fields, xref0, xref1, name) + else: # make field names unique + newname = name + f" [{xref1}]" # append this to the name + wobject = mupdf.pdf_load_object(pdf, xref1) + wobject.pdf_dict_put_text_string(pymupdf.PDF_NAME("T"), newname) + + clean_kid_parents(acro_fields) + + def get_acroform(doc): + """Retrieve the AcroForm dictionary form a PDF.""" + pdf = mupdf.pdf_document_from_fz_document(doc) + # AcroForm (= central form field info) + return mupdf.pdf_dict_getp(mupdf.pdf_trailer(pdf), "Root/AcroForm") + + tarpdf = mupdf.pdf_document_from_fz_document(tar) + srcpdf = mupdf.pdf_document_from_fz_document(src) + + if tar.is_form_pdf: + # target is a Form PDF, so use it to include source fields + acro = get_acroform(tar) + # Important arrays in AcroForm + acro_fields = acro.pdf_dict_get(pymupdf.PDF_NAME("Fields")) + tar_co = acro.pdf_dict_get(pymupdf.PDF_NAME("CO")) + if not tar_co.pdf_is_array(): + tar_co = acro.pdf_dict_put_array(pymupdf.PDF_NAME("CO"), 5) + else: + # target is no Form PDF, so copy over source AcroForm + acro = mupdf.pdf_deep_copy_obj(get_acroform(src)) # make a copy + + # Clear "Fields" and "CO" arrays: will be populated by page fields. + # This is required to avoid copying unneeded objects. + acro.pdf_dict_del(pymupdf.PDF_NAME("Fields")) + acro.pdf_dict_put_array(pymupdf.PDF_NAME("Fields"), 5) + acro.pdf_dict_del(pymupdf.PDF_NAME("CO")) + acro.pdf_dict_put_array(pymupdf.PDF_NAME("CO"), 5) + + # Enrich AcroForm for copying to target + acro_graft = mupdf.pdf_graft_mapped_object(graftmap, acro) + + # Insert AcroForm into target PDF + acro_tar = mupdf.pdf_add_object(tarpdf, acro_graft) + acro_fields = acro_tar.pdf_dict_get(pymupdf.PDF_NAME("Fields")) + tar_co = acro_tar.pdf_dict_get(pymupdf.PDF_NAME("CO")) + + # get its xref and insert it into target catalog + tar_xref = acro_tar.pdf_to_num() + acro_tar_ind = mupdf.pdf_new_indirect(tarpdf, tar_xref, 0) + root = mupdf.pdf_dict_get(mupdf.pdf_trailer(tarpdf), pymupdf.PDF_NAME("Root")) + root.pdf_dict_put(pymupdf.PDF_NAME("AcroForm"), acro_tar_ind) + + if from_page <= to_page: + src_range = range(from_page, to_page + 1) + else: + src_range = range(from_page, to_page - 1, -1) + + parents = {} # information about widget parents + + # remove "P" owning page reference from all widgets of all source pages + for i in src_range: + src_page = src[i] + for xref in [ + xref + for xref, wtype, _ in src_page.annot_xrefs() + if wtype == pymupdf.PDF_ANNOT_WIDGET # pylint: disable=no-member + ]: + w_obj = mupdf.pdf_load_object(srcpdf, xref) + w_obj.pdf_dict_del(pymupdf.PDF_NAME("P")) + + # get the widget's parent structure + parent_xref, old_kids = kids_xrefs(w_obj) + if parent_xref: + parents[parent_xref] = { + "new_xref": 0, + "old_kids": old_kids, + "new_kids": [], + } + # Copy over Parent widgets first - they are not page-dependent + for xref in parents.keys(): # pylint: disable=consider-using-dict-items + parent = mupdf.pdf_load_object(srcpdf, xref) + parent_graft = mupdf.pdf_graft_mapped_object(graftmap, parent) + parent_tar = mupdf.pdf_add_object(tarpdf, parent_graft) + kids_xrefs_new = get_kids(parent_tar, []) + parent_xref_new = parent_tar.pdf_to_num() + parent_ind = mupdf.pdf_new_indirect(tarpdf, parent_xref_new, 0) + acro_fields.pdf_array_push(parent_ind) + parents[xref]["new_xref"] = parent_xref_new + parents[xref]["new_kids"] = kids_xrefs_new + + for i in range(len(src_range)): + # read first copied over page in target + tar_page = tar[start_at + i] + + # read the original page in the source PDF + src_page = src[src_range[i]] + + # now walk through source page widgets and copy over + w_xrefs = [ # widget xrefs of the source page + xref + for xref, wtype, _ in src_page.annot_xrefs() + if wtype == pymupdf.PDF_ANNOT_WIDGET # pylint: disable=no-member + ] + if not w_xrefs: # no widgets on this source page + continue + + # convert to formal PDF page + tar_page_pdf = mupdf.pdf_page_from_fz_page(tar_page) + + # extract annotations array + tar_annots = mupdf.pdf_dict_get(tar_page_pdf.obj(), pymupdf.PDF_NAME("Annots")) + if not mupdf.pdf_is_array(tar_annots): + tar_annots = mupdf.pdf_dict_put_array( + tar_page_pdf.obj(), pymupdf.PDF_NAME("Annots"), 5 + ) + + for xref in w_xrefs: + w_obj = mupdf.pdf_load_object(srcpdf, xref) + + # check if field takes part in inter-field validations + is_aac = mupdf.pdf_is_dict(mupdf.pdf_dict_getp(w_obj, "AA/C")) + + # check if parent of widget already in target + parent_xref = mupdf.pdf_to_num( + w_obj.pdf_dict_get(pymupdf.PDF_NAME("Parent")) + ) + if parent_xref == 0: # parent not in target yet + try: + w_obj_graft = mupdf.pdf_graft_mapped_object(graftmap, w_obj) + except Exception as e: + pymupdf.message_warning(f"cannot copy widget at {xref=}: {e}") + continue + w_obj_tar = mupdf.pdf_add_object(tarpdf, w_obj_graft) + tar_xref = w_obj_tar.pdf_to_num() + w_obj_tar_ind = mupdf.pdf_new_indirect(tarpdf, tar_xref, 0) + mupdf.pdf_array_push(tar_annots, w_obj_tar_ind) + mupdf.pdf_array_push(acro_fields, w_obj_tar_ind) + else: + parent = parents[parent_xref] + idx = parent["old_kids"].index(xref) # search for xref in parent + tar_xref = parent["new_kids"][idx] + w_obj_tar_ind = mupdf.pdf_new_indirect(tarpdf, tar_xref, 0) + mupdf.pdf_array_push(tar_annots, w_obj_tar_ind) + + # Into "AcroForm/CO" if a computation field. + if is_aac: + mupdf.pdf_array_push(tar_co, w_obj_tar_ind) + + deduplicate_names(tarpdf, acro_fields, join_duplicates=join_duplicates) + +def do_links( + doc1: pymupdf.Document, + doc2: pymupdf.Document, + from_page: int = -1, + to_page: int = -1, + start_at: int = -1, +) -> None: + """Insert links contained in copied page range into destination PDF. + + Parameter values **must** equal those of method insert_pdf(), which must + have been previously executed. + """ + #pymupdf.log( 'utils.do_links()') + # -------------------------------------------------------------------------- + # internal function to create the actual "/Annots" object string + # -------------------------------------------------------------------------- + def cre_annot(lnk, xref_dst, pno_src, ctm): + """Create annotation object string for a passed-in link.""" + + r = lnk["from"] * ctm # rect in PDF coordinates + rect = _format_g(tuple(r)) + if lnk["kind"] == pymupdf.LINK_GOTO: + txt = pymupdf.annot_skel["goto1"] # annot_goto + idx = pno_src.index(lnk["page"]) + p = lnk["to"] * ctm # target point in PDF coordinates + annot = txt(xref_dst[idx], p.x, p.y, lnk["zoom"], rect) + + elif lnk["kind"] == pymupdf.LINK_GOTOR: + if lnk["page"] >= 0: + txt = pymupdf.annot_skel["gotor1"] # annot_gotor + pnt = lnk.get("to", pymupdf.Point(0, 0)) # destination point + if type(pnt) is not pymupdf.Point: + pnt = pymupdf.Point(0, 0) + annot = txt( + lnk["page"], + pnt.x, + pnt.y, + lnk["zoom"], + lnk["file"], + lnk["file"], + rect, + ) + else: + txt = pymupdf.annot_skel["gotor2"] # annot_gotor_n + to = pymupdf.get_pdf_str(lnk["to"]) + to = to[1:-1] + f = lnk["file"] + annot = txt(to, f, rect) + + elif lnk["kind"] == pymupdf.LINK_LAUNCH: + txt = pymupdf.annot_skel["launch"] # annot_launch + annot = txt(lnk["file"], lnk["file"], rect) + + elif lnk["kind"] == pymupdf.LINK_URI: + txt = pymupdf.annot_skel["uri"] # annot_uri + annot = txt(lnk["uri"], rect) + + else: + annot = "" + + return annot + + # -------------------------------------------------------------------------- + + # validate & normalize parameters + if from_page < 0: + fp = 0 + elif from_page >= doc2.page_count: + fp = doc2.page_count - 1 + else: + fp = from_page + + if to_page < 0 or to_page >= doc2.page_count: + tp = doc2.page_count - 1 + else: + tp = to_page + + if start_at < 0: + raise ValueError("'start_at' must be >= 0") + sa = start_at + + incr = 1 if fp <= tp else -1 # page range could be reversed + + # lists of source / destination page numbers + pno_src = list(range(fp, tp + incr, incr)) + pno_dst = [sa + i for i in range(len(pno_src))] + + # lists of source / destination page xrefs + xref_src = [] + xref_dst = [] + for i in range(len(pno_src)): + p_src = pno_src[i] + p_dst = pno_dst[i] + old_xref = doc2.page_xref(p_src) + new_xref = doc1.page_xref(p_dst) + xref_src.append(old_xref) + xref_dst.append(new_xref) + + # create the links for each copied page in destination PDF + for i in range(len(xref_src)): + page_src = doc2[pno_src[i]] # load source page + links = page_src.get_links() # get all its links + #pymupdf.log( '{pno_src=}') + #pymupdf.log( '{type(page_src)=}') + #pymupdf.log( '{page_src=}') + #pymupdf.log( '{=i len(links)}') + if len(links) == 0: # no links there + page_src = None + continue + ctm = ~page_src.transformation_matrix # calc page transformation matrix + page_dst = doc1[pno_dst[i]] # load destination page + link_tab = [] # store all link definitions here + for l in links: + if l["kind"] == pymupdf.LINK_GOTO and (l["page"] not in pno_src): + continue # GOTO link target not in copied pages + annot_text = cre_annot(l, xref_dst, pno_src, ctm) + if annot_text: + link_tab.append(annot_text) + if link_tab != []: + page_dst._addAnnot_FromString( tuple(link_tab)) + #pymupdf.log( 'utils.do_links() returning.') + + +def getLinkText(page: pymupdf.Page, lnk: dict) -> str: + # -------------------------------------------------------------------------- + # define skeletons for /Annots object texts + # -------------------------------------------------------------------------- + ctm = page.transformation_matrix + ictm = ~ctm + r = lnk["from"] + rect = _format_g(tuple(r * ictm)) + + annot = "" + if lnk["kind"] == pymupdf.LINK_GOTO: + if lnk["page"] >= 0: + txt = pymupdf.annot_skel["goto1"] # annot_goto + pno = lnk["page"] + xref = page.parent.page_xref(pno) + pnt = lnk.get("to", pymupdf.Point(0, 0)) # destination point + dest_page = page.parent[pno] + dest_ctm = dest_page.transformation_matrix + dest_ictm = ~dest_ctm + ipnt = pnt * dest_ictm + annot = txt(xref, ipnt.x, ipnt.y, lnk.get("zoom", 0), rect) + else: + txt = pymupdf.annot_skel["goto2"] # annot_goto_n + annot = txt(pymupdf.get_pdf_str(lnk["to"]), rect) + + elif lnk["kind"] == pymupdf.LINK_GOTOR: + if lnk["page"] >= 0: + txt = pymupdf.annot_skel["gotor1"] # annot_gotor + pnt = lnk.get("to", pymupdf.Point(0, 0)) # destination point + if type(pnt) is not pymupdf.Point: + pnt = pymupdf.Point(0, 0) + annot = txt( + lnk["page"], + pnt.x, + pnt.y, + lnk.get("zoom", 0), + lnk["file"], + lnk["file"], + rect, + ) + else: + txt = pymupdf.annot_skel["gotor2"] # annot_gotor_n + annot = txt(pymupdf.get_pdf_str(lnk["to"]), lnk["file"], rect) + + elif lnk["kind"] == pymupdf.LINK_LAUNCH: + txt = pymupdf.annot_skel["launch"] # annot_launch + annot = txt(lnk["file"], lnk["file"], rect) + + elif lnk["kind"] == pymupdf.LINK_URI: + txt = pymupdf.annot_skel["uri"] # txt = annot_uri + annot = txt(lnk["uri"], rect) + + elif lnk["kind"] == pymupdf.LINK_NAMED: + txt = pymupdf.annot_skel["named"] # annot_named + lname = lnk.get("name") # check presence of key + if lname is None: # if missing, fall back to alternative + lname = lnk["nameddest"] + annot = txt(lname, rect) + if not annot: + return annot + + # add a /NM PDF key to the object definition + link_names = dict( # existing ids and their xref + [(x[0], x[2]) for x in page.annot_xrefs() if x[1] == pymupdf.PDF_ANNOT_LINK] # pylint: disable=no-member + ) + + old_name = lnk.get("id", "") # id value in the argument + + if old_name and (lnk["xref"], old_name) in link_names.items(): + name = old_name # no new name if this is an update only + else: + i = 0 + stem = pymupdf.TOOLS.set_annot_stem() + "-L%i" + while True: + name = stem % i + if name not in link_names.values(): + break + i += 1 + # add /NM key to object definition + annot = annot.replace("/Link", "/Link/NM(%s)" % name) + return annot + + +def delete_widget(page: pymupdf.Page, widget: pymupdf.Widget) -> pymupdf.Widget: + """Delete widget from page and return the next one.""" + pymupdf.CheckParent(page) + annot = getattr(widget, "_annot", None) + if annot is None: + raise ValueError("bad type: widget") + nextwidget = widget.next + page.delete_annot(annot) + widget._annot.parent = None + keylist = list(widget.__dict__.keys()) + for key in keylist: + del widget.__dict__[key] + return nextwidget + + +def update_link(page: pymupdf.Page, lnk: dict) -> None: + """Update a link on the current page.""" + pymupdf.CheckParent(page) + annot = getLinkText(page, lnk) + if annot == "": + raise ValueError("link kind not supported") + + page.parent.update_object(lnk["xref"], annot, page=page) + + +def insert_link(page: pymupdf.Page, lnk: dict, mark: bool = True) -> None: + """Insert a new link for the current page.""" + pymupdf.CheckParent(page) + annot = getLinkText(page, lnk) + if annot == "": + raise ValueError("link kind not supported") + page._addAnnot_FromString((annot,)) + + +def insert_textbox( + page: pymupdf.Page, + rect: rect_like, + buffer: typing.Union[str, list], + *, + fontname: str = "helv", + fontfile: OptStr = None, + set_simple: int = 0, + encoding: int = 0, + fontsize: float = 11, + lineheight: OptFloat = None, + color: OptSeq = None, + fill: OptSeq = None, + expandtabs: int = 1, + align: int = 0, + rotate: int = 0, + render_mode: int = 0, + miter_limit: float = 1, + border_width: float = 0.05, + morph: OptSeq = None, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> float: + """Insert text into a given rectangle. + + Notes: + Creates a Shape object, uses its same-named method and commits it. + Parameters: + rect: (rect-like) area to use for text. + buffer: text to be inserted + fontname: a Base-14 font, font name or '/name' + fontfile: name of a font file + fontsize: font size + lineheight: overwrite the font property + color: RGB color triple + expandtabs: handles tabulators with string function + align: left, center, right, justified + rotate: 0, 90, 180, or 270 degrees + morph: morph box with a matrix and a fixpoint + overlay: put text in foreground or background + Returns: + unused or deficit rectangle area (float) + """ + img = page.new_shape() + rc = img.insert_textbox( + rect, + buffer, + fontsize=fontsize, + lineheight=lineheight, + fontname=fontname, + fontfile=fontfile, + set_simple=set_simple, + encoding=encoding, + color=color, + fill=fill, + expandtabs=expandtabs, + render_mode=render_mode, + miter_limit=miter_limit, + border_width=border_width, + align=align, + rotate=rotate, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + if rc >= 0: + img.commit(overlay) + return rc + + +def insert_text( + page: pymupdf.Page, + point: point_like, + text: typing.Union[str, list], + *, + fontsize: float = 11, + lineheight: OptFloat = None, + fontname: str = "helv", + fontfile: OptStr = None, + set_simple: int = 0, + encoding: int = 0, + color: OptSeq = None, + fill: OptSeq = None, + border_width: float = 0.05, + miter_limit: float = 1, + render_mode: int = 0, + rotate: int = 0, + morph: OptSeq = None, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +): + + img = page.new_shape() + rc = img.insert_text( + point, + text, + fontsize=fontsize, + lineheight=lineheight, + fontname=fontname, + fontfile=fontfile, + set_simple=set_simple, + encoding=encoding, + color=color, + fill=fill, + border_width=border_width, + render_mode=render_mode, + miter_limit=miter_limit, + rotate=rotate, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + if rc >= 0: + img.commit(overlay) + return rc + + +def insert_htmlbox( + page, + rect, + text, + *, + css=None, + scale_low=0, + archive=None, + rotate=0, + oc=0, + opacity=1, + overlay=True, +) -> float: + """Insert text with optional HTML tags and stylings into a rectangle. + + Args: + rect: (rect-like) rectangle into which the text should be placed. + text: (str) text with optional HTML tags and stylings. + css: (str) CSS styling commands. + scale_low: (float) force-fit content by scaling it down. Must be in + range [0, 1]. If 1, no scaling will take place. If 0, arbitrary + down-scaling is acceptable. A value of 0.1 would mean that content + may be scaled down by at most 90%. + archive: Archive object pointing to locations of used fonts or images + rotate: (int) rotate the text in the box by a multiple of 90 degrees. + oc: (int) the xref of an OCG / OCMD (Optional Content). + opacity: (float) set opacity of inserted content. + overlay: (bool) put text on top of page content. + Returns: + A tuple of floats (spare_height, scale). + spare_height: -1 if content did not fit, else >= 0. It is the height of the + unused (still available) rectangle stripe. Positive only if + scale_min = 1 (no down scaling). + scale: downscaling factor, 0 < scale <= 1. Set to 0 if spare_height = -1 (no fit). + """ + + # normalize rotation angle + if not rotate % 90 == 0: + raise ValueError("bad rotation angle") + while rotate < 0: + rotate += 360 + while rotate >= 360: + rotate -= 360 + + if not 0 <= scale_low <= 1: + raise ValueError("'scale_low' must be in [0, 1]") + + if css is None: + css = "" + + rect = pymupdf.Rect(rect) + if rotate in (90, 270): + temp_rect = pymupdf.Rect(0, 0, rect.height, rect.width) + else: + temp_rect = pymupdf.Rect(0, 0, rect.width, rect.height) + + # use a small border by default + mycss = "body {margin:1px;}" + css # append user CSS + + # either make a story, or accept a given one + if isinstance(text, str): # if a string, convert to a Story + story = pymupdf.Story(html=text, user_css=mycss, archive=archive) + elif isinstance(text, pymupdf.Story): + story = text + else: + raise ValueError("'text' must be a string or a Story") + # ---------------------------------------------------------------- + # Find a scaling factor that lets our story fit in + # ---------------------------------------------------------------- + scale_max = None if scale_low == 0 else 1 / scale_low + + fit = story.fit_scale(temp_rect, scale_min=1, scale_max=scale_max) + if not fit.big_enough: # there was no fit + return (-1, scale_low) + + filled = fit.filled + scale = 1 / fit.parameter # shrink factor + + spare_height = fit.rect.y1 - filled[3] # unused room at rectangle bottom + # Note: due to MuPDF's logic this may be negative even for successful fits. + if scale != 1 or spare_height < 0: # if scaling occurred, set spare_height to 0 + spare_height = 0 + + def rect_function(*args): + return fit.rect, fit.rect, pymupdf.Identity + + # draw story on temp PDF page + doc = story.write_with_links(rect_function) + + # Insert opacity if requested. + # For this, we prepend a command to the /Contents. + if 0 <= opacity < 1: + tpage = doc[0] # load page + # generate /ExtGstate for the page + alp0 = tpage._set_opacity(CA=opacity, ca=opacity) + s = f"/{alp0} gs\n" # generate graphic state command + pymupdf.TOOLS._insert_contents(tpage, s.encode(), 0) + + # put result in target page + page.show_pdf_page(rect, doc, 0, rotate=rotate, oc=oc, overlay=overlay) + + # ------------------------------------------------------------------------- + # re-insert links in target rect (show_pdf_page cannot copy annotations) + # ------------------------------------------------------------------------- + # scaled center point of fit.rect + mp1 = (fit.rect.tl + fit.rect.br) / 2 * scale + + # center point of target rect + mp2 = (rect.tl + rect.br) / 2 + + # compute link positioning matrix: + # - move center of scaled-down fit.rect to (0,0) + # - rotate + # - move (0,0) to center of target rect + mat = ( + pymupdf.Matrix(scale, 0, 0, scale, -mp1.x, -mp1.y) + * pymupdf.Matrix(-rotate) + * pymupdf.Matrix(1, 0, 0, 1, mp2.x, mp2.y) + ) + + # copy over links + for link in doc[0].get_links(): + link["from"] *= mat + page.insert_link(link) + + return spare_height, scale + + +def new_page( + doc: pymupdf.Document, + pno: int = -1, + width: float = 595, + height: float = 842, +) -> pymupdf.Page: + """Create and return a new page object. + + Args: + pno: (int) insert before this page. Default: after last page. + width: (float) page width in points. Default: 595 (ISO A4 width). + height: (float) page height in points. Default 842 (ISO A4 height). + Returns: + A pymupdf.Page object. + """ + doc._newPage(pno, width=width, height=height) + return doc[pno] + + +def insert_page( + doc: pymupdf.Document, + pno: int, + text: typing.Union[str, list, None] = None, + fontsize: float = 11, + width: float = 595, + height: float = 842, + fontname: str = "helv", + fontfile: OptStr = None, + color: OptSeq = (0,), +) -> int: + """Create a new PDF page and insert some text. + + Notes: + Function combining pymupdf.Document.new_page() and pymupdf.Page.insert_text(). + For parameter details see these methods. + """ + page = doc.new_page(pno=pno, width=width, height=height) + if not bool(text): + return 0 + rc = page.insert_text( + (50, 72), + text, + fontsize=fontsize, + fontname=fontname, + fontfile=fontfile, + color=color, + ) + return rc + + +def draw_line( + page: pymupdf.Page, + p1: point_like, + p2: point_like, + color: OptSeq = (0,), + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc=0, +) -> pymupdf.Point: + """Draw a line from point p1 to point p2.""" + img = page.new_shape() + p = img.draw_line(pymupdf.Point(p1), pymupdf.Point(p2)) + img.finish( + color=color, + dashes=dashes, + width=width, + closePath=False, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return p + + +def draw_squiggle( + page: pymupdf.Page, + p1: point_like, + p2: point_like, + breadth: float = 2, + color: OptSeq = (0,), + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> pymupdf.Point: + """Draw a squiggly line from point p1 to point p2.""" + img = page.new_shape() + p = img.draw_squiggle(pymupdf.Point(p1), pymupdf.Point(p2), breadth=breadth) + img.finish( + color=color, + dashes=dashes, + width=width, + closePath=False, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return p + + +def draw_zigzag( + page: pymupdf.Page, + p1: point_like, + p2: point_like, + breadth: float = 2, + color: OptSeq = (0,), + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> pymupdf.Point: + """Draw a zigzag line from point p1 to point p2.""" + img = page.new_shape() + p = img.draw_zigzag(pymupdf.Point(p1), pymupdf.Point(p2), breadth=breadth) + img.finish( + color=color, + dashes=dashes, + width=width, + closePath=False, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return p + + +def draw_rect( + page: pymupdf.Page, + rect: rect_like, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + morph: OptSeq = None, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, + radius=None, + ) -> pymupdf.Point: + ''' + Draw a rectangle. See Shape class method for details. + ''' + img = page.new_shape() + Q = img.draw_rect(pymupdf.Rect(rect), radius=radius) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_quad( + page: pymupdf.Page, + quad: quad_like, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + morph: OptSeq = None, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> pymupdf.Point: + """Draw a quadrilateral.""" + img = page.new_shape() + Q = img.draw_quad(pymupdf.Quad(quad)) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_polyline( + page: pymupdf.Page, + points: list, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + morph: OptSeq = None, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + closePath: bool = False, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> pymupdf.Point: + """Draw multiple connected line segments.""" + img = page.new_shape() + Q = img.draw_polyline(points) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + closePath=closePath, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_circle( + page: pymupdf.Page, + center: point_like, + radius: float, + color: OptSeq = (0,), + fill: OptSeq = None, + morph: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> pymupdf.Point: + """Draw a circle given its center and radius.""" + img = page.new_shape() + Q = img.draw_circle(pymupdf.Point(center), radius) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + return Q + + +def draw_oval( + page: pymupdf.Page, + rect: typing.Union[rect_like, quad_like], + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + morph: OptSeq = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> pymupdf.Point: + """Draw an oval given its containing rectangle or quad.""" + img = page.new_shape() + Q = img.draw_oval(rect) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_curve( + page: pymupdf.Page, + p1: point_like, + p2: point_like, + p3: point_like, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + morph: OptSeq = None, + closePath: bool = False, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> pymupdf.Point: + """Draw a special Bezier curve from p1 to p3, generating control points on lines p1 to p2 and p2 to p3.""" + img = page.new_shape() + Q = img.draw_curve(pymupdf.Point(p1), pymupdf.Point(p2), pymupdf.Point(p3)) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + closePath=closePath, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_bezier( + page: pymupdf.Page, + p1: point_like, + p2: point_like, + p3: point_like, + p4: point_like, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + morph: OptStr = None, + closePath: bool = False, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> pymupdf.Point: + """Draw a general cubic Bezier curve from p1 to p4 using control points p2 and p3.""" + img = page.new_shape() + Q = img.draw_bezier(pymupdf.Point(p1), pymupdf.Point(p2), pymupdf.Point(p3), pymupdf.Point(p4)) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + closePath=closePath, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_sector( + page: pymupdf.Page, + center: point_like, + point: point_like, + beta: float, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + fullSector: bool = True, + morph: OptSeq = None, + width: float = 1, + closePath: bool = False, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> pymupdf.Point: + """Draw a circle sector given circle center, one arc end point and the angle of the arc. + + Parameters: + center -- center of circle + point -- arc end point + beta -- angle of arc (degrees) + fullSector -- connect arc ends with center + """ + img = page.new_shape() + Q = img.draw_sector(pymupdf.Point(center), pymupdf.Point(point), beta, fullSector=fullSector) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + closePath=closePath, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +# ---------------------------------------------------------------------- +# Name: wx.lib.colourdb.py +# Purpose: Adds a bunch of colour names and RGB values to the +# colour database so they can be found by name +# +# Author: Robin Dunn +# +# Created: 13-March-2001 +# Copyright: (c) 2001-2017 by Total Control Software +# Licence: wxWindows license +# Tags: phoenix-port, unittest, documented +# ---------------------------------------------------------------------- + + +def getColorList() -> list: + """ + Returns a list of upper-case colour names. + :rtype: list of strings + """ + return [name for name, r, g, b in pymupdf.colors_wx_list()] + + +def getColorInfoList() -> list: + """ + Returns list of (name, red, gree, blue) tuples, where: + name: upper-case color name. + read, green, blue: integers in range 0..255. + :rtype: list of tuples + """ + return pymupdf.colors_wx_list() + + +def getColor(name: str) -> tuple: + """Retrieve RGB color in PDF format by name. + + Returns: + a triple of floats in range 0 to 1. In case of name-not-found, "white" is returned. + """ + return pymupdf.colors_pdf_dict().get(name.lower(), (1, 1, 1)) + + +def getColorHSV(name: str) -> tuple: + """Retrieve the hue, saturation, value triple of a color name. + + Returns: + a triple (degree, percent, percent). If not found (-1, -1, -1) is returned. + """ + try: + x = getColorInfoList()[getColorList().index(name.upper())] + except Exception: + if g_exceptions_verbose: pymupdf.exception_info() + return (-1, -1, -1) + + r = x[1] / 255.0 + g = x[2] / 255.0 + b = x[3] / 255.0 + cmax = max(r, g, b) + V = round(cmax * 100, 1) + cmin = min(r, g, b) + delta = cmax - cmin + if delta == 0: + hue = 0 + elif cmax == r: + hue = 60.0 * (((g - b) / delta) % 6) + elif cmax == g: + hue = 60.0 * (((b - r) / delta) + 2) + else: + hue = 60.0 * (((r - g) / delta) + 4) + + H = int(round(hue)) + + if cmax == 0: + sat = 0 + else: + sat = delta / cmax + S = int(round(sat * 100)) + + return (H, S, V) + + +def _get_font_properties(doc: pymupdf.Document, xref: int) -> tuple: + fontname, ext, stype, buffer = doc.extract_font(xref) + asc = 0.8 + dsc = -0.2 + if ext == "": + return fontname, ext, stype, asc, dsc + + if buffer: + try: + font = pymupdf.Font(fontbuffer=buffer) + asc = font.ascender + dsc = font.descender + bbox = font.bbox + if asc - dsc < 1: + if bbox.y0 < dsc: + dsc = bbox.y0 + asc = 1 - dsc + except Exception: + pymupdf.exception_info() + asc *= 1.2 + dsc *= 1.2 + return fontname, ext, stype, asc, dsc + if ext != "n/a": + try: + font = pymupdf.Font(fontname) + asc = font.ascender + dsc = font.descender + except Exception: + pymupdf.exception_info() + asc *= 1.2 + dsc *= 1.2 + else: + asc *= 1.2 + dsc *= 1.2 + return fontname, ext, stype, asc, dsc + + +def get_char_widths( + doc: pymupdf.Document, xref: int, limit: int = 256, idx: int = 0, fontdict: OptDict = None +) -> list: + """Get list of glyph information of a font. + + Notes: + Must be provided by its XREF number. If we already dealt with the + font, it will be recorded in doc.FontInfos. Otherwise we insert an + entry there. + Finally we return the glyphs for the font. This is a list of + (glyph, width) where glyph is an integer controlling the char + appearance, and width is a float controlling the char's spacing: + width * fontsize is the actual space. + For 'simple' fonts, glyph == ord(char) will usually be true. + Exceptions are 'Symbol' and 'ZapfDingbats'. We are providing data for these directly here. + """ + fontinfo = pymupdf.CheckFontInfo(doc, xref) + if fontinfo is None: # not recorded yet: create it + if fontdict is None: + name, ext, stype, asc, dsc = _get_font_properties(doc, xref) + fontdict = { + "name": name, + "type": stype, + "ext": ext, + "ascender": asc, + "descender": dsc, + } + else: + name = fontdict["name"] + ext = fontdict["ext"] + stype = fontdict["type"] + ordering = fontdict["ordering"] + simple = fontdict["simple"] + + if ext == "": + raise ValueError("xref is not a font") + + # check for 'simple' fonts + if stype in ("Type1", "MMType1", "TrueType"): + simple = True + else: + simple = False + + # check for CJK fonts + if name in ("Fangti", "Ming"): + ordering = 0 + elif name in ("Heiti", "Song"): + ordering = 1 + elif name in ("Gothic", "Mincho"): + ordering = 2 + elif name in ("Dotum", "Batang"): + ordering = 3 + else: + ordering = -1 + + fontdict["simple"] = simple + + if name == "ZapfDingbats": + glyphs = pymupdf.zapf_glyphs + elif name == "Symbol": + glyphs = pymupdf.symbol_glyphs + else: + glyphs = None + + fontdict["glyphs"] = glyphs + fontdict["ordering"] = ordering + fontinfo = [xref, fontdict] + doc.FontInfos.append(fontinfo) + else: + fontdict = fontinfo[1] + glyphs = fontdict["glyphs"] + simple = fontdict["simple"] + ordering = fontdict["ordering"] + + if glyphs is None: + oldlimit = 0 + else: + oldlimit = len(glyphs) + + mylimit = max(256, limit) + + if mylimit <= oldlimit: + return glyphs + + if ordering < 0: # not a CJK font + glyphs = doc._get_char_widths( + xref, fontdict["name"], fontdict["ext"], fontdict["ordering"], mylimit, idx + ) + else: # CJK fonts use char codes and width = 1 + glyphs = None + + fontdict["glyphs"] = glyphs + fontinfo[1] = fontdict + pymupdf.UpdateFontInfo(doc, fontinfo) + + return glyphs + + +class Shape: + """Create a new shape.""" + + @staticmethod + def horizontal_angle(C, P): + """Return the angle to the horizontal for the connection from C to P. + This uses the arcus sine function and resolves its inherent ambiguity by + looking up in which quadrant vector S = P - C is located. + """ + S = pymupdf.Point(P - C).unit # unit vector 'C' -> 'P' + alfa = math.asin(abs(S.y)) # absolute angle from horizontal + if S.x < 0: # make arcsin result unique + if S.y <= 0: # bottom-left + alfa = -(math.pi - alfa) + else: # top-left + alfa = math.pi - alfa + else: + if S.y >= 0: # top-right + pass + else: # bottom-right + alfa = -alfa + return alfa + + def __init__(self, page: pymupdf.Page): + pymupdf.CheckParent(page) + self.page = page + self.doc = page.parent + if not self.doc.is_pdf: + raise ValueError("is no PDF") + self.height = page.mediabox_size.y + self.width = page.mediabox_size.x + self.x = page.cropbox_position.x + self.y = page.cropbox_position.y + + self.pctm = page.transformation_matrix # page transf. matrix + self.ipctm = ~self.pctm # inverted transf. matrix + + self.draw_cont = "" + self.text_cont = "" + self.totalcont = "" + self.last_point = None + self.rect = None + + def updateRect(self, x): + if self.rect is None: + if len(x) == 2: + self.rect = pymupdf.Rect(x, x) + else: + self.rect = pymupdf.Rect(x) + + else: + if len(x) == 2: + x = pymupdf.Point(x) + self.rect.x0 = min(self.rect.x0, x.x) + self.rect.y0 = min(self.rect.y0, x.y) + self.rect.x1 = max(self.rect.x1, x.x) + self.rect.y1 = max(self.rect.y1, x.y) + else: + x = pymupdf.Rect(x) + self.rect.x0 = min(self.rect.x0, x.x0) + self.rect.y0 = min(self.rect.y0, x.y0) + self.rect.x1 = max(self.rect.x1, x.x1) + self.rect.y1 = max(self.rect.y1, x.y1) + + def draw_line(self, p1: point_like, p2: point_like) -> pymupdf.Point: + """Draw a line between two points.""" + p1 = pymupdf.Point(p1) + p2 = pymupdf.Point(p2) + if not (self.last_point == p1): + self.draw_cont += _format_g(pymupdf.JM_TUPLE(p1 * self.ipctm)) + " m\n" + self.last_point = p1 + self.updateRect(p1) + + self.draw_cont += _format_g(pymupdf.JM_TUPLE(p2 * self.ipctm)) + " l\n" + self.updateRect(p2) + self.last_point = p2 + return self.last_point + + def draw_polyline(self, points: list) -> pymupdf.Point: + """Draw several connected line segments.""" + for i, p in enumerate(points): + if i == 0: + if not (self.last_point == pymupdf.Point(p)): + self.draw_cont += _format_g(pymupdf.JM_TUPLE(pymupdf.Point(p) * self.ipctm)) + " m\n" + self.last_point = pymupdf.Point(p) + else: + self.draw_cont += _format_g(pymupdf.JM_TUPLE(pymupdf.Point(p) * self.ipctm)) + " l\n" + self.updateRect(p) + + self.last_point = pymupdf.Point(points[-1]) + return self.last_point + + def draw_bezier( + self, + p1: point_like, + p2: point_like, + p3: point_like, + p4: point_like, + ) -> pymupdf.Point: + """Draw a standard cubic Bezier curve.""" + p1 = pymupdf.Point(p1) + p2 = pymupdf.Point(p2) + p3 = pymupdf.Point(p3) + p4 = pymupdf.Point(p4) + if not (self.last_point == p1): + self.draw_cont += _format_g(pymupdf.JM_TUPLE(p1 * self.ipctm)) + " m\n" + args = pymupdf.JM_TUPLE(list(p2 * self.ipctm) + list(p3 * self.ipctm) + list(p4 * self.ipctm)) + self.draw_cont += _format_g(args) + " c\n" + self.updateRect(p1) + self.updateRect(p2) + self.updateRect(p3) + self.updateRect(p4) + self.last_point = p4 + return self.last_point + + def draw_oval(self, tetra: typing.Union[quad_like, rect_like]) -> pymupdf.Point: + """Draw an ellipse inside a tetrapod.""" + if len(tetra) != 4: + raise ValueError("invalid arg length") + if hasattr(tetra[0], "__float__"): + q = pymupdf.Rect(tetra).quad + else: + q = pymupdf.Quad(tetra) + + mt = q.ul + (q.ur - q.ul) * 0.5 + mr = q.ur + (q.lr - q.ur) * 0.5 + mb = q.ll + (q.lr - q.ll) * 0.5 + ml = q.ul + (q.ll - q.ul) * 0.5 + if not (self.last_point == ml): + self.draw_cont += _format_g(pymupdf.JM_TUPLE(ml * self.ipctm)) + " m\n" + self.last_point = ml + self.draw_curve(ml, q.ll, mb) + self.draw_curve(mb, q.lr, mr) + self.draw_curve(mr, q.ur, mt) + self.draw_curve(mt, q.ul, ml) + self.updateRect(q.rect) + self.last_point = ml + return self.last_point + + def draw_circle(self, center: point_like, radius: float) -> pymupdf.Point: + """Draw a circle given its center and radius.""" + if not radius > pymupdf.EPSILON: + raise ValueError("radius must be positive") + center = pymupdf.Point(center) + p1 = center - (radius, 0) + return self.draw_sector(center, p1, 360, fullSector=False) + + def draw_curve( + self, + p1: point_like, + p2: point_like, + p3: point_like, + ) -> pymupdf.Point: + """Draw a curve between points using one control point.""" + kappa = 0.55228474983 + p1 = pymupdf.Point(p1) + p2 = pymupdf.Point(p2) + p3 = pymupdf.Point(p3) + k1 = p1 + (p2 - p1) * kappa + k2 = p3 + (p2 - p3) * kappa + return self.draw_bezier(p1, k1, k2, p3) + + def draw_sector( + self, + center: point_like, + point: point_like, + beta: float, + fullSector: bool = True, + ) -> pymupdf.Point: + """Draw a circle sector.""" + center = pymupdf.Point(center) + point = pymupdf.Point(point) + l3 = lambda a, b: _format_g((a, b)) + " m\n" + l4 = lambda a, b, c, d, e, f: _format_g((a, b, c, d, e, f)) + " c\n" + l5 = lambda a, b: _format_g((a, b)) + " l\n" + betar = math.radians(-beta) + w360 = math.radians(math.copysign(360, betar)) * (-1) + w90 = math.radians(math.copysign(90, betar)) + w45 = w90 / 2 + while abs(betar) > 2 * math.pi: + betar += w360 # bring angle below 360 degrees + if not (self.last_point == point): + self.draw_cont += l3(*pymupdf.JM_TUPLE(point * self.ipctm)) + self.last_point = point + Q = pymupdf.Point(0, 0) # just make sure it exists + C = center + P = point + S = P - C # vector 'center' -> 'point' + rad = abs(S) # circle radius + + if not rad > pymupdf.EPSILON: + raise ValueError("radius must be positive") + + alfa = self.horizontal_angle(center, point) + while abs(betar) > abs(w90): # draw 90 degree arcs + q1 = C.x + math.cos(alfa + w90) * rad + q2 = C.y + math.sin(alfa + w90) * rad + Q = pymupdf.Point(q1, q2) # the arc's end point + r1 = C.x + math.cos(alfa + w45) * rad / math.cos(w45) + r2 = C.y + math.sin(alfa + w45) * rad / math.cos(w45) + R = pymupdf.Point(r1, r2) # crossing point of tangents + kappah = (1 - math.cos(w45)) * 4 / 3 / abs(R - Q) + kappa = kappah * abs(P - Q) + cp1 = P + (R - P) * kappa # control point 1 + cp2 = Q + (R - Q) * kappa # control point 2 + self.draw_cont += l4(*pymupdf.JM_TUPLE( + list(cp1 * self.ipctm) + list(cp2 * self.ipctm) + list(Q * self.ipctm) + )) + + betar -= w90 # reduce param angle by 90 deg + alfa += w90 # advance start angle by 90 deg + P = Q # advance to arc end point + # draw (remaining) arc + if abs(betar) > 1e-3: # significant degrees left? + beta2 = betar / 2 + q1 = C.x + math.cos(alfa + betar) * rad + q2 = C.y + math.sin(alfa + betar) * rad + Q = pymupdf.Point(q1, q2) # the arc's end point + r1 = C.x + math.cos(alfa + beta2) * rad / math.cos(beta2) + r2 = C.y + math.sin(alfa + beta2) * rad / math.cos(beta2) + R = pymupdf.Point(r1, r2) # crossing point of tangents + # kappa height is 4/3 of segment height + kappah = (1 - math.cos(beta2)) * 4 / 3 / abs(R - Q) # kappa height + kappa = kappah * abs(P - Q) / (1 - math.cos(betar)) + cp1 = P + (R - P) * kappa # control point 1 + cp2 = Q + (R - Q) * kappa # control point 2 + self.draw_cont += l4(*pymupdf.JM_TUPLE( + list(cp1 * self.ipctm) + list(cp2 * self.ipctm) + list(Q * self.ipctm) + )) + if fullSector: + self.draw_cont += l3(*pymupdf.JM_TUPLE(point * self.ipctm)) + self.draw_cont += l5(*pymupdf.JM_TUPLE(center * self.ipctm)) + self.draw_cont += l5(*pymupdf.JM_TUPLE(Q * self.ipctm)) + self.last_point = Q + return self.last_point + + def draw_rect(self, rect: rect_like, *, radius=None) -> pymupdf.Point: + """Draw a rectangle. + + Args: + radius: if not None, the rectangle will have rounded corners. + This is the radius of the curvature, given as percentage of + the rectangle width or height. Valid are values 0 < v <= 0.5. + For a sequence of two values, the corners will have different + radii. Otherwise, the percentage will be computed from the + shorter side. A value of (0.5, 0.5) will draw an ellipse. + """ + r = pymupdf.Rect(rect) + if radius is None: # standard rectangle + self.draw_cont += _format_g(pymupdf.JM_TUPLE( + list(r.bl * self.ipctm) + [r.width, r.height] + )) + " re\n" + self.updateRect(r) + self.last_point = r.tl + return self.last_point + # rounded corners requested. This requires 1 or 2 values, each + # with 0 < value <= 0.5 + if hasattr(radius, "__float__"): + if radius <= 0 or radius > 0.5: + raise ValueError(f"bad radius value {radius}.") + d = min(r.width, r.height) * radius + px = (d, 0) + py = (0, d) + elif hasattr(radius, "__len__") and len(radius) == 2: + rx, ry = radius + px = (rx * r.width, 0) + py = (0, ry * r.height) + if min(rx, ry) <= 0 or max(rx, ry) > 0.5: + raise ValueError(f"bad radius value {radius}.") + else: + raise ValueError(f"bad radius value {radius}.") + + lp = self.draw_line(r.tl + py, r.bl - py) + lp = self.draw_curve(lp, r.bl, r.bl + px) + + lp = self.draw_line(lp, r.br - px) + lp = self.draw_curve(lp, r.br, r.br - py) + + lp = self.draw_line(lp, r.tr + py) + lp = self.draw_curve(lp, r.tr, r.tr - px) + + lp = self.draw_line(lp, r.tl + px) + self.last_point = self.draw_curve(lp, r.tl, r.tl + py) + + self.updateRect(r) + return self.last_point + + def draw_quad(self, quad: quad_like) -> pymupdf.Point: + """Draw a Quad.""" + q = pymupdf.Quad(quad) + return self.draw_polyline([q.ul, q.ll, q.lr, q.ur, q.ul]) + + def draw_zigzag( + self, + p1: point_like, + p2: point_like, + breadth: float = 2, + ) -> pymupdf.Point: + """Draw a zig-zagged line from p1 to p2.""" + p1 = pymupdf.Point(p1) + p2 = pymupdf.Point(p2) + S = p2 - p1 # vector start - end + rad = abs(S) # distance of points + cnt = 4 * int(round(rad / (4 * breadth), 0)) # always take full phases + if cnt < 4: + raise ValueError("points too close") + mb = rad / cnt # revised breadth + matrix = pymupdf.Matrix(pymupdf.util_hor_matrix(p1, p2)) # normalize line to x-axis + i_mat = ~matrix # get original position + points = [] # stores edges + for i in range(1, cnt): + if i % 4 == 1: # point "above" connection + p = pymupdf.Point(i, -1) * mb + elif i % 4 == 3: # point "below" connection + p = pymupdf.Point(i, 1) * mb + else: # ignore others + continue + points.append(p * i_mat) + self.draw_polyline([p1] + points + [p2]) # add start and end points + return p2 + + def draw_squiggle( + self, + p1: point_like, + p2: point_like, + breadth=2, + ) -> pymupdf.Point: + """Draw a squiggly line from p1 to p2.""" + p1 = pymupdf.Point(p1) + p2 = pymupdf.Point(p2) + S = p2 - p1 # vector start - end + rad = abs(S) # distance of points + cnt = 4 * int(round(rad / (4 * breadth), 0)) # always take full phases + if cnt < 4: + raise ValueError("points too close") + mb = rad / cnt # revised breadth + matrix = pymupdf.Matrix(pymupdf.util_hor_matrix(p1, p2)) # normalize line to x-axis + i_mat = ~matrix # get original position + k = 2.4142135623765633 # y of draw_curve helper point + + points = [] # stores edges + for i in range(1, cnt): + if i % 4 == 1: # point "above" connection + p = pymupdf.Point(i, -k) * mb + elif i % 4 == 3: # point "below" connection + p = pymupdf.Point(i, k) * mb + else: # else on connection line + p = pymupdf.Point(i, 0) * mb + points.append(p * i_mat) + + points = [p1] + points + [p2] + cnt = len(points) + i = 0 + while i + 2 < cnt: + self.draw_curve(points[i], points[i + 1], points[i + 2]) + i += 2 + return p2 + + # ============================================================================== + # Shape.insert_text + # ============================================================================== + def insert_text( + self, + point: point_like, + buffer: typing.Union[str, list], + *, + fontsize: float = 11, + lineheight: OptFloat = None, + fontname: str = "helv", + fontfile: OptStr = None, + set_simple: bool = 0, + encoding: int = 0, + color: OptSeq = None, + fill: OptSeq = None, + render_mode: int = 0, + border_width: float = 0.05, + miter_limit: float = 1, + rotate: int = 0, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, + ) -> int: + + # ensure 'text' is a list of strings, worth dealing with + if not bool(buffer): + return 0 + + if type(buffer) not in (list, tuple): + text = buffer.splitlines() + else: + text = buffer + + if not len(text) > 0: + return 0 + + point = pymupdf.Point(point) + try: + maxcode = max([ord(c) for c in " ".join(text)]) + except Exception: + pymupdf.exception_info() + return 0 + + # ensure valid 'fontname' + fname = fontname + if fname.startswith("/"): + fname = fname[1:] + + xref = self.page.insert_font( + fontname=fname, fontfile=fontfile, encoding=encoding, set_simple=set_simple + ) + fontinfo = pymupdf.CheckFontInfo(self.doc, xref) + + fontdict = fontinfo[1] + ordering = fontdict["ordering"] + simple = fontdict["simple"] + bfname = fontdict["name"] + ascender = fontdict["ascender"] + descender = fontdict["descender"] + if lineheight: + lheight = fontsize * lineheight + elif ascender - descender <= 1: + lheight = fontsize * 1.2 + else: + lheight = fontsize * (ascender - descender) + + if maxcode > 255: + glyphs = self.doc.get_char_widths(xref, maxcode + 1) + else: + glyphs = fontdict["glyphs"] + + tab = [] + for t in text: + if simple and bfname not in ("Symbol", "ZapfDingbats"): + g = None + else: + g = glyphs + tab.append(pymupdf.getTJstr(t, g, simple, ordering)) + text = tab + + color_str = pymupdf.ColorCode(color, "c") + fill_str = pymupdf.ColorCode(fill, "f") + if not fill and render_mode == 0: # ensure fill color when 0 Tr + fill = color + fill_str = pymupdf.ColorCode(color, "f") + + morphing = pymupdf.CheckMorph(morph) + rot = rotate + if rot % 90 != 0: + raise ValueError("bad rotate value") + + while rot < 0: + rot += 360 + rot = rot % 360 # text rotate = 0, 90, 270, 180 + + templ1 = lambda a, b, c, d, e, f, g: f"\nq\n{a}{b}BT\n{c}1 0 0 1 {_format_g((d, e))} Tm\n/{f} {_format_g(g)} Tf " + templ2 = lambda a: f"TJ\n0 -{_format_g(a)} TD\n" + cmp90 = "0 1 -1 0 0 0 cm\n" # rotates 90 deg counter-clockwise + cmm90 = "0 -1 1 0 0 0 cm\n" # rotates 90 deg clockwise + cm180 = "-1 0 0 -1 0 0 cm\n" # rotates by 180 deg. + height = self.height + width = self.width + + # setting up for standard rotation directions + # case rotate = 0 + if morphing: + m1 = pymupdf.Matrix(1, 0, 0, 1, morph[0].x + self.x, height - morph[0].y - self.y) + mat = ~m1 * morph[1] * m1 + cm = _format_g(pymupdf.JM_TUPLE(mat)) + " cm\n" + else: + cm = "" + top = height - point.y - self.y # start of 1st char + left = point.x + self.x # start of 1. char + space = top # space available + #headroom = point.y + self.y # distance to page border + if rot == 90: + left = height - point.y - self.y + top = -point.x - self.x + cm += cmp90 + space = width - abs(top) + #headroom = point.x + self.x + + elif rot == 270: + left = -height + point.y + self.y + top = point.x + self.x + cm += cmm90 + space = abs(top) + #headroom = width - point.x - self.x + + elif rot == 180: + left = -point.x - self.x + top = -height + point.y + self.y + cm += cm180 + space = abs(point.y + self.y) + #headroom = height - point.y - self.y + + optcont = self.page._get_optional_content(oc) + if optcont is not None: + bdc = "/OC /%s BDC\n" % optcont + emc = "EMC\n" + else: + bdc = emc = "" + + alpha = self.page._set_opacity(CA=stroke_opacity, ca=fill_opacity) + if alpha is None: + alpha = "" + else: + alpha = "/%s gs\n" % alpha + nres = templ1(bdc, alpha, cm, left, top, fname, fontsize) + + if render_mode > 0: + nres += "%i Tr " % render_mode + nres += _format_g(border_width * fontsize) + " w " + if miter_limit is not None: + nres += _format_g(miter_limit) + " M " + if color is not None: + nres += color_str + if fill is not None: + nres += fill_str + + # ========================================================================= + # start text insertion + # ========================================================================= + nres += text[0] + nlines = 1 # set output line counter + if len(text) > 1: + nres += templ2(lheight) # line 1 + else: + nres += 'TJ' + for i in range(1, len(text)): + if space < lheight: + break # no space left on page + if i > 1: + nres += "\nT* " + nres += text[i] + 'TJ' + space -= lheight + nlines += 1 + + nres += "\nET\n%sQ\n" % emc + + # ========================================================================= + # end of text insertion + # ========================================================================= + # update the /Contents object + self.text_cont += nres + return nlines + + # ============================================================================== + # Shape.insert_textbox + # ============================================================================== + def insert_textbox( + self, + rect: rect_like, + buffer: typing.Union[str, list], + *, + fontname: OptStr = "helv", + fontfile: OptStr = None, + fontsize: float = 11, + lineheight: OptFloat = None, + set_simple: bool = 0, + encoding: int = 0, + color: OptSeq = None, + fill: OptSeq = None, + expandtabs: int = 1, + border_width: float = 0.05, + miter_limit: float = 1, + align: int = 0, + render_mode: int = 0, + rotate: int = 0, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, + ) -> float: + """Insert text into a given rectangle. + + Args: + rect -- the textbox to fill + buffer -- text to be inserted + fontname -- a Base-14 font, font name or '/name' + fontfile -- name of a font file + fontsize -- font size + lineheight -- overwrite the font property + color -- RGB stroke color triple + fill -- RGB fill color triple + render_mode -- text rendering control + border_width -- thickness of glyph borders as percentage of fontsize + expandtabs -- handles tabulators with string function + align -- left, center, right, justified + rotate -- 0, 90, 180, or 270 degrees + morph -- morph box with a matrix and a fixpoint + Returns: + unused or deficit rectangle area (float) + """ + rect = pymupdf.Rect(rect) + if rect.is_empty or rect.is_infinite: + raise ValueError("text box must be finite and not empty") + + color_str = pymupdf.ColorCode(color, "c") + fill_str = pymupdf.ColorCode(fill, "f") + if fill is None and render_mode == 0: # ensure fill color for 0 Tr + fill = color + fill_str = pymupdf.ColorCode(color, "f") + + optcont = self.page._get_optional_content(oc) + if optcont is not None: + bdc = "/OC /%s BDC\n" % optcont + emc = "EMC\n" + else: + bdc = emc = "" + + # determine opacity / transparency + alpha = self.page._set_opacity(CA=stroke_opacity, ca=fill_opacity) + if alpha is None: + alpha = "" + else: + alpha = "/%s gs\n" % alpha + + if rotate % 90 != 0: + raise ValueError("rotate must be multiple of 90") + + rot = rotate + while rot < 0: + rot += 360 + rot = rot % 360 + + # is buffer worth of dealing with? + if not bool(buffer): + return rect.height if rot in (0, 180) else rect.width + + cmp90 = "0 1 -1 0 0 0 cm\n" # rotates counter-clockwise + cmm90 = "0 -1 1 0 0 0 cm\n" # rotates clockwise + cm180 = "-1 0 0 -1 0 0 cm\n" # rotates by 180 deg. + height = self.height + + fname = fontname + if fname.startswith("/"): + fname = fname[1:] + + xref = self.page.insert_font( + fontname=fname, fontfile=fontfile, encoding=encoding, set_simple=set_simple + ) + fontinfo = pymupdf.CheckFontInfo(self.doc, xref) + + fontdict = fontinfo[1] + ordering = fontdict["ordering"] + simple = fontdict["simple"] + glyphs = fontdict["glyphs"] + bfname = fontdict["name"] + ascender = fontdict["ascender"] + descender = fontdict["descender"] + + if lineheight: + lheight_factor = lineheight + elif ascender - descender <= 1: + lheight_factor = 1.2 + else: + lheight_factor = ascender - descender + lheight = fontsize * lheight_factor + + # create a list from buffer, split into its lines + if type(buffer) in (list, tuple): + t0 = "\n".join(buffer) + else: + t0 = buffer + + maxcode = max([ord(c) for c in t0]) + # replace invalid char codes for simple fonts + if simple and maxcode > 255: + t0 = "".join([c if ord(c) < 256 else "?" for c in t0]) + + t0 = t0.splitlines() + + glyphs = self.doc.get_char_widths(xref, maxcode + 1) + if simple and bfname not in ("Symbol", "ZapfDingbats"): + tj_glyphs = None + else: + tj_glyphs = glyphs + + # ---------------------------------------------------------------------- + # calculate pixel length of a string + # ---------------------------------------------------------------------- + def pixlen(x): + """Calculate pixel length of x.""" + if ordering < 0: + return sum([glyphs[ord(c)][1] for c in x]) * fontsize + else: + return len(x) * fontsize + + # --------------------------------------------------------------------- + + if ordering < 0: + blen = glyphs[32][1] * fontsize # pixel size of space character + else: + blen = fontsize + + text = "" # output buffer + + if pymupdf.CheckMorph(morph): + m1 = pymupdf.Matrix( + 1, 0, 0, 1, morph[0].x + self.x, self.height - morph[0].y - self.y + ) + mat = ~m1 * morph[1] * m1 + cm = _format_g(pymupdf.JM_TUPLE(mat)) + " cm\n" + else: + cm = "" + + # --------------------------------------------------------------------- + # adjust for text orientation / rotation + # --------------------------------------------------------------------- + progr = 1 # direction of line progress + c_pnt = pymupdf.Point(0, fontsize * ascender) # used for line progress + if rot == 0: # normal orientation + point = rect.tl + c_pnt # line 1 is 'lheight' below top + maxwidth = rect.width # pixels available in one line + maxheight = rect.height # available text height + + elif rot == 90: # rotate counter clockwise + c_pnt = pymupdf.Point(fontsize * ascender, 0) # progress in x-direction + point = rect.bl + c_pnt # line 1 'lheight' away from left + maxwidth = rect.height # pixels available in one line + maxheight = rect.width # available text height + cm += cmp90 + + elif rot == 180: # text upside down + # progress upwards in y direction + c_pnt = -pymupdf.Point(0, fontsize * ascender) + point = rect.br + c_pnt # line 1 'lheight' above bottom + maxwidth = rect.width # pixels available in one line + progr = -1 # subtract lheight for next line + maxheight =rect.height # available text height + cm += cm180 + + else: # rotate clockwise (270 or -90) + # progress from right to left + c_pnt = -pymupdf.Point(fontsize * ascender, 0) + point = rect.tr + c_pnt # line 1 'lheight' left of right + maxwidth = rect.height # pixels available in one line + progr = -1 # subtract lheight for next line + maxheight = rect.width # available text height + cm += cmm90 + + # ===================================================================== + # line loop + # ===================================================================== + just_tab = [] # 'justify' indicators per line + + for i, line in enumerate(t0): + line_t = line.expandtabs(expandtabs).split(" ") # split into words + num_words = len(line_t) + lbuff = "" # init line buffer + rest = maxwidth # available line pixels + # ================================================================= + # word loop + # ================================================================= + for j in range(num_words): + word = line_t[j] + pl_w = pixlen(word) # pixel len of word + if rest >= pl_w: # does it fit on the line? + lbuff += word + " " # yes, append word + rest -= pl_w + blen # update available line space + continue # next word + + # word doesn't fit - output line (if not empty) + if lbuff: + lbuff = lbuff.rstrip() + "\n" # line full, append line break + text += lbuff # append to total text + just_tab.append(True) # can align-justify + + lbuff = "" # re-init line buffer + rest = maxwidth # re-init avail. space + + if pl_w <= maxwidth: # word shorter than 1 line? + lbuff = word + " " # start the line with it + rest = maxwidth - pl_w - blen # update free space + continue + + # long word: split across multiple lines - char by char ... + if len(just_tab) > 0: + just_tab[-1] = False # cannot align-justify + for c in word: + if pixlen(lbuff) <= maxwidth - pixlen(c): + lbuff += c + else: # line full + lbuff += "\n" # close line + text += lbuff # append to text + just_tab.append(False) # cannot align-justify + lbuff = c # start new line with this char + + lbuff += " " # finish long word + rest = maxwidth - pixlen(lbuff) # long word stored + + if lbuff: # unprocessed line content? + text += lbuff.rstrip() # append to text + just_tab.append(False) # cannot align-justify + + if i < len(t0) - 1: # not the last line? + text += "\n" # insert line break + + # compute used part of the textbox + if text.endswith("\n"): + text = text[:-1] + lb_count = text.count("\n") + 1 # number of lines written + + # text height = line count * line height plus one descender value + text_height = lheight * lb_count - descender * fontsize + + more = text_height - maxheight # difference to height limit + if more > pymupdf.EPSILON: # landed too much outside rect + return (-1) * more # return deficit, don't output + + more = abs(more) + if more < pymupdf.EPSILON: + more = 0 # don't bother with epsilons + nres = "\nq\n%s%sBT\n" % (bdc, alpha) + cm # initialize output buffer + templ = lambda a, b, c, d: f"1 0 0 1 {_format_g((a, b))} Tm /{c} {_format_g(d)} Tf " + # center, right, justify: output each line with its own specifics + text_t = text.splitlines() # split text in lines again + just_tab[-1] = False # never justify last line + for i, t in enumerate(text_t): + spacing = 0 + pl = maxwidth - pixlen(t) # length of empty line part + pnt = point + c_pnt * (i * lheight_factor) # text start of line + if align == 1: # center: right shift by half width + if rot in (0, 180): + pnt = pnt + pymupdf.Point(pl / 2, 0) * progr + else: + pnt = pnt - pymupdf.Point(0, pl / 2) * progr + elif align == 2: # right: right shift by full width + if rot in (0, 180): + pnt = pnt + pymupdf.Point(pl, 0) * progr + else: + pnt = pnt - pymupdf.Point(0, pl) * progr + elif align == 3: # justify + spaces = t.count(" ") # number of spaces in line + if spaces > 0 and just_tab[i]: # if any, and we may justify + spacing = pl / spaces # make every space this much larger + else: + spacing = 0 # keep normal space length + top = height - pnt.y - self.y + left = pnt.x + self.x + if rot == 90: + left = height - pnt.y - self.y + top = -pnt.x - self.x + elif rot == 270: + left = -height + pnt.y + self.y + top = pnt.x + self.x + elif rot == 180: + left = -pnt.x - self.x + top = -height + pnt.y + self.y + + nres += templ(left, top, fname, fontsize) + + if render_mode > 0: + nres += "%i Tr " % render_mode + nres += _format_g(border_width * fontsize) + " w " + if miter_limit is not None: + nres += _format_g(miter_limit) + " M " + + if align == 3: + nres += _format_g(spacing) + " Tw " + + if color is not None: + nres += color_str + if fill is not None: + nres += fill_str + nres += "%sTJ\n" % pymupdf.getTJstr(t, tj_glyphs, simple, ordering) + + nres += "ET\n%sQ\n" % emc + + self.text_cont += nres + self.updateRect(rect) + return more + + def finish( + self, + width: float = 1, + color: OptSeq = (0,), + fill: OptSeq = None, + lineCap: int = 0, + lineJoin: int = 0, + dashes: OptStr = None, + even_odd: bool = False, + morph: OptSeq = None, + closePath: bool = True, + fill_opacity: float = 1, + stroke_opacity: float = 1, + oc: int = 0, + ) -> None: + """Finish the current drawing segment. + + Notes: + Apply colors, opacity, dashes, line style and width, or + morphing. Also whether to close the path + by connecting last to first point. + """ + if self.draw_cont == "": # treat empty contents as no-op + return + + if width == 0: # border color makes no sense then + color = None + elif color is None: # vice versa + width = 0 + # if color == None and fill == None: + # raise ValueError("at least one of 'color' or 'fill' must be given") + color_str = pymupdf.ColorCode(color, "c") # ensure proper color string + fill_str = pymupdf.ColorCode(fill, "f") # ensure proper fill string + + optcont = self.page._get_optional_content(oc) + if optcont is not None: + self.draw_cont = "/OC /%s BDC\n" % optcont + self.draw_cont + emc = "EMC\n" + else: + emc = "" + + alpha = self.page._set_opacity(CA=stroke_opacity, ca=fill_opacity) + if alpha is not None: + self.draw_cont = "/%s gs\n" % alpha + self.draw_cont + + if width != 1 and width != 0: + self.draw_cont += _format_g(width) + " w\n" + + if lineCap != 0: + self.draw_cont = "%i J\n" % lineCap + self.draw_cont + if lineJoin != 0: + self.draw_cont = "%i j\n" % lineJoin + self.draw_cont + + if dashes not in (None, "", "[] 0"): + self.draw_cont = "%s d\n" % dashes + self.draw_cont + + if closePath: + self.draw_cont += "h\n" + self.last_point = None + + if color is not None: + self.draw_cont += color_str + + if fill is not None: + self.draw_cont += fill_str + if color is not None: + if not even_odd: + self.draw_cont += "B\n" + else: + self.draw_cont += "B*\n" + else: + if not even_odd: + self.draw_cont += "f\n" + else: + self.draw_cont += "f*\n" + else: + self.draw_cont += "S\n" + + self.draw_cont += emc + if pymupdf.CheckMorph(morph): + m1 = pymupdf.Matrix( + 1, 0, 0, 1, morph[0].x + self.x, self.height - morph[0].y - self.y + ) + mat = ~m1 * morph[1] * m1 + self.draw_cont = _format_g(pymupdf.JM_TUPLE(mat)) + " cm\n" + self.draw_cont + + self.totalcont += "\nq\n" + self.draw_cont + "Q\n" + self.draw_cont = "" + self.last_point = None + return + + def commit(self, overlay: bool = True) -> None: + """Update the page's /Contents object with Shape data. + + The argument controls whether data appear in foreground (default) + or background. + """ + pymupdf.CheckParent(self.page) # doc may have died meanwhile + self.totalcont += self.text_cont + self.totalcont = self.totalcont.encode() + + if self.totalcont: + if overlay: + self.page.wrap_contents() # ensure a balanced graphics state + # make /Contents object with dummy stream + xref = pymupdf.TOOLS._insert_contents(self.page, b" ", overlay) + # update it with potential compression + self.doc.update_stream(xref, self.totalcont) + + self.last_point = None # clean up ... + self.rect = None # + self.draw_cont = "" # for potential ... + self.text_cont = "" # ... + self.totalcont = "" # re-use + + +def apply_redactions( + page: pymupdf.Page, images: int = 2, graphics: int = 1, text: int = 0 +) -> bool: + """Apply the redaction annotations of the page. + + Args: + page: the PDF page. + images: + 0 - ignore images + 1 - remove all overlapping images + 2 - blank out overlapping image parts + 3 - remove image unless invisible + graphics: + 0 - ignore graphics + 1 - remove graphics if contained in rectangle + 2 - remove all overlapping graphics + text: + 0 - remove text + 1 - ignore text + """ + + def center_rect(annot_rect, new_text, font, fsize): + """Calculate minimal sub-rectangle for the overlay text. + + Notes: + Because 'insert_textbox' supports no vertical text centering, + we calculate an approximate number of lines here and return a + sub-rect with smaller height, which should still be sufficient. + Args: + annot_rect: the annotation rectangle + new_text: the text to insert. + font: the fontname. Must be one of the CJK or Base-14 set, else + the rectangle is returned unchanged. + fsize: the fontsize + Returns: + A rectangle to use instead of the annot rectangle. + """ + if not new_text or annot_rect.width <= pymupdf.EPSILON: + return annot_rect + try: + text_width = pymupdf.get_text_length(new_text, font, fsize) + except (ValueError, mupdf.FzErrorBase): # unsupported font + if g_exceptions_verbose: + pymupdf.exception_info() + return annot_rect + line_height = fsize * 1.2 + limit = annot_rect.width + h = math.ceil(text_width / limit) * line_height # estimate rect height + if h >= annot_rect.height: + return annot_rect + r = annot_rect + y = (annot_rect.tl.y + annot_rect.bl.y - h) * 0.5 + r.y0 = y + return r + + pymupdf.CheckParent(page) + doc = page.parent + if doc.is_encrypted or doc.is_closed: + raise ValueError("document closed or encrypted") + if not doc.is_pdf: + raise ValueError("is no PDF") + + redact_annots = [] # storage of annot values + for annot in page.annots( + types=(pymupdf.PDF_ANNOT_REDACT,) # pylint: disable=no-member + ): + # loop redactions + redact_annots.append(annot._get_redact_values()) # save annot values + + if redact_annots == []: # any redactions on this page? + return False # no redactions + + rc = page._apply_redactions(text, images, graphics) # call MuPDF + if not rc: # should not happen really + raise ValueError("Error applying redactions.") + + # now write replacement text in old redact rectangles + shape = page.new_shape() + for redact in redact_annots: + annot_rect = redact["rect"] + fill = redact["fill"] + if fill: + shape.draw_rect(annot_rect) # colorize the rect background + shape.finish(fill=fill, color=fill) + if "text" in redact.keys(): # if we also have text + new_text = redact["text"] + align = redact.get("align", 0) + fname = redact["fontname"] + fsize = redact["fontsize"] + color = redact["text_color"] + # try finding vertical centered sub-rect + trect = center_rect(annot_rect, new_text, fname, fsize) + + rc = -1 + while rc < 0 and fsize >= 4: # while not enough room + # (re-) try insertion + rc = shape.insert_textbox( + trect, + new_text, + fontname=fname, + fontsize=fsize, + color=color, + align=align, + ) + fsize -= 0.5 # reduce font if unsuccessful + shape.commit() # append new contents object + return True + + +# ------------------------------------------------------------------------------ +# Remove potentially sensitive data from a PDF. Similar to the Adobe +# Acrobat 'sanitize' function +# ------------------------------------------------------------------------------ +def scrub( + doc: pymupdf.Document, + attached_files: bool = True, + clean_pages: bool = True, + embedded_files: bool = True, + hidden_text: bool = True, + javascript: bool = True, + metadata: bool = True, + redactions: bool = True, + redact_images: int = 0, + remove_links: bool = True, + reset_fields: bool = True, + reset_responses: bool = True, + thumbnails: bool = True, + xml_metadata: bool = True, +) -> None: + def remove_hidden(cont_lines): + """Remove hidden text from a PDF page. + + Args: + cont_lines: list of lines with /Contents content. Should have status + from after page.cleanContents(). + + Returns: + List of /Contents lines from which hidden text has been removed. + + Notes: + The input must have been created after the page's /Contents object(s) + have been cleaned with page.cleanContents(). This ensures a standard + formatting: one command per line, single spaces between operators. + This allows for drastic simplification of this code. + """ + out_lines = [] # will return this + in_text = False # indicate if within BT/ET object + suppress = False # indicate text suppression active + make_return = False + for line in cont_lines: + if line == b"BT": # start of text object + in_text = True # switch on + out_lines.append(line) # output it + continue + if line == b"ET": # end of text object + in_text = False # switch off + out_lines.append(line) # output it + continue + if line == b"3 Tr": # text suppression operator + suppress = True # switch on + make_return = True + continue + if line[-2:] == b"Tr" and line[0] != b"3": + suppress = False # text rendering changed + out_lines.append(line) + continue + if line == b"Q": # unstack command also switches off + suppress = False + out_lines.append(line) + continue + if suppress and in_text: # suppress hidden lines + continue + out_lines.append(line) + if make_return: + return out_lines + else: + return None + + if not doc.is_pdf: # only works for PDF + raise ValueError("is no PDF") + if doc.is_encrypted or doc.is_closed: + raise ValueError("closed or encrypted doc") + + if not clean_pages: + hidden_text = False + redactions = False + + if metadata: + doc.set_metadata({}) # remove standard metadata + + for page in doc: + if reset_fields: + # reset form fields (widgets) + for widget in page.widgets(): + widget.reset() + + if remove_links: + links = page.get_links() # list of all links on page + for link in links: # remove all links + page.delete_link(link) + + found_redacts = False + for annot in page.annots(): + if annot.type[0] == mupdf.PDF_ANNOT_FILE_ATTACHMENT and attached_files: + annot.update_file(buffer_=b" ") # set file content to empty + if reset_responses: + annot.delete_responses() + if annot.type[0] == pymupdf.PDF_ANNOT_REDACT: # pylint: disable=no-member + found_redacts = True + + if redactions and found_redacts: + page.apply_redactions(images=redact_images) + + if not (clean_pages or hidden_text): + continue # done with the page + + page.clean_contents() + if not page.get_contents(): + continue + if hidden_text: + xref = page.get_contents()[0] # only one b/o cleaning! + cont = doc.xref_stream(xref) + cont_lines = remove_hidden(cont.splitlines()) # remove hidden text + if cont_lines: # something was actually removed + cont = b"\n".join(cont_lines) + doc.update_stream(xref, cont) # rewrite the page /Contents + + if thumbnails: # remove page thumbnails? + if doc.xref_get_key(page.xref, "Thumb")[0] != "null": + doc.xref_set_key(page.xref, "Thumb", "null") + + # pages are scrubbed, now perform document-wide scrubbing + # remove embedded files + if embedded_files: + for name in doc.embfile_names(): + doc.embfile_del(name) + + if xml_metadata: + doc.del_xml_metadata() + if not (xml_metadata or javascript): + xref_limit = 0 + else: + xref_limit = doc.xref_length() + for xref in range(1, xref_limit): + if not doc.xref_object(xref): + msg = "bad xref %i - clean PDF before scrubbing" % xref + raise ValueError(msg) + if javascript and doc.xref_get_key(xref, "S")[1] == "/JavaScript": + # a /JavaScript action object + obj = "<>" # replace with a null JavaScript + doc.update_object(xref, obj) # update this object + continue # no further handling + + if not xml_metadata: + continue + + if doc.xref_get_key(xref, "Type")[1] == "/Metadata": + # delete any metadata object directly + doc.update_object(xref, "<<>>") + doc.update_stream(xref, b"deleted", new=True) + continue + + if doc.xref_get_key(xref, "Metadata")[0] != "null": + doc.xref_set_key(xref, "Metadata", "null") + + +def _show_fz_text( text): + #if mupdf_cppyy: + # assert isinstance( text, cppyy.gbl.mupdf.Text) + #else: + # assert isinstance( text, mupdf.Text) + num_spans = 0 + num_chars = 0 + span = text.m_internal.head + while 1: + if not span: + break + num_spans += 1 + num_chars += span.len + span = span.next + return f'num_spans={num_spans} num_chars={num_chars}' + +def fill_textbox( + writer: pymupdf.TextWriter, + rect: rect_like, + text: typing.Union[str, list], + pos: point_like = None, + font: typing.Optional[pymupdf.Font] = None, + fontsize: float = 11, + lineheight: OptFloat = None, + align: int = 0, + warn: bool = None, + right_to_left: bool = False, + small_caps: bool = False, +) -> tuple: + """Fill a rectangle with text. + + Args: + writer: pymupdf.TextWriter object (= "self") + rect: rect-like to receive the text. + text: string or list/tuple of strings. + pos: point-like start position of first word. + font: pymupdf.Font object (default pymupdf.Font('helv')). + fontsize: the fontsize. + lineheight: overwrite the font property + align: (int) 0 = left, 1 = center, 2 = right, 3 = justify + warn: (bool) text overflow action: none, warn, or exception + right_to_left: (bool) indicate right-to-left language. + """ + rect = pymupdf.Rect(rect) + if rect.is_empty: + raise ValueError("fill rect must not empty.") + if type(font) is not pymupdf.Font: + font = pymupdf.Font("helv") + + def textlen(x): + """Return length of a string.""" + return font.text_length( + x, fontsize=fontsize, small_caps=small_caps + ) # abbreviation + + def char_lengths(x): + """Return list of single character lengths for a string.""" + return font.char_lengths(x, fontsize=fontsize, small_caps=small_caps) + + def append_this(pos, text): + ret = writer.append( + pos, text, font=font, fontsize=fontsize, small_caps=small_caps + ) + return ret + + tolerance = fontsize * 0.2 # extra distance to left border + space_len = textlen(" ") + std_width = rect.width - tolerance + std_start = rect.x0 + tolerance + + def norm_words(width, words): + """Cut any word in pieces no longer than 'width'.""" + nwords = [] + word_lengths = [] + for w in words: + wl_lst = char_lengths(w) + wl = sum(wl_lst) + if wl <= width: # nothing to do - copy over + nwords.append(w) + word_lengths.append(wl) + continue + + # word longer than rect width - split it in parts + n = len(wl_lst) + while n > 0: + wl = sum(wl_lst[:n]) + if wl <= width: + nwords.append(w[:n]) + word_lengths.append(wl) + w = w[n:] + wl_lst = wl_lst[n:] + n = len(wl_lst) + else: + n -= 1 + return nwords, word_lengths + + def output_justify(start, line): + """Justified output of a line.""" + # ignore leading / trailing / multiple spaces + words = [w for w in line.split(" ") if w != ""] + nwords = len(words) + if nwords == 0: + return + if nwords == 1: # single word cannot be justified + append_this(start, words[0]) + return + tl = sum([textlen(w) for w in words]) # total word lengths + gaps = nwords - 1 # number of word gaps + gapl = (std_width - tl) / gaps # width of each gap + for w in words: + _, lp = append_this(start, w) # output one word + start.x = lp.x + gapl # next start at word end plus gap + return + + asc = font.ascender + dsc = font.descender + if not lineheight: + if asc - dsc <= 1: + lheight = 1.2 + else: + lheight = asc - dsc + else: + lheight = lineheight + + LINEHEIGHT = fontsize * lheight # effective line height + width = std_width # available horizontal space + + # starting point of text + if pos is not None: + pos = pymupdf.Point(pos) + else: # default is just below rect top-left + pos = rect.tl + (tolerance, fontsize * asc) + if pos not in rect: + raise ValueError("Text must start in rectangle.") + + # calculate displacement factor for alignment + if align == pymupdf.TEXT_ALIGN_CENTER: + factor = 0.5 + elif align == pymupdf.TEXT_ALIGN_RIGHT: + factor = 1.0 + else: + factor = 0 + + # split in lines if just a string was given + if type(text) is str: + textlines = text.splitlines() + else: + textlines = [] + for line in text: + textlines.extend(line.splitlines()) + + max_lines = int((rect.y1 - pos.y) / LINEHEIGHT) + 1 + + new_lines = [] # the final list of textbox lines + no_justify = [] # no justify for these line numbers + for i, line in enumerate(textlines): + if line in ("", " "): + new_lines.append((line, space_len)) + width = rect.width - tolerance + no_justify.append((len(new_lines) - 1)) + continue + if i == 0: + width = rect.x1 - pos.x + else: + width = rect.width - tolerance + + if right_to_left: # reverses Arabic / Hebrew text front to back + line = writer.clean_rtl(line) + tl = textlen(line) + if tl <= width: # line short enough + new_lines.append((line, tl)) + no_justify.append((len(new_lines) - 1)) + continue + + # we need to split the line in fitting parts + words = line.split(" ") # the words in the line + + # cut in parts any words that are longer than rect width + words, word_lengths = norm_words(width, words) + + n = len(words) + while True: + line0 = " ".join(words[:n]) + wl = sum(word_lengths[:n]) + space_len * (n - 1) + if wl <= width: + new_lines.append((line0, wl)) + words = words[n:] + word_lengths = word_lengths[n:] + n = len(words) + line0 = None + else: + n -= 1 + + if len(words) == 0: + break + assert n + + # ------------------------------------------------------------------------- + # List of lines created. Each item is (text, tl), where 'tl' is the PDF + # output length (float) and 'text' is the text. Except for justified text, + # this is output-ready. + # ------------------------------------------------------------------------- + nlines = len(new_lines) + if nlines > max_lines: + msg = "Only fitting %i of %i lines." % (max_lines, nlines) + if warn is None: + pass + elif warn: + pymupdf.message("Warning: " + msg) + else: + raise ValueError(msg) + + start = pymupdf.Point() + no_justify += [len(new_lines) - 1] # no justifying of last line + for i in range(max_lines): + try: + line, tl = new_lines.pop(0) + except IndexError: + if g_exceptions_verbose >= 2: pymupdf.exception_info() + break + + if right_to_left: # Arabic, Hebrew + line = "".join(reversed(line)) + + if i == 0: # may have different start for first line + start = pos + + if align == pymupdf.TEXT_ALIGN_JUSTIFY and i not in no_justify and tl < std_width: + output_justify(start, line) + start.x = std_start + start.y += LINEHEIGHT + continue + + if i > 0 or pos.x == std_start: # left, center, right alignments + start.x += (width - tl) * factor + + append_this(start, line) + start.x = std_start + start.y += LINEHEIGHT + + return new_lines # return non-written lines + + +# ------------------------------------------------------------------------ +# Optional Content functions +# ------------------------------------------------------------------------ +def get_oc(doc: pymupdf.Document, xref: int) -> int: + """Return optional content object xref for an image or form xobject. + + Args: + xref: (int) xref number of an image or form xobject. + """ + if doc.is_closed or doc.is_encrypted: + raise ValueError("document close or encrypted") + t, name = doc.xref_get_key(xref, "Subtype") + if t != "name" or name not in ("/Image", "/Form"): + raise ValueError("bad object type at xref %i" % xref) + t, oc = doc.xref_get_key(xref, "OC") + if t != "xref": + return 0 + rc = int(oc.replace("0 R", "")) + return rc + + +def set_oc(doc: pymupdf.Document, xref: int, oc: int) -> None: + """Attach optional content object to image or form xobject. + + Args: + xref: (int) xref number of an image or form xobject + oc: (int) xref number of an OCG or OCMD + """ + if doc.is_closed or doc.is_encrypted: + raise ValueError("document close or encrypted") + t, name = doc.xref_get_key(xref, "Subtype") + if t != "name" or name not in ("/Image", "/Form"): + raise ValueError("bad object type at xref %i" % xref) + if oc > 0: + t, name = doc.xref_get_key(oc, "Type") + if t != "name" or name not in ("/OCG", "/OCMD"): + raise ValueError("bad object type at xref %i" % oc) + if oc == 0 and "OC" in doc.xref_get_keys(xref): + doc.xref_set_key(xref, "OC", "null") + return None + doc.xref_set_key(xref, "OC", "%i 0 R" % oc) + return None + + +def set_ocmd( + doc: pymupdf.Document, + xref: int = 0, + ocgs: typing.Union[list, None] = None, + policy: OptStr = None, + ve: typing.Union[list, None] = None, +) -> int: + """Create or update an OCMD object in a PDF document. + + Args: + xref: (int) 0 for creating a new object, otherwise update existing one. + ocgs: (list) OCG xref numbers, which shall be subject to 'policy'. + policy: one of 'AllOn', 'AllOff', 'AnyOn', 'AnyOff' (any casing). + ve: (list) visibility expression. Use instead of 'ocgs' with 'policy'. + + Returns: + Xref of the created or updated OCMD. + """ + + all_ocgs = set(doc.get_ocgs().keys()) + + def ve_maker(ve): + if type(ve) not in (list, tuple) or len(ve) < 2: + raise ValueError("bad 've' format: %s" % ve) + if ve[0].lower() not in ("and", "or", "not"): + raise ValueError("bad operand: %s" % ve[0]) + if ve[0].lower() == "not" and len(ve) != 2: + raise ValueError("bad 've' format: %s" % ve) + item = "[/%s" % ve[0].title() + for x in ve[1:]: + if type(x) is int: + if x not in all_ocgs: + raise ValueError("bad OCG %i" % x) + item += " %i 0 R" % x + else: + item += " %s" % ve_maker(x) + item += "]" + return item + + text = "< dict: + """Return the definition of an OCMD (optional content membership dictionary). + + Recognizes PDF dict keys /OCGs (PDF array of OCGs), /P (policy string) and + /VE (visibility expression, PDF array). Via string manipulation, this + info is converted to a Python dictionary with keys "xref", "ocgs", "policy" + and "ve" - ready to recycle as input for 'set_ocmd()'. + """ + + if xref not in range(doc.xref_length()): + raise ValueError("bad xref") + text = doc.xref_object(xref, compressed=True) + if "/Type/OCMD" not in text: + raise ValueError("bad object type") + textlen = len(text) + + p0 = text.find("/OCGs[") # look for /OCGs key + p1 = text.find("]", p0) + if p0 < 0 or p1 < 0: # no OCGs found + ocgs = None + else: + ocgs = text[p0 + 6 : p1].replace("0 R", " ").split() + ocgs = list(map(int, ocgs)) + + p0 = text.find("/P/") # look for /P policy key + if p0 < 0: + policy = None + else: + p1 = text.find("ff", p0) + if p1 < 0: + p1 = text.find("on", p0) + if p1 < 0: # some irregular syntax + raise ValueError("bad object at xref") + else: + policy = text[p0 + 3 : p1 + 2] + + p0 = text.find("/VE[") # look for /VE visibility expression key + if p0 < 0: # no visibility expression found + ve = None + else: + lp = rp = 0 # find end of /VE by finding last ']'. + p1 = p0 + while lp < 1 or lp != rp: + p1 += 1 + if not p1 < textlen: # some irregular syntax + raise ValueError("bad object at xref") + if text[p1] == "[": + lp += 1 + if text[p1] == "]": + rp += 1 + # p1 now positioned at the last "]" + ve = text[p0 + 3 : p1 + 1] # the PDF /VE array + ve = ( + ve.replace("/And", '"and",') + .replace("/Not", '"not",') + .replace("/Or", '"or",') + ) + ve = ve.replace(" 0 R]", "]").replace(" 0 R", ",").replace("][", "],[") + import json + try: + ve = json.loads(ve) + except Exception: + pymupdf.exception_info() + pymupdf.message(f"bad /VE key: {ve!r}") + raise + return {"xref": xref, "ocgs": ocgs, "policy": policy, "ve": ve} + + +""" +Handle page labels for PDF documents. + +Reading +------- +* compute the label of a page +* find page number(s) having the given label. + +Writing +------- +Supports setting (defining) page labels for PDF documents. + +A big Thank You goes to WILLIAM CHAPMAN who contributed the idea and +significant parts of the following code during late December 2020 +through early January 2021. +""" + + +def rule_dict(item): + """Make a Python dict from a PDF page label rule. + + Args: + item -- a tuple (pno, rule) with the start page number and the rule + string like <>. + Returns: + A dict like + {'startpage': int, 'prefix': str, 'style': str, 'firstpagenum': int}. + """ + # Jorj McKie, 2021-01-06 + + pno, rule = item + rule = rule[2:-2].split("/")[1:] # strip "<<" and ">>" + d = {"startpage": pno, "prefix": "", "firstpagenum": 1} + skip = False + for i, item in enumerate(rule): # pylint: disable=redefined-argument-from-local + if skip: # this item has already been processed + skip = False # deactivate skipping again + continue + if item == "S": # style specification + d["style"] = rule[i + 1] # next item has the style + skip = True # do not process next item again + continue + if item.startswith("P"): # prefix specification: extract the string + x = item[1:].replace("(", "").replace(")", "") + d["prefix"] = x + continue + if item.startswith("St"): # start page number specification + x = int(item[2:]) + d["firstpagenum"] = x + return d + + +def get_label_pno(pgNo, labels): + """Return the label for this page number. + + Args: + pgNo: page number, 0-based. + labels: result of doc._get_page_labels(). + Returns: + The label (str) of the page number. Errors return an empty string. + """ + # Jorj McKie, 2021-01-06 + + item = [x for x in labels if x[0] <= pgNo][-1] + rule = rule_dict(item) + prefix = rule.get("prefix", "") + style = rule.get("style", "") + # make sure we start at 0 when enumerating the alphabet + delta = -1 if style in ("a", "A") else 0 + pagenumber = pgNo - rule["startpage"] + rule["firstpagenum"] + delta + return construct_label(style, prefix, pagenumber) + + +def get_label(page): + """Return the label for this PDF page. + + Args: + page: page object. + Returns: + The label (str) of the page. Errors return an empty string. + """ + # Jorj McKie, 2021-01-06 + + labels = page.parent._get_page_labels() + if not labels: + return "" + labels.sort() + return get_label_pno(page.number, labels) + + +def get_page_numbers(doc, label, only_one=False): + """Return a list of page numbers with the given label. + + Args: + doc: PDF document object (resp. 'self'). + label: (str) label. + only_one: (bool) stop searching after first hit. + Returns: + List of page numbers having this label. + """ + # Jorj McKie, 2021-01-06 + + numbers = [] + if not label: + return numbers + labels = doc._get_page_labels() + if labels == []: + return numbers + for i in range(doc.page_count): + plabel = get_label_pno(i, labels) + if plabel == label: + numbers.append(i) + if only_one: + break + return numbers + + +def construct_label(style, prefix, pno) -> str: + """Construct a label based on style, prefix and page number.""" + # William Chapman, 2021-01-06 + + n_str = "" + if style == "D": + n_str = str(pno) + elif style == "r": + n_str = integerToRoman(pno).lower() + elif style == "R": + n_str = integerToRoman(pno).upper() + elif style == "a": + n_str = integerToLetter(pno).lower() + elif style == "A": + n_str = integerToLetter(pno).upper() + result = prefix + n_str + return result + + +def integerToLetter(i) -> str: + """Returns letter sequence string for integer i.""" + # William Chapman, Jorj McKie, 2021-01-06 + import string + ls = string.ascii_uppercase + n, a = 1, i + while pow(26, n) <= a: + a -= int(math.pow(26, n)) + n += 1 + + str_t = "" + for j in reversed(range(n)): + f, g = divmod(a, int(math.pow(26, j))) + str_t += ls[f] + a = g + return str_t + + +def integerToRoman(num: int) -> str: + """Return roman numeral for an integer.""" + # William Chapman, Jorj McKie, 2021-01-06 + + roman = ( + (1000, "M"), + (900, "CM"), + (500, "D"), + (400, "CD"), + (100, "C"), + (90, "XC"), + (50, "L"), + (40, "XL"), + (10, "X"), + (9, "IX"), + (5, "V"), + (4, "IV"), + (1, "I"), + ) + + def roman_num(num): + for r, ltr in roman: + x, _ = divmod(num, r) + yield ltr * x + num -= r * x + if num <= 0: + break + + return "".join([a for a in roman_num(num)]) + + +def get_page_labels(doc): + """Return page label definitions in PDF document. + + Args: + doc: PDF document (resp. 'self'). + Returns: + A list of dictionaries with the following format: + {'startpage': int, 'prefix': str, 'style': str, 'firstpagenum': int}. + """ + # Jorj McKie, 2021-01-10 + return [rule_dict(item) for item in doc._get_page_labels()] + + +def set_page_labels(doc, labels): + """Add / replace page label definitions in PDF document. + + Args: + doc: PDF document (resp. 'self'). + labels: list of label dictionaries like: + {'startpage': int, 'prefix': str, 'style': str, 'firstpagenum': int}, + as returned by get_page_labels(). + """ + # William Chapman, 2021-01-06 + + def create_label_str(label): + """Convert Python label dict to corresponding PDF rule string. + + Args: + label: (dict) build rule for the label. + Returns: + PDF label rule string wrapped in "<<", ">>". + """ + s = "%i<<" % label["startpage"] + if label.get("prefix", "") != "": + s += "/P(%s)" % label["prefix"] + if label.get("style", "") != "": + s += "/S/%s" % label["style"] + if label.get("firstpagenum", 1) > 1: + s += "/St %i" % label["firstpagenum"] + s += ">>" + return s + + def create_nums(labels): + """Return concatenated string of all labels rules. + + Args: + labels: (list) dictionaries as created by function 'rule_dict'. + Returns: + PDF compatible string for page label definitions, ready to be + enclosed in PDF array 'Nums[...]'. + """ + labels.sort(key=lambda x: x["startpage"]) + s = "".join([create_label_str(label) for label in labels]) + return s + + doc._set_page_labels(create_nums(labels)) + + +# End of Page Label Code ------------------------------------------------- + + +def has_links(doc: pymupdf.Document) -> bool: + """Check whether there are links on any page.""" + if doc.is_closed: + raise ValueError("document closed") + if not doc.is_pdf: + raise ValueError("is no PDF") + for i in range(doc.page_count): + for item in doc.page_annot_xrefs(i): + if item[1] == pymupdf.PDF_ANNOT_LINK: # pylint: disable=no-member + return True + return False + + +def has_annots(doc: pymupdf.Document) -> bool: + """Check whether there are annotations on any page.""" + if doc.is_closed: + raise ValueError("document closed") + if not doc.is_pdf: + raise ValueError("is no PDF") + for i in range(doc.page_count): + for item in doc.page_annot_xrefs(i): + # pylint: disable=no-member + if not (item[1] == pymupdf.PDF_ANNOT_LINK or item[1] == pymupdf.PDF_ANNOT_WIDGET): # pylint: disable=no-member + return True + return False + + +# ------------------------------------------------------------------- +# Functions to recover the quad contained in a text extraction bbox +# ------------------------------------------------------------------- +def recover_bbox_quad(line_dir: tuple, span: dict, bbox: tuple) -> pymupdf.Quad: + """Compute the quad located inside the bbox. + + The bbox may be any of the resp. tuples occurring inside the given span. + + Args: + line_dir: (tuple) 'line["dir"]' of the owning line or None. + span: (dict) the span. May be from get_texttrace() method. + bbox: (tuple) the bbox of the span or any of its characters. + Returns: + The quad which is wrapped by the bbox. + """ + if line_dir is None: + line_dir = span["dir"] + cos, sin = line_dir + bbox = pymupdf.Rect(bbox) # make it a rect + if pymupdf.TOOLS.set_small_glyph_heights(): # ==> just fontsize as height + d = 1 + else: + d = span["ascender"] - span["descender"] + + height = d * span["size"] # the quad's rectangle height + # The following are distances from the bbox corners, at which we find the + # respective quad points. The computation depends on in which quadrant the + # text writing angle is located. + hs = height * sin + hc = height * cos + if hc >= 0 and hs <= 0: # quadrant 1 + ul = bbox.bl - (0, hc) + ur = bbox.tr + (hs, 0) + ll = bbox.bl - (hs, 0) + lr = bbox.tr + (0, hc) + elif hc <= 0 and hs <= 0: # quadrant 2 + ul = bbox.br + (hs, 0) + ur = bbox.tl - (0, hc) + ll = bbox.br + (0, hc) + lr = bbox.tl - (hs, 0) + elif hc <= 0 and hs >= 0: # quadrant 3 + ul = bbox.tr - (0, hc) + ur = bbox.bl + (hs, 0) + ll = bbox.tr - (hs, 0) + lr = bbox.bl + (0, hc) + else: # quadrant 4 + ul = bbox.tl + (hs, 0) + ur = bbox.br - (0, hc) + ll = bbox.tl + (0, hc) + lr = bbox.br - (hs, 0) + return pymupdf.Quad(ul, ur, ll, lr) + + +def recover_quad(line_dir: tuple, span: dict) -> pymupdf.Quad: + """Recover the quadrilateral of a text span. + + Args: + line_dir: (tuple) 'line["dir"]' of the owning line. + span: the span. + Returns: + The quadrilateral enveloping the span's text. + """ + if type(line_dir) is not tuple or len(line_dir) != 2: + raise ValueError("bad line dir argument") + if type(span) is not dict: + raise ValueError("bad span argument") + return recover_bbox_quad(line_dir, span, span["bbox"]) + + +def recover_line_quad(line: dict, spans: list = None) -> pymupdf.Quad: + """Calculate the line quad for 'dict' / 'rawdict' text extractions. + + The lower quad points are those of the first, resp. last span quad. + The upper points are determined by the maximum span quad height. + From this, compute a rect with bottom-left in (0, 0), convert this to a + quad and rotate and shift back to cover the text of the spans. + + Args: + spans: (list, optional) sub-list of spans to consider. + Returns: + pymupdf.Quad covering selected spans. + """ + if spans is None: # no sub-selection + spans = line["spans"] # all spans + if len(spans) == 0: + raise ValueError("bad span list") + line_dir = line["dir"] # text direction + cos, sin = line_dir + q0 = recover_quad(line_dir, spans[0]) # quad of first span + if len(spans) > 1: # get quad of last span + q1 = recover_quad(line_dir, spans[-1]) + else: + q1 = q0 # last = first + + line_ll = q0.ll # lower-left of line quad + line_lr = q1.lr # lower-right of line quad + + mat0 = pymupdf.planish_line(line_ll, line_lr) + + # map base line to x-axis such that line_ll goes to (0, 0) + x_lr = line_lr * mat0 + + small = pymupdf.TOOLS.set_small_glyph_heights() # small glyph heights? + + h = max( + [s["size"] * (1 if small else (s["ascender"] - s["descender"])) for s in spans] + ) + + line_rect = pymupdf.Rect(0, -h, x_lr.x, 0) # line rectangle + line_quad = line_rect.quad # make it a quad and: + line_quad *= ~mat0 + return line_quad + + +def recover_span_quad(line_dir: tuple, span: dict, chars: list = None) -> pymupdf.Quad: + """Calculate the span quad for 'dict' / 'rawdict' text extractions. + + Notes: + There are two execution paths: + 1. For the full span quad, the result of 'recover_quad' is returned. + 2. For the quad of a sub-list of characters, the char quads are + computed and joined. This is only supported for the "rawdict" + extraction option. + + Args: + line_dir: (tuple) 'line["dir"]' of the owning line. + span: (dict) the span. + chars: (list, optional) sub-list of characters to consider. + Returns: + pymupdf.Quad covering selected characters. + """ + if line_dir is None: # must be a span from get_texttrace() + line_dir = span["dir"] + if chars is None: # no sub-selection + return recover_quad(line_dir, span) + if "chars" not in span.keys(): + raise ValueError("need 'rawdict' option to sub-select chars") + + q0 = recover_char_quad(line_dir, span, chars[0]) # quad of first char + if len(chars) > 1: # get quad of last char + q1 = recover_char_quad(line_dir, span, chars[-1]) + else: + q1 = q0 # last = first + + span_ll = q0.ll # lower-left of span quad + span_lr = q1.lr # lower-right of span quad + mat0 = pymupdf.planish_line(span_ll, span_lr) + # map base line to x-axis such that span_ll goes to (0, 0) + x_lr = span_lr * mat0 + + small = pymupdf.TOOLS.set_small_glyph_heights() # small glyph heights? + h = span["size"] * (1 if small else (span["ascender"] - span["descender"])) + + span_rect = pymupdf.Rect(0, -h, x_lr.x, 0) # line rectangle + span_quad = span_rect.quad # make it a quad and: + span_quad *= ~mat0 # rotate back and shift back + return span_quad + + +def recover_char_quad(line_dir: tuple, span: dict, char: dict) -> pymupdf.Quad: + """Recover the quadrilateral of a text character. + + This requires the "rawdict" option of text extraction. + + Args: + line_dir: (tuple) 'line["dir"]' of the span's line. + span: (dict) the span dict. + char: (dict) the character dict. + Returns: + The quadrilateral enveloping the character. + """ + if line_dir is None: + line_dir = span["dir"] + if type(line_dir) is not tuple or len(line_dir) != 2: + raise ValueError("bad line dir argument") + if type(span) is not dict: + raise ValueError("bad span argument") + if type(char) is dict: + bbox = pymupdf.Rect(char["bbox"]) + elif type(char) is tuple: + bbox = pymupdf.Rect(char[3]) + else: + raise ValueError("bad span argument") + + return recover_bbox_quad(line_dir, span, bbox) + + +# ------------------------------------------------------------------- +# Building font subsets using fontTools +# ------------------------------------------------------------------- +def subset_fonts(doc: pymupdf.Document, verbose: bool = False, fallback: bool = False) -> OptInt: + """Build font subsets in a PDF. + + Eligible fonts are potentially replaced by smaller versions. Page text is + NOT rewritten and thus should retain properties like being hidden or + controlled by optional content. + + This method by default uses MuPDF's own internal feature to create subset + fonts. As this is a new function, errors may still occur. In this case, + please fall back to using the previous version by using "fallback=True". + Fallback mode requires the external package 'fontTools'. + + Args: + fallback: use the older deprecated implementation. + verbose: only used by fallback mode. + + Returns: + The new MuPDF-based code returns None. The deprecated fallback + mode returns 0 if there are no fonts to subset. Otherwise, it + returns the decrease in fontsize (the difference in fontsize), + measured in bytes. + """ + # Font binaries: - "buffer" -> (names, xrefs, (unicodes, glyphs)) + # An embedded font is uniquely defined by its fontbuffer only. It may have + # multiple names and xrefs. + # Once the sets of used unicodes and glyphs are known, we compute a + # smaller version of the buffer user package fontTools. + + if not fallback: # by default use MuPDF function + pdf = mupdf.pdf_document_from_fz_document(doc) + mupdf.pdf_subset_fonts2(pdf, list(range(doc.page_count))) + return + + font_buffers = {} + + def get_old_widths(xref): + """Retrieve old font '/W' and '/DW' values.""" + df = doc.xref_get_key(xref, "DescendantFonts") + if df[0] != "array": # only handle xref specifications + return None, None + df_xref = int(df[1][1:-1].replace("0 R", "")) + widths = doc.xref_get_key(df_xref, "W") + if widths[0] != "array": # no widths key found + widths = None + else: + widths = widths[1] + dwidths = doc.xref_get_key(df_xref, "DW") + if dwidths[0] != "int": + dwidths = None + else: + dwidths = dwidths[1] + return widths, dwidths + + def set_old_widths(xref, widths, dwidths): + """Restore the old '/W' and '/DW' in subsetted font. + + If either parameter is None or evaluates to False, the corresponding + dictionary key will be set to null. + """ + df = doc.xref_get_key(xref, "DescendantFonts") + if df[0] != "array": # only handle xref specs + return None + df_xref = int(df[1][1:-1].replace("0 R", "")) + if (type(widths) is not str or not widths) and doc.xref_get_key(df_xref, "W")[ + 0 + ] != "null": + doc.xref_set_key(df_xref, "W", "null") + else: + doc.xref_set_key(df_xref, "W", widths) + if (type(dwidths) is not str or not dwidths) and doc.xref_get_key( + df_xref, "DW" + )[0] != "null": + doc.xref_set_key(df_xref, "DW", "null") + else: + doc.xref_set_key(df_xref, "DW", dwidths) + return None + + def set_subset_fontname(new_xref): + """Generate a name prefix to tag a font as subset. + + We use a random generator to select 6 upper case ASCII characters. + The prefixed name must be put in the font xref as the "/BaseFont" value + and in the FontDescriptor object as the '/FontName' value. + """ + # The following generates a prefix like 'ABCDEF+' + import random + import string + prefix = "".join(random.choices(tuple(string.ascii_uppercase), k=6)) + "+" + font_str = doc.xref_object(new_xref, compressed=True) + font_str = font_str.replace("/BaseFont/", "/BaseFont/" + prefix) + df = doc.xref_get_key(new_xref, "DescendantFonts") + if df[0] == "array": + df_xref = int(df[1][1:-1].replace("0 R", "")) + fd = doc.xref_get_key(df_xref, "FontDescriptor") + if fd[0] == "xref": + fd_xref = int(fd[1].replace("0 R", "")) + fd_str = doc.xref_object(fd_xref, compressed=True) + fd_str = fd_str.replace("/FontName/", "/FontName/" + prefix) + doc.update_object(fd_xref, fd_str) + doc.update_object(new_xref, font_str) + + def build_subset(buffer, unc_set, gid_set): + """Build font subset using fontTools. + + Args: + buffer: (bytes) the font given as a binary buffer. + unc_set: (set) required glyph ids. + Returns: + Either None if subsetting is unsuccessful or the subset font buffer. + """ + try: + import fontTools.subset as fts + except ImportError: + if g_exceptions_verbose: pymupdf.exception_info() + pymupdf.message("This method requires fontTools to be installed.") + raise + import tempfile + with tempfile.TemporaryDirectory() as tmp_dir: + oldfont_path = f"{tmp_dir}/oldfont.ttf" + newfont_path = f"{tmp_dir}/newfont.ttf" + uncfile_path = f"{tmp_dir}/uncfile.txt" + args = [ + oldfont_path, + "--retain-gids", + f"--output-file={newfont_path}", + "--layout-features=*", + "--passthrough-tables", + "--ignore-missing-glyphs", + "--ignore-missing-unicodes", + "--symbol-cmap", + ] + + # store glyph ids or unicodes as file + with open(f"{tmp_dir}/uncfile.txt", "w", encoding='utf8') as unc_file: + if 0xFFFD in unc_set: # error unicode exists -> use glyphs + args.append(f"--gids-file={uncfile_path}") + gid_set.add(189) + unc_list = list(gid_set) + for unc in unc_list: + unc_file.write("%i\n" % unc) + else: + args.append(f"--unicodes-file={uncfile_path}") + unc_set.add(255) + unc_list = list(unc_set) + for unc in unc_list: + unc_file.write("%04x\n" % unc) + + # store fontbuffer as a file + with open(oldfont_path, "wb") as fontfile: + fontfile.write(buffer) + try: + os.remove(newfont_path) # remove old file + except Exception: + pass + try: # invoke fontTools subsetter + fts.main(args) + font = pymupdf.Font(fontfile=newfont_path) + new_buffer = font.buffer # subset font binary + if font.glyph_count == 0: # intercept empty font + new_buffer = None + except Exception: + pymupdf.exception_info() + new_buffer = None + return new_buffer + + def repl_fontnames(doc): + """Populate 'font_buffers'. + + For each font candidate, store its xref and the list of names + by which PDF text may refer to it (there may be multiple). + """ + + def norm_name(name): + """Recreate font name that contains PDF hex codes. + + E.g. #20 -> space, chr(32) + """ + while "#" in name: + p = name.find("#") + c = int(name[p + 1 : p + 3], 16) + name = name.replace(name[p : p + 3], chr(c)) + return name + + def get_fontnames(doc, item): + """Return a list of fontnames for an item of page.get_fonts(). + + There may be multiple names e.g. for Type0 fonts. + """ + fontname = item[3] + names = [fontname] + fontname = doc.xref_get_key(item[0], "BaseFont")[1][1:] + fontname = norm_name(fontname) + if fontname not in names: + names.append(fontname) + descendents = doc.xref_get_key(item[0], "DescendantFonts") + if descendents[0] != "array": + return names + descendents = descendents[1][1:-1] + if descendents.endswith(" 0 R"): + xref = int(descendents[:-4]) + descendents = doc.xref_object(xref, compressed=True) + p1 = descendents.find("/BaseFont") + if p1 >= 0: + p2 = descendents.find("/", p1 + 1) + p1 = min(descendents.find("/", p2 + 1), descendents.find(">>", p2 + 1)) + fontname = descendents[p2 + 1 : p1] + fontname = norm_name(fontname) + if fontname not in names: + names.append(fontname) + return names + + for i in range(doc.page_count): + for f in doc.get_page_fonts(i, full=True): + font_xref = f[0] # font xref + font_ext = f[1] # font file extension + basename = f[3] # font basename + + if font_ext not in ( # skip if not supported by fontTools + "otf", + "ttf", + "woff", + "woff2", + ): + continue + # skip fonts which already are subsets + if len(basename) > 6 and basename[6] == "+": + continue + + extr = doc.extract_font(font_xref) + fontbuffer = extr[-1] + names = get_fontnames(doc, f) + name_set, xref_set, subsets = font_buffers.get( + fontbuffer, (set(), set(), (set(), set())) + ) + xref_set.add(font_xref) + for name in names: + name_set.add(name) + font = pymupdf.Font(fontbuffer=fontbuffer) + name_set.add(font.name) + del font + font_buffers[fontbuffer] = (name_set, xref_set, subsets) + + def find_buffer_by_name(name): + for buffer, (name_set, _, _) in font_buffers.items(): + if name in name_set: + return buffer + return None + + # ----------------- + # main function + # ----------------- + repl_fontnames(doc) # populate font information + if not font_buffers: # nothing found to do + if verbose: + pymupdf.message(f'No fonts to subset.') + return 0 + + old_fontsize = 0 + new_fontsize = 0 + for fontbuffer in font_buffers.keys(): + old_fontsize += len(fontbuffer) + + # Scan page text for usage of subsettable fonts + for page in doc: + # go through the text and extend set of used glyphs by font + # we use a modified MuPDF trace device, which delivers us glyph ids. + for span in page.get_texttrace(): + if type(span) is not dict: # skip useless information + continue + fontname = span["font"][:33] # fontname for the span + buffer = find_buffer_by_name(fontname) + if buffer is None: + continue + name_set, xref_set, (set_ucs, set_gid) = font_buffers[buffer] + for c in span["chars"]: + set_ucs.add(c[0]) # unicode + set_gid.add(c[1]) # glyph id + font_buffers[buffer] = (name_set, xref_set, (set_ucs, set_gid)) + + # build the font subsets + for old_buffer, (name_set, xref_set, subsets) in font_buffers.items(): + new_buffer = build_subset(old_buffer, subsets[0], subsets[1]) + fontname = list(name_set)[0] + if new_buffer is None or len(new_buffer) >= len(old_buffer): + # subset was not created or did not get smaller + if verbose: + pymupdf.message(f'Cannot subset {fontname!r}.') + continue + if verbose: + pymupdf.message(f"Built subset of font {fontname!r}.") + val = doc._insert_font(fontbuffer=new_buffer) # store subset font in PDF + new_xref = val[0] # get its xref + set_subset_fontname(new_xref) # tag fontname as subset font + font_str = doc.xref_object( # get its object definition + new_xref, + compressed=True, + ) + # walk through the original font xrefs and replace each by the subset def + for font_xref in xref_set: + # we need the original '/W' and '/DW' width values + width_table, def_width = get_old_widths(font_xref) + # ... and replace original font definition at xref with it + doc.update_object(font_xref, font_str) + # now copy over old '/W' and '/DW' values + if width_table or def_width: + set_old_widths(font_xref, width_table, def_width) + # 'new_xref' remains unused in the PDF and must be removed + # by garbage collection. + new_fontsize += len(new_buffer) + + return old_fontsize - new_fontsize + + +# ------------------------------------------------------------------- +# Copy XREF object to another XREF +# ------------------------------------------------------------------- +def xref_copy(doc: pymupdf.Document, source: int, target: int, *, keep: list = None) -> None: + """Copy a PDF dictionary object to another one given their xref numbers. + + Args: + doc: PDF document object + source: source xref number + target: target xref number, the xref must already exist + keep: an optional list of 1st level keys in target that should not be + removed before copying. + Notes: + This works similar to the copy() method of dictionaries in Python. The + source may be a stream object. + """ + if doc.xref_is_stream(source): + # read new xref stream, maintaining compression + stream = doc.xref_stream_raw(source) + doc.update_stream( + target, + stream, + compress=False, # keeps source compression + new=True, # in case target is no stream + ) + + # empty the target completely, observe exceptions + if keep is None: + keep = [] + for key in doc.xref_get_keys(target): + if key in keep: + continue + doc.xref_set_key(target, key, "null") + # copy over all source dict items + for key in doc.xref_get_keys(source): + item = doc.xref_get_key(source, key) + doc.xref_set_key(target, key, item[1]) diff -r 000000000000 -r 1d09e1dec1d9 src_classic/__init__.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/__init__.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,506 @@ +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +import sys + +import glob +import os +if os.path.exists( 'fitz/__init__.py'): + if not glob.glob( 'fitz/_fitz*'): + print( '#' * 40) + print( '# Warning: current directory appears to contain an incomplete') + print( '# fitz/ installation directory so "import fitz" may fail.') + print( '# This can happen if current directory is a PyMuPDF source tree.') + print( '# Suggest changing to a different current directory.') + print( '#' * 40) + +def message(text=''): + print(text) + +from fitz_old.fitz_old import * + +# Allow this to work: +# import fitz_old as fitz +# fitz.fitz.TEXT_ALIGN_CENTER +# +fitz = fitz_old + +# define the supported colorspaces for convenience +fitz_old.csRGB = fitz_old.Colorspace(fitz_old.CS_RGB) +fitz_old.csGRAY = fitz_old.Colorspace(fitz_old.CS_GRAY) +fitz_old.csCMYK = fitz_old.Colorspace(fitz_old.CS_CMYK) +csRGB = fitz_old.csRGB +csGRAY = fitz_old.csGRAY +csCMYK = fitz_old.csCMYK + +# create the TOOLS object. +# +# Unfortunately it seems that this is never be destructed even if we use an +# atexit() handler, which makes MuPDF's Memento list it as a leak. In fitz_old.i +# we use Memento_startLeaking()/Memento_stopLeaking() when allocating +# the Tools instance so at least the leak is marked as known. +# +TOOLS = fitz_old.Tools() +TOOLS.thisown = True +fitz_old.TOOLS = TOOLS + +# This atexit handler runs, but doesn't cause ~Tools() to be run. +# +import atexit + + +def cleanup_tools(TOOLS): + # print(f'cleanup_tools: TOOLS={TOOLS} id(TOOLS)={id(TOOLS)}') + # print(f'TOOLS.thisown={TOOLS.thisown}') + del TOOLS + del fitz_old.TOOLS + + +atexit.register(cleanup_tools, TOOLS) + + +# Require that MuPDF matches fitz_old.TOOLS.mupdf_version(); also allow use with +# next minor version (e.g. 1.21.2 => 1.22), so we can test with mupdf master. +# +def v_str_to_tuple(s): + return tuple(map(int, s.split('.'))) + +def v_tuple_to_string(t): + return '.'.join(map(str, t)) + +mupdf_version_tuple = v_str_to_tuple(fitz_old.TOOLS.mupdf_version()) +mupdf_version_tuple_required = v_str_to_tuple(fitz_old.VersionFitz) +mupdf_version_tuple_required_prev = (mupdf_version_tuple_required[0], mupdf_version_tuple_required[1]-1) +mupdf_version_tuple_required_next = (mupdf_version_tuple_required[0], mupdf_version_tuple_required[1]+1) + +# copy functions in 'utils' to their respective fitz classes +import fitz_old.utils +from .table import find_tables + +# ------------------------------------------------------------------------------ +# General +# ------------------------------------------------------------------------------ +fitz_old.recover_quad = fitz_old.utils.recover_quad +fitz_old.recover_bbox_quad = fitz_old.utils.recover_bbox_quad +fitz_old.recover_line_quad = fitz_old.utils.recover_line_quad +fitz_old.recover_span_quad = fitz_old.utils.recover_span_quad +fitz_old.recover_char_quad = fitz_old.utils.recover_char_quad + +# ------------------------------------------------------------------------------ +# Document +# ------------------------------------------------------------------------------ +fitz_old.open = fitz_old.Document +fitz_old.Document._do_links = fitz_old.utils.do_links +fitz_old.Document.del_toc_item = fitz_old.utils.del_toc_item +fitz_old.Document.get_char_widths = fitz_old.utils.get_char_widths +fitz_old.Document.get_ocmd = fitz_old.utils.get_ocmd +fitz_old.Document.get_page_labels = fitz_old.utils.get_page_labels +fitz_old.Document.get_page_numbers = fitz_old.utils.get_page_numbers +fitz_old.Document.get_page_pixmap = fitz_old.utils.get_page_pixmap +fitz_old.Document.get_page_text = fitz_old.utils.get_page_text +fitz_old.Document.get_toc = fitz_old.utils.get_toc +fitz_old.Document.has_annots = fitz_old.utils.has_annots +fitz_old.Document.has_links = fitz_old.utils.has_links +fitz_old.Document.insert_page = fitz_old.utils.insert_page +fitz_old.Document.new_page = fitz_old.utils.new_page +fitz_old.Document.scrub = fitz_old.utils.scrub +fitz_old.Document.search_page_for = fitz_old.utils.search_page_for +fitz_old.Document.set_metadata = fitz_old.utils.set_metadata +fitz_old.Document.set_ocmd = fitz_old.utils.set_ocmd +fitz_old.Document.set_page_labels = fitz_old.utils.set_page_labels +fitz_old.Document.set_toc = fitz_old.utils.set_toc +fitz_old.Document.set_toc_item = fitz_old.utils.set_toc_item +fitz_old.Document.tobytes = fitz_old.Document.write +fitz_old.Document.subset_fonts = fitz_old.utils.subset_fonts +fitz_old.Document.get_oc = fitz_old.utils.get_oc +fitz_old.Document.set_oc = fitz_old.utils.set_oc +fitz_old.Document.xref_copy = fitz_old.utils.xref_copy + + +# ------------------------------------------------------------------------------ +# Page +# ------------------------------------------------------------------------------ +fitz_old.Page.apply_redactions = fitz_old.utils.apply_redactions +fitz_old.Page.delete_widget = fitz_old.utils.delete_widget +fitz_old.Page.draw_bezier = fitz_old.utils.draw_bezier +fitz_old.Page.draw_circle = fitz_old.utils.draw_circle +fitz_old.Page.draw_curve = fitz_old.utils.draw_curve +fitz_old.Page.draw_line = fitz_old.utils.draw_line +fitz_old.Page.draw_oval = fitz_old.utils.draw_oval +fitz_old.Page.draw_polyline = fitz_old.utils.draw_polyline +fitz_old.Page.draw_quad = fitz_old.utils.draw_quad +fitz_old.Page.draw_rect = fitz_old.utils.draw_rect +fitz_old.Page.draw_sector = fitz_old.utils.draw_sector +fitz_old.Page.draw_squiggle = fitz_old.utils.draw_squiggle +fitz_old.Page.draw_zigzag = fitz_old.utils.draw_zigzag +fitz_old.Page.get_links = fitz_old.utils.get_links +fitz_old.Page.get_pixmap = fitz_old.utils.get_pixmap +fitz_old.Page.get_text = fitz_old.utils.get_text +fitz_old.Page.get_image_info = fitz_old.utils.get_image_info +fitz_old.Page.get_text_blocks = fitz_old.utils.get_text_blocks +fitz_old.Page.get_text_selection = fitz_old.utils.get_text_selection +fitz_old.Page.get_text_words = fitz_old.utils.get_text_words +fitz_old.Page.get_textbox = fitz_old.utils.get_textbox +fitz_old.Page.insert_image = fitz_old.utils.insert_image +fitz_old.Page.insert_link = fitz_old.utils.insert_link +fitz_old.Page.insert_text = fitz_old.utils.insert_text +fitz_old.Page.insert_textbox = fitz_old.utils.insert_textbox +fitz_old.Page.new_shape = lambda x: fitz_old.utils.Shape(x) +fitz_old.Page.search_for = fitz_old.utils.search_for +fitz_old.Page.show_pdf_page = fitz_old.utils.show_pdf_page +fitz_old.Page.update_link = fitz_old.utils.update_link +fitz_old.Page.write_text = fitz_old.utils.write_text +fitz_old.Page.get_label = fitz_old.utils.get_label +fitz_old.Page.get_image_rects = fitz_old.utils.get_image_rects +fitz_old.Page.get_textpage_ocr = fitz_old.utils.get_textpage_ocr +fitz_old.Page.delete_image = fitz_old.utils.delete_image +fitz_old.Page.replace_image = fitz_old.utils.replace_image +fitz_old.Page.find_tables = find_tables +# ------------------------------------------------------------------------ +# Annot +# ------------------------------------------------------------------------ +fitz_old.Annot.get_text = fitz_old.utils.get_text +fitz_old.Annot.get_textbox = fitz_old.utils.get_textbox + +# ------------------------------------------------------------------------ +# Rect and IRect +# ------------------------------------------------------------------------ +fitz_old.Rect.get_area = fitz_old.utils.get_area +fitz_old.IRect.get_area = fitz_old.utils.get_area + +# ------------------------------------------------------------------------ +# TextWriter +# ------------------------------------------------------------------------ +fitz_old.TextWriter.fill_textbox = fitz_old.utils.fill_textbox + + +class FitzDeprecation(DeprecationWarning): + pass + + +def restore_aliases(): + import warnings + + warnings.filterwarnings( + "once", + category=FitzDeprecation, + ) + + def showthis(msg, cat, filename, lineno, file=None, line=None): + text = warnings.formatwarning(msg, cat, filename, lineno, line=line) + s = text.find("FitzDeprecation") + if s < 0: + print(text, file=sys.stderr) + return + text = text[s:].splitlines()[0][4:] + print(text, file=sys.stderr) + + warnings.showwarning = showthis + + def _alias(fitz_class, old, new): + fname = getattr(fitz_class, new) + r = str(fitz_class)[1:-1] + objname = " ".join(r.split()[:2]) + objname = objname.replace("fitz_old.fitz_old.", "") + objname = objname.replace("fitz_old.utils.", "") + if callable(fname): + + def deprecated_function(*args, **kw): + msg = "'%s' removed from %s after v1.19 - use '%s'." % ( + old, + objname, + new, + ) + if not VersionBind.startswith("1.18"): + warnings.warn(msg, category=FitzDeprecation) + return fname(*args, **kw) + + setattr(fitz_class, old, deprecated_function) + else: + if type(fname) is property: + setattr(fitz_class, old, property(fname.fget)) + else: + setattr(fitz_class, old, fname) + + eigen = getattr(fitz_class, old) + x = fname.__doc__ + if not x: + x = "" + try: + if callable(fname) or type(fname) is property: + eigen.__doc__ = ( + "*** Deprecated and removed after v1.19 - use '%s'. ***\n" % new + x + ) + except: + pass + + # deprecated Document aliases + _alias(fitz_old.Document, "chapterCount", "chapter_count") + _alias(fitz_old.Document, "chapterPageCount", "chapter_page_count") + _alias(fitz_old.Document, "convertToPDF", "convert_to_pdf") + _alias(fitz_old.Document, "copyPage", "copy_page") + _alias(fitz_old.Document, "deletePage", "delete_page") + _alias(fitz_old.Document, "deletePageRange", "delete_pages") + _alias(fitz_old.Document, "embeddedFileAdd", "embfile_add") + _alias(fitz_old.Document, "embeddedFileCount", "embfile_count") + _alias(fitz_old.Document, "embeddedFileDel", "embfile_del") + _alias(fitz_old.Document, "embeddedFileGet", "embfile_get") + _alias(fitz_old.Document, "embeddedFileInfo", "embfile_info") + _alias(fitz_old.Document, "embeddedFileNames", "embfile_names") + _alias(fitz_old.Document, "embeddedFileUpd", "embfile_upd") + _alias(fitz_old.Document, "extractFont", "extract_font") + _alias(fitz_old.Document, "extractImage", "extract_image") + _alias(fitz_old.Document, "findBookmark", "find_bookmark") + _alias(fitz_old.Document, "fullcopyPage", "fullcopy_page") + _alias(fitz_old.Document, "getCharWidths", "get_char_widths") + _alias(fitz_old.Document, "getOCGs", "get_ocgs") + _alias(fitz_old.Document, "getPageFontList", "get_page_fonts") + _alias(fitz_old.Document, "getPageImageList", "get_page_images") + _alias(fitz_old.Document, "getPagePixmap", "get_page_pixmap") + _alias(fitz_old.Document, "getPageText", "get_page_text") + _alias(fitz_old.Document, "getPageXObjectList", "get_page_xobjects") + _alias(fitz_old.Document, "getSigFlags", "get_sigflags") + _alias(fitz_old.Document, "getToC", "get_toc") + _alias(fitz_old.Document, "getXmlMetadata", "get_xml_metadata") + _alias(fitz_old.Document, "insertPage", "insert_page") + _alias(fitz_old.Document, "insertPDF", "insert_pdf") + _alias(fitz_old.Document, "isDirty", "is_dirty") + _alias(fitz_old.Document, "isFormPDF", "is_form_pdf") + _alias(fitz_old.Document, "isPDF", "is_pdf") + _alias(fitz_old.Document, "isReflowable", "is_reflowable") + _alias(fitz_old.Document, "isRepaired", "is_repaired") + _alias(fitz_old.Document, "isStream", "xref_is_stream") + _alias(fitz_old.Document, "is_stream", "xref_is_stream") + _alias(fitz_old.Document, "lastLocation", "last_location") + _alias(fitz_old.Document, "loadPage", "load_page") + _alias(fitz_old.Document, "makeBookmark", "make_bookmark") + _alias(fitz_old.Document, "metadataXML", "xref_xml_metadata") + _alias(fitz_old.Document, "movePage", "move_page") + _alias(fitz_old.Document, "needsPass", "needs_pass") + _alias(fitz_old.Document, "newPage", "new_page") + _alias(fitz_old.Document, "nextLocation", "next_location") + _alias(fitz_old.Document, "pageCount", "page_count") + _alias(fitz_old.Document, "pageCropBox", "page_cropbox") + _alias(fitz_old.Document, "pageXref", "page_xref") + _alias(fitz_old.Document, "PDFCatalog", "pdf_catalog") + _alias(fitz_old.Document, "PDFTrailer", "pdf_trailer") + _alias(fitz_old.Document, "previousLocation", "prev_location") + _alias(fitz_old.Document, "resolveLink", "resolve_link") + _alias(fitz_old.Document, "searchPageFor", "search_page_for") + _alias(fitz_old.Document, "setLanguage", "set_language") + _alias(fitz_old.Document, "setMetadata", "set_metadata") + _alias(fitz_old.Document, "setToC", "set_toc") + _alias(fitz_old.Document, "setXmlMetadata", "set_xml_metadata") + _alias(fitz_old.Document, "updateObject", "update_object") + _alias(fitz_old.Document, "updateStream", "update_stream") + _alias(fitz_old.Document, "xrefLength", "xref_length") + _alias(fitz_old.Document, "xrefObject", "xref_object") + _alias(fitz_old.Document, "xrefStream", "xref_stream") + _alias(fitz_old.Document, "xrefStreamRaw", "xref_stream_raw") + + # deprecated Page aliases + _alias(fitz_old.Page, "_isWrapped", "is_wrapped") + _alias(fitz_old.Page, "addCaretAnnot", "add_caret_annot") + _alias(fitz_old.Page, "addCircleAnnot", "add_circle_annot") + _alias(fitz_old.Page, "addFileAnnot", "add_file_annot") + _alias(fitz_old.Page, "addFreetextAnnot", "add_freetext_annot") + _alias(fitz_old.Page, "addHighlightAnnot", "add_highlight_annot") + _alias(fitz_old.Page, "addInkAnnot", "add_ink_annot") + _alias(fitz_old.Page, "addLineAnnot", "add_line_annot") + _alias(fitz_old.Page, "addPolygonAnnot", "add_polygon_annot") + _alias(fitz_old.Page, "addPolylineAnnot", "add_polyline_annot") + _alias(fitz_old.Page, "addRectAnnot", "add_rect_annot") + _alias(fitz_old.Page, "addRedactAnnot", "add_redact_annot") + _alias(fitz_old.Page, "addSquigglyAnnot", "add_squiggly_annot") + _alias(fitz_old.Page, "addStampAnnot", "add_stamp_annot") + _alias(fitz_old.Page, "addStrikeoutAnnot", "add_strikeout_annot") + _alias(fitz_old.Page, "addTextAnnot", "add_text_annot") + _alias(fitz_old.Page, "addUnderlineAnnot", "add_underline_annot") + _alias(fitz_old.Page, "addWidget", "add_widget") + _alias(fitz_old.Page, "cleanContents", "clean_contents") + _alias(fitz_old.Page, "CropBox", "cropbox") + _alias(fitz_old.Page, "CropBoxPosition", "cropbox_position") + _alias(fitz_old.Page, "deleteAnnot", "delete_annot") + _alias(fitz_old.Page, "deleteLink", "delete_link") + _alias(fitz_old.Page, "deleteWidget", "delete_widget") + _alias(fitz_old.Page, "derotationMatrix", "derotation_matrix") + _alias(fitz_old.Page, "drawBezier", "draw_bezier") + _alias(fitz_old.Page, "drawCircle", "draw_circle") + _alias(fitz_old.Page, "drawCurve", "draw_curve") + _alias(fitz_old.Page, "drawLine", "draw_line") + _alias(fitz_old.Page, "drawOval", "draw_oval") + _alias(fitz_old.Page, "drawPolyline", "draw_polyline") + _alias(fitz_old.Page, "drawQuad", "draw_quad") + _alias(fitz_old.Page, "drawRect", "draw_rect") + _alias(fitz_old.Page, "drawSector", "draw_sector") + _alias(fitz_old.Page, "drawSquiggle", "draw_squiggle") + _alias(fitz_old.Page, "drawZigzag", "draw_zigzag") + _alias(fitz_old.Page, "firstAnnot", "first_annot") + _alias(fitz_old.Page, "firstLink", "first_link") + _alias(fitz_old.Page, "firstWidget", "first_widget") + _alias(fitz_old.Page, "getContents", "get_contents") + _alias(fitz_old.Page, "getDisplayList", "get_displaylist") + _alias(fitz_old.Page, "getDrawings", "get_drawings") + _alias(fitz_old.Page, "getFontList", "get_fonts") + _alias(fitz_old.Page, "getImageBbox", "get_image_bbox") + _alias(fitz_old.Page, "getImageList", "get_images") + _alias(fitz_old.Page, "getLinks", "get_links") + _alias(fitz_old.Page, "getPixmap", "get_pixmap") + _alias(fitz_old.Page, "getSVGimage", "get_svg_image") + _alias(fitz_old.Page, "getText", "get_text") + _alias(fitz_old.Page, "getTextBlocks", "get_text_blocks") + _alias(fitz_old.Page, "getTextbox", "get_textbox") + _alias(fitz_old.Page, "getTextPage", "get_textpage") + _alias(fitz_old.Page, "getTextWords", "get_text_words") + _alias(fitz_old.Page, "insertFont", "insert_font") + _alias(fitz_old.Page, "insertImage", "insert_image") + _alias(fitz_old.Page, "insertLink", "insert_link") + _alias(fitz_old.Page, "insertText", "insert_text") + _alias(fitz_old.Page, "insertTextbox", "insert_textbox") + _alias(fitz_old.Page, "loadAnnot", "load_annot") + _alias(fitz_old.Page, "loadLinks", "load_links") + _alias(fitz_old.Page, "MediaBox", "mediabox") + _alias(fitz_old.Page, "MediaBoxSize", "mediabox_size") + _alias(fitz_old.Page, "newShape", "new_shape") + _alias(fitz_old.Page, "readContents", "read_contents") + _alias(fitz_old.Page, "rotationMatrix", "rotation_matrix") + _alias(fitz_old.Page, "searchFor", "search_for") + _alias(fitz_old.Page, "setCropBox", "set_cropbox") + _alias(fitz_old.Page, "setMediaBox", "set_mediabox") + _alias(fitz_old.Page, "setRotation", "set_rotation") + _alias(fitz_old.Page, "showPDFpage", "show_pdf_page") + _alias(fitz_old.Page, "transformationMatrix", "transformation_matrix") + _alias(fitz_old.Page, "updateLink", "update_link") + _alias(fitz_old.Page, "wrapContents", "wrap_contents") + _alias(fitz_old.Page, "writeText", "write_text") + + # deprecated Shape aliases + _alias(fitz_old.utils.Shape, "drawBezier", "draw_bezier") + _alias(fitz_old.utils.Shape, "drawCircle", "draw_circle") + _alias(fitz_old.utils.Shape, "drawCurve", "draw_curve") + _alias(fitz_old.utils.Shape, "drawLine", "draw_line") + _alias(fitz_old.utils.Shape, "drawOval", "draw_oval") + _alias(fitz_old.utils.Shape, "drawPolyline", "draw_polyline") + _alias(fitz_old.utils.Shape, "drawQuad", "draw_quad") + _alias(fitz_old.utils.Shape, "drawRect", "draw_rect") + _alias(fitz_old.utils.Shape, "drawSector", "draw_sector") + _alias(fitz_old.utils.Shape, "drawSquiggle", "draw_squiggle") + _alias(fitz_old.utils.Shape, "drawZigzag", "draw_zigzag") + _alias(fitz_old.utils.Shape, "insertText", "insert_text") + _alias(fitz_old.utils.Shape, "insertTextbox", "insert_textbox") + + # deprecated Annot aliases + _alias(fitz_old.Annot, "getText", "get_text") + _alias(fitz_old.Annot, "getTextbox", "get_textbox") + _alias(fitz_old.Annot, "fileGet", "get_file") + _alias(fitz_old.Annot, "fileUpd", "update_file") + _alias(fitz_old.Annot, "getPixmap", "get_pixmap") + _alias(fitz_old.Annot, "getTextPage", "get_textpage") + _alias(fitz_old.Annot, "lineEnds", "line_ends") + _alias(fitz_old.Annot, "setBlendMode", "set_blendmode") + _alias(fitz_old.Annot, "setBorder", "set_border") + _alias(fitz_old.Annot, "setColors", "set_colors") + _alias(fitz_old.Annot, "setFlags", "set_flags") + _alias(fitz_old.Annot, "setInfo", "set_info") + _alias(fitz_old.Annot, "setLineEnds", "set_line_ends") + _alias(fitz_old.Annot, "setName", "set_name") + _alias(fitz_old.Annot, "setOpacity", "set_opacity") + _alias(fitz_old.Annot, "setRect", "set_rect") + _alias(fitz_old.Annot, "setOC", "set_oc") + _alias(fitz_old.Annot, "soundGet", "get_sound") + + # deprecated TextWriter aliases + _alias(fitz_old.TextWriter, "writeText", "write_text") + _alias(fitz_old.TextWriter, "fillTextbox", "fill_textbox") + + # deprecated DisplayList aliases + _alias(fitz_old.DisplayList, "getPixmap", "get_pixmap") + _alias(fitz_old.DisplayList, "getTextPage", "get_textpage") + + # deprecated Pixmap aliases + _alias(fitz_old.Pixmap, "setAlpha", "set_alpha") + _alias(fitz_old.Pixmap, "gammaWith", "gamma_with") + _alias(fitz_old.Pixmap, "tintWith", "tint_with") + _alias(fitz_old.Pixmap, "clearWith", "clear_with") + _alias(fitz_old.Pixmap, "copyPixmap", "copy") + _alias(fitz_old.Pixmap, "getImageData", "tobytes") + _alias(fitz_old.Pixmap, "getPNGData", "tobytes") + _alias(fitz_old.Pixmap, "getPNGdata", "tobytes") + _alias(fitz_old.Pixmap, "writeImage", "save") + _alias(fitz_old.Pixmap, "writePNG", "save") + _alias(fitz_old.Pixmap, "pillowWrite", "pil_save") + _alias(fitz_old.Pixmap, "pillowData", "pil_tobytes") + _alias(fitz_old.Pixmap, "invertIRect", "invert_irect") + _alias(fitz_old.Pixmap, "setPixel", "set_pixel") + _alias(fitz_old.Pixmap, "setOrigin", "set_origin") + _alias(fitz_old.Pixmap, "setRect", "set_rect") + _alias(fitz_old.Pixmap, "setResolution", "set_dpi") + + # deprecated geometry aliases + _alias(fitz_old.Rect, "getArea", "get_area") + _alias(fitz_old.IRect, "getArea", "get_area") + _alias(fitz_old.Rect, "getRectArea", "get_area") + _alias(fitz_old.IRect, "getRectArea", "get_area") + _alias(fitz_old.Rect, "includePoint", "include_point") + _alias(fitz_old.IRect, "includePoint", "include_point") + _alias(fitz_old.Rect, "includeRect", "include_rect") + _alias(fitz_old.IRect, "includeRect", "include_rect") + _alias(fitz_old.Rect, "isInfinite", "is_infinite") + _alias(fitz_old.IRect, "isInfinite", "is_infinite") + _alias(fitz_old.Rect, "isEmpty", "is_empty") + _alias(fitz_old.IRect, "isEmpty", "is_empty") + _alias(fitz_old.Quad, "isEmpty", "is_empty") + _alias(fitz_old.Quad, "isRectangular", "is_rectangular") + _alias(fitz_old.Quad, "isConvex", "is_convex") + _alias(fitz_old.Matrix, "isRectilinear", "is_rectilinear") + _alias(fitz_old.Matrix, "preRotate", "prerotate") + _alias(fitz_old.Matrix, "preScale", "prescale") + _alias(fitz_old.Matrix, "preShear", "preshear") + _alias(fitz_old.Matrix, "preTranslate", "pretranslate") + + # deprecated other aliases + _alias(fitz_old.Outline, "isExternal", "is_external") + _alias(fitz_old.Outline, "isOpen", "is_open") + _alias(fitz_old.Link, "isExternal", "is_external") + _alias(fitz_old.Link, "setBorder", "set_border") + _alias(fitz_old.Link, "setColors", "set_colors") + _alias(fitz, "getPDFstr", "get_pdf_str") + _alias(fitz, "getPDFnow", "get_pdf_now") + _alias(fitz, "PaperSize", "paper_size") + _alias(fitz, "PaperRect", "paper_rect") + _alias(fitz, "paperSizes", "paper_sizes") + _alias(fitz, "ImageProperties", "image_profile") + _alias(fitz, "planishLine", "planish_line") + _alias(fitz, "getTextLength", "get_text_length") + _alias(fitz, "getTextlength", "get_text_length") + + +fitz_old.__doc__ = """ +PyMuPDF %s: Python bindings for the MuPDF %s library. +Version date: %s. +Built for Python %i.%i on %s (%i-bit). +""" % ( + fitz_old.VersionBind, + fitz_old.VersionFitz, + fitz_old.VersionDate, + sys.version_info[0], + sys.version_info[1], + sys.platform, + 64 if sys.maxsize > 2**32 else 32, +) + +if VersionBind.startswith("1.19"): # don't generate aliases after v1.19.* + restore_aliases() + +pdfcolor = dict( + [ + (k, (r / 255, g / 255, b / 255)) + for k, (r, g, b) in fitz_old.utils.getColorInfoDict().items() + ] +) +__version__ = fitz_old.VersionBind diff -r 000000000000 -r 1d09e1dec1d9 src_classic/__main__.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/__main__.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1136 @@ +# ----------------------------------------------------------------------------- +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# Part of "PyMuPDF", Python bindings for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ----------------------------------------------------------------------------- +import argparse +import bisect +import os +import sys +import statistics +from typing import Dict, List, Set, Tuple + +import fitz +from fitz.fitz import ( + TEXT_INHIBIT_SPACES, + TEXT_PRESERVE_LIGATURES, + TEXT_PRESERVE_WHITESPACE, +) + +mycenter = lambda x: (" %s " % x).center(75, "-") + + +def recoverpix(doc, item): + """Return image for a given XREF.""" + x = item[0] # xref of PDF image + s = item[1] # xref of its /SMask + if s == 0: # no smask: use direct image output + return doc.extract_image(x) + + def getimage(pix): + if pix.colorspace.n != 4: + return pix + tpix = fitz.Pixmap(fitz.csRGB, pix) + return tpix + + # we need to reconstruct the alpha channel with the smask + pix1 = fitz.Pixmap(doc, x) + pix2 = fitz.Pixmap(doc, s) # create pixmap of the /SMask entry + + """Sanity check: + - both pixmaps must have the same rectangle + - both pixmaps must have alpha=0 + - pix2 must consist of 1 byte per pixel + """ + if not (pix1.irect == pix2.irect and pix1.alpha == pix2.alpha == 0 and pix2.n == 1): + print("Warning: unsupported /SMask %i for %i:" % (s, x)) + print(pix2) + pix2 = None + return getimage(pix1) # return the pixmap as is + + pix = fitz.Pixmap(pix1) # copy of pix1, with an alpha channel added + pix.set_alpha(pix2.samples) # treat pix2.samples as the alpha values + pix1 = pix2 = None # free temp pixmaps + + # we may need to adjust something for CMYK pixmaps here: + return getimage(pix) + + +def open_file(filename, password, show=False, pdf=True): + """Open and authenticate a document.""" + doc = fitz.open(filename) + if not doc.is_pdf and pdf is True: + sys.exit("this command supports PDF files only") + rc = -1 + if not doc.needs_pass: + return doc + if password: + rc = doc.authenticate(password) + if not rc: + sys.exit("authentication unsuccessful") + if show is True: + print("authenticated as %s" % "owner" if rc > 2 else "user") + else: + sys.exit("'%s' requires a password" % doc.name) + return doc + + +def print_dict(item): + """Print a Python dictionary.""" + l = max([len(k) for k in item.keys()]) + 1 + for k, v in item.items(): + msg = "%s: %s" % (k.rjust(l), v) + print(msg) + return + + +def print_xref(doc, xref): + """Print an object given by XREF number. + + Simulate the PDF source in "pretty" format. + For a stream also print its size. + """ + print("%i 0 obj" % xref) + xref_str = doc.xref_object(xref) + print(xref_str) + if doc.xref_is_stream(xref): + temp = xref_str.split() + try: + idx = temp.index("/Length") + 1 + size = temp[idx] + if size.endswith("0 R"): + size = "unknown" + except: + size = "unknown" + print("stream\n...%s bytes" % size) + print("endstream") + print("endobj") + + +def get_list(rlist, limit, what="page"): + """Transform a page / xref specification into a list of integers. + + Args + ---- + rlist: (str) the specification + limit: maximum number, i.e. number of pages, number of objects + what: a string to be used in error messages + Returns + ------- + A list of integers representing the specification. + """ + N = str(limit - 1) + rlist = rlist.replace("N", N).replace(" ", "") + rlist_arr = rlist.split(",") + out_list = [] + for seq, item in enumerate(rlist_arr): + n = seq + 1 + if item.isdecimal(): # a single integer + i = int(item) + if 1 <= i < limit: + out_list.append(int(item)) + else: + sys.exit("bad %s specification at item %i" % (what, n)) + continue + try: # this must be a range now, and all of the following must work: + i1, i2 = item.split("-") # will fail if not 2 items produced + i1 = int(i1) # will fail on non-integers + i2 = int(i2) + except: + sys.exit("bad %s range specification at item %i" % (what, n)) + + if not (1 <= i1 < limit and 1 <= i2 < limit): + sys.exit("bad %s range specification at item %i" % (what, n)) + + if i1 == i2: # just in case: a range of equal numbers + out_list.append(i1) + continue + + if i1 < i2: # first less than second + out_list += list(range(i1, i2 + 1)) + else: # first larger than second + out_list += list(range(i1, i2 - 1, -1)) + + return out_list + + +def show(args): + doc = open_file(args.input, args.password, True) + size = os.path.getsize(args.input) / 1024 + flag = "KB" + if size > 1000: + size /= 1024 + flag = "MB" + size = round(size, 1) + meta = doc.metadata + print( + "'%s', pages: %i, objects: %i, %g %s, %s, encryption: %s" + % ( + args.input, + doc.page_count, + doc.xref_length() - 1, + size, + flag, + meta["format"], + meta["encryption"], + ) + ) + n = doc.is_form_pdf + if n > 0: + s = doc.get_sigflags() + print( + "document contains %i root form fields and is %ssigned" + % (n, "not " if s != 3 else "") + ) + n = doc.embfile_count() + if n > 0: + print("document contains %i embedded files" % n) + print() + if args.catalog: + print(mycenter("PDF catalog")) + xref = doc.pdf_catalog() + print_xref(doc, xref) + print() + if args.metadata: + print(mycenter("PDF metadata")) + print_dict(doc.metadata) + print() + if args.xrefs: + print(mycenter("object information")) + xrefl = get_list(args.xrefs, doc.xref_length(), what="xref") + for xref in xrefl: + print_xref(doc, xref) + print() + if args.pages: + print(mycenter("page information")) + pagel = get_list(args.pages, doc.page_count + 1) + for pno in pagel: + n = pno - 1 + xref = doc.page_xref(n) + print("Page %i:" % pno) + print_xref(doc, xref) + print() + if args.trailer: + print(mycenter("PDF trailer")) + print(doc.pdf_trailer()) + print() + doc.close() + + +def clean(args): + doc = open_file(args.input, args.password, pdf=True) + encryption = args.encryption + encrypt = ("keep", "none", "rc4-40", "rc4-128", "aes-128", "aes-256").index( + encryption + ) + + if not args.pages: # simple cleaning + doc.save( + args.output, + garbage=args.garbage, + deflate=args.compress, + pretty=args.pretty, + clean=args.sanitize, + ascii=args.ascii, + linear=args.linear, + encryption=encrypt, + owner_pw=args.owner, + user_pw=args.user, + permissions=args.permission, + ) + return + + # create sub document from page numbers + pages = get_list(args.pages, doc.page_count + 1) + outdoc = fitz.open() + for pno in pages: + n = pno - 1 + outdoc.insert_pdf(doc, from_page=n, to_page=n) + outdoc.save( + args.output, + garbage=args.garbage, + deflate=args.compress, + pretty=args.pretty, + clean=args.sanitize, + ascii=args.ascii, + linear=args.linear, + encryption=encrypt, + owner_pw=args.owner, + user_pw=args.user, + permissions=args.permission, + ) + doc.close() + outdoc.close() + return + + +def doc_join(args): + """Join pages from several PDF documents.""" + doc_list = args.input # a list of input PDFs + doc = fitz.open() # output PDF + for src_item in doc_list: # process one input PDF + src_list = src_item.split(",") + password = src_list[1] if len(src_list) > 1 else None + src = open_file(src_list[0], password, pdf=True) + pages = ",".join(src_list[2:]) # get 'pages' specifications + if pages: # if anything there, retrieve a list of desired pages + page_list = get_list(",".join(src_list[2:]), src.page_count + 1) + else: # take all pages + page_list = range(1, src.page_count + 1) + for i in page_list: + doc.insert_pdf(src, from_page=i - 1, to_page=i - 1) # copy each source page + src.close() + + doc.save(args.output, garbage=4, deflate=True) + doc.close() + + +def embedded_copy(args): + """Copy embedded files between PDFs.""" + doc = open_file(args.input, args.password, pdf=True) + if not doc.can_save_incrementally() and ( + not args.output or args.output == args.input + ): + sys.exit("cannot save PDF incrementally") + src = open_file(args.source, args.pwdsource) + names = set(args.name) if args.name else set() + src_names = set(src.embfile_names()) + if names: + if not names <= src_names: + sys.exit("not all names are contained in source") + else: + names = src_names + if not names: + sys.exit("nothing to copy") + intersect = names & set(doc.embfile_names()) # any equal name already in target? + if intersect: + sys.exit("following names already exist in receiving PDF: %s" % str(intersect)) + + for item in names: + info = src.embfile_info(item) + buff = src.embfile_get(item) + doc.embfile_add( + item, + buff, + filename=info["filename"], + ufilename=info["ufilename"], + desc=info["desc"], + ) + print("copied entry '%s' from '%s'" % (item, src.name)) + src.close() + if args.output and args.output != args.input: + doc.save(args.output, garbage=3) + else: + doc.saveIncr() + doc.close() + + +def embedded_del(args): + """Delete an embedded file entry.""" + doc = open_file(args.input, args.password, pdf=True) + if not doc.can_save_incrementally() and ( + not args.output or args.output == args.input + ): + sys.exit("cannot save PDF incrementally") + + try: + doc.embfile_del(args.name) + except ValueError: + sys.exit("no such embedded file '%s'" % args.name) + if not args.output or args.output == args.input: + doc.save_incr() + else: + doc.save(args.output, garbage=1) + doc.close() + + +def embedded_get(args): + """Retrieve contents of an embedded file.""" + doc = open_file(args.input, args.password, pdf=True) + try: + stream = doc.embfile_get(args.name) + d = doc.embfile_info(args.name) + except ValueError: + sys.exit("no such embedded file '%s'" % args.name) + filename = args.output if args.output else d["filename"] + output = open(filename, "wb") + output.write(stream) + output.close() + print("saved entry '%s' as '%s'" % (args.name, filename)) + doc.close() + + +def embedded_add(args): + """Insert a new embedded file.""" + doc = open_file(args.input, args.password, pdf=True) + if not doc.can_save_incrementally() and ( + args.output is None or args.output == args.input + ): + sys.exit("cannot save PDF incrementally") + + try: + doc.embfile_del(args.name) + sys.exit("entry '%s' already exists" % args.name) + except: + pass + + if not os.path.exists(args.path) or not os.path.isfile(args.path): + sys.exit("no such file '%s'" % args.path) + stream = open(args.path, "rb").read() + filename = args.path + ufilename = filename + if not args.desc: + desc = filename + else: + desc = args.desc + doc.embfile_add( + args.name, stream, filename=filename, ufilename=ufilename, desc=desc + ) + if not args.output or args.output == args.input: + doc.saveIncr() + else: + doc.save(args.output, garbage=3) + doc.close() + + +def embedded_upd(args): + """Update contents or metadata of an embedded file.""" + doc = open_file(args.input, args.password, pdf=True) + if not doc.can_save_incrementally() and ( + args.output is None or args.output == args.input + ): + sys.exit("cannot save PDF incrementally") + + try: + doc.embfile_info(args.name) + except: + sys.exit("no such embedded file '%s'" % args.name) + + if ( + args.path is not None + and os.path.exists(args.path) + and os.path.isfile(args.path) + ): + stream = open(args.path, "rb").read() + else: + stream = None + + if args.filename: + filename = args.filename + else: + filename = None + + if args.ufilename: + ufilename = args.ufilename + elif args.filename: + ufilename = args.filename + else: + ufilename = None + + if args.desc: + desc = args.desc + else: + desc = None + + doc.embfile_upd( + args.name, stream, filename=filename, ufilename=ufilename, desc=desc + ) + if args.output is None or args.output == args.input: + doc.saveIncr() + else: + doc.save(args.output, garbage=3) + doc.close() + + +def embedded_list(args): + """List embedded files.""" + doc = open_file(args.input, args.password, pdf=True) + names = doc.embfile_names() + if args.name is not None: + if args.name not in names: + sys.exit("no such embedded file '%s'" % args.name) + else: + print() + print( + "printing 1 of %i embedded file%s:" + % (len(names), "s" if len(names) > 1 else "") + ) + print() + print_dict(doc.embfile_info(args.name)) + print() + return + if not names: + print("'%s' contains no embedded files" % doc.name) + return + if len(names) > 1: + msg = "'%s' contains the following %i embedded files" % (doc.name, len(names)) + else: + msg = "'%s' contains the following embedded file" % doc.name + print(msg) + print() + for name in names: + if not args.detail: + print(name) + continue + _ = doc.embfile_info(name) + print_dict(doc.embfile_info(name)) + print() + doc.close() + + +def extract_objects(args): + """Extract images and / or fonts from a PDF.""" + if not args.fonts and not args.images: + sys.exit("neither fonts nor images requested") + doc = open_file(args.input, args.password, pdf=True) + + if args.pages: + pages = get_list(args.pages, doc.page_count + 1) + else: + pages = range(1, doc.page_count + 1) + + if not args.output: + out_dir = os.path.abspath(os.curdir) + else: + out_dir = args.output + if not (os.path.exists(out_dir) and os.path.isdir(out_dir)): + sys.exit("output directory %s does not exist" % out_dir) + + font_xrefs = set() # already saved fonts + image_xrefs = set() # already saved images + + for pno in pages: + if args.fonts: + itemlist = doc.get_page_fonts(pno - 1) + for item in itemlist: + xref = item[0] + if xref not in font_xrefs: + font_xrefs.add(xref) + fontname, ext, _, buffer = doc.extract_font(xref) + if ext == "n/a" or not buffer: + continue + outname = os.path.join( + out_dir, f"{fontname.replace(' ', '-')}-{xref}.{ext}" + ) + outfile = open(outname, "wb") + outfile.write(buffer) + outfile.close() + buffer = None + if args.images: + itemlist = doc.get_page_images(pno - 1) + for item in itemlist: + xref = item[0] + if xref not in image_xrefs: + image_xrefs.add(xref) + pix = recoverpix(doc, item) + if type(pix) is dict: + ext = pix["ext"] + imgdata = pix["image"] + outname = os.path.join(out_dir, "img-%i.%s" % (xref, ext)) + outfile = open(outname, "wb") + outfile.write(imgdata) + outfile.close() + else: + outname = os.path.join(out_dir, "img-%i.png" % xref) + pix2 = ( + pix + if pix.colorspace.n < 4 + else fitz.Pixmap(fitz.csRGB, pix) + ) + pix2.save(outname) + + if args.fonts: + print("saved %i fonts to '%s'" % (len(font_xrefs), out_dir)) + if args.images: + print("saved %i images to '%s'" % (len(image_xrefs), out_dir)) + doc.close() + + +def page_simple(page, textout, GRID, fontsize, noformfeed, skip_empty, flags): + eop = b"\n" if noformfeed else bytes([12]) + text = page.get_text("text", flags=flags) + if not text: + if not skip_empty: + textout.write(eop) # write formfeed + return + textout.write(text.encode("utf8", errors="surrogatepass")) + textout.write(eop) + return + + +def page_blocksort(page, textout, GRID, fontsize, noformfeed, skip_empty, flags): + eop = b"\n" if noformfeed else bytes([12]) + blocks = page.get_text("blocks", flags=flags) + if blocks == []: + if not skip_empty: + textout.write(eop) # write formfeed + return + blocks.sort(key=lambda b: (b[3], b[0])) + for b in blocks: + textout.write(b[4].encode("utf8", errors="surrogatepass")) + textout.write(eop) + return + + +def page_layout(page, textout, GRID, fontsize, noformfeed, skip_empty, flags): + eop = b"\n" if noformfeed else bytes([12]) + + # -------------------------------------------------------------------- + def find_line_index(values: List[int], value: int) -> int: + """Find the right row coordinate. + + Args: + values: (list) y-coordinates of rows. + value: (int) lookup for this value (y-origin of char). + Returns: + y-ccordinate of appropriate line for value. + """ + i = bisect.bisect_right(values, value) + if i: + return values[i - 1] + raise RuntimeError("Line for %g not found in %s" % (value, values)) + + # -------------------------------------------------------------------- + def curate_rows(rows: Set[int], GRID) -> List: + rows = list(rows) + rows.sort() # sort ascending + nrows = [rows[0]] + for h in rows[1:]: + if h >= nrows[-1] + GRID: # only keep significant differences + nrows.append(h) + return nrows # curated list of line bottom coordinates + + def process_blocks(blocks: List[Dict], page: fitz.Page): + rows = set() + page_width = page.rect.width + page_height = page.rect.height + rowheight = page_height + left = page_width + right = 0 + chars = [] + for block in blocks: + for line in block["lines"]: + if line["dir"] != (1, 0): # ignore non-horizontal text + continue + x0, y0, x1, y1 = line["bbox"] + if y1 < 0 or y0 > page.rect.height: # ignore if outside CropBox + continue + # upd row height + height = y1 - y0 + + if rowheight > height: + rowheight = height + for span in line["spans"]: + if span["size"] <= fontsize: + continue + for c in span["chars"]: + x0, _, x1, _ = c["bbox"] + cwidth = x1 - x0 + ox, oy = c["origin"] + oy = int(round(oy)) + rows.add(oy) + ch = c["c"] + if left > ox and ch != " ": + left = ox # update left coordinate + if right < x1: + right = x1 # update right coordinate + # handle ligatures: + if cwidth == 0 and chars != []: # potential ligature + old_ch, old_ox, old_oy, old_cwidth = chars[-1] + if old_oy == oy: # ligature + if old_ch != chr(0xFB00): # previous "ff" char lig? + lig = joinligature(old_ch + ch) # no + # convert to one of the 3-char ligatures: + elif ch == "i": + lig = chr(0xFB03) # "ffi" + elif ch == "l": + lig = chr(0xFB04) # "ffl" + else: # something wrong, leave old char in place + lig = old_ch + chars[-1] = (lig, old_ox, old_oy, old_cwidth) + continue + chars.append((ch, ox, oy, cwidth)) # all chars on page + return chars, rows, left, right, rowheight + + def joinligature(lig: str) -> str: + """Return ligature character for a given pair / triple of characters. + + Args: + lig: (str) 2/3 characters, e.g. "ff" + Returns: + Ligature, e.g. "ff" -> chr(0xFB00) + """ + + if lig == "ff": + return chr(0xFB00) + elif lig == "fi": + return chr(0xFB01) + elif lig == "fl": + return chr(0xFB02) + elif lig == "ffi": + return chr(0xFB03) + elif lig == "ffl": + return chr(0xFB04) + elif lig == "ft": + return chr(0xFB05) + elif lig == "st": + return chr(0xFB06) + return lig + + # -------------------------------------------------------------------- + def make_textline(left, slot, minslot, lchars): + """Produce the text of one output line. + + Args: + left: (float) left most coordinate used on page + slot: (float) avg width of one character in any font in use. + minslot: (float) min width for the characters in this line. + chars: (list[tuple]) characters of this line. + Returns: + text: (str) text string for this line + """ + text = "" # we output this + old_char = "" + old_x1 = 0 # end coordinate of last char + old_ox = 0 # x-origin of last char + if minslot <= fitz.EPSILON: + raise RuntimeError("program error: minslot too small = %g" % minslot) + + for c in lchars: # loop over characters + char, ox, _, cwidth = c + ox = ox - left # its (relative) start coordinate + x1 = ox + cwidth # ending coordinate + + # eliminate overprint effect + if old_char == char and ox - old_ox <= cwidth * 0.2: + continue + + # omit spaces overlapping previous char + if char == " " and (old_x1 - ox) / cwidth > 0.8: + continue + + old_char = char + # close enough to previous? + if ox < old_x1 + minslot: # assume char adjacent to previous + text += char # append to output + old_x1 = x1 # new end coord + old_ox = ox # new origin.x + continue + + # else next char starts after some gap: + # fill in right number of spaces, so char is positioned + # in the right slot of the line + if char == " ": # rest relevant for non-space only + continue + delta = int(ox / slot) - len(text) + if ox > old_x1 and delta > 1: + text += " " * delta + # now append char + text += char + old_x1 = x1 # new end coordinate + old_ox = ox # new origin + return text.rstrip() + + # extract page text by single characters ("rawdict") + blocks = page.get_text("rawdict", flags=flags)["blocks"] + chars, rows, left, right, rowheight = process_blocks(blocks, page) + + if chars == []: + if not skip_empty: + textout.write(eop) # write formfeed + return + # compute list of line coordinates - ignoring small (GRID) differences + rows = curate_rows(rows, GRID) + + # sort all chars by x-coordinates, so every line will receive char info, + # sorted from left to right. + chars.sort(key=lambda c: c[1]) + + # populate the lines with their char info + lines = {} # key: y1-ccordinate, value: char list + for c in chars: + _, _, oy, _ = c + y = find_line_index(rows, oy) # y-coord of the right line + lchars = lines.get(y, []) # read line chars so far + lchars.append(c) # append this char + lines[y] = lchars # write back to line + + # ensure line coordinates are ascending + keys = list(lines.keys()) + keys.sort() + + # ------------------------------------------------------------------------- + # Compute "char resolution" for the page: the char width corresponding to + # 1 text char position on output - call it 'slot'. + # For each line, compute median of its char widths. The minimum across all + # lines is 'slot'. + # The minimum char width of each line is used to determine if spaces must + # be inserted in between two characters. + # ------------------------------------------------------------------------- + slot = right - left + minslots = {} + for k in keys: + lchars = lines[k] + ccount = len(lchars) + if ccount < 2: + minslots[k] = 1 + continue + widths = [c[3] for c in lchars] + widths.sort() + this_slot = statistics.median(widths) # take median value + if this_slot < slot: + slot = this_slot + minslots[k] = widths[0] + + # compute line advance in text output + rowheight = rowheight * (rows[-1] - rows[0]) / (rowheight * len(rows)) * 1.2 + rowpos = rows[0] # first line positioned here + textout.write(b"\n") + for k in keys: # walk through the lines + while rowpos < k: # honor distance between lines + textout.write(b"\n") + rowpos += rowheight + text = make_textline(left, slot, minslots[k], lines[k]) + textout.write((text + "\n").encode("utf8", errors="surrogatepass")) + rowpos = k + rowheight + + textout.write(eop) # write formfeed + + +def gettext(args): + doc = open_file(args.input, args.password, pdf=False) + pagel = get_list(args.pages, doc.page_count + 1) + output = args.output + if output == None: + filename, _ = os.path.splitext(doc.name) + output = filename + ".txt" + textout = open(output, "wb") + flags = TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE + if args.convert_white: + flags ^= TEXT_PRESERVE_WHITESPACE + if args.noligatures: + flags ^= TEXT_PRESERVE_LIGATURES + if args.extra_spaces: + flags ^= TEXT_INHIBIT_SPACES + func = { + "simple": page_simple, + "blocks": page_blocksort, + "layout": page_layout, + } + for pno in pagel: + page = doc[pno - 1] + func[args.mode]( + page, + textout, + args.grid, + args.fontsize, + args.noformfeed, + args.skip_empty, + flags=flags, + ) + + textout.close() + + +def main(): + """Define command configurations.""" + parser = argparse.ArgumentParser( + prog="fitz", + description=mycenter("Basic PyMuPDF Functions"), + ) + subps = parser.add_subparsers( + title="Subcommands", help="Enter 'command -h' for subcommand specific help" + ) + + # ------------------------------------------------------------------------- + # 'show' command + # ------------------------------------------------------------------------- + ps_show = subps.add_parser("show", description=mycenter("display PDF information")) + ps_show.add_argument("input", type=str, help="PDF filename") + ps_show.add_argument("-password", help="password") + ps_show.add_argument("-catalog", action="store_true", help="show PDF catalog") + ps_show.add_argument("-trailer", action="store_true", help="show PDF trailer") + ps_show.add_argument("-metadata", action="store_true", help="show PDF metadata") + ps_show.add_argument( + "-xrefs", type=str, help="show selected objects, format: 1,5-7,N" + ) + ps_show.add_argument( + "-pages", type=str, help="show selected pages, format: 1,5-7,50-N" + ) + ps_show.set_defaults(func=show) + + # ------------------------------------------------------------------------- + # 'clean' command + # ------------------------------------------------------------------------- + ps_clean = subps.add_parser( + "clean", description=mycenter("optimize PDF, or create sub-PDF if pages given") + ) + ps_clean.add_argument("input", type=str, help="PDF filename") + ps_clean.add_argument("output", type=str, help="output PDF filename") + ps_clean.add_argument("-password", help="password") + + ps_clean.add_argument( + "-encryption", + help="encryption method", + choices=("keep", "none", "rc4-40", "rc4-128", "aes-128", "aes-256"), + default="none", + ) + + ps_clean.add_argument("-owner", type=str, help="owner password") + ps_clean.add_argument("-user", type=str, help="user password") + + ps_clean.add_argument( + "-garbage", + type=int, + help="garbage collection level", + choices=range(5), + default=0, + ) + + ps_clean.add_argument( + "-compress", + action="store_true", + default=False, + help="compress (deflate) output", + ) + + ps_clean.add_argument( + "-ascii", action="store_true", default=False, help="ASCII encode binary data" + ) + + ps_clean.add_argument( + "-linear", + action="store_true", + default=False, + help="format for fast web display", + ) + + ps_clean.add_argument( + "-permission", type=int, default=-1, help="integer with permission levels" + ) + + ps_clean.add_argument( + "-sanitize", + action="store_true", + default=False, + help="sanitize / clean contents", + ) + ps_clean.add_argument( + "-pretty", action="store_true", default=False, help="prettify PDF structure" + ) + ps_clean.add_argument( + "-pages", help="output selected pages pages, format: 1,5-7,50-N" + ) + ps_clean.set_defaults(func=clean) + + # ------------------------------------------------------------------------- + # 'join' command + # ------------------------------------------------------------------------- + ps_join = subps.add_parser( + "join", + description=mycenter("join PDF documents"), + epilog="specify each input as 'filename[,password[,pages]]'", + ) + ps_join.add_argument("input", nargs="*", help="input filenames") + ps_join.add_argument("-output", required=True, help="output filename") + ps_join.set_defaults(func=doc_join) + + # ------------------------------------------------------------------------- + # 'extract' command + # ------------------------------------------------------------------------- + ps_extract = subps.add_parser( + "extract", description=mycenter("extract images and fonts to disk") + ) + ps_extract.add_argument("input", type=str, help="PDF filename") + ps_extract.add_argument("-images", action="store_true", help="extract images") + ps_extract.add_argument("-fonts", action="store_true", help="extract fonts") + ps_extract.add_argument( + "-output", help="folder to receive output, defaults to current" + ) + ps_extract.add_argument("-password", help="password") + ps_extract.add_argument( + "-pages", type=str, help="consider these pages only, format: 1,5-7,50-N" + ) + ps_extract.set_defaults(func=extract_objects) + + # ------------------------------------------------------------------------- + # 'embed-info' + # ------------------------------------------------------------------------- + ps_show = subps.add_parser( + "embed-info", description=mycenter("list embedded files") + ) + ps_show.add_argument("input", help="PDF filename") + ps_show.add_argument("-name", help="if given, report only this one") + ps_show.add_argument("-detail", action="store_true", help="detail information") + ps_show.add_argument("-password", help="password") + ps_show.set_defaults(func=embedded_list) + + # ------------------------------------------------------------------------- + # 'embed-add' command + # ------------------------------------------------------------------------- + ps_embed_add = subps.add_parser( + "embed-add", description=mycenter("add embedded file") + ) + ps_embed_add.add_argument("input", help="PDF filename") + ps_embed_add.add_argument("-password", help="password") + ps_embed_add.add_argument( + "-output", help="output PDF filename, incremental save if none" + ) + ps_embed_add.add_argument("-name", required=True, help="name of new entry") + ps_embed_add.add_argument("-path", required=True, help="path to data for new entry") + ps_embed_add.add_argument("-desc", help="description of new entry") + ps_embed_add.set_defaults(func=embedded_add) + + # ------------------------------------------------------------------------- + # 'embed-del' command + # ------------------------------------------------------------------------- + ps_embed_del = subps.add_parser( + "embed-del", description=mycenter("delete embedded file") + ) + ps_embed_del.add_argument("input", help="PDF filename") + ps_embed_del.add_argument("-password", help="password") + ps_embed_del.add_argument( + "-output", help="output PDF filename, incremental save if none" + ) + ps_embed_del.add_argument("-name", required=True, help="name of entry to delete") + ps_embed_del.set_defaults(func=embedded_del) + + # ------------------------------------------------------------------------- + # 'embed-upd' command + # ------------------------------------------------------------------------- + ps_embed_upd = subps.add_parser( + "embed-upd", + description=mycenter("update embedded file"), + epilog="except '-name' all parameters are optional", + ) + ps_embed_upd.add_argument("input", help="PDF filename") + ps_embed_upd.add_argument("-name", required=True, help="name of entry") + ps_embed_upd.add_argument("-password", help="password") + ps_embed_upd.add_argument( + "-output", help="Output PDF filename, incremental save if none" + ) + ps_embed_upd.add_argument("-path", help="path to new data for entry") + ps_embed_upd.add_argument("-filename", help="new filename to store in entry") + ps_embed_upd.add_argument( + "-ufilename", help="new unicode filename to store in entry" + ) + ps_embed_upd.add_argument("-desc", help="new description to store in entry") + ps_embed_upd.set_defaults(func=embedded_upd) + + # ------------------------------------------------------------------------- + # 'embed-extract' command + # ------------------------------------------------------------------------- + ps_embed_extract = subps.add_parser( + "embed-extract", description=mycenter("extract embedded file to disk") + ) + ps_embed_extract.add_argument("input", type=str, help="PDF filename") + ps_embed_extract.add_argument("-name", required=True, help="name of entry") + ps_embed_extract.add_argument("-password", help="password") + ps_embed_extract.add_argument( + "-output", help="output filename, default is stored name" + ) + ps_embed_extract.set_defaults(func=embedded_get) + + # ------------------------------------------------------------------------- + # 'embed-copy' command + # ------------------------------------------------------------------------- + ps_embed_copy = subps.add_parser( + "embed-copy", description=mycenter("copy embedded files between PDFs") + ) + ps_embed_copy.add_argument("input", type=str, help="PDF to receive embedded files") + ps_embed_copy.add_argument("-password", help="password of input") + ps_embed_copy.add_argument( + "-output", help="output PDF, incremental save to 'input' if omitted" + ) + ps_embed_copy.add_argument( + "-source", required=True, help="copy embedded files from here" + ) + ps_embed_copy.add_argument("-pwdsource", help="password of 'source' PDF") + ps_embed_copy.add_argument( + "-name", nargs="*", help="restrict copy to these entries" + ) + ps_embed_copy.set_defaults(func=embedded_copy) + + # ------------------------------------------------------------------------- + # 'textlayout' command + # ------------------------------------------------------------------------- + ps_gettext = subps.add_parser( + "gettext", description=mycenter("extract text in various formatting modes") + ) + ps_gettext.add_argument("input", type=str, help="input document filename") + ps_gettext.add_argument("-password", help="password for input document") + ps_gettext.add_argument( + "-mode", + type=str, + help="mode: simple, block sort, or layout (default)", + choices=("simple", "blocks", "layout"), + default="layout", + ) + ps_gettext.add_argument( + "-pages", + type=str, + help="select pages, format: 1,5-7,50-N", + default="1-N", + ) + ps_gettext.add_argument( + "-noligatures", + action="store_true", + help="expand ligature characters (default False)", + default=False, + ) + ps_gettext.add_argument( + "-convert-white", + action="store_true", + help="convert whitespace characters to white (default False)", + default=False, + ) + ps_gettext.add_argument( + "-extra-spaces", + action="store_true", + help="fill gaps with spaces (default False)", + default=False, + ) + ps_gettext.add_argument( + "-noformfeed", + action="store_true", + help="write linefeeds, no formfeeds (default False)", + default=False, + ) + ps_gettext.add_argument( + "-skip-empty", + action="store_true", + help="suppress pages with no text (default False)", + default=False, + ) + ps_gettext.add_argument( + "-output", + help="store text in this file (default inputfilename.txt)", + ) + ps_gettext.add_argument( + "-grid", + type=float, + help="merge lines if closer than this (default 2)", + default=2, + ) + ps_gettext.add_argument( + "-fontsize", + type=float, + help="only include text with a larger fontsize (default 3)", + default=3, + ) + ps_gettext.set_defaults(func=gettext) + + # ------------------------------------------------------------------------- + # start program + # ------------------------------------------------------------------------- + args = parser.parse_args() # create parameter arguments class + if not hasattr(args, "func"): # no function selected + parser.print_help() # so print top level help + else: + args.func(args) # execute requested command + + +if __name__ == "__main__": + main() diff -r 000000000000 -r 1d09e1dec1d9 src_classic/_config.h --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/_config.h Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,222 @@ +// Copyright (C) 2004-2021 Artifex Software, Inc. +// +// This file is part of MuPDF. +// +// MuPDF is free software: you can redistribute it and/or modify it under the +// terms of the GNU Affero General Public License as published by the Free +// Software Foundation, either version 3 of the License, or (at your option) +// any later version. +// +// MuPDF is distributed in the hope that it will be useful, but WITHOUT ANY +// WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +// FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more +// details. +// +// You should have received a copy of the GNU Affero General Public License +// along with MuPDF. If not, see +// +// Alternative licensing terms are available from the licensor. +// For commercial licensing, see or contact +// Artifex Software, Inc., 39 Mesa Street, Suite 108A, San Francisco, +// CA 94129, USA, for further information. + +#ifndef FZ_CONFIG_H + +#define FZ_CONFIG_H + +/** + Enable the following for spot (and hence overprint/overprint + simulation) capable rendering. This forces FZ_PLOTTERS_N on. +*/ +/* #define FZ_ENABLE_SPOT_RENDERING 1 */ + +/** + Choose which plotters we need. + By default we build all the plotters in. To avoid building + plotters in that aren't needed, define the unwanted + FZ_PLOTTERS_... define to 0. +*/ +/* #define FZ_PLOTTERS_G 1 */ +/* #define FZ_PLOTTERS_RGB 1 */ +/* #define FZ_PLOTTERS_CMYK 1 */ +/* #define FZ_PLOTTERS_N 1 */ + +/** + Choose which document agents to include. + By default all are enabled. To avoid building unwanted + ones, define FZ_ENABLE_... to 0. +*/ +/* #define FZ_ENABLE_PDF 1 */ +/* #define FZ_ENABLE_XPS 1 */ +/* #define FZ_ENABLE_SVG 1 */ +/* #define FZ_ENABLE_CBZ 1 */ +/* #define FZ_ENABLE_IMG 1 */ +/* #define FZ_ENABLE_HTML 1 */ +/* #define FZ_ENABLE_EPUB 1 */ + +/** + Choose which document writers to include. + By default all are enabled. To avoid building unwanted + ones, define FZ_ENABLE_..._OUTPUT to 0. +*/ +/* #define FZ_ENABLE_OCR_OUTPUT 1 */ +/* #define FZ_ENABLE_DOCX_OUTPUT 1 */ +/* #define FZ_ENABLE_ODT_OUTPUT 1 */ + +/** + Choose whether to enable ICC color profiles. +*/ +/* #define FZ_ENABLE_ICC 1 */ + +/** + Choose whether to enable JPEG2000 decoding. + By default, it is enabled, but due to frequent security + issues with the third party libraries we support disabling + it with this flag. +*/ +/* #define FZ_ENABLE_JPX 1 */ + +/** + Choose whether to enable JavaScript. + By default JavaScript is enabled both for mutool and PDF + interactivity. +*/ +/* #define FZ_ENABLE_JS 1 */ + +/** + Choose which fonts to include. + By default we include the base 14 PDF fonts, + DroidSansFallback from Android for CJK, and + Charis SIL from SIL for epub/html. + Enable the following defines to AVOID including + unwanted fonts. +*/ +/* To avoid all noto fonts except CJK, enable: */ +/* #define TOFU */ + +/* To skip the CJK font, enable: (this implicitly enables TOFU_CJK_EXT + * and TOFU_CJK_LANG) */ +/* #define TOFU_CJK */ + +/* To skip CJK Extension A, enable: (this implicitly enables + * TOFU_CJK_LANG) */ +#define TOFU_CJK_EXT 1 + +/* To skip CJK language specific fonts, enable: */ +/* #define TOFU_CJK_LANG */ + +/* To skip the Emoji font, enable: */ +/* #define TOFU_EMOJI */ + +/* To skip the ancient/historic scripts, enable: */ +/* #define TOFU_HISTORIC */ + +/* To skip the symbol font, enable: */ +/* #define TOFU_SYMBOL */ + +/* To skip the SIL fonts, enable: */ +/* #define TOFU_SIL */ + +/* To skip the Base14 fonts, enable: */ +/* #define TOFU_BASE14 */ +/* (You probably really don't want to do that except for measurement + * purposes!) */ + +/* ---------- DO NOT EDIT ANYTHING UNDER THIS LINE ---------- */ + +#ifndef FZ_ENABLE_SPOT_RENDERING +#define FZ_ENABLE_SPOT_RENDERING 1 +#endif + +#if FZ_ENABLE_SPOT_RENDERING +#undef FZ_PLOTTERS_N +#define FZ_PLOTTERS_N 1 +#endif /* FZ_ENABLE_SPOT_RENDERING */ + +#ifndef FZ_PLOTTERS_G +#define FZ_PLOTTERS_G 1 +#endif /* FZ_PLOTTERS_G */ + +#ifndef FZ_PLOTTERS_RGB +#define FZ_PLOTTERS_RGB 1 +#endif /* FZ_PLOTTERS_RGB */ + +#ifndef FZ_PLOTTERS_CMYK +#define FZ_PLOTTERS_CMYK 1 +#endif /* FZ_PLOTTERS_CMYK */ + +#ifndef FZ_PLOTTERS_N +#define FZ_PLOTTERS_N 1 +#endif /* FZ_PLOTTERS_N */ + +/* We need at least 1 plotter defined */ +#if FZ_PLOTTERS_G == 0 && FZ_PLOTTERS_RGB == 0 && FZ_PLOTTERS_CMYK == 0 +#undef FZ_PLOTTERS_N +#define FZ_PLOTTERS_N 1 +#endif + +#ifndef FZ_ENABLE_PDF +#define FZ_ENABLE_PDF 1 +#endif /* FZ_ENABLE_PDF */ + +#ifndef FZ_ENABLE_XPS +#define FZ_ENABLE_XPS 1 +#endif /* FZ_ENABLE_XPS */ + +#ifndef FZ_ENABLE_SVG +#define FZ_ENABLE_SVG 1 +#endif /* FZ_ENABLE_SVG */ + +#ifndef FZ_ENABLE_CBZ +#define FZ_ENABLE_CBZ 1 +#endif /* FZ_ENABLE_CBZ */ + +#ifndef FZ_ENABLE_IMG +#define FZ_ENABLE_IMG 1 +#endif /* FZ_ENABLE_IMG */ + +#ifndef FZ_ENABLE_HTML +#define FZ_ENABLE_HTML 1 +#endif /* FZ_ENABLE_HTML */ + +#ifndef FZ_ENABLE_EPUB +#define FZ_ENABLE_EPUB 1 +#endif /* FZ_ENABLE_EPUB */ + +#ifndef FZ_ENABLE_OCR_OUTPUT +#define FZ_ENABLE_OCR_OUTPUT 1 +#endif /* FZ_ENABLE_OCR_OUTPUT */ + +#ifndef FZ_ENABLE_ODT_OUTPUT +#define FZ_ENABLE_ODT_OUTPUT 1 +#endif /* FZ_ENABLE_ODT_OUTPUT */ + +#ifndef FZ_ENABLE_DOCX_OUTPUT +#define FZ_ENABLE_DOCX_OUTPUT 1 +#endif /* FZ_ENABLE_DOCX_OUTPUT */ + +#ifndef FZ_ENABLE_JPX +#define FZ_ENABLE_JPX 1 +#endif /* FZ_ENABLE_JPX */ + +#ifndef FZ_ENABLE_JS +#define FZ_ENABLE_JS 1 +#endif /* FZ_ENABLE_JS */ + +#ifndef FZ_ENABLE_ICC +#define FZ_ENABLE_ICC 1 +#endif /* FZ_ENABLE_ICC */ + +/* If Epub and HTML are both disabled, disable SIL fonts */ +#if FZ_ENABLE_HTML == 0 && FZ_ENABLE_EPUB == 0 +#undef TOFU_SIL +#define TOFU_SIL +#endif + +#if !defined(HAVE_LEPTONICA) || !defined(HAVE_TESSERACT) +#ifndef OCR_DISABLED +#define OCR_DISABLED +#endif +#endif + +#endif /* FZ_CONFIG_H */ diff -r 000000000000 -r 1d09e1dec1d9 src_classic/fitz_old.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/fitz_old.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,15210 @@ +%module fitz +%pythonbegin %{ +%} +//------------------------------------------------------------------------ +// SWIG macros: handle fitz exceptions +//------------------------------------------------------------------------ +%define FITZEXCEPTION(meth, cond) +%exception meth +{ + $action + if (cond) { + return JM_ReturnException(gctx); + } +} +%enddef + + +%define FITZEXCEPTION2(meth, cond) +%exception meth +{ + $action + if (cond) { + const char *msg = fz_caught_message(gctx); + if (strcmp(msg, MSG_BAD_FILETYPE) == 0) { + PyErr_SetString(PyExc_ValueError, msg); + } else { + PyErr_SetString(JM_Exc_FileDataError, MSG_BAD_DOCUMENT); + } + return NULL; + } +} +%enddef + +//------------------------------------------------------------------------ +// SWIG macro: check that a document is not closed / encrypted +//------------------------------------------------------------------------ +%define CLOSECHECK(meth, doc) +%pythonprepend meth %{doc +if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted")%} +%enddef + +%define CLOSECHECK0(meth, doc) +%pythonprepend meth%{doc +if self.is_closed: + raise ValueError("document closed")%} +%enddef + +//------------------------------------------------------------------------ +// SWIG macro: check if object has a valid parent +//------------------------------------------------------------------------ +%define PARENTCHECK(meth, doc) +%pythonprepend meth %{doc +CheckParent(self)%} +%enddef + + +//------------------------------------------------------------------------ +// SWIG macro: ensure object still exists +//------------------------------------------------------------------------ +%define ENSURE_OWNERSHIP(meth, doc) +%pythonprepend meth %{doc +EnsureOwnership(self)%} +%enddef + +%include "mupdf/fitz/version.h" + +%{ +#define MEMDEBUG 0 +#if MEMDEBUG == 1 + #define DEBUGMSG1(x) PySys_WriteStderr("[DEBUG] free %s ", x) + #define DEBUGMSG2 PySys_WriteStderr("... done!\n") +#else + #define DEBUGMSG1(x) + #define DEBUGMSG2 +#endif + +#ifndef FLT_EPSILON + #define FLT_EPSILON 1e-5 +#endif + +#define SWIG_FILE_WITH_INIT + +// JM_MEMORY controls what allocators we tell MuPDF to use when we call +// fz_new_context(): +// +// JM_MEMORY=0: MuPDF uses malloc()/free(). +// JM_MEMORY=1: MuPDF uses PyMem_Malloc()/PyMem_Free(). +// +// There are also a small number of places where we call malloc() or +// PyMem_Malloc() ourselves, depending on JM_MEMORY. +// +#define JM_MEMORY 0 + +#if JM_MEMORY == 1 + #define JM_Alloc(type, len) PyMem_New(type, len) + #define JM_Free(x) PyMem_Del(x) +#else + #define JM_Alloc(type, len) (type *) malloc(sizeof(type)*len) + #define JM_Free(x) free(x) +#endif + +#define EMPTY_STRING PyUnicode_FromString("") +#define EXISTS(x) (x != NULL && PyObject_IsTrue(x)==1) +#define RAISEPY(context, msg, exc) {JM_Exc_CurrentException=exc; fz_throw(context, FZ_ERROR_GENERIC, msg);} +#define ASSERT_PDF(cond) if (cond == NULL) RAISEPY(gctx, MSG_IS_NO_PDF, PyExc_RuntimeError) +#define ENSURE_OPERATION(ctx, pdf) if (!JM_have_operation(ctx, pdf)) RAISEPY(ctx, "No journalling operation started", PyExc_RuntimeError) +#define INRANGE(v, low, high) ((low) <= v && v <= (high)) +#define JM_BOOL(x) PyBool_FromLong((long) (x)) +#define JM_PyErr_Clear if (PyErr_Occurred()) PyErr_Clear() + +#define JM_StrAsChar(x) (char *)PyUnicode_AsUTF8(x) +#define JM_BinFromChar(x) PyBytes_FromString(x) +#define JM_BinFromCharSize(x, y) PyBytes_FromStringAndSize(x, (Py_ssize_t) y) + +#include +#include +#include +// freetype includes >> -------------------------------------------------- +#include +#include FT_FREETYPE_H +#ifdef FT_FONT_FORMATS_H +#include FT_FONT_FORMATS_H +#else +#include FT_XFREE86_H +#endif +#include FT_TRUETYPE_TABLES_H + +#ifndef FT_SFNT_HEAD +#define FT_SFNT_HEAD ft_sfnt_head +#endif +// << freetype includes -------------------------------------------------- + +void JM_delete_widget(fz_context *ctx, pdf_page *page, pdf_annot *annot); +static void JM_get_page_labels(fz_context *ctx, PyObject *liste, pdf_obj *nums); +static int DICT_SETITEMSTR_DROP(PyObject *dict, const char *key, PyObject *value); +static int LIST_APPEND_DROP(PyObject *list, PyObject *item); +static int LIST_APPEND_DROP(PyObject *list, PyObject *item); +static fz_irect JM_irect_from_py(PyObject *r); +static fz_matrix JM_matrix_from_py(PyObject *m); +static fz_point JM_normalize_vector(float x, float y); +static fz_point JM_point_from_py(PyObject *p); +static fz_quad JM_quad_from_py(PyObject *r); +static fz_rect JM_rect_from_py(PyObject *r); +static int JM_FLOAT_ITEM(PyObject *obj, Py_ssize_t idx, double *result); +static int JM_INT_ITEM(PyObject *obj, Py_ssize_t idx, int *result); +static PyObject *JM_py_from_irect(fz_irect r); +static PyObject *JM_py_from_matrix(fz_matrix m); +static PyObject *JM_py_from_point(fz_point p); +static PyObject *JM_py_from_quad(fz_quad q); +static PyObject *JM_py_from_rect(fz_rect r); +static void show(const char* prefix, PyObject* obj); + + +// additional headers ---------------------------------------------- +#if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR == 23 && FZ_VERSION_PATCH < 8 +pdf_obj *pdf_lookup_page_loc(fz_context *ctx, pdf_document *doc, int needle, pdf_obj **parentp, int *indexp); +fz_pixmap *fz_scale_pixmap(fz_context *ctx, fz_pixmap *src, float x, float y, float w, float h, const fz_irect *clip); +int fz_pixmap_size(fz_context *ctx, fz_pixmap *src); +void fz_subsample_pixmap(fz_context *ctx, fz_pixmap *tile, int factor); +void fz_copy_pixmap_rect(fz_context *ctx, fz_pixmap *dest, fz_pixmap *src, fz_irect b, const fz_default_colorspaces *default_cs); +void fz_write_pixmap_as_jpeg(fz_context *ctx, fz_output *out, fz_pixmap *pix, int jpg_quality); +#endif +static const float JM_font_ascender(fz_context *ctx, fz_font *font); +static const float JM_font_descender(fz_context *ctx, fz_font *font); +// end of additional headers -------------------------------------------- + +static PyObject *JM_mupdf_warnings_store; +static int JM_mupdf_show_errors; +static int JM_mupdf_show_warnings; +static PyObject *JM_Exc_FileDataError; +static PyObject *JM_Exc_CurrentException; +%} + +//------------------------------------------------------------------------ +// global context +//------------------------------------------------------------------------ +%init %{ + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + /* Stop Memento backtraces if we reach the Python interpreter. + `cfunction_call()` isn't the only way that Python calls C though, so we + might need extra calls to Memento_addBacktraceLimitFnname(). + + We put this inside `#ifdef MEMENTO` because memento.h's disabling macro + causes "warning: statement with no effect" from cc. */ + #ifdef MEMENTO + Memento_addBacktraceLimitFnname("cfunction_call"); + #endif + #endif + + /* + We end up with Memento leaks from fz_new_context()'s allocs even when our + atexit handler calls fz_drop_context(), so remove these from Memento's + accounting. + */ + Memento_startLeaking(); +#if JM_MEMORY == 1 + gctx = fz_new_context(&JM_Alloc_Context, NULL, FZ_STORE_DEFAULT); +#else + gctx = fz_new_context(NULL, NULL, FZ_STORE_DEFAULT); +#endif + Memento_stopLeaking(); + if(!gctx) + { + PyErr_SetString(PyExc_RuntimeError, "Fatal error: cannot create global context."); + return NULL; + } + fz_register_document_handlers(gctx); + +//------------------------------------------------------------------------ +// START redirect stdout/stderr +//------------------------------------------------------------------------ +JM_mupdf_warnings_store = PyList_New(0); +JM_mupdf_show_errors = 1; +JM_mupdf_show_warnings = 0; +char user[] = "PyMuPDF"; +fz_set_warning_callback(gctx, JM_mupdf_warning, &user); +fz_set_error_callback(gctx, JM_mupdf_error, &user); +JM_Exc_FileDataError = NULL; +JM_Exc_CurrentException = PyExc_RuntimeError; +//------------------------------------------------------------------------ +// STOP redirect stdout/stderr +//------------------------------------------------------------------------ +// init global constants +//------------------------------------------------------------------------ +dictkey_align = PyUnicode_InternFromString("align"); +dictkey_ascender = PyUnicode_InternFromString("ascender"); +dictkey_bbox = PyUnicode_InternFromString("bbox"); +dictkey_blocks = PyUnicode_InternFromString("blocks"); +dictkey_bpc = PyUnicode_InternFromString("bpc"); +dictkey_c = PyUnicode_InternFromString("c"); +dictkey_chars = PyUnicode_InternFromString("chars"); +dictkey_color = PyUnicode_InternFromString("color"); +dictkey_colorspace = PyUnicode_InternFromString("colorspace"); +dictkey_content = PyUnicode_InternFromString("content"); +dictkey_creationDate = PyUnicode_InternFromString("creationDate"); +dictkey_cs_name = PyUnicode_InternFromString("cs-name"); +dictkey_da = PyUnicode_InternFromString("da"); +dictkey_dashes = PyUnicode_InternFromString("dashes"); +dictkey_desc = PyUnicode_InternFromString("desc"); +dictkey_desc = PyUnicode_InternFromString("descender"); +dictkey_descender = PyUnicode_InternFromString("descender"); +dictkey_dir = PyUnicode_InternFromString("dir"); +dictkey_effect = PyUnicode_InternFromString("effect"); +dictkey_ext = PyUnicode_InternFromString("ext"); +dictkey_filename = PyUnicode_InternFromString("filename"); +dictkey_fill = PyUnicode_InternFromString("fill"); +dictkey_flags = PyUnicode_InternFromString("flags"); +dictkey_font = PyUnicode_InternFromString("font"); +dictkey_glyph = PyUnicode_InternFromString("glyph"); +dictkey_height = PyUnicode_InternFromString("height"); +dictkey_id = PyUnicode_InternFromString("id"); +dictkey_image = PyUnicode_InternFromString("image"); +dictkey_items = PyUnicode_InternFromString("items"); +dictkey_length = PyUnicode_InternFromString("length"); +dictkey_lines = PyUnicode_InternFromString("lines"); +dictkey_matrix = PyUnicode_InternFromString("transform"); +dictkey_modDate = PyUnicode_InternFromString("modDate"); +dictkey_name = PyUnicode_InternFromString("name"); +dictkey_number = PyUnicode_InternFromString("number"); +dictkey_origin = PyUnicode_InternFromString("origin"); +dictkey_rect = PyUnicode_InternFromString("rect"); +dictkey_size = PyUnicode_InternFromString("size"); +dictkey_smask = PyUnicode_InternFromString("smask"); +dictkey_spans = PyUnicode_InternFromString("spans"); +dictkey_stroke = PyUnicode_InternFromString("stroke"); +dictkey_style = PyUnicode_InternFromString("style"); +dictkey_subject = PyUnicode_InternFromString("subject"); +dictkey_text = PyUnicode_InternFromString("text"); +dictkey_title = PyUnicode_InternFromString("title"); +dictkey_type = PyUnicode_InternFromString("type"); +dictkey_ufilename = PyUnicode_InternFromString("ufilename"); +dictkey_width = PyUnicode_InternFromString("width"); +dictkey_wmode = PyUnicode_InternFromString("wmode"); +dictkey_xref = PyUnicode_InternFromString("xref"); +dictkey_xres = PyUnicode_InternFromString("xres"); +dictkey_yres = PyUnicode_InternFromString("yres"); + +atexit( cleanup); +%} + +%header %{ +fz_context *gctx; + +static void cleanup() +{ + fz_drop_context( gctx); +} + +static int JM_UNIQUE_ID = 0; + +struct DeviceWrapper { + fz_device *device; + fz_display_list *list; +}; +%} + +//------------------------------------------------------------------------ +// include version information and several other helpers +//------------------------------------------------------------------------ +%pythoncode %{ +import sys +import io +import math +import os +import weakref +import hashlib +import typing +import binascii +import re +import tarfile +import zipfile +import pathlib +import string + +# PDF names must not contain these characters: +INVALID_NAME_CHARS = set(string.whitespace + "()<>[]{}/%" + chr(0)) + +TESSDATA_PREFIX = os.getenv("TESSDATA_PREFIX") +point_like = "point_like" +rect_like = "rect_like" +matrix_like = "matrix_like" +quad_like = "quad_like" + +# ByteString is gone from typing in 3.14. +# collections.abc.Buffer available from 3.12 only +try: + ByteString = typing.ByteString +except AttributeError: + ByteString = bytes | bytearray | memoryview + +AnyType = typing.Any +OptInt = typing.Union[int, None] +OptFloat = typing.Optional[float] +OptStr = typing.Optional[str] +OptDict = typing.Optional[dict] +OptBytes = typing.Optional[ByteString] +OptSeq = typing.Optional[typing.Sequence] + +try: + from pymupdf_fonts import fontdescriptors, fontbuffers + + fitz_fontdescriptors = fontdescriptors.copy() + for k in fitz_fontdescriptors.keys(): + fitz_fontdescriptors[k]["loader"] = fontbuffers[k] + del fontdescriptors, fontbuffers +except ImportError: + fitz_fontdescriptors = {} +%} +%include version.i +%include helper-git-versions.i +%include helper-defines.i +%include helper-globals.i +%include helper-geo-c.i +%include helper-other.i +%include helper-pixmap.i +%include helper-geo-py.i +%include helper-annot.i +%include helper-fields.i +%include helper-python.i +%include helper-portfolio.i +%include helper-select.i +%include helper-stext.i +%include helper-xobject.i +%include helper-pdfinfo.i +%include helper-convert.i +%include helper-fileobj.i +%include helper-devices.i + +%{ +// Declaring these structs here prevents gcc from generating warnings like: +// +// warning: 'struct Document' declared inside parameter list will not be visible outside of this definition or declaration +// +struct Colorspace; +struct Document; +struct Font; +struct Graftmap; +struct TextPage; +struct TextWriter; +struct DocumentWriter; +struct Xml; +struct Archive; +struct Story; +%} + +//------------------------------------------------------------------------ +// fz_document +//------------------------------------------------------------------------ +struct Document +{ + %extend + { + ~Document() + { + DEBUGMSG1("Document"); + fz_document *this_doc = (fz_document *) $self; + fz_drop_document(gctx, this_doc); + DEBUGMSG2; + } + FITZEXCEPTION2(Document, !result) + + %pythonprepend Document %{ + """Creates a document. Use 'open' as a synonym. + + Notes: + Basic usages: + open() - new PDF document + open(filename) - string, pathlib.Path, or file object. + open(filename, fileype=type) - overwrite filename extension. + open(type, buffer) - type: extension, buffer: bytes object. + open(stream=buffer, filetype=type) - keyword version of previous. + Parameters rect, width, height, fontsize: layout reflowable + document on open (e.g. EPUB). Ignored if n/a. + """ + self.is_closed = False + self.is_encrypted = False + self.isEncrypted = False + self.metadata = None + self.FontInfos = [] + self.Graftmaps = {} + self.ShownPages = {} + self.InsertedImages = {} + self._page_refs = weakref.WeakValueDictionary() + + if not filename or type(filename) is str: + pass + elif hasattr(filename, "absolute"): + filename = str(filename) + elif hasattr(filename, "name"): + filename = filename.name + else: + msg = "bad filename" + raise TypeError(msg) + + if stream != None: + if type(stream) is bytes: + self.stream = stream + elif type(stream) is bytearray: + self.stream = bytes(stream) + elif type(stream) is io.BytesIO: + self.stream = stream.getvalue() + else: + msg = "bad type: 'stream'" + raise TypeError(msg) + stream = self.stream + if not (filename or filetype): + filename = "pdf" + else: + self.stream = None + + if filename and self.stream == None: + self.name = filename + from_file = True + else: + from_file = False + self.name = "" + + if from_file: + if not os.path.exists(filename): + msg = f"no such file: '{filename}'" + raise FileNotFoundError(msg) + elif not os.path.isfile(filename): + msg = f"'{filename}' is no file" + raise FileDataError(msg) + if from_file and os.path.getsize(filename) == 0 or type(self.stream) is bytes and len(self.stream) == 0: + msg = "cannot open empty document" + raise EmptyFileError(msg) + %} + %pythonappend Document %{ + if self.thisown: + self._graft_id = TOOLS.gen_id() + if self.needs_pass is True: + self.is_encrypted = True + self.isEncrypted = True + else: # we won't init until doc is decrypted + self.init_doc() + # the following hack detects invalid/empty SVG files, which else may lead + # to interpreter crashes + if filename and filename.lower().endswith("svg") or filetype and "svg" in filetype.lower(): + try: + _ = self.convert_to_pdf() # this seems to always work + except: + raise FileDataError("cannot open broken document") from None + %} + + Document(const char *filename=NULL, PyObject *stream=NULL, + const char *filetype=NULL, PyObject *rect=NULL, + float width=0, float height=0, + float fontsize=11) + { + int old_msg_option = JM_mupdf_show_errors; + JM_mupdf_show_errors = 0; + fz_document *doc = NULL; + const fz_document_handler *handler; + char *c = NULL; + char *magic = NULL; + size_t len = 0; + fz_stream *data = NULL; + float w = width, h = height; + fz_rect r = JM_rect_from_py(rect); + if (!fz_is_infinite_rect(r)) { + w = r.x1 - r.x0; + h = r.y1 - r.y0; + } + + fz_try(gctx) { + if (stream != Py_None) { // stream given, **MUST** be bytes! + c = PyBytes_AS_STRING(stream); // just a pointer, no new obj + len = (size_t) PyBytes_Size(stream); + data = fz_open_memory(gctx, (const unsigned char *) c, len); + magic = (char *)filename; + if (!magic) magic = (char *)filetype; + handler = fz_recognize_document(gctx, magic); + if (!handler) { + RAISEPY(gctx, MSG_BAD_FILETYPE, PyExc_ValueError); + } + doc = fz_open_document_with_stream(gctx, magic, data); + } else { + if (filename && strlen(filename)) { + if (!filetype || strlen(filetype) == 0) { + doc = fz_open_document(gctx, filename); + } else { + handler = fz_recognize_document(gctx, filetype); + if (!handler) { + RAISEPY(gctx, MSG_BAD_FILETYPE, PyExc_ValueError); + } + #if FZ_VERSION_MINOR >= 24 + if (handler->open) + { + fz_stream* filename_stream = fz_open_file(gctx, filename); + fz_try(gctx) + { + doc = handler->open(gctx, filename_stream, NULL, NULL); + } + fz_always(gctx) + { + fz_drop_stream(gctx, filename_stream); + } + fz_catch(gctx) + { + fz_rethrow(gctx); + } + } + #else + if (handler->open) { + doc = handler->open(gctx, filename); + } else if (handler->open_with_stream) { + data = fz_open_file(gctx, filename); + doc = handler->open_with_stream(gctx, data); + } + #endif + } + } else { + pdf_document *pdf = pdf_create_document(gctx); + doc = (fz_document *) pdf; + } + } + } + fz_always(gctx) { + fz_drop_stream(gctx, data); + } + fz_catch(gctx) { + JM_mupdf_show_errors = old_msg_option; + return NULL; + } + if (w > 0 && h > 0) { + fz_layout_document(gctx, doc, w, h, fontsize); + } else if (fz_is_document_reflowable(gctx, doc)) { + fz_layout_document(gctx, doc, 400, 600, 11); + } + return (struct Document *) doc; + } + + + FITZEXCEPTION(load_page, !result) + %pythonprepend load_page %{ + """Load a page. + + 'page_id' is either a 0-based page number or a tuple (chapter, pno), + with chapter number and page number within that chapter. + """ + + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if page_id is None: + page_id = 0 + if page_id not in self: + raise ValueError("page not in document") + if type(page_id) is int and page_id < 0: + np = self.page_count + while page_id < 0: + page_id += np + %} + %pythonappend load_page %{ + val.thisown = True + val.parent = weakref.proxy(self) + self._page_refs[id(val)] = val + val._annot_refs = weakref.WeakValueDictionary() + val.number = page_id + %} + struct Page * + load_page(PyObject *page_id) + { + fz_page *page = NULL; + fz_document *doc = (fz_document *) $self; + int pno = 0, chapter = 0; + fz_try(gctx) { + if (PySequence_Check(page_id)) { + if (JM_INT_ITEM(page_id, 0, &chapter) == 1) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + if (JM_INT_ITEM(page_id, 1, &pno) == 1) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + page = fz_load_chapter_page(gctx, doc, chapter, pno); + } else { + pno = (int) PyLong_AsLong(page_id); + if (PyErr_Occurred()) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + page = fz_load_page(gctx, doc, pno); + } + } + fz_catch(gctx) { + PyErr_Clear(); + return NULL; + } + PyErr_Clear(); + return (struct Page *) page; + } + + + FITZEXCEPTION(_remove_links_to, !result) + PyObject *_remove_links_to(PyObject *numbers) + { + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + remove_dest_range(gctx, pdf, numbers); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + CLOSECHECK0(_loadOutline, """Load first outline.""") + struct Outline *_loadOutline() + { + fz_outline *ol = NULL; + fz_document *doc = (fz_document *) $self; + fz_try(gctx) { + ol = fz_load_outline(gctx, doc); + } + fz_catch(gctx) { + return NULL; + } + return (struct Outline *) ol; + } + + void _dropOutline(struct Outline *ol) { + DEBUGMSG1("Outline"); + fz_outline *this_ol = (fz_outline *) ol; + fz_drop_outline(gctx, this_ol); + DEBUGMSG2; + } + + FITZEXCEPTION(_insert_font, !result) + CLOSECHECK0(_insert_font, """Utility: insert font from file or binary.""") + PyObject * + _insert_font(char *fontfile=NULL, PyObject *fontbuffer=NULL) + { + PyObject *value=NULL; + pdf_document *pdf = pdf_specifics(gctx, (fz_document *)$self); + + fz_try(gctx) { + ASSERT_PDF(pdf); + if (!fontfile && !EXISTS(fontbuffer)) { + RAISEPY(gctx, MSG_FILE_OR_BUFFER, PyExc_ValueError); + } + value = JM_insert_font(gctx, pdf, NULL, fontfile, fontbuffer, + 0, 0, 0, 0, 0, -1); + } + fz_catch(gctx) { + return NULL; + } + return value; + } + + + FITZEXCEPTION(get_outline_xrefs, !result) + CLOSECHECK0(get_outline_xrefs, """Get list of outline xref numbers.""") + PyObject * + get_outline_xrefs() + { + PyObject *xrefs = PyList_New(0); + pdf_document *pdf = pdf_specifics(gctx, (fz_document *)$self); + if (!pdf) { + return xrefs; + } + fz_try(gctx) { + pdf_obj *root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root)); + if (!root) goto finished; + pdf_obj *olroot = pdf_dict_get(gctx, root, PDF_NAME(Outlines)); + if (!olroot) goto finished; + pdf_obj *first = pdf_dict_get(gctx, olroot, PDF_NAME(First)); + if (!first) goto finished; + xrefs = JM_outline_xrefs(gctx, first, xrefs); + finished:; + } + fz_catch(gctx) { + Py_DECREF(xrefs); + return NULL; + } + return xrefs; + } + + + FITZEXCEPTION(xref_get_keys, !result) + CLOSECHECK0(xref_get_keys, """Get the keys of PDF dict object at 'xref'. Use -1 for the PDF trailer.""") + PyObject * + xref_get_keys(int xref) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *)$self); + pdf_obj *obj=NULL; + PyObject *rc = NULL; + int i, n; + fz_try(gctx) { + ASSERT_PDF(pdf); + int xreflen = pdf_xref_len(gctx, pdf); + if (!INRANGE(xref, 1, xreflen-1) && xref != -1) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + if (xref > 0) { + obj = pdf_load_object(gctx, pdf, xref); + } else { + obj = pdf_trailer(gctx, pdf); + } + n = pdf_dict_len(gctx, obj); + rc = PyTuple_New(n); + if (!n) goto finished; + for (i = 0; i < n; i++) { + const char *key = pdf_to_name(gctx, pdf_dict_get_key(gctx, obj, i)); + PyTuple_SET_ITEM(rc, i, Py_BuildValue("s", key)); + } + finished:; + } + fz_always(gctx) { + if (xref > 0) { + pdf_drop_obj(gctx, obj); + } + } + fz_catch(gctx) { + return NULL; + } + return rc; + } + + + FITZEXCEPTION(xref_get_key, !result) + CLOSECHECK0(xref_get_key, """Get PDF dict key value of object at 'xref'.""") + PyObject * + xref_get_key(int xref, const char *key) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *)$self); + pdf_obj *obj=NULL, *subobj=NULL; + PyObject *rc = NULL; + fz_buffer *res = NULL; + PyObject *text = NULL; + fz_try(gctx) { + ASSERT_PDF(pdf); + int xreflen = pdf_xref_len(gctx, pdf); + if (!INRANGE(xref, 1, xreflen-1) && xref != -1) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + if (xref > 0) { + obj = pdf_load_object(gctx, pdf, xref); + } else { + obj = pdf_trailer(gctx, pdf); + } + if (!obj) { + goto not_found; + } + subobj = pdf_dict_getp(gctx, obj, key); + if (!subobj) { + goto not_found; + } + char *type; + if (pdf_is_indirect(gctx, subobj)) { + type = "xref"; + text = PyUnicode_FromFormat("%i 0 R", pdf_to_num(gctx, subobj)); + } else if (pdf_is_array(gctx, subobj)) { + type = "array"; + } else if (pdf_is_dict(gctx, subobj)) { + type = "dict"; + } else if (pdf_is_int(gctx, subobj)) { + type = "int"; + text = PyUnicode_FromFormat("%i", pdf_to_int(gctx, subobj)); + } else if (pdf_is_real(gctx, subobj)) { + type = "float"; + } else if (pdf_is_null(gctx, subobj)) { + type = "null"; + text = PyUnicode_FromString("null"); + } else if (pdf_is_bool(gctx, subobj)) { + type = "bool"; + if (pdf_to_bool(gctx, subobj)) { + text = PyUnicode_FromString("true"); + } else { + text = PyUnicode_FromString("false"); + } + } else if (pdf_is_name(gctx, subobj)) { + type = "name"; + text = PyUnicode_FromFormat("/%s", pdf_to_name(gctx, subobj)); + } else if (pdf_is_string(gctx, subobj)) { + type = "string"; + text = JM_UnicodeFromStr(pdf_to_text_string(gctx, subobj)); + } else { + type = "unknown"; + } + if (!text) { + res = JM_object_to_buffer(gctx, subobj, 1, 0); + text = JM_UnicodeFromBuffer(gctx, res); + } + rc = Py_BuildValue("sO", type, text); + Py_DECREF(text); + goto finished; + + not_found:; + rc = Py_BuildValue("ss", "null", "null"); + finished:; + } + fz_always(gctx) { + if (xref > 0) { + pdf_drop_obj(gctx, obj); + } + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + return rc; + } + + + FITZEXCEPTION(xref_set_key, !result) + %pythonprepend xref_set_key %{ + """Set the value of a PDF dictionary key.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if not key or not isinstance(key, str) or INVALID_NAME_CHARS.intersection(key) not in (set(), {"/"}): + raise ValueError("bad 'key'") + if not isinstance(value, str) or not value or value[0] == "/" and INVALID_NAME_CHARS.intersection(value[1:]) != set(): + raise ValueError("bad 'value'") + %} + PyObject * + xref_set_key(int xref, const char *key, char *value) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *)$self); + pdf_obj *obj = NULL, *new_obj = NULL; + int i, n; + fz_try(gctx) { + ASSERT_PDF(pdf); + if (!key || strlen(key) == 0) { + RAISEPY(gctx, "bad 'key'", PyExc_ValueError); + } + if (!value || strlen(value) == 0) { + RAISEPY(gctx, "bad 'value'", PyExc_ValueError); + } + int xreflen = pdf_xref_len(gctx, pdf); + if (!INRANGE(xref, 1, xreflen-1) && xref != -1) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + if (xref != -1) { + obj = pdf_load_object(gctx, pdf, xref); + } else { + obj = pdf_trailer(gctx, pdf); + } + // if val=="null" and no path hierarchy, delete "key" from object + // chr(47) = "/" + if (strcmp(value, "null") == 0 && strchr(key, 47) == NULL) { + pdf_dict_dels(gctx, obj, key); + goto finished; + } + new_obj = JM_set_object_value(gctx, obj, key, value); + if (!new_obj) { + goto finished; // did not work: skip update + } + if (xref != -1) { + pdf_drop_obj(gctx, obj); + obj = NULL; + pdf_update_object(gctx, pdf, xref, new_obj); + } else { + n = pdf_dict_len(gctx, new_obj); + for (i = 0; i < n; i++) { + pdf_dict_put(gctx, obj, pdf_dict_get_key(gctx, new_obj, i), pdf_dict_get_val(gctx, new_obj, i)); + } + } + finished:; + } + fz_always(gctx) { + if (xref != -1) { + pdf_drop_obj(gctx, obj); + } + pdf_drop_obj(gctx, new_obj); + PyErr_Clear(); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(_extend_toc_items, !result) + CLOSECHECK0(_extend_toc_items, """Add color info to all items of an extended TOC list.""") + PyObject * + _extend_toc_items(PyObject *items) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *)$self); + pdf_obj *bm, *col, *obj; + int count, flags; + PyObject *item=NULL, *itemdict=NULL, *xrefs, *bold, *italic, *collapse, *zoom; + zoom = PyUnicode_FromString("zoom"); + bold = PyUnicode_FromString("bold"); + italic = PyUnicode_FromString("italic"); + collapse = PyUnicode_FromString("collapse"); + fz_try(gctx) { + pdf_obj *root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root)); + if (!root) goto finished; + pdf_obj *olroot = pdf_dict_get(gctx, root, PDF_NAME(Outlines)); + if (!olroot) goto finished; + pdf_obj *first = pdf_dict_get(gctx, olroot, PDF_NAME(First)); + if (!first) goto finished; + xrefs = PyList_New(0); // pre-allocate an empty list + xrefs = JM_outline_xrefs(gctx, first, xrefs); + Py_ssize_t i, n = PySequence_Size(xrefs), m = PySequence_Size(items); + if (!n) goto finished; + if (n != m) { + RAISEPY(gctx, "internal error finding outline xrefs", PyExc_IndexError); + } + int xref; + + // update all TOC item dictionaries + for (i = 0; i < n; i++) { + JM_INT_ITEM(xrefs, i, &xref); + item = PySequence_ITEM(items, i); + itemdict = PySequence_ITEM(item, 3); + if (!itemdict || !PyDict_Check(itemdict)) { + RAISEPY(gctx, "need non-simple TOC format", PyExc_ValueError); + } + PyDict_SetItem(itemdict, dictkey_xref, PySequence_ITEM(xrefs, i)); + bm = pdf_load_object(gctx, pdf, xref); + flags = pdf_to_int(gctx, (pdf_dict_get(gctx, bm, PDF_NAME(F)))); + if (flags == 1) { + PyDict_SetItem(itemdict, italic, Py_True); + } else if (flags == 2) { + PyDict_SetItem(itemdict, bold, Py_True); + } else if (flags == 3) { + PyDict_SetItem(itemdict, italic, Py_True); + PyDict_SetItem(itemdict, bold, Py_True); + } + count = pdf_to_int(gctx, (pdf_dict_get(gctx, bm, PDF_NAME(Count)))); + if (count < 0) { + PyDict_SetItem(itemdict, collapse, Py_True); + } else if (count > 0) { + PyDict_SetItem(itemdict, collapse, Py_False); + } + col = pdf_dict_get(gctx, bm, PDF_NAME(C)); + if (pdf_is_array(gctx, col) && pdf_array_len(gctx, col) == 3) { + PyObject *color = PyTuple_New(3); + PyTuple_SET_ITEM(color, 0, Py_BuildValue("f", pdf_to_real(gctx, pdf_array_get(gctx, col, 0)))); + PyTuple_SET_ITEM(color, 1, Py_BuildValue("f", pdf_to_real(gctx, pdf_array_get(gctx, col, 1)))); + PyTuple_SET_ITEM(color, 2, Py_BuildValue("f", pdf_to_real(gctx, pdf_array_get(gctx, col, 2)))); + DICT_SETITEM_DROP(itemdict, dictkey_color, color); + } + float z=0; + obj = pdf_dict_get(gctx, bm, PDF_NAME(Dest)); + if (!obj || !pdf_is_array(gctx, obj)) { + obj = pdf_dict_getl(gctx, bm, PDF_NAME(A), PDF_NAME(D), NULL); + } + if (pdf_is_array(gctx, obj) && pdf_array_len(gctx, obj) == 5) { + z = pdf_to_real(gctx, pdf_array_get(gctx, obj, 4)); + } + DICT_SETITEM_DROP(itemdict, zoom, Py_BuildValue("f", z)); + PyList_SetItem(item, 3, itemdict); + PyList_SetItem(items, i, item); + pdf_drop_obj(gctx, bm); + bm = NULL; + } + finished:; + } + fz_always(gctx) { + Py_CLEAR(xrefs); + Py_CLEAR(bold); + Py_CLEAR(italic); + Py_CLEAR(collapse); + Py_CLEAR(zoom); + pdf_drop_obj(gctx, bm); + PyErr_Clear(); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------------------------------------- + // EmbeddedFiles utility functions + //---------------------------------------------------------------- + FITZEXCEPTION(_embfile_names, !result) + CLOSECHECK0(_embfile_names, """Get list of embedded file names.""") + PyObject *_embfile_names(PyObject *namelist) + { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_specifics(gctx, doc); + fz_try(gctx) { + ASSERT_PDF(pdf); + PyObject *val; + pdf_obj *names = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(Names), + PDF_NAME(EmbeddedFiles), + PDF_NAME(Names), + NULL); + if (pdf_is_array(gctx, names)) { + int i, n = pdf_array_len(gctx, names); + for (i=0; i < n; i+=2) { + val = JM_EscapeStrFromStr(pdf_to_text_string(gctx, + pdf_array_get(gctx, names, i))); + LIST_APPEND_DROP(namelist, val); + } + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION(_embfile_del, !result) + PyObject *_embfile_del(int idx) + { + fz_try(gctx) { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_document_from_fz_document(gctx, doc); + pdf_obj *names = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(Names), + PDF_NAME(EmbeddedFiles), + PDF_NAME(Names), + NULL); + pdf_array_delete(gctx, names, idx + 1); + pdf_array_delete(gctx, names, idx); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION(_embfile_info, !result) + PyObject *_embfile_info(int idx, PyObject *infodict) + { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_document_from_fz_document(gctx, doc); + char *name; + int xref = 0, ci_xref=0; + fz_try(gctx) { + pdf_obj *names = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(Names), + PDF_NAME(EmbeddedFiles), + PDF_NAME(Names), + NULL); + + pdf_obj *o = pdf_array_get(gctx, names, 2*idx+1); + pdf_obj *ci = pdf_dict_get(gctx, o, PDF_NAME(CI)); + if (ci) { + ci_xref = pdf_to_num(gctx, ci); + } + DICT_SETITEMSTR_DROP(infodict, "collection", Py_BuildValue("i", ci_xref)); + name = (char *) pdf_to_text_string(gctx, + pdf_dict_get(gctx, o, PDF_NAME(F))); + DICT_SETITEM_DROP(infodict, dictkey_filename, JM_EscapeStrFromStr(name)); + + name = (char *) pdf_to_text_string(gctx, + pdf_dict_get(gctx, o, PDF_NAME(UF))); + DICT_SETITEM_DROP(infodict, dictkey_ufilename, JM_EscapeStrFromStr(name)); + + name = (char *) pdf_to_text_string(gctx, + pdf_dict_get(gctx, o, PDF_NAME(Desc))); + DICT_SETITEM_DROP(infodict, dictkey_desc, JM_UnicodeFromStr(name)); + + int len = -1, DL = -1; + pdf_obj *fileentry = pdf_dict_getl(gctx, o, PDF_NAME(EF), PDF_NAME(F), NULL); + xref = pdf_to_num(gctx, fileentry); + o = pdf_dict_get(gctx, fileentry, PDF_NAME(Length)); + if (o) len = pdf_to_int(gctx, o); + + o = pdf_dict_get(gctx, fileentry, PDF_NAME(DL)); + if (o) { + DL = pdf_to_int(gctx, o); + } else { + o = pdf_dict_getl(gctx, fileentry, PDF_NAME(Params), + PDF_NAME(Size), NULL); + if (o) DL = pdf_to_int(gctx, o); + } + DICT_SETITEM_DROP(infodict, dictkey_size, Py_BuildValue("i", DL)); + DICT_SETITEM_DROP(infodict, dictkey_length, Py_BuildValue("i", len)); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", xref); + } + + FITZEXCEPTION(_embfile_upd, !result) + PyObject *_embfile_upd(int idx, PyObject *buffer = NULL, char *filename = NULL, char *ufilename = NULL, char *desc = NULL) + { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_document_from_fz_document(gctx, doc); + fz_buffer *res = NULL; + fz_var(res); + int xref = 0; + fz_try(gctx) { + pdf_obj *names = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(Names), + PDF_NAME(EmbeddedFiles), + PDF_NAME(Names), + NULL); + + pdf_obj *entry = pdf_array_get(gctx, names, 2*idx+1); + + pdf_obj *filespec = pdf_dict_getl(gctx, entry, PDF_NAME(EF), + PDF_NAME(F), NULL); + if (!filespec) { + RAISEPY(gctx, "bad PDF: no /EF object", JM_Exc_FileDataError); + } + res = JM_BufferFromBytes(gctx, buffer); + if (EXISTS(buffer) && !res) { + RAISEPY(gctx, MSG_BAD_BUFFER, PyExc_TypeError); + } + if (res && buffer != Py_None) + { + JM_update_stream(gctx, pdf, filespec, res, 1); + // adjust /DL and /Size parameters + int64_t len = (int64_t) fz_buffer_storage(gctx, res, NULL); + pdf_obj *l = pdf_new_int(gctx, len); + pdf_dict_put(gctx, filespec, PDF_NAME(DL), l); + pdf_dict_putl(gctx, filespec, l, PDF_NAME(Params), PDF_NAME(Size), NULL); + } + xref = pdf_to_num(gctx, filespec); + if (filename) + pdf_dict_put_text_string(gctx, entry, PDF_NAME(F), filename); + + if (ufilename) + pdf_dict_put_text_string(gctx, entry, PDF_NAME(UF), ufilename); + + if (desc) + pdf_dict_put_text_string(gctx, entry, PDF_NAME(Desc), desc); + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) + return NULL; + + return Py_BuildValue("i", xref); + } + + FITZEXCEPTION(_embeddedFileGet, !result) + PyObject *_embeddedFileGet(int idx) + { + fz_document *doc = (fz_document *) $self; + PyObject *cont = NULL; + pdf_document *pdf = pdf_document_from_fz_document(gctx, doc); + fz_buffer *buf = NULL; + fz_var(buf); + fz_try(gctx) { + pdf_obj *names = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(Names), + PDF_NAME(EmbeddedFiles), + PDF_NAME(Names), + NULL); + + pdf_obj *entry = pdf_array_get(gctx, names, 2*idx+1); + pdf_obj *filespec = pdf_dict_getl(gctx, entry, PDF_NAME(EF), + PDF_NAME(F), NULL); + buf = pdf_load_stream(gctx, filespec); + cont = JM_BinFromBuffer(gctx, buf); + } + fz_always(gctx) { + fz_drop_buffer(gctx, buf); + } + fz_catch(gctx) { + return NULL; + } + return cont; + } + + FITZEXCEPTION(_embfile_add, !result) + PyObject *_embfile_add(const char *name, PyObject *buffer, char *filename=NULL, char *ufilename=NULL, char *desc=NULL) + { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_document_from_fz_document(gctx, doc); + fz_buffer *data = NULL; + fz_var(data); + pdf_obj *names = NULL; + int xref = 0; // xref of file entry + fz_try(gctx) { + ASSERT_PDF(pdf); + data = JM_BufferFromBytes(gctx, buffer); + if (!data) { + RAISEPY(gctx, MSG_BAD_BUFFER, PyExc_TypeError); + } + + names = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(Names), + PDF_NAME(EmbeddedFiles), + PDF_NAME(Names), + NULL); + if (!pdf_is_array(gctx, names)) { + pdf_obj *root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root)); + names = pdf_new_array(gctx, pdf, 6); // an even number! + pdf_dict_putl_drop(gctx, root, names, + PDF_NAME(Names), + PDF_NAME(EmbeddedFiles), + PDF_NAME(Names), + NULL); + } + + pdf_obj *fileentry = JM_embed_file(gctx, pdf, data, + filename, + ufilename, + desc, 1); + xref = pdf_to_num(gctx, pdf_dict_getl(gctx, fileentry, + PDF_NAME(EF), PDF_NAME(F), NULL)); + pdf_array_push_drop(gctx, names, pdf_new_text_string(gctx, name)); + pdf_array_push_drop(gctx, names, fileentry); + } + fz_always(gctx) { + fz_drop_buffer(gctx, data); + } + fz_catch(gctx) { + return NULL; + } + + return Py_BuildValue("i", xref); + } + + + %pythoncode %{ + def embfile_names(self) -> list: + """Get list of names of EmbeddedFiles.""" + filenames = [] + self._embfile_names(filenames) + return filenames + + def _embeddedFileIndex(self, item: typing.Union[int, str]) -> int: + filenames = self.embfile_names() + msg = "'%s' not in EmbeddedFiles array." % str(item) + if item in filenames: + idx = filenames.index(item) + elif item in range(len(filenames)): + idx = item + else: + raise ValueError(msg) + return idx + + def embfile_count(self) -> int: + """Get number of EmbeddedFiles.""" + return len(self.embfile_names()) + + def embfile_del(self, item: typing.Union[int, str]): + """Delete an entry from EmbeddedFiles. + + Notes: + The argument must be name or index of an EmbeddedFiles item. + Physical deletion of data will happen on save to a new + file with appropriate garbage option. + Args: + item: name or number of item. + Returns: + None + """ + idx = self._embeddedFileIndex(item) + return self._embfile_del(idx) + + def embfile_info(self, item: typing.Union[int, str]) -> dict: + """Get information of an item in the EmbeddedFiles array. + + Args: + item: number or name of item. + Returns: + Information dictionary. + """ + idx = self._embeddedFileIndex(item) + infodict = {"name": self.embfile_names()[idx]} + xref = self._embfile_info(idx, infodict) + t, date = self.xref_get_key(xref, "Params/CreationDate") + if t != "null": + infodict["creationDate"] = date + t, date = self.xref_get_key(xref, "Params/ModDate") + if t != "null": + infodict["modDate"] = date + t, md5 = self.xref_get_key(xref, "Params/CheckSum") + if t != "null": + infodict["checksum"] = binascii.hexlify(md5.encode()).decode() + return infodict + + def embfile_get(self, item: typing.Union[int, str]) -> bytes: + """Get the content of an item in the EmbeddedFiles array. + + Args: + item: number or name of item. + Returns: + (bytes) The file content. + """ + idx = self._embeddedFileIndex(item) + return self._embeddedFileGet(idx) + + def embfile_upd(self, item: typing.Union[int, str], + buffer: OptBytes =None, + filename: OptStr =None, + ufilename: OptStr =None, + desc: OptStr =None,) -> None: + """Change an item of the EmbeddedFiles array. + + Notes: + Only provided parameters are changed. If all are omitted, + the method is a no-op. + Args: + item: number or name of item. + buffer: (binary data) the new file content. + filename: (str) the new file name. + ufilename: (unicode) the new filen ame. + desc: (str) the new description. + """ + idx = self._embeddedFileIndex(item) + xref = self._embfile_upd(idx, buffer=buffer, + filename=filename, + ufilename=ufilename, + desc=desc) + date = get_pdf_now() + self.xref_set_key(xref, "Params/ModDate", get_pdf_str(date)) + return xref + + def embfile_add(self, name: str, buffer: ByteString, + filename: OptStr =None, + ufilename: OptStr =None, + desc: OptStr =None,) -> None: + """Add an item to the EmbeddedFiles array. + + Args: + name: name of the new item, must not already exist. + buffer: (binary data) the file content. + filename: (str) the file name, default: the name + ufilename: (unicode) the file name, default: filename + desc: (str) the description. + """ + filenames = self.embfile_names() + msg = "Name '%s' already exists." % str(name) + if name in filenames: + raise ValueError(msg) + + if filename is None: + filename = name + if ufilename is None: + ufilename = unicode(filename, "utf8") if str is bytes else filename + if desc is None: + desc = name + xref = self._embfile_add(name, buffer=buffer, + filename=filename, + ufilename=ufilename, + desc=desc) + date = get_pdf_now() + self.xref_set_key(xref, "Type", "/EmbeddedFile") + self.xref_set_key(xref, "Params/CreationDate", get_pdf_str(date)) + self.xref_set_key(xref, "Params/ModDate", get_pdf_str(date)) + return xref + %} + + FITZEXCEPTION(convert_to_pdf, !result) + %pythonprepend convert_to_pdf %{ + """Convert document to a PDF, selecting page range and optional rotation. Output bytes object.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + %} + PyObject *convert_to_pdf(int from_page=0, int to_page=-1, int rotate=0) + { + PyObject *doc = NULL; + fz_document *fz_doc = (fz_document *) $self; + fz_try(gctx) { + int fp = from_page, tp = to_page, srcCount = fz_count_pages(gctx, fz_doc); + if (fp < 0) fp = 0; + if (fp > srcCount - 1) fp = srcCount - 1; + if (tp < 0) tp = srcCount - 1; + if (tp > srcCount - 1) tp = srcCount - 1; + Py_ssize_t len0 = PyList_Size(JM_mupdf_warnings_store); + doc = JM_convert_to_pdf(gctx, fz_doc, fp, tp, rotate); + Py_ssize_t len1 = PyList_Size(JM_mupdf_warnings_store); + Py_ssize_t i = len0; + while (i < len1) { + PySys_WriteStderr("%s\n", JM_StrAsChar(PyList_GetItem(JM_mupdf_warnings_store, i))); + i++; + } + } + fz_catch(gctx) { + return NULL; + } + if (doc) { + return doc; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(page_count, !result) + CLOSECHECK0(page_count, """Number of pages.""") + %pythoncode%{@property%} + PyObject *page_count() + { + PyObject *ret; + fz_try(gctx) { + ret = PyLong_FromLong((long) fz_count_pages(gctx, (fz_document *) $self)); + } + fz_catch(gctx) { + PyErr_Clear(); + return NULL; + } + return ret; + } + + FITZEXCEPTION(chapter_count, !result) + CLOSECHECK0(chapter_count, """Number of chapters.""") + %pythoncode%{@property%} + PyObject *chapter_count() + { + PyObject *ret; + fz_try(gctx) { + ret = PyLong_FromLong((long) fz_count_chapters(gctx, (fz_document *) $self)); + } + fz_catch(gctx) { + return NULL; + } + return ret; + } + + FITZEXCEPTION(last_location, !result) + CLOSECHECK0(last_location, """Id (chapter, page) of last page.""") + %pythoncode%{@property%} + PyObject *last_location() + { + fz_document *this_doc = (fz_document *) $self; + fz_location last_loc; + fz_try(gctx) { + last_loc = fz_last_page(gctx, this_doc); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("ii", last_loc.chapter, last_loc.page); + } + + + FITZEXCEPTION(chapter_page_count, !result) + CLOSECHECK0(chapter_page_count, """Page count of chapter.""") + PyObject *chapter_page_count(int chapter) + { + long pages = 0; + fz_try(gctx) { + int chapters = fz_count_chapters(gctx, (fz_document *) $self); + if (chapter < 0 || chapter >= chapters) { + RAISEPY(gctx, "bad chapter number", PyExc_ValueError); + } + pages = (long) fz_count_chapter_pages(gctx, (fz_document *) $self, chapter); + } + fz_catch(gctx) { + return NULL; + } + return PyLong_FromLong(pages); + } + + FITZEXCEPTION(prev_location, !result) + %pythonprepend prev_location %{ + """Get (chapter, page) of previous page.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if type(page_id) is int: + page_id = (0, page_id) + if page_id not in self: + raise ValueError("page id not in document") + if page_id == (0, 0): + return () + %} + PyObject *prev_location(PyObject *page_id) + { + fz_document *this_doc = (fz_document *) $self; + fz_location prev_loc, loc; + PyObject *val; + int pno; + fz_try(gctx) { + val = PySequence_GetItem(page_id, 0); + if (!val) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + int chapter = (int) PyLong_AsLong(val); + Py_DECREF(val); + if (PyErr_Occurred()) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + + val = PySequence_GetItem(page_id, 1); + if (!val) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + pno = (int) PyLong_AsLong(val); + Py_DECREF(val); + if (PyErr_Occurred()) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + loc = fz_make_location(chapter, pno); + prev_loc = fz_previous_page(gctx, this_doc, loc); + } + fz_catch(gctx) { + PyErr_Clear(); + return NULL; + } + return Py_BuildValue("ii", prev_loc.chapter, prev_loc.page); + } + + + FITZEXCEPTION(next_location, !result) + %pythonprepend next_location %{ + """Get (chapter, page) of next page.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if type(page_id) is int: + page_id = (0, page_id) + if page_id not in self: + raise ValueError("page id not in document") + if tuple(page_id) == self.last_location: + return () + %} + PyObject *next_location(PyObject *page_id) + { + fz_document *this_doc = (fz_document *) $self; + fz_location next_loc, loc; + PyObject *val; + int pno; + fz_try(gctx) { + val = PySequence_GetItem(page_id, 0); + if (!val) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + int chapter = (int) PyLong_AsLong(val); + Py_DECREF(val); + if (PyErr_Occurred()) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + + val = PySequence_GetItem(page_id, 1); + if (!val) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + pno = (int) PyLong_AsLong(val); + Py_DECREF(val); + if (PyErr_Occurred()) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + loc = fz_make_location(chapter, pno); + next_loc = fz_next_page(gctx, this_doc, loc); + } + fz_catch(gctx) { + PyErr_Clear(); + return NULL; + } + return Py_BuildValue("ii", next_loc.chapter, next_loc.page); + } + + + FITZEXCEPTION(location_from_page_number, !result) + CLOSECHECK0(location_from_page_number, """Convert pno to (chapter, page).""") + PyObject *location_from_page_number(int pno) + { + fz_document *this_doc = (fz_document *) $self; + fz_location loc = fz_make_location(-1, -1); + int page_count = fz_count_pages(gctx, this_doc); + while (pno < 0) pno += page_count; + fz_try(gctx) { + if (pno >= page_count) { + RAISEPY(gctx, MSG_BAD_PAGENO, PyExc_ValueError); + } + loc = fz_location_from_page_number(gctx, this_doc, pno); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("ii", loc.chapter, loc.page); + } + + FITZEXCEPTION(page_number_from_location, !result) + %pythonprepend page_number_from_location%{ + """Convert (chapter, pno) to page number.""" + if type(page_id) is int: + np = self.page_count + while page_id < 0: + page_id += np + page_id = (0, page_id) + if page_id not in self: + raise ValueError("page id not in document") + %} + PyObject *page_number_from_location(PyObject *page_id) + { + fz_document *this_doc = (fz_document *) $self; + fz_location loc; + long page_n = -1; + PyObject *val; + int pno; + fz_try(gctx) { + val = PySequence_GetItem(page_id, 0); + if (!val) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + int chapter = (int) PyLong_AsLong(val); + Py_DECREF(val); + if (PyErr_Occurred()) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + + val = PySequence_GetItem(page_id, 1); + if (!val) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + pno = (int) PyLong_AsLong(val); + Py_DECREF(val); + if (PyErr_Occurred()) { + RAISEPY(gctx, MSG_BAD_PAGEID, PyExc_ValueError); + } + + loc = fz_make_location(chapter, pno); + page_n = (long) fz_page_number_from_location(gctx, this_doc, loc); + } + fz_catch(gctx) { + PyErr_Clear(); + return NULL; + } + return PyLong_FromLong(page_n); + } + + FITZEXCEPTION(_getMetadata, !result) + CLOSECHECK0(_getMetadata, """Get metadata.""") + PyObject * + _getMetadata(const char *key) + { + PyObject *res = NULL; + fz_document *doc = (fz_document *) $self; + int vsize; + char *value; + fz_try(gctx) { + vsize = fz_lookup_metadata(gctx, doc, key, NULL, 0)+1; + if(vsize > 1) { + value = JM_Alloc(char, vsize); + fz_lookup_metadata(gctx, doc, key, value, vsize); + res = JM_UnicodeFromStr(value); + JM_Free(value); + } else { + res = EMPTY_STRING; + } + } + fz_always(gctx) { + PyErr_Clear(); + } + fz_catch(gctx) { + return EMPTY_STRING; + } + return res; + } + + CLOSECHECK0(needs_pass, """Indicate password required.""") + %pythoncode%{@property%} + PyObject *needs_pass() { + return JM_BOOL(fz_needs_password(gctx, (fz_document *) $self)); + } + + %pythoncode%{@property%} + CLOSECHECK0(language, """Document language.""") + PyObject *language() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_NONE; + fz_text_language lang = pdf_document_language(gctx, pdf); + char buf[8]; + if (lang == FZ_LANG_UNSET) Py_RETURN_NONE; + return PyUnicode_FromString(fz_string_from_text_language(buf, lang)); + } + + FITZEXCEPTION(set_language, !result) + PyObject *set_language(char *language=NULL) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + fz_try(gctx) { + ASSERT_PDF(pdf); + fz_text_language lang; + if (!language) + lang = FZ_LANG_UNSET; + else + lang = fz_text_language_from_string(language); + pdf_set_document_language(gctx, pdf, lang); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_TRUE; + } + + + %pythonprepend resolve_link %{ + """Calculate internal link destination. + + Args: + uri: (str) some Link.uri + chapters: (bool) whether to use (chapter, page) format + Returns: + (page_id, x, y) where x, y are point coordinates on the page. + page_id is either page number (if chapters=0), or (chapter, pno). + """ + %} + PyObject *resolve_link(char *uri=NULL, int chapters=0) + { + if (!uri) { + if (chapters) return Py_BuildValue("(ii)ff", -1, -1, 0, 0); + return Py_BuildValue("iff", -1, 0, 0); + } + fz_document *this_doc = (fz_document *) $self; + float xp = 0, yp = 0; + fz_location loc = {0, 0}; + fz_try(gctx) { + loc = fz_resolve_link(gctx, (fz_document *) $self, uri, &xp, &yp); + } + fz_catch(gctx) { + if (chapters) return Py_BuildValue("(ii)ff", -1, -1, 0, 0); + return Py_BuildValue("iff", -1, 0, 0); + } + if (chapters) + return Py_BuildValue("(ii)ff", loc.chapter, loc.page, xp, yp); + int pno = fz_page_number_from_location(gctx, this_doc, loc); + return Py_BuildValue("iff", pno, xp, yp); + } + + FITZEXCEPTION(layout, !result) + CLOSECHECK(layout, """Re-layout a reflowable document.""") + %pythonappend layout %{ + self._reset_page_refs() + self.init_doc()%} + PyObject *layout(PyObject *rect = NULL, float width = 0, float height = 0, float fontsize = 11) + { + fz_document *doc = (fz_document *) $self; + if (!fz_is_document_reflowable(gctx, doc)) Py_RETURN_NONE; + fz_try(gctx) { + float w = width, h = height; + fz_rect r = JM_rect_from_py(rect); + if (!fz_is_infinite_rect(r)) { + w = r.x1 - r.x0; + h = r.y1 - r.y0; + } + if (w <= 0.0f || h <= 0.0f) { + RAISEPY(gctx, "bad page size", PyExc_ValueError); + } + fz_layout_document(gctx, doc, w, h, fontsize); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION(make_bookmark, !result) + CLOSECHECK(make_bookmark, """Make a page pointer before layouting document.""") + PyObject *make_bookmark(PyObject *loc) + { + fz_document *doc = (fz_document *) $self; + fz_location location; + fz_bookmark mark; + fz_try(gctx) { + if (JM_INT_ITEM(loc, 0, &location.chapter) == 1) { + RAISEPY(gctx, MSG_BAD_LOCATION, PyExc_ValueError); + } + if (JM_INT_ITEM(loc, 1, &location.page) == 1) { + RAISEPY(gctx, MSG_BAD_LOCATION, PyExc_ValueError); + } + mark = fz_make_bookmark(gctx, doc, location); + if (!mark) { + RAISEPY(gctx, MSG_BAD_LOCATION, PyExc_ValueError); + } + } + fz_catch(gctx) { + return NULL; + } + return PyLong_FromVoidPtr((void *) mark); + } + + + FITZEXCEPTION(find_bookmark, !result) + CLOSECHECK(find_bookmark, """Find new location after layouting a document.""") + PyObject *find_bookmark(PyObject *bm) + { + fz_document *doc = (fz_document *) $self; + fz_location location; + fz_try(gctx) { + intptr_t mark = (intptr_t) PyLong_AsVoidPtr(bm); + location = fz_lookup_bookmark(gctx, doc, mark); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("ii", location.chapter, location.page); + } + + + CLOSECHECK0(is_reflowable, """Check if document is layoutable.""") + %pythoncode%{@property%} + PyObject *is_reflowable() + { + return JM_BOOL(fz_is_document_reflowable(gctx, (fz_document *) $self)); + } + + FITZEXCEPTION(_deleteObject, !result) + CLOSECHECK0(_deleteObject, """Delete object.""") + PyObject *_deleteObject(int xref) + { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_specifics(gctx, doc); + fz_try(gctx) { + ASSERT_PDF(pdf); + if (!INRANGE(xref, 1, pdf_xref_len(gctx, pdf)-1)) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + pdf_delete_object(gctx, pdf, xref); + } + fz_catch(gctx) { + return NULL; + } + + Py_RETURN_NONE; + } + + FITZEXCEPTION(pdf_catalog, !result) + CLOSECHECK0(pdf_catalog, """Get xref of PDF catalog.""") + PyObject *pdf_catalog() + { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_specifics(gctx, doc); + int xref = 0; + if (!pdf) return Py_BuildValue("i", xref); + fz_try(gctx) { + pdf_obj *root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root)); + xref = pdf_to_num(gctx, root); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", xref); + } + + FITZEXCEPTION(_getPDFfileid, !result) + CLOSECHECK0(_getPDFfileid, """Get PDF file id.""") + PyObject *_getPDFfileid() + { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_specifics(gctx, doc); + if (!pdf) Py_RETURN_NONE; + PyObject *idlist = PyList_New(0); + fz_buffer *buffer = NULL; + unsigned char *hex; + pdf_obj *o; + int n, i, len; + PyObject *bytes; + + fz_try(gctx) { + pdf_obj *identity = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(ID)); + if (identity) { + n = pdf_array_len(gctx, identity); + for (i = 0; i < n; i++) { + o = pdf_array_get(gctx, identity, i); + len = (int) pdf_to_str_len(gctx, o); + buffer = fz_new_buffer(gctx, 2 * len); + fz_buffer_storage(gctx, buffer, &hex); + hexlify(len, (unsigned char *) pdf_to_text_string(gctx, o), hex); + LIST_APPEND_DROP(idlist, JM_UnicodeFromStr(hex)); + Py_CLEAR(bytes); + fz_drop_buffer(gctx, buffer); + buffer = NULL; + } + } + } + fz_catch(gctx) { + fz_drop_buffer(gctx, buffer); + } + return idlist; + } + + CLOSECHECK0(version_count, """Count versions of PDF document.""") + %pythoncode%{@property%} + PyObject *version_count() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) return Py_BuildValue("i", 0); + return Py_BuildValue("i", pdf_count_versions(gctx, pdf)); + } + + + CLOSECHECK0(is_pdf, """Check for PDF.""") + %pythoncode%{@property%} + PyObject *is_pdf() + { + if (pdf_specifics(gctx, (fz_document *) $self)) Py_RETURN_TRUE; + else Py_RETURN_FALSE; + } + + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR <= 21 + /* The underlying struct members that these methods give access to, are + not available. */ + CLOSECHECK0(has_xref_streams, """Check if xref table is a stream.""") + %pythoncode%{@property%} + PyObject *has_xref_streams() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_FALSE; + if (pdf->has_xref_streams) Py_RETURN_TRUE; + Py_RETURN_FALSE; + } + + CLOSECHECK0(has_old_style_xrefs, """Check if xref table is old style.""") + %pythoncode%{@property%} + PyObject *has_old_style_xrefs() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_FALSE; + if (pdf->has_old_style_xrefs) Py_RETURN_TRUE; + Py_RETURN_FALSE; + } + #endif + + CLOSECHECK0(is_dirty, """True if PDF has unsaved changes.""") + %pythoncode%{@property%} + PyObject *is_dirty() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_FALSE; + return JM_BOOL(pdf_has_unsaved_changes(gctx, pdf)); + } + + CLOSECHECK0(can_save_incrementally, """Check whether incremental saves are possible.""") + PyObject *can_save_incrementally() + { + pdf_document *pdf = pdf_document_from_fz_document(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_FALSE; // gracefully handle non-PDF + return JM_BOOL(pdf_can_be_saved_incrementally(gctx, pdf)); + } + + CLOSECHECK0(is_fast_webaccess, """Check whether we have a linearized PDF.""") + %pythoncode%{@property%} + PyObject *is_fast_webaccess() + { + pdf_document *pdf = pdf_document_from_fz_document(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_FALSE; // gracefully handle non-PDF + return JM_BOOL(pdf_doc_was_linearized(gctx, pdf)); + } + + CLOSECHECK0(is_repaired, """Check whether PDF was repaired.""") + %pythoncode%{@property%} + PyObject *is_repaired() + { + pdf_document *pdf = pdf_document_from_fz_document(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_FALSE; // gracefully handle non-PDF + return JM_BOOL(pdf_was_repaired(gctx, pdf)); + } + + FITZEXCEPTION(save_snapshot, !result) + %pythonprepend save_snapshot %{ + """Save a file snapshot suitable for journalling.""" + if self.is_closed: + raise ValueError("doc is closed") + if type(filename) == str: + pass + elif hasattr(filename, "open"): # assume: pathlib.Path + filename = str(filename) + elif hasattr(filename, "name"): # assume: file object + filename = filename.name + else: + raise ValueError("filename must be str, Path or file object") + if filename == self.name: + raise ValueError("cannot snapshot to original") + %} + PyObject *save_snapshot(const char *filename) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + fz_try(gctx) { + ASSERT_PDF(pdf); + pdf_save_snapshot(gctx, pdf, filename); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + CLOSECHECK0(authenticate, """Decrypt document.""") + %pythonappend authenticate %{ + if val: # the doc is decrypted successfully and we init the outline + self.is_encrypted = False + self.isEncrypted = False + self.init_doc() + self.thisown = True + %} + PyObject *authenticate(char *password) + { + return Py_BuildValue("i", fz_authenticate_password(gctx, (fz_document *) $self, (const char *) password)); + } + + //------------------------------------------------------------------ + // save a PDF + //------------------------------------------------------------------ + FITZEXCEPTION(save, !result) + %pythonprepend save %{ + """Save PDF to file, pathlib.Path or file pointer.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if type(filename) == str: + pass + elif hasattr(filename, "open"): # assume: pathlib.Path + filename = str(filename) + elif hasattr(filename, "name"): # assume: file object + filename = filename.name + elif not hasattr(filename, "seek"): # assume file object + raise ValueError("filename must be str, Path or file object") + if filename == self.name and not incremental: + raise ValueError("save to original must be incremental") + if self.page_count < 1: + raise ValueError("cannot save with zero pages") + if incremental: + if self.name != filename or self.stream: + raise ValueError("incremental needs original file") + if user_pw and len(user_pw) > 40 or owner_pw and len(owner_pw) > 40: + raise ValueError("password length must not exceed 40") + %} + + PyObject * + save(PyObject *filename, int garbage=0, int clean=0, + int deflate=0, int deflate_images=0, int deflate_fonts=0, + int incremental=0, int ascii=0, int expand=0, int linear=0, + int no_new_id=0, int appearance=0, + int pretty=0, int encryption=1, int permissions=4095, + char *owner_pw=NULL, char *user_pw=NULL) + { + pdf_write_options opts = pdf_default_write_options; + opts.do_incremental = incremental; + opts.do_ascii = ascii; + opts.do_compress = deflate; + opts.do_compress_images = deflate_images; + opts.do_compress_fonts = deflate_fonts; + opts.do_decompress = expand; + opts.do_garbage = garbage; + opts.do_pretty = pretty; + opts.do_linear = linear; + opts.do_clean = clean; + opts.do_sanitize = clean; + opts.dont_regenerate_id = no_new_id; + opts.do_appearance = appearance; + opts.do_encrypt = encryption; + opts.permissions = permissions; + if (owner_pw) { + memcpy(&opts.opwd_utf8, owner_pw, strlen(owner_pw)+1); + } else if (user_pw) { + memcpy(&opts.opwd_utf8, user_pw, strlen(user_pw)+1); + } + if (user_pw) { + memcpy(&opts.upwd_utf8, user_pw, strlen(user_pw)+1); + } + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_specifics(gctx, doc); + fz_output *out = NULL; + fz_try(gctx) { + ASSERT_PDF(pdf); + pdf->resynth_required = 0; + JM_embedded_clean(gctx, pdf); + if (no_new_id == 0) { + JM_ensure_identity(gctx, pdf); + } + if (PyUnicode_Check(filename)) { + pdf_save_document(gctx, pdf, JM_StrAsChar(filename), &opts); + } else { + out = JM_new_output_fileptr(gctx, filename); + pdf_write_document(gctx, pdf, out, &opts); + } + } + fz_always(gctx) { + fz_drop_output(gctx, out); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + %pythoncode %{ + def write(self, garbage=False, clean=False, + deflate=False, deflate_images=False, deflate_fonts=False, + incremental=False, ascii=False, expand=False, linear=False, + no_new_id=False, appearance=False, pretty=False, encryption=1, permissions=4095, + owner_pw=None, user_pw=None): + from io import BytesIO + bio = BytesIO() + self.save(bio, garbage=garbage, clean=clean, + no_new_id=no_new_id, appearance=appearance, + deflate=deflate, deflate_images=deflate_images, deflate_fonts=deflate_fonts, + incremental=incremental, ascii=ascii, expand=expand, linear=linear, + pretty=pretty, encryption=encryption, permissions=permissions, + owner_pw=owner_pw, user_pw=user_pw) + return bio.getvalue() + %} + + //---------------------------------------------------------------- + // Insert pages from a source PDF into this PDF. + // For reconstructing the links (_do_links method), we must save the + // insertion point (start_at) if it was specified as -1. + //---------------------------------------------------------------- + FITZEXCEPTION(insert_pdf, !result) + %pythonprepend insert_pdf %{ + """Insert a page range from another PDF. + + Args: + docsrc: PDF to copy from. Must be different object, but may be same file. + from_page: (int) first source page to copy, 0-based, default 0. + to_page: (int) last source page to copy, 0-based, default last page. + start_at: (int) from_page will become this page number in target. + rotate: (int) rotate copied pages, default -1 is no change. + links: (int/bool) whether to also copy links. + annots: (int/bool) whether to also copy annotations. + show_progress: (int) progress message interval, 0 is no messages. + final: (bool) indicates last insertion from this source PDF. + _gmap: internal use only + + Copy sequence reversed if from_page > to_page.""" + + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if self._graft_id == docsrc._graft_id: + raise ValueError("source and target cannot be same object") + sa = start_at + if sa < 0: + sa = self.page_count + if len(docsrc) > show_progress > 0: + inname = os.path.basename(docsrc.name) + if not inname: + inname = "memory PDF" + outname = os.path.basename(self.name) + if not outname: + outname = "memory PDF" + print("Inserting '%s' at '%s'" % (inname, outname)) + + # retrieve / make a Graftmap to avoid duplicate objects + isrt = docsrc._graft_id + _gmap = self.Graftmaps.get(isrt, None) + if _gmap is None: + _gmap = Graftmap(self) + self.Graftmaps[isrt] = _gmap + %} + + %pythonappend insert_pdf %{ + self._reset_page_refs() + if links: + self._do_links(docsrc, from_page = from_page, to_page = to_page, + start_at = sa) + if final == 1: + self.Graftmaps[isrt] = None%} + + PyObject * + insert_pdf(struct Document *docsrc, + int from_page=-1, + int to_page=-1, + int start_at=-1, + int rotate=-1, + int links=1, + int annots=1, + int show_progress=0, + int final = 1, + struct Graftmap *_gmap=NULL) + { + fz_document *doc = (fz_document *) $self; + fz_document *src = (fz_document *) docsrc; + pdf_document *pdfout = pdf_specifics(gctx, doc); + pdf_document *pdfsrc = pdf_specifics(gctx, src); + int outCount = fz_count_pages(gctx, doc); + int srcCount = fz_count_pages(gctx, src); + + // local copies of page numbers + int fp = from_page, tp = to_page, sa = start_at; + + // normalize page numbers + fp = Py_MAX(fp, 0); // -1 = first page + fp = Py_MIN(fp, srcCount - 1); // but do not exceed last page + + if (tp < 0) tp = srcCount - 1; // -1 = last page + tp = Py_MIN(tp, srcCount - 1); // but do not exceed last page + + if (sa < 0) sa = outCount; // -1 = behind last page + sa = Py_MIN(sa, outCount); // but that is also the limit + + fz_try(gctx) { + if (!pdfout || !pdfsrc) { + RAISEPY(gctx, "source or target not a PDF", PyExc_TypeError); + } + ENSURE_OPERATION(gctx, pdfout); + JM_merge_range(gctx, pdfout, pdfsrc, fp, tp, sa, rotate, links, annots, show_progress, (pdf_graft_map *) _gmap); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + %pythoncode %{ + def insert_file(self, infile, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, annots=True,show_progress=0, final=1): + """Insert an arbitrary supported document to an existing PDF. + + The infile may be given as a filename, a Document or a Pixmap. + Other paramters - where applicable - equal those of insert_pdf(). + """ + src = None + if isinstance(infile, Pixmap): + if infile.colorspace.n > 3: + infile = Pixmap(csRGB, infile) + src = Document("png", infile.tobytes()) + elif isinstance(infile, Document): + src = infile + else: + src = Document(infile) + if not src: + raise ValueError("bad infile parameter") + if not src.is_pdf: + pdfbytes = src.convert_to_pdf() + src = Document("pdf", pdfbytes) + return self.insert_pdf(src, from_page=from_page, to_page=to_page, start_at=start_at, rotate=rotate,links=links, annots=annots, show_progress=show_progress, final=final) + %} + + //------------------------------------------------------------------ + // Create and insert a new page (PDF) + //------------------------------------------------------------------ + FITZEXCEPTION(_newPage, !result) + CLOSECHECK(_newPage, """Make a new PDF page.""") + %pythonappend _newPage %{self._reset_page_refs()%} + PyObject *_newPage(int pno=-1, float width=595, float height=842) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + fz_rect mediabox = fz_unit_rect; + mediabox.x1 = width; + mediabox.y1 = height; + pdf_obj *resources = NULL, *page_obj = NULL; + fz_buffer *contents = NULL; + fz_var(contents); + fz_var(page_obj); + fz_var(resources); + fz_try(gctx) { + ASSERT_PDF(pdf); + if (pno < -1) { + RAISEPY(gctx, MSG_BAD_PAGENO, PyExc_ValueError); + } + ENSURE_OPERATION(gctx, pdf); + // create /Resources and /Contents objects + resources = pdf_add_new_dict(gctx, pdf, 1); + page_obj = pdf_add_page(gctx, pdf, mediabox, 0, resources, contents); + pdf_insert_page(gctx, pdf, pno, page_obj); + } + fz_always(gctx) { + fz_drop_buffer(gctx, contents); + pdf_drop_obj(gctx, page_obj); + pdf_drop_obj(gctx, resources); + } + fz_catch(gctx) { + return NULL; + } + + Py_RETURN_NONE; + } + + //------------------------------------------------------------------ + // Create sub-document to keep only selected pages. + // Parameter is a Python sequence of the wanted page numbers. + //------------------------------------------------------------------ + FITZEXCEPTION(select, !result) + %pythonprepend select %{"""Build sub-pdf with page numbers in the list.""" +if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") +if not self.is_pdf: + raise ValueError("is no PDF") +if not hasattr(pyliste, "__getitem__"): + raise ValueError("sequence required") +if len(pyliste) == 0 or min(pyliste) not in range(len(self)) or max(pyliste) not in range(len(self)): + raise ValueError("bad page number(s)") +pyliste = tuple(pyliste)%} + %pythonappend select %{self._reset_page_refs()%} + PyObject *select(PyObject *pyliste) + { + // preparatory stuff: + // (1) get underlying pdf document, + // (2) transform Python list into integer array + + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + int *pages = NULL; + fz_try(gctx) { + // call retainpages (code copy of fz_clean_file.c) + int i, len = (int) PyTuple_Size(pyliste); + pages = fz_realloc_array(gctx, pages, len, int); + for (i = 0; i < len; i++) { + pages[i] = (int) PyLong_AsLong(PyTuple_GET_ITEM(pyliste, (Py_ssize_t) i)); + } + pdf_rearrange_pages(gctx, pdf, len, pages); + if (pdf->rev_page_map) + { + pdf_drop_page_tree(gctx, pdf); + } + } + fz_always(gctx) { + fz_free(gctx, pages); + } + fz_catch(gctx) { + return NULL; + } + + Py_RETURN_NONE; + } + + //------------------------------------------------------------------ + // remove one page + //------------------------------------------------------------------ + FITZEXCEPTION(_delete_page, !result) + PyObject *_delete_page(int pno) + { + fz_try(gctx) { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_specifics(gctx, doc); + pdf_delete_page(gctx, pdf, pno); + if (pdf->rev_page_map) + { + pdf_drop_page_tree(gctx, pdf); + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //------------------------------------------------------------------ + // get document permissions + //------------------------------------------------------------------ + %pythoncode%{@property%} + %pythonprepend permissions %{ + """Document permissions.""" + + if self.is_encrypted: + return 0 + %} + PyObject *permissions() + { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_document_from_fz_document(gctx, doc); + + // for PDF return result of standard function + if (pdf) + return Py_BuildValue("i", pdf_document_permissions(gctx, pdf)); + + // otherwise simulate the PDF return value + int perm = (int) 0xFFFFFFFC; // all permissions granted + // now switch off where needed + if (!fz_has_permission(gctx, doc, FZ_PERMISSION_PRINT)) + perm = perm ^ PDF_PERM_PRINT; + if (!fz_has_permission(gctx, doc, FZ_PERMISSION_EDIT)) + perm = perm ^ PDF_PERM_MODIFY; + if (!fz_has_permission(gctx, doc, FZ_PERMISSION_COPY)) + perm = perm ^ PDF_PERM_COPY; + if (!fz_has_permission(gctx, doc, FZ_PERMISSION_ANNOTATE)) + perm = perm ^ PDF_PERM_ANNOTATE; + return Py_BuildValue("i", perm); + } + + + FITZEXCEPTION(journal_enable, !result) + CLOSECHECK(journal_enable, """Activate document journalling.""") + PyObject *journal_enable() + { + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + pdf_enable_journal(gctx, pdf); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(journal_start_op, !result) + CLOSECHECK(journal_start_op, """Begin a journalling operation.""") + PyObject *journal_start_op(const char *name=NULL) + { + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + if (!pdf->journal) { + RAISEPY(gctx, "Journalling not enabled", PyExc_RuntimeError); + } + if (name) { + pdf_begin_operation(gctx, pdf, name); + } else { + pdf_begin_implicit_operation(gctx, pdf); + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(journal_stop_op, !result) + CLOSECHECK(journal_stop_op, """End a journalling operation.""") + PyObject *journal_stop_op() + { + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + pdf_end_operation(gctx, pdf); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(journal_position, !result) + CLOSECHECK(journal_position, """Show journalling state.""") + PyObject *journal_position() + { + int rc, steps=0; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + rc = pdf_undoredo_state(gctx, pdf, &steps); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("ii", rc, steps); + } + + + FITZEXCEPTION(journal_op_name, !result) + CLOSECHECK(journal_op_name, """Show operation name for given step.""") + PyObject *journal_op_name(int step) + { + const char *name=NULL; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + name = pdf_undoredo_step(gctx, pdf, step); + } + fz_catch(gctx) { + return NULL; + } + if (name) { + return PyUnicode_FromString(name); + } else { + Py_RETURN_NONE; + } + } + + + FITZEXCEPTION(journal_can_do, !result) + CLOSECHECK(journal_can_do, """Show if undo and / or redo are possible.""") + PyObject *journal_can_do() + { + int undo=0, redo=0; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + undo = pdf_can_undo(gctx, pdf); + redo = pdf_can_redo(gctx, pdf); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("{s:N,s:N}", "undo", JM_BOOL(undo), "redo", JM_BOOL(redo)); + } + + + FITZEXCEPTION(journal_undo, !result) + CLOSECHECK(journal_undo, """Move backwards in the journal.""") + PyObject *journal_undo() + { + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + pdf_undo(gctx, pdf); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_TRUE; + } + + + FITZEXCEPTION(journal_redo, !result) + CLOSECHECK(journal_redo, """Move forward in the journal.""") + PyObject *journal_redo() + { + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + pdf_redo(gctx, pdf); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_TRUE; + } + + + FITZEXCEPTION(journal_save, !result) + CLOSECHECK(journal_save, """Save journal to a file.""") + PyObject *journal_save(PyObject *filename) + { + fz_output *out = NULL; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + if (PyUnicode_Check(filename)) { + pdf_save_journal(gctx, pdf, (const char *) PyUnicode_AsUTF8(filename)); + } else { + out = JM_new_output_fileptr(gctx, filename); + pdf_write_journal(gctx, pdf, out); + } + } + fz_always(gctx) { + fz_drop_output(gctx, out); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(journal_load, !result) + CLOSECHECK(journal_load, """Load a journal from a file.""") + PyObject *journal_load(PyObject *filename) + { + fz_buffer *res = NULL; + fz_stream *stm = NULL; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + if (PyUnicode_Check(filename)) { + pdf_load_journal(gctx, pdf, PyUnicode_AsUTF8(filename)); + } else { + res = JM_BufferFromBytes(gctx, filename); + stm = fz_open_buffer(gctx, res); + pdf_deserialise_journal(gctx, pdf, stm); + } + if (!pdf->journal) { + RAISEPY(gctx, "Journal and document do not match", JM_Exc_FileDataError); + } + } + fz_always(gctx) { + fz_drop_stream(gctx, stm); + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(journal_is_enabled, !result) + CLOSECHECK(journal_is_enabled, """Check if journalling is enabled.""") + PyObject *journal_is_enabled() + { + int enabled = 0; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + enabled = pdf && pdf->journal; + } + fz_catch(gctx) { + return NULL; + } + return JM_BOOL(enabled); + } + + + FITZEXCEPTION(_get_char_widths, !result) + CLOSECHECK(_get_char_widths, """Return list of glyphs and glyph widths of a font.""") + PyObject *_get_char_widths(int xref, char *bfname, char *ext, + int ordering, int limit, int idx = 0) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + PyObject *wlist = NULL; + int i, glyph, mylimit; + mylimit = limit; + if (mylimit < 256) mylimit = 256; + const unsigned char *data; + int size, index; + fz_font *font = NULL; + fz_buffer *buf = NULL; + + fz_try(gctx) { + ASSERT_PDF(pdf); + if (ordering >= 0) { + data = fz_lookup_cjk_font(gctx, ordering, &size, &index); + font = fz_new_font_from_memory(gctx, NULL, data, size, index, 0); + goto weiter; + } + data = fz_lookup_base14_font(gctx, bfname, &size); + if (data) { + font = fz_new_font_from_memory(gctx, bfname, data, size, 0, 0); + goto weiter; + } + buf = JM_get_fontbuffer(gctx, pdf, xref); + if (!buf) { + fz_throw(gctx, FZ_ERROR_GENERIC, "font at xref %d is not supported", xref); + } + font = fz_new_font_from_buffer(gctx, NULL, buf, idx, 0); + + weiter:; + wlist = PyList_New(0); + float adv; + for (i = 0; i < mylimit; i++) { + glyph = fz_encode_character(gctx, font, i); + adv = fz_advance_glyph(gctx, font, glyph, 0); + if (ordering >= 0) { + glyph = i; + } + if (glyph > 0) { + LIST_APPEND_DROP(wlist, Py_BuildValue("if", glyph, adv)); + } else { + LIST_APPEND_DROP(wlist, Py_BuildValue("if", glyph, 0.0)); + } + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, buf); + fz_drop_font(gctx, font); + } + fz_catch(gctx) { + return NULL; + } + return wlist; + } + + + FITZEXCEPTION(page_xref, !result) + CLOSECHECK0(page_xref, """Get xref of page number.""") + PyObject *page_xref(int pno) + { + fz_document *this_doc = (fz_document *) $self; + int page_count = fz_count_pages(gctx, this_doc); + int n = pno; + while (n < 0) n += page_count; + pdf_document *pdf = pdf_specifics(gctx, this_doc); + int xref = 0; + fz_try(gctx) { + if (n >= page_count) { + RAISEPY(gctx, MSG_BAD_PAGENO, PyExc_ValueError); + } + ASSERT_PDF(pdf); + xref = pdf_to_num(gctx, pdf_lookup_page_obj(gctx, pdf, n)); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", xref); + } + + + FITZEXCEPTION(page_annot_xrefs, !result) + CLOSECHECK0(page_annot_xrefs, """Get list annotations of page number.""") + PyObject *page_annot_xrefs(int pno) + { + fz_document *this_doc = (fz_document *) $self; + int page_count = fz_count_pages(gctx, this_doc); + int n = pno; + while (n < 0) n += page_count; + pdf_document *pdf = pdf_specifics(gctx, this_doc); + PyObject *annots = NULL; + fz_try(gctx) { + if (n >= page_count) { + RAISEPY(gctx, MSG_BAD_PAGENO, PyExc_ValueError); + } + ASSERT_PDF(pdf); + annots = JM_get_annot_xref_list(gctx, pdf_lookup_page_obj(gctx, pdf, n)); + } + fz_catch(gctx) { + return NULL; + } + return annots; + } + + + FITZEXCEPTION(page_cropbox, !result) + CLOSECHECK0(page_cropbox, """Get CropBox of page number (without loading page).""") + %pythonappend page_cropbox %{val = Rect(JM_TUPLE3(val))%} + PyObject *page_cropbox(int pno) + { + fz_document *this_doc = (fz_document *) $self; + int page_count = fz_count_pages(gctx, this_doc); + int n = pno; + while (n < 0) n += page_count; + pdf_obj *pageref = NULL; + fz_var(pageref); + pdf_document *pdf = pdf_specifics(gctx, this_doc); + fz_try(gctx) { + if (n >= page_count) { + RAISEPY(gctx, MSG_BAD_PAGENO, PyExc_ValueError); + } + ASSERT_PDF(pdf); + pageref = pdf_lookup_page_obj(gctx, pdf, n); + } + fz_catch(gctx) { + return NULL; + } + return JM_py_from_rect(JM_cropbox(gctx, pageref)); + } + + + FITZEXCEPTION(_getPageInfo, !result) + CLOSECHECK(_getPageInfo, """List fonts, images, XObjects used on a page.""") + PyObject *_getPageInfo(int pno, int what) + { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_specifics(gctx, doc); + pdf_obj *pageref, *rsrc; + PyObject *liste = NULL, *tracer = NULL; + fz_var(liste); + fz_var(tracer); + fz_try(gctx) { + int page_count = fz_count_pages(gctx, doc); + int n = pno; // pno < 0 is allowed + while (n < 0) n += page_count; // make it non-negative + if (n >= page_count) { + RAISEPY(gctx, MSG_BAD_PAGENO, PyExc_ValueError); + } + ASSERT_PDF(pdf); + pageref = pdf_lookup_page_obj(gctx, pdf, n); + rsrc = pdf_dict_get_inheritable(gctx, + pageref, PDF_NAME(Resources)); + liste = PyList_New(0); + tracer = PyList_New(0); + if (rsrc) { + JM_scan_resources(gctx, pdf, rsrc, liste, what, 0, tracer); + } + } + fz_always(gctx) { + Py_CLEAR(tracer); + } + fz_catch(gctx) { + Py_CLEAR(liste); + return NULL; + } + return liste; + } + + FITZEXCEPTION(extract_font, !result) + CLOSECHECK(extract_font, """Get a font by xref. Returns a tuple or dictionary.""") + PyObject *extract_font(int xref=0, int info_only=0, PyObject *named=NULL) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + + fz_try(gctx) { + ASSERT_PDF(pdf); + } + fz_catch(gctx) { + return NULL; + } + + fz_buffer *buffer = NULL; + pdf_obj *obj, *basefont, *bname; + PyObject *bytes = NULL; + char *ext = NULL; + PyObject *rc; + fz_try(gctx) { + obj = pdf_load_object(gctx, pdf, xref); + pdf_obj *type = pdf_dict_get(gctx, obj, PDF_NAME(Type)); + pdf_obj *subtype = pdf_dict_get(gctx, obj, PDF_NAME(Subtype)); + if(pdf_name_eq(gctx, type, PDF_NAME(Font)) && + strncmp(pdf_to_name(gctx, subtype), "CIDFontType", 11) != 0) { + basefont = pdf_dict_get(gctx, obj, PDF_NAME(BaseFont)); + if (!basefont || pdf_is_null(gctx, basefont)) { + bname = pdf_dict_get(gctx, obj, PDF_NAME(Name)); + } else { + bname = basefont; + } + ext = JM_get_fontextension(gctx, pdf, xref); + if (strcmp(ext, "n/a") != 0 && !info_only) { + buffer = JM_get_fontbuffer(gctx, pdf, xref); + bytes = JM_BinFromBuffer(gctx, buffer); + fz_drop_buffer(gctx, buffer); + } else { + bytes = Py_BuildValue("y", ""); + } + if (PyObject_Not(named)) { + rc = PyTuple_New(4); + PyTuple_SET_ITEM(rc, 0, JM_EscapeStrFromStr(pdf_to_name(gctx, bname))); + PyTuple_SET_ITEM(rc, 1, JM_UnicodeFromStr(ext)); + PyTuple_SET_ITEM(rc, 2, JM_UnicodeFromStr(pdf_to_name(gctx, subtype))); + PyTuple_SET_ITEM(rc, 3, bytes); + } else { + rc = PyDict_New(); + DICT_SETITEM_DROP(rc, dictkey_name, JM_EscapeStrFromStr(pdf_to_name(gctx, bname))); + DICT_SETITEM_DROP(rc, dictkey_ext, JM_UnicodeFromStr(ext)); + DICT_SETITEM_DROP(rc, dictkey_type, JM_UnicodeFromStr(pdf_to_name(gctx, subtype))); + DICT_SETITEM_DROP(rc, dictkey_content, bytes); + } + } else { + if (PyObject_Not(named)) { + rc = Py_BuildValue("sssy", "", "", "", ""); + } else { + rc = PyDict_New(); + DICT_SETITEM_DROP(rc, dictkey_name, Py_BuildValue("s", "")); + DICT_SETITEM_DROP(rc, dictkey_ext, Py_BuildValue("s", "")); + DICT_SETITEM_DROP(rc, dictkey_type, Py_BuildValue("s", "")); + DICT_SETITEM_DROP(rc, dictkey_content, Py_BuildValue("y", "")); + } + } + } + fz_always(gctx) { + pdf_drop_obj(gctx, obj); + JM_PyErr_Clear; + } + fz_catch(gctx) { + if (PyObject_Not(named)) { + rc = Py_BuildValue("sssy", "invalid-name", "", "", ""); + } else { + rc = PyDict_New(); + DICT_SETITEM_DROP(rc, dictkey_name, Py_BuildValue("s", "invalid-name")); + DICT_SETITEM_DROP(rc, dictkey_ext, Py_BuildValue("s", "")); + DICT_SETITEM_DROP(rc, dictkey_type, Py_BuildValue("s", "")); + DICT_SETITEM_DROP(rc, dictkey_content, Py_BuildValue("y", "")); + } + } + return rc; + } + + + FITZEXCEPTION(extract_image, !result) + CLOSECHECK(extract_image, """Get image by xref. Returns a dictionary.""") + PyObject *extract_image(int xref) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + pdf_obj *obj = NULL; + fz_buffer *res = NULL; + fz_image *img = NULL; + PyObject *rc = NULL; + const char *ext = NULL; + const char *cs_name = NULL; + int img_type = 0, xres, yres, colorspace; + int smask = 0, width, height, bpc; + fz_compressed_buffer *cbuf = NULL; + fz_var(img); + fz_var(res); + fz_var(obj); + + fz_try(gctx) { + ASSERT_PDF(pdf); + if (!INRANGE(xref, 1, pdf_xref_len(gctx, pdf)-1)) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + obj = pdf_new_indirect(gctx, pdf, xref, 0); + pdf_obj *subtype = pdf_dict_get(gctx, obj, PDF_NAME(Subtype)); + + if (!pdf_name_eq(gctx, subtype, PDF_NAME(Image))) { + RAISEPY(gctx, "not an image", PyExc_ValueError); + } + + pdf_obj *o = pdf_dict_geta(gctx, obj, PDF_NAME(SMask), PDF_NAME(Mask)); + if (o) smask = pdf_to_num(gctx, o); + + if (pdf_is_jpx_image(gctx, obj)) { + img_type = FZ_IMAGE_JPX; + res = pdf_load_stream(gctx, obj); + ext = "jpx"; + } + if (JM_is_jbig2_image(gctx, obj)) { + img_type = FZ_IMAGE_JBIG2; + res = pdf_load_stream(gctx, obj); + ext = "jb2"; + } + if (img_type == FZ_IMAGE_UNKNOWN) { + res = pdf_load_raw_stream(gctx, obj); + unsigned char *c = NULL; + fz_buffer_storage(gctx, res, &c); + img_type = fz_recognize_image_format(gctx, c); + ext = JM_image_extension(img_type); + } + if (img_type == FZ_IMAGE_UNKNOWN) { + fz_drop_buffer(gctx, res); + res = NULL; + img = pdf_load_image(gctx, pdf, obj); + cbuf = fz_compressed_image_buffer(gctx, img); + if (cbuf && + cbuf->params.type != FZ_IMAGE_RAW && + cbuf->params.type != FZ_IMAGE_FAX && + cbuf->params.type != FZ_IMAGE_FLATE && + cbuf->params.type != FZ_IMAGE_LZW && + cbuf->params.type != FZ_IMAGE_RLD) { + img_type = cbuf->params.type; + ext = JM_image_extension(img_type); + res = cbuf->buffer; + } else { + res = fz_new_buffer_from_image_as_png(gctx, img, + fz_default_color_params); + ext = "png"; + } + } else { + img = fz_new_image_from_buffer(gctx, res); + } + + fz_image_resolution(img, &xres, &yres); + width = img->w; + height = img->h; + colorspace = img->n; + bpc = img->bpc; + cs_name = fz_colorspace_name(gctx, img->colorspace); + + rc = PyDict_New(); + DICT_SETITEM_DROP(rc, dictkey_ext, + JM_UnicodeFromStr(ext)); + DICT_SETITEM_DROP(rc, dictkey_smask, + Py_BuildValue("i", smask)); + DICT_SETITEM_DROP(rc, dictkey_width, + Py_BuildValue("i", width)); + DICT_SETITEM_DROP(rc, dictkey_height, + Py_BuildValue("i", height)); + DICT_SETITEM_DROP(rc, dictkey_colorspace, + Py_BuildValue("i", colorspace)); + DICT_SETITEM_DROP(rc, dictkey_bpc, + Py_BuildValue("i", bpc)); + DICT_SETITEM_DROP(rc, dictkey_xres, + Py_BuildValue("i", xres)); + DICT_SETITEM_DROP(rc, dictkey_yres, + Py_BuildValue("i", yres)); + DICT_SETITEM_DROP(rc, dictkey_cs_name, + JM_UnicodeFromStr(cs_name)); + DICT_SETITEM_DROP(rc, dictkey_image, + JM_BinFromBuffer(gctx, res)); + } + fz_always(gctx) { + fz_drop_image(gctx, img); + if (!cbuf) fz_drop_buffer(gctx, res); + pdf_drop_obj(gctx, obj); + } + + fz_catch(gctx) { + Py_CLEAR(rc); + fz_warn(gctx, "%s", fz_caught_message(gctx)); + Py_RETURN_FALSE; + } + if (!rc) + Py_RETURN_NONE; + return rc; + } + + + //------------------------------------------------------------------ + // Delete all bookmarks (table of contents) + // returns list of deleted (now available) xref numbers + //------------------------------------------------------------------ + CLOSECHECK(_delToC, """Delete the TOC.""") + %pythonappend _delToC %{self.init_doc()%} + PyObject *_delToC() + { + PyObject *xrefs = PyList_New(0); // create Python list + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) return xrefs; // not a pdf + + pdf_obj *root, *olroot, *first; + int xref_count, olroot_xref, i, xref; + + // get the main root + root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root)); + // get the outline root + olroot = pdf_dict_get(gctx, root, PDF_NAME(Outlines)); + if (!olroot) return xrefs; // no outlines or some problem + + first = pdf_dict_get(gctx, olroot, PDF_NAME(First)); // first outline + + xrefs = JM_outline_xrefs(gctx, first, xrefs); + xref_count = (int) PyList_Size(xrefs); + + olroot_xref = pdf_to_num(gctx, olroot); // delete OL root + pdf_delete_object(gctx, pdf, olroot_xref); // delete OL root + pdf_dict_del(gctx, root, PDF_NAME(Outlines)); // delete OL root + + for (i = 0; i < xref_count; i++) + { + JM_INT_ITEM(xrefs, i, &xref); + pdf_delete_object(gctx, pdf, xref); // delete outline item + } + LIST_APPEND_DROP(xrefs, Py_BuildValue("i", olroot_xref)); + + return xrefs; + } + + + //------------------------------------------------------------------ + // Check: is xref a stream object? + //------------------------------------------------------------------ + CLOSECHECK0(xref_is_stream, """Check if xref is a stream object.""") + PyObject *xref_is_stream(int xref=0) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_FALSE; // not a PDF + return JM_BOOL(pdf_obj_num_is_stream(gctx, pdf, xref)); + } + + //------------------------------------------------------------------ + // Return or set NeedAppearances + //------------------------------------------------------------------ + %pythonprepend need_appearances +%{"""Get/set the NeedAppearances value.""" +if self.is_closed: + raise ValueError("document closed") +if not self.is_form_pdf: + return None +%} + PyObject *need_appearances(PyObject *value=NULL) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + int oldval = -1; + pdf_obj *app = NULL; + char appkey[] = "NeedAppearances"; + fz_try(gctx) { + pdf_obj *form = pdf_dict_getp(gctx, pdf_trailer(gctx, pdf), + "Root/AcroForm"); + app = pdf_dict_gets(gctx, form, appkey); + if (pdf_is_bool(gctx, app)) { + oldval = pdf_to_bool(gctx, app); + } + + if (EXISTS(value)) { + pdf_dict_puts_drop(gctx, form, appkey, PDF_TRUE); + } else if (value == Py_False) { + pdf_dict_puts_drop(gctx, form, appkey, PDF_FALSE); + } + } + fz_catch(gctx) { + Py_RETURN_NONE; + } + if (value != Py_None) { + return value; + } + if (oldval >= 0) { + return JM_BOOL(oldval); + } + Py_RETURN_NONE; + } + + //------------------------------------------------------------------ + // Return the /SigFlags value + //------------------------------------------------------------------ + CLOSECHECK0(get_sigflags, """Get the /SigFlags value.""") + int get_sigflags() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) return -1; // not a PDF + int sigflag = -1; + fz_try(gctx) { + pdf_obj *sigflags = pdf_dict_getl(gctx, + pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(AcroForm), + PDF_NAME(SigFlags), + NULL); + if (sigflags) { + sigflag = (int) pdf_to_int(gctx, sigflags); + } + } + fz_catch(gctx) { + return -1; // any problem + } + return sigflag; + } + + //------------------------------------------------------------------ + // Check: is this an AcroForm with at least one field? + //------------------------------------------------------------------ + CLOSECHECK0(is_form_pdf, """Either False or PDF field count.""") + %pythoncode%{@property%} + PyObject *is_form_pdf() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_FALSE; // not a PDF + int count = -1; // init count + fz_try(gctx) { + pdf_obj *fields = pdf_dict_getl(gctx, + pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(AcroForm), + PDF_NAME(Fields), + NULL); + if (pdf_is_array(gctx, fields)) { + count = pdf_array_len(gctx, fields); + } + } + fz_catch(gctx) { + Py_RETURN_FALSE; + } + if (count >= 0) { + return Py_BuildValue("i", count); + } else { + Py_RETURN_FALSE; + } + } + + //------------------------------------------------------------------ + // Return the list of field font resource names + //------------------------------------------------------------------ + CLOSECHECK0(FormFonts, """Get list of field font resource names.""") + %pythoncode%{@property%} + PyObject *FormFonts() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_NONE; // not a PDF + pdf_obj *fonts = NULL; + PyObject *liste = PyList_New(0); + fz_var(liste); + fz_try(gctx) { + fonts = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root), PDF_NAME(AcroForm), PDF_NAME(DR), PDF_NAME(Font), NULL); + if (fonts && pdf_is_dict(gctx, fonts)) // fonts exist + { + int i, n = pdf_dict_len(gctx, fonts); + for (i = 0; i < n; i++) + { + pdf_obj *f = pdf_dict_get_key(gctx, fonts, i); + LIST_APPEND_DROP(liste, JM_UnicodeFromStr(pdf_to_name(gctx, f))); + } + } + } + fz_catch(gctx) { + Py_DECREF(liste); + Py_RETURN_NONE; // any problem yields None + } + return liste; + } + + //------------------------------------------------------------------ + // Add a field font + //------------------------------------------------------------------ + FITZEXCEPTION(_addFormFont, !result) + CLOSECHECK(_addFormFont, """Add new form font.""") + PyObject *_addFormFont(char *name, char *font) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_NONE; // not a PDF + pdf_obj *fonts = NULL; + fz_try(gctx) { + fonts = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root), + PDF_NAME(AcroForm), PDF_NAME(DR), PDF_NAME(Font), NULL); + if (!fonts || !pdf_is_dict(gctx, fonts)) { + RAISEPY(gctx, "PDF has no form fonts yet", PyExc_RuntimeError); + } + pdf_obj *k = pdf_new_name(gctx, (const char *) name); + pdf_obj *v = JM_pdf_obj_from_str(gctx, pdf, font); + pdf_dict_put(gctx, fonts, k, v); + } + fz_catch(gctx) NULL; + Py_RETURN_NONE; + } + + //------------------------------------------------------------------ + // Get Xref Number of Outline Root, create it if missing + //------------------------------------------------------------------ + FITZEXCEPTION(_getOLRootNumber, !result) + CLOSECHECK(_getOLRootNumber, """Get xref of Outline Root, create it if missing.""") + PyObject *_getOLRootNumber() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + pdf_obj *ind_obj = NULL; + pdf_obj *olroot2 = NULL; + int ret; + fz_var(ind_obj); + fz_var(olroot2); + fz_try(gctx) { + ASSERT_PDF(pdf); + // get main root + pdf_obj *root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root)); + // get outline root + pdf_obj *olroot = pdf_dict_get(gctx, root, PDF_NAME(Outlines)); + if (!olroot) + { + olroot2 = pdf_new_dict(gctx, pdf, 4); + pdf_dict_put(gctx, olroot2, PDF_NAME(Type), PDF_NAME(Outlines)); + ind_obj = pdf_add_object(gctx, pdf, olroot2); + pdf_dict_put(gctx, root, PDF_NAME(Outlines), ind_obj); + olroot = pdf_dict_get(gctx, root, PDF_NAME(Outlines)); + + } + ret = pdf_to_num(gctx, olroot); + } + fz_always(gctx) { + pdf_drop_obj(gctx, ind_obj); + pdf_drop_obj(gctx, olroot2); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", ret); + } + + //------------------------------------------------------------------ + // Get a new Xref number + //------------------------------------------------------------------ + FITZEXCEPTION(get_new_xref, !result) + CLOSECHECK(get_new_xref, """Make a new xref.""") + PyObject *get_new_xref() + { + int xref = 0; + fz_try(gctx) { + fz_document *doc = (fz_document *) $self; + pdf_document *pdf = pdf_specifics(gctx, doc); + ASSERT_PDF(pdf); + ENSURE_OPERATION(gctx, pdf); + xref = pdf_create_object(gctx, pdf); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", xref); + } + + //------------------------------------------------------------------ + // Get Length of XREF table + //------------------------------------------------------------------ + FITZEXCEPTION(xref_length, !result) + CLOSECHECK0(xref_length, """Get length of xref table.""") + PyObject *xref_length() + { + int xreflen = 0; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (pdf) xreflen = pdf_xref_len(gctx, pdf); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", xreflen); + } + + //------------------------------------------------------------------ + // Get XML Metadata + //------------------------------------------------------------------ + CLOSECHECK0(get_xml_metadata, """Get document XML metadata.""") + PyObject *get_xml_metadata() + { + PyObject *rc = NULL; + fz_buffer *buff = NULL; + pdf_obj *xml = NULL; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (pdf) { + xml = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root), PDF_NAME(Metadata), NULL); + } + if (xml) { + buff = pdf_load_stream(gctx, xml); + rc = JM_UnicodeFromBuffer(gctx, buff); + } else { + rc = EMPTY_STRING; + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, buff); + PyErr_Clear(); + } + fz_catch(gctx) { + return EMPTY_STRING; + } + return rc; + } + + //------------------------------------------------------------------ + // Get XML Metadata xref + //------------------------------------------------------------------ + FITZEXCEPTION(xref_xml_metadata, !result) + CLOSECHECK0(xref_xml_metadata, """Get xref of document XML metadata.""") + PyObject *xref_xml_metadata() + { + int xref = 0; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + ASSERT_PDF(pdf); + pdf_obj *root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root)); + if (!root) { + RAISEPY(gctx, MSG_BAD_PDFROOT, JM_Exc_FileDataError); + } + pdf_obj *xml = pdf_dict_get(gctx, root, PDF_NAME(Metadata)); + if (xml) xref = pdf_to_num(gctx, xml); + } + fz_catch(gctx) {;} + return Py_BuildValue("i", xref); + } + + //------------------------------------------------------------------ + // Delete XML Metadata + //------------------------------------------------------------------ + FITZEXCEPTION(del_xml_metadata, !result) + CLOSECHECK(del_xml_metadata, """Delete XML metadata.""") + PyObject *del_xml_metadata() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + fz_try(gctx) { + ASSERT_PDF(pdf); + pdf_obj *root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root)); + if (root) pdf_dict_del(gctx, root, PDF_NAME(Metadata)); + } + fz_catch(gctx) { + return NULL; + } + + Py_RETURN_NONE; + } + + //------------------------------------------------------------------ + // Set XML-based Metadata + //------------------------------------------------------------------ + FITZEXCEPTION(set_xml_metadata, !result) + CLOSECHECK(set_xml_metadata, """Store XML document level metadata.""") + PyObject *set_xml_metadata(char *metadata) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + fz_buffer *res = NULL; + fz_try(gctx) { + ASSERT_PDF(pdf); + pdf_obj *root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root)); + if (!root) { + RAISEPY(gctx, MSG_BAD_PDFROOT, JM_Exc_FileDataError); + } + res = fz_new_buffer_from_copied_data(gctx, (const unsigned char *) metadata, strlen(metadata)); + pdf_obj *xml = pdf_dict_get(gctx, root, PDF_NAME(Metadata)); + if (xml) { + JM_update_stream(gctx, pdf, xml, res, 0); + } else { + xml = pdf_add_stream(gctx, pdf, res, NULL, 0); + pdf_dict_put(gctx, xml, PDF_NAME(Type), PDF_NAME(Metadata)); + pdf_dict_put(gctx, xml, PDF_NAME(Subtype), PDF_NAME(XML)); + pdf_dict_put_drop(gctx, root, PDF_NAME(Metadata), xml); + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + + Py_RETURN_NONE; + } + + //------------------------------------------------------------------ + // Get Object String of xref + //------------------------------------------------------------------ + FITZEXCEPTION(xref_object, !result) + CLOSECHECK0(xref_object, """Get xref object source as a string.""") + PyObject *xref_object(int xref, int compressed=0, int ascii=0) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + pdf_obj *obj = NULL; + PyObject *text = NULL; + fz_buffer *res=NULL; + fz_try(gctx) { + ASSERT_PDF(pdf); + int xreflen = pdf_xref_len(gctx, pdf); + if (!INRANGE(xref, 1, xreflen-1) && xref != -1) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + if (xref > 0) { + obj = pdf_load_object(gctx, pdf, xref); + } else { + obj = pdf_trailer(gctx, pdf); + } + res = JM_object_to_buffer(gctx, pdf_resolve_indirect(gctx, obj), compressed, ascii); + text = JM_EscapeStrFromBuffer(gctx, res); + } + fz_always(gctx) { + if (xref > 0) { + pdf_drop_obj(gctx, obj); + } + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) return EMPTY_STRING; + return text; + } + %pythoncode %{ + def pdf_trailer(self, compressed: bool=False, ascii:bool=False)->str: + """Get PDF trailer as a string.""" + return self.xref_object(-1, compressed=compressed, ascii=ascii)%} + + + //------------------------------------------------------------------ + // Get compressed stream of an object by xref + // Py_RETURN_NONE if not stream + //------------------------------------------------------------------ + FITZEXCEPTION(xref_stream_raw, !result) + CLOSECHECK(xref_stream_raw, """Get xref stream without decompression.""") + PyObject *xref_stream_raw(int xref) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + PyObject *r = NULL; + pdf_obj *obj = NULL; + fz_var(obj); + fz_buffer *res = NULL; + fz_var(res); + fz_try(gctx) { + ASSERT_PDF(pdf); + int xreflen = pdf_xref_len(gctx, pdf); + if (!INRANGE(xref, 1, xreflen-1) && xref != -1) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + if (xref >= 0) { + obj = pdf_new_indirect(gctx, pdf, xref, 0); + } else { + obj = pdf_trailer(gctx, pdf); + } + if (pdf_is_stream(gctx, obj)) + { + res = pdf_load_raw_stream_number(gctx, pdf, xref); + r = JM_BinFromBuffer(gctx, res); + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + if (xref >= 0) { + pdf_drop_obj(gctx, obj); + } + } + fz_catch(gctx) + { + Py_CLEAR(r); + return NULL; + } + if (!r) Py_RETURN_NONE; + return r; + } + + //------------------------------------------------------------------ + // Get decompressed stream of an object by xref + // Py_RETURN_NONE if not stream + //------------------------------------------------------------------ + FITZEXCEPTION(xref_stream, !result) + CLOSECHECK(xref_stream, """Get decompressed xref stream.""") + PyObject *xref_stream(int xref) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + PyObject *r = Py_None; + pdf_obj *obj = NULL; + fz_var(obj); + fz_buffer *res = NULL; + fz_var(res); + fz_try(gctx) { + ASSERT_PDF(pdf); + int xreflen = pdf_xref_len(gctx, pdf); + if (!INRANGE(xref, 1, xreflen-1) && xref != -1) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + if (xref >= 0) { + obj = pdf_new_indirect(gctx, pdf, xref, 0); + } else { + obj = pdf_trailer(gctx, pdf); + } + if (pdf_is_stream(gctx, obj)) + { + res = pdf_load_stream_number(gctx, pdf, xref); + r = JM_BinFromBuffer(gctx, res); + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + if (xref >= 0) { + pdf_drop_obj(gctx, obj); + } + } + fz_catch(gctx) + { + Py_CLEAR(r); + return NULL; + } + return r; + } + + //------------------------------------------------------------------ + // Update an Xref number with a new object given as a string + //------------------------------------------------------------------ + FITZEXCEPTION(update_object, !result) + CLOSECHECK(update_object, """Replace object definition source.""") + PyObject *update_object(int xref, char *text, struct Page *page = NULL) + { + pdf_obj *new_obj; + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + fz_try(gctx) { + ASSERT_PDF(pdf); + int xreflen = pdf_xref_len(gctx, pdf); + if (!INRANGE(xref, 1, xreflen-1)) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + ENSURE_OPERATION(gctx, pdf); + // create new object with passed-in string + new_obj = JM_pdf_obj_from_str(gctx, pdf, text); + pdf_update_object(gctx, pdf, xref, new_obj); + pdf_drop_obj(gctx, new_obj); + if (page) { + pdf_page *pdfpage = pdf_page_from_fz_page(gctx, (fz_page *) page); + JM_refresh_links(gctx, pdfpage); + } + } + fz_catch(gctx) { + return NULL; + } + + Py_RETURN_NONE; + } + + //------------------------------------------------------------------ + // Update a stream identified by its xref + //------------------------------------------------------------------ + FITZEXCEPTION(update_stream, !result) + CLOSECHECK(update_stream, """Replace xref stream part.""") + PyObject *update_stream(int xref=0, PyObject *stream=NULL, int new=1, int compress=1) + { + pdf_obj *obj = NULL; + fz_var(obj); + fz_buffer *res = NULL; + fz_var(res); + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + fz_try(gctx) { + ASSERT_PDF(pdf); + int xreflen = pdf_xref_len(gctx, pdf); + if (!INRANGE(xref, 1, xreflen-1)) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + ENSURE_OPERATION(gctx, pdf); + // get the object + obj = pdf_new_indirect(gctx, pdf, xref, 0); + if (!pdf_is_dict(gctx, obj)) { + RAISEPY(gctx, MSG_IS_NO_DICT, PyExc_ValueError); + } + res = JM_BufferFromBytes(gctx, stream); + if (!res) { + RAISEPY(gctx, MSG_BAD_BUFFER, PyExc_TypeError); + } + JM_update_stream(gctx, pdf, obj, res, compress); + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + pdf_drop_obj(gctx, obj); + } + fz_catch(gctx) + return NULL; + + Py_RETURN_NONE; + } + + + //------------------------------------------------------------------ + // create / refresh the page map + //------------------------------------------------------------------ + FITZEXCEPTION(_make_page_map, !result) + CLOSECHECK0(_make_page_map, """Make an array page number -> page object.""") + PyObject *_make_page_map() + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + if (!pdf) Py_RETURN_NONE; + fz_try(gctx) { + pdf_drop_page_tree(gctx, pdf); + pdf_load_page_tree(gctx, pdf); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", pdf->map_page_count); + } + + + //------------------------------------------------------------------ + // full (deep) copy of one page + //------------------------------------------------------------------ + FITZEXCEPTION(fullcopy_page, !result) + CLOSECHECK0(fullcopy_page, """Make a full page duplicate.""") + %pythonappend fullcopy_page %{self._reset_page_refs()%} + PyObject *fullcopy_page(int pno, int to = -1) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + int page_count = pdf_count_pages(gctx, pdf); + fz_buffer *res = NULL, *nres=NULL; + fz_buffer *contents_buffer = NULL; + fz_var(pdf); + fz_var(res); + fz_var(nres); + fz_var(contents_buffer); + fz_try(gctx) { + ASSERT_PDF(pdf); + if (!INRANGE(pno, 0, page_count - 1) || + !INRANGE(to, -1, page_count - 1)) { + RAISEPY(gctx, MSG_BAD_PAGENO, PyExc_ValueError); + } + + pdf_obj *page1 = pdf_resolve_indirect(gctx, + pdf_lookup_page_obj(gctx, pdf, pno)); + + pdf_obj *page2 = pdf_deep_copy_obj(gctx, page1); + pdf_obj *old_annots = pdf_dict_get(gctx, page2, PDF_NAME(Annots)); + + // copy annotations, but remove Popup and IRT types + if (old_annots) { + int i, n = pdf_array_len(gctx, old_annots); + pdf_obj *new_annots = pdf_new_array(gctx, pdf, n); + for (i = 0; i < n; i++) { + pdf_obj *o = pdf_array_get(gctx, old_annots, i); + pdf_obj *subtype = pdf_dict_get(gctx, o, PDF_NAME(Subtype)); + if (pdf_name_eq(gctx, subtype, PDF_NAME(Popup))) continue; + if (pdf_dict_gets(gctx, o, "IRT")) continue; + pdf_obj *copy_o = pdf_deep_copy_obj(gctx, + pdf_resolve_indirect(gctx, o)); + int xref = pdf_create_object(gctx, pdf); + pdf_update_object(gctx, pdf, xref, copy_o); + pdf_drop_obj(gctx, copy_o); + copy_o = pdf_new_indirect(gctx, pdf, xref, 0); + pdf_dict_del(gctx, copy_o, PDF_NAME(Popup)); + pdf_dict_del(gctx, copy_o, PDF_NAME(P)); + pdf_array_push_drop(gctx, new_annots, copy_o); + } + pdf_dict_put_drop(gctx, page2, PDF_NAME(Annots), new_annots); + } + + // copy the old contents stream(s) + res = JM_read_contents(gctx, page1); + + // create new /Contents object for page2 + if (res) { + contents_buffer = fz_new_buffer_from_copied_data(gctx, " ", 1); + pdf_obj *contents = pdf_add_stream(gctx, pdf, contents_buffer, NULL, 0); + JM_update_stream(gctx, pdf, contents, res, 1); + pdf_dict_put_drop(gctx, page2, PDF_NAME(Contents), contents); + } + + // now insert target page, making sure it is an indirect object + int xref = pdf_create_object(gctx, pdf); // get new xref + pdf_update_object(gctx, pdf, xref, page2); // store new page + pdf_drop_obj(gctx, page2); // give up this object for now + + page2 = pdf_new_indirect(gctx, pdf, xref, 0); // reread object + pdf_insert_page(gctx, pdf, to, page2); // and store the page + pdf_drop_obj(gctx, page2); + } + fz_always(gctx) { + pdf_drop_page_tree(gctx, pdf); + fz_drop_buffer(gctx, res); + fz_drop_buffer(gctx, nres); + fz_drop_buffer(gctx, contents_buffer); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //------------------------------------------------------------------ + // move or copy one page + //------------------------------------------------------------------ + FITZEXCEPTION(_move_copy_page, !result) + CLOSECHECK0(_move_copy_page, """Move or copy a PDF page reference.""") + %pythonappend _move_copy_page %{self._reset_page_refs()%} + PyObject *_move_copy_page(int pno, int nb, int before, int copy) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + int i1, i2, pos, count, same = 0; + pdf_obj *parent1 = NULL, *parent2 = NULL, *parent = NULL; + pdf_obj *kids1, *kids2; + fz_try(gctx) { + ASSERT_PDF(pdf); + // get the two page objects ----------------------------------- + // locate the /Kids arrays and indices in each + pdf_obj *page1 = pdf_lookup_page_loc(gctx, pdf, pno, &parent1, &i1); + kids1 = pdf_dict_get(gctx, parent1, PDF_NAME(Kids)); + + pdf_obj *page2 = pdf_lookup_page_loc(gctx, pdf, nb, &parent2, &i2); + (void) page2; + kids2 = pdf_dict_get(gctx, parent2, PDF_NAME(Kids)); + + if (before) // calc index of source page in target /Kids + pos = i2; + else + pos = i2 + 1; + + // same /Kids array? ------------------------------------------ + same = pdf_objcmp(gctx, kids1, kids2); + + // put source page in target /Kids array ---------------------- + if (!copy && same != 0) // update parent in page object + { + pdf_dict_put(gctx, page1, PDF_NAME(Parent), parent2); + } + pdf_array_insert(gctx, kids2, page1, pos); + + if (same != 0) // different /Kids arrays ---------------------- + { + parent = parent2; + while (parent) // increase /Count objects in parents + { + count = pdf_dict_get_int(gctx, parent, PDF_NAME(Count)); + pdf_dict_put_int(gctx, parent, PDF_NAME(Count), count + 1); + parent = pdf_dict_get(gctx, parent, PDF_NAME(Parent)); + } + if (!copy) // delete original item + { + pdf_array_delete(gctx, kids1, i1); + parent = parent1; + while (parent) // decrease /Count objects in parents + { + count = pdf_dict_get_int(gctx, parent, PDF_NAME(Count)); + pdf_dict_put_int(gctx, parent, PDF_NAME(Count), count - 1); + parent = pdf_dict_get(gctx, parent, PDF_NAME(Parent)); + } + } + } + else { // same /Kids array + if (copy) { // source page is copied + parent = parent2; + while (parent) // increase /Count object in parents + { + count = pdf_dict_get_int(gctx, parent, PDF_NAME(Count)); + pdf_dict_put_int(gctx, parent, PDF_NAME(Count), count + 1); + parent = pdf_dict_get(gctx, parent, PDF_NAME(Parent)); + } + } else { + if (i1 < pos) + pdf_array_delete(gctx, kids1, i1); + else + pdf_array_delete(gctx, kids1, i1 + 1); + } + } + if (pdf->rev_page_map) { // page map no longer valid: drop it + pdf_drop_page_tree(gctx, pdf); + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION(_remove_toc_item, !result) + PyObject *_remove_toc_item(int xref) + { + // "remove" bookmark by letting it point to nowhere + pdf_obj *item = NULL, *color; + int i; + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + fz_try(gctx) { + item = pdf_new_indirect(gctx, pdf, xref, 0); + pdf_dict_del(gctx, item, PDF_NAME(Dest)); + pdf_dict_del(gctx, item, PDF_NAME(A)); + color = pdf_new_array(gctx, pdf, 3); + for (i=0; i < 3; i++) { + pdf_array_push_real(gctx, color, 0.8); + } + pdf_dict_put_drop(gctx, item, PDF_NAME(C), color); + } + fz_always(gctx) { + pdf_drop_obj(gctx, item); + } + fz_catch(gctx){ + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION(_update_toc_item, !result) + PyObject *_update_toc_item(int xref, char *action=NULL, char *title=NULL, int flags=0, PyObject *collapse=NULL, PyObject *color=NULL) + { + // "update" bookmark by letting it point to nowhere + pdf_obj *item = NULL; + pdf_obj *obj = NULL; + Py_ssize_t i; + double f; + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + fz_try(gctx) { + item = pdf_new_indirect(gctx, pdf, xref, 0); + if (title) { + pdf_dict_put_text_string(gctx, item, PDF_NAME(Title), title); + } + if (action) { + pdf_dict_del(gctx, item, PDF_NAME(Dest)); + obj = JM_pdf_obj_from_str(gctx, pdf, action); + pdf_dict_put_drop(gctx, item, PDF_NAME(A), obj); + } + pdf_dict_put_int(gctx, item, PDF_NAME(F), flags); + if (EXISTS(color)) { + pdf_obj *c = pdf_new_array(gctx, pdf, 3); + for (i = 0; i < 3; i++) { + JM_FLOAT_ITEM(color, i, &f); + pdf_array_push_real(gctx, c, f); + } + pdf_dict_put_drop(gctx, item, PDF_NAME(C), c); + } else if (color != Py_None) { + pdf_dict_del(gctx, item, PDF_NAME(C)); + } + if (collapse != Py_None) { + if (pdf_dict_get(gctx, item, PDF_NAME(Count))) { + i = pdf_dict_get_int(gctx, item, PDF_NAME(Count)); + if ((i < 0 && collapse == Py_False) || (i > 0 && collapse == Py_True)) { + i = i * (-1); + pdf_dict_put_int(gctx, item, PDF_NAME(Count), i); + } + } + } + } + fz_always(gctx) { + pdf_drop_obj(gctx, item); + } + fz_catch(gctx){ + return NULL; + } + Py_RETURN_NONE; + } + + //------------------------------------------------------------------ + // PDF page label getting / setting + //------------------------------------------------------------------ + FITZEXCEPTION(_get_page_labels, !result) + PyObject * + _get_page_labels() + { + pdf_obj *obj, *nums, *kids; + PyObject *rc = NULL; + int i, n; + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + + pdf_obj *pagelabels = NULL; + fz_var(pagelabels); + fz_try(gctx) { + ASSERT_PDF(pdf); + rc = PyList_New(0); + pagelabels = pdf_new_name(gctx, "PageLabels"); + obj = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), pagelabels, NULL); + if (!obj) { + goto finished; + } + // simple case: direct /Nums object + nums = pdf_resolve_indirect(gctx, + pdf_dict_get(gctx, obj, PDF_NAME(Nums))); + if (nums) { + JM_get_page_labels(gctx, rc, nums); + goto finished; + } + // case: /Kids/Nums + nums = pdf_resolve_indirect(gctx, + pdf_dict_getl(gctx, obj, PDF_NAME(Kids), PDF_NAME(Nums), NULL) + ); + if (nums) { + JM_get_page_labels(gctx, rc, nums); + goto finished; + } + // case: /Kids is an array of multiple /Nums + kids = pdf_resolve_indirect(gctx, + pdf_dict_get(gctx, obj, PDF_NAME(Kids))); + if (!kids || !pdf_is_array(gctx, kids)) { + goto finished; + } + + n = pdf_array_len(gctx, kids); + for (i = 0; i < n; i++) { + nums = pdf_resolve_indirect(gctx, + pdf_dict_get(gctx, + pdf_array_get(gctx, kids, i), + PDF_NAME(Nums))); + JM_get_page_labels(gctx, rc, nums); + } + finished:; + } + fz_always(gctx) { + PyErr_Clear(); + pdf_drop_obj(gctx, pagelabels); + } + fz_catch(gctx){ + Py_CLEAR(rc); + return NULL; + } + return rc; + } + + + FITZEXCEPTION(_set_page_labels, !result) + %pythonappend _set_page_labels %{ + xref = self.pdf_catalog() + text = self.xref_object(xref, compressed=True) + text = text.replace("/Nums[]", "/Nums[%s]" % labels) + self.update_object(xref, text)%} + PyObject * + _set_page_labels(char *labels) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) $self); + pdf_obj *pagelabels = NULL; + fz_var(pagelabels); + fz_try(gctx) { + ASSERT_PDF(pdf); + pagelabels = pdf_new_name(gctx, "PageLabels"); + pdf_obj *root = pdf_dict_get(gctx, pdf_trailer(gctx, pdf), PDF_NAME(Root)); + pdf_dict_del(gctx, root, pagelabels); + pdf_dict_putl_drop(gctx, root, pdf_new_array(gctx, pdf, 0), pagelabels, PDF_NAME(Nums), NULL); + } + fz_always(gctx) { + PyErr_Clear(); + pdf_drop_obj(gctx, pagelabels); + } + fz_catch(gctx){ + return NULL; + } + Py_RETURN_NONE; + } + + + //------------------------------------------------------------------ + // PDF Optional Content functions + //------------------------------------------------------------------ + FITZEXCEPTION(get_layers, !result) + CLOSECHECK0(get_layers, """Show optional OC layers.""") + PyObject * + get_layers() + { + PyObject *rc = NULL; + pdf_layer_config info = {NULL, NULL}; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) self); + ASSERT_PDF(pdf); + int i, n = pdf_count_layer_configs(gctx, pdf); + if (n == 1) { + pdf_obj *obj = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), PDF_NAME(OCProperties), PDF_NAME(Configs), NULL); + if (!pdf_is_array(gctx, obj)) n = 0; + } + rc = PyTuple_New(n); + for (i = 0; i < n; i++) { + pdf_layer_config_info(gctx, pdf, i, &info); + PyObject *item = Py_BuildValue("{s:i,s:s,s:s}", + "number", i, "name", info.name, "creator", info.creator); + PyTuple_SET_ITEM(rc, i, item); + info.name = NULL; + info.creator = NULL; + } + } + fz_catch(gctx) { + Py_CLEAR(rc); + return NULL; + } + return rc; + } + + + FITZEXCEPTION(switch_layer, !result) + CLOSECHECK0(switch_layer, """Activate an OC layer.""") + PyObject * + switch_layer(int config, int as_default=0) + { + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) self); + ASSERT_PDF(pdf); + pdf_obj *cfgs = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), PDF_NAME(OCProperties), PDF_NAME(Configs), NULL); + if (!pdf_is_array(gctx, cfgs) || !pdf_array_len(gctx, cfgs)) { + if (config < 1) goto finished; + RAISEPY(gctx, MSG_BAD_OC_LAYER, PyExc_ValueError); + } + if (config < 0) goto finished; + pdf_select_layer_config(gctx, pdf, config); + if (as_default) { + pdf_set_layer_config_as_default(gctx, pdf); + pdf_read_ocg(gctx, pdf); + } + finished:; + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(get_layer, !result) + CLOSECHECK0(get_layer, """Content of ON, OFF, RBGroups of an OC layer.""") + PyObject * + get_layer(int config=-1) + { + PyObject *rc; + pdf_obj *obj = NULL; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) self); + ASSERT_PDF(pdf); + pdf_obj *ocp = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), PDF_NAME(OCProperties), NULL); + if (!ocp) { + rc = Py_BuildValue("s", NULL); + goto finished; + } + if (config == -1) { + obj = pdf_dict_get(gctx, ocp, PDF_NAME(D)); + } else { + obj = pdf_array_get(gctx, pdf_dict_get(gctx, ocp, PDF_NAME(Configs)), config); + } + if (!obj) { + RAISEPY(gctx, MSG_BAD_OC_CONFIG, PyExc_ValueError); + } + rc = JM_get_ocg_arrays(gctx, obj); + finished:; + } + fz_catch(gctx) { + Py_CLEAR(rc); + PyErr_Clear(); + return NULL; + } + return rc; + } + + + FITZEXCEPTION(set_layer, !result) + %pythonprepend set_layer +%{"""Set the PDF keys /ON, /OFF, /RBGroups of an OC layer.""" +if self.is_closed: + raise ValueError("document closed") +ocgs = set(self.get_ocgs().keys()) +if ocgs == set(): + raise ValueError("document has no optional content") + +if on: + if type(on) not in (list, tuple): + raise ValueError("bad type: 'on'") + s = set(on).difference(ocgs) + if s != set(): + raise ValueError("bad OCGs in 'on': %s" % s) + +if off: + if type(off) not in (list, tuple): + raise ValueError("bad type: 'off'") + s = set(off).difference(ocgs) + if s != set(): + raise ValueError("bad OCGs in 'off': %s" % s) + +if locked: + if type(locked) not in (list, tuple): + raise ValueError("bad type: 'locked'") + s = set(locked).difference(ocgs) + if s != set(): + raise ValueError("bad OCGs in 'locked': %s" % s) + +if rbgroups: + if type(rbgroups) not in (list, tuple): + raise ValueError("bad type: 'rbgroups'") + for x in rbgroups: + if not type(x) in (list, tuple): + raise ValueError("bad RBGroup '%s'" % x) + s = set(x).difference(ocgs) + if s != set(): + raise ValueError("bad OCGs in RBGroup: %s" % s) + +if basestate: + basestate = str(basestate).upper() + if basestate == "UNCHANGED": + basestate = "Unchanged" + if basestate not in ("ON", "OFF", "Unchanged"): + raise ValueError("bad 'basestate'") +%} + PyObject * + set_layer(int config, const char *basestate=NULL, PyObject *on=NULL, + PyObject *off=NULL, PyObject *rbgroups=NULL, PyObject *locked=NULL) + { + pdf_obj *obj = NULL; + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) self); + ASSERT_PDF(pdf); + pdf_obj *ocp = pdf_dict_getl(gctx, pdf_trailer(gctx, pdf), + PDF_NAME(Root), PDF_NAME(OCProperties), NULL); + if (!ocp) { + goto finished; + } + if (config == -1) { + obj = pdf_dict_get(gctx, ocp, PDF_NAME(D)); + } else { + obj = pdf_array_get(gctx, pdf_dict_get(gctx, ocp, PDF_NAME(Configs)), config); + } + if (!obj) { + RAISEPY(gctx, MSG_BAD_OC_CONFIG, PyExc_ValueError); + } + JM_set_ocg_arrays(gctx, obj, basestate, on, off, rbgroups, locked); + pdf_read_ocg(gctx, pdf); + finished:; + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(add_layer, !result) + CLOSECHECK0(add_layer, """Add a new OC layer.""") + PyObject *add_layer(char *name, char *creator=NULL, PyObject *on=NULL) + { + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) self); + ASSERT_PDF(pdf); + JM_add_layer_config(gctx, pdf, name, creator, on); + pdf_read_ocg(gctx, pdf); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(layer_ui_configs, !result) + CLOSECHECK0(layer_ui_configs, """Show OC visibility status modifiable by user.""") + PyObject *layer_ui_configs() + { + typedef struct + { + const char *text; + int depth; + pdf_layer_config_ui_type type; + int selected; + int locked; + } pdf_layer_config_ui; + PyObject *rc = NULL; + + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) self); + ASSERT_PDF(pdf); + pdf_layer_config_ui info; + int i, n = pdf_count_layer_config_ui(gctx, pdf); + rc = PyTuple_New(n); + char *type = NULL; + for (i = 0; i < n; i++) { + pdf_layer_config_ui_info(gctx, pdf, i, (void *) &info); + switch (info.type) + { + case (1): type = "checkbox"; break; + case (2): type = "radiobox"; break; + default: type = "label"; break; + } + PyObject *item = Py_BuildValue("{s:i,s:N,s:i,s:s,s:N,s:N}", + "number", i, + "text", JM_UnicodeFromStr(info.text), + "depth", info.depth, + "type", type, + "on", JM_BOOL(info.selected), + "locked", JM_BOOL(info.locked)); + PyTuple_SET_ITEM(rc, i, item); + } + } + fz_catch(gctx) { + Py_CLEAR(rc); + return NULL; + } + return rc; + } + + + FITZEXCEPTION(set_layer_ui_config, !result) + CLOSECHECK0(set_layer_ui_config, ) + %pythonprepend set_layer_ui_config %{ + """Set / unset OC intent configuration.""" + # The user might have given the name instead of sequence number, + # so select by that name and continue with corresp. number + if isinstance(number, str): + select = [ui["number"] for ui in self.layer_ui_configs() if ui["text"] == number] + if select == []: + raise ValueError(f"bad OCG '{number}'.") + number = select[0] # this is the number for the name + %} + PyObject *set_layer_ui_config(int number, int action=0) + { + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) self); + ASSERT_PDF(pdf); + switch (action) + { + case (1): + pdf_toggle_layer_config_ui(gctx, pdf, number); + break; + case (2): + pdf_deselect_layer_config_ui(gctx, pdf, number); + break; + default: + pdf_select_layer_config_ui(gctx, pdf, number); + break; + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(get_ocgs, !result) + CLOSECHECK0(get_ocgs, """Show existing optional content groups.""") + PyObject * + get_ocgs() + { + PyObject *rc = NULL; + pdf_obj *ci = pdf_new_name(gctx, "CreatorInfo"); + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) self); + ASSERT_PDF(pdf); + pdf_obj *ocgs = pdf_dict_getl(gctx, + pdf_dict_get(gctx, + pdf_trailer(gctx, pdf), PDF_NAME(Root)), + PDF_NAME(OCProperties), PDF_NAME(OCGs), NULL); + rc = PyDict_New(); + if (!pdf_is_array(gctx, ocgs)) goto fertig; + int i, n = pdf_array_len(gctx, ocgs); + for (i = 0; i < n; i++) { + pdf_obj *ocg = pdf_array_get(gctx, ocgs, i); + int xref = pdf_to_num(gctx, ocg); + const char *name = pdf_to_text_string(gctx, pdf_dict_get(gctx, ocg, PDF_NAME(Name))); + pdf_obj *obj = pdf_dict_getl(gctx, ocg, PDF_NAME(Usage), ci, PDF_NAME(Subtype), NULL); + const char *usage = NULL; + if (obj) usage = pdf_to_name(gctx, obj); + PyObject *intents = PyList_New(0); + pdf_obj *intent = pdf_dict_get(gctx, ocg, PDF_NAME(Intent)); + if (intent) { + if (pdf_is_name(gctx, intent)) { + LIST_APPEND_DROP(intents, Py_BuildValue("s", pdf_to_name(gctx, intent))); + } else if (pdf_is_array(gctx, intent)) { + int j, m = pdf_array_len(gctx, intent); + for (j = 0; j < m; j++) { + pdf_obj *o = pdf_array_get(gctx, intent, j); + if (pdf_is_name(gctx, o)) + LIST_APPEND_DROP(intents, Py_BuildValue("s", pdf_to_name(gctx, o))); + } + } + } + int hidden = pdf_is_ocg_hidden(gctx, pdf, NULL, usage, ocg); + PyObject *item = Py_BuildValue("{s:s,s:O,s:O,s:s}", + "name", name, + "intent", intents, + "on", JM_BOOL(!hidden), + "usage", usage); + Py_DECREF(intents); + PyObject *temp = Py_BuildValue("i", xref); + DICT_SETITEM_DROP(rc, temp, item); + Py_DECREF(temp); + } + fertig:; + } + fz_always(gctx) { + pdf_drop_obj(gctx, ci); + } + fz_catch(gctx) { + Py_CLEAR(rc); + return NULL; + } + return rc; + } + + + FITZEXCEPTION(add_ocg, !result) + CLOSECHECK0(add_ocg, """Add new optional content group.""") + PyObject * + add_ocg(char *name, int config=-1, int on=1, PyObject *intent=NULL, const char *usage=NULL) + { + int xref = 0; + pdf_obj *obj = NULL, *cfg = NULL; + pdf_obj *indocg = NULL; + pdf_obj *ocg = NULL; + pdf_obj *ci_name = NULL; + fz_var(indocg); + fz_var(ocg); + fz_var(ci_name); + fz_try(gctx) { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) self); + ASSERT_PDF(pdf); + + // ------------------------------ + // make the OCG + // ------------------------------ + ocg = pdf_add_new_dict(gctx, pdf, 3); + pdf_dict_put(gctx, ocg, PDF_NAME(Type), PDF_NAME(OCG)); + pdf_dict_put_text_string(gctx, ocg, PDF_NAME(Name), name); + pdf_obj *intents = pdf_dict_put_array(gctx, ocg, PDF_NAME(Intent), 2); + if (!EXISTS(intent)) { + pdf_array_push(gctx, intents, PDF_NAME(View)); + } else if (!PyUnicode_Check(intent)) { + int i, n = PySequence_Size(intent); + for (i = 0; i < n; i++) { + PyObject *item = PySequence_ITEM(intent, i); + char *c = JM_StrAsChar(item); + if (c) { + pdf_array_push_drop(gctx, intents, pdf_new_name(gctx, c)); + } + Py_DECREF(item); + } + } else { + char *c = JM_StrAsChar(intent); + if (c) { + pdf_array_push_drop(gctx, intents, pdf_new_name(gctx, c)); + } + } + pdf_obj *use_for = pdf_dict_put_dict(gctx, ocg, PDF_NAME(Usage), 3); + ci_name = pdf_new_name(gctx, "CreatorInfo"); + pdf_obj *cre_info = pdf_dict_put_dict(gctx, use_for, ci_name, 2); + pdf_dict_put_text_string(gctx, cre_info, PDF_NAME(Creator), "PyMuPDF"); + if (usage) { + pdf_dict_put_name(gctx, cre_info, PDF_NAME(Subtype), usage); + } else { + pdf_dict_put_name(gctx, cre_info, PDF_NAME(Subtype), "Artwork"); + } + indocg = pdf_add_object(gctx, pdf, ocg); + + // ------------------------------ + // Insert OCG in the right config + // ------------------------------ + pdf_obj *ocp = JM_ensure_ocproperties(gctx, pdf); + obj = pdf_dict_get(gctx, ocp, PDF_NAME(OCGs)); + pdf_array_push(gctx, obj, indocg); + + if (config > -1) { + obj = pdf_dict_get(gctx, ocp, PDF_NAME(Configs)); + if (!pdf_is_array(gctx, obj)) { + RAISEPY(gctx, MSG_BAD_OC_CONFIG, PyExc_ValueError); + } + cfg = pdf_array_get(gctx, obj, config); + if (!cfg) { + RAISEPY(gctx, MSG_BAD_OC_CONFIG, PyExc_ValueError); + } + } else { + cfg = pdf_dict_get(gctx, ocp, PDF_NAME(D)); + } + + obj = pdf_dict_get(gctx, cfg, PDF_NAME(Order)); + if (!obj) { + obj = pdf_dict_put_array(gctx, cfg, PDF_NAME(Order), 1); + } + pdf_array_push(gctx, obj, indocg); + if (on) { + obj = pdf_dict_get(gctx, cfg, PDF_NAME(ON)); + if (!obj) { + obj = pdf_dict_put_array(gctx, cfg, PDF_NAME(ON), 1); + } + } else { + obj = pdf_dict_get(gctx, cfg, PDF_NAME(OFF)); + if (!obj) { + obj = pdf_dict_put_array(gctx, cfg, PDF_NAME(OFF), 1); + } + } + pdf_array_push(gctx, obj, indocg); + + // let MuPDF take note: re-read OCProperties + pdf_read_ocg(gctx, pdf); + + xref = pdf_to_num(gctx, indocg); + } + fz_always(gctx) { + pdf_drop_obj(gctx, indocg); + pdf_drop_obj(gctx, ocg); + pdf_drop_obj(gctx, ci_name); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", xref); + } + + struct Annot; + + void internal_keep_annot(struct Annot* annot) + { + pdf_keep_annot(gctx, (pdf_annot*) annot); + } + + //------------------------------------------------------------------ + // Initialize document: set outline and metadata properties + //------------------------------------------------------------------ + %pythoncode %{ + def init_doc(self): + if self.is_encrypted: + raise ValueError("cannot initialize - document still encrypted") + self._outline = self._loadOutline() + if self._outline: + self._outline.thisown = True + self.metadata = dict([(k,self._getMetadata(v)) for k,v in {'format':'format', 'title':'info:Title', 'author':'info:Author','subject':'info:Subject', 'keywords':'info:Keywords','creator':'info:Creator', 'producer':'info:Producer', 'creationDate':'info:CreationDate', 'modDate':'info:ModDate', 'trapped':'info:Trapped'}.items()]) + self.metadata['encryption'] = None if self._getMetadata('encryption')=='None' else self._getMetadata('encryption') + + outline = property(lambda self: self._outline) + + + def get_page_fonts(self, pno: int, full: bool =False) -> list: + """Retrieve a list of fonts used on a page. + """ + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if not self.is_pdf: + return () + if type(pno) is not int: + try: + pno = pno.number + except: + raise ValueError("need a Page or page number") + val = self._getPageInfo(pno, 1) + if full is False: + return [v[:-1] for v in val] + return val + + + def get_page_images(self, pno: int, full: bool =False) -> list: + """Retrieve a list of images used on a page. + """ + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if not self.is_pdf: + return () + if type(pno) is not int: + try: + pno = pno.number + except: + raise ValueError("need a Page or page number") + val = self._getPageInfo(pno, 2) + if full is False: + return [v[:-1] for v in val] + return val + + + def get_page_xobjects(self, pno: int) -> list: + """Retrieve a list of XObjects used on a page. + """ + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if not self.is_pdf: + return () + if type(pno) is not int: + try: + pno = pno.number + except: + raise ValueError("need a Page or page number") + val = self._getPageInfo(pno, 3) + rc = [(v[0], v[1], v[2], Rect(v[3])) for v in val] + return rc + + + def xref_is_image(self, xref): + """Check if xref is an image object.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if self.xref_get_key(xref, "Subtype")[1] == "/Image": + return True + return False + + def xref_is_font(self, xref): + """Check if xref is a font object.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if self.xref_get_key(xref, "Type")[1] == "/Font": + return True + return False + + def xref_is_xobject(self, xref): + """Check if xref is a form xobject.""" + if self.is_closed or self.is_encrypted: + raise ValueError("document closed or encrypted") + if self.xref_get_key(xref, "Subtype")[1] == "/Form": + return True + return False + + def copy_page(self, pno: int, to: int =-1): + """Copy a page within a PDF document. + + This will only create another reference of the same page object. + Args: + pno: source page number + to: put before this page, '-1' means after last page. + """ + if self.is_closed: + raise ValueError("document closed") + + page_count = len(self) + if ( + pno not in range(page_count) or + to not in range(-1, page_count) + ): + raise ValueError("bad page number(s)") + before = 1 + copy = 1 + if to == -1: + to = page_count - 1 + before = 0 + + return self._move_copy_page(pno, to, before, copy) + + def move_page(self, pno: int, to: int =-1): + """Move a page within a PDF document. + + Args: + pno: source page number. + to: put before this page, '-1' means after last page. + """ + if self.is_closed: + raise ValueError("document closed") + + page_count = len(self) + if ( + pno not in range(page_count) or + to not in range(-1, page_count) + ): + raise ValueError("bad page number(s)") + before = 1 + copy = 0 + if to == -1: + to = page_count - 1 + before = 0 + + return self._move_copy_page(pno, to, before, copy) + + def delete_page(self, pno: int =-1): + """ Delete one page from a PDF. + """ + if not self.is_pdf: + raise ValueError("is no PDF") + if self.is_closed: + raise ValueError("document closed") + + page_count = self.page_count + while pno < 0: + pno += page_count + + if pno >= page_count: + raise ValueError("bad page number(s)") + + # remove TOC bookmarks pointing to deleted page + toc = self.get_toc() + ol_xrefs = self.get_outline_xrefs() + for i, item in enumerate(toc): + if item[2] == pno + 1: + self._remove_toc_item(ol_xrefs[i]) + + self._remove_links_to(frozenset((pno,))) + self._delete_page(pno) + self._reset_page_refs() + + + def delete_pages(self, *args, **kw): + """Delete pages from a PDF. + + Args: + Either keywords 'from_page'/'to_page', or two integers to + specify the first/last page to delete. + Or a list/tuple/range object, which can contain arbitrary + page numbers. + """ + if not self.is_pdf: + raise ValueError("is no PDF") + if self.is_closed: + raise ValueError("document closed") + + page_count = self.page_count # page count of document + f = t = -1 + if kw: # check if keywords were used + if args: # then no positional args are allowed + raise ValueError("cannot mix keyword and positional argument") + f = kw.get("from_page", -1) # first page to delete + t = kw.get("to_page", -1) # last page to delete + while f < 0: + f += page_count + while t < 0: + t += page_count + if not f <= t < page_count: + raise ValueError("bad page number(s)") + numbers = tuple(range(f, t + 1)) + else: + if len(args) > 2 or args == []: + raise ValueError("need 1 or 2 positional arguments") + if len(args) == 2: + f, t = args + if not (type(f) is int and type(t) is int): + raise ValueError("both arguments must be int") + if f > t: + f, t = t, f + if not f <= t < page_count: + raise ValueError("bad page number(s)") + numbers = tuple(range(f, t + 1)) + else: + r = args[0] + if type(r) not in (int, range, list, tuple): + raise ValueError("need int or sequence if one argument") + numbers = tuple(r) + + numbers = list(map(int, set(numbers))) # ensure unique integers + if numbers == []: + print("nothing to delete") + return + numbers.sort() + if numbers[0] < 0 or numbers[-1] >= page_count: + raise ValueError("bad page number(s)") + frozen_numbers = frozenset(numbers) + toc = self.get_toc() + for i, xref in enumerate(self.get_outline_xrefs()): + if toc[i][2] - 1 in frozen_numbers: + self._remove_toc_item(xref) # remove target in PDF object + + self._remove_links_to(frozen_numbers) + + for i in reversed(numbers): # delete pages, last to first + self._delete_page(i) + + self._reset_page_refs() + + + def saveIncr(self): + """ Save PDF incrementally""" + return self.save(self.name, incremental=True, encryption=PDF_ENCRYPT_KEEP) + + + def ez_save(self, filename, garbage=3, clean=False, + deflate=True, deflate_images=True, deflate_fonts=True, + incremental=False, ascii=False, expand=False, linear=False, + pretty=False, encryption=1, permissions=4095, + owner_pw=None, user_pw=None, no_new_id=True): + """ Save PDF using some different defaults""" + return self.save(filename, garbage=garbage, + clean=clean, + deflate=deflate, + deflate_images=deflate_images, + deflate_fonts=deflate_fonts, + incremental=incremental, + ascii=ascii, + expand=expand, + linear=linear, + pretty=pretty, + encryption=encryption, + permissions=permissions, + owner_pw=owner_pw, + user_pw=user_pw, + no_new_id=no_new_id,) + + + def reload_page(self, page: "struct Page *") -> "struct Page *": + """Make a fresh copy of a page.""" + old_annots = {} # copy annot references to here + pno = page.number # save the page number + for k, v in page._annot_refs.items(): # save the annot dictionary + # We need to call pdf_keep_annot() here, otherwise `v`'s + # refcount can reach zero even if there is an external + # reference. + self.internal_keep_annot(v) + old_annots[k] = v + page._erase() # remove the page + page = None + TOOLS.store_shrink(100) + page = self.load_page(pno) # reload the page + + # copy annot refs over to the new dictionary + page_proxy = weakref.proxy(page) + for k, v in old_annots.items(): + annot = old_annots[k] + annot.parent = page_proxy # refresh parent to new page + page._annot_refs[k] = annot + return page + + + @property + def pagemode(self) -> str: + """Return the PDF PageMode value. + """ + xref = self.pdf_catalog() + if xref == 0: + return None + rc = self.xref_get_key(xref, "PageMode") + if rc[0] == "null": + return "UseNone" + if rc[0] == "name": + return rc[1][1:] + return "UseNone" + + + def set_pagemode(self, pagemode: str): + """Set the PDF PageMode value.""" + valid = ("UseNone", "UseOutlines", "UseThumbs", "FullScreen", "UseOC", "UseAttachments") + xref = self.pdf_catalog() + if xref == 0: + raise ValueError("not a PDF") + if not pagemode: + raise ValueError("bad PageMode value") + if pagemode[0] == "/": + pagemode = pagemode[1:] + for v in valid: + if pagemode.lower() == v.lower(): + self.xref_set_key(xref, "PageMode", f"/{v}") + return True + raise ValueError("bad PageMode value") + + + @property + def pagelayout(self) -> str: + """Return the PDF PageLayout value. + """ + xref = self.pdf_catalog() + if xref == 0: + return None + rc = self.xref_get_key(xref, "PageLayout") + if rc[0] == "null": + return "SinglePage" + if rc[0] == "name": + return rc[1][1:] + return "SinglePage" + + + def set_pagelayout(self, pagelayout: str): + """Set the PDF PageLayout value.""" + valid = ("SinglePage", "OneColumn", "TwoColumnLeft", "TwoColumnRight", "TwoPageLeft", "TwoPageRight") + xref = self.pdf_catalog() + if xref == 0: + raise ValueError("not a PDF") + if not pagelayout: + raise ValueError("bad PageLayout value") + if pagelayout[0] == "/": + pagelayout = pagelayout[1:] + for v in valid: + if pagelayout.lower() == v.lower(): + self.xref_set_key(xref, "PageLayout", f"/{v}") + return True + raise ValueError("bad PageLayout value") + + + @property + def markinfo(self) -> dict: + """Return the PDF MarkInfo value.""" + xref = self.pdf_catalog() + if xref == 0: + return None + rc = self.xref_get_key(xref, "MarkInfo") + if rc[0] == "null": + return {} + if rc[0] == "xref": + xref = int(rc[1].split()[0]) + val = self.xref_object(xref, compressed=True) + elif rc[0] == "dict": + val = rc[1] + else: + val = None + if val == None or not (val[:2] == "<<" and val[-2:] == ">>"): + return {} + valid = {"Marked": False, "UserProperties": False, "Suspects": False} + val = val[2:-2].split("/") + for v in val[1:]: + try: + key, value = v.split() + except: + return valid + if value == "true": + valid[key] = True + return valid + + + def set_markinfo(self, markinfo: dict) -> bool: + """Set the PDF MarkInfo values.""" + xref = self.pdf_catalog() + if xref == 0: + raise ValueError("not a PDF") + if not markinfo or not isinstance(markinfo, dict): + return False + valid = {"Marked": False, "UserProperties": False, "Suspects": False} + + if not set(valid.keys()).issuperset(markinfo.keys()): + badkeys = f"bad MarkInfo key(s): {set(markinfo.keys()).difference(valid.keys())}" + raise ValueError(badkeys) + pdfdict = "<<" + valid.update(markinfo) + for key, value in valid.items(): + value=str(value).lower() + if not value in ("true", "false"): + raise ValueError(f"bad key value '{key}': '{value}'") + pdfdict += f"/{key} {value}" + pdfdict += ">>" + self.xref_set_key(xref, "MarkInfo", pdfdict) + return True + + + def __repr__(self) -> str: + m = "closed " if self.is_closed else "" + if self.stream is None: + if self.name == "": + return m + "Document()" % self._graft_id + return m + "Document('%s')" % (self.name,) + return m + "Document('%s', )" % (self.name, self._graft_id) + + + def __contains__(self, loc) -> bool: + if type(loc) is int: + if loc < self.page_count: + return True + return False + if type(loc) not in (tuple, list) or len(loc) != 2: + return False + + chapter, pno = loc + if (type(chapter) != int or + chapter < 0 or + chapter >= self.chapter_count + ): + return False + if (type(pno) != int or + pno < 0 or + pno >= self.chapter_page_count(chapter) + ): + return False + + return True + + + def __getitem__(self, i: int =0)->"Page": + assert isinstance(i, int) or (isinstance(i, tuple) and len(i) == 2 and all(isinstance(x, int) for x in i)) + if i not in self: + raise IndexError("page not in document") + return self.load_page(i) + + + def __delitem__(self, i: AnyType)->None: + if not self.is_pdf: + raise ValueError("is no PDF") + if type(i) is int: + return self.delete_page(i) + if type(i) in (list, tuple, range): + return self.delete_pages(i) + if type(i) is not slice: + raise ValueError("bad argument type") + pc = self.page_count + start = i.start if i.start else 0 + stop = i.stop if i.stop else pc + step = i.step if i.step else 1 + while start < 0: + start += pc + if start >= pc: + raise ValueError("bad page number(s)") + while stop < 0: + stop += pc + if stop > pc: + raise ValueError("bad page number(s)") + return self.delete_pages(range(start, stop, step)) + + + def pages(self, start: OptInt =None, stop: OptInt =None, step: OptInt =None): + """Return a generator iterator over a page range. + + Arguments have the same meaning as for the range() built-in. + """ + # set the start value + start = start or 0 + while start < 0: + start += self.page_count + if start not in range(self.page_count): + raise ValueError("bad start page number") + + # set the stop value + stop = stop if stop is not None and stop <= self.page_count else self.page_count + + # set the step value + if step == 0: + raise ValueError("arg 3 must not be zero") + if step is None: + if start > stop: + step = -1 + else: + step = 1 + + for pno in range(start, stop, step): + yield (self.load_page(pno)) + + + def __len__(self) -> int: + return self.page_count + + def _forget_page(self, page: "struct Page *"): + """Remove a page from document page dict.""" + pid = id(page) + if pid in self._page_refs: + self._page_refs[pid] = None + + def _reset_page_refs(self): + """Invalidate all pages in document dictionary.""" + if getattr(self, "is_closed", True): + return + for page in self._page_refs.values(): + if page: + page._erase() + page = None + self._page_refs.clear() + + + + def _cleanup(self): + self._reset_page_refs() + for k in self.Graftmaps.keys(): + self.Graftmaps[k] = None + self.Graftmaps = {} + self.ShownPages = {} + self.InsertedImages = {} + self.FontInfos = [] + self.metadata = None + self.stream = None + self.is_closed = True + + + def close(self): + """Close the document.""" + if getattr(self, "is_closed", False): + raise ValueError("document closed") + self._cleanup() + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + return + else: + raise RuntimeError("document object unavailable") + + def __del__(self): + if not type(self) is Document: + return + self._cleanup() + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + + def __enter__(self): + return self + + def __exit__(self, *args): + self.close() + %} + } +}; + +/*****************************************************************************/ +// fz_page +/*****************************************************************************/ +%nodefaultctor; +struct Page { + %extend { + ~Page() + { + DEBUGMSG1("Page"); + fz_page *this_page = (fz_page *) $self; + fz_drop_page(gctx, this_page); + DEBUGMSG2; + } + //---------------------------------------------------------------- + // bound() + //---------------------------------------------------------------- + FITZEXCEPTION(bound, !result) + PARENTCHECK(bound, """Get page rectangle.""") + %pythonappend bound %{ + val = Rect(val) + if val.is_infinite and self.parent.is_pdf: + cb = self.cropbox + w, h = cb.width, cb.height + if self.rotation not in (0, 180): + w, h = h, w + val = Rect(0, 0, w, h) + msg = TOOLS.mupdf_warnings(reset=False).splitlines()[-1] + print(msg, file=sys.stderr) + %} + PyObject *bound() { + fz_rect rect = fz_infinite_rect; + fz_try(gctx) { + rect = fz_bound_page(gctx, (fz_page *) $self); + } + fz_catch(gctx) { + ; + } + return JM_py_from_rect(rect); + } + %pythoncode %{rect = property(bound, doc="page rectangle")%} + + //---------------------------------------------------------------- + // Page.get_image_bbox + //---------------------------------------------------------------- + %pythonprepend get_image_bbox %{ + """Get rectangle occupied by image 'name'. + + 'name' is either an item of the image list, or the referencing + name string - elem[7] of the resp. item. + Option 'transform' also returns the image transformation matrix. + """ + CheckParent(self) + doc = self.parent + if doc.is_closed or doc.is_encrypted: + raise ValueError("document closed or encrypted") + + inf_rect = Rect(1, 1, -1, -1) + null_mat = Matrix() + if transform: + rc = (inf_rect, null_mat) + else: + rc = inf_rect + + if type(name) in (list, tuple): + if not type(name[-1]) is int: + raise ValueError("need item of full page image list") + item = name + else: + imglist = [i for i in doc.get_page_images(self.number, True) if name == i[7]] + if len(imglist) == 1: + item = imglist[0] + elif imglist == []: + raise ValueError("bad image name") + else: + raise ValueError("found multiple images named '%s'." % name) + xref = item[-1] + if xref != 0 or transform == True: + try: + return self.get_image_rects(item, transform=transform)[0] + except: + return inf_rect + %} + %pythonappend get_image_bbox %{ + if not bool(val): + return rc + + for v in val: + if v[0] != item[-3]: + continue + q = Quad(v[1]) + bbox = q.rect + if transform == 0: + rc = bbox + break + + hm = Matrix(util_hor_matrix(q.ll, q.lr)) + h = abs(q.ll - q.ul) + w = abs(q.ur - q.ul) + m0 = Matrix(1 / w, 0, 0, 1 / h, 0, 0) + m = ~(hm * m0) + rc = (bbox, m) + break + val = rc%} + PyObject * + get_image_bbox(PyObject *name, int transform=0) + { + pdf_page *pdf_page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + PyObject *rc =NULL; + fz_try(gctx) { + rc = JM_image_reporter(gctx, pdf_page); + } + fz_catch(gctx) { + Py_RETURN_NONE; + } + return rc; + } + + //---------------------------------------------------------------- + // run() + //---------------------------------------------------------------- + FITZEXCEPTION(run, !result) + PARENTCHECK(run, """Run page through a device.""") + PyObject *run(struct DeviceWrapper *dw, PyObject *m) + { + fz_try(gctx) { + fz_run_page(gctx, (fz_page *) $self, dw->device, JM_matrix_from_py(m), NULL); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------------------------------------- + // Page.extend_textpage + //---------------------------------------------------------------- + FITZEXCEPTION(extend_textpage, !result) + PyObject * + extend_textpage(struct TextPage *tpage, int flags=0, PyObject *matrix=NULL) + { + fz_page *page = (fz_page *) $self; + fz_stext_page *tp = (fz_stext_page *) tpage; + fz_device *dev = NULL; + fz_stext_options options; + memset(&options, 0, sizeof options); + options.flags = flags; + fz_try(gctx) { + fz_matrix ctm = JM_matrix_from_py(matrix); + dev = fz_new_stext_device(gctx, tp, &options); + fz_run_page(gctx, page, dev, ctm, NULL); + fz_close_device(gctx, dev); + } + fz_always(gctx) { + fz_drop_device(gctx, dev); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // Page.get_textpage + //---------------------------------------------------------------- + FITZEXCEPTION(_get_textpage, !result) + %pythonappend _get_textpage %{val.thisown = True%} + struct TextPage * + _get_textpage(PyObject *clip=NULL, int flags=0, PyObject *matrix=NULL) + { + fz_stext_page *tpage=NULL; + fz_page *page = (fz_page *) $self; + fz_device *dev = NULL; + fz_stext_options options; + memset(&options, 0, sizeof options); + options.flags = flags; + fz_try(gctx) { + // Default to page's rect if `clip` not specified, for #2048. + fz_rect rect = (clip==Py_None) ? fz_bound_page(gctx, page) : JM_rect_from_py(clip); + fz_matrix ctm = JM_matrix_from_py(matrix); + tpage = fz_new_stext_page(gctx, rect); + dev = fz_new_stext_device(gctx, tpage, &options); + fz_run_page(gctx, page, dev, ctm, NULL); + fz_close_device(gctx, dev); + } + fz_always(gctx) { + fz_drop_device(gctx, dev); + } + fz_catch(gctx) { + return NULL; + } + return (struct TextPage *) tpage; + } + + + %pythoncode %{ + def get_textpage(self, clip: rect_like = None, flags: int = 0, matrix=None) -> "TextPage": + CheckParent(self) + if matrix is None: + matrix = Matrix(1, 1) + old_rotation = self.rotation + if old_rotation != 0: + self.set_rotation(0) + try: + textpage = self._get_textpage(clip, flags=flags, matrix=matrix) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + textpage.parent = weakref.proxy(self) + return textpage + %} + + /* ****************** currently inactive + //---------------------------------------------------------------- + // Page._get_textpage_ocr + //---------------------------------------------------------------- + FITZEXCEPTION(_get_textpage_ocr, !result) + %pythonappend _get_textpage_ocr %{val.thisown = True%} + struct TextPage * + _get_textpage_ocr(PyObject *clip=NULL, int flags=0, const char *language=NULL, const char *tessdata=NULL) + { + fz_stext_page *textpage=NULL; + fz_try(gctx) { + fz_rect rect = JM_rect_from_py(clip); + textpage = JM_new_stext_page_ocr_from_page(gctx, (fz_page *) $self, rect, flags, language, tessdata); + } + fz_catch(gctx) { + return NULL; + } + return (struct TextPage *) textpage; + } + ************************* */ + + //---------------------------------------------------------------- + // Page.language + //---------------------------------------------------------------- + %pythoncode%{@property%} + %pythonprepend language %{"""Page language."""%} + PyObject *language() + { + pdf_page *pdfpage = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (!pdfpage) Py_RETURN_NONE; + pdf_obj *lang = pdf_dict_get_inheritable(gctx, pdfpage->obj, PDF_NAME(Lang)); + if (!lang) Py_RETURN_NONE; + return Py_BuildValue("s", pdf_to_str_buf(gctx, lang)); + } + + + //---------------------------------------------------------------- + // Page.set_language + //---------------------------------------------------------------- + FITZEXCEPTION(set_language, !result) + PARENTCHECK(set_language, """Set PDF page default language.""") + PyObject *set_language(char *language=NULL) + { + pdf_page *pdfpage = pdf_page_from_fz_page(gctx, (fz_page *) $self); + fz_try(gctx) { + ASSERT_PDF(pdfpage); + fz_text_language lang; + char buf[8]; + if (!language) { + pdf_dict_del(gctx, pdfpage->obj, PDF_NAME(Lang)); + } else { + lang = fz_text_language_from_string(language); + pdf_dict_put_text_string(gctx, pdfpage->obj, + PDF_NAME(Lang), + fz_string_from_text_language(buf, lang)); + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_TRUE; + } + + + //---------------------------------------------------------------- + // Page.get_svg_image + //---------------------------------------------------------------- + FITZEXCEPTION(get_svg_image, !result) + PARENTCHECK(get_svg_image, """Make SVG image from page.""") + PyObject *get_svg_image(PyObject *matrix = NULL, int text_as_path=1) + { + fz_rect mediabox = fz_bound_page(gctx, (fz_page *) $self); + fz_device *dev = NULL; + fz_buffer *res = NULL; + PyObject *text = NULL; + fz_matrix ctm = JM_matrix_from_py(matrix); + fz_output *out = NULL; + fz_var(out); + fz_var(dev); + fz_var(res); + fz_rect tbounds = mediabox; + int text_option = (text_as_path == 1) ? FZ_SVG_TEXT_AS_PATH : FZ_SVG_TEXT_AS_TEXT; + tbounds = fz_transform_rect(tbounds, ctm); + + fz_try(gctx) { + res = fz_new_buffer(gctx, 1024); + out = fz_new_output_with_buffer(gctx, res); + dev = fz_new_svg_device(gctx, out, + tbounds.x1-tbounds.x0, // width + tbounds.y1-tbounds.y0, // height + text_option, 1); + fz_run_page(gctx, (fz_page *) $self, dev, ctm, NULL); + fz_close_device(gctx, dev); + text = JM_EscapeStrFromBuffer(gctx, res); + } + fz_always(gctx) { + fz_drop_device(gctx, dev); + fz_drop_output(gctx, out); + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + return text; + } + + + //---------------------------------------------------------------- + // page set opacity + //---------------------------------------------------------------- + FITZEXCEPTION(_set_opacity, !result) + %pythonprepend _set_opacity %{ + if CA >= 1 and ca >= 1 and blendmode == None: + return None + tCA = int(round(max(CA , 0) * 100)) + if tCA >= 100: + tCA = 99 + tca = int(round(max(ca, 0) * 100)) + if tca >= 100: + tca = 99 + gstate = "fitzca%02i%02i" % (tCA, tca) + %} + PyObject * + _set_opacity(char *gstate=NULL, float CA=1, float ca=1, char *blendmode=NULL) + { + if (!gstate) Py_RETURN_NONE; + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + fz_try(gctx) { + ASSERT_PDF(page); + pdf_obj *resources = pdf_dict_get(gctx, page->obj, PDF_NAME(Resources)); + if (!resources) { + resources = pdf_dict_put_dict(gctx, page->obj, PDF_NAME(Resources), 2); + } + pdf_obj *extg = pdf_dict_get(gctx, resources, PDF_NAME(ExtGState)); + if (!extg) { + extg = pdf_dict_put_dict(gctx, resources, PDF_NAME(ExtGState), 2); + } + int i, n = pdf_dict_len(gctx, extg); + for (i = 0; i < n; i++) { + pdf_obj *o1 = pdf_dict_get_key(gctx, extg, i); + char *name = (char *) pdf_to_name(gctx, o1); + if (strcmp(name, gstate) == 0) goto finished; + } + pdf_obj *opa = pdf_new_dict(gctx, page->doc, 3); + pdf_dict_put_real(gctx, opa, PDF_NAME(CA), (double) CA); + pdf_dict_put_real(gctx, opa, PDF_NAME(ca), (double) ca); + pdf_dict_puts_drop(gctx, extg, gstate, opa); + finished:; + } + fz_always(gctx) { + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("s", gstate); + } + + //---------------------------------------------------------------- + // page add_caret_annot + //---------------------------------------------------------------- + FITZEXCEPTION(_add_caret_annot, !result) + struct Annot * + _add_caret_annot(PyObject *point) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + fz_try(gctx) { + annot = pdf_create_annot(gctx, page, PDF_ANNOT_CARET); + if (point) + { + fz_point p = JM_point_from_py(point); + fz_rect r = pdf_annot_rect(gctx, annot); + r = fz_make_rect(p.x, p.y, p.x + r.x1 - r.x0, p.y + r.y1 - r.y0); + pdf_set_annot_rect(gctx, annot, r); + } + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + + //---------------------------------------------------------------- + // page addRedactAnnot + //---------------------------------------------------------------- + FITZEXCEPTION(_add_redact_annot, !result) + struct Annot * + _add_redact_annot(PyObject *quad, + PyObject *text=NULL, + PyObject *da_str=NULL, + int align=0, + PyObject *fill=NULL, + PyObject *text_color=NULL) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + float fcol[4] = { 1, 1, 1, 0}; + int nfcol = 0, i; + fz_try(gctx) { + annot = pdf_create_annot(gctx, page, PDF_ANNOT_REDACT); + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + fz_quad q = JM_quad_from_py(quad); + fz_rect r = fz_rect_from_quad(q); + + // TODO calculate de-rotated rect + pdf_set_annot_rect(gctx, annot, r); + if (EXISTS(fill)) { + JM_color_FromSequence(fill, &nfcol, fcol); + pdf_obj *arr = pdf_new_array(gctx, page->doc, nfcol); + for (i = 0; i < nfcol; i++) { + pdf_array_push_real(gctx, arr, fcol[i]); + } + pdf_dict_put_drop(gctx, annot_obj, PDF_NAME(IC), arr); + } + if (EXISTS(text)) { + const char *otext = PyUnicode_AsUTF8(text); + pdf_dict_puts_drop(gctx, annot_obj, "OverlayText", + pdf_new_text_string(gctx, otext)); + pdf_dict_put_text_string(gctx,annot_obj, PDF_NAME(DA), PyUnicode_AsUTF8(da_str)); + pdf_dict_put_int(gctx, annot_obj, PDF_NAME(Q), (int64_t) align); + } + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + //---------------------------------------------------------------- + // page addLineAnnot + //---------------------------------------------------------------- + FITZEXCEPTION(_add_line_annot, !result) + struct Annot * + _add_line_annot(PyObject *p1, PyObject *p2) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + fz_try(gctx) { + ASSERT_PDF(page); + annot = pdf_create_annot(gctx, page, PDF_ANNOT_LINE); + fz_point a = JM_point_from_py(p1); + fz_point b = JM_point_from_py(p2); + pdf_set_annot_line(gctx, annot, a, b); + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + //---------------------------------------------------------------- + // page addTextAnnot + //---------------------------------------------------------------- + FITZEXCEPTION(_add_text_annot, !result) + struct Annot * + _add_text_annot(PyObject *point, + char *text, + char *icon=NULL) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + fz_rect r; + fz_point p = JM_point_from_py(point); + fz_var(annot); + fz_try(gctx) { + ASSERT_PDF(page); + annot = pdf_create_annot(gctx, page, PDF_ANNOT_TEXT); + r = pdf_annot_rect(gctx, annot); + r = fz_make_rect(p.x, p.y, p.x + r.x1 - r.x0, p.y + r.y1 - r.y0); + pdf_set_annot_rect(gctx, annot, r); + pdf_set_annot_contents(gctx, annot, text); + if (icon) { + pdf_set_annot_icon_name(gctx, annot, icon); + } + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + //---------------------------------------------------------------- + // page addInkAnnot + //---------------------------------------------------------------- + FITZEXCEPTION(_add_ink_annot, !result) + struct Annot * + _add_ink_annot(PyObject *list) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + PyObject *p = NULL, *sublist = NULL; + pdf_obj *inklist = NULL, *stroke = NULL; + fz_matrix ctm, inv_ctm; + fz_point point; + fz_var(annot); + fz_try(gctx) { + ASSERT_PDF(page); + if (!PySequence_Check(list)) { + RAISEPY(gctx, MSG_BAD_ARG_INK_ANNOT, PyExc_ValueError); + } + pdf_page_transform(gctx, page, NULL, &ctm); + inv_ctm = fz_invert_matrix(ctm); + annot = pdf_create_annot(gctx, page, PDF_ANNOT_INK); + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + Py_ssize_t i, j, n0 = PySequence_Size(list), n1; + inklist = pdf_new_array(gctx, page->doc, n0); + + for (j = 0; j < n0; j++) { + sublist = PySequence_ITEM(list, j); + n1 = PySequence_Size(sublist); + stroke = pdf_new_array(gctx, page->doc, 2 * n1); + + for (i = 0; i < n1; i++) { + p = PySequence_ITEM(sublist, i); + if (!PySequence_Check(p) || PySequence_Size(p) != 2) { + RAISEPY(gctx, MSG_BAD_ARG_INK_ANNOT, PyExc_ValueError); + } + point = fz_transform_point(JM_point_from_py(p), inv_ctm); + Py_CLEAR(p); + pdf_array_push_real(gctx, stroke, point.x); + pdf_array_push_real(gctx, stroke, point.y); + } + + pdf_array_push_drop(gctx, inklist, stroke); + stroke = NULL; + Py_CLEAR(sublist); + } + + pdf_dict_put_drop(gctx, annot_obj, PDF_NAME(InkList), inklist); + inklist = NULL; + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + + fz_catch(gctx) { + Py_CLEAR(p); + Py_CLEAR(sublist); + return NULL; + } + return (struct Annot *) annot; + } + + //---------------------------------------------------------------- + // page addStampAnnot + //---------------------------------------------------------------- + FITZEXCEPTION(_add_stamp_annot, !result) + struct Annot * + _add_stamp_annot(PyObject *rect, int stamp=0) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + pdf_obj *stamp_id[] = {PDF_NAME(Approved), PDF_NAME(AsIs), + PDF_NAME(Confidential), PDF_NAME(Departmental), + PDF_NAME(Experimental), PDF_NAME(Expired), + PDF_NAME(Final), PDF_NAME(ForComment), + PDF_NAME(ForPublicRelease), PDF_NAME(NotApproved), + PDF_NAME(NotForPublicRelease), PDF_NAME(Sold), + PDF_NAME(TopSecret), PDF_NAME(Draft)}; + int n = nelem(stamp_id); + pdf_obj *name = stamp_id[0]; + fz_try(gctx) { + ASSERT_PDF(page); + fz_rect r = JM_rect_from_py(rect); + if (fz_is_infinite_rect(r) || fz_is_empty_rect(r)) { + RAISEPY(gctx, MSG_BAD_RECT, PyExc_ValueError); + } + if (INRANGE(stamp, 0, n-1)) { + name = stamp_id[stamp]; + } + annot = pdf_create_annot(gctx, page, PDF_ANNOT_STAMP); + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_set_annot_rect(gctx, annot, r); + pdf_dict_put(gctx, annot_obj, PDF_NAME(Name), name); + pdf_set_annot_contents(gctx, annot, + pdf_dict_get_name(gctx, annot_obj, PDF_NAME(Name))); + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + //---------------------------------------------------------------- + // page addFileAnnot + //---------------------------------------------------------------- + FITZEXCEPTION(_add_file_annot, !result) + struct Annot * + _add_file_annot(PyObject *point, + PyObject *buffer, + char *filename, + char *ufilename=NULL, + char *desc=NULL, + char *icon=NULL) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + char *uf = ufilename, *d = desc; + if (!ufilename) uf = filename; + if (!desc) d = filename; + fz_buffer *filebuf = NULL; + fz_rect r; + fz_point p = JM_point_from_py(point); + fz_var(filebuf); + fz_try(gctx) { + ASSERT_PDF(page); + filebuf = JM_BufferFromBytes(gctx, buffer); + if (!filebuf) { + RAISEPY(gctx, MSG_BAD_BUFFER, PyExc_TypeError); + } + annot = pdf_create_annot(gctx, page, PDF_ANNOT_FILE_ATTACHMENT); + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + r = pdf_annot_rect(gctx, annot); + r = fz_make_rect(p.x, p.y, p.x + r.x1 - r.x0, p.y + r.y1 - r.y0); + pdf_set_annot_rect(gctx, annot, r); + int flags = PDF_ANNOT_IS_PRINT; + pdf_set_annot_flags(gctx, annot, flags); + + if (icon) + pdf_set_annot_icon_name(gctx, annot, icon); + + pdf_obj *val = JM_embed_file(gctx, page->doc, filebuf, + filename, uf, d, 1); + pdf_dict_put_drop(gctx, annot_obj, PDF_NAME(FS), val); + pdf_dict_put_text_string(gctx, annot_obj, PDF_NAME(Contents), filename); + pdf_update_annot(gctx, annot); + pdf_set_annot_rect(gctx, annot, r); + pdf_set_annot_flags(gctx, annot, flags); + JM_add_annot_id(gctx, annot, "A"); + } + fz_always(gctx) { + fz_drop_buffer(gctx, filebuf); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + + //---------------------------------------------------------------- + // page: add a text marker annotation + //---------------------------------------------------------------- + FITZEXCEPTION(_add_text_marker, !result) + %pythonprepend _add_text_marker %{ + CheckParent(self) + if not self.parent.is_pdf: + raise ValueError("is no PDF")%} + + %pythonappend _add_text_marker %{ + if not val: + return None + val.parent = weakref.proxy(self) + self._annot_refs[id(val)] = val%} + + struct Annot * + _add_text_marker(PyObject *quads, int annot_type) + { + pdf_page *pdfpage = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + PyObject *item = NULL; + int rotation = JM_page_rotation(gctx, pdfpage); + fz_quad q; + fz_var(annot); + fz_var(item); + fz_try(gctx) { + if (rotation != 0) { + pdf_dict_put_int(gctx, pdfpage->obj, PDF_NAME(Rotate), 0); + } + annot = pdf_create_annot(gctx, pdfpage, annot_type); + Py_ssize_t i, len = PySequence_Size(quads); + for (i = 0; i < len; i++) { + item = PySequence_ITEM(quads, i); + q = JM_quad_from_py(item); + Py_DECREF(item); + pdf_add_annot_quad_point(gctx, annot, q); + } + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + fz_always(gctx) { + if (rotation != 0) { + pdf_dict_put_int(gctx, pdfpage->obj, PDF_NAME(Rotate), rotation); + } + } + fz_catch(gctx) { + pdf_drop_annot(gctx, annot); + return NULL; + } + return (struct Annot *) annot; + } + + + //---------------------------------------------------------------- + // page: add circle or rectangle annotation + //---------------------------------------------------------------- + FITZEXCEPTION(_add_square_or_circle, !result) + struct Annot * + _add_square_or_circle(PyObject *rect, int annot_type) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + fz_try(gctx) { + fz_rect r = JM_rect_from_py(rect); + if (fz_is_infinite_rect(r) || fz_is_empty_rect(r)) { + RAISEPY(gctx, MSG_BAD_RECT, PyExc_ValueError); + } + annot = pdf_create_annot(gctx, page, annot_type); + pdf_set_annot_rect(gctx, annot, r); + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + + //---------------------------------------------------------------- + // page: add multiline annotation + //---------------------------------------------------------------- + FITZEXCEPTION(_add_multiline, !result) + struct Annot * + _add_multiline(PyObject *points, int annot_type) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *annot = NULL; + fz_try(gctx) { + Py_ssize_t i, n = PySequence_Size(points); + if (n < 2) { + RAISEPY(gctx, MSG_BAD_ARG_POINTS, PyExc_ValueError); + } + annot = pdf_create_annot(gctx, page, annot_type); + for (i = 0; i < n; i++) { + PyObject *p = PySequence_ITEM(points, i); + if (PySequence_Size(p) != 2) { + Py_DECREF(p); + RAISEPY(gctx, MSG_BAD_ARG_POINTS, PyExc_ValueError); + } + fz_point point = JM_point_from_py(p); + Py_DECREF(p); + pdf_add_annot_vertex(gctx, annot, point); + } + + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + + //---------------------------------------------------------------- + // page addFreetextAnnot + //---------------------------------------------------------------- + FITZEXCEPTION(_add_freetext_annot, !result) + %pythonappend _add_freetext_annot %{ + ap = val._getAP() + BT = ap.find(b"BT") + ET = ap.find(b"ET") + 2 + ap = ap[BT:ET] + w = rect[2]-rect[0] + h = rect[3]-rect[1] + if rotate in (90, -90, 270): + w, h = h, w + re = b"0 0 %g %g re" % (w, h) + ap = re + b"\nW\nn\n" + ap + ope = None + bwidth = b"" + fill_string = ColorCode(fill_color, "f").encode() + if fill_string: + fill_string += b"\n" + ope = b"f" + stroke_string = ColorCode(border_color, "c").encode() + if stroke_string: + stroke_string += b"\n" + bwidth = b"1 w\n" + ope = b"S" + if fill_string and stroke_string: + ope = b"B" + if ope != None: + ap = bwidth + fill_string + stroke_string + re + b"\n" + ope + b"\n" + ap + val._setAP(ap) + %} + struct Annot * + _add_freetext_annot(PyObject *rect, char *text, + float fontsize=11, + char *fontname=NULL, + PyObject *text_color=NULL, + PyObject *fill_color=NULL, + PyObject *border_color=NULL, + int align=0, + int rotate=0) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + float fcol[4] = {1, 1, 1, 1}; // fill color: white + int nfcol = 0; + JM_color_FromSequence(fill_color, &nfcol, fcol); + float tcol[4] = {0, 0, 0, 0}; // std. text color: black + int ntcol = 0; + JM_color_FromSequence(text_color, &ntcol, tcol); + fz_rect r = JM_rect_from_py(rect); + pdf_annot *annot = NULL; + fz_try(gctx) { + if (fz_is_infinite_rect(r) || fz_is_empty_rect(r)) { + RAISEPY(gctx, MSG_BAD_RECT, PyExc_ValueError); + } + annot = pdf_create_annot(gctx, page, PDF_ANNOT_FREE_TEXT); + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_set_annot_contents(gctx, annot, text); + pdf_set_annot_rect(gctx, annot, r); + pdf_dict_put_int(gctx, annot_obj, PDF_NAME(Rotate), rotate); + pdf_dict_put_int(gctx, annot_obj, PDF_NAME(Q), align); + + if (nfcol > 0) { + pdf_set_annot_color(gctx, annot, nfcol, fcol); + } + + // insert the default appearance string + JM_make_annot_DA(gctx, annot, ntcol, tcol, fontname, fontsize); + pdf_update_annot(gctx, annot); + JM_add_annot_id(gctx, annot, "A"); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + + %pythoncode %{ + @property + def rotation_matrix(self) -> Matrix: + """Reflects page rotation.""" + return Matrix(TOOLS._rotate_matrix(self)) + + @property + def derotation_matrix(self) -> Matrix: + """Reflects page de-rotation.""" + return Matrix(TOOLS._derotate_matrix(self)) + + def add_caret_annot(self, point: point_like) -> "struct Annot *": + """Add a 'Caret' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_caret_annot(point) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_strikeout_annot(self, quads=None, start=None, stop=None, clip=None) -> "struct Annot *": + """Add a 'StrikeOut' annotation.""" + if quads is None: + q = get_highlight_selection(self, start=start, stop=stop, clip=clip) + else: + q = CheckMarkerArg(quads) + return self._add_text_marker(q, PDF_ANNOT_STRIKE_OUT) + + + def add_underline_annot(self, quads=None, start=None, stop=None, clip=None) -> "struct Annot *": + """Add a 'Underline' annotation.""" + if quads is None: + q = get_highlight_selection(self, start=start, stop=stop, clip=clip) + else: + q = CheckMarkerArg(quads) + return self._add_text_marker(q, PDF_ANNOT_UNDERLINE) + + + def add_squiggly_annot(self, quads=None, start=None, + stop=None, clip=None) -> "struct Annot *": + """Add a 'Squiggly' annotation.""" + if quads is None: + q = get_highlight_selection(self, start=start, stop=stop, clip=clip) + else: + q = CheckMarkerArg(quads) + return self._add_text_marker(q, PDF_ANNOT_SQUIGGLY) + + + def add_highlight_annot(self, quads=None, start=None, + stop=None, clip=None) -> "struct Annot *": + """Add a 'Highlight' annotation.""" + if quads is None: + q = get_highlight_selection(self, start=start, stop=stop, clip=clip) + else: + q = CheckMarkerArg(quads) + return self._add_text_marker(q, PDF_ANNOT_HIGHLIGHT) + + + def add_rect_annot(self, rect: rect_like) -> "struct Annot *": + """Add a 'Square' (rectangle) annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_square_or_circle(rect, PDF_ANNOT_SQUARE) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_circle_annot(self, rect: rect_like) -> "struct Annot *": + """Add a 'Circle' (ellipse, oval) annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_square_or_circle(rect, PDF_ANNOT_CIRCLE) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_text_annot(self, point: point_like, text: str, icon: str ="Note") -> "struct Annot *": + """Add a 'Text' (sticky note) annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_text_annot(point, text, icon=icon) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_line_annot(self, p1: point_like, p2: point_like) -> "struct Annot *": + """Add a 'Line' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_line_annot(p1, p2) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_polyline_annot(self, points: list) -> "struct Annot *": + """Add a 'PolyLine' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_multiline(points, PDF_ANNOT_POLY_LINE) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_polygon_annot(self, points: list) -> "struct Annot *": + """Add a 'Polygon' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_multiline(points, PDF_ANNOT_POLYGON) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_stamp_annot(self, rect: rect_like, stamp: int =0) -> "struct Annot *": + """Add a ('rubber') 'Stamp' annotation.""" + old_rotation = annot_preprocess(self) + try: + annot = self._add_stamp_annot(rect, stamp) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_ink_annot(self, handwriting: list) -> "struct Annot *": + """Add a 'Ink' ('handwriting') annotation. + + The argument must be a list of lists of point_likes. + """ + old_rotation = annot_preprocess(self) + try: + annot = self._add_ink_annot(handwriting) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_file_annot(self, point: point_like, + buffer: ByteString, + filename: str, + ufilename: OptStr =None, + desc: OptStr =None, + icon: OptStr =None) -> "struct Annot *": + """Add a 'FileAttachment' annotation.""" + + old_rotation = annot_preprocess(self) + try: + annot = self._add_file_annot(point, + buffer, + filename, + ufilename=ufilename, + desc=desc, + icon=icon) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_freetext_annot(self, rect: rect_like, text: str, fontsize: float =11, + fontname: OptStr =None, border_color: OptSeq =None, + text_color: OptSeq =None, + fill_color: OptSeq =None, align: int =0, rotate: int =0) -> "struct Annot *": + """Add a 'FreeText' annotation.""" + + old_rotation = annot_preprocess(self) + try: + annot = self._add_freetext_annot(rect, text, fontsize=fontsize, + fontname=fontname, border_color=border_color,text_color=text_color, + fill_color=fill_color, align=align, rotate=rotate) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + return annot + + + def add_redact_annot(self, quad, text: OptStr =None, fontname: OptStr =None, + fontsize: float =11, align: int =0, fill: OptSeq =None, text_color: OptSeq =None, + cross_out: bool =True) -> "struct Annot *": + """Add a 'Redact' annotation.""" + da_str = None + if text: + CheckColor(fill) + CheckColor(text_color) + if not fontname: + fontname = "Helv" + if not fontsize: + fontsize = 11 + if not text_color: + text_color = (0, 0, 0) + if hasattr(text_color, "__float__"): + text_color = (text_color, text_color, text_color) + if len(text_color) > 3: + text_color = text_color[:3] + fmt = "{:g} {:g} {:g} rg /{f:s} {s:g} Tf" + da_str = fmt.format(*text_color, f=fontname, s=fontsize) + if fill is None: + fill = (1, 1, 1) + if fill: + if hasattr(fill, "__float__"): + fill = (fill, fill, fill) + if len(fill) > 3: + fill = fill[:3] + + old_rotation = annot_preprocess(self) + try: + annot = self._add_redact_annot(quad, text=text, da_str=da_str, + align=align, fill=fill) + finally: + if old_rotation != 0: + self.set_rotation(old_rotation) + annot_postprocess(self, annot) + #------------------------------------------------------------- + # change appearance to show a crossed-out rectangle + #------------------------------------------------------------- + if cross_out: + ap_tab = annot._getAP().splitlines()[:-1] # get the 4 commands only + _, LL, LR, UR, UL = ap_tab + ap_tab.append(LR) + ap_tab.append(LL) + ap_tab.append(UR) + ap_tab.append(LL) + ap_tab.append(UL) + ap_tab.append(b"S") + ap = b"\n".join(ap_tab) + annot._setAP(ap, 0) + return annot + %} + + + //---------------------------------------------------------------- + // page load annot by name or xref + //---------------------------------------------------------------- + FITZEXCEPTION(_load_annot, !result) + struct Annot * + _load_annot(char *name, int xref) + { + pdf_annot *annot = NULL; + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + fz_try(gctx) { + ASSERT_PDF(page); + if (xref == 0) + annot = JM_get_annot_by_name(gctx, page, name); + else + annot = JM_get_annot_by_xref(gctx, page, xref); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + + //---------------------------------------------------------------- + // page load widget by xref + //---------------------------------------------------------------- + FITZEXCEPTION(load_widget, !result) + %pythonprepend load_widget %{ + """Load a widget by its xref.""" + CheckParent(self) + %} + %pythonappend load_widget %{ + if not val: + return val + val.thisown = True + val.parent = weakref.proxy(self) + self._annot_refs[id(val)] = val + widget = Widget() + TOOLS._fill_widget(val, widget) + val = widget + %} + struct Annot * + load_widget(int xref) + { + pdf_annot *annot = NULL; + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + fz_try(gctx) { + ASSERT_PDF(page); + annot = JM_get_widget_by_xref(gctx, page, xref); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + + //---------------------------------------------------------------- + // page list Resource/Properties + //---------------------------------------------------------------- + FITZEXCEPTION(_get_resource_properties, !result) + PyObject * + _get_resource_properties() + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + PyObject *rc; + fz_try(gctx) { + ASSERT_PDF(page); + rc = JM_get_resource_properties(gctx, page->obj); + } + fz_catch(gctx) { + return NULL; + } + return rc; + } + + + //---------------------------------------------------------------- + // page list Resource/Properties + //---------------------------------------------------------------- + FITZEXCEPTION(_set_resource_property, !result) + PyObject * + _set_resource_property(char *name, int xref) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + fz_try(gctx) { + ASSERT_PDF(page); + JM_set_resource_property(gctx, page->obj, name, xref); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + %pythoncode %{ +def _get_optional_content(self, oc: OptInt) -> OptStr: + if oc == None or oc == 0: + return None + doc = self.parent + check = doc.xref_object(oc, compressed=True) + if not ("/Type/OCG" in check or "/Type/OCMD" in check): + raise ValueError("bad optional content: 'oc'") + props = {} + for p, x in self._get_resource_properties(): + props[x] = p + if oc in props.keys(): + return props[oc] + i = 0 + mc = "MC%i" % i + while mc in props.values(): + i += 1 + mc = "MC%i" % i + self._set_resource_property(mc, oc) + return mc + +def get_oc_items(self) -> list: + """Get OCGs and OCMDs used in the page's contents. + + Returns: + List of items (name, xref, type), where type is one of "ocg" / "ocmd", + and name is the property name. + """ + rc = [] + for pname, xref in self._get_resource_properties(): + text = self.parent.xref_object(xref, compressed=True) + if "/Type/OCG" in text: + octype = "ocg" + elif "/Type/OCMD" in text: + octype = "ocmd" + else: + continue + rc.append((pname, xref, octype)) + return rc +%} + + //---------------------------------------------------------------- + // page get list of annot names + //---------------------------------------------------------------- + PARENTCHECK(annot_names, """List of names of annotations, fields and links.""") + PyObject *annot_names() + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + + if (!page) { + PyObject *rc = PyList_New(0); + return rc; + } + return JM_get_annot_id_list(gctx, page); + } + + + //---------------------------------------------------------------- + // page retrieve list of annotation xrefs + //---------------------------------------------------------------- + PARENTCHECK(annot_xrefs,"""List of xref numbers of annotations, fields and links.""") + PyObject *annot_xrefs() + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (!page) { + PyObject *rc = PyList_New(0); + return rc; + } + return JM_get_annot_xref_list(gctx, page->obj); + } + + + %pythoncode %{ + def load_annot(self, ident: typing.Union[str, int]) -> "struct Annot *": + """Load an annot by name (/NM key) or xref. + + Args: + ident: identifier, either name (str) or xref (int). + """ + + CheckParent(self) + if type(ident) is str: + xref = 0 + name = ident + elif type(ident) is int: + xref = ident + name = None + else: + raise ValueError("identifier must be string or integer") + val = self._load_annot(name, xref) + if not val: + return val + val.thisown = True + val.parent = weakref.proxy(self) + self._annot_refs[id(val)] = val + return val + + + #--------------------------------------------------------------------- + # page addWidget + #--------------------------------------------------------------------- + def add_widget(self, widget: Widget) -> "struct Annot *": + """Add a 'Widget' (form field).""" + CheckParent(self) + doc = self.parent + if not doc.is_pdf: + raise ValueError("is no PDF") + widget._validate() + annot = self._addWidget(widget.field_type, widget.field_name) + if not annot: + return None + annot.thisown = True + annot.parent = weakref.proxy(self) # owning page object + self._annot_refs[id(annot)] = annot + widget.parent = annot.parent + widget._annot = annot + widget.update() + return annot + %} + + FITZEXCEPTION(_addWidget, !result) + struct Annot *_addWidget(int field_type, char *field_name) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_document *pdf = page->doc; + pdf_annot *annot = NULL; + fz_var(annot); + fz_try(gctx) { + annot = JM_create_widget(gctx, pdf, page, field_type, field_name); + if (!annot) { + RAISEPY(gctx, "cannot create widget", PyExc_RuntimeError); + } + JM_add_annot_id(gctx, annot, "W"); + } + fz_catch(gctx) { + return NULL; + } + return (struct Annot *) annot; + } + + //---------------------------------------------------------------- + // Page.get_displaylist + //---------------------------------------------------------------- + FITZEXCEPTION(get_displaylist, !result) + %pythonprepend get_displaylist %{ + """Make a DisplayList from the page for Pixmap generation. + + Include (default) or exclude annotations.""" + + CheckParent(self) + %} + %pythonappend get_displaylist %{val.thisown = True%} + struct DisplayList *get_displaylist(int annots=1) + { + fz_display_list *dl = NULL; + fz_try(gctx) { + if (annots) { + dl = fz_new_display_list_from_page(gctx, (fz_page *) $self); + } else { + dl = fz_new_display_list_from_page_contents(gctx, (fz_page *) $self); + } + } + fz_catch(gctx) { + return NULL; + } + return (struct DisplayList *) dl; + } + + + //---------------------------------------------------------------- + // Page.get_drawings + //---------------------------------------------------------------- + %pythoncode %{ + def get_drawings(self, extended: bool = False) -> list: + """Retrieve vector graphics. The extended version includes clips. + + Note: + For greater comfort, this method converts point-like, rect-like, quad-like + tuples of the C version to respective Point / Rect / Quad objects. + It also adds default items that are missing in original path types. + """ + allkeys = ( + "closePath", "fill", "color", "width", "lineCap", + "lineJoin", "dashes", "stroke_opacity", "fill_opacity", "even_odd", + ) + val = self.get_cdrawings(extended=extended) + for i in range(len(val)): + npath = val[i] + if not npath["type"].startswith("clip"): + npath["rect"] = Rect(npath["rect"]) + else: + npath["scissor"] = Rect(npath["scissor"]) + if npath["type"]!="group": + items = npath["items"] + newitems = [] + for item in items: + cmd = item[0] + rest = item[1:] + if cmd == "re": + item = ("re", Rect(rest[0]).normalize(), rest[1]) + elif cmd == "qu": + item = ("qu", Quad(rest[0])) + else: + item = tuple([cmd] + [Point(i) for i in rest]) + newitems.append(item) + npath["items"] = newitems + if npath["type"] in ("f", "s"): + for k in allkeys: + npath[k] = npath.get(k) + val[i] = npath + return val + + class Drawpath(object): + """Reflects a path dictionary from get_cdrawings().""" + def __init__(self, **args): + self.__dict__.update(args) + + class Drawpathlist(object): + """List of Path objects representing get_cdrawings() output.""" + def __init__(self): + self.paths = [] + self.path_count = 0 + self.group_count = 0 + self.clip_count = 0 + self.fill_count = 0 + self.stroke_count = 0 + self.fillstroke_count = 0 + + def append(self, path): + self.paths.append(path) + self.path_count += 1 + if path.type == "clip": + self.clip_count += 1 + elif path.type == "group": + self.group_count += 1 + elif path.type == "f": + self.fill_count += 1 + elif path.type == "s": + self.stroke_count += 1 + elif path.type == "fs": + self.fillstroke_count += 1 + + def clip_parents(self, i): + """Return list of parent clip paths. + + Args: + i: (int) return parents of this path. + Returns: + List of the clip parents.""" + if i >= self.path_count: + raise IndexError("bad path index") + while i < 0: + i += self.path_count + lvl = self.paths[i].level + clips = list( # clip paths before identified one + reversed( + [ + p + for p in self.paths[:i] + if p.type == "clip" and p.level < lvl + ] + ) + ) + if clips == []: # none found: empty list + return [] + nclips = [clips[0]] # init return list + for p in clips[1:]: + if p.level >= nclips[-1].level: + continue # only accept smaller clip levels + nclips.append(p) + return nclips + + def group_parents(self, i): + """Return list of parent group paths. + + Args: + i: (int) return parents of this path. + Returns: + List of the group parents.""" + if i >= self.path_count: + raise IndexError("bad path index") + while i < 0: + i += self.path_count + lvl = self.paths[i].level + groups = list( # group paths before identified one + reversed( + [ + p + for p in self.paths[:i] + if p.type == "group" and p.level < lvl + ] + ) + ) + if groups == []: # none found: empty list + return [] + ngroups = [groups[0]] # init return list + for p in groups[1:]: + if p.level >= ngroups[-1].level: + continue # only accept smaller group levels + ngroups.append(p) + return ngroups + + def __getitem__(self, item): + return self.paths.__getitem__(item) + + def __len__(self): + return self.paths.__len__() + + + def get_lineart(self) -> object: + """Get page drawings paths. + + Note: + For greater comfort, this method converts point-like, rect-like, quad-like + tuples of the C version to respective Point / Rect / Quad objects. + Also adds default items that are missing in original path types. + In contrast to get_drawings(), this output is an object. + """ + + val = self.get_cdrawings(extended=True) + paths = self.Drawpathlist() + for path in val: + npath = self.Drawpath(**path) + if npath.type != "clip": + npath.rect = Rect(path["rect"]) + else: + npath.scissor = Rect(path["scissor"]) + if npath.type != "group": + items = path["items"] + newitems = [] + for item in items: + cmd = item[0] + rest = item[1:] + if cmd == "re": + item = ("re", Rect(rest[0]).normalize(), rest[1]) + elif cmd == "qu": + item = ("qu", Quad(rest[0])) + else: + item = tuple([cmd] + [Point(i) for i in rest]) + newitems.append(item) + npath.items = newitems + + if npath.type == "f": + npath.stroke_opacity = None + npath.dashes = None + npath.lineJoin = None + npath.lineCap = None + npath.color = None + npath.width = None + + paths.append(npath) + + val = None + return paths + %} + + + FITZEXCEPTION(get_cdrawings, !result) + %pythonprepend get_cdrawings %{ + """Extract vector graphics ("line art") from the page.""" + CheckParent(self) + old_rotation = self.rotation + if old_rotation != 0: + self.set_rotation(0) + %} + %pythonappend get_cdrawings %{ + if old_rotation != 0: + self.set_rotation(old_rotation) + %} + PyObject * + get_cdrawings(PyObject *extended=NULL, PyObject *callback=NULL, PyObject *method=NULL) + { + fz_page *page = (fz_page *) $self; + fz_device *dev = NULL; + PyObject *rc = NULL; + int clips = PyObject_IsTrue(extended); + fz_var(rc); + fz_try(gctx) { + fz_rect prect = fz_bound_page(gctx, page); + trace_device_ptm = fz_make_matrix(1, 0, 0, -1, 0, prect.y1); + if (PyCallable_Check(callback) || method != Py_None) { + dev = JM_new_lineart_device(gctx, callback, clips, method); + } else { + rc = PyList_New(0); + dev = JM_new_lineart_device(gctx, rc, clips, method); + } + fz_run_page(gctx, page, dev, fz_identity, NULL); + fz_close_device(gctx, dev); + } + fz_always(gctx) { + fz_drop_device(gctx, dev); + } + fz_catch(gctx) { + Py_CLEAR(rc); + return NULL; + } + if (PyCallable_Check(callback) || method != Py_None) { + Py_RETURN_NONE; + } + return rc; + } + + + FITZEXCEPTION(get_bboxlog, !result) + %pythonprepend get_bboxlog %{ + CheckParent(self) + old_rotation = self.rotation + if old_rotation != 0: + self.set_rotation(0) + %} + %pythonappend get_bboxlog %{ + if old_rotation != 0: + self.set_rotation(old_rotation) + %} + PyObject * + get_bboxlog(PyObject *layers=NULL) + { + fz_page *page = (fz_page *) $self; + fz_device *dev = NULL; + PyObject *rc = PyList_New(0); + int inc_layers = PyObject_IsTrue(layers); + fz_try(gctx) { + dev = JM_new_bbox_device(gctx, rc, inc_layers); + fz_run_page(gctx, page, dev, fz_identity, NULL); + fz_close_device(gctx, dev); + } + fz_always(gctx) { + fz_drop_device(gctx, dev); + } + fz_catch(gctx) { + Py_CLEAR(rc); + return NULL; + } + return rc; + } + + + FITZEXCEPTION(get_texttrace, !result) + %pythonprepend get_texttrace %{ + CheckParent(self) + old_rotation = self.rotation + if old_rotation != 0: + self.set_rotation(0) + %} + %pythonappend get_texttrace %{ + if old_rotation != 0: + self.set_rotation(old_rotation) + %} + PyObject * + get_texttrace() + { + fz_page *page = (fz_page *) $self; + fz_device *dev = NULL; + PyObject *rc = PyList_New(0); + fz_try(gctx) { + dev = JM_new_texttrace_device(gctx, rc); + fz_rect prect = fz_bound_page(gctx, page); + trace_device_rot = fz_identity; + trace_device_ptm = fz_make_matrix(1, 0, 0, -1, 0, prect.y1); + fz_run_page(gctx, page, dev, fz_identity, NULL); + fz_close_device(gctx, dev); + } + fz_always(gctx) { + fz_drop_device(gctx, dev); + } + fz_catch(gctx) { + Py_CLEAR(rc); + return NULL; + } + return rc; + } + + + //---------------------------------------------------------------- + // Page apply redactions + //---------------------------------------------------------------- + FITZEXCEPTION(_apply_redactions, !result) + PyObject *_apply_redactions(int images=PDF_REDACT_IMAGE_PIXELS) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + int success = 0; + pdf_redact_options opts = {0}; + opts.black_boxes = 0; // no black boxes + opts.image_method = images; // how to treat images + fz_try(gctx) { + ASSERT_PDF(page); + success = pdf_redact_page(gctx, page->doc, page, &opts); + } + fz_catch(gctx) { + return NULL; + } + return JM_BOOL(success); + } + + + //---------------------------------------------------------------- + // Page._makePixmap + //---------------------------------------------------------------- + FITZEXCEPTION(_makePixmap, !result) + struct Pixmap * + _makePixmap(struct Document *doc, + PyObject *ctm, + struct Colorspace *cs, + int alpha=0, + int annots=1, + PyObject *clip=NULL) + { + fz_pixmap *pix = NULL; + fz_try(gctx) { + pix = JM_pixmap_from_page(gctx, (fz_document *) doc, (fz_page *) $self, ctm, (fz_colorspace *) cs, alpha, annots, clip); + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) pix; + } + + + //---------------------------------------------------------------- + // Page.set_mediabox + //---------------------------------------------------------------- + FITZEXCEPTION(set_mediabox, !result) + PARENTCHECK(set_mediabox, """Set the MediaBox.""") + PyObject *set_mediabox(PyObject *rect) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + fz_try(gctx) { + ASSERT_PDF(page); + fz_rect mediabox = JM_rect_from_py(rect); + if (fz_is_empty_rect(mediabox) || + fz_is_infinite_rect(mediabox)) { + RAISEPY(gctx, MSG_BAD_RECT, PyExc_ValueError); + } + pdf_dict_put_rect(gctx, page->obj, PDF_NAME(MediaBox), mediabox); + pdf_dict_del(gctx, page->obj, PDF_NAME(CropBox)); + pdf_dict_del(gctx, page->obj, PDF_NAME(ArtBox)); + pdf_dict_del(gctx, page->obj, PDF_NAME(BleedBox)); + pdf_dict_del(gctx, page->obj, PDF_NAME(TrimBox)); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // Page.load_links() + //---------------------------------------------------------------- + PARENTCHECK(load_links, """Get first Link.""") + %pythonappend load_links %{ + if val: + val.thisown = True + val.parent = weakref.proxy(self) # owning page object + self._annot_refs[id(val)] = val + if self.parent.is_pdf: + link_id = [x for x in self.annot_xrefs() if x[1] == PDF_ANNOT_LINK][0] + val.xref = link_id[0] + val.id = link_id[2] + else: + val.xref = 0 + val.id = "" + %} + struct Link *load_links() + { + fz_link *l = NULL; + fz_try(gctx) { + l = fz_load_links(gctx, (fz_page *) $self); + } + fz_catch(gctx) { + return NULL; + } + return (struct Link *) l; + } + %pythoncode %{first_link = property(load_links, doc="First link on page")%} + + //---------------------------------------------------------------- + // Page.first_annot + //---------------------------------------------------------------- + PARENTCHECK(first_annot, """First annotation.""") + %pythonappend first_annot %{ + if val: + val.thisown = True + val.parent = weakref.proxy(self) # owning page object + self._annot_refs[id(val)] = val + %} + %pythoncode %{@property%} + struct Annot *first_annot() + { + pdf_annot *annot = NULL; + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (page) + { + annot = pdf_first_annot(gctx, page); + if (annot) pdf_keep_annot(gctx, annot); + } + return (struct Annot *) annot; + } + + //---------------------------------------------------------------- + // first_widget + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(first_widget, """First widget/field.""") + %pythonappend first_widget %{ + if val: + val.thisown = True + val.parent = weakref.proxy(self) # owning page object + self._annot_refs[id(val)] = val + widget = Widget() + TOOLS._fill_widget(val, widget) + val = widget + %} + struct Annot *first_widget() + { + pdf_annot *annot = NULL; + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (page) { + annot = pdf_first_widget(gctx, page); + if (annot) pdf_keep_annot(gctx, annot); + } + return (struct Annot *) annot; + } + + + //---------------------------------------------------------------- + // Page.delete_link() - delete link + //---------------------------------------------------------------- + PARENTCHECK(delete_link, """Delete a Link.""") + %pythonappend delete_link %{ + if linkdict["xref"] == 0: return + try: + linkid = linkdict["id"] + linkobj = self._annot_refs[linkid] + linkobj._erase() + except: + pass + %} + void delete_link(PyObject *linkdict) + { + if (!PyDict_Check(linkdict)) return; // have no dictionary + fz_try(gctx) { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (!page) goto finished; // have no PDF + int xref = (int) PyInt_AsLong(PyDict_GetItem(linkdict, dictkey_xref)); + if (xref < 1) goto finished; // invalid xref + pdf_obj *annots = pdf_dict_get(gctx, page->obj, PDF_NAME(Annots)); + if (!annots) goto finished; // have no annotations + int len = pdf_array_len(gctx, annots); + if (len == 0) goto finished; + int i, oxref = 0; + + for (i = 0; i < len; i++) { + oxref = pdf_to_num(gctx, pdf_array_get(gctx, annots, i)); + if (xref == oxref) break; // found xref in annotations + } + + if (xref != oxref) goto finished; // xref not in annotations + pdf_array_delete(gctx, annots, i); // delete entry in annotations + pdf_delete_object(gctx, page->doc, xref); // delete link obj + pdf_dict_put(gctx, page->obj, PDF_NAME(Annots), annots); + JM_refresh_links(gctx, page); + finished:; + + } + fz_catch(gctx) {;} + } + + //---------------------------------------------------------------- + // Page.delete_annot() - delete annotation and return the next one + //---------------------------------------------------------------- + %pythonprepend delete_annot %{ + """Delete annot and return next one.""" + CheckParent(self) + CheckParent(annot)%} + + %pythonappend delete_annot %{ + if val: + val.thisown = True + val.parent = weakref.proxy(self) # owning page object + val.parent._annot_refs[id(val)] = val + %} + + struct Annot *delete_annot(struct Annot *annot) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_annot *irt_annot = NULL; + while (1) { + // first loop through all /IRT annots and remove them + irt_annot = JM_find_annot_irt(gctx, (pdf_annot *) annot); + if (!irt_annot) // no more there + break; + pdf_delete_annot(gctx, page, irt_annot); + } + pdf_annot *nextannot = pdf_next_annot(gctx, (pdf_annot *) annot); // store next + pdf_delete_annot(gctx, page, (pdf_annot *) annot); + if (nextannot) { + nextannot = pdf_keep_annot(gctx, nextannot); + } + return (struct Annot *) nextannot; + } + + + //---------------------------------------------------------------- + // mediabox: get the /MediaBox (PDF only) + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(mediabox, """The MediaBox.""") + %pythonappend mediabox %{val = Rect(JM_TUPLE3(val))%} + PyObject *mediabox() + { + fz_rect rect = fz_infinite_rect; + fz_try(gctx) { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (!page) { + rect = fz_bound_page(gctx, (fz_page *) $self); + } else { + rect = JM_mediabox(gctx, page->obj); + } + } + fz_catch(gctx) {;} + return JM_py_from_rect(rect); + } + + + //---------------------------------------------------------------- + // cropbox: get the /CropBox (PDF only) + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(cropbox, """The CropBox.""") + %pythonappend cropbox %{val = Rect(JM_TUPLE3(val))%} + PyObject *cropbox() + { + fz_rect rect = fz_infinite_rect; + fz_try(gctx) { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (!page) { + rect = fz_bound_page(gctx, (fz_page *) $self); + } else { + rect = JM_cropbox(gctx, page->obj); + } + } + fz_catch(gctx) {;} + return JM_py_from_rect(rect); + } + + + PyObject *_other_box(const char *boxtype) + { + fz_rect rect = fz_infinite_rect; + fz_try(gctx) { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (page) { + pdf_obj *obj = pdf_dict_gets(gctx, page->obj, boxtype); + if (pdf_is_array(gctx, obj)) { + rect = pdf_to_rect(gctx, obj); + } + } + } + fz_catch(gctx) {;} + if (fz_is_infinite_rect(rect)) { + Py_RETURN_NONE; + } + return JM_py_from_rect(rect); + } + + + //---------------------------------------------------------------- + // CropBox position: x0, y0 of /CropBox + //---------------------------------------------------------------- + %pythoncode %{ + @property + def cropbox_position(self): + return self.cropbox.tl + + @property + def artbox(self): + """The ArtBox""" + rect = self._other_box("ArtBox") + if rect == None: + return self.cropbox + mb = self.mediabox + return Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1]) + + @property + def trimbox(self): + """The TrimBox""" + rect = self._other_box("TrimBox") + if rect == None: + return self.cropbox + mb = self.mediabox + return Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1]) + + @property + def bleedbox(self): + """The BleedBox""" + rect = self._other_box("BleedBox") + if rect == None: + return self.cropbox + mb = self.mediabox + return Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1]) + + def _set_pagebox(self, boxtype, rect): + doc = self.parent + if doc == None: + raise ValueError("orphaned object: parent is None") + + if not doc.is_pdf: + raise ValueError("is no PDF") + + valid_boxes = ("CropBox", "BleedBox", "TrimBox", "ArtBox") + + if boxtype not in valid_boxes: + raise ValueError("bad boxtype") + + rect = Rect(rect) + mb = self.mediabox + rect = Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1]) + if not (mb.x0 <= rect.x0 < rect.x1 <= mb.x1 and mb.y0 <= rect.y0 < rect.y1 <= mb.y1): + raise ValueError(f"{boxtype} not in MediaBox") + + doc.xref_set_key(self.xref, boxtype, "[%g %g %g %g]" % tuple(rect)) + + + def set_cropbox(self, rect): + """Set the CropBox. Will also change Page.rect.""" + return self._set_pagebox("CropBox", rect) + + def set_artbox(self, rect): + """Set the ArtBox.""" + return self._set_pagebox("ArtBox", rect) + + def set_bleedbox(self, rect): + """Set the BleedBox.""" + return self._set_pagebox("BleedBox", rect) + + def set_trimbox(self, rect): + """Set the TrimBox.""" + return self._set_pagebox("TrimBox", rect) + %} + + + //---------------------------------------------------------------- + // rotation - return page rotation + //---------------------------------------------------------------- + PARENTCHECK(rotation, """Page rotation.""") + %pythoncode %{@property%} + int rotation() + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (!page) return 0; + return JM_page_rotation(gctx, page); + } + + /*********************************************************************/ + // set_rotation() - set page rotation + /*********************************************************************/ + FITZEXCEPTION(set_rotation, !result) + PARENTCHECK(set_rotation, """Set page rotation.""") + PyObject *set_rotation(int rotation) + { + fz_try(gctx) { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + ASSERT_PDF(page); + int rot = JM_norm_rotation(rotation); + pdf_dict_put_int(gctx, page->obj, PDF_NAME(Rotate), (int64_t) rot); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + /*********************************************************************/ + // Page._addAnnot_FromString + // Add new links provided as an array of string object definitions. + /*********************************************************************/ + FITZEXCEPTION(_addAnnot_FromString, !result) + PARENTCHECK(_addAnnot_FromString, """Add links from list of object sources.""") + PyObject *_addAnnot_FromString(PyObject *linklist) + { + pdf_obj *annots, *annot, *ind_obj; + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + PyObject *txtpy = NULL; + char *text = NULL; + Py_ssize_t lcount = PyTuple_Size(linklist); // link count + if (lcount < 1) Py_RETURN_NONE; + Py_ssize_t i = -1; + fz_var(text); + + // insert links from the provided sources + fz_try(gctx) { + ASSERT_PDF(page); + if (!PyTuple_Check(linklist)) { + RAISEPY(gctx, "bad 'linklist' argument", PyExc_ValueError); + } + if (!pdf_dict_get(gctx, page->obj, PDF_NAME(Annots))) { + pdf_dict_put_array(gctx, page->obj, PDF_NAME(Annots), lcount); + } + annots = pdf_dict_get(gctx, page->obj, PDF_NAME(Annots)); + for (i = 0; i < lcount; i++) { + fz_try(gctx) { + for (; i < lcount; i++) { + text = JM_StrAsChar(PyTuple_GET_ITEM(linklist, i)); + if (!text) { + PySys_WriteStderr("skipping bad link / annot item %zi.\n", i); + continue; + } + annot = pdf_add_object_drop(gctx, page->doc, + JM_pdf_obj_from_str(gctx, page->doc, text)); + ind_obj = pdf_new_indirect(gctx, page->doc, pdf_to_num(gctx, annot), 0); + pdf_array_push_drop(gctx, annots, ind_obj); + pdf_drop_obj(gctx, annot); + } + } + fz_catch(gctx) { + PySys_WriteStderr("skipping bad link / annot item %zi.\n", i); + } + } + } + fz_catch(gctx) { + PyErr_Clear(); + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------------------------------------- + // Page clean contents stream + //---------------------------------------------------------------- + FITZEXCEPTION(clean_contents, !result) + %pythonprepend clean_contents +%{"""Clean page /Contents into one object.""" +CheckParent(self) +if not sanitize and not self.is_wrapped: + self.wrap_contents()%} + PyObject *clean_contents(int sanitize=1) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (!page) { + Py_RETURN_NONE; + } + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + pdf_filter_factory list[2] = { 0 }; + pdf_sanitize_filter_options sopts = { 0 }; + pdf_filter_options filter = { + 1, // recurse: true + 0, // instance forms + 0, // do not ascii-escape binary data + 0, // no_update + NULL, // end_page_opaque + NULL, // end page + list, // filters + }; + if (sanitize) { + list[0].filter = pdf_new_sanitize_filter; + list[0].options = &sopts; + } + #else + pdf_filter_options filter = { + NULL, // opaque + NULL, // image filter + NULL, // text filter + NULL, // after text + NULL, // end page + 1, // recurse: true + 1, // instance forms + 1, // sanitize plus filtering + 0 // do not ascii-escape binary data + }; + filter.sanitize = sanitize; + #endif + fz_try(gctx) { + pdf_filter_page_contents(gctx, page->doc, page, &filter); + } + fz_catch(gctx) { + Py_RETURN_NONE; + } + Py_RETURN_NONE; + } + + //---------------------------------------------------------------- + // Show a PDF page + //---------------------------------------------------------------- + FITZEXCEPTION(_show_pdf_page, !result) + PyObject *_show_pdf_page(struct Page *fz_srcpage, int overlay=1, PyObject *matrix=NULL, int xref=0, int oc=0, PyObject *clip = NULL, struct Graftmap *graftmap = NULL, char *_imgname = NULL) + { + pdf_obj *xobj1=NULL, *xobj2=NULL, *resources; + fz_buffer *res=NULL, *nres=NULL; + fz_rect cropbox = JM_rect_from_py(clip); + fz_matrix mat = JM_matrix_from_py(matrix); + int rc_xref = xref; + fz_var(xobj1); + fz_var(xobj2); + fz_try(gctx) { + pdf_page *tpage = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_obj *tpageref = tpage->obj; + pdf_document *pdfout = tpage->doc; // target PDF + ENSURE_OPERATION(gctx, pdfout); + //------------------------------------------------------------- + // convert the source page to a Form XObject + //------------------------------------------------------------- + xobj1 = JM_xobject_from_page(gctx, pdfout, (fz_page *) fz_srcpage, + xref, (pdf_graft_map *) graftmap); + if (!rc_xref) rc_xref = pdf_to_num(gctx, xobj1); + + //------------------------------------------------------------- + // create referencing XObject (controls display on target page) + //------------------------------------------------------------- + // fill reference to xobj1 into the /Resources + //------------------------------------------------------------- + pdf_obj *subres1 = pdf_new_dict(gctx, pdfout, 5); + pdf_dict_puts(gctx, subres1, "fullpage", xobj1); + pdf_obj *subres = pdf_new_dict(gctx, pdfout, 5); + pdf_dict_put_drop(gctx, subres, PDF_NAME(XObject), subres1); + + res = fz_new_buffer(gctx, 20); + fz_append_string(gctx, res, "/fullpage Do"); + + xobj2 = pdf_new_xobject(gctx, pdfout, cropbox, mat, subres, res); + if (oc > 0) { + JM_add_oc_object(gctx, pdfout, pdf_resolve_indirect(gctx, xobj2), oc); + } + pdf_drop_obj(gctx, subres); + fz_drop_buffer(gctx, res); + + //------------------------------------------------------------- + // update target page with xobj2: + //------------------------------------------------------------- + // 1. insert Xobject in Resources + //------------------------------------------------------------- + resources = pdf_dict_get_inheritable(gctx, tpageref, PDF_NAME(Resources)); + subres = pdf_dict_get(gctx, resources, PDF_NAME(XObject)); + if (!subres) { + subres = pdf_dict_put_dict(gctx, resources, PDF_NAME(XObject), 5); + } + + pdf_dict_puts(gctx, subres, _imgname, xobj2); + + //------------------------------------------------------------- + // 2. make and insert new Contents object + //------------------------------------------------------------- + nres = fz_new_buffer(gctx, 50); // buffer for Do-command + fz_append_string(gctx, nres, " q /"); // Do-command + fz_append_string(gctx, nres, _imgname); + fz_append_string(gctx, nres, " Do Q "); + + JM_insert_contents(gctx, pdfout, tpageref, nres, overlay); + fz_drop_buffer(gctx, nres); + } + fz_always(gctx) { + pdf_drop_obj(gctx, xobj1); + pdf_drop_obj(gctx, xobj2); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", rc_xref); + } + + //---------------------------------------------------------------- + // insert an image + //---------------------------------------------------------------- + FITZEXCEPTION(_insert_image, !result) + PyObject * + _insert_image(char *filename=NULL, + struct Pixmap *pixmap=NULL, + PyObject *stream=NULL, + PyObject *imask=NULL, + PyObject *clip=NULL, + int overlay=1, + int rotate=0, + int keep_proportion=1, + int oc=0, + int width=0, + int height=0, + int xref=0, + int alpha=-1, + const char *_imgname=NULL, + PyObject *digests=NULL) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_document *pdf = page->doc; + float w = width, h = height; + fz_pixmap *pm = NULL; + fz_pixmap *pix = NULL; + fz_image *mask = NULL, *zimg = NULL, *image = NULL, *freethis = NULL; + pdf_obj *resources, *xobject, *ref; + fz_buffer *nres = NULL, *imgbuf = NULL, *maskbuf = NULL; + fz_compressed_buffer *cbuf1 = NULL; + int xres, yres, bpc, img_xref = xref, rc_digest = 0; + unsigned char digest[16]; + PyObject *md5_py = NULL, *temp; + const char *template = "\nq\n%g %g %g %g %g %g cm\n/%s Do\nQ\n"; + + fz_try(gctx) { + if (xref > 0) { + ref = pdf_new_indirect(gctx, pdf, xref, 0); + w = pdf_to_int(gctx, + pdf_dict_geta(gctx, ref, + PDF_NAME(Width), PDF_NAME(W))); + h = pdf_to_int(gctx, + pdf_dict_geta(gctx, ref, + PDF_NAME(Height), PDF_NAME(H))); + if ((w + h) == 0) { + RAISEPY(gctx, MSG_IS_NO_IMAGE, PyExc_ValueError); + } + goto have_xref; + } + if (EXISTS(stream)) { + imgbuf = JM_BufferFromBytes(gctx, stream); + goto have_stream; + } + if (filename) { + imgbuf = fz_read_file(gctx, filename); + goto have_stream; + } + // process pixmap --------------------------------- + fz_pixmap *arg_pix = (fz_pixmap *) pixmap; + w = arg_pix->w; + h = arg_pix->h; + fz_md5_pixmap(gctx, arg_pix, digest); + md5_py = PyBytes_FromStringAndSize(digest, 16); + temp = PyDict_GetItem(digests, md5_py); + if (temp) { + img_xref = (int) PyLong_AsLong(temp); + ref = pdf_new_indirect(gctx, page->doc, img_xref, 0); + goto have_xref; + } + if (arg_pix->alpha == 0) { + image = fz_new_image_from_pixmap(gctx, arg_pix, NULL); + } else { + pm = fz_convert_pixmap(gctx, arg_pix, NULL, NULL, NULL, + fz_default_color_params, 1); + pm->alpha = 0; + pm->colorspace = NULL; + mask = fz_new_image_from_pixmap(gctx, pm, NULL); + image = fz_new_image_from_pixmap(gctx, arg_pix, mask); + } + goto have_image; + + // process stream --------------------------------- + have_stream:; + fz_md5 state; + fz_md5_init(&state); + fz_md5_update(&state, imgbuf->data, imgbuf->len); + if (imask != Py_None) { + maskbuf = JM_BufferFromBytes(gctx, imask); + fz_md5_update(&state, maskbuf->data, maskbuf->len); + } + fz_md5_final(&state, digest); + md5_py = PyBytes_FromStringAndSize(digest, 16); + temp = PyDict_GetItem(digests, md5_py); + if (temp) { + img_xref = (int) PyLong_AsLong(temp); + ref = pdf_new_indirect(gctx, page->doc, img_xref, 0); + w = pdf_to_int(gctx, + pdf_dict_geta(gctx, ref, + PDF_NAME(Width), PDF_NAME(W))); + h = pdf_to_int(gctx, + pdf_dict_geta(gctx, ref, + PDF_NAME(Height), PDF_NAME(H))); + goto have_xref; + } + image = fz_new_image_from_buffer(gctx, imgbuf); + w = image->w; + h = image->h; + if (imask == Py_None) { + goto have_image; + } + + cbuf1 = fz_compressed_image_buffer(gctx, image); + if (!cbuf1) { + RAISEPY(gctx, "uncompressed image cannot have mask", PyExc_ValueError); + } + bpc = image->bpc; + fz_colorspace *colorspace = image->colorspace; + fz_image_resolution(image, &xres, &yres); + mask = fz_new_image_from_buffer(gctx, maskbuf); + zimg = fz_new_image_from_compressed_buffer(gctx, w, h, + bpc, colorspace, xres, yres, 1, 0, NULL, + NULL, cbuf1, mask); + freethis = image; + image = zimg; + zimg = NULL; + goto have_image; + + have_image:; + ref = pdf_add_image(gctx, pdf, image); + if (oc) { + JM_add_oc_object(gctx, pdf, ref, oc); + } + img_xref = pdf_to_num(gctx, ref); + DICT_SETITEM_DROP(digests, md5_py, Py_BuildValue("i", img_xref)); + rc_digest = 1; + have_xref:; + resources = pdf_dict_get_inheritable(gctx, page->obj, + PDF_NAME(Resources)); + if (!resources) { + resources = pdf_dict_put_dict(gctx, page->obj, + PDF_NAME(Resources), 2); + } + xobject = pdf_dict_get(gctx, resources, PDF_NAME(XObject)); + if (!xobject) { + xobject = pdf_dict_put_dict(gctx, resources, + PDF_NAME(XObject), 2); + } + fz_matrix mat = calc_image_matrix(w, h, clip, rotate, keep_proportion); + pdf_dict_puts_drop(gctx, xobject, _imgname, ref); + nres = fz_new_buffer(gctx, 50); + fz_append_printf(gctx, nres, template, + mat.a, mat.b, mat.c, mat.d, mat.e, mat.f, _imgname); + JM_insert_contents(gctx, pdf, page->obj, nres, overlay); + } + fz_always(gctx) { + if (freethis) { + fz_drop_image(gctx, freethis); + } else { + fz_drop_image(gctx, image); + } + fz_drop_image(gctx, mask); + fz_drop_image(gctx, zimg); + fz_drop_pixmap(gctx, pix); + fz_drop_pixmap(gctx, pm); + fz_drop_buffer(gctx, imgbuf); + fz_drop_buffer(gctx, maskbuf); + fz_drop_buffer(gctx, nres); + } + fz_catch(gctx) { + return NULL; + } + + if (rc_digest) { + return Py_BuildValue("iO", img_xref, digests); + } else { + return Py_BuildValue("iO", img_xref, Py_None); + } + } + + + //---------------------------------------------------------------- + // Page.refresh() + //---------------------------------------------------------------- + %pythoncode %{ + def refresh(self): + doc = self.parent + page = doc.reload_page(self) + self = page + %} + + + //---------------------------------------------------------------- + // insert font + //---------------------------------------------------------------- + %pythoncode +%{ +def insert_font(self, fontname="helv", fontfile=None, fontbuffer=None, + set_simple=False, wmode=0, encoding=0): + doc = self.parent + if doc is None: + raise ValueError("orphaned object: parent is None") + idx = 0 + + if fontname.startswith("/"): + fontname = fontname[1:] + inv_chars = INVALID_NAME_CHARS.intersection(fontname) + if inv_chars != set(): + raise ValueError(f"bad fontname chars {inv_chars}") + + font = CheckFont(self, fontname) + if font is not None: # font already in font list of page + xref = font[0] # this is the xref + if CheckFontInfo(doc, xref): # also in our document font list? + return xref # yes: we are done + # need to build the doc FontInfo entry - done via get_char_widths + doc.get_char_widths(xref) + return xref + + #-------------------------------------------------------------------------- + # the font is not present for this page + #-------------------------------------------------------------------------- + + bfname = Base14_fontdict.get(fontname.lower(), None) # BaseFont if Base-14 font + + serif = 0 + CJK_number = -1 + CJK_list_n = ["china-t", "china-s", "japan", "korea"] + CJK_list_s = ["china-ts", "china-ss", "japan-s", "korea-s"] + + try: + CJK_number = CJK_list_n.index(fontname) + serif = 0 + except: + pass + + if CJK_number < 0: + try: + CJK_number = CJK_list_s.index(fontname) + serif = 1 + except: + pass + + if fontname.lower() in fitz_fontdescriptors.keys(): + import pymupdf_fonts + fontbuffer = pymupdf_fonts.myfont(fontname) # make a copy + del pymupdf_fonts + + # install the font for the page + if fontfile != None: + if type(fontfile) is str: + fontfile_str = fontfile + elif hasattr(fontfile, "absolute"): + fontfile_str = str(fontfile) + elif hasattr(fontfile, "name"): + fontfile_str = fontfile.name + else: + raise ValueError("bad fontfile") + else: + fontfile_str = None + val = self._insertFont(fontname, bfname, fontfile_str, fontbuffer, set_simple, idx, + wmode, serif, encoding, CJK_number) + + if not val: # did not work, error return + return val + + xref = val[0] # xref of installed font + fontdict = val[1] + + if CheckFontInfo(doc, xref): # check again: document already has this font + return xref # we are done + + # need to create document font info + doc.get_char_widths(xref, fontdict=fontdict) + return xref + +%} + + FITZEXCEPTION(_insertFont, !result) + PyObject *_insertFont(char *fontname, char *bfname, + char *fontfile, + PyObject *fontbuffer, + int set_simple, int idx, + int wmode, int serif, + int encoding, int ordering) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + pdf_document *pdf; + pdf_obj *resources, *fonts, *font_obj; + PyObject *value; + fz_try(gctx) { + ASSERT_PDF(page); + pdf = page->doc; + + value = JM_insert_font(gctx, pdf, bfname, fontfile,fontbuffer, + set_simple, idx, wmode, serif, encoding, ordering); + + // get the objects /Resources, /Resources/Font + resources = pdf_dict_get_inheritable(gctx, page->obj, PDF_NAME(Resources)); + fonts = pdf_dict_get(gctx, resources, PDF_NAME(Font)); + if (!fonts) { // page has no fonts yet + fonts = pdf_new_dict(gctx, pdf, 5); + pdf_dict_putl_drop(gctx, page->obj, fonts, PDF_NAME(Resources), PDF_NAME(Font), NULL); + } + // store font in resources and fonts objects will contain named reference to font + int xref = 0; + JM_INT_ITEM(value, 0, &xref); + if (!xref) { + RAISEPY(gctx, "cannot insert font", PyExc_RuntimeError); + } + font_obj = pdf_new_indirect(gctx, pdf, xref, 0); + pdf_dict_puts_drop(gctx, fonts, fontname, font_obj); + } + fz_always(gctx) { + ; + } + fz_catch(gctx) { + return NULL; + } + + return value; + } + + //---------------------------------------------------------------- + // Get page transformation matrix + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(transformation_matrix, """Page transformation matrix.""") + %pythonappend transformation_matrix %{ + if self.rotation % 360 == 0: + val = Matrix(val) + else: + val = Matrix(1, 0, 0, -1, 0, self.cropbox.height) + %} + PyObject *transformation_matrix() + { + fz_matrix ctm = fz_identity; + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + if (!page) return JM_py_from_matrix(ctm); + fz_try(gctx) { + pdf_page_transform(gctx, page, NULL, &ctm); + } + fz_catch(gctx) {;} + return JM_py_from_matrix(ctm); + } + + //---------------------------------------------------------------- + // Page Get list of contents objects + //---------------------------------------------------------------- + FITZEXCEPTION(get_contents, !result) + PARENTCHECK(get_contents, """Get xrefs of /Contents objects.""") + PyObject *get_contents() + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) $self); + PyObject *list = NULL; + pdf_obj *contents = NULL, *icont = NULL; + int i, xref; + size_t n = 0; + fz_try(gctx) { + ASSERT_PDF(page); + contents = pdf_dict_get(gctx, page->obj, PDF_NAME(Contents)); + if (pdf_is_array(gctx, contents)) { + n = pdf_array_len(gctx, contents); + list = PyList_New(n); + for (i = 0; i < n; i++) { + icont = pdf_array_get(gctx, contents, i); + xref = pdf_to_num(gctx, icont); + PyList_SET_ITEM(list, i, Py_BuildValue("i", xref)); + } + } + else if (contents) { + list = PyList_New(1); + xref = pdf_to_num(gctx, contents); + PyList_SET_ITEM(list, 0, Py_BuildValue("i", xref)); + } + } + fz_catch(gctx) { + return NULL; + } + if (list) { + return list; + } + return PyList_New(0); + } + + //---------------------------------------------------------------- + // + //---------------------------------------------------------------- + %pythoncode %{ + def set_contents(self, xref: int)->None: + """Set object at 'xref' as the page's /Contents.""" + CheckParent(self) + doc = self.parent + if doc.is_closed: + raise ValueError("document closed") + if not doc.is_pdf: + raise ValueError("is no PDF") + if not xref in range(1, doc.xref_length()): + raise ValueError("bad xref") + if not doc.xref_is_stream(xref): + raise ValueError("xref is no stream") + doc.xref_set_key(self.xref, "Contents", "%i 0 R" % xref) + + + @property + def is_wrapped(self): + """Check if /Contents is wrapped with string pair "q" / "Q".""" + if getattr(self, "was_wrapped", False): # costly checks only once + return True + cont = self.read_contents().split() + if cont == []: # no contents treated as okay + self.was_wrapped = True + return True + if cont[0] != b"q" or cont[-1] != b"Q": + return False # potential "geometry" issue + self.was_wrapped = True # cheap check next time + return True + + + def wrap_contents(self): + if self.is_wrapped: # avoid unnecessary wrapping + return + TOOLS._insert_contents(self, b"q\n", False) + TOOLS._insert_contents(self, b"\nQ", True) + self.was_wrapped = True # indicate not needed again + + + def links(self, kinds=None): + """ Generator over the links of a page. + + Args: + kinds: (list) link kinds to subselect from. If none, + all links are returned. E.g. kinds=[LINK_URI] + will only yield URI links. + """ + all_links = self.get_links() + for link in all_links: + if kinds is None or link["kind"] in kinds: + yield (link) + + + def annots(self, types=None): + """ Generator over the annotations of a page. + + Args: + types: (list) annotation types to subselect from. If none, + all annotations are returned. E.g. types=[PDF_ANNOT_LINE] + will only yield line annotations. + """ + skip_types = (PDF_ANNOT_LINK, PDF_ANNOT_POPUP, PDF_ANNOT_WIDGET) + if not hasattr(types, "__getitem__"): + annot_xrefs = [a[0] for a in self.annot_xrefs() if a[1] not in skip_types] + else: + annot_xrefs = [a[0] for a in self.annot_xrefs() if a[1] in types and a[1] not in skip_types] + for xref in annot_xrefs: + annot = self.load_annot(xref) + annot._yielded=True + yield annot + + + def widgets(self, types=None): + """ Generator over the widgets of a page. + + Args: + types: (list) field types to subselect from. If none, + all fields are returned. E.g. types=[PDF_WIDGET_TYPE_TEXT] + will only yield text fields. + """ + widget_xrefs = [a[0] for a in self.annot_xrefs() if a[1] == PDF_ANNOT_WIDGET] + for xref in widget_xrefs: + widget = self.load_widget(xref) + if types == None or widget.field_type in types: + yield (widget) + + + def __str__(self): + CheckParent(self) + x = self.parent.name + if self.parent.stream is not None: + x = "" % (self.parent._graft_id,) + if x == "": + x = "" % self.parent._graft_id + return "page %s of %s" % (self.number, x) + + def __repr__(self): + CheckParent(self) + x = self.parent.name + if self.parent.stream is not None: + x = "" % (self.parent._graft_id,) + if x == "": + x = "" % self.parent._graft_id + return "page %s of %s" % (self.number, x) + + def _reset_annot_refs(self): + """Invalidate / delete all annots of this page.""" + for annot in self._annot_refs.values(): + if annot: + annot._erase() + self._annot_refs.clear() + + @property + def xref(self): + """PDF xref number of page.""" + CheckParent(self) + return self.parent.page_xref(self.number) + + def _erase(self): + self._reset_annot_refs() + self._image_infos = None + try: + self.parent._forget_page(self) + except: + pass + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + self.parent = None + self.number = None + + + def __del__(self): + self._erase() + + + def get_fonts(self, full=False): + """List of fonts defined in the page object.""" + CheckParent(self) + return self.parent.get_page_fonts(self.number, full=full) + + + def get_images(self, full=False): + """List of images defined in the page object.""" + CheckParent(self) + ret = self.parent.get_page_images(self.number, full=full) + return ret + + + def get_xobjects(self): + """List of xobjects defined in the page object.""" + CheckParent(self) + return self.parent.get_page_xobjects(self.number) + + + def read_contents(self): + """All /Contents streams concatenated to one bytes object.""" + return TOOLS._get_all_contents(self) + + + @property + def mediabox_size(self): + return Point(self.mediabox.x1, self.mediabox.y1) + %} + } +}; +%clearnodefaultctor; + +//------------------------------------------------------------------------ +// Pixmap +//------------------------------------------------------------------------ +struct Pixmap +{ + %extend { + ~Pixmap() { + DEBUGMSG1("Pixmap"); + fz_pixmap *this_pix = (fz_pixmap *) $self; + fz_drop_pixmap(gctx, this_pix); + DEBUGMSG2; + } + FITZEXCEPTION(Pixmap, !result) + %pythonprepend Pixmap +%{"""Pixmap(colorspace, irect, alpha) - empty pixmap. +Pixmap(colorspace, src) - copy changing colorspace. +Pixmap(src, width, height,[clip]) - scaled copy, float dimensions. +Pixmap(src, alpha=True) - copy adding / dropping alpha. +Pixmap(source, mask) - from a non-alpha and a mask pixmap. +Pixmap(file) - from an image file. +Pixmap(memory) - from an image in memory (bytes). +Pixmap(colorspace, width, height, samples, alpha) - from samples data. +Pixmap(PDFdoc, xref) - from an image xref in a PDF document. +"""%} + //---------------------------------------------------------------- + // create empty pixmap with colorspace and IRect + //---------------------------------------------------------------- + Pixmap(struct Colorspace *cs, PyObject *bbox, int alpha = 0) + { + fz_pixmap *pm = NULL; + fz_try(gctx) { + pm = fz_new_pixmap_with_bbox(gctx, (fz_colorspace *) cs, JM_irect_from_py(bbox), NULL, alpha); + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) pm; + } + + //---------------------------------------------------------------- + // copy pixmap, converting colorspace + //---------------------------------------------------------------- + Pixmap(struct Colorspace *cs, struct Pixmap *spix) + { + fz_pixmap *pm = NULL; + fz_try(gctx) { + if (!fz_pixmap_colorspace(gctx, (fz_pixmap *) spix)) { + RAISEPY(gctx, "source colorspace must not be None", PyExc_ValueError); + } + fz_colorspace *cspace = NULL; + if (cs) { + cspace = (fz_colorspace *) cs; + } + if (cspace) { + pm = fz_convert_pixmap(gctx, (fz_pixmap *) spix, cspace, NULL, NULL, fz_default_color_params, 1); + } else { + pm = fz_new_pixmap_from_alpha_channel(gctx, (fz_pixmap *) spix); + if (!pm) { + RAISEPY(gctx, MSG_PIX_NOALPHA, PyExc_RuntimeError); + } + } + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) pm; + } + + + //---------------------------------------------------------------- + // add mask to a pixmap w/o alpha channel + //---------------------------------------------------------------- + Pixmap(struct Pixmap *spix, struct Pixmap *mpix) + { + fz_pixmap *dst = NULL; + fz_pixmap *spm = (fz_pixmap *) spix; + fz_pixmap *mpm = (fz_pixmap *) mpix; + fz_try(gctx) { + if (!spix) { // intercept NULL for spix: make alpha only pix + dst = fz_new_pixmap_from_alpha_channel(gctx, mpm); + if (!dst) { + RAISEPY(gctx, MSG_PIX_NOALPHA, PyExc_RuntimeError); + } + } else { + dst = fz_new_pixmap_from_color_and_mask(gctx, spm, mpm); + } + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) dst; + } + + + //---------------------------------------------------------------- + // create pixmap as scaled copy of another one + //---------------------------------------------------------------- + Pixmap(struct Pixmap *spix, float w, float h, PyObject *clip=NULL) + { + fz_pixmap *pm = NULL; + fz_pixmap *src_pix = (fz_pixmap *) spix; + fz_try(gctx) { + fz_irect bbox = JM_irect_from_py(clip); + if (clip != Py_None && (fz_is_infinite_irect(bbox) || fz_is_empty_irect(bbox))) { + RAISEPY(gctx, "bad clip parameter", PyExc_ValueError); + } + if (!fz_is_infinite_irect(bbox)) { + pm = fz_scale_pixmap(gctx, src_pix, src_pix->x, src_pix->y, w, h, &bbox); + } else { + pm = fz_scale_pixmap(gctx, src_pix, src_pix->x, src_pix->y, w, h, NULL); + } + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) pm; + } + + + //---------------------------------------------------------------- + // copy pixmap & add / drop the alpha channel + //---------------------------------------------------------------- + Pixmap(struct Pixmap *spix, int alpha=1) + { + fz_pixmap *pm = NULL, *src_pix = (fz_pixmap *) spix; + int n, w, h, i; + fz_separations *seps = NULL; + fz_try(gctx) { + if (!INRANGE(alpha, 0, 1)) { + RAISEPY(gctx, "bad alpha value", PyExc_ValueError); + } + fz_colorspace *cs = fz_pixmap_colorspace(gctx, src_pix); + if (!cs && !alpha) { + RAISEPY(gctx, "cannot drop alpha for 'NULL' colorspace", PyExc_ValueError); + } + n = fz_pixmap_colorants(gctx, src_pix); + w = fz_pixmap_width(gctx, src_pix); + h = fz_pixmap_height(gctx, src_pix); + pm = fz_new_pixmap(gctx, cs, w, h, seps, alpha); + pm->x = src_pix->x; + pm->y = src_pix->y; + pm->xres = src_pix->xres; + pm->yres = src_pix->yres; + + // copy samples data ------------------------------------------ + unsigned char *sptr = src_pix->samples; + unsigned char *tptr = pm->samples; + if (src_pix->alpha == pm->alpha) { // identical samples + memcpy(tptr, sptr, w * h * (n + alpha)); + } else { + for (i = 0; i < w * h; i++) { + memcpy(tptr, sptr, n); + tptr += n; + if (pm->alpha) { + tptr[0] = 255; + tptr++; + } + sptr += n + src_pix->alpha; + } + } + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) pm; + } + + //---------------------------------------------------------------- + // create pixmap from samples data + //---------------------------------------------------------------- + Pixmap(struct Colorspace *cs, int w, int h, PyObject *samples, int alpha=0) + { + int n = fz_colorspace_n(gctx, (fz_colorspace *) cs); + int stride = (n + alpha) * w; + fz_separations *seps = NULL; + fz_buffer *res = NULL; + fz_pixmap *pm = NULL; + fz_try(gctx) { + size_t size = 0; + unsigned char *c = NULL; + res = JM_BufferFromBytes(gctx, samples); + if (!res) { + RAISEPY(gctx, "bad samples data", PyExc_ValueError); + } + size = fz_buffer_storage(gctx, res, &c); + if (stride * h != size) { + RAISEPY(gctx, "bad samples length", PyExc_ValueError); + } + pm = fz_new_pixmap(gctx, (fz_colorspace *) cs, w, h, seps, alpha); + memcpy(pm->samples, c, size); + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) pm; + } + + + //---------------------------------------------------------------- + // create pixmap from filename, file object, pathlib.Path or memory + //---------------------------------------------------------------- + Pixmap(PyObject *imagedata) + { + fz_buffer *res = NULL; + fz_image *img = NULL; + fz_pixmap *pm = NULL; + PyObject *fname = NULL; + PyObject *name = PyUnicode_FromString("name"); + fz_try(gctx) { + if (PyObject_HasAttrString(imagedata, "resolve")) { + fname = PyObject_CallMethod(imagedata, "__str__", NULL); + if (fname) { + img = fz_new_image_from_file(gctx, JM_StrAsChar(fname)); + } + } else if (PyObject_HasAttr(imagedata, name)) { + fname = PyObject_GetAttr(imagedata, name); + if (fname) { + img = fz_new_image_from_file(gctx, JM_StrAsChar(fname)); + } + } else if (PyUnicode_Check(imagedata)) { + img = fz_new_image_from_file(gctx, JM_StrAsChar(imagedata)); + } else { + res = JM_BufferFromBytes(gctx, imagedata); + if (!res || !fz_buffer_storage(gctx, res, NULL)) { + RAISEPY(gctx, "bad image data", PyExc_ValueError); + } + img = fz_new_image_from_buffer(gctx, res); + } + pm = fz_get_pixmap_from_image(gctx, img, NULL, NULL, NULL, NULL); + int xres, yres; + fz_image_resolution(img, &xres, &yres); + pm->xres = xres; + pm->yres = yres; + } + fz_always(gctx) { + Py_CLEAR(fname); + Py_CLEAR(name); + fz_drop_image(gctx, img); + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) pm; + } + + + //---------------------------------------------------------------- + // Create pixmap from PDF image identified by XREF number + //---------------------------------------------------------------- + Pixmap(struct Document *doc, int xref) + { + fz_image *img = NULL; + fz_pixmap *pix = NULL; + pdf_obj *ref = NULL; + pdf_obj *type; + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) doc); + fz_try(gctx) { + ASSERT_PDF(pdf); + int xreflen = pdf_xref_len(gctx, pdf); + if (!INRANGE(xref, 1, xreflen-1)) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + ref = pdf_new_indirect(gctx, pdf, xref, 0); + type = pdf_dict_get(gctx, ref, PDF_NAME(Subtype)); + if (!pdf_name_eq(gctx, type, PDF_NAME(Image)) && + !pdf_name_eq(gctx, type, PDF_NAME(Alpha)) && + !pdf_name_eq(gctx, type, PDF_NAME(Luminosity))) { + RAISEPY(gctx, MSG_IS_NO_IMAGE, PyExc_ValueError); + } + img = pdf_load_image(gctx, pdf, ref); + pix = fz_get_pixmap_from_image(gctx, img, NULL, NULL, NULL, NULL); + } + fz_always(gctx) { + fz_drop_image(gctx, img); + pdf_drop_obj(gctx, ref); + } + fz_catch(gctx) { + fz_drop_pixmap(gctx, pix); + return NULL; + } + return (struct Pixmap *) pix; + } + + + //---------------------------------------------------------------- + // warp + //---------------------------------------------------------------- + FITZEXCEPTION(warp, !result) + %pythonprepend warp %{ + """Return pixmap from a warped quad.""" + EnsureOwnership(self) + if not quad.is_convex: raise ValueError("quad must be convex")%} + struct Pixmap *warp(PyObject *quad, int width, int height) + { + fz_point points[4]; + fz_quad q = JM_quad_from_py(quad); + fz_pixmap *dst = NULL; + points[0] = q.ul; + points[1] = q.ur; + points[2] = q.lr; + points[3] = q.ll; + + fz_try(gctx) { + dst = fz_warp_pixmap(gctx, (fz_pixmap *) $self, points, width, height); + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) dst; + } + + + //---------------------------------------------------------------- + // shrink + //---------------------------------------------------------------- + ENSURE_OWNERSHIP(shrink, """Divide width and height by 2**factor. + E.g. factor=1 shrinks to 25% of original size (in place).""") + void shrink(int factor) + { + if (factor < 1) + { + JM_Warning("ignoring shrink factor < 1"); + return; + } + fz_subsample_pixmap(gctx, (fz_pixmap *) $self, factor); + } + + //---------------------------------------------------------------- + // apply gamma correction + //---------------------------------------------------------------- + ENSURE_OWNERSHIP(gamma_with, """Apply correction with some float. +gamma=1 is a no-op.""") + void gamma_with(float gamma) + { + if (!fz_pixmap_colorspace(gctx, (fz_pixmap *) $self)) + { + JM_Warning("colorspace invalid for function"); + return; + } + fz_gamma_pixmap(gctx, (fz_pixmap *) $self, gamma); + } + + //---------------------------------------------------------------- + // tint pixmap with color + //---------------------------------------------------------------- + %pythonprepend tint_with +%{"""Tint colors with modifiers for black and white.""" +EnsureOwnership(self) +if not self.colorspace or self.colorspace.n > 3: + print("warning: colorspace invalid for function") + return%} + void tint_with(int black, int white) + { + fz_tint_pixmap(gctx, (fz_pixmap *) $self, black, white); + } + + //----------------------------------------------------------------- + // clear all of pixmap samples to 0x00 */ + //----------------------------------------------------------------- + ENSURE_OWNERSHIP(clear_with, """Fill all color components with same value.""") + void clear_with() + { + fz_clear_pixmap(gctx, (fz_pixmap *) $self); + } + + //----------------------------------------------------------------- + // clear total pixmap with value */ + //----------------------------------------------------------------- + void clear_with(int value) + { + fz_clear_pixmap_with_value(gctx, (fz_pixmap *) $self, value); + } + + //----------------------------------------------------------------- + // clear pixmap rectangle with value + //----------------------------------------------------------------- + void clear_with(int value, PyObject *bbox) + { + JM_clear_pixmap_rect_with_value(gctx, (fz_pixmap *) $self, value, JM_irect_from_py(bbox)); + } + + //----------------------------------------------------------------- + // copy pixmaps + //----------------------------------------------------------------- + FITZEXCEPTION(copy, !result) + ENSURE_OWNERSHIP(copy, """Copy bbox from another Pixmap.""") + PyObject *copy(struct Pixmap *src, PyObject *bbox) + { + fz_try(gctx) { + fz_pixmap *pm = (fz_pixmap *) $self, *src_pix = (fz_pixmap *) src; + if (!fz_pixmap_colorspace(gctx, src_pix)) { + RAISEPY(gctx, "cannot copy pixmap with NULL colorspace", PyExc_ValueError); + } + if (pm->alpha != src_pix->alpha) { + RAISEPY(gctx, "source and target alpha must be equal", PyExc_ValueError); + } + fz_copy_pixmap_rect(gctx, pm, src_pix, JM_irect_from_py(bbox), NULL); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //----------------------------------------------------------------- + // set alpha values + //----------------------------------------------------------------- + FITZEXCEPTION(set_alpha, !result) + ENSURE_OWNERSHIP(set_alpha, """Set alpha channel to values contained in a byte array. +If None, all alphas are 255. + +Args: + alphavalues: (bytes) with length (width * height) or 'None'. + premultiply: (bool, True) premultiply colors with alpha values. + opaque: (tuple, length colorspace.n) this color receives opacity 0. + matte: (tuple, length colorspace.n) preblending background color. +""") + PyObject *set_alpha(PyObject *alphavalues=NULL, int premultiply=1, PyObject *opaque=NULL, PyObject *matte=NULL) + { + fz_buffer *res = NULL; + fz_pixmap *pix = (fz_pixmap *) $self; + unsigned char alpha = 0, m = 0; + fz_try(gctx) { + if (pix->alpha == 0) { + RAISEPY(gctx, MSG_PIX_NOALPHA, PyExc_ValueError); + } + size_t i, k, j; + size_t n = fz_pixmap_colorants(gctx, pix); + size_t w = (size_t) fz_pixmap_width(gctx, pix); + size_t h = (size_t) fz_pixmap_height(gctx, pix); + size_t balen = w * h * (n+1); + int colors[4]; // make this color opaque + int bgcolor[4]; // preblending background color + int zero_out = 0, bground = 0; + if (opaque && PySequence_Check(opaque) && PySequence_Size(opaque) == n) { + for (i = 0; i < n; i++) { + if (JM_INT_ITEM(opaque, i, &colors[i]) == 1) { + RAISEPY(gctx, "bad opaque components", PyExc_ValueError); + } + } + zero_out = 1; + } + if (matte && PySequence_Check(matte) && PySequence_Size(matte) == n) { + for (i = 0; i < n; i++) { + if (JM_INT_ITEM(matte, i, &bgcolor[i]) == 1) { + RAISEPY(gctx, "bad matte components", PyExc_ValueError); + } + } + bground = 1; + } + unsigned char *data = NULL; + size_t data_len = 0; + if (alphavalues && PyObject_IsTrue(alphavalues)) { + res = JM_BufferFromBytes(gctx, alphavalues); + data_len = fz_buffer_storage(gctx, res, &data); + if (data_len < w * h) { + RAISEPY(gctx, "bad alpha values", PyExc_ValueError); + } + } + i = k = j = 0; + int data_fix = 255; + while (i < balen) { + alpha = data[k]; + if (zero_out) { + for (j = i; j < i+n; j++) { + if (pix->samples[j] != (unsigned char) colors[j - i]) { + data_fix = 255; + break; + } else { + data_fix = 0; + } + } + } + if (data_len) { + if (data_fix == 0) { + pix->samples[i+n] = 0; + } else { + pix->samples[i+n] = alpha; + } + if (premultiply && !bground) { + for (j = i; j < i+n; j++) { + pix->samples[j] = fz_mul255(pix->samples[j], alpha); + } + } else if (bground) { + for (j = i; j < i+n; j++) { + m = (unsigned char) bgcolor[j - i]; + pix->samples[j] = m + fz_mul255((pix->samples[j] - m), alpha); + } + } + } else { + pix->samples[i+n] = data_fix; + } + i += n+1; + k += 1; + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //----------------------------------------------------------------- + // Pixmap._tobytes + //----------------------------------------------------------------- + FITZEXCEPTION(_tobytes, !result) + PyObject *_tobytes(int format, int jpg_quality) + { + fz_output *out = NULL; + fz_buffer *res = NULL; + PyObject *barray = NULL; + fz_pixmap *pm = (fz_pixmap *) $self; + fz_try(gctx) { + size_t size = fz_pixmap_stride(gctx, pm) * pm->h; + res = fz_new_buffer(gctx, size); + out = fz_new_output_with_buffer(gctx, res); + + switch(format) { + case(1): + fz_write_pixmap_as_png(gctx, out, pm); + break; + case(2): + fz_write_pixmap_as_pnm(gctx, out, pm); + break; + case(3): + fz_write_pixmap_as_pam(gctx, out, pm); + break; + case(5): // Adobe Photoshop Document + fz_write_pixmap_as_psd(gctx, out, pm); + break; + case(6): // Postscript format + fz_write_pixmap_as_ps(gctx, out, pm); + break; + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + case(7): // JPEG format + #if FZ_VERSION_MINOR < 24 + fz_write_pixmap_as_jpeg(gctx, out, pm, jpg_quality); + #else + fz_write_pixmap_as_jpeg(gctx, out, pm, jpg_quality, 0 /*invert_cmyk*/); + #endif + break; + #endif + default: + fz_write_pixmap_as_png(gctx, out, pm); + break; + } + barray = JM_BinFromBuffer(gctx, res); + } + fz_always(gctx) { + fz_drop_output(gctx, out); + fz_drop_buffer(gctx, res); + } + + fz_catch(gctx) { + return NULL; + } + return barray; + } + + %pythoncode %{ +def tobytes(self, output="png", jpg_quality=95): + """Convert to binary image stream of desired type. + + Can be used as input to GUI packages like tkinter. + + Args: + output: (str) image type, default is PNG. Others are JPG, JPEG, PNM, PGM, PPM, + PBM, PAM, PSD, PS. + Returns: + Bytes object. + """ + EnsureOwnership(self) + valid_formats = {"png": 1, "pnm": 2, "pgm": 2, "ppm": 2, "pbm": 2, + "pam": 3, "psd": 5, "ps": 6, "jpg": 7, "jpeg": 7} + + idx = valid_formats.get(output.lower(), None) + if idx==None: + raise ValueError(f"Image format {output} not in {tuple(valid_formats.keys())}") + if self.alpha and idx in (2, 6, 7): + raise ValueError("'%s' cannot have alpha" % output) + if self.colorspace and self.colorspace.n > 3 and idx in (1, 2, 4): + raise ValueError("unsupported colorspace for '%s'" % output) + if idx == 7: + self.set_dpi(self.xres, self.yres) + barray = self._tobytes(idx, jpg_quality) + return barray + %} + + + //----------------------------------------------------------------- + // output as PDF-OCR + //----------------------------------------------------------------- + FITZEXCEPTION(pdfocr_save, !result) + %pythonprepend pdfocr_save %{ + """Save pixmap as an OCR-ed PDF page.""" + EnsureOwnership(self) + if not os.getenv("TESSDATA_PREFIX") and not tessdata: + raise RuntimeError("No OCR support: TESSDATA_PREFIX not set") + %} + ENSURE_OWNERSHIP(pdfocr_save, ) + PyObject *pdfocr_save(PyObject *filename, int compress=1, char *language=NULL, char *tessdata=NULL) + { + fz_pdfocr_options opts; + memset(&opts, 0, sizeof opts); + opts.compress = compress; + if (language) { + fz_strlcpy(opts.language, language, sizeof(opts.language)); + } + if (tessdata) { + fz_strlcpy(opts.datadir, tessdata, sizeof(opts.language)); + } + fz_output *out = NULL; + fz_pixmap *pix = (fz_pixmap *) $self; + fz_try(gctx) { + if (PyUnicode_Check(filename)) { + fz_save_pixmap_as_pdfocr(gctx, pix, (char *) PyUnicode_AsUTF8(filename), 0, &opts); + } else { + out = JM_new_output_fileptr(gctx, filename); + fz_write_pixmap_as_pdfocr(gctx, out, pix, &opts); + } + } + fz_always(gctx) { + fz_drop_output(gctx, out); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + %pythoncode %{ + def pdfocr_tobytes(self, compress=True, language="eng", tessdata=None): + """Save pixmap as an OCR-ed PDF page. + + Args: + compress: (bool) compress, default 1 (True). + language: (str) language(s) occurring on page, default "eng" (English), + multiples like "eng+ger" for English and German. + tessdata: (str) folder name of Tesseract's language support. Must be + given if environment variable TESSDATA_PREFIX is not set. + Notes: + On failure, make sure Tesseract is installed and you have set the + environment variable "TESSDATA_PREFIX" to the folder containing your + Tesseract's language support data. + """ + if not os.getenv("TESSDATA_PREFIX") and not tessdata: + raise RuntimeError("No OCR support: TESSDATA_PREFIX not set") + EnsureOwnership(self) + from io import BytesIO + bio = BytesIO() + self.pdfocr_save(bio, compress=compress, language=language, tessdata=tessdata) + return bio.getvalue() + %} + + + //----------------------------------------------------------------- + // _writeIMG + //----------------------------------------------------------------- + FITZEXCEPTION(_writeIMG, !result) + PyObject *_writeIMG(char *filename, int format, int jpg_quality) + { + fz_try(gctx) { + fz_pixmap *pm = (fz_pixmap *) $self; + switch(format) { + case(1): + fz_save_pixmap_as_png(gctx, pm, filename); + break; + case(2): + fz_save_pixmap_as_pnm(gctx, pm, filename); + break; + case(3): + fz_save_pixmap_as_pam(gctx, pm, filename); + break; + case(5): // Adobe Photoshop Document + fz_save_pixmap_as_psd(gctx, pm, filename); + break; + case(6): // Postscript + fz_save_pixmap_as_ps(gctx, pm, filename, 0); + break; + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + case(7): // JPEG + fz_save_pixmap_as_jpeg(gctx, pm, filename, jpg_quality); + break; + #endif + default: + fz_save_pixmap_as_png(gctx, pm, filename); + break; + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + %pythoncode %{ +def save(self, filename, output=None, jpg_quality=95): + """Output as image in format determined by filename extension. + + Args: + output: (str) only use to overrule filename extension. Default is PNG. + Others are JPEG, JPG, PNM, PGM, PPM, PBM, PAM, PSD, PS. + """ + EnsureOwnership(self) + valid_formats = {"png": 1, "pnm": 2, "pgm": 2, "ppm": 2, "pbm": 2, + "pam": 3, "psd": 5, "ps": 6, "jpg": 7, "jpeg": 7} + + if type(filename) is str: + pass + elif hasattr(filename, "absolute"): + filename = str(filename) + elif hasattr(filename, "name"): + filename = filename.name + if output is None: + _, ext = os.path.splitext(filename) + output = ext[1:] + + idx = valid_formats.get(output.lower(), None) + if idx == None: + raise ValueError(f"Image format {output} not in {tuple(valid_formats.keys())}") + if self.alpha and idx in (2, 6, 7): + raise ValueError("'%s' cannot have alpha" % output) + if self.colorspace and self.colorspace.n > 3 and idx in (1, 2, 4): + raise ValueError("unsupported colorspace for '%s'" % output) + if idx == 7: + self.set_dpi(self.xres, self.yres) + return self._writeIMG(filename, idx, jpg_quality) + +def pil_save(self, *args, unmultiply=False, **kwargs): + """Write to image file using Pillow. + + Args are passed to Pillow's Image.save method, see their documentation. + Use instead of save when other output formats are desired. + + :arg bool unmultiply: generates Pillow mode "RGBa" instead of "RGBA". + Relevant for colorspace RGB with alpha only. + """ + EnsureOwnership(self) + try: + from PIL import Image + except ImportError: + print("Pillow not installed") + raise + + cspace = self.colorspace + if cspace is None: + mode = "L" + elif cspace.n == 1: + mode = "L" if self.alpha == 0 else "LA" + elif cspace.n == 3: + mode = "RGB" if self.alpha == 0 else "RGBA" + if mode == "RGBA" and unmultiply: + mode = "RGBa" + else: + mode = "CMYK" + + img = Image.frombytes(mode, (self.width, self.height), self.samples) + + if "dpi" not in kwargs.keys(): + kwargs["dpi"] = (self.xres, self.yres) + + img.save(*args, **kwargs) + +def pil_tobytes(self, *args, unmultiply=False, **kwargs): + """Convert to binary image stream using pillow. + + Args are passed to Pillow's Image.save method, see their documentation. + Use instead of 'tobytes' when other output formats are needed. + """ + EnsureOwnership(self) + from io import BytesIO + bytes_out = BytesIO() + self.pil_save(bytes_out, *args, unmultiply=unmultiply, **kwargs) + return bytes_out.getvalue() + + %} + //----------------------------------------------------------------- + // invert_irect + //----------------------------------------------------------------- + %pythonprepend invert_irect + %{"""Invert the colors inside a bbox."""%} + PyObject *invert_irect(PyObject *bbox = NULL) + { + fz_pixmap *pm = (fz_pixmap *) $self; + if (!fz_pixmap_colorspace(gctx, pm)) + { + JM_Warning("ignored for stencil pixmap"); + return JM_BOOL(0); + } + + fz_irect r = JM_irect_from_py(bbox); + if (fz_is_infinite_irect(r)) + r = fz_pixmap_bbox(gctx, pm); + + return JM_BOOL(JM_invert_pixmap_rect(gctx, pm, r)); + } + + //----------------------------------------------------------------- + // get one pixel as a list + //----------------------------------------------------------------- + FITZEXCEPTION(pixel, !result) + ENSURE_OWNERSHIP(pixel, """Get color tuple of pixel (x, y). +Includes alpha byte if applicable.""") + PyObject *pixel(int x, int y) + { + PyObject *p = NULL; + fz_try(gctx) { + fz_pixmap *pm = (fz_pixmap *) $self; + if (!INRANGE(x, 0, pm->w - 1) || !INRANGE(y, 0, pm->h - 1)) { + RAISEPY(gctx, MSG_PIXEL_OUTSIDE, PyExc_ValueError); + } + int n = pm->n; + int stride = fz_pixmap_stride(gctx, pm); + int j, i = stride * y + n * x; + p = PyTuple_New(n); + for (j = 0; j < n; j++) { + PyTuple_SET_ITEM(p, j, Py_BuildValue("i", pm->samples[i + j])); + } + } + fz_catch(gctx) { + return NULL; + } + return p; + } + + //----------------------------------------------------------------- + // Set one pixel to a given color tuple + //----------------------------------------------------------------- + FITZEXCEPTION(set_pixel, !result) + ENSURE_OWNERSHIP(set_pixel, """Set color of pixel (x, y).""") + PyObject *set_pixel(int x, int y, PyObject *color) + { + fz_try(gctx) { + fz_pixmap *pm = (fz_pixmap *) $self; + if (!INRANGE(x, 0, pm->w - 1) || !INRANGE(y, 0, pm->h - 1)) { + RAISEPY(gctx, MSG_PIXEL_OUTSIDE, PyExc_ValueError); + } + int n = pm->n; + if (!PySequence_Check(color) || PySequence_Size(color) != n) { + RAISEPY(gctx, MSG_BAD_COLOR_SEQ, PyExc_ValueError); + } + int i, j; + unsigned char c[5]; + for (j = 0; j < n; j++) { + if (JM_INT_ITEM(color, j, &i) == 1) { + RAISEPY(gctx, MSG_BAD_COLOR_SEQ, PyExc_ValueError); + } + if (!INRANGE(i, 0, 255)) { + RAISEPY(gctx, MSG_BAD_COLOR_SEQ, PyExc_ValueError); + } + c[j] = (unsigned char) i; + } + int stride = fz_pixmap_stride(gctx, pm); + i = stride * y + n * x; + for (j = 0; j < n; j++) { + pm->samples[i + j] = c[j]; + } + } + fz_catch(gctx) { + PyErr_Clear(); + return NULL; + } + Py_RETURN_NONE; + } + + + //----------------------------------------------------------------- + // Set Pixmap origin + //----------------------------------------------------------------- + ENSURE_OWNERSHIP(set_origin, """Set top-left coordinates.""") + PyObject *set_origin(int x, int y) + { + fz_pixmap *pm = (fz_pixmap *) $self; + pm->x = x; + pm->y = y; + Py_RETURN_NONE; + } + + ENSURE_OWNERSHIP(set_dpi, """Set resolution in both dimensions.""") + PyObject *set_dpi(int xres, int yres) + { + fz_pixmap *pm = (fz_pixmap *) $self; + pm->xres = xres; + pm->yres = yres; + Py_RETURN_NONE; + } + + //----------------------------------------------------------------- + // Set a rect to a given color tuple + //----------------------------------------------------------------- + FITZEXCEPTION(set_rect, !result) + ENSURE_OWNERSHIP(set_rect, """Set color of all pixels in bbox.""") + PyObject *set_rect(PyObject *bbox, PyObject *color) + { + PyObject *rc = NULL; + fz_try(gctx) { + fz_pixmap *pm = (fz_pixmap *) $self; + Py_ssize_t j, n = (Py_ssize_t) pm->n; + if (!PySequence_Check(color) || PySequence_Size(color) != n) { + RAISEPY(gctx, MSG_BAD_COLOR_SEQ, PyExc_ValueError); + } + unsigned char c[5]; + int i; + for (j = 0; j < n; j++) { + if (JM_INT_ITEM(color, j, &i) == 1) { + RAISEPY(gctx, MSG_BAD_COLOR_SEQ, PyExc_ValueError); + } + if (!INRANGE(i, 0, 255)) { + RAISEPY(gctx, MSG_BAD_COLOR_SEQ, PyExc_ValueError); + } + c[j] = (unsigned char) i; + } + i = JM_fill_pixmap_rect_with_color(gctx, pm, c, JM_irect_from_py(bbox)); + rc = JM_BOOL(i); + } + fz_catch(gctx) { + PyErr_Clear(); + return NULL; + } + return rc; + } + + //----------------------------------------------------------------- + // check if monochrome + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(is_monochrome, """Check if pixmap is monochrome.""") + PyObject *is_monochrome() + { + return JM_BOOL(fz_is_pixmap_monochrome(gctx, (fz_pixmap *) $self)); + } + + //----------------------------------------------------------------- + // check if unicolor (only one color there) + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(is_unicolor, """Check if pixmap has only one color.""") + PyObject *is_unicolor() + { + fz_pixmap *pm = (fz_pixmap *) $self; + size_t i, n = pm->n, count = pm->w * pm->h * n; + unsigned char *s = pm->samples; + for (i = n; i < count; i += n) { + if (memcmp(s, s + i, n) != 0) { + Py_RETURN_FALSE; + } + } + Py_RETURN_TRUE; + } + + + //----------------------------------------------------------------- + // count each pixmap color + //----------------------------------------------------------------- + FITZEXCEPTION(color_count, !result) + ENSURE_OWNERSHIP(color_count, """Return count of each color.""") + PyObject *color_count(int colors=0, PyObject *clip=NULL) + { + fz_pixmap *pm = (fz_pixmap *) $self; + PyObject *rc = NULL; + fz_try(gctx) { + rc = JM_color_count(gctx, pm, clip); + if (!rc) { + RAISEPY(gctx, MSG_COLOR_COUNT_FAILED, PyExc_RuntimeError); + } + } + fz_catch(gctx) { + return NULL; + } + if (!colors) { + Py_ssize_t len = PyDict_Size(rc); + Py_DECREF(rc); + return PyLong_FromSsize_t(len); + } + return rc; + } + + %pythoncode %{ + def color_topusage(self, clip=None): + """Return most frequent color and its usage ratio.""" + EnsureOwnership(self) + allpixels = 0 + cnt = 0 + if clip != None and self.irect in Rect(clip): + clip = self.irect + for pixel, count in self.color_count(colors=True,clip=clip).items(): + allpixels += count + if count > cnt: + cnt = count + maxpixel = pixel + if not allpixels: + return (1, bytes([255] * self.n)) + return (cnt / allpixels, maxpixel) + + %} + + //----------------------------------------------------------------- + // MD5 digest of pixmap + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(digest, """MD5 digest of pixmap (bytes).""") + PyObject *digest() + { + unsigned char digest[16]; + fz_md5_pixmap(gctx, (fz_pixmap *) $self, digest); + return PyBytes_FromStringAndSize(digest, 16); + } + + //----------------------------------------------------------------- + // get length of one image row + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(stride, """Length of one image line (width * n).""") + PyObject *stride() + { + return PyLong_FromSize_t((size_t) fz_pixmap_stride(gctx, (fz_pixmap *) $self)); + } + + //----------------------------------------------------------------- + // x, y, width, height, xres, yres, n + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(xres, """Resolution in x direction.""") + int xres() + { + fz_pixmap *this_pix = (fz_pixmap *) $self; + return this_pix->xres; + } + + %pythoncode %{@property%} + ENSURE_OWNERSHIP(yres, """Resolution in y direction.""") + int yres() + { + fz_pixmap *this_pix = (fz_pixmap *) $self; + return this_pix->yres; + } + + %pythoncode %{@property%} + ENSURE_OWNERSHIP(w, """The width.""") + PyObject *w() + { + return PyLong_FromSize_t((size_t) fz_pixmap_width(gctx, (fz_pixmap *) $self)); + } + + %pythoncode %{@property%} + ENSURE_OWNERSHIP(h, """The height.""") + PyObject *h() + { + return PyLong_FromSize_t((size_t) fz_pixmap_height(gctx, (fz_pixmap *) $self)); + } + + %pythoncode %{@property%} + ENSURE_OWNERSHIP(x, """x component of Pixmap origin.""") + int x() + { + return fz_pixmap_x(gctx, (fz_pixmap *) $self); + } + + %pythoncode %{@property%} + ENSURE_OWNERSHIP(y, """y component of Pixmap origin.""") + int y() + { + return fz_pixmap_y(gctx, (fz_pixmap *) $self); + } + + %pythoncode %{@property%} + ENSURE_OWNERSHIP(n, """The size of one pixel.""") + int n() + { + return fz_pixmap_components(gctx, (fz_pixmap *) $self); + } + + //----------------------------------------------------------------- + // check alpha channel + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(alpha, """Indicates presence of alpha channel.""") + int alpha() + { + return fz_pixmap_alpha(gctx, (fz_pixmap *) $self); + } + + //----------------------------------------------------------------- + // get colorspace of pixmap + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(colorspace, """Pixmap Colorspace.""") + struct Colorspace *colorspace() + { + return (struct Colorspace *) fz_pixmap_colorspace(gctx, (fz_pixmap *) $self); + } + + //----------------------------------------------------------------- + // return irect of pixmap + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(irect, """Pixmap bbox - an IRect object.""") + %pythonappend irect %{val = IRect(val)%} + PyObject *irect() + { + return JM_py_from_irect(fz_pixmap_bbox(gctx, (fz_pixmap *) $self)); + } + + //----------------------------------------------------------------- + // return size of pixmap + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(size, """Pixmap size.""") + PyObject *size() + { + return PyLong_FromSize_t(fz_pixmap_size(gctx, (fz_pixmap *) $self)); + } + + //----------------------------------------------------------------- + // samples + //----------------------------------------------------------------- + %pythoncode %{@property%} + ENSURE_OWNERSHIP(samples_mv, """Pixmap samples memoryview.""") + PyObject *samples_mv() + { + fz_pixmap *pm = (fz_pixmap *) $self; + Py_ssize_t s = (Py_ssize_t) pm->w; + s *= pm->h; + s *= pm->n; + return PyMemoryView_FromMemory((char *) pm->samples, s, PyBUF_READ); + } + + + %pythoncode %{@property%} + ENSURE_OWNERSHIP(samples_ptr, """Pixmap samples pointer.""") + PyObject *samples_ptr() + { + fz_pixmap *pm = (fz_pixmap *) $self; + return PyLong_FromVoidPtr((void *) pm->samples); + } + + %pythoncode %{ + @property + def samples(self)->bytes: + return bytes(self.samples_mv) + + width = w + height = h + + def __len__(self): + return self.size + + def __repr__(self): + EnsureOwnership(self) + if not type(self) is Pixmap: return + if self.colorspace: + return "Pixmap(%s, %s, %s)" % (self.colorspace.name, self.irect, self.alpha) + else: + return "Pixmap(%s, %s, %s)" % ('None', self.irect, self.alpha) + + def __enter__(self): + return self + + def __exit__(self, *args): + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + + def __del__(self): + if not type(self) is Pixmap: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + + %} + } +}; + +/* fz_colorspace */ +struct Colorspace +{ + %extend { + ~Colorspace() + { + DEBUGMSG1("Colorspace"); + fz_colorspace *this_cs = (fz_colorspace *) $self; + fz_drop_colorspace(gctx, this_cs); + DEBUGMSG2; + } + + %pythonprepend Colorspace + %{"""Supported are GRAY, RGB and CMYK."""%} + Colorspace(int type) + { + fz_colorspace *cs = NULL; + switch(type) { + case CS_GRAY: + cs = fz_device_gray(gctx); + break; + case CS_CMYK: + cs = fz_device_cmyk(gctx); + break; + case CS_RGB: + default: + cs = fz_device_rgb(gctx); + break; + } + fz_keep_colorspace(gctx, cs); + return (struct Colorspace *) cs; + } + //----------------------------------------------------------------- + // number of bytes to define color of one pixel + //----------------------------------------------------------------- + %pythoncode %{@property%} + %pythonprepend n %{"""Size of one pixel."""%} + PyObject *n() + { + return Py_BuildValue("i", fz_colorspace_n(gctx, (fz_colorspace *) $self)); + } + + //----------------------------------------------------------------- + // name of colorspace + //----------------------------------------------------------------- + PyObject *_name() + { + return JM_UnicodeFromStr(fz_colorspace_name(gctx, (fz_colorspace *) $self)); + } + + %pythoncode %{ + @property + def name(self): + """Name of the Colorspace.""" + + if self.n == 1: + return csGRAY._name() + elif self.n == 3: + return csRGB._name() + elif self.n == 4: + return csCMYK._name() + return self._name() + + def __repr__(self): + x = ("", "GRAY", "", "RGB", "CMYK")[self.n] + return "Colorspace(CS_%s) - %s" % (x, self.name) + %} + } +}; + + +/* fz_device wrapper */ +%rename(Device) DeviceWrapper; +struct DeviceWrapper +{ + %extend { + FITZEXCEPTION(DeviceWrapper, !result) + DeviceWrapper(struct Pixmap *pm, PyObject *clip) { + struct DeviceWrapper *dw = NULL; + fz_try(gctx) { + dw = (struct DeviceWrapper *)calloc(1, sizeof(struct DeviceWrapper)); + fz_irect bbox = JM_irect_from_py(clip); + if (fz_is_infinite_irect(bbox)) + dw->device = fz_new_draw_device(gctx, fz_identity, (fz_pixmap *) pm); + else + dw->device = fz_new_draw_device_with_bbox(gctx, fz_identity, (fz_pixmap *) pm, &bbox); + } + fz_catch(gctx) { + return NULL; + } + return dw; + } + DeviceWrapper(struct DisplayList *dl) { + struct DeviceWrapper *dw = NULL; + fz_try(gctx) { + dw = (struct DeviceWrapper *)calloc(1, sizeof(struct DeviceWrapper)); + dw->device = fz_new_list_device(gctx, (fz_display_list *) dl); + dw->list = (fz_display_list *) dl; + fz_keep_display_list(gctx, (fz_display_list *) dl); + } + fz_catch(gctx) { + return NULL; + } + return dw; + } + DeviceWrapper(struct TextPage *tp, int flags = 0) { + struct DeviceWrapper *dw = NULL; + fz_try(gctx) { + dw = (struct DeviceWrapper *)calloc(1, sizeof(struct DeviceWrapper)); + fz_stext_options opts = { 0 }; + opts.flags = flags; + dw->device = fz_new_stext_device(gctx, (fz_stext_page *) tp, &opts); + } + fz_catch(gctx) { + return NULL; + } + return dw; + } + ~DeviceWrapper() { + fz_display_list *list = $self->list; + DEBUGMSG1("Device"); + fz_close_device(gctx, $self->device); + fz_drop_device(gctx, $self->device); + DEBUGMSG2; + if(list) + { + DEBUGMSG1("DisplayList after Device"); + fz_drop_display_list(gctx, list); + DEBUGMSG2; + } + } + } +}; + +//------------------------------------------------------------------------ +// fz_outline +//------------------------------------------------------------------------ +%nodefaultctor; +struct Outline { + %immutable; + %extend { + ~Outline() + { + DEBUGMSG1("Outline"); + fz_outline *this_ol = (fz_outline *) $self; + fz_drop_outline(gctx, this_ol); + DEBUGMSG2; + } + + %pythoncode %{@property%} + PyObject *uri() + { + fz_outline *ol = (fz_outline *) $self; + return JM_UnicodeFromStr(ol->uri); + } + + /* `%newobject foo;` is equivalent to wrapping C fn in python like: + ret = _foo() + ret.thisown=true + return ret. + */ + %newobject next; + %pythoncode %{@property%} + struct Outline *next() + { + fz_outline *ol = (fz_outline *) $self; + fz_outline *next_ol = ol->next; + if (!next_ol) return NULL; + next_ol = fz_keep_outline(gctx, next_ol); + return (struct Outline *) next_ol; + } + + %newobject down; + %pythoncode %{@property%} + struct Outline *down() + { + fz_outline *ol = (fz_outline *) $self; + fz_outline *down_ol = ol->down; + if (!down_ol) return NULL; + down_ol = fz_keep_outline(gctx, down_ol); + return (struct Outline *) down_ol; + } + + %pythoncode %{@property%} + PyObject *is_external() + { + fz_outline *ol = (fz_outline *) $self; + if (!ol->uri) Py_RETURN_FALSE; + return JM_BOOL(fz_is_external_link(gctx, ol->uri)); + } + + %pythoncode %{@property%} + int page() + { + fz_outline *ol = (fz_outline *) $self; + return ol->page.page; + } + + %pythoncode %{@property%} + float x() + { + fz_outline *ol = (fz_outline *) $self; + return ol->x; + } + + %pythoncode %{@property%} + float y() + { + fz_outline *ol = (fz_outline *) $self; + return ol->y; + } + + %pythoncode %{@property%} + PyObject *title() + { + fz_outline *ol = (fz_outline *) $self; + return JM_UnicodeFromStr(ol->title); + } + + %pythoncode %{@property%} + PyObject *is_open() + { + fz_outline *ol = (fz_outline *) $self; + return JM_BOOL(ol->is_open); + } + + %pythoncode %{ + @property + def dest(self): + '''outline destination details''' + return linkDest(self, None) + + def __del__(self): + if not isinstance(self, Outline): + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; +%clearnodefaultctor; + + +//------------------------------------------------------------------------ +// Annotation +//------------------------------------------------------------------------ +%nodefaultctor; +struct Annot +{ + %extend + { + ~Annot() + { + DEBUGMSG1("Annot"); + pdf_annot *this_annot = (pdf_annot *) $self; + pdf_drop_annot(gctx, this_annot); + DEBUGMSG2; + } + //---------------------------------------------------------------- + // annotation rectangle + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(rect, """annotation rectangle""") + %pythonappend rect %{ + val = Rect(val) + val *= self.parent.derotation_matrix + %} + PyObject * + rect() + { + fz_rect r = pdf_bound_annot(gctx, (pdf_annot *) $self); + return JM_py_from_rect(r); + } + + %pythoncode %{@property%} + PARENTCHECK(rect_delta, """annotation delta values to rectangle""") + PyObject * + rect_delta() + { + PyObject *rc=NULL; + float d; + fz_try(gctx) { + pdf_obj *annot_obj = pdf_annot_obj(gctx, (pdf_annot *) $self); + pdf_obj *arr = pdf_dict_get(gctx, annot_obj, PDF_NAME(RD)); + int i, n = pdf_array_len(gctx, arr); + if (n != 4) { + rc = Py_BuildValue("s", NULL); + } else { + rc = PyTuple_New(4); + for (i = 0; i < n; i++) { + d = pdf_to_real(gctx, pdf_array_get(gctx, arr, i)); + if (i == 2 || i == 3) d *= -1; + PyTuple_SET_ITEM(rc, i, Py_BuildValue("f", d)); + } + } + } + fz_catch(gctx) { + Py_RETURN_NONE; + } + return rc; + } + + //---------------------------------------------------------------- + // annotation xref number + //---------------------------------------------------------------- + PARENTCHECK(xref, """annotation xref""") + %pythoncode %{@property%} + PyObject *xref() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + return Py_BuildValue("i", pdf_to_num(gctx, annot_obj)); + } + + //---------------------------------------------------------------- + // annotation get IRT xref number + //---------------------------------------------------------------- + PARENTCHECK(irt_xref, """annotation IRT xref""") + %pythoncode %{@property%} + PyObject *irt_xref() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *irt = pdf_dict_get(gctx, annot_obj, PDF_NAME(IRT)); + if (!irt) return PyLong_FromLong(0); + return PyLong_FromLong((long) pdf_to_num(gctx, irt)); + } + + //---------------------------------------------------------------- + // annotation set IRT xref number + //---------------------------------------------------------------- + FITZEXCEPTION(set_irt_xref, !result) + PARENTCHECK(set_irt_xref, """Set annotation IRT xref""") + PyObject *set_irt_xref(int xref) + { + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_page *page = pdf_annot_page(gctx, annot); + if (!INRANGE(xref, 1, pdf_xref_len(gctx, page->doc) - 1)) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + pdf_obj *irt = pdf_new_indirect(gctx, page->doc, xref, 0); + pdf_obj *subt = pdf_dict_get(gctx, irt, PDF_NAME(Subtype)); + int irt_subt = pdf_annot_type_from_string(gctx, pdf_to_name(gctx, subt)); + if (irt_subt < 0) { + pdf_drop_obj(gctx, irt); + RAISEPY(gctx, MSG_IS_NO_ANNOT, PyExc_ValueError); + } + pdf_dict_put_drop(gctx, annot_obj, PDF_NAME(IRT), irt); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------------------------------------- + // annotation get AP/N Matrix + //---------------------------------------------------------------- + PARENTCHECK(apn_matrix, """annotation appearance matrix""") + %pythonappend apn_matrix %{val = Matrix(val)%} + %pythoncode %{@property%} + PyObject * + apn_matrix() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *ap = pdf_dict_getl(gctx, annot_obj, PDF_NAME(AP), + PDF_NAME(N), NULL); + if (!ap) + return JM_py_from_matrix(fz_identity); + fz_matrix mat = pdf_dict_get_matrix(gctx, ap, PDF_NAME(Matrix)); + return JM_py_from_matrix(mat); + } + + + //---------------------------------------------------------------- + // annotation get AP/N BBox + //---------------------------------------------------------------- + PARENTCHECK(apn_bbox, """annotation appearance bbox""") + %pythonappend apn_bbox %{ + val = Rect(val) * self.parent.transformation_matrix + val *= self.parent.derotation_matrix%} + %pythoncode %{@property%} + PyObject * + apn_bbox() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *ap = pdf_dict_getl(gctx, annot_obj, PDF_NAME(AP), + PDF_NAME(N), NULL); + if (!ap) + return JM_py_from_rect(fz_infinite_rect); + fz_rect rect = pdf_dict_get_rect(gctx, ap, PDF_NAME(BBox)); + return JM_py_from_rect(rect); + } + + + //---------------------------------------------------------------- + // annotation set AP/N Matrix + //---------------------------------------------------------------- + FITZEXCEPTION(set_apn_matrix, !result) + PARENTCHECK(set_apn_matrix, """Set annotation appearance matrix.""") + PyObject * + set_apn_matrix(PyObject *matrix) + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + fz_try(gctx) { + pdf_obj *ap = pdf_dict_getl(gctx, annot_obj, PDF_NAME(AP), + PDF_NAME(N), NULL); + if (!ap) { + RAISEPY(gctx, MSG_BAD_APN, PyExc_RuntimeError); + } + fz_matrix mat = JM_matrix_from_py(matrix); + pdf_dict_put_matrix(gctx, ap, PDF_NAME(Matrix), mat); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation set AP/N BBox + //---------------------------------------------------------------- + FITZEXCEPTION(set_apn_bbox, !result) + %pythonprepend set_apn_bbox %{ + """Set annotation appearance bbox.""" + + CheckParent(self) + page = self.parent + rot = page.rotation_matrix + mat = page.transformation_matrix + bbox *= rot * ~mat + %} + PyObject * + set_apn_bbox(PyObject *bbox) + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + fz_try(gctx) { + pdf_obj *ap = pdf_dict_getl(gctx, annot_obj, PDF_NAME(AP), + PDF_NAME(N), NULL); + if (!ap) { + RAISEPY(gctx, MSG_BAD_APN, PyExc_RuntimeError); + } + fz_rect rect = JM_rect_from_py(bbox); + pdf_dict_put_rect(gctx, ap, PDF_NAME(BBox), rect); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation show blend mode (/BM) + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(blendmode, """annotation BlendMode""") + PyObject *blendmode() + { + PyObject *blend_mode = NULL; + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *obj, *obj1, *obj2; + obj = pdf_dict_get(gctx, annot_obj, PDF_NAME(BM)); + if (obj) { + blend_mode = JM_UnicodeFromStr(pdf_to_name(gctx, obj)); + goto finished; + } + // loop through the /AP/N/Resources/ExtGState objects + obj = pdf_dict_getl(gctx, annot_obj, PDF_NAME(AP), + PDF_NAME(N), + PDF_NAME(Resources), + PDF_NAME(ExtGState), + NULL); + + if (pdf_is_dict(gctx, obj)) { + int i, j, m, n = pdf_dict_len(gctx, obj); + for (i = 0; i < n; i++) { + obj1 = pdf_dict_get_val(gctx, obj, i); + if (pdf_is_dict(gctx, obj1)) { + m = pdf_dict_len(gctx, obj1); + for (j = 0; j < m; j++) { + obj2 = pdf_dict_get_key(gctx, obj1, j); + if (pdf_objcmp(gctx, obj2, PDF_NAME(BM)) == 0) { + blend_mode = JM_UnicodeFromStr(pdf_to_name(gctx, pdf_dict_get_val(gctx, obj1, j))); + goto finished; + } + } + } + } + } + finished:; + } + fz_catch(gctx) { + Py_RETURN_NONE; + } + if (blend_mode) return blend_mode; + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation set blend mode (/BM) + //---------------------------------------------------------------- + FITZEXCEPTION(set_blendmode, !result) + PARENTCHECK(set_blendmode, """Set annotation BlendMode.""") + PyObject * + set_blendmode(char *blend_mode) + { + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(BM), blend_mode); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation get optional content + //---------------------------------------------------------------- + FITZEXCEPTION(get_oc, !result) + PARENTCHECK(get_oc, """Get annotation optional content reference.""") + PyObject *get_oc() + { + int oc = 0; + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *obj = pdf_dict_get(gctx, annot_obj, PDF_NAME(OC)); + if (obj) { + oc = pdf_to_num(gctx, obj); + } + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", oc); + } + + + //---------------------------------------------------------------- + // annotation set open + //---------------------------------------------------------------- + FITZEXCEPTION(set_open, !result) + PARENTCHECK(set_open, """Set 'open' status of annotation or its Popup.""") + PyObject *set_open(int is_open) + { + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_set_annot_is_open(gctx, annot, is_open); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation inquiry: is open + //---------------------------------------------------------------- + FITZEXCEPTION(is_open, !result) + PARENTCHECK(is_open, """Get 'open' status of annotation or its Popup.""") + %pythoncode %{@property%} + PyObject * + is_open() + { + int is_open; + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + is_open = pdf_annot_is_open(gctx, annot); + } + fz_catch(gctx) { + return NULL; + } + return JM_BOOL(is_open); + } + + + //---------------------------------------------------------------- + // annotation inquiry: has Popup + //---------------------------------------------------------------- + FITZEXCEPTION(has_popup, !result) + PARENTCHECK(has_popup, """Check if annotation has a Popup.""") + %pythoncode %{@property%} + PyObject * + has_popup() + { + int has_popup = 0; + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *obj = pdf_dict_get(gctx, annot_obj, PDF_NAME(Popup)); + if (obj) has_popup = 1; + } + fz_catch(gctx) { + return NULL; + } + return JM_BOOL(has_popup); + } + + + //---------------------------------------------------------------- + // annotation set Popup + //---------------------------------------------------------------- + FITZEXCEPTION(set_popup, !result) + PARENTCHECK(set_popup, """Create annotation 'Popup' or update rectangle.""") + PyObject * + set_popup(PyObject *rect) + { + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_page *pdfpage = pdf_annot_page(gctx, annot); + fz_matrix rot = JM_rotate_page_matrix(gctx, pdfpage); + fz_rect r = fz_transform_rect(JM_rect_from_py(rect), rot); + pdf_set_annot_popup(gctx, annot, r); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------------------------------------- + // annotation Popup rectangle + //---------------------------------------------------------------- + FITZEXCEPTION(popup_rect, !result) + PARENTCHECK(popup_rect, """annotation 'Popup' rectangle""") + %pythoncode %{@property%} + %pythonappend popup_rect %{ + val = Rect(val) * self.parent.transformation_matrix + val *= self.parent.derotation_matrix%} + PyObject * + popup_rect() + { + fz_rect rect = fz_infinite_rect; + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *obj = pdf_dict_get(gctx, annot_obj, PDF_NAME(Popup)); + if (obj) { + rect = pdf_dict_get_rect(gctx, obj, PDF_NAME(Rect)); + } + } + fz_catch(gctx) { + return NULL; + } + return JM_py_from_rect(rect); + } + + + //---------------------------------------------------------------- + // annotation Popup xref + //---------------------------------------------------------------- + FITZEXCEPTION(popup_xref, !result) + PARENTCHECK(popup_xref, """annotation 'Popup' xref""") + %pythoncode %{@property%} + PyObject * + popup_xref() + { + int xref = 0; + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *obj = pdf_dict_get(gctx, annot_obj, PDF_NAME(Popup)); + if (obj) { + xref = pdf_to_num(gctx, obj); + } + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", xref); + } + + + //---------------------------------------------------------------- + // annotation set optional content + //---------------------------------------------------------------- + FITZEXCEPTION(set_oc, !result) + PARENTCHECK(set_oc, """Set / remove annotation OC xref.""") + PyObject * + set_oc(int oc=0) + { + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + if (!oc) { + pdf_dict_del(gctx, annot_obj, PDF_NAME(OC)); + } else { + JM_add_oc_object(gctx, pdf_get_bound_document(gctx, annot_obj), annot_obj, oc); + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + %pythoncode%{@property%} + %pythonprepend language %{"""annotation language"""%} + PyObject *language() + { + pdf_annot *this_annot = (pdf_annot *) $self; + fz_text_language lang = pdf_annot_language(gctx, this_annot); + char buf[8]; + if (lang == FZ_LANG_UNSET) Py_RETURN_NONE; + return Py_BuildValue("s", fz_string_from_text_language(buf, lang)); + } + + //---------------------------------------------------------------- + // annotation set language (/Lang) + //---------------------------------------------------------------- + FITZEXCEPTION(set_language, !result) + PARENTCHECK(set_language, """Set annotation language.""") + PyObject *set_language(char *language=NULL) + { + pdf_annot *this_annot = (pdf_annot *) $self; + fz_try(gctx) { + fz_text_language lang; + if (!language) + lang = FZ_LANG_UNSET; + else + lang = fz_text_language_from_string(language); + pdf_set_annot_language(gctx, this_annot, lang); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation get decompressed appearance stream source + //---------------------------------------------------------------- + FITZEXCEPTION(_getAP, !result) + PyObject * + _getAP() + { + PyObject *r = NULL; + fz_buffer *res = NULL; + fz_var(res); + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *ap = pdf_dict_getl(gctx, annot_obj, PDF_NAME(AP), + PDF_NAME(N), NULL); + + if (pdf_is_stream(gctx, ap)) res = pdf_load_stream(gctx, ap); + if (res) { + r = JM_BinFromBuffer(gctx, res); + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + Py_RETURN_NONE; + } + if (!r) Py_RETURN_NONE; + return r; + } + + //---------------------------------------------------------------- + // annotation update /AP stream + //---------------------------------------------------------------- + FITZEXCEPTION(_setAP, !result) + PyObject * + _setAP(PyObject *buffer, int rect=0) + { + fz_buffer *res = NULL; + fz_var(res); + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_page *page = pdf_annot_page(gctx, annot); + pdf_obj *apobj = pdf_dict_getl(gctx, annot_obj, PDF_NAME(AP), + PDF_NAME(N), NULL); + if (!apobj) { + RAISEPY(gctx, MSG_BAD_APN, PyExc_RuntimeError); + } + if (!pdf_is_stream(gctx, apobj)) { + RAISEPY(gctx, MSG_BAD_APN, PyExc_RuntimeError); + } + res = JM_BufferFromBytes(gctx, buffer); + if (!res) { + RAISEPY(gctx, MSG_BAD_BUFFER, PyExc_ValueError); + } + JM_update_stream(gctx, page->doc, apobj, res, 1); + if (rect) { + fz_rect bbox = pdf_dict_get_rect(gctx, annot_obj, PDF_NAME(Rect)); + pdf_dict_put_rect(gctx, apobj, PDF_NAME(BBox), bbox); + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // redaction annotation get values + //---------------------------------------------------------------- + FITZEXCEPTION(_get_redact_values, !result) + %pythonappend _get_redact_values %{ + if not val: + return val + val["rect"] = self.rect + text_color, fontname, fontsize = TOOLS._parse_da(self) + val["text_color"] = text_color + val["fontname"] = fontname + val["fontsize"] = fontsize + fill = self.colors["fill"] + val["fill"] = fill + + %} + PyObject * + _get_redact_values() + { + pdf_annot *annot = (pdf_annot *) $self; + if (pdf_annot_type(gctx, annot) != PDF_ANNOT_REDACT) + Py_RETURN_NONE; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + PyObject *values = PyDict_New(); + pdf_obj *obj = NULL; + const char *text = NULL; + fz_try(gctx) { + obj = pdf_dict_gets(gctx, annot_obj, "RO"); + if (obj) { + JM_Warning("Ignoring redaction key '/RO'."); + int xref = pdf_to_num(gctx, obj); + DICT_SETITEM_DROP(values, dictkey_xref, Py_BuildValue("i", xref)); + } + obj = pdf_dict_gets(gctx, annot_obj, "OverlayText"); + if (obj) { + text = pdf_to_text_string(gctx, obj); + DICT_SETITEM_DROP(values, dictkey_text, JM_UnicodeFromStr(text)); + } else { + DICT_SETITEM_DROP(values, dictkey_text, Py_BuildValue("s", "")); + } + obj = pdf_dict_get(gctx, annot_obj, PDF_NAME(Q)); + int align = 0; + if (obj) { + align = pdf_to_int(gctx, obj); + } + DICT_SETITEM_DROP(values, dictkey_align, Py_BuildValue("i", align)); + } + fz_catch(gctx) { + Py_DECREF(values); + return NULL; + } + return values; + } + + //---------------------------------------------------------------- + // annotation get TextPage + //---------------------------------------------------------------- + %pythonappend get_textpage %{ + if val: + val.thisown = True + %} + FITZEXCEPTION(get_textpage, !result) + PARENTCHECK(get_textpage, """Make annotation TextPage.""") + struct TextPage * + get_textpage(PyObject *clip=NULL, int flags = 0) + { + fz_stext_page *textpage=NULL; + fz_stext_options options = { 0 }; + options.flags = flags; + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + textpage = pdf_new_stext_page_from_annot(gctx, annot, &options); + } + fz_catch(gctx) { + return NULL; + } + return (struct TextPage *) textpage; + } + + + //---------------------------------------------------------------- + // annotation set name + //---------------------------------------------------------------- + FITZEXCEPTION(set_name, !result) + PARENTCHECK(set_name, """Set /Name (icon) of annotation.""") + PyObject * + set_name(char *name) + { + fz_try(gctx) { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(Name), name); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation set rectangle + //---------------------------------------------------------------- + PARENTCHECK(set_rect, """Set annotation rectangle.""") + FITZEXCEPTION(set_rect, !result) + PyObject * + set_rect(PyObject *rect) + { + pdf_annot *annot = (pdf_annot *) $self; + int type = pdf_annot_type(gctx, annot); + int err_source = 0; // what raised the error + fz_var(err_source); + fz_try(gctx) { + pdf_page *pdfpage = pdf_annot_page(gctx, annot); + fz_matrix rot = JM_rotate_page_matrix(gctx, pdfpage); + fz_rect r = fz_transform_rect(JM_rect_from_py(rect), rot); + if (fz_is_empty_rect(r) || fz_is_infinite_rect(r)) { + RAISEPY(gctx, MSG_BAD_RECT, PyExc_ValueError); + } + err_source = 1; // indicate that error was from MuPDF + pdf_set_annot_rect(gctx, annot, r); + } + fz_catch(gctx) { + if (err_source == 0) { + return NULL; + } + PySys_WriteStderr("cannot set rect: '%s'\n", fz_caught_message(gctx)); + Py_RETURN_FALSE; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation set rotation + //---------------------------------------------------------------- + PARENTCHECK(set_rotation, """Set annotation rotation.""") + PyObject * + set_rotation(int rotate=0) + { + pdf_annot *annot = (pdf_annot *) $self; + int type = pdf_annot_type(gctx, annot); + switch (type) + { + case PDF_ANNOT_CARET: break; + case PDF_ANNOT_CIRCLE: break; + case PDF_ANNOT_FREE_TEXT: break; + case PDF_ANNOT_FILE_ATTACHMENT: break; + case PDF_ANNOT_INK: break; + case PDF_ANNOT_LINE: break; + case PDF_ANNOT_POLY_LINE: break; + case PDF_ANNOT_POLYGON: break; + case PDF_ANNOT_SQUARE: break; + case PDF_ANNOT_STAMP: break; + case PDF_ANNOT_TEXT: break; + default: Py_RETURN_NONE; + } + int rot = rotate; + while (rot < 0) rot += 360; + while (rot >= 360) rot -= 360; + if (type == PDF_ANNOT_FREE_TEXT && rot % 90 != 0) + rot = 0; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_dict_put_int(gctx, annot_obj, PDF_NAME(Rotate), rot); + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation get rotation + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(rotation, """annotation rotation""") + int rotation() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_obj *rotation = pdf_dict_get(gctx, annot_obj, PDF_NAME(Rotate)); + if (!rotation) return -1; + return pdf_to_int(gctx, rotation); + } + + + //---------------------------------------------------------------- + // annotation vertices (for "Line", "Polgon", "Ink", etc. + //---------------------------------------------------------------- + PARENTCHECK(vertices, """annotation vertex points""") + %pythoncode %{@property%} + PyObject *vertices() + { + PyObject *res = NULL, *res1 = NULL; + pdf_obj *o, *o1; + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_page *page = pdf_annot_page(gctx, annot); + int i, j; + fz_point point; // point object to work with + fz_matrix page_ctm; // page transformation matrix + pdf_page_transform(gctx, page, NULL, &page_ctm); + fz_matrix derot = JM_derotate_page_matrix(gctx, page); + page_ctm = fz_concat(page_ctm, derot); + + //---------------------------------------------------------------- + // The following objects occur in different annotation types. + // So we are sure that (!o) occurs at most once. + // Every pair of floats is one point, that needs to be separately + // transformed with the page transformation matrix. + //---------------------------------------------------------------- + o = pdf_dict_get(gctx, annot_obj, PDF_NAME(Vertices)); + if (o) goto weiter; + o = pdf_dict_get(gctx, annot_obj, PDF_NAME(L)); + if (o) goto weiter; + o = pdf_dict_get(gctx, annot_obj, PDF_NAME(QuadPoints)); + if (o) goto weiter; + o = pdf_dict_gets(gctx, annot_obj, "CL"); + if (o) goto weiter; + o = pdf_dict_get(gctx, annot_obj, PDF_NAME(InkList)); + if (o) goto inklist; + Py_RETURN_NONE; + + // handle lists with 1-level depth -------------------------------- + weiter:; + res = PyList_New(0); // create Python list + for (i = 0; i < pdf_array_len(gctx, o); i += 2) + { + point.x = pdf_to_real(gctx, pdf_array_get(gctx, o, i)); + point.y = pdf_to_real(gctx, pdf_array_get(gctx, o, i+1)); + point = fz_transform_point(point, page_ctm); + LIST_APPEND_DROP(res, Py_BuildValue("ff", point.x, point.y)); + } + return res; + + // InkList has 2-level lists -------------------------------------- + inklist:; + res = PyList_New(0); + for (i = 0; i < pdf_array_len(gctx, o); i++) + { + res1 = PyList_New(0); + o1 = pdf_array_get(gctx, o, i); + for (j = 0; j < pdf_array_len(gctx, o1); j += 2) + { + point.x = pdf_to_real(gctx, pdf_array_get(gctx, o1, j)); + point.y = pdf_to_real(gctx, pdf_array_get(gctx, o1, j+1)); + point = fz_transform_point(point, page_ctm); + LIST_APPEND_DROP(res1, Py_BuildValue("ff", point.x, point.y)); + } + LIST_APPEND_DROP(res, res1); + } + return res; + } + + //---------------------------------------------------------------- + // annotation colors + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(colors, """Color definitions.""") + PyObject *colors() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + return JM_annot_colors(gctx, annot_obj); + } + + //---------------------------------------------------------------- + // annotation update appearance + //---------------------------------------------------------------- + PyObject *_update_appearance(float opacity=-1, + char *blend_mode=NULL, + PyObject *fill_color=NULL, + int rotate = -1) + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_page *page = pdf_annot_page(gctx, annot); + pdf_document *pdf = page->doc; + int type = pdf_annot_type(gctx, annot); + float fcol[4] = {1,1,1,1}; // std fill color: white + int i, nfcol = 0; // number of color components + JM_color_FromSequence(fill_color, &nfcol, fcol); + fz_try(gctx) { + // remove fill color from unsupported annots + // or if so requested + if ((type != PDF_ANNOT_SQUARE + && type != PDF_ANNOT_CIRCLE + && type != PDF_ANNOT_LINE + && type != PDF_ANNOT_POLY_LINE + && type != PDF_ANNOT_POLYGON + ) + || nfcol == 0 + ) { + pdf_dict_del(gctx, annot_obj, PDF_NAME(IC)); + } else if (nfcol > 0) { + pdf_set_annot_interior_color(gctx, annot, nfcol, fcol); + } + + int insert_rot = (rotate >= 0) ? 1 : 0; + switch (type) { + case PDF_ANNOT_CARET: + case PDF_ANNOT_CIRCLE: + case PDF_ANNOT_FREE_TEXT: + case PDF_ANNOT_FILE_ATTACHMENT: + case PDF_ANNOT_INK: + case PDF_ANNOT_LINE: + case PDF_ANNOT_POLY_LINE: + case PDF_ANNOT_POLYGON: + case PDF_ANNOT_SQUARE: + case PDF_ANNOT_STAMP: + case PDF_ANNOT_TEXT: break; + default: insert_rot = 0; + } + + if (insert_rot) { + pdf_dict_put_int(gctx, annot_obj, PDF_NAME(Rotate), rotate); + } + + pdf_dirty_annot(gctx, annot); + pdf_update_annot(gctx, annot); // let MuPDF update + pdf->resynth_required = 0; + // insert fill color + if (type == PDF_ANNOT_FREE_TEXT) { + if (nfcol > 0) { + pdf_set_annot_color(gctx, annot, nfcol, fcol); + } + } else if (nfcol > 0) { + pdf_obj *col = pdf_new_array(gctx, page->doc, nfcol); + for (i = 0; i < nfcol; i++) { + pdf_array_push_real(gctx, col, fcol[i]); + } + pdf_dict_put_drop(gctx,annot_obj, PDF_NAME(IC), col); + } + } + fz_catch(gctx) { + PySys_WriteStderr("cannot update annot: '%s'\n", fz_caught_message(gctx)); + Py_RETURN_FALSE; + } + + if ((opacity < 0 || opacity >= 1) && !blend_mode) // no opacity, no blend_mode + goto normal_exit; + + fz_try(gctx) { // create or update /ExtGState + pdf_obj *ap = pdf_dict_getl(gctx, annot_obj, PDF_NAME(AP), + PDF_NAME(N), NULL); + if (!ap) { // should never happen + RAISEPY(gctx, MSG_BAD_APN, PyExc_RuntimeError); + } + + pdf_obj *resources = pdf_dict_get(gctx, ap, PDF_NAME(Resources)); + if (!resources) { // no Resources yet: make one + resources = pdf_dict_put_dict(gctx, ap, PDF_NAME(Resources), 2); + } + pdf_obj *alp0 = pdf_new_dict(gctx, page->doc, 3); + if (opacity >= 0 && opacity < 1) { + pdf_dict_put_real(gctx, alp0, PDF_NAME(CA), (double) opacity); + pdf_dict_put_real(gctx, alp0, PDF_NAME(ca), (double) opacity); + pdf_dict_put_real(gctx, annot_obj, PDF_NAME(CA), (double) opacity); + } + if (blend_mode) { + pdf_dict_put_name(gctx, alp0, PDF_NAME(BM), blend_mode); + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(BM), blend_mode); + } + pdf_obj *extg = pdf_dict_get(gctx, resources, PDF_NAME(ExtGState)); + if (!extg) { // no ExtGState yet: make one + extg = pdf_dict_put_dict(gctx, resources, PDF_NAME(ExtGState), 2); + } + pdf_dict_put_drop(gctx, extg, PDF_NAME(H), alp0); + } + + fz_catch(gctx) { + PySys_WriteStderr("cannot set opacity or blend mode\n"); + Py_RETURN_FALSE; + } + normal_exit:; + Py_RETURN_TRUE; + } + + + %pythoncode %{ + def update(self, + blend_mode: OptStr =None, + opacity: OptFloat =None, + fontsize: float =0, + fontname: OptStr =None, + text_color: OptSeq =None, + border_color: OptSeq =None, + fill_color: OptSeq =None, + cross_out: bool =True, + rotate: int =-1, + ): + + """Update annot appearance. + + Notes: + Depending on the annot type, some parameters make no sense, + while others are only available in this method to achieve the + desired result. This is especially true for 'FreeText' annots. + Args: + blend_mode: set the blend mode, all annotations. + opacity: set the opacity, all annotations. + fontsize: set fontsize, 'FreeText' only. + fontname: set the font, 'FreeText' only. + border_color: set border color, 'FreeText' only. + text_color: set text color, 'FreeText' only. + fill_color: set fill color, all annotations. + cross_out: draw diagonal lines, 'Redact' only. + rotate: set rotation, 'FreeText' and some others. + """ + CheckParent(self) + def color_string(cs, code): + """Return valid PDF color operator for a given color sequence. + """ + cc = ColorCode(cs, code) + if not cc: + return b"" + return (cc + "\n").encode() + + annot_type = self.type[0] # get the annot type + dt = self.border.get("dashes", None) # get the dashes spec + bwidth = self.border.get("width", -1) # get border line width + stroke = self.colors["stroke"] # get the stroke color + if fill_color != None: # change of fill color requested + fill = fill_color + else: # put in current annot value + fill = self.colors["fill"] + + rect = None # self.rect # prevent MuPDF fiddling with it + apnmat = self.apn_matrix # prevent MuPDF fiddling with it + if rotate != -1: # sanitize rotation value + while rotate < 0: + rotate += 360 + while rotate >= 360: + rotate -= 360 + if annot_type == PDF_ANNOT_FREE_TEXT and rotate % 90 != 0: + rotate = 0 + + #------------------------------------------------------------------ + # handle opacity and blend mode + #------------------------------------------------------------------ + if blend_mode is None: + blend_mode = self.blendmode + if not hasattr(opacity, "__float__"): + opacity = self.opacity + + if 0 <= opacity < 1 or blend_mode is not None: + opa_code = "/H gs\n" # then we must reference this 'gs' + else: + opa_code = "" + + if annot_type == PDF_ANNOT_FREE_TEXT: + CheckColor(border_color) + CheckColor(text_color) + CheckColor(fill_color) + tcol, fname, fsize = TOOLS._parse_da(self) + + # read and update default appearance as necessary + update_default_appearance = False + if fsize <= 0: + fsize = 12 + update_default_appearance = True + if text_color is not None: + tcol = text_color + update_default_appearance = True + if fontname is not None: + fname = fontname + update_default_appearance = True + if fontsize > 0: + fsize = fontsize + update_default_appearance = True + + if update_default_appearance: + da_str = "" + if len(tcol) == 3: + fmt = "{:g} {:g} {:g} rg /{f:s} {s:g} Tf" + elif len(tcol) == 1: + fmt = "{:g} g /{f:s} {s:g} Tf" + elif len(tcol) == 4: + fmt = "{:g} {:g} {:g} {:g} k /{f:s} {s:g} Tf" + da_str = fmt.format(*tcol, f=fname, s=fsize) + TOOLS._update_da(self, da_str) + + #------------------------------------------------------------------ + # now invoke MuPDF to update the annot appearance + #------------------------------------------------------------------ + val = self._update_appearance( + opacity=opacity, + blend_mode=blend_mode, + fill_color=fill, + rotate=rotate, + ) + if val == False: + raise RuntimeError("Error updating annotation.") + + bfill = color_string(fill, "f") + bstroke = color_string(stroke, "c") + + p_ctm = self.parent.transformation_matrix + imat = ~p_ctm # inverse page transf. matrix + + if dt: + dashes = "[" + " ".join(map(str, dt)) + "] 0 d\n" + dashes = dashes.encode("utf-8") + else: + dashes = None + + if self.line_ends: + line_end_le, line_end_ri = self.line_ends + else: + line_end_le, line_end_ri = 0, 0 # init line end codes + + # read contents as created by MuPDF + ap = self._getAP() + ap_tab = ap.splitlines() # split in single lines + ap_updated = False # assume we did nothing + + if annot_type == PDF_ANNOT_REDACT: + if cross_out: # create crossed-out rect + ap_updated = True + ap_tab = ap_tab[:-1] + _, LL, LR, UR, UL = ap_tab + ap_tab.append(LR) + ap_tab.append(LL) + ap_tab.append(UR) + ap_tab.append(LL) + ap_tab.append(UL) + ap_tab.append(b"S") + + if bwidth > 0 or bstroke != b"": + ap_updated = True + ntab = [b"%g w" % bwidth] if bwidth > 0 else [] + for line in ap_tab: + if line.endswith(b"w"): + continue + if line.endswith(b"RG") and bstroke != b"": + line = bstroke[:-1] + ntab.append(line) + ap_tab = ntab + + ap = b"\n".join(ap_tab) + + if annot_type == PDF_ANNOT_FREE_TEXT: + BT = ap.find(b"BT") + ET = ap.find(b"ET") + 2 + ap = ap[BT:ET] + w, h = self.rect.width, self.rect.height + if rotate in (90, 270) or not (apnmat.b == apnmat.c == 0): + w, h = h, w + re = b"0 0 %g %g re" % (w, h) + ap = re + b"\nW\nn\n" + ap + ope = None + fill_string = color_string(fill, "f") + if fill_string: + ope = b"f" + stroke_string = color_string(border_color, "c") + if stroke_string and bwidth > 0: + ope = b"S" + bwidth = b"%g w\n" % bwidth + else: + bwidth = stroke_string = b"" + if fill_string and stroke_string: + ope = b"B" + if ope != None: + ap = bwidth + fill_string + stroke_string + re + b"\n" + ope + b"\n" + ap + + if dashes != None: # handle dashes + ap = dashes + b"\n" + ap + dashes = None + + ap_updated = True + + if annot_type in (PDF_ANNOT_POLYGON, PDF_ANNOT_POLY_LINE): + ap = b"\n".join(ap_tab[:-1]) + b"\n" + ap_updated = True + if bfill != b"": + if annot_type == PDF_ANNOT_POLYGON: + ap = ap + bfill + b"b" # close, fill, and stroke + elif annot_type == PDF_ANNOT_POLY_LINE: + ap = ap + b"S" # stroke + else: + if annot_type == PDF_ANNOT_POLYGON: + ap = ap + b"s" # close and stroke + elif annot_type == PDF_ANNOT_POLY_LINE: + ap = ap + b"S" # stroke + + if dashes is not None: # handle dashes + ap = dashes + ap + # reset dashing - only applies for LINE annots with line ends given + ap = ap.replace(b"\nS\n", b"\nS\n[] 0 d\n", 1) + ap_updated = True + + if opa_code: + ap = opa_code.encode("utf-8") + ap + ap_updated = True + + ap = b"q\n" + ap + b"\nQ\n" + #---------------------------------------------------------------------- + # the following handles line end symbols for 'Polygon' and 'Polyline' + #---------------------------------------------------------------------- + if line_end_le + line_end_ri > 0 and annot_type in (PDF_ANNOT_POLYGON, PDF_ANNOT_POLY_LINE): + + le_funcs = (None, TOOLS._le_square, TOOLS._le_circle, + TOOLS._le_diamond, TOOLS._le_openarrow, + TOOLS._le_closedarrow, TOOLS._le_butt, + TOOLS._le_ropenarrow, TOOLS._le_rclosedarrow, + TOOLS._le_slash) + le_funcs_range = range(1, len(le_funcs)) + d = 2 * max(1, self.border["width"]) + rect = self.rect + (-d, -d, d, d) + ap_updated = True + points = self.vertices + if line_end_le in le_funcs_range: + p1 = Point(points[0]) * imat + p2 = Point(points[1]) * imat + left = le_funcs[line_end_le](self, p1, p2, False, fill_color) + ap += left.encode() + if line_end_ri in le_funcs_range: + p1 = Point(points[-2]) * imat + p2 = Point(points[-1]) * imat + left = le_funcs[line_end_ri](self, p1, p2, True, fill_color) + ap += left.encode() + + if ap_updated: + if rect: # rect modified here? + self.set_rect(rect) + self._setAP(ap, rect=1) + else: + self._setAP(ap, rect=0) + + #------------------------------- + # handle annotation rotations + #------------------------------- + if annot_type not in ( # only these types are supported + PDF_ANNOT_CARET, + PDF_ANNOT_CIRCLE, + PDF_ANNOT_FILE_ATTACHMENT, + PDF_ANNOT_INK, + PDF_ANNOT_LINE, + PDF_ANNOT_POLY_LINE, + PDF_ANNOT_POLYGON, + PDF_ANNOT_SQUARE, + PDF_ANNOT_STAMP, + PDF_ANNOT_TEXT, + ): + return + + rot = self.rotation # get value from annot object + if rot == -1: # nothing to change + return + + M = (self.rect.tl + self.rect.br) / 2 # center of annot rect + + if rot == 0: # undo rotations + if abs(apnmat - Matrix(1, 1)) < 1e-5: + return # matrix already is a no-op + quad = self.rect.morph(M, ~apnmat) # derotate rect + self.set_rect(quad.rect) + self.set_apn_matrix(Matrix(1, 1)) # appearance matrix = no-op + return + + mat = Matrix(rot) + quad = self.rect.morph(M, mat) + self.set_rect(quad.rect) + self.set_apn_matrix(apnmat * mat) + %} + + //---------------------------------------------------------------- + // annotation set colors + //---------------------------------------------------------------- + %pythoncode %{ + def set_colors(self, colors=None, stroke=None, fill=None): + """Set 'stroke' and 'fill' colors. + + Use either a dict or the direct arguments. + """ + CheckParent(self) + doc = self.parent.parent + if type(colors) is not dict: + colors = {"fill": fill, "stroke": stroke} + fill = colors.get("fill") + stroke = colors.get("stroke") + fill_annots = (PDF_ANNOT_CIRCLE, PDF_ANNOT_SQUARE, PDF_ANNOT_LINE, PDF_ANNOT_POLY_LINE, PDF_ANNOT_POLYGON, + PDF_ANNOT_REDACT,) + if stroke in ([], ()): + doc.xref_set_key(self.xref, "C", "[]") + elif stroke is not None: + if hasattr(stroke, "__float__"): + stroke = [float(stroke)] + CheckColor(stroke) + if len(stroke) == 1: + s = "[%g]" % stroke[0] + elif len(stroke) == 3: + s = "[%g %g %g]" % tuple(stroke) + else: + s = "[%g %g %g %g]" % tuple(stroke) + doc.xref_set_key(self.xref, "C", s) + + if fill and self.type[0] not in fill_annots: + print("Warning: fill color ignored for annot type '%s'." % self.type[1]) + return + if fill in ([], ()): + doc.xref_set_key(self.xref, "IC", "[]") + elif fill is not None: + if hasattr(fill, "__float__"): + fill = [float(fill)] + CheckColor(fill) + if len(fill) == 1: + s = "[%g]" % fill[0] + elif len(fill) == 3: + s = "[%g %g %g]" % tuple(fill) + else: + s = "[%g %g %g %g]" % tuple(fill) + doc.xref_set_key(self.xref, "IC", s) + %} + + + //---------------------------------------------------------------- + // annotation line_ends + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(line_ends, """Line end codes.""") + PyObject * + line_ends() + { + pdf_annot *annot = (pdf_annot *) $self; + + // return nothing for invalid annot types + if (!pdf_annot_has_line_ending_styles(gctx, annot)) + Py_RETURN_NONE; + + int lstart = (int) pdf_annot_line_start_style(gctx, annot); + int lend = (int) pdf_annot_line_end_style(gctx, annot); + return Py_BuildValue("ii", lstart, lend); + } + + + //---------------------------------------------------------------- + // annotation set line ends + //---------------------------------------------------------------- + PARENTCHECK(set_line_ends, """Set line end codes.""") + void set_line_ends(int start, int end) + { + pdf_annot *annot = (pdf_annot *) $self; + if (pdf_annot_has_line_ending_styles(gctx, annot)) + pdf_set_annot_line_ending_styles(gctx, annot, start, end); + else + JM_Warning("bad annot type for line ends"); + } + + + //---------------------------------------------------------------- + // annotation type + //---------------------------------------------------------------- + PARENTCHECK(type, """annotation type""") + %pythoncode %{@property%} + PyObject *type() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + int type = pdf_annot_type(gctx, annot); + const char *c = pdf_string_from_annot_type(gctx, type); + pdf_obj *o = pdf_dict_gets(gctx, annot_obj, "IT"); + if (!o || !pdf_is_name(gctx, o)) + return Py_BuildValue("is", type, c); // no IT entry + const char *it = pdf_to_name(gctx, o); + return Py_BuildValue("iss", type, c, it); + } + + //---------------------------------------------------------------- + // annotation opacity + //---------------------------------------------------------------- + PARENTCHECK(opacity, """Opacity.""") + %pythoncode %{@property%} + PyObject *opacity() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + double opy = -1; + pdf_obj *ca = pdf_dict_get(gctx, annot_obj, PDF_NAME(CA)); + if (pdf_is_number(gctx, ca)) + opy = pdf_to_real(gctx, ca); + return Py_BuildValue("f", opy); + } + + //---------------------------------------------------------------- + // annotation set opacity + //---------------------------------------------------------------- + PARENTCHECK(set_opacity, """Set opacity.""") + void set_opacity(float opacity) + { + pdf_annot *annot = (pdf_annot *) $self; + if (!INRANGE(opacity, 0.0f, 1.0f)) + { + pdf_set_annot_opacity(gctx, annot, 1); + return; + } + pdf_set_annot_opacity(gctx, annot, opacity); + if (opacity < 1.0f) + { + pdf_page *page = pdf_annot_page(gctx, annot); + page->transparency = 1; + } + } + + + //---------------------------------------------------------------- + // annotation get attached file info + //---------------------------------------------------------------- + %pythoncode %{@property%} + FITZEXCEPTION(file_info, !result) + PARENTCHECK(file_info, """Attached file information.""") + PyObject *file_info() + { + PyObject *res = PyDict_New(); // create Python dict + char *filename = NULL; + char *desc = NULL; + int length = -1, size = -1; + pdf_obj *stream = NULL, *o = NULL, *fs = NULL; + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + fz_try(gctx) { + int type = (int) pdf_annot_type(gctx, annot); + if (type != PDF_ANNOT_FILE_ATTACHMENT) { + RAISEPY(gctx, MSG_BAD_ANNOT_TYPE, PyExc_TypeError); + } + stream = pdf_dict_getl(gctx, annot_obj, PDF_NAME(FS), + PDF_NAME(EF), PDF_NAME(F), NULL); + if (!stream) { + RAISEPY(gctx, "bad PDF: file entry not found", JM_Exc_FileDataError); + } + } + fz_catch(gctx) { + return NULL; + } + + fs = pdf_dict_get(gctx, annot_obj, PDF_NAME(FS)); + + o = pdf_dict_get(gctx, fs, PDF_NAME(UF)); + if (o) { + filename = (char *) pdf_to_text_string(gctx, o); + } else { + o = pdf_dict_get(gctx, fs, PDF_NAME(F)); + if (o) filename = (char *) pdf_to_text_string(gctx, o); + } + + o = pdf_dict_get(gctx, fs, PDF_NAME(Desc)); + if (o) desc = (char *) pdf_to_text_string(gctx, o); + + o = pdf_dict_get(gctx, stream, PDF_NAME(Length)); + if (o) length = pdf_to_int(gctx, o); + + o = pdf_dict_getl(gctx, stream, PDF_NAME(Params), + PDF_NAME(Size), NULL); + if (o) size = pdf_to_int(gctx, o); + + DICT_SETITEM_DROP(res, dictkey_filename, JM_EscapeStrFromStr(filename)); + DICT_SETITEM_DROP(res, dictkey_desc, JM_UnicodeFromStr(desc)); + DICT_SETITEM_DROP(res, dictkey_length, Py_BuildValue("i", length)); + DICT_SETITEM_DROP(res, dictkey_size, Py_BuildValue("i", size)); + return res; + } + + + //---------------------------------------------------------------- + // annotation get attached file content + //---------------------------------------------------------------- + FITZEXCEPTION(get_file, !result) + PARENTCHECK(get_file, """Retrieve attached file content.""") + PyObject * + get_file() + { + PyObject *res = NULL; + pdf_obj *stream = NULL; + fz_buffer *buf = NULL; + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + fz_var(buf); + fz_try(gctx) { + int type = (int) pdf_annot_type(gctx, annot); + if (type != PDF_ANNOT_FILE_ATTACHMENT) { + RAISEPY(gctx, MSG_BAD_ANNOT_TYPE, PyExc_TypeError); + } + stream = pdf_dict_getl(gctx, annot_obj, PDF_NAME(FS), + PDF_NAME(EF), PDF_NAME(F), NULL); + if (!stream) { + RAISEPY(gctx, "bad PDF: file entry not found", JM_Exc_FileDataError); + } + buf = pdf_load_stream(gctx, stream); + res = JM_BinFromBuffer(gctx, buf); + } + fz_always(gctx) { + fz_drop_buffer(gctx, buf); + } + fz_catch(gctx) { + return NULL; + } + return res; + } + + + //---------------------------------------------------------------- + // annotation get attached sound stream + //---------------------------------------------------------------- + FITZEXCEPTION(get_sound, !result) + PARENTCHECK(get_sound, """Retrieve sound stream.""") + PyObject * + get_sound() + { + PyObject *res = NULL; + PyObject *stream = NULL; + fz_buffer *buf = NULL; + pdf_obj *obj = NULL; + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + fz_var(buf); + fz_try(gctx) { + int type = (int) pdf_annot_type(gctx, annot); + pdf_obj *sound = pdf_dict_get(gctx, annot_obj, PDF_NAME(Sound)); + if (type != PDF_ANNOT_SOUND || !sound) { + RAISEPY(gctx, MSG_BAD_ANNOT_TYPE, PyExc_TypeError); + } + if (pdf_dict_get(gctx, sound, PDF_NAME(F))) { + RAISEPY(gctx, "unsupported sound stream", JM_Exc_FileDataError); + } + res = PyDict_New(); + obj = pdf_dict_get(gctx, sound, PDF_NAME(R)); + if (obj) { + DICT_SETITEMSTR_DROP(res, "rate", + Py_BuildValue("f", pdf_to_real(gctx, obj))); + } + obj = pdf_dict_get(gctx, sound, PDF_NAME(C)); + if (obj) { + DICT_SETITEMSTR_DROP(res, "channels", + Py_BuildValue("i", pdf_to_int(gctx, obj))); + } + obj = pdf_dict_get(gctx, sound, PDF_NAME(B)); + if (obj) { + DICT_SETITEMSTR_DROP(res, "bps", + Py_BuildValue("i", pdf_to_int(gctx, obj))); + } + obj = pdf_dict_get(gctx, sound, PDF_NAME(E)); + if (obj) { + DICT_SETITEMSTR_DROP(res, "encoding", + Py_BuildValue("s", pdf_to_name(gctx, obj))); + } + obj = pdf_dict_gets(gctx, sound, "CO"); + if (obj) { + DICT_SETITEMSTR_DROP(res, "compression", + Py_BuildValue("s", pdf_to_name(gctx, obj))); + } + buf = pdf_load_stream(gctx, sound); + stream = JM_BinFromBuffer(gctx, buf); + DICT_SETITEMSTR_DROP(res, "stream", stream); + } + fz_always(gctx) { + fz_drop_buffer(gctx, buf); + } + fz_catch(gctx) { + Py_CLEAR(res); + return NULL; + } + return res; + } + + + //---------------------------------------------------------------- + // annotation update attached file + //---------------------------------------------------------------- + FITZEXCEPTION(update_file, !result) + %pythonprepend update_file +%{"""Update attached file.""" +CheckParent(self)%} + + PyObject * + update_file(PyObject *buffer=NULL, char *filename=NULL, char *ufilename=NULL, char *desc=NULL) + { + pdf_document *pdf = NULL; // to be filled in + fz_buffer *res = NULL; // for compressed content + pdf_obj *stream = NULL, *fs = NULL; + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + fz_try(gctx) { + pdf = pdf_get_bound_document(gctx, annot_obj); // the owning PDF + int type = (int) pdf_annot_type(gctx, annot); + if (type != PDF_ANNOT_FILE_ATTACHMENT) { + RAISEPY(gctx, MSG_BAD_ANNOT_TYPE, PyExc_TypeError); + } + stream = pdf_dict_getl(gctx, annot_obj, PDF_NAME(FS), + PDF_NAME(EF), PDF_NAME(F), NULL); + // the object for file content + if (!stream) { + RAISEPY(gctx, "bad PDF: no /EF object", JM_Exc_FileDataError); + } + + fs = pdf_dict_get(gctx, annot_obj, PDF_NAME(FS)); + + // file content given + res = JM_BufferFromBytes(gctx, buffer); + if (buffer && !res) { + RAISEPY(gctx, MSG_BAD_BUFFER, PyExc_ValueError); + } + if (res) { + JM_update_stream(gctx, pdf, stream, res, 1); + // adjust /DL and /Size parameters + int64_t len = (int64_t) fz_buffer_storage(gctx, res, NULL); + pdf_obj *l = pdf_new_int(gctx, len); + pdf_dict_put(gctx, stream, PDF_NAME(DL), l); + pdf_dict_putl(gctx, stream, l, PDF_NAME(Params), PDF_NAME(Size), NULL); + } + + if (filename) { + pdf_dict_put_text_string(gctx, stream, PDF_NAME(F), filename); + pdf_dict_put_text_string(gctx, fs, PDF_NAME(F), filename); + pdf_dict_put_text_string(gctx, stream, PDF_NAME(UF), filename); + pdf_dict_put_text_string(gctx, fs, PDF_NAME(UF), filename); + pdf_dict_put_text_string(gctx, annot_obj, PDF_NAME(Contents), filename); + } + + if (ufilename) { + pdf_dict_put_text_string(gctx, stream, PDF_NAME(UF), ufilename); + pdf_dict_put_text_string(gctx, fs, PDF_NAME(UF), ufilename); + } + + if (desc) { + pdf_dict_put_text_string(gctx, stream, PDF_NAME(Desc), desc); + pdf_dict_put_text_string(gctx, fs, PDF_NAME(Desc), desc); + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation info + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(info, """Various information details.""") + PyObject *info() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + PyObject *res = PyDict_New(); + pdf_obj *o; + + DICT_SETITEM_DROP(res, dictkey_content, + JM_UnicodeFromStr(pdf_annot_contents(gctx, annot))); + + o = pdf_dict_get(gctx, annot_obj, PDF_NAME(Name)); + DICT_SETITEM_DROP(res, dictkey_name, JM_UnicodeFromStr(pdf_to_name(gctx, o))); + + // Title (= author) + o = pdf_dict_get(gctx, annot_obj, PDF_NAME(T)); + DICT_SETITEM_DROP(res, dictkey_title, JM_UnicodeFromStr(pdf_to_text_string(gctx, o))); + + // CreationDate + o = pdf_dict_gets(gctx, annot_obj, "CreationDate"); + DICT_SETITEM_DROP(res, dictkey_creationDate, + JM_UnicodeFromStr(pdf_to_text_string(gctx, o))); + + // ModDate + o = pdf_dict_get(gctx, annot_obj, PDF_NAME(M)); + DICT_SETITEM_DROP(res, dictkey_modDate, JM_UnicodeFromStr(pdf_to_text_string(gctx, o))); + + // Subj + o = pdf_dict_gets(gctx, annot_obj, "Subj"); + DICT_SETITEM_DROP(res, dictkey_subject, + Py_BuildValue("s",pdf_to_text_string(gctx, o))); + + // Identification (PDF key /NM) + o = pdf_dict_gets(gctx, annot_obj, "NM"); + DICT_SETITEM_DROP(res, dictkey_id, + JM_UnicodeFromStr(pdf_to_text_string(gctx, o))); + + return res; + } + + //---------------------------------------------------------------- + // annotation set information + //---------------------------------------------------------------- + FITZEXCEPTION(set_info, !result) + %pythonprepend set_info %{ + """Set various properties.""" + CheckParent(self) + if type(info) is dict: # build the args from the dictionary + content = info.get("content", None) + title = info.get("title", None) + creationDate = info.get("creationDate", None) + modDate = info.get("modDate", None) + subject = info.get("subject", None) + info = None + %} + PyObject * + set_info(PyObject *info=NULL, char *content=NULL, char *title=NULL, + char *creationDate=NULL, char *modDate=NULL, char *subject=NULL) + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + // use this to indicate a 'markup' annot type + int is_markup = pdf_annot_has_author(gctx, annot); + fz_try(gctx) { + // contents + if (content) + pdf_set_annot_contents(gctx, annot, content); + + if (is_markup) { + // title (= author) + if (title) + pdf_set_annot_author(gctx, annot, title); + + // creation date + if (creationDate) + pdf_dict_put_text_string(gctx, annot_obj, + PDF_NAME(CreationDate), creationDate); + + // mod date + if (modDate) + pdf_dict_put_text_string(gctx, annot_obj, + PDF_NAME(M), modDate); + + // subject + if (subject) + pdf_dict_puts_drop(gctx, annot_obj, "Subj", + pdf_new_text_string(gctx, subject)); + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // annotation border + //---------------------------------------------------------------- + %pythoncode %{@property%} + %pythonprepend border %{ + """Border information.""" + CheckParent(self) + atype = self.type[0] + if atype not in (PDF_ANNOT_CIRCLE, PDF_ANNOT_FREE_TEXT, PDF_ANNOT_INK, PDF_ANNOT_LINE, PDF_ANNOT_POLY_LINE,PDF_ANNOT_POLYGON, PDF_ANNOT_SQUARE): + return {} + %} + PyObject *border() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + return JM_annot_border(gctx, annot_obj); + } + + //---------------------------------------------------------------- + // set annotation border + //---------------------------------------------------------------- + %pythonprepend set_border %{ + """Set border properties. + + Either a dict, or direct arguments width, style, dashes or clouds.""" + + CheckParent(self) + atype, atname = self.type[:2] # annotation type + if atype not in (PDF_ANNOT_CIRCLE, PDF_ANNOT_FREE_TEXT, PDF_ANNOT_INK, PDF_ANNOT_LINE, PDF_ANNOT_POLY_LINE,PDF_ANNOT_POLYGON, PDF_ANNOT_SQUARE): + print(f"Cannot set border for '{atname}'.") + return None + if not atype in (PDF_ANNOT_CIRCLE, PDF_ANNOT_FREE_TEXT,PDF_ANNOT_POLYGON, PDF_ANNOT_SQUARE): + if clouds > 0: + print(f"Cannot set cloudy border for '{atname}'.") + clouds = -1 # do not set border effect + if type(border) is not dict: + border = {"width": width, "style": style, "dashes": dashes, "clouds": clouds} + border.setdefault("width", -1) + border.setdefault("style", None) + border.setdefault("dashes", None) + border.setdefault("clouds", -1) + if border["width"] == None: + border["width"] = -1 + if border["clouds"] == None: + border["clouds"] = -1 + if hasattr(border["dashes"], "__getitem__"): # ensure sequence items are integers + border["dashes"] = tuple(border["dashes"]) + for item in border["dashes"]: + if not isinstance(item, int): + border["dashes"] = None + break + %} + PyObject * + set_border(PyObject *border=NULL, float width=-1, char *style=NULL, PyObject *dashes=NULL, int clouds=-1) + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_document *pdf = pdf_get_bound_document(gctx, annot_obj); + return JM_annot_set_border(gctx, border, pdf, annot_obj); + } + + + //---------------------------------------------------------------- + // annotation flags + //---------------------------------------------------------------- + %pythoncode %{@property%} + PARENTCHECK(flags, """Flags field.""") + int flags() + { + pdf_annot *annot = (pdf_annot *) $self; + return pdf_annot_flags(gctx, annot); + } + + //---------------------------------------------------------------- + // annotation clean contents + //---------------------------------------------------------------- + FITZEXCEPTION(clean_contents, !result) + PARENTCHECK(clean_contents, """Clean appearance contents stream.""") + PyObject *clean_contents(int sanitize=1) + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_document *pdf = pdf_get_bound_document(gctx, pdf_annot_obj(gctx, annot)); + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + pdf_filter_factory list[2] = { 0 }; + pdf_sanitize_filter_options sopts = { 0 }; + pdf_filter_options filter = { + 1, // recurse: true + 0, // instance forms + 0, // do not ascii-escape binary data + 0, // no_update + NULL, // end_page_opaque + NULL, // end page + list, // filters + }; + if (sanitize) { + list[0].filter = pdf_new_sanitize_filter; + list[0].options = &sopts; + } + #else + pdf_filter_options filter = { + NULL, // opaque + NULL, // image filter + NULL, // text filter + NULL, // after text + NULL, // end page + 1, // recurse: true + 1, // instance forms + 1, // sanitize, + 0 // do not ascii-escape binary data + }; + filter.sanitize = sanitize; + #endif + fz_try(gctx) { + pdf_filter_annot_contents(gctx, pdf, annot, &filter); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + //---------------------------------------------------------------- + // set annotation flags + //---------------------------------------------------------------- + PARENTCHECK(set_flags, """Set annotation flags.""") + void + set_flags(int flags) + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_set_annot_flags(gctx, annot, flags); + } + + + //---------------------------------------------------------------- + // annotation delete responses + //---------------------------------------------------------------- + FITZEXCEPTION(delete_responses, !result) + PARENTCHECK(delete_responses, """Delete 'Popup' and responding annotations.""") + PyObject * + delete_responses() + { + pdf_annot *annot = (pdf_annot *) $self; + pdf_obj *annot_obj = pdf_annot_obj(gctx, annot); + pdf_page *page = pdf_annot_page(gctx, annot); + pdf_annot *irt_annot = NULL; + fz_try(gctx) { + while (1) { + irt_annot = JM_find_annot_irt(gctx, annot); + if (!irt_annot) + break; + pdf_delete_annot(gctx, page, irt_annot); + } + pdf_dict_del(gctx, annot_obj, PDF_NAME(Popup)); + + pdf_obj *annots = pdf_dict_get(gctx, page->obj, PDF_NAME(Annots)); + int i, n = pdf_array_len(gctx, annots), found = 0; + for (i = n - 1; i >= 0; i--) { + pdf_obj *o = pdf_array_get(gctx, annots, i); + pdf_obj *p = pdf_dict_get(gctx, o, PDF_NAME(Parent)); + if (!p) + continue; + if (!pdf_objcmp(gctx, p, annot_obj)) { + pdf_array_delete(gctx, annots, i); + found = 1; + } + } + if (found > 0) { + pdf_dict_put(gctx, page->obj, PDF_NAME(Annots), annots); + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------------------------------------- + // next annotation + //---------------------------------------------------------------- + PARENTCHECK(next, """Next annotation.""") + %pythonappend next %{ + if not val: + return None + val.thisown = True + val.parent = self.parent # copy owning page object from previous annot + val.parent._annot_refs[id(val)] = val + + if val.type[0] == PDF_ANNOT_WIDGET: + widget = Widget() + TOOLS._fill_widget(val, widget) + val = widget + %} + %pythoncode %{@property%} + struct Annot *next() + { + pdf_annot *this_annot = (pdf_annot *) $self; + int type = pdf_annot_type(gctx, this_annot); + pdf_annot *annot; + + if (type != PDF_ANNOT_WIDGET) { + annot = pdf_next_annot(gctx, this_annot); + } else { + annot = pdf_next_widget(gctx, this_annot); + } + + if (annot) + pdf_keep_annot(gctx, annot); + return (struct Annot *) annot; + } + + + //---------------------------------------------------------------- + // annotation pixmap + //---------------------------------------------------------------- + FITZEXCEPTION(get_pixmap, !result) + %pythonprepend get_pixmap +%{"""annotation Pixmap""" + +CheckParent(self) +cspaces = {"gray": csGRAY, "rgb": csRGB, "cmyk": csCMYK} +if type(colorspace) is str: + colorspace = cspaces.get(colorspace.lower(), None) +if dpi: + matrix = Matrix(dpi / 72, dpi / 72) +%} + %pythonappend get_pixmap +%{ + val.thisown = True + if dpi: + val.set_dpi(dpi, dpi) +%} + struct Pixmap * + get_pixmap(PyObject *matrix = NULL, PyObject *dpi=NULL, struct Colorspace *colorspace = NULL, int alpha = 0) + { + fz_matrix ctm = JM_matrix_from_py(matrix); + fz_colorspace *cs = (fz_colorspace *) colorspace; + fz_pixmap *pix = NULL; + if (!cs) { + cs = fz_device_rgb(gctx); + } + + fz_try(gctx) { + pix = pdf_new_pixmap_from_annot(gctx, (pdf_annot *) $self, ctm, cs, NULL, alpha); + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) pix; + } + %pythoncode %{ + def _erase(self): + self.__swig_destroy__(self) + self.parent = None + + def __str__(self): + CheckParent(self) + return "'%s' annotation on %s" % (self.type[1], str(self.parent)) + + def __repr__(self): + CheckParent(self) + return "'%s' annotation on %s" % (self.type[1], str(self.parent)) + + def __del__(self): + if self.parent is None: + return + self._erase()%} + } +}; +%clearnodefaultctor; + +//------------------------------------------------------------------------ +// fz_link +//------------------------------------------------------------------------ +%nodefaultctor; +struct Link +{ + %immutable; + %extend { + ~Link() { + DEBUGMSG1("Link"); + fz_link *this_link = (fz_link *) $self; + fz_drop_link(gctx, this_link); + DEBUGMSG2; + } + + PyObject *_border(struct Document *doc, int xref) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) doc); + if (!pdf) Py_RETURN_NONE; + pdf_obj *link_obj = pdf_new_indirect(gctx, pdf, xref, 0); + if (!link_obj) Py_RETURN_NONE; + PyObject *b = JM_annot_border(gctx, link_obj); + pdf_drop_obj(gctx, link_obj); + return b; + } + + PyObject *_setBorder(PyObject *border, struct Document *doc, int xref) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) doc); + if (!pdf) Py_RETURN_NONE; + pdf_obj *link_obj = pdf_new_indirect(gctx, pdf, xref, 0); + if (!link_obj) Py_RETURN_NONE; + PyObject *b = JM_annot_set_border(gctx, border, pdf, link_obj); + pdf_drop_obj(gctx, link_obj); + return b; + } + + FITZEXCEPTION(_colors, !result) + PyObject *_colors(struct Document *doc, int xref) + { + pdf_document *pdf = pdf_specifics(gctx, (fz_document *) doc); + if (!pdf) Py_RETURN_NONE; + PyObject *b = NULL; + pdf_obj *link_obj; + fz_try(gctx) { + link_obj = pdf_new_indirect(gctx, pdf, xref, 0); + if (!link_obj) { + RAISEPY(gctx, MSG_BAD_XREF, PyExc_ValueError); + } + b = JM_annot_colors(gctx, link_obj); + } + fz_always(gctx) { + pdf_drop_obj(gctx, link_obj); + } + fz_catch(gctx) { + return NULL; + } + return b; + } + + + %pythoncode %{ + @property + def border(self): + return self._border(self.parent.parent.this, self.xref) + + @property + def flags(self)->int: + CheckParent(self) + doc = self.parent.parent + if not doc.is_pdf: + return 0 + f = doc.xref_get_key(self.xref, "F") + if f[1] != "null": + return int(f[1]) + return 0 + + def set_flags(self, flags): + CheckParent(self) + doc = self.parent.parent + if not doc.is_pdf: + raise ValueError("is no PDF") + if not type(flags) is int: + raise ValueError("bad 'flags' value") + doc.xref_set_key(self.xref, "F", str(flags)) + return None + + def set_border(self, border=None, width=0, dashes=None, style=None): + if type(border) is not dict: + border = {"width": width, "style": style, "dashes": dashes} + return self._setBorder(border, self.parent.parent.this, self.xref) + + @property + def colors(self): + return self._colors(self.parent.parent.this, self.xref) + + def set_colors(self, colors=None, stroke=None, fill=None): + """Set border colors.""" + CheckParent(self) + doc = self.parent.parent + if type(colors) is not dict: + colors = {"fill": fill, "stroke": stroke} + fill = colors.get("fill") + stroke = colors.get("stroke") + if fill is not None: + print("warning: links have no fill color") + if stroke in ([], ()): + doc.xref_set_key(self.xref, "C", "[]") + return + if hasattr(stroke, "__float__"): + stroke = [float(stroke)] + CheckColor(stroke) + if len(stroke) == 1: + s = "[%g]" % stroke[0] + elif len(stroke) == 3: + s = "[%g %g %g]" % tuple(stroke) + else: + s = "[%g %g %g %g]" % tuple(stroke) + doc.xref_set_key(self.xref, "C", s) + %} + %pythoncode %{@property%} + PARENTCHECK(uri, """Uri string.""") + PyObject *uri() + { + fz_link *this_link = (fz_link *) $self; + return JM_UnicodeFromStr(this_link->uri); + } + + %pythoncode %{@property%} + PARENTCHECK(is_external, """Flag the link as external.""") + PyObject *is_external() + { + fz_link *this_link = (fz_link *) $self; + if (!this_link->uri) Py_RETURN_FALSE; + return JM_BOOL(fz_is_external_link(gctx, this_link->uri)); + } + + %pythoncode + %{ + page = -1 + @property + def dest(self): + """Create link destination details.""" + if hasattr(self, "parent") and self.parent is None: + raise ValueError("orphaned object: parent is None") + if self.parent.parent.is_closed or self.parent.parent.is_encrypted: + raise ValueError("document closed or encrypted") + doc = self.parent.parent + + if self.is_external or self.uri.startswith("#"): + uri = None + else: + uri = doc.resolve_link(self.uri) + + return linkDest(self, uri) + %} + + PARENTCHECK(rect, """Rectangle ('hot area').""") + %pythoncode %{@property%} + %pythonappend rect %{val = Rect(val)%} + PyObject *rect() + { + fz_link *this_link = (fz_link *) $self; + return JM_py_from_rect(this_link->rect); + } + + //---------------------------------------------------------------- + // next link + //---------------------------------------------------------------- + // we need to increase the link refs number + // so that it will not be freed when the head is dropped + PARENTCHECK(next, """Next link.""") + %pythonappend next %{ + if val: + val.thisown = True + val.parent = self.parent # copy owning page from prev link + val.parent._annot_refs[id(val)] = val + if self.xref > 0: # prev link has an xref + link_xrefs = [x[0] for x in self.parent.annot_xrefs() if x[1] == PDF_ANNOT_LINK] + link_ids = [x[2] for x in self.parent.annot_xrefs() if x[1] == PDF_ANNOT_LINK] + idx = link_xrefs.index(self.xref) + val.xref = link_xrefs[idx + 1] + val.id = link_ids[idx + 1] + else: + val.xref = 0 + val.id = "" + %} + %pythoncode %{@property%} + struct Link *next() + { + fz_link *this_link = (fz_link *) $self; + fz_link *next_link = this_link->next; + if (!next_link) return NULL; + next_link = fz_keep_link(gctx, next_link); + return (struct Link *) next_link; + } + + %pythoncode %{ + def _erase(self): + self.__swig_destroy__(self) + self.parent = None + + def __str__(self): + CheckParent(self) + return "link on " + str(self.parent) + + def __repr__(self): + CheckParent(self) + return "link on " + str(self.parent) + + def __del__(self): + self._erase()%} + } +}; +%clearnodefaultctor; + +//------------------------------------------------------------------------ +// fz_display_list +//------------------------------------------------------------------------ +struct DisplayList { + %extend + { + ~DisplayList() { + DEBUGMSG1("DisplayList"); + fz_display_list *this_dl = (fz_display_list *) $self; + fz_drop_display_list(gctx, this_dl); + DEBUGMSG2; + } + FITZEXCEPTION(DisplayList, !result) + DisplayList(PyObject *mediabox) + { + fz_display_list *dl = NULL; + fz_try(gctx) { + dl = fz_new_display_list(gctx, JM_rect_from_py(mediabox)); + } + fz_catch(gctx) { + return NULL; + } + return (struct DisplayList *) dl; + } + + FITZEXCEPTION(run, !result) + PyObject *run(struct DeviceWrapper *dw, PyObject *m, PyObject *area) { + fz_try(gctx) { + fz_run_display_list(gctx, (fz_display_list *) $self, dw->device, + JM_matrix_from_py(m), JM_rect_from_py(area), NULL); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------------------------------------- + // DisplayList.rect + //---------------------------------------------------------------- + %pythoncode%{@property%} + %pythonappend rect %{val = Rect(val)%} + PyObject *rect() + { + return JM_py_from_rect(fz_bound_display_list(gctx, (fz_display_list *) $self)); + } + + //---------------------------------------------------------------- + // DisplayList.get_pixmap + //---------------------------------------------------------------- + FITZEXCEPTION(get_pixmap, !result) + %pythonappend get_pixmap %{val.thisown = True%} + struct Pixmap *get_pixmap(PyObject *matrix=NULL, + struct Colorspace *colorspace=NULL, + int alpha=0, + PyObject *clip=NULL) + { + fz_colorspace *cs = NULL; + fz_pixmap *pix = NULL; + + if (colorspace) cs = (fz_colorspace *) colorspace; + else cs = fz_device_rgb(gctx); + + fz_try(gctx) { + pix = JM_pixmap_from_display_list(gctx, + (fz_display_list *) $self, matrix, cs, + alpha, clip, NULL); + } + fz_catch(gctx) { + return NULL; + } + return (struct Pixmap *) pix; + } + + //---------------------------------------------------------------- + // DisplayList.get_textpage + //---------------------------------------------------------------- + FITZEXCEPTION(get_textpage, !result) + %pythonappend get_textpage %{val.thisown = True%} + struct TextPage *get_textpage(int flags = 3) + { + fz_display_list *this_dl = (fz_display_list *) $self; + fz_stext_page *tp = NULL; + fz_try(gctx) { + fz_stext_options stext_options = { 0 }; + stext_options.flags = flags; + tp = fz_new_stext_page_from_display_list(gctx, this_dl, &stext_options); + } + fz_catch(gctx) { + return NULL; + } + return (struct TextPage *) tp; + } + %pythoncode %{ + def __del__(self): + if not type(self) is DisplayList: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; + +//------------------------------------------------------------------------ +// fz_stext_page +//------------------------------------------------------------------------ +struct TextPage { + %extend { + ~TextPage() + { + DEBUGMSG1("TextPage"); + fz_stext_page *this_tp = (fz_stext_page *) $self; + fz_drop_stext_page(gctx, this_tp); + DEBUGMSG2; + } + + FITZEXCEPTION(TextPage, !result) + %pythonappend TextPage %{self.thisown=True%} + TextPage(PyObject *mediabox) + { + fz_stext_page *tp = NULL; + fz_try(gctx) { + tp = fz_new_stext_page(gctx, JM_rect_from_py(mediabox)); + } + fz_catch(gctx) { + return NULL; + } + return (struct TextPage *) tp; + } + + //---------------------------------------------------------------- + // method search() + //---------------------------------------------------------------- + FITZEXCEPTION(search, !result) + %pythonprepend search + %{"""Locate 'needle' returning rects or quads."""%} + %pythonappend search %{ + if not val: + return val + items = len(val) + for i in range(items): # change entries to quads or rects + q = Quad(val[i]) + if quads: + val[i] = q + else: + val[i] = q.rect + if quads: + return val + i = 0 # join overlapping rects on the same line + while i < items - 1: + v1 = val[i] + v2 = val[i + 1] + if v1.y1 != v2.y1 or (v1 & v2).is_empty: + i += 1 + continue # no overlap on same line + val[i] = v1 | v2 # join rectangles + del val[i + 1] # remove v2 + items -= 1 # reduce item count + %} + PyObject *search(const char *needle, int hit_max=0, int quads=1) + { + PyObject *liste = NULL; + fz_try(gctx) { + liste = JM_search_stext_page(gctx, (fz_stext_page *) $self, needle); + } + fz_catch(gctx) { + return NULL; + } + return liste; + } + + + //---------------------------------------------------------------- + // Get list of all blocks with block type and bbox as a Python list + //---------------------------------------------------------------- + FITZEXCEPTION(_getNewBlockList, !result) + PyObject * + _getNewBlockList(PyObject *page_dict, int raw) + { + fz_try(gctx) { + JM_make_textpage_dict(gctx, (fz_stext_page *) $self, page_dict, raw); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + %pythoncode %{ + def _textpage_dict(self, raw=False): + page_dict = {"width": self.rect.width, "height": self.rect.height} + self._getNewBlockList(page_dict, raw) + return page_dict + %} + + + //---------------------------------------------------------------- + // Get image meta information as a Python dictionary + //---------------------------------------------------------------- + FITZEXCEPTION(extractIMGINFO, !result) + %pythonprepend extractIMGINFO + %{"""Return a list with image meta information."""%} + PyObject * + extractIMGINFO(int hashes=0) + { + fz_stext_block *block; + int block_n = -1; + fz_stext_page *this_tpage = (fz_stext_page *) $self; + PyObject *rc = NULL, *block_dict = NULL; + fz_pixmap *pix = NULL; + fz_try(gctx) { + rc = PyList_New(0); + for (block = this_tpage->first_block; block; block = block->next) { + block_n++; + if (block->type == FZ_STEXT_BLOCK_TEXT) { + continue; + } + unsigned char digest[16]; + fz_image *img = block->u.i.image; + Py_ssize_t img_size = 0; + fz_compressed_buffer *cbuff = fz_compressed_image_buffer(gctx, img); + if (cbuff) { + img_size = (Py_ssize_t) cbuff->buffer->len; + } + if (hashes) { + pix = fz_get_pixmap_from_image(gctx, img, NULL, NULL, NULL, NULL); + if (img_size == 0) { + img_size = (Py_ssize_t) pix->w * pix->h * pix->n; + } + fz_md5_pixmap(gctx, pix, digest); + fz_drop_pixmap(gctx, pix); + pix = NULL; + } + fz_colorspace *cs = img->colorspace; + block_dict = PyDict_New(); + DICT_SETITEM_DROP(block_dict, dictkey_number, Py_BuildValue("i", block_n)); + DICT_SETITEM_DROP(block_dict, dictkey_bbox, + JM_py_from_rect(block->bbox)); + DICT_SETITEM_DROP(block_dict, dictkey_matrix, + JM_py_from_matrix(block->u.i.transform)); + DICT_SETITEM_DROP(block_dict, dictkey_width, + Py_BuildValue("i", img->w)); + DICT_SETITEM_DROP(block_dict, dictkey_height, + Py_BuildValue("i", img->h)); + DICT_SETITEM_DROP(block_dict, dictkey_colorspace, + Py_BuildValue("i", + fz_colorspace_n(gctx, cs))); + DICT_SETITEM_DROP(block_dict, dictkey_cs_name, + Py_BuildValue("s", + fz_colorspace_name(gctx, cs))); + DICT_SETITEM_DROP(block_dict, dictkey_xres, + Py_BuildValue("i", img->xres)); + DICT_SETITEM_DROP(block_dict, dictkey_yres, + Py_BuildValue("i", img->xres)); + DICT_SETITEM_DROP(block_dict, dictkey_bpc, + Py_BuildValue("i", (int) img->bpc)); + DICT_SETITEM_DROP(block_dict, dictkey_size, + Py_BuildValue("n", img_size)); + if (hashes) { + DICT_SETITEMSTR_DROP(block_dict, "digest", + PyBytes_FromStringAndSize(digest, 16)); + } + LIST_APPEND_DROP(rc, block_dict); + } + } + fz_always(gctx) { + } + fz_catch(gctx) { + Py_CLEAR(rc); + Py_CLEAR(block_dict); + fz_drop_pixmap(gctx, pix); + return NULL; + } + return rc; + } + + + //---------------------------------------------------------------- + // Get text blocks with their bbox and concatenated lines + // as a Python list + //---------------------------------------------------------------- + FITZEXCEPTION(extractBLOCKS, !result) + %pythonprepend extractBLOCKS + %{"""Return a list with text block information."""%} + PyObject * + extractBLOCKS() + { + fz_stext_block *block; + fz_stext_line *line; + fz_stext_char *ch; + int block_n = -1; + PyObject *text = NULL, *litem; + fz_buffer *res = NULL; + fz_var(res); + fz_stext_page *this_tpage = (fz_stext_page *) $self; + fz_rect tp_rect = this_tpage->mediabox; + PyObject *lines = NULL; + fz_try(gctx) { + res = fz_new_buffer(gctx, 1024); + lines = PyList_New(0); + for (block = this_tpage->first_block; block; block = block->next) { + block_n++; + fz_rect blockrect = fz_empty_rect; + if (block->type == FZ_STEXT_BLOCK_TEXT) { + fz_clear_buffer(gctx, res); // set text buffer to empty + int line_n = -1; + int last_char = 0; + for (line = block->u.t.first_line; line; line = line->next) { + line_n++; + fz_rect linerect = fz_empty_rect; + for (ch = line->first_char; ch; ch = ch->next) { + fz_rect cbbox = JM_char_bbox(gctx, line, ch); + if (!JM_rects_overlap(tp_rect, cbbox) && + !fz_is_infinite_rect(tp_rect)) { + continue; + } + JM_append_rune(gctx, res, ch->c); + last_char = ch->c; + linerect = fz_union_rect(linerect, cbbox); + } + if (last_char != 10 && !fz_is_empty_rect(linerect)) { + fz_append_byte(gctx, res, 10); + } + blockrect = fz_union_rect(blockrect, linerect); + } + text = JM_EscapeStrFromBuffer(gctx, res); + } else if (JM_rects_overlap(tp_rect, block->bbox) || fz_is_infinite_rect(tp_rect)) { + fz_image *img = block->u.i.image; + fz_colorspace *cs = img->colorspace; + text = PyUnicode_FromFormat("", fz_colorspace_name(gctx, cs), img->w, img->h, img->bpc); + blockrect = fz_union_rect(blockrect, block->bbox); + } + if (!fz_is_empty_rect(blockrect)) { + litem = PyTuple_New(7); + PyTuple_SET_ITEM(litem, 0, Py_BuildValue("f", blockrect.x0)); + PyTuple_SET_ITEM(litem, 1, Py_BuildValue("f", blockrect.y0)); + PyTuple_SET_ITEM(litem, 2, Py_BuildValue("f", blockrect.x1)); + PyTuple_SET_ITEM(litem, 3, Py_BuildValue("f", blockrect.y1)); + PyTuple_SET_ITEM(litem, 4, Py_BuildValue("O", text)); + PyTuple_SET_ITEM(litem, 5, Py_BuildValue("i", block_n)); + PyTuple_SET_ITEM(litem, 6, Py_BuildValue("i", block->type)); + LIST_APPEND_DROP(lines, litem); + } + Py_CLEAR(text); + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + PyErr_Clear(); + } + fz_catch(gctx) { + Py_CLEAR(lines); + return NULL; + } + return lines; + } + + //---------------------------------------------------------------- + // Get text words with their bbox + //---------------------------------------------------------------- + FITZEXCEPTION(extractWORDS, !result) + %pythonprepend extractWORDS + %{"""Return a list with text word information."""%} + PyObject * + extractWORDS(PyObject *delimiters=NULL) + { + fz_stext_block *block; + fz_stext_line *line; + fz_stext_char *ch; + fz_buffer *buff = NULL; + fz_var(buff); + size_t buflen = 0; + int block_n = -1, line_n, word_n; + fz_rect wbbox = fz_empty_rect; // word bbox + fz_stext_page *this_tpage = (fz_stext_page *) $self; + fz_rect tp_rect = this_tpage->mediabox; + int word_delimiter = 0; + PyObject *lines = NULL; + fz_try(gctx) { + buff = fz_new_buffer(gctx, 64); + lines = PyList_New(0); + for (block = this_tpage->first_block; block; block = block->next) { + block_n++; + if (block->type != FZ_STEXT_BLOCK_TEXT) { + continue; + } + line_n = -1; + for (line = block->u.t.first_line; line; line = line->next) { + line_n++; + word_n = 0; // word counter per line + fz_clear_buffer(gctx, buff); // reset word buffer + buflen = 0; // reset char counter + for (ch = line->first_char; ch; ch = ch->next) { + fz_rect cbbox = JM_char_bbox(gctx, line, ch); + if (!JM_rects_overlap(tp_rect, cbbox) && + !fz_is_infinite_rect(tp_rect)) { + continue; + } + word_delimiter = JM_is_word_delimiter(ch->c, delimiters); + if (word_delimiter) { + if (buflen == 0) continue; // skip spaces at line start + if (!fz_is_empty_rect(wbbox)) { // output word + word_n = JM_append_word(gctx, lines, buff, &wbbox, + block_n, line_n, word_n); + } + fz_clear_buffer(gctx, buff); + buflen = 0; // reset char counter + continue; + } + // append one unicode character to the word + JM_append_rune(gctx, buff, ch->c); + buflen++; + // enlarge word bbox + wbbox = fz_union_rect(wbbox, JM_char_bbox(gctx, line, ch)); + } + if (buflen && !fz_is_empty_rect(wbbox)) { + word_n = JM_append_word(gctx, lines, buff, &wbbox, + block_n, line_n, word_n); + } + fz_clear_buffer(gctx, buff); + buflen = 0; + } + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, buff); + PyErr_Clear(); + } + fz_catch(gctx) { + return NULL; + } + return lines; + } + + //---------------------------------------------------------------- + // TextPage poolsize + //---------------------------------------------------------------- + %pythonprepend poolsize + %{"""TextPage current poolsize."""%} + PyObject *poolsize() + { + fz_stext_page *tpage = (fz_stext_page *) $self; + size_t size = fz_pool_size(gctx, tpage->pool); + return PyLong_FromSize_t(size); + } + + //---------------------------------------------------------------- + // TextPage rectangle + //---------------------------------------------------------------- + %pythoncode %{@property%} + %pythonprepend rect + %{"""TextPage rectangle."""%} + %pythonappend rect %{val = Rect(val)%} + PyObject *rect() + { + fz_stext_page *this_tpage = (fz_stext_page *) $self; + fz_rect mediabox = this_tpage->mediabox; + return JM_py_from_rect(mediabox); + } + + //---------------------------------------------------------------- + // method _extractText() + //---------------------------------------------------------------- + FITZEXCEPTION(_extractText, !result) + %newobject _extractText; + PyObject *_extractText(int format) + { + fz_buffer *res = NULL; + fz_output *out = NULL; + PyObject *text = NULL; + fz_var(res); + fz_var(out); + fz_stext_page *this_tpage = (fz_stext_page *) $self; + fz_try(gctx) { + res = fz_new_buffer(gctx, 1024); + out = fz_new_output_with_buffer(gctx, res); + switch(format) { + case(1): + fz_print_stext_page_as_html(gctx, out, this_tpage, 0); + break; + case(3): + fz_print_stext_page_as_xml(gctx, out, this_tpage, 0); + break; + case(4): + fz_print_stext_page_as_xhtml(gctx, out, this_tpage, 0); + break; + default: + JM_print_stext_page_as_text(gctx, res, this_tpage); + break; + } + text = JM_EscapeStrFromBuffer(gctx, res); + + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + fz_drop_output(gctx, out); + } + fz_catch(gctx) { + return NULL; + } + return text; + } + + + //---------------------------------------------------------------- + // method extractTextbox() + //---------------------------------------------------------------- + FITZEXCEPTION(extractTextbox, !result) + PyObject *extractTextbox(PyObject *rect) + { + fz_stext_page *this_tpage = (fz_stext_page *) $self; + fz_rect area = JM_rect_from_py(rect); + PyObject *rc = NULL; + fz_try(gctx) { + rc = JM_copy_rectangle(gctx, this_tpage, area); + } + fz_catch(gctx) { + return NULL; + } + return rc; + } + + //---------------------------------------------------------------- + // method extractSelection() + //---------------------------------------------------------------- + PyObject *extractSelection(PyObject *pointa, PyObject *pointb) + { + fz_stext_page *this_tpage = (fz_stext_page *) $self; + fz_point a = JM_point_from_py(pointa); + fz_point b = JM_point_from_py(pointb); + char *found = fz_copy_selection(gctx, this_tpage, a, b, 0); + PyObject *rc = NULL; + if (found) { + rc = PyUnicode_FromString(found); + JM_Free(found); + } else { + rc = EMPTY_STRING; + } + return rc; + } + + %pythoncode %{ + def extractText(self, sort=False) -> str: + """Return simple, bare text on the page.""" + if sort is False: + return self._extractText(0) + blocks = self.extractBLOCKS()[:] + blocks.sort(key=lambda b: (b[3], b[0])) + return "".join([b[4] for b in blocks]) + + def extractHTML(self) -> str: + """Return page content as a HTML string.""" + return self._extractText(1) + + def extractJSON(self, cb=None, sort=False) -> str: + """Return 'extractDICT' converted to JSON format.""" + import base64, json + val = self._textpage_dict(raw=False) + + class b64encode(json.JSONEncoder): + def default(self, s): + if type(s) in (bytes, bytearray): + return base64.b64encode(s).decode() + + if cb is not None: + val["width"] = cb.width + val["height"] = cb.height + if sort is True: + blocks = val["blocks"] + blocks.sort(key=lambda b: (b["bbox"][3], b["bbox"][0])) + val["blocks"] = blocks + val = json.dumps(val, separators=(",", ":"), cls=b64encode, indent=1) + return val + + def extractRAWJSON(self, cb=None, sort=False) -> str: + """Return 'extractRAWDICT' converted to JSON format.""" + import base64, json + val = self._textpage_dict(raw=True) + + class b64encode(json.JSONEncoder): + def default(self,s): + if type(s) in (bytes, bytearray): + return base64.b64encode(s).decode() + + if cb is not None: + val["width"] = cb.width + val["height"] = cb.height + if sort is True: + blocks = val["blocks"] + blocks.sort(key=lambda b: (b["bbox"][3], b["bbox"][0])) + val["blocks"] = blocks + val = json.dumps(val, separators=(",", ":"), cls=b64encode, indent=1) + return val + + def extractXML(self) -> str: + """Return page content as a XML string.""" + return self._extractText(3) + + def extractXHTML(self) -> str: + """Return page content as a XHTML string.""" + return self._extractText(4) + + def extractDICT(self, cb=None, sort=False) -> dict: + """Return page content as a Python dict of images and text spans.""" + val = self._textpage_dict(raw=False) + if cb is not None: + val["width"] = cb.width + val["height"] = cb.height + if sort is True: + blocks = val["blocks"] + blocks.sort(key=lambda b: (b["bbox"][3], b["bbox"][0])) + val["blocks"] = blocks + return val + + def extractRAWDICT(self, cb=None, sort=False) -> dict: + """Return page content as a Python dict of images and text characters.""" + val = self._textpage_dict(raw=True) + if cb is not None: + val["width"] = cb.width + val["height"] = cb.height + if sort is True: + blocks = val["blocks"] + blocks.sort(key=lambda b: (b["bbox"][3], b["bbox"][0])) + val["blocks"] = blocks + return val + + def __del__(self): + if not type(self) is TextPage: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; + +//------------------------------------------------------------------------ +// Graftmap - only used internally for inter-PDF object copy operations +//------------------------------------------------------------------------ +struct Graftmap +{ + %extend + { + ~Graftmap() + { + DEBUGMSG1("Graftmap"); + pdf_graft_map *this_gm = (pdf_graft_map *) $self; + pdf_drop_graft_map(gctx, this_gm); + DEBUGMSG2; + } + + FITZEXCEPTION(Graftmap, !result) + Graftmap(struct Document *doc) + { + pdf_graft_map *map = NULL; + fz_try(gctx) { + pdf_document *dst = pdf_specifics(gctx, (fz_document *) doc); + ASSERT_PDF(dst); + map = pdf_new_graft_map(gctx, dst); + } + fz_catch(gctx) { + return NULL; + } + return (struct Graftmap *) map; + } + + %pythoncode %{ + def __del__(self): + if not type(self) is Graftmap: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; + + +//------------------------------------------------------------------------ +// TextWriter +//------------------------------------------------------------------------ +struct TextWriter +{ + %extend { + ~TextWriter() + { + DEBUGMSG1("TextWriter"); + fz_text *this_tw = (fz_text *) $self; + fz_drop_text(gctx, this_tw); + DEBUGMSG2; + } + + FITZEXCEPTION(TextWriter, !result) + %pythonprepend TextWriter + %{"""Stores text spans for later output on compatible PDF pages."""%} + %pythonappend TextWriter %{ + self.opacity = opacity + self.color = color + self.rect = Rect(page_rect) + self.ctm = Matrix(1, 0, 0, -1, 0, self.rect.height) + self.ictm = ~self.ctm + self.last_point = Point() + self.last_point.__doc__ = "Position following last text insertion." + self.text_rect = Rect() + + self.text_rect.__doc__ = "Accumulated area of text spans." + self.used_fonts = set() + self.thisown = True + %} + TextWriter(PyObject *page_rect, float opacity=1, PyObject *color=NULL ) + { + fz_text *text = NULL; + fz_try(gctx) { + text = fz_new_text(gctx); + } + fz_catch(gctx) { + return NULL; + } + return (struct TextWriter *) text; + } + + FITZEXCEPTION(append, !result) + %pythonprepend append %{ + """Store 'text' at point 'pos' using 'font' and 'fontsize'.""" + + pos = Point(pos) * self.ictm + if font is None: + font = Font("helv") + if not font.is_writable: + raise ValueError("Unsupported font '%s'." % font.name) + if right_to_left: + text = self.clean_rtl(text) + text = "".join(reversed(text)) + right_to_left = 0 + %} + %pythonappend append %{ + self.last_point = Point(val[-2:]) * self.ctm + self.text_rect = self._bbox * self.ctm + val = self.text_rect, self.last_point + if font.flags["mono"] == 1: + self.used_fonts.add(font) + %} + PyObject * + append(PyObject *pos, char *text, struct Font *font=NULL, float fontsize=11, char *language=NULL, int right_to_left=0, int small_caps=0) + { + fz_text_language lang = fz_text_language_from_string(language); + fz_point p = JM_point_from_py(pos); + fz_matrix trm = fz_make_matrix(fontsize, 0, 0, fontsize, p.x, p.y); + int markup_dir = 0, wmode = 0; + fz_try(gctx) { + if (small_caps == 0) { + trm = fz_show_string(gctx, (fz_text *) $self, (fz_font *) font, + trm, text, wmode, right_to_left, markup_dir, lang); + } else { + trm = JM_show_string_cs(gctx, (fz_text *) $self, (fz_font *) font, + trm, text, wmode, right_to_left, markup_dir, lang); + } + } + fz_catch(gctx) { + return NULL; + } + return JM_py_from_matrix(trm); + } + + %pythoncode %{ + def appendv(self, pos, text, font=None, fontsize=11, + language=None, small_caps=False): + """Append text in vertical write mode.""" + lheight = fontsize * 1.2 + for c in text: + self.append(pos, c, font=font, fontsize=fontsize, + language=language, small_caps=small_caps) + pos.y += lheight + return self.text_rect, self.last_point + + + def clean_rtl(self, text): + """Revert the sequence of Latin text parts. + + Text with right-to-left writing direction (Arabic, Hebrew) often + contains Latin parts, which are written in left-to-right: numbers, names, + etc. For output as PDF text we need *everything* in right-to-left. + E.g. an input like " ABCDE FG HIJ KL " will be + converted to " JIH GF EDCBA LK ". The Arabic + parts remain untouched. + + Args: + text: str + Returns: + Massaged string. + """ + if not text: + return text + # split into words at space boundaries + words = text.split(" ") + idx = [] + for i in range(len(words)): + w = words[i] + # revert character sequence for Latin only words + if not (len(w) < 2 or max([ord(c) for c in w]) > 255): + words[i] = "".join(reversed(w)) + idx.append(i) # stored index of Latin word + + # adjacent Latin words must revert their sequence, too + idx2 = [] # store indices of adjacent Latin words + for i in range(len(idx)): + if idx2 == []: # empty yet? + idx2.append(idx[i]) # store Latin word number + + elif idx[i] > idx2[-1] + 1: # large gap to last? + if len(idx2) > 1: # at least two consecutives? + words[idx2[0] : idx2[-1] + 1] = reversed( + words[idx2[0] : idx2[-1] + 1] + ) # revert their sequence + idx2 = [idx[i]] # re-initialize + + elif idx[i] == idx2[-1] + 1: # new adjacent Latin word + idx2.append(idx[i]) + + text = " ".join(words) + return text + %} + + + %pythoncode %{@property%} + %pythonappend _bbox%{val = Rect(val)%} + PyObject *_bbox() + { + return JM_py_from_rect(fz_bound_text(gctx, (fz_text *) $self, NULL, fz_identity)); + } + + FITZEXCEPTION(write_text, !result) + %pythonprepend write_text%{ + """Write the text to a PDF page having the TextWriter's page size. + + Args: + page: a PDF page having same size. + color: override text color. + opacity: override transparency. + overlay: put in foreground or background. + morph: tuple(Point, Matrix), apply a matrix with a fixpoint. + matrix: Matrix to be used instead of 'morph' argument. + render_mode: (int) PDF render mode operator 'Tr'. + """ + + CheckParent(page) + if abs(self.rect - page.rect) > 1e-3: + raise ValueError("incompatible page rect") + if morph != None: + if (type(morph) not in (tuple, list) + or type(morph[0]) is not Point + or type(morph[1]) is not Matrix + ): + raise ValueError("morph must be (Point, Matrix) or None") + if matrix != None and morph != None: + raise ValueError("only one of matrix, morph is allowed") + if getattr(opacity, "__float__", None) is None or opacity == -1: + opacity = self.opacity + if color is None: + color = self.color + %} + + %pythonappend write_text%{ + max_nums = val[0] + content = val[1] + max_alp, max_font = max_nums + old_cont_lines = content.splitlines() + + optcont = page._get_optional_content(oc) + if optcont != None: + bdc = "/OC /%s BDC" % optcont + emc = "EMC" + else: + bdc = emc = "" + + new_cont_lines = ["q"] + if bdc: + new_cont_lines.append(bdc) + + cb = page.cropbox_position + if page.rotation in (90, 270): + delta = page.rect.height - page.rect.width + else: + delta = 0 + mb = page.mediabox + if bool(cb) or mb.y0 != 0 or delta != 0: + new_cont_lines.append("1 0 0 1 %g %g cm" % (cb.x, cb.y + mb.y0 - delta)) + + if morph: + p = morph[0] * self.ictm + delta = Matrix(1, 1).pretranslate(p.x, p.y) + matrix = ~delta * morph[1] * delta + if morph or matrix: + new_cont_lines.append("%g %g %g %g %g %g cm" % JM_TUPLE(matrix)) + + for line in old_cont_lines: + if line.endswith(" cm"): + continue + if line == "BT": + new_cont_lines.append(line) + new_cont_lines.append("%i Tr" % render_mode) + continue + if line.endswith(" gs"): + alp = int(line.split()[0][4:]) + max_alp + line = "/Alp%i gs" % alp + elif line.endswith(" Tf"): + temp = line.split() + fsize = float(temp[1]) + if render_mode != 0: + w = fsize * 0.05 + else: + w = 1 + new_cont_lines.append("%g w" % w) + font = int(temp[0][2:]) + max_font + line = " ".join(["/F%i" % font] + temp[1:]) + elif line.endswith(" rg"): + new_cont_lines.append(line.replace("rg", "RG")) + elif line.endswith(" g"): + new_cont_lines.append(line.replace(" g", " G")) + elif line.endswith(" k"): + new_cont_lines.append(line.replace(" k", " K")) + new_cont_lines.append(line) + if emc: + new_cont_lines.append(emc) + new_cont_lines.append("Q\n") + content = "\n".join(new_cont_lines).encode("utf-8") + TOOLS._insert_contents(page, content, overlay=overlay) + val = None + for font in self.used_fonts: + repair_mono_font(page, font) + %} + PyObject *write_text(struct Page *page, PyObject *color=NULL, float opacity=-1, int overlay=1, + PyObject *morph=NULL, PyObject *matrix=NULL, int render_mode=0, int oc=0) + { + pdf_page *pdfpage = pdf_page_from_fz_page(gctx, (fz_page *) page); + pdf_obj *resources = NULL; + fz_buffer *contents = NULL; + fz_device *dev = NULL; + PyObject *result = NULL, *max_nums, *cont_string; + float alpha = 1; + if (opacity >= 0 && opacity < 1) + alpha = opacity; + fz_colorspace *colorspace; + int ncol = 1; + float dev_color[4] = {0, 0, 0, 0}; + if (EXISTS(color)) { + JM_color_FromSequence(color, &ncol, dev_color); + } + switch(ncol) { + case 3: colorspace = fz_device_rgb(gctx); break; + case 4: colorspace = fz_device_cmyk(gctx); break; + default: colorspace = fz_device_gray(gctx); break; + } + + fz_var(contents); + fz_var(resources); + fz_var(dev); + fz_try(gctx) { + ASSERT_PDF(pdfpage); + resources = pdf_new_dict(gctx, pdfpage->doc, 5); + contents = fz_new_buffer(gctx, 1024); + dev = pdf_new_pdf_device(gctx, pdfpage->doc, fz_identity, + resources, contents); + fz_fill_text(gctx, dev, (fz_text *) $self, fz_identity, + colorspace, dev_color, alpha, fz_default_color_params); + fz_close_device(gctx, dev); + + // copy generated resources into the one of the page + max_nums = JM_merge_resources(gctx, pdfpage, resources); + cont_string = JM_EscapeStrFromBuffer(gctx, contents); + result = Py_BuildValue("OO", max_nums, cont_string); + Py_DECREF(cont_string); + Py_DECREF(max_nums); + } + fz_always(gctx) { + fz_drop_buffer(gctx, contents); + pdf_drop_obj(gctx, resources); + fz_drop_device(gctx, dev); + } + fz_catch(gctx) { + return NULL; + } + return result; + } + %pythoncode %{ + def __del__(self): + if not type(self) is TextWriter: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; + + +//------------------------------------------------------------------------ +// Font +//------------------------------------------------------------------------ +struct Font +{ + %extend + { + ~Font() + { + DEBUGMSG1("Font"); + fz_font *this_font = (fz_font *) $self; + fz_drop_font(gctx, this_font); + DEBUGMSG2; + } + + FITZEXCEPTION(Font, !result) + %pythonprepend Font %{ + if fontbuffer: + if hasattr(fontbuffer, "getvalue"): + fontbuffer = fontbuffer.getvalue() + elif isinstance(fontbuffer, bytearray): + fontbuffer = bytes(fontbuffer) + if not isinstance(fontbuffer, bytes): + raise ValueError("bad type: 'fontbuffer'") + + if isinstance(fontname, str): + fname_lower = fontname.lower() + if "/" in fname_lower or "\\" in fname_lower or "." in fname_lower: + print("Warning: did you mean a fontfile?") + + if fname_lower in ("cjk", "china-t", "china-ts"): + ordering = 0 + elif fname_lower.startswith("china-s"): + ordering = 1 + elif fname_lower.startswith("korea"): + ordering = 3 + elif fname_lower.startswith("japan"): + ordering = 2 + elif fname_lower in fitz_fontdescriptors.keys(): + import pymupdf_fonts # optional fonts + fontbuffer = pymupdf_fonts.myfont(fname_lower) # make a copy + fontname = None # ensure using fontbuffer only + del pymupdf_fonts # remove package again + + elif ordering < 0: + fontname = Base14_fontdict.get(fontname, fontname) + %} + %pythonappend Font %{self.thisown = True%} + Font(char *fontname=NULL, char *fontfile=NULL, + PyObject *fontbuffer=NULL, int script=0, + char *language=NULL, int ordering=-1, int is_bold=0, + int is_italic=0, int is_serif=0, int embed=1) + { + fz_font *font = NULL; + fz_try(gctx) { + fz_text_language lang = fz_text_language_from_string(language); + font = JM_get_font(gctx, fontname, fontfile, + fontbuffer, script, lang, ordering, + is_bold, is_italic, is_serif, embed); + } + fz_catch(gctx) { + return NULL; + } + return (struct Font *) font; + } + + + %pythonprepend glyph_advance + %{"""Return the glyph width of a unicode (font size 1)."""%} + PyObject *glyph_advance(int chr, char *language=NULL, int script=0, int wmode=0, int small_caps=0) + { + fz_font *font, *thisfont = (fz_font *) $self; + int gid; + fz_text_language lang = fz_text_language_from_string(language); + if (small_caps) { + gid = fz_encode_character_sc(gctx, thisfont, chr); + if (gid >= 0) font = thisfont; + } else { + gid = fz_encode_character_with_fallback(gctx, thisfont, chr, script, lang, &font); + } + return PyFloat_FromDouble((double) fz_advance_glyph(gctx, font, gid, wmode)); + } + + + FITZEXCEPTION(text_length, !result) + %pythonprepend text_length + %{"""Return length of unicode 'text' under a fontsize."""%} + PyObject *text_length(PyObject *text, double fontsize=11, char *language=NULL, int script=0, int wmode=0, int small_caps=0) + { + fz_font *font=NULL, *thisfont = (fz_font *) $self; + fz_text_language lang = fz_text_language_from_string(language); + double rc = 0; + int gid; + fz_try(gctx) { + if (!PyUnicode_Check(text) || PyUnicode_READY(text) != 0) { + RAISEPY(gctx, MSG_BAD_TEXT, PyExc_TypeError); + } + Py_ssize_t i, len = PyUnicode_GET_LENGTH(text); + int kind = PyUnicode_KIND(text); + void *data = PyUnicode_DATA(text); + for (i = 0; i < len; i++) { + int c = PyUnicode_READ(kind, data, i); + if (small_caps) { + gid = fz_encode_character_sc(gctx, thisfont, c); + if (gid >= 0) font = thisfont; + } else { + gid = fz_encode_character_with_fallback(gctx,thisfont, c, script, lang, &font); + } + rc += (double) fz_advance_glyph(gctx, font, gid, wmode); + } + } + fz_catch(gctx) { + PyErr_Clear(); + return NULL; + } + rc *= fontsize; + return PyFloat_FromDouble(rc); + } + + + FITZEXCEPTION(char_lengths, !result) + %pythonprepend char_lengths + %{"""Return tuple of char lengths of unicode 'text' under a fontsize."""%} + PyObject *char_lengths(PyObject *text, double fontsize=11, char *language=NULL, int script=0, int wmode=0, int small_caps=0) + { + fz_font *font, *thisfont = (fz_font *) $self; + fz_text_language lang = fz_text_language_from_string(language); + PyObject *rc = NULL; + int gid; + fz_try(gctx) { + if (!PyUnicode_Check(text) || PyUnicode_READY(text) != 0) { + RAISEPY(gctx, MSG_BAD_TEXT, PyExc_TypeError); + } + Py_ssize_t i, len = PyUnicode_GET_LENGTH(text); + int kind = PyUnicode_KIND(text); + void *data = PyUnicode_DATA(text); + rc = PyTuple_New(len); + for (i = 0; i < len; i++) { + int c = PyUnicode_READ(kind, data, i); + if (small_caps) { + gid = fz_encode_character_sc(gctx, thisfont, c); + if (gid >= 0) font = thisfont; + } else { + gid = fz_encode_character_with_fallback(gctx,thisfont, c, script, lang, &font); + } + PyTuple_SET_ITEM(rc, i, + PyFloat_FromDouble(fontsize * (double) fz_advance_glyph(gctx, font, gid, wmode))); + } + } + fz_catch(gctx) { + PyErr_Clear(); + Py_CLEAR(rc); + return NULL; + } + return rc; + } + + + %pythonprepend glyph_bbox + %{"""Return the glyph bbox of a unicode (font size 1)."""%} + %pythonappend glyph_bbox %{val = Rect(val)%} + PyObject *glyph_bbox(int chr, char *language=NULL, int script=0, int small_caps=0) + { + fz_font *font, *thisfont = (fz_font *) $self; + int gid; + fz_text_language lang = fz_text_language_from_string(language); + if (small_caps) { + gid = fz_encode_character_sc(gctx, thisfont, chr); + if (gid >= 0) font = thisfont; + } else { + gid = fz_encode_character_with_fallback(gctx, thisfont, chr, script, lang, &font); + } + return JM_py_from_rect(fz_bound_glyph(gctx, font, gid, fz_identity)); + } + + %pythonprepend has_glyph + %{"""Check whether font has a glyph for this unicode."""%} + PyObject *has_glyph(int chr, char *language=NULL, int script=0, int fallback=0, int small_caps=0) + { + fz_font *font, *thisfont = (fz_font *) $self; + fz_text_language lang; + int gid = 0; + if (fallback) { + lang = fz_text_language_from_string(language); + gid = fz_encode_character_with_fallback(gctx, (fz_font *) $self, chr, script, lang, &font); + } else { + if (!small_caps) { + gid = fz_encode_character(gctx, thisfont, chr); + } else { + gid = fz_encode_character_sc(gctx, thisfont, chr); + } + } + return Py_BuildValue("i", gid); + } + + + %pythoncode %{ + def valid_codepoints(self): + from array import array + gc = self.glyph_count + cp = array("l", (0,) * gc) + arr = cp.buffer_info() + self._valid_unicodes(arr) + return array("l", sorted(set(cp))[1:]) + %} + void _valid_unicodes(PyObject *arr) + { + fz_font *font = (fz_font *) $self; + PyObject *temp = PySequence_ITEM(arr, 0); + void *ptr = PyLong_AsVoidPtr(temp); + JM_valid_chars(gctx, font, ptr); + Py_DECREF(temp); + } + + + %pythoncode %{@property%} + PyObject *flags() + { + fz_font_flags_t *f = fz_font_flags((fz_font *) $self); + if (!f) Py_RETURN_NONE; + return Py_BuildValue( + "{s:N,s:N,s:N,s:N,s:N,s:N,s:N,s:N,s:N,s:N,s:N,s:N" + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + ",s:N,s:N" + #endif + "}", + "mono", JM_BOOL(f->is_mono), + "serif", JM_BOOL(f->is_serif), + "bold", JM_BOOL(f->is_bold), + "italic", JM_BOOL(f->is_italic), + "substitute", JM_BOOL(f->ft_substitute), + "stretch", JM_BOOL(f->ft_stretch), + "fake-bold", JM_BOOL(f->fake_bold), + "fake-italic", JM_BOOL(f->fake_italic), + "opentype", JM_BOOL(f->has_opentype), + "invalid-bbox", JM_BOOL(f->invalid_bbox), + "cjk", JM_BOOL(f->cjk), + "cjk-lang", (f->cjk ? PyLong_FromUnsignedLong((unsigned long) f->cjk_lang) : Py_BuildValue("s",NULL)) + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + , + "embed", JM_BOOL(f->embed), + "never-embed", JM_BOOL(f->never_embed) + #endif + ); + + } + + + %pythoncode %{@property%} + PyObject *is_bold() + { + fz_font *font = (fz_font *) $self; + if (fz_font_is_bold(gctx,font)) { + Py_RETURN_TRUE; + } + Py_RETURN_FALSE; + } + + + %pythoncode %{@property%} + PyObject *is_serif() + { + fz_font *font = (fz_font *) $self; + if (fz_font_is_serif(gctx,font)) { + Py_RETURN_TRUE; + } + Py_RETURN_FALSE; + } + + + %pythoncode %{@property%} + PyObject *is_italic() + { + fz_font *font = (fz_font *) $self; + if (fz_font_is_italic(gctx,font)) { + Py_RETURN_TRUE; + } + Py_RETURN_FALSE; + } + + + %pythoncode %{@property%} + PyObject *is_monospaced() + { + fz_font *font = (fz_font *) $self; + if (fz_font_is_monospaced(gctx,font)) { + Py_RETURN_TRUE; + } + Py_RETURN_FALSE; + } + + + /* temporarily disabled + * PyObject *is_writable() + * { + * fz_font *font = (fz_font *) $self; + * if (fz_font_t3_procs(gctx, font) || + * fz_font_flags(font)->ft_substitute || + * !pdf_font_writing_supported(font)) { + * Py_RETURN_FALSE; + * } + * Py_RETURN_TRUE; + * } + */ + + %pythoncode %{@property%} + PyObject *name() + { + return JM_UnicodeFromStr(fz_font_name(gctx, (fz_font *) $self)); + } + + %pythoncode %{@property%} + int glyph_count() + { + fz_font *this_font = (fz_font *) $self; + return this_font->glyph_count; + } + + %pythoncode %{@property%} + PyObject *buffer() + { + fz_font *this_font = (fz_font *) $self; + unsigned char *data = NULL; + size_t len = fz_buffer_storage(gctx, this_font->buffer, &data); + return JM_BinFromCharSize(data, len); + } + + %pythoncode %{@property%} + %pythonappend bbox%{val = Rect(val)%} + PyObject *bbox() + { + fz_font *this_font = (fz_font *) $self; + return JM_py_from_rect(fz_font_bbox(gctx, this_font)); + } + + %pythoncode %{@property%} + %pythonprepend ascender + %{"""Return the glyph ascender value."""%} + float ascender() + { + return fz_font_ascender(gctx, (fz_font *) $self); + } + + + %pythoncode %{@property%} + %pythonprepend descender + %{"""Return the glyph descender value."""%} + float descender() + { + return fz_font_descender(gctx, (fz_font *) $self); + } + + + %pythoncode %{ + + @property + def is_writable(self): + return True + + def glyph_name_to_unicode(self, name): + """Return the unicode for a glyph name.""" + return glyph_name_to_unicode(name) + + def unicode_to_glyph_name(self, ch): + """Return the glyph name for a unicode.""" + return unicode_to_glyph_name(ch) + + def __repr__(self): + return "Font('%s')" % self.name + + def __del__(self): + if not type(self) is Font: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; + + +//------------------------------------------------------------------------ +// DocumentWriter +//------------------------------------------------------------------------ + +struct DocumentWriter +{ + %extend + { + ~DocumentWriter() + { + // need this structure to free any fz_output the writer may have + typedef struct { // copied from pdf_write.c + fz_document_writer super; + pdf_document *pdf; + pdf_write_options opts; + fz_output *out; + fz_rect mediabox; + pdf_obj *resources; + fz_buffer *contents; + } pdf_writer; + + fz_document_writer *writer_fz = (fz_document_writer *) $self; + fz_output *out = NULL; + pdf_writer *writer_pdf = (pdf_writer *) writer_fz; + if (writer_pdf) { + out = writer_pdf->out; + if (out) { + DEBUGMSG1("Output of DocumentWriter"); + fz_drop_output(gctx, out); + writer_pdf->out = NULL; + DEBUGMSG2; + } + } + DEBUGMSG1("DocumentWriter"); + fz_drop_document_writer( gctx, writer_fz); + DEBUGMSG2; + } + + FITZEXCEPTION(DocumentWriter, !result) + %pythonprepend DocumentWriter + %{ + if type(path) is str: + pass + elif hasattr(path, "absolute"): + path = str(path) + elif hasattr(path, "name"): + path = path.name + if options==None: + options="" + %} + %pythonappend DocumentWriter + %{ + %} + DocumentWriter( PyObject* path, const char* options=NULL) + { + fz_output *out = NULL; + fz_document_writer* ret=NULL; + fz_try(gctx) { + if (PyUnicode_Check(path)) { + ret = fz_new_pdf_writer( gctx, PyUnicode_AsUTF8(path), options); + } else { + out = JM_new_output_fileptr(gctx, path); + ret = fz_new_pdf_writer_with_output(gctx, out, options); + } + } + + fz_catch(gctx) { + return NULL; + } + return (struct DocumentWriter*) ret; + } + + struct DeviceWrapper* begin_page( PyObject* mediabox) + { + fz_rect mediabox2 = JM_rect_from_py(mediabox); + fz_device* device = fz_begin_page( gctx, (fz_document_writer*) $self, mediabox2); + struct DeviceWrapper* device_wrapper + = (struct DeviceWrapper*) calloc(1, sizeof(struct DeviceWrapper)) + ; + device_wrapper->device = device; + device_wrapper->list = NULL; + return device_wrapper; + } + + void end_page() + { + fz_end_page( gctx, (fz_document_writer*) $self); + } + + void close() + { + fz_document_writer *writer = (fz_document_writer*) $self; + fz_close_document_writer( gctx, writer); + } + %pythoncode + %{ + def __del__(self): + if not type(self) is DocumentWriter: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + + def __enter__(self): + return self + + def __exit__(self, *args): + self.close() + %} + } +}; + +//------------------------------------------------------------------------ +// Archive +//------------------------------------------------------------------------ +struct Archive +{ + %extend + { + ~Archive() + { + DEBUGMSG1("Archive"); + fz_drop_archive( gctx, (fz_archive *) $self); + DEBUGMSG2; + } + FITZEXCEPTION(Archive, !result) + %pythonprepend Archive %{ + self._subarchives = [] + %} + %pythonappend Archive %{ + self.thisown = True + if args != (): + self.add(*args) + %} + + //--------------------------------------- + // new empty archive + //--------------------------------------- + Archive(struct Archive *a0=NULL, const char *path=NULL) + { + fz_archive *arch=NULL; + fz_try(gctx) { + arch = fz_new_multi_archive(gctx); + } + fz_catch(gctx) { + return NULL; + } + return (struct Archive *) arch; + } + + Archive(PyObject *a0=NULL, const char *path=NULL) + { + fz_archive *arch=NULL; + fz_try(gctx) { + arch = fz_new_multi_archive(gctx); + } + fz_catch(gctx) { + return NULL; + } + return (struct Archive *) arch; + } + + FITZEXCEPTION(has_entry, !result) + PyObject *has_entry(const char *name) + { + fz_archive *arch = (fz_archive *) $self; + int ret = 0; + fz_try(gctx) { + ret = fz_has_archive_entry(gctx, arch, name); + } + fz_catch(gctx) { + return NULL; + } + return JM_BOOL(ret); + } + + FITZEXCEPTION(read_entry, !result) + PyObject *read_entry(const char *name) + { + fz_archive *arch = (fz_archive *) $self; + PyObject *ret = NULL; + fz_buffer *buff = NULL; + fz_try(gctx) { + buff = fz_read_archive_entry(gctx, arch, name); + ret = JM_BinFromBuffer(gctx, buff); + } + fz_always(gctx) { + fz_drop_buffer(gctx, buff); + } + fz_catch(gctx) { + return NULL; + } + return ret; + } + + //-------------------------------------- + // add dir + //-------------------------------------- + FITZEXCEPTION(_add_dir, !result) + PyObject *_add_dir(const char *folder, const char *path=NULL) + { + fz_archive *arch = (fz_archive *) $self; + fz_archive *sub = NULL; + fz_try(gctx) { + sub = fz_open_directory(gctx, folder); + fz_mount_multi_archive(gctx, arch, sub, path); + } + fz_always(gctx) { + fz_drop_archive(gctx, sub); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------- + // add archive + //---------------------------------- + FITZEXCEPTION(_add_arch, !result) + PyObject *_add_arch(struct Archive *subarch, const char *path=NULL) + { + fz_archive *arch = (fz_archive *) $self; + fz_archive *sub = (fz_archive *) subarch; + fz_try(gctx) { + fz_mount_multi_archive(gctx, arch, sub, path); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------- + // add ZIP/TAR from file + //---------------------------------- + FITZEXCEPTION(_add_ziptarfile, !result) + PyObject *_add_ziptarfile(const char *filepath, int type, const char *path=NULL) + { + fz_archive *arch = (fz_archive *) $self; + fz_archive *sub = NULL; + fz_try(gctx) { + if (type==1) { + sub = fz_open_zip_archive(gctx, filepath); + } else { + sub = fz_open_tar_archive(gctx, filepath); + } + fz_mount_multi_archive(gctx, arch, sub, path); + } + fz_always(gctx) { + fz_drop_archive(gctx, sub); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------- + // add ZIP/TAR from memory + //---------------------------------- + FITZEXCEPTION(_add_ziptarmemory, !result) + PyObject *_add_ziptarmemory(PyObject *memory, int type, const char *path=NULL) + { + fz_archive *arch = (fz_archive *) $self; + fz_archive *sub = NULL; + fz_stream *stream = NULL; + fz_buffer *buff = NULL; + fz_try(gctx) { + buff = JM_BufferFromBytes(gctx, memory); + stream = fz_open_buffer(gctx, buff); + if (type==1) { + sub = fz_open_zip_archive_with_stream(gctx, stream); + } else { + sub = fz_open_tar_archive_with_stream(gctx, stream); + } + fz_mount_multi_archive(gctx, arch, sub, path); + } + fz_always(gctx) { + fz_drop_stream(gctx, stream); + fz_drop_buffer(gctx, buff); + fz_drop_archive(gctx, sub); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + //---------------------------------- + // add "tree" item + //---------------------------------- + FITZEXCEPTION(_add_treeitem, !result) + PyObject *_add_treeitem(PyObject *memory, const char *name, const char *path=NULL) + { + fz_archive *arch = (fz_archive *) $self; + fz_archive *sub = NULL; + fz_buffer *buff = NULL; + int drop_sub = 0; + fz_try(gctx) { + buff = JM_BufferFromBytes(gctx, memory); + sub = JM_last_tree(gctx, arch, path); + if (!sub) { + sub = fz_new_tree_archive(gctx, NULL); + drop_sub = 1; + } + fz_tree_archive_add_buffer(gctx, sub, name, buff); + if (drop_sub) { + fz_mount_multi_archive(gctx, arch, sub, path); + } + } + fz_always(gctx) { + fz_drop_buffer(gctx, buff); + if (drop_sub) { + fz_drop_archive(gctx, sub); + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + %pythoncode %{ + def add(self, content, path=None): + """Add a sub-archive. + + Args: + content: content to be added. May be one of Archive, folder + name, file name, raw bytes (bytes, bytearray), zipfile, + tarfile, or a sequence of any of these types. + path: (str) a "virtual" path name, under which the elements + of content can be retrieved. Use it to e.g. cope with + duplicate element names. + """ + bin_ok = lambda x: isinstance(x, (bytes, bytearray, io.BytesIO)) + + entries = [] + mount = None + fmt = None + + def make_subarch(): + subarch = {"fmt": fmt, "entries": entries, "path": mount} + if fmt != "tree" or self._subarchives == []: + self._subarchives.append(subarch) + else: + ltree = self._subarchives[-1] + if ltree["fmt"] != "tree" or ltree["path"] != subarch["path"]: + self._subarchives.append(subarch) + else: + ltree["entries"].extend(subarch["entries"]) + self._subarchives[-1] = ltree + return + + if isinstance(content, zipfile.ZipFile): + fmt = "zip" + entries = content.namelist() + mount = path + filename = getattr(content, "filename", None) + fp = getattr(content, "fp", None) + if filename: + self._add_ziptarfile(filename, 1, path) + else: + self._add_ziptarmemory(fp.getvalue(), 1, path) + return make_subarch() + + if isinstance(content, tarfile.TarFile): + fmt = "tar" + entries = content.getnames() + mount = path + filename = getattr(content.fileobj, "name", None) + fp = content.fileobj + if not isinstance(fp, io.BytesIO) and not filename: + fp = fp.fileobj + if filename: + self._add_ziptarfile(filename, 0, path) + else: + self._add_ziptarmemory(fp.getvalue(), 0, path) + return make_subarch() + + if isinstance(content, Archive): + fmt = "multi" + mount = path + self._add_arch(content, path) + return make_subarch() + + if bin_ok(content): + if not (path and type(path) is str): + raise ValueError("need name for binary content") + fmt = "tree" + mount = None + entries = [path] + self._add_treeitem(content, path) + return make_subarch() + + if hasattr(content, "name"): + content = content.name + elif isinstance(content, pathlib.Path): + content = str(content) + + if os.path.isdir(str(content)): + a0 = str(content) + fmt = "dir" + mount = path + entries = os.listdir(a0) + self._add_dir(a0, path) + return make_subarch() + + if os.path.isfile(str(content)): + if not (path and type(path) is str): + raise ValueError("need name for binary content") + a0 = str(content) + _ = open(a0, "rb") + ff = _.read() + _.close() + fmt = "tree" + mount = None + entries = [path] + self._add_treeitem(ff, path) + return make_subarch() + + if type(content) is str or not getattr(content, "__getitem__", None): + raise ValueError("bad archive content") + + #---------------------------------------- + # handling sequence types here + #---------------------------------------- + + if len(content) == 2: # covers the tree item plus path + data, name = content + if bin_ok(data) or os.path.isfile(str(data)): + if not type(name) is str: + raise ValueError(f"bad item name {name}") + mount = path + fmt = "tree" + if bin_ok(data): + self._add_treeitem(data, name, path=mount) + else: + _ = open(str(data), "rb") + ff = _.read() + _.close() + seld._add_treeitem(ff, name, path=mount) + entries = [name] + return make_subarch() + + # deal with sequence of disparate items + for item in content: + self.add(item, path) + + __doc__ = """Archive(dirname [, path]) - from folder + Archive(file [, path]) - from file name or object + Archive(data, name) - from memory item + Archive() - empty archive + Archive(archive [, path]) - from archive + """ + + @property + def entry_list(self): + """List of sub archives.""" + return self._subarchives + + def __repr__(self): + return f"Archive, sub-archives: {len(self._subarchives)}" + + def __del__(self): + if not type(self) is Archive: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; +//------------------------------------------------------------------------ +// Xml +//------------------------------------------------------------------------ +struct Xml +{ + %extend + { + ~Xml() + { + DEBUGMSG1("Xml"); + fz_drop_xml( gctx, (fz_xml*) $self); + DEBUGMSG2; + } + + FITZEXCEPTION(Xml, !result) + Xml(fz_xml* xml) + { + fz_keep_xml( gctx, xml); + return (struct Xml*) xml; + } + + Xml(const char *html) + { + fz_buffer *buff = NULL; + fz_xml *ret = NULL; + fz_try(gctx) { + buff = fz_new_buffer_from_copied_data(gctx, html, strlen(html)+1); + ret = fz_parse_xml_from_html5(gctx, buff); + } + fz_always(gctx) { + fz_drop_buffer(gctx, buff); + } + fz_catch(gctx) { + return NULL; + } + fz_keep_xml(gctx, ret); + return (struct Xml*) ret; + } + + %pythoncode %{@property%} + FITZEXCEPTION (root, !result) + struct Xml* root() + { + fz_xml* ret = NULL; + fz_try(gctx) { + ret = fz_xml_root((fz_xml_doc *) $self); + } + fz_catch(gctx) { + return NULL; + } + return (struct Xml*) ret; + } + + FITZEXCEPTION (bodytag, !result) + struct Xml* bodytag() + { + fz_xml* ret = NULL; + fz_try(gctx) { + ret = fz_keep_xml( gctx, fz_dom_body( gctx, (fz_xml *) $self)); + } + fz_catch(gctx) { + return NULL; + } + return (struct Xml*) ret; + } + + FITZEXCEPTION (append_child, !result) + PyObject *append_child( struct Xml* child) + { + fz_try(gctx) { + fz_dom_append_child( gctx, (fz_xml *) $self, (fz_xml *) child); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION (create_text_node, !result) + struct Xml* create_text_node( const char *text) + { + fz_xml* ret = NULL; + fz_try(gctx) { + ret = fz_dom_create_text_node( gctx,(fz_xml *) $self, text); + } + fz_catch(gctx) { + return NULL; + } + fz_keep_xml( gctx, ret); + return (struct Xml*) ret; + } + + FITZEXCEPTION (create_element, !result) + struct Xml* create_element( const char *tag) + { + fz_xml* ret = NULL; + fz_try(gctx) { + ret = fz_dom_create_element( gctx, (fz_xml *)$self, tag); + } + fz_catch(gctx) { + return NULL; + } + fz_keep_xml( gctx, ret); + return (struct Xml*) ret; + } + + struct Xml *find(const char *tag, const char *att, const char *match) + { + fz_xml* ret=NULL; + ret = fz_dom_find( gctx, (fz_xml *)$self, tag, att, match); + if (!ret) { + return NULL; + } + fz_keep_xml( gctx, ret); + return (struct Xml*) ret; + } + + struct Xml *find_next( const char *tag, const char *att, const char *match) + { + fz_xml* ret=NULL; + ret = fz_dom_find_next( gctx, (fz_xml *)$self, tag, att, match); + if (!ret) { + return NULL; + } + fz_keep_xml( gctx, ret); + return (struct Xml*) ret; + } + + %pythoncode %{@property%} + struct Xml *next() + { + fz_xml* ret=NULL; + ret = fz_dom_next( gctx, (fz_xml *)$self); + if (!ret) { + return NULL; + } + fz_keep_xml( gctx, ret); + return (struct Xml*) ret; + } + + %pythoncode %{@property%} + struct Xml *previous() + { + fz_xml* ret=NULL; + ret = fz_dom_previous( gctx, (fz_xml *)$self); + if (!ret) { + return NULL; + } + fz_keep_xml( gctx, ret); + return (struct Xml*) ret; + } + + FITZEXCEPTION (set_attribute, !result) + PyObject *set_attribute(const char *key, const char *value) + { + fz_try(gctx) { + if (strlen(key)==0) { + RAISEPY(gctx, "key must not be empty", PyExc_ValueError); + } + fz_dom_add_attribute(gctx, (fz_xml *)$self, key, value); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION (remove_attribute, !result) + PyObject *remove_attribute(const char *key) + { + fz_try(gctx) { + if (strlen(key)==0) { + RAISEPY(gctx, "key must not be empty", PyExc_ValueError); + } + fz_xml *elt = (fz_xml *)$self; + fz_dom_remove_attribute(gctx, elt, key); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION (get_attribute_value, !result) + PyObject *get_attribute_value(const char *key) + { + const char *ret=NULL; + fz_try(gctx) { + if (strlen(key)==0) { + RAISEPY(gctx, "key must not be empty", PyExc_ValueError); + } + fz_xml *elt = (fz_xml *)$self; + ret=fz_dom_attribute(gctx, elt, key); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("s", ret); + } + + + FITZEXCEPTION (get_attributes, !result) + PyObject *get_attributes() + { + fz_xml *this = (fz_xml *) $self; + if (fz_xml_text(this)) { // text node has none + Py_RETURN_NONE; + } + PyObject *result=PyDict_New(); + fz_try(gctx) { + int i=0; + const char *key=NULL; + const char *val=NULL; + while (1) { + val = fz_dom_get_attribute(gctx, this, i, &key); + if (!val || !key) { + break; + } + PyObject *temp = Py_BuildValue("s",val); + PyDict_SetItemString(result, key, temp); + Py_DECREF(temp); + i += 1; + } + } + fz_catch(gctx) { + Py_DECREF(result); + return NULL; + } + return result; + } + + + FITZEXCEPTION (insert_before, !result) + PyObject *insert_before(struct Xml *node) + { + fz_xml *existing = (fz_xml *) $self; + fz_xml *what = (fz_xml *) node; + fz_try(gctx) + { + fz_dom_insert_before(gctx, existing, what); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION (insert_after, !result) + PyObject *insert_after(struct Xml *node) + { + fz_xml *existing = (fz_xml *) $self; + fz_xml *what = (fz_xml *) node; + fz_try(gctx) + { + fz_dom_insert_after(gctx, existing, what); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION (clone, !result) + struct Xml* clone() + { + fz_xml* ret = NULL; + fz_try(gctx) { + ret = fz_dom_clone( gctx, (fz_xml *)$self); + } + fz_catch(gctx) { + return NULL; + } + fz_keep_xml( gctx, ret); + return (struct Xml*) ret; + } + + %pythoncode %{@property%} + struct Xml *parent() + { + fz_xml* ret = NULL; + ret = fz_dom_parent( gctx, (fz_xml *)$self); + if (!ret) { + return NULL; + } + fz_keep_xml( gctx, ret); + return (struct Xml*) ret; + } + + %pythoncode %{@property%} + struct Xml *first_child() + { + fz_xml* ret = NULL; + fz_xml *this = (fz_xml *)$self; + if (fz_xml_text(this)) { // a text node has no child + return NULL; + } + ret = fz_dom_first_child( gctx, (fz_xml *)$self); + if (!ret) { + return NULL; + } + fz_keep_xml( gctx, ret); + return (struct Xml*) ret; + } + + + FITZEXCEPTION (remove, !result) + PyObject *remove() + { + fz_try(gctx) { + fz_dom_remove( gctx, (fz_xml *)$self); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + %pythoncode %{@property%} + PyObject *text() + { + return Py_BuildValue("s", fz_xml_text((fz_xml *)$self)); + } + + %pythoncode %{@property%} + PyObject *tagname() + { + return Py_BuildValue("s", fz_xml_tag((fz_xml *)$self)); + } + + + %pythoncode %{ + def _get_node_tree(self): + def show_node(node, items, shift): + while node != None: + if node.is_text: + items.append((shift, f'"{node.text}"')) + node = node.next + continue + items.append((shift, f"({node.tagname}")) + for k, v in node.get_attributes().items(): + items.append((shift, f"={k} '{v}'")) + child = node.first_child + if child: + items = show_node(child, items, shift + 1) + items.append((shift, f"){node.tagname}")) + node = node.next + return items + + shift = 0 + items = [] + items = show_node(self, items, shift) + return items + + def debug(self): + """Print a list of the node tree below self.""" + items = self._get_node_tree() + for item in items: + print(" " * item[0] + item[1].replace("\n", "\\n")) + + @property + def is_text(self): + """Check if this is a text node.""" + return self.text != None + + @property + def last_child(self): + """Return last child node.""" + child = self.first_child + if child==None: + return None + while True: + if child.next == None: + return child + child = child.next + + @staticmethod + def color_text(color): + if type(color) is str: + return color + if type(color) is int: + return f"rgb({sRGB_to_rgb(color)})" + if type(color) in (tuple, list): + return f"rgb{tuple(color)}" + return color + + def add_number_list(self, start=1, numtype=None): + """Add numbered list ("ol" tag)""" + child = self.create_element("ol") + if start > 1: + child.set_attribute("start", str(start)) + if numtype != None: + child.set_attribute("type", numtype) + self.append_child(child) + return child + + def add_description_list(self): + """Add description list ("dl" tag)""" + child = self.create_element("dl") + self.append_child(child) + return child + + def add_image(self, name, width=None, height=None, imgfloat=None, align=None): + """Add image node (tag "img").""" + child = self.create_element("img") + if width != None: + child.set_attribute("width", f"{width}") + if height != None: + child.set_attribute("height", f"{height}") + if imgfloat != None: + child.set_attribute("style", f"float: {imgfloat}") + if align != None: + child.set_attribute("align", f"{align}") + child.set_attribute("src", f"{name}") + self.append_child(child) + return child + + def add_bullet_list(self): + """Add bulleted list ("ul" tag)""" + child = self.create_element("ul") + self.append_child(child) + return child + + def add_list_item(self): + """Add item ("li" tag) under a (numbered or bulleted) list.""" + if self.tagname not in ("ol", "ul"): + raise ValueError("cannot add list item to", self.tagname) + child = self.create_element("li") + self.append_child(child) + return child + + def add_span(self): + child = self.create_element("span") + self.append_child(child) + return child + + def add_paragraph(self): + """Add "p" tag""" + child = self.create_element("p") + if self.tagname != "p": + self.append_child(child) + else: + self.parent.append_child(child) + return child + + def add_header(self, level=1): + """Add header tag""" + if level not in range(1, 7): + raise ValueError("Header level must be in [1, 6]") + this_tag = self.tagname + new_tag = f"h{level}" + child = self.create_element(new_tag) + prev = self + if this_tag not in ("h1", "h2", "h3", "h4", "h5", "h6", "p"): + self.append_child(child) + return child + self.parent.append_child(child) + return child + + def add_division(self): + """Add "div" tag""" + child = self.create_element("div") + self.append_child(child) + return child + + def add_horizontal_line(self): + """Add horizontal line ("hr" tag)""" + child = self.create_element("hr") + self.append_child(child) + return child + + def add_link(self, href, text=None): + """Add a hyperlink ("a" tag)""" + child = self.create_element("a") + if not isinstance(text, str): + text = href + child.set_attribute("href", href) + child.append_child(self.create_text_node(text)) + prev = self.span_bottom() + if prev == None: + prev = self + prev.append_child(child) + return self + + def add_code(self, text=None): + """Add a "code" tag""" + child = self.create_element("code") + if type(text) is str: + child.append_child(self.create_text_node(text)) + prev = self.span_bottom() + if prev == None: + prev = self + prev.append_child(child) + return self + + add_var = add_code + add_samp = add_code + add_kbd = add_code + + def add_superscript(self, text=None): + """Add a superscript ("sup" tag)""" + child = self.create_element("sup") + if type(text) is str: + child.append_child(self.create_text_node(text)) + prev = self.span_bottom() + if prev == None: + prev = self + prev.append_child(child) + return self + + def add_subscript(self, text=None): + """Add a subscript ("sub" tag)""" + child = self.create_element("sub") + if type(text) is str: + child.append_child(self.create_text_node(text)) + prev = self.span_bottom() + if prev == None: + prev = self + prev.append_child(child) + return self + + def add_codeblock(self): + """Add monospaced lines ("pre" node)""" + child = self.create_element("pre") + self.append_child(child) + return child + + def span_bottom(self): + """Find deepest level in stacked spans.""" + parent = self + child = self.last_child + if child == None: + return None + while child.is_text: + child = child.previous + if child == None: + break + if child == None or child.tagname != "span": + return None + + while True: + if child == None: + return parent + if child.tagname in ("a", "sub","sup","body") or child.is_text: + child = child.next + continue + if child.tagname == "span": + parent = child + child = child.first_child + else: + return parent + + def append_styled_span(self, style): + span = self.create_element("span") + span.add_style(style) + prev = self.span_bottom() + if prev == None: + prev = self + prev.append_child(span) + return prev + + def set_margins(self, val): + """Set margin values via CSS style""" + text = "margins: %s" % val + self.append_styled_span(text) + return self + + def set_font(self, font): + """Set font-family name via CSS style""" + text = "font-family: %s" % font + self.append_styled_span(text) + return self + + def set_color(self, color): + """Set text color via CSS style""" + text = f"color: %s" % self.color_text(color) + self.append_styled_span(text) + return self + + def set_columns(self, cols): + """Set number of text columns via CSS style""" + text = f"columns: {cols}" + self.append_styled_span(text) + return self + + def set_bgcolor(self, color): + """Set background color via CSS style""" + text = f"background-color: %s" % self.color_text(color) + self.add_style(text) # does not work on span level + return self + + def set_opacity(self, opacity): + """Set opacity via CSS style""" + text = f"opacity: {opacity}" + self.append_styled_span(text) + return self + + def set_align(self, align): + """Set text alignment via CSS style""" + text = "text-align: %s" + if isinstance( align, str): + t = align + elif align == TEXT_ALIGN_LEFT: + t = "left" + elif align == TEXT_ALIGN_CENTER: + t = "center" + elif align == TEXT_ALIGN_RIGHT: + t = "right" + elif align == TEXT_ALIGN_JUSTIFY: + t = "justify" + else: + raise ValueError(f"Unrecognised align={align}") + text = text % t + self.add_style(text) + return self + + def set_underline(self, val="underline"): + text = "text-decoration: %s" % val + self.append_styled_span(text) + return self + + def set_pagebreak_before(self): + """Insert a page break before this node.""" + text = "page-break-before: always" + self.add_style(text) + return self + + def set_pagebreak_after(self): + """Insert a page break after this node.""" + text = "page-break-after: always" + self.add_style(text) + return self + + def set_fontsize(self, fontsize): + """Set font size name via CSS style""" + if type(fontsize) is str: + px="" + else: + px="px" + text = f"font-size: {fontsize}{px}" + self.append_styled_span(text) + return self + + def set_lineheight(self, lineheight): + """Set line height name via CSS style - block-level only.""" + text = f"line-height: {lineheight}" + self.add_style(text) + return self + + def set_leading(self, leading): + """Set inter-line spacing value via CSS style - block-level only.""" + text = f"-mupdf-leading: {leading}" + self.add_style(text) + return self + + def set_word_spacing(self, spacing): + """Set inter-word spacing value via CSS style""" + text = f"word-spacing: {spacing}" + self.append_styled_span(text) + return self + + def set_letter_spacing(self, spacing): + """Set inter-letter spacing value via CSS style""" + text = f"letter-spacing: {spacing}" + self.append_styled_span(text) + return self + + def set_text_indent(self, indent): + """Set text indentation name via CSS style - block-level only.""" + text = f"text-indent: {indent}" + self.add_style(text) + return self + + def set_bold(self, val=True): + """Set bold on / off via CSS style""" + if val: + val="bold" + else: + val="normal" + text = "font-weight: %s" % val + self.append_styled_span(text) + return self + + def set_italic(self, val=True): + """Set italic on / off via CSS style""" + if val: + val="italic" + else: + val="normal" + text = "font-style: %s" % val + self.append_styled_span(text) + return self + + def set_properties( + self, + align=None, + bgcolor=None, + bold=None, + color=None, + columns=None, + font=None, + fontsize=None, + indent=None, + italic=None, + leading=None, + letter_spacing=None, + lineheight=None, + margins=None, + pagebreak_after=None, + pagebreak_before=None, + word_spacing=None, + unqid=None, + cls=None, + ): + """Set any or all properties of a node. + + To be used for existing nodes preferrably. + """ + root = self.root + temp = root.add_division() + if align is not None: + temp.set_align(align) + if bgcolor is not None: + temp.set_bgcolor(bgcolor) + if bold is not None: + temp.set_bold(bold) + if color is not None: + temp.set_color(color) + if columns is not None: + temp.set_columns(columns) + if font is not None: + temp.set_font(font) + if fontsize is not None: + temp.set_fontsize(fontsize) + if indent is not None: + temp.set_text_indent(indent) + if italic is not None: + temp.set_italic(italic) + if leading is not None: + temp.set_leading(leading) + if letter_spacing is not None: + temp.set_letter_spacing(letter_spacing) + if lineheight is not None: + temp.set_lineheight(lineheight) + if margins is not None: + temp.set_margins(margins) + if pagebreak_after is not None: + temp.set_pagebreak_after() + if pagebreak_before is not None: + temp.set_pagebreak_before() + if word_spacing is not None: + temp.set_word_spacing(word_spacing) + if unqid is not None: + self.set_id(unqid) + if cls is not None: + self.add_class(cls) + + styles = [] + top_style = temp.get_attribute_value("style") + if top_style is not None: + styles.append(top_style) + child = temp.first_child + while child: + styles.append(child.get_attribute_value("style")) + child = child.first_child + self.set_attribute("style", ";".join(styles)) + temp.remove() + return self + + def set_id(self, unique): + """Set a unique id.""" + # check uniqueness + tagname = self.tagname + root = self.root + if root.find(None, "id", unique): + raise ValueError(f"id '{unique}' already exists") + self.set_attribute("id", unique) + return self + + def add_text(self, text): + """Add text. Line breaks are honored.""" + lines = text.splitlines() + line_count = len(lines) + prev = self.span_bottom() + if prev == None: + prev = self + + for i, line in enumerate(lines): + prev.append_child(self.create_text_node(line)) + if i < line_count - 1: + prev.append_child(self.create_element("br")) + return self + + def add_style(self, text): + """Set some style via CSS style. Replaces complete style spec.""" + style = self.get_attribute_value("style") + if style != None and text in style: + return self + self.remove_attribute("style") + if style == None: + style = text + else: + style += ";" + text + self.set_attribute("style", style) + return self + + def add_class(self, text): + """Set some class via CSS. Replaces complete class spec.""" + cls = self.get_attribute_value("class") + if cls != None and text in cls: + return self + self.remove_attribute("class") + if cls == None: + cls = text + else: + cls += " " + text + self.set_attribute("class", cls) + return self + + def insert_text(self, text): + lines = text.splitlines() + line_count = len(lines) + for i, line in enumerate(lines): + self.append_child(self.create_text_node(line)) + if i < line_count - 1: + self.append_child(self.create_element("br")) + return self + + def __enter__(self): + return self + + def __exit__(self, *args): + pass + + def __del__(self): + if not type(self) is Xml: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; + +//------------------------------------------------------------------------ +// Story +//------------------------------------------------------------------------ +struct Story +{ + %extend + { + ~Story() + { + DEBUGMSG1("Story"); + fz_story *this_story = (fz_story *) $self; + fz_drop_story(gctx, this_story); + DEBUGMSG2; + } + + FITZEXCEPTION(Story, !result) + %pythonprepend Story %{ + if archive != None and isinstance(archive, Archive) == False: + archive = Archive(archive) + %} + Story(const char* html=NULL, const char *user_css=NULL, double em=12, struct Archive *archive=NULL) + { + fz_story* story = NULL; + fz_buffer *buffer = NULL; + fz_archive* arch = NULL; + fz_var(story); + fz_var(buffer); + const char *html2=""; + if (html) { + html2=html; + } + + fz_try(gctx) + { + buffer = fz_new_buffer_from_copied_data(gctx, html2, strlen(html2)+1); + if (archive) { + arch = (fz_archive *) archive; + } + story = fz_new_story(gctx, buffer, user_css, em, arch); + } + fz_always(gctx) + { + fz_drop_buffer(gctx, buffer); + } + fz_catch(gctx) + { + return NULL; + } + struct Story* ret = (struct Story *) story; + return ret; + } + + FITZEXCEPTION(reset, !result) + PyObject* reset() + { + fz_try(gctx) + { + fz_reset_story(gctx, (fz_story *)$self); + } + fz_catch(gctx) + { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION(place, !result) + PyObject* place( PyObject* where) + { + PyObject* ret = NULL; + fz_try(gctx) + { + fz_rect where2 = JM_rect_from_py(where); + fz_rect filled; + int more = fz_place_story( gctx, (fz_story*) $self, where2, &filled); + ret = PyTuple_New(2); + PyTuple_SET_ITEM( ret, 0, Py_BuildValue( "i", more)); + PyTuple_SET_ITEM( ret, 1, JM_py_from_rect( filled)); + } + fz_catch(gctx) + { + return NULL; + } + return ret; + } + + FITZEXCEPTION(draw, !result) + PyObject* draw( struct DeviceWrapper* device, PyObject* matrix=NULL) + { + fz_try(gctx) + { + fz_matrix ctm2 = JM_matrix_from_py( matrix); + fz_device *dev = (device) ? device->device : NULL; + fz_draw_story( gctx, (fz_story*) $self, dev, ctm2); + } + fz_catch(gctx) + { + return NULL; + } + Py_RETURN_NONE; + } + + FITZEXCEPTION(document, !result) + struct Xml* document() + { + fz_xml* dom=NULL; + fz_try(gctx) { + dom = fz_story_document( gctx, (fz_story*) $self); + } + fz_catch(gctx) { + return NULL; + } + fz_keep_xml( gctx, dom); + return (struct Xml*) dom; + } + + FITZEXCEPTION(element_positions, !result) + %pythonprepend element_positions %{ + """Trigger a callback function to record where items have been placed. + + Args: + function: a function accepting exactly one argument. + args: an optional dictionary for passing additional data. + """ + if type(args) is dict: + for k in args.keys(): + if not (type(k) is str and k.isidentifier()): + raise ValueError(f"invalid key '{k}'") + else: + args = {} + if not callable(function) or function.__code__.co_argcount != 1: + raise ValueError("callback 'function' must be a callable with exactly one argument") + %} + PyObject* element_positions(PyObject *function, PyObject *args) + { + PyObject *callarg=NULL; + fz_try(gctx) { + callarg = Py_BuildValue("OO", function, args); + fz_story_positions(gctx, (fz_story *) $self, Story_Callback, callarg); + } + fz_always(gctx) { + Py_CLEAR(callarg); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + %pythoncode + %{ + def write(self, writer, rectfn, positionfn=None, pagefn=None): + dev = None + page_num = 0 + rect_num = 0 + filled = Rect(0, 0, 0, 0) + while 1: + mediabox, rect, ctm = rectfn(rect_num, filled) + rect_num += 1 + if mediabox: + # new page. + page_num += 1 + more, filled = self.place( rect) + #print(f"write(): positionfn={positionfn}") + if positionfn: + def positionfn2(position): + # We add a `.page_num` member to the + # `ElementPosition` instance. + position.page_num = page_num + #print(f"write(): position={position}") + positionfn(position) + self.element_positions(positionfn2, {}) + if writer: + if mediabox: + # new page. + if dev: + if pagefn: + pagefn(page_num, medibox, dev, 1) + writer.end_page() + dev = writer.begin_page( mediabox) + if pagefn: + pagefn(page_num, mediabox, dev, 0) + self.draw( dev, ctm) + if not more: + if pagefn: + pagefn( page_num, mediabox, dev, 1) + writer.end_page() + else: + self.draw(None, ctm) + if not more: + break + + @staticmethod + def write_stabilized(writer, contentfn, rectfn, user_css=None, em=12, positionfn=None, pagefn=None, archive=None, add_header_ids=True): + positions = list() + content = None + # Iterate until stable. + while 1: + content_prev = content + content = contentfn( positions) + stable = False + if content == content_prev: + stable = True + content2 = content + story = Story(content2, user_css, em, archive) + + if add_header_ids: + story.add_header_ids() + + positions = list() + def positionfn2(position): + #print(f"write_stabilized(): stable={stable} positionfn={positionfn} position={position}") + positions.append(position) + if stable and positionfn: + positionfn(position) + story.write( + writer if stable else None, + rectfn, + positionfn2, + pagefn, + ) + if stable: + break + + def add_header_ids(self): + ''' + Look for `` items in `self` and adds unique `id` + attributes if not already present. + ''' + dom = self.body + i = 0 + x = dom.find(None, None, None) + while x: + name = x.tagname + if len(name) == 2 and name[0]=="h" and name[1] in "123456": + attr = x.get_attribute_value("id") + if not attr: + id_ = f"h_id_{i}" + #print(f"name={name}: setting id={id_}") + x.set_attribute("id", id_) + i += 1 + x = x.find_next(None, None, None) + + def write_with_links(self, rectfn, positionfn=None, pagefn=None): + #print("write_with_links()") + stream = io.BytesIO() + writer = DocumentWriter(stream) + positions = [] + def positionfn2(position): + #print(f"write_with_links(): position={position}") + positions.append(position) + if positionfn: + positionfn(position) + self.write(writer, rectfn, positionfn=positionfn2, pagefn=pagefn) + writer.close() + stream.seek(0) + return Story.add_pdf_links(stream, positions) + + @staticmethod + def write_stabilized_with_links(contentfn, rectfn, user_css=None, em=12, positionfn=None, pagefn=None, archive=None, add_header_ids=True): + #print("write_stabilized_with_links()") + stream = io.BytesIO() + writer = DocumentWriter(stream) + positions = [] + def positionfn2(position): + #print(f"write_stabilized_with_links(): position={position}") + positions.append(position) + if positionfn: + positionfn(position) + Story.write_stabilized(writer, contentfn, rectfn, user_css, em, positionfn2, pagefn, archive, add_header_ids) + writer.close() + stream.seek(0) + return Story.add_pdf_links(stream, positions) + + @staticmethod + def add_pdf_links(document_or_stream, positions): + """ + Adds links to PDF document. + Args: + document_or_stream: + A PDF `Document` or raw PDF content, for example an + `io.BytesIO` instance. + positions: + List of `ElementPosition`'s for `document_or_stream`, + typically from Story.element_positions(). We raise an + exception if two or more positions have same id. + Returns: + `document_or_stream` if a `Document` instance, otherwise a + new `Document` instance. + We raise an exception if an `href` in `positions` refers to an + internal position `#` but no item in `postions` has `id = + name`. + """ + if isinstance(document_or_stream, Document): + document = document_or_stream + else: + document = Document("pdf", document_or_stream) + + # Create dict from id to position, which we will use to find + # link destinations. + # + id_to_position = dict() + #print(f"positions: {positions}") + for position in positions: + #print(f"add_pdf_links(): position: {position}") + if (position.open_close & 1) and position.id: + #print(f"add_pdf_links(): position with id: {position}") + if position.id in id_to_position: + #print(f"Ignoring duplicate positions with id={position.id!r}") + pass + else: + id_to_position[ position.id] = position + + # Insert links for all positions that have an `href` starting + # with '#'. + # + for position_from in positions: + if ((position_from.open_close & 1) + and position_from.href + and position_from.href.startswith("#") + ): + # This is a `...` internal link. + #print(f"add_pdf_links(): position with href: {position}") + target_id = position_from.href[1:] + try: + position_to = id_to_position[ target_id] + except Exception as e: + raise RuntimeError(f"No destination with id={target_id}, required by position_from: {position_from}") + # Make link from `position_from`'s rect to top-left of + # `position_to`'s rect. + if 0: + print(f"add_pdf_links(): making link from:") + print(f"add_pdf_links(): {position_from}") + print(f"add_pdf_links(): to:") + print(f"add_pdf_links(): {position_to}") + link = dict() + link["kind"] = LINK_GOTO + link["from"] = Rect(position_from.rect) + x0, y0, x1, y1 = position_to.rect + # This appears to work well with viewers which scroll + # to make destination point top-left of window. + link["to"] = Point(x0, y0) + link["page"] = position_to.page_num - 1 + document[position_from.page_num - 1].insert_link(link) + return document + + @property + def body(self): + dom = self.document() + return dom.bodytag() + + def __del__(self): + if not type(self) is Story: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; + + +//------------------------------------------------------------------------ +// Tools - a collection of tools and utilities +//------------------------------------------------------------------------ +struct Tools +{ + %extend + { + Tools() + { + /* It looks like global objects are never destructed when running + with SWIG, so we use Memento_startLeaking()/Memento_stopLeaking(). + */ + Memento_startLeaking(); + void* p = malloc( sizeof(struct Tools)); + Memento_stopLeaking(); + //fprintf(stderr, "Tools constructor p=%p\n", p); + return (struct Tools*) p; + } + + ~Tools() + { + /* This is not called. */ + struct Tools* p = (struct Tools*) $self; + //fprintf(stderr, "~Tools() p=%p\n", p); + free(p); + } + + %pythonprepend gen_id + %{"""Return a unique positive integer."""%} + PyObject *gen_id() + { + JM_UNIQUE_ID += 1; + if (JM_UNIQUE_ID < 0) JM_UNIQUE_ID = 1; + return Py_BuildValue("i", JM_UNIQUE_ID); + } + + + FITZEXCEPTION(set_icc, !result) + %pythonprepend set_icc + %{"""Set ICC color handling on or off."""%} + PyObject *set_icc(int on=0) + { + fz_try(gctx) { + if (on) { + if (FZ_ENABLE_ICC) + fz_enable_icc(gctx); + else { + RAISEPY(gctx, "MuPDF built w/o ICC support",PyExc_ValueError); + } + } else if (FZ_ENABLE_ICC) { + fz_disable_icc(gctx); + } + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + %pythonprepend set_annot_stem + %{"""Get / set id prefix for annotations."""%} + char *set_annot_stem(char *stem=NULL) + { + if (!stem) { + return JM_annot_id_stem; + } + size_t len = strlen(stem) + 1; + if (len > 50) len = 50; + memcpy(&JM_annot_id_stem, stem, len); + return JM_annot_id_stem; + } + + + %pythonprepend set_small_glyph_heights + %{"""Set / unset small glyph heights."""%} + PyObject *set_small_glyph_heights(PyObject *on=NULL) + { + if (!on || on == Py_None) { + return JM_BOOL(small_glyph_heights); + } + if (PyObject_IsTrue(on)) { + small_glyph_heights = 1; + } else { + small_glyph_heights = 0; + } + return JM_BOOL(small_glyph_heights); + } + + + %pythonprepend set_subset_fontnames + %{"""Set / unset returning fontnames with their subset prefix."""%} + PyObject *set_subset_fontnames(PyObject *on=NULL) + { + if (!on || on == Py_None) { + return JM_BOOL(subset_fontnames); + } + if (PyObject_IsTrue(on)) { + subset_fontnames = 1; + } else { + subset_fontnames = 0; + } + return JM_BOOL(subset_fontnames); + } + + + %pythonprepend set_low_memory + %{"""Set / unset MuPDF device caching."""%} + PyObject *set_low_memory(PyObject *on=NULL) + { + if (!on || on == Py_None) { + return JM_BOOL(no_device_caching); + } + if (PyObject_IsTrue(on)) { + no_device_caching = 1; + } else { + no_device_caching = 0; + } + return JM_BOOL(no_device_caching); + } + + + %pythonprepend unset_quad_corrections + %{"""Set ascender / descender corrections on or off."""%} + PyObject *unset_quad_corrections(PyObject *on=NULL) + { + if (!on || on == Py_None) { + return JM_BOOL(skip_quad_corrections); + } + if (PyObject_IsTrue(on)) { + skip_quad_corrections = 1; + } else { + skip_quad_corrections = 0; + } + return JM_BOOL(skip_quad_corrections); + } + + + %pythonprepend store_shrink + %{"""Free 'percent' of current store size."""%} + PyObject *store_shrink(int percent) + { + if (percent >= 100) { + fz_empty_store(gctx); + return Py_BuildValue("i", 0); + } + if (percent > 0) fz_shrink_store(gctx, 100 - percent); + return Py_BuildValue("i", (int) gctx->store->size); + } + + + %pythoncode%{@property%} + %pythonprepend store_size + %{"""MuPDF current store size."""%} + PyObject *store_size() + { + return Py_BuildValue("i", (int) gctx->store->size); + } + + + %pythoncode%{@property%} + %pythonprepend store_maxsize + %{"""MuPDF store size limit."""%} + PyObject *store_maxsize() + { + return Py_BuildValue("i", (int) gctx->store->max); + } + + + %pythonprepend show_aa_level + %{"""Show anti-aliasing values."""%} + %pythonappend show_aa_level %{ + temp = {"graphics": val[0], "text": val[1], "graphics_min_line_width": val[2]} + val = temp%} + PyObject *show_aa_level() + { + return Py_BuildValue("iif", + fz_graphics_aa_level(gctx), + fz_text_aa_level(gctx), + fz_graphics_min_line_width(gctx)); + } + + + %pythonprepend set_aa_level + %{"""Set anti-aliasing level."""%} + void set_aa_level(int level) + { + fz_set_aa_level(gctx, level); + } + + + %pythonprepend set_graphics_min_line_width + %{"""Set the graphics minimum line width."""%} + void set_graphics_min_line_width(float min_line_width) + { + fz_set_graphics_min_line_width(gctx, min_line_width); + } + + + FITZEXCEPTION(image_profile, !result) + %pythonprepend image_profile + %{"""Metadata of an image binary stream."""%} + PyObject *image_profile(PyObject *stream, int keep_image=0) + { + PyObject *rc = NULL; + fz_try(gctx) { + rc = JM_image_profile(gctx, stream, keep_image); + } + fz_catch(gctx) { + return NULL; + } + return rc; + } + + + PyObject *_rotate_matrix(struct Page *page) + { + pdf_page *pdfpage = pdf_page_from_fz_page(gctx, (fz_page *) page); + if (!pdfpage) return JM_py_from_matrix(fz_identity); + return JM_py_from_matrix(JM_rotate_page_matrix(gctx, pdfpage)); + } + + + PyObject *_derotate_matrix(struct Page *page) + { + pdf_page *pdfpage = pdf_page_from_fz_page(gctx, (fz_page *) page); + if (!pdfpage) return JM_py_from_matrix(fz_identity); + return JM_py_from_matrix(JM_derotate_page_matrix(gctx, pdfpage)); + } + + + %pythoncode%{@property%} + %pythonprepend fitz_config + %{"""PyMuPDF configuration parameters."""%} + PyObject *fitz_config() + { + return JM_fitz_config(); + } + + + %pythonprepend glyph_cache_empty + %{"""Empty the glyph cache."""%} + void glyph_cache_empty() + { + fz_purge_glyph_cache(gctx); + } + + + FITZEXCEPTION(_fill_widget, !result) + %pythonappend _fill_widget %{ + widget.rect = Rect(annot.rect) + widget.xref = annot.xref + widget.parent = annot.parent + widget._annot = annot # backpointer to annot object + if not widget.script: + widget.script = None + if not widget.script_stroke: + widget.script_stroke = None + if not widget.script_format: + widget.script_format = None + if not widget.script_change: + widget.script_change = None + if not widget.script_calc: + widget.script_calc = None + if not widget.script_blur: + widget.script_blur = None + if not widget.script_focus: + widget.script_focus = None + %} + PyObject *_fill_widget(struct Annot *annot, PyObject *widget) + { + fz_try(gctx) { + JM_get_widget_properties(gctx, (pdf_annot *) annot, widget); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(_save_widget, !result) + PyObject *_save_widget(struct Annot *annot, PyObject *widget) + { + fz_try(gctx) { + JM_set_widget_properties(gctx, (pdf_annot *) annot, widget); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(_reset_widget, !result) + PyObject *_reset_widget(struct Annot *annot) + { + fz_try(gctx) { + pdf_annot *this_annot = (pdf_annot *) annot; + pdf_obj *this_annot_obj = pdf_annot_obj(gctx, this_annot); + pdf_document *pdf = pdf_get_bound_document(gctx, this_annot_obj); + pdf_field_reset(gctx, pdf, this_annot_obj); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + // Ensure that widgets with a /AA/C JavaScript are in AcroForm/CO + FITZEXCEPTION(_ensure_widget_calc, !result) + PyObject *_ensure_widget_calc(struct Annot *annot) + { + pdf_obj *PDFNAME_CO=NULL; + fz_try(gctx) { + pdf_obj *annot_obj = pdf_annot_obj(gctx, (pdf_annot *) annot); + pdf_document *pdf = pdf_get_bound_document(gctx, annot_obj); + PDFNAME_CO = pdf_new_name(gctx, "CO"); // = PDF_NAME(CO) + pdf_obj *acro = pdf_dict_getl(gctx, // get AcroForm dict + pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(AcroForm), + NULL); + + pdf_obj *CO = pdf_dict_get(gctx, acro, PDFNAME_CO); // = AcroForm/CO + if (!CO) { + CO = pdf_dict_put_array(gctx, acro, PDFNAME_CO, 2); + } + int i, n = pdf_array_len(gctx, CO); + int xref, nxref, found = 0; + xref = pdf_to_num(gctx, annot_obj); + for (i = 0; i < n; i++) { + nxref = pdf_to_num(gctx, pdf_array_get(gctx, CO, i)); + if (xref == nxref) { + found = 1; + break; + } + } + if (!found) { + pdf_array_push_drop(gctx, CO, pdf_new_indirect(gctx, pdf, xref, 0)); + } + } + fz_always(gctx) { + pdf_drop_obj(gctx, PDFNAME_CO); + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(_parse_da, !result) + %pythonappend _parse_da %{ + if not val: + return ((0,), "", 0) + font = "Helv" + fsize = 12 + col = (0, 0, 0) + dat = val.split() # split on any whitespace + for i, item in enumerate(dat): + if item == "Tf": + font = dat[i - 2][1:] + fsize = float(dat[i - 1]) + dat[i] = dat[i-1] = dat[i-2] = "" + continue + if item == "g": # unicolor text + col = [(float(dat[i - 1]))] + dat[i] = dat[i-1] = "" + continue + if item == "rg": # RGB colored text + col = [float(f) for f in dat[i - 3:i]] + dat[i] = dat[i-1] = dat[i-2] = dat[i-3] = "" + continue + if item == "k": # CMYK colored text + col = [float(f) for f in dat[i - 4:i]] + dat[i] = dat[i-1] = dat[i-2] = dat[i-3] = dat[i-4] = "" + continue + + val = (col, font, fsize) + %} + PyObject *_parse_da(struct Annot *annot) + { + char *da_str = NULL; + pdf_annot *this_annot = (pdf_annot *) annot; + pdf_obj *this_annot_obj = pdf_annot_obj(gctx, this_annot); + pdf_document *pdf = pdf_get_bound_document(gctx, this_annot_obj); + fz_try(gctx) { + pdf_obj *da = pdf_dict_get_inheritable(gctx, this_annot_obj, + PDF_NAME(DA)); + if (!da) { + pdf_obj *trailer = pdf_trailer(gctx, pdf); + da = pdf_dict_getl(gctx, trailer, PDF_NAME(Root), + PDF_NAME(AcroForm), + PDF_NAME(DA), + NULL); + } + da_str = (char *) pdf_to_text_string(gctx, da); + } + fz_catch(gctx) { + return NULL; + } + return JM_UnicodeFromStr(da_str); + } + + + FITZEXCEPTION(_update_da, !result) + PyObject *_update_da(struct Annot *annot, char *da_str) + { + fz_try(gctx) { + pdf_annot *this_annot = (pdf_annot *) annot; + pdf_obj *this_annot_obj = pdf_annot_obj(gctx, this_annot); + pdf_dict_put_text_string(gctx, this_annot_obj, PDF_NAME(DA), da_str); + pdf_dict_del(gctx, this_annot_obj, PDF_NAME(DS)); /* not supported */ + pdf_dict_del(gctx, this_annot_obj, PDF_NAME(RC)); /* not supported */ + } + fz_catch(gctx) { + return NULL; + } + Py_RETURN_NONE; + } + + + FITZEXCEPTION(_get_all_contents, !result) + %pythonprepend _get_all_contents + %{"""Concatenate all /Contents objects of a page into a bytes object."""%} + PyObject *_get_all_contents(struct Page *fzpage) + { + pdf_page *page = pdf_page_from_fz_page(gctx, (fz_page *) fzpage); + fz_buffer *res = NULL; + PyObject *result = NULL; + fz_try(gctx) { + ASSERT_PDF(page); + res = JM_read_contents(gctx, page->obj); + result = JM_BinFromBuffer(gctx, res); + } + fz_always(gctx) { + fz_drop_buffer(gctx, res); + } + fz_catch(gctx) { + return NULL; + } + return result; + } + + + FITZEXCEPTION(_insert_contents, !result) + %pythonprepend _insert_contents + %{"""Add bytes as a new /Contents object for a page, and return its xref."""%} + PyObject *_insert_contents(struct Page *page, PyObject *newcont, int overlay=1) + { + fz_buffer *contbuf = NULL; + int xref = 0; + pdf_page *pdfpage = pdf_page_from_fz_page(gctx, (fz_page *) page); + fz_try(gctx) { + ASSERT_PDF(pdfpage); + ENSURE_OPERATION(gctx, pdfpage->doc); + contbuf = JM_BufferFromBytes(gctx, newcont); + xref = JM_insert_contents(gctx, pdfpage->doc, pdfpage->obj, contbuf, overlay); + } + fz_always(gctx) { + fz_drop_buffer(gctx, contbuf); + } + fz_catch(gctx) { + return NULL; + } + return Py_BuildValue("i", xref); + } + + %pythonprepend mupdf_version + %{"""Get version of MuPDF binary build."""%} + PyObject *mupdf_version() + { + return Py_BuildValue("s", FZ_VERSION); + } + + %pythonprepend mupdf_warnings + %{"""Get the MuPDF warnings/errors with optional reset (default)."""%} + %pythonappend mupdf_warnings %{ + val = "\n".join(val) + if reset: + self.reset_mupdf_warnings()%} + PyObject *mupdf_warnings(int reset=1) + { + Py_INCREF(JM_mupdf_warnings_store); + return JM_mupdf_warnings_store; + } + + int _int_from_language(char *language) + { + return fz_text_language_from_string(language); + } + + %pythonprepend reset_mupdf_warnings + %{"""Empty the MuPDF warnings/errors store."""%} + void reset_mupdf_warnings() + { + Py_CLEAR(JM_mupdf_warnings_store); + JM_mupdf_warnings_store = PyList_New(0); + } + + %pythonprepend mupdf_display_errors + %{"""Set MuPDF error display to True or False."""%} + PyObject *mupdf_display_errors(PyObject *on=NULL) + { + if (!on || on == Py_None) { + return JM_BOOL(JM_mupdf_show_errors); + } + if (PyObject_IsTrue(on)) { + JM_mupdf_show_errors = 1; + } else { + JM_mupdf_show_errors = 0; + } + return JM_BOOL(JM_mupdf_show_errors); + } + + %pythonprepend mupdf_display_warnings + %{"""Set MuPDF warnings display to True or False."""%} + PyObject *mupdf_display_warnings(PyObject *on=NULL) + { + if (!on || on == Py_None) { + return JM_BOOL(JM_mupdf_show_warnings); + } + if (PyObject_IsTrue(on)) { + JM_mupdf_show_warnings = 1; + } else { + JM_mupdf_show_warnings = 0; + } + return JM_BOOL(JM_mupdf_show_warnings); + } + + %pythoncode %{ +def _le_annot_parms(self, annot, p1, p2, fill_color): + """Get common parameters for making annot line end symbols. + + Returns: + m: matrix that maps p1, p2 to points L, P on the x-axis + im: its inverse + L, P: transformed p1, p2 + w: line width + scol: stroke color string + fcol: fill color store_shrink + opacity: opacity string (gs command) + """ + w = annot.border["width"] # line width + sc = annot.colors["stroke"] # stroke color + if not sc: # black if missing + sc = (0,0,0) + scol = " ".join(map(str, sc)) + " RG\n" + if fill_color: + fc = fill_color + else: + fc = annot.colors["fill"] # fill color + if not fc: + fc = (1,1,1) # white if missing + fcol = " ".join(map(str, fc)) + " rg\n" + # nr = annot.rect + np1 = p1 # point coord relative to annot rect + np2 = p2 # point coord relative to annot rect + m = Matrix(util_hor_matrix(np1, np2)) # matrix makes the line horizontal + im = ~m # inverted matrix + L = np1 * m # converted start (left) point + R = np2 * m # converted end (right) point + if 0 <= annot.opacity < 1: + opacity = "/H gs\n" + else: + opacity = "" + return m, im, L, R, w, scol, fcol, opacity + +def _oval_string(self, p1, p2, p3, p4): + """Return /AP string defining an oval within a 4-polygon provided as points + """ + def bezier(p, q, r): + f = "%f %f %f %f %f %f c\n" + return f % (p.x, p.y, q.x, q.y, r.x, r.y) + + kappa = 0.55228474983 # magic number + ml = p1 + (p4 - p1) * 0.5 # middle points ... + mo = p1 + (p2 - p1) * 0.5 # for each ... + mr = p2 + (p3 - p2) * 0.5 # polygon ... + mu = p4 + (p3 - p4) * 0.5 # side + ol1 = ml + (p1 - ml) * kappa # the 8 bezier + ol2 = mo + (p1 - mo) * kappa # helper points + or1 = mo + (p2 - mo) * kappa + or2 = mr + (p2 - mr) * kappa + ur1 = mr + (p3 - mr) * kappa + ur2 = mu + (p3 - mu) * kappa + ul1 = mu + (p4 - mu) * kappa + ul2 = ml + (p4 - ml) * kappa + # now draw, starting from middle point of left side + ap = "%f %f m\n" % (ml.x, ml.y) + ap += bezier(ol1, ol2, mo) + ap += bezier(or1, or2, mr) + ap += bezier(ur1, ur2, mu) + ap += bezier(ul1, ul2, ml) + return ap + +def _le_diamond(self, annot, p1, p2, lr, fill_color): + """Make stream commands for diamond line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = self._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 # 2*shift*width = length of square edge + d = shift * max(1, w) + M = R - (d/2., 0) if lr else L + (d/2., 0) + r = Rect(M, M) + (-d, -d, d, d) # the square + # the square makes line longer by (2*shift - 1)*width + p = (r.tl + (r.bl - r.tl) * 0.5) * im + ap = "q\n%s%f %f m\n" % (opacity, p.x, p.y) + p = (r.tl + (r.tr - r.tl) * 0.5) * im + ap += "%f %f l\n" % (p.x, p.y) + p = (r.tr + (r.br - r.tr) * 0.5) * im + ap += "%f %f l\n" % (p.x, p.y) + p = (r.br + (r.bl - r.br) * 0.5) * im + ap += "%f %f l\n" % (p.x, p.y) + ap += "%g w\n" % w + ap += scol + fcol + "b\nQ\n" + return ap + +def _le_square(self, annot, p1, p2, lr, fill_color): + """Make stream commands for square line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = self._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 # 2*shift*width = length of square edge + d = shift * max(1, w) + M = R - (d/2., 0) if lr else L + (d/2., 0) + r = Rect(M, M) + (-d, -d, d, d) # the square + # the square makes line longer by (2*shift - 1)*width + p = r.tl * im + ap = "q\n%s%f %f m\n" % (opacity, p.x, p.y) + p = r.tr * im + ap += "%f %f l\n" % (p.x, p.y) + p = r.br * im + ap += "%f %f l\n" % (p.x, p.y) + p = r.bl * im + ap += "%f %f l\n" % (p.x, p.y) + ap += "%g w\n" % w + ap += scol + fcol + "b\nQ\n" + return ap + +def _le_circle(self, annot, p1, p2, lr, fill_color): + """Make stream commands for circle line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = self._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 # 2*shift*width = length of square edge + d = shift * max(1, w) + M = R - (d/2., 0) if lr else L + (d/2., 0) + r = Rect(M, M) + (-d, -d, d, d) # the square + ap = "q\n" + opacity + self._oval_string(r.tl * im, r.tr * im, r.br * im, r.bl * im) + ap += "%g w\n" % w + ap += scol + fcol + "b\nQ\n" + return ap + +def _le_butt(self, annot, p1, p2, lr, fill_color): + """Make stream commands for butt line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = self._le_annot_parms(annot, p1, p2, fill_color) + shift = 3 + d = shift * max(1, w) + M = R if lr else L + top = (M + (0, -d/2.)) * im + bot = (M + (0, d/2.)) * im + ap = "\nq\n%s%f %f m\n" % (opacity, top.x, top.y) + ap += "%f %f l\n" % (bot.x, bot.y) + ap += "%g w\n" % w + ap += scol + "s\nQ\n" + return ap + +def _le_slash(self, annot, p1, p2, lr, fill_color): + """Make stream commands for slash line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = self._le_annot_parms(annot, p1, p2, fill_color) + rw = 1.1547 * max(1, w) * 1.0 # makes rect diagonal a 30 deg inclination + M = R if lr else L + r = Rect(M.x - rw, M.y - 2 * w, M.x + rw, M.y + 2 * w) + top = r.tl * im + bot = r.br * im + ap = "\nq\n%s%f %f m\n" % (opacity, top.x, top.y) + ap += "%f %f l\n" % (bot.x, bot.y) + ap += "%g w\n" % w + ap += scol + "s\nQ\n" + return ap + +def _le_openarrow(self, annot, p1, p2, lr, fill_color): + """Make stream commands for open arrow line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = self._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 + d = shift * max(1, w) + p2 = R + (d/2., 0) if lr else L - (d/2., 0) + p1 = p2 + (-2*d, -d) if lr else p2 + (2*d, -d) + p3 = p2 + (-2*d, d) if lr else p2 + (2*d, d) + p1 *= im + p2 *= im + p3 *= im + ap = "\nq\n%s%f %f m\n" % (opacity, p1.x, p1.y) + ap += "%f %f l\n" % (p2.x, p2.y) + ap += "%f %f l\n" % (p3.x, p3.y) + ap += "%g w\n" % w + ap += scol + "S\nQ\n" + return ap + +def _le_closedarrow(self, annot, p1, p2, lr, fill_color): + """Make stream commands for closed arrow line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = self._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 + d = shift * max(1, w) + p2 = R + (d/2., 0) if lr else L - (d/2., 0) + p1 = p2 + (-2*d, -d) if lr else p2 + (2*d, -d) + p3 = p2 + (-2*d, d) if lr else p2 + (2*d, d) + p1 *= im + p2 *= im + p3 *= im + ap = "\nq\n%s%f %f m\n" % (opacity, p1.x, p1.y) + ap += "%f %f l\n" % (p2.x, p2.y) + ap += "%f %f l\n" % (p3.x, p3.y) + ap += "%g w\n" % w + ap += scol + fcol + "b\nQ\n" + return ap + +def _le_ropenarrow(self, annot, p1, p2, lr, fill_color): + """Make stream commands for right open arrow line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = self._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 + d = shift * max(1, w) + p2 = R - (d/3., 0) if lr else L + (d/3., 0) + p1 = p2 + (2*d, -d) if lr else p2 + (-2*d, -d) + p3 = p2 + (2*d, d) if lr else p2 + (-2*d, d) + p1 *= im + p2 *= im + p3 *= im + ap = "\nq\n%s%f %f m\n" % (opacity, p1.x, p1.y) + ap += "%f %f l\n" % (p2.x, p2.y) + ap += "%f %f l\n" % (p3.x, p3.y) + ap += "%g w\n" % w + ap += scol + fcol + "S\nQ\n" + return ap + +def _le_rclosedarrow(self, annot, p1, p2, lr, fill_color): + """Make stream commands for right closed arrow line end symbol. "lr" denotes left (False) or right point. + """ + m, im, L, R, w, scol, fcol, opacity = self._le_annot_parms(annot, p1, p2, fill_color) + shift = 2.5 + d = shift * max(1, w) + p2 = R - (2*d, 0) if lr else L + (2*d, 0) + p1 = p2 + (2*d, -d) if lr else p2 + (-2*d, -d) + p3 = p2 + (2*d, d) if lr else p2 + (-2*d, d) + p1 *= im + p2 *= im + p3 *= im + ap = "\nq\n%s%f %f m\n" % (opacity, p1.x, p1.y) + ap += "%f %f l\n" % (p2.x, p2.y) + ap += "%f %f l\n" % (p3.x, p3.y) + ap += "%g w\n" % w + ap += scol + fcol + "b\nQ\n" + return ap + +def __del__(self): + if not type(self) is Tools: + return + if getattr(self, "thisown", False): + self.__swig_destroy__(self) + %} + } +}; diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-annot.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-annot.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,455 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +//------------------------------------------------------------------------ +// return pdf_obj "border style" from Python str +//------------------------------------------------------------------------ +pdf_obj *JM_get_border_style(fz_context *ctx, PyObject *style) +{ + pdf_obj *val = PDF_NAME(S); + if (!style) return val; + char *s = JM_StrAsChar(style); + JM_PyErr_Clear; + if (!s) return val; + if (!strncmp(s, "b", 1) || !strncmp(s, "B", 1)) val = PDF_NAME(B); + else if (!strncmp(s, "d", 1) || !strncmp(s, "D", 1)) val = PDF_NAME(D); + else if (!strncmp(s, "i", 1) || !strncmp(s, "I", 1)) val = PDF_NAME(I); + else if (!strncmp(s, "u", 1) || !strncmp(s, "U", 1)) val = PDF_NAME(U); + else if (!strncmp(s, "s", 1) || !strncmp(s, "S", 1)) val = PDF_NAME(S); + return val; +} + +//------------------------------------------------------------------------ +// Make /DA string of annotation +//------------------------------------------------------------------------ +const char *JM_expand_fname(const char **name) +{ + if (!*name) return "Helv"; + if (!strncmp(*name, "Co", 2)) return "Cour"; + if (!strncmp(*name, "co", 2)) return "Cour"; + if (!strncmp(*name, "Ti", 2)) return "TiRo"; + if (!strncmp(*name, "ti", 2)) return "TiRo"; + if (!strncmp(*name, "Sy", 2)) return "Symb"; + if (!strncmp(*name, "sy", 2)) return "Symb"; + if (!strncmp(*name, "Za", 2)) return "ZaDb"; + if (!strncmp(*name, "za", 2)) return "ZaDb"; + return "Helv"; +} + +void JM_make_annot_DA(fz_context *ctx, pdf_annot *annot, int ncol, float col[4], const char *fontname, float fontsize) +{ + fz_buffer *buf = NULL; + fz_try(ctx) + { + buf = fz_new_buffer(ctx, 50); + if (ncol <= 1) + fz_append_printf(ctx, buf, "%g g ", col[0]); + else if (ncol < 4) + fz_append_printf(ctx, buf, "%g %g %g rg ", col[0], col[1], col[2]); + else + fz_append_printf(ctx, buf, "%g %g %g %g k ", col[0], col[1], col[2], col[3]); + fz_append_printf(ctx, buf, "/%s %g Tf", JM_expand_fname(&fontname), fontsize); + unsigned char *da = NULL; + size_t len = fz_buffer_storage(ctx, buf, &da); + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + pdf_dict_put_string(ctx, annot_obj, PDF_NAME(DA), (const char *) da, len); + } + fz_always(ctx) fz_drop_buffer(ctx, buf); + fz_catch(ctx) fz_rethrow(ctx); + return; +} + +//------------------------------------------------------------------------ +// refreshes the link and annotation tables of a page +//------------------------------------------------------------------------ +void JM_refresh_links(fz_context *ctx, pdf_page *page) +{ + if (!page) return; + fz_try(ctx) { + pdf_obj *obj = pdf_dict_get(ctx, page->obj, PDF_NAME(Annots)); + if (obj) + { + pdf_document *pdf = page->doc; + int number = pdf_lookup_page_number(ctx, pdf, page->obj); + fz_rect page_mediabox; + fz_matrix page_ctm; + pdf_page_transform(ctx, page, &page_mediabox, &page_ctm); + page->links = pdf_load_link_annots(ctx, pdf, page, obj, number, page_ctm); + } + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return; +} + + +PyObject *JM_annot_border(fz_context *ctx, pdf_obj *annot_obj) +{ + PyObject *res = PyDict_New(); + PyObject *dash_py = PyList_New(0); + PyObject *val; + int i; + const char *style = NULL; + float width = -1.0f; + int clouds = -1; + pdf_obj *obj = NULL; + + obj = pdf_dict_get(ctx, annot_obj, PDF_NAME(Border)); + if (pdf_is_array(ctx, obj)) { + width = pdf_to_real(ctx, pdf_array_get(ctx, obj, 2)); + if (pdf_array_len(ctx, obj) == 4) { + pdf_obj *dash = pdf_array_get(ctx, obj, 3); + for (i = 0; i < pdf_array_len(ctx, dash); i++) { + val = Py_BuildValue("i", pdf_to_int(ctx, pdf_array_get(ctx, dash, i))); + LIST_APPEND_DROP(dash_py, val); + } + } + } + + pdf_obj *bs_o = pdf_dict_get(ctx, annot_obj, PDF_NAME(BS)); + if (bs_o) { + width = pdf_to_real(ctx, pdf_dict_get(ctx, bs_o, PDF_NAME(W))); + style = pdf_to_name(ctx, pdf_dict_get(ctx, bs_o, PDF_NAME(S))); + if (style && strcmp(style, "") == 0) { + style = NULL; + } + obj = pdf_dict_get(ctx, bs_o, PDF_NAME(D)); + if (obj) { + for (i = 0; i < pdf_array_len(ctx, obj); i++) { + val = Py_BuildValue("i", pdf_to_int(ctx, pdf_array_get(ctx, obj, i))); + LIST_APPEND_DROP(dash_py, val); + } + } + } + + obj = pdf_dict_get(ctx, annot_obj, PDF_NAME(BE)); + if (obj) { + clouds = pdf_to_int(ctx, pdf_dict_get(ctx, obj, PDF_NAME(I))); + } + val = PySequence_Tuple(dash_py); + Py_CLEAR(dash_py); + DICT_SETITEM_DROP(res, dictkey_width, Py_BuildValue("f", width)); + DICT_SETITEM_DROP(res, dictkey_dashes, val); + DICT_SETITEM_DROP(res, dictkey_style, Py_BuildValue("s", style)); + DICT_SETITEMSTR_DROP(res, "clouds", Py_BuildValue("i", clouds)); + return res; +} + +PyObject *JM_annot_set_border(fz_context *ctx, PyObject *border, pdf_document *doc, pdf_obj *annot_obj) +{ + if (!PyDict_Check(border)) { + JM_Warning("arg must be a dict"); + Py_RETURN_NONE; // not a dict + } + pdf_obj *obj = NULL; + Py_ssize_t i = 0, dashlen = 0; + int d; + double nwidth = PyFloat_AsDouble(PyDict_GetItem(border, dictkey_width)); // new width + PyObject *ndashes = PyDict_GetItem(border, dictkey_dashes); // new dashes + PyObject *nstyle = PyDict_GetItem(border, dictkey_style); // new style + int nclouds = (int) PyLong_AsLong(PyDict_GetItemString(border, "clouds")); // new clouds value + + // get old border properties + PyObject *oborder = JM_annot_border(ctx, annot_obj); + + // delete border-related entries + pdf_dict_del(ctx, annot_obj, PDF_NAME(BS)); + pdf_dict_del(ctx, annot_obj, PDF_NAME(BE)); + pdf_dict_del(ctx, annot_obj, PDF_NAME(Border)); + + // populate border items: keep old values for any omitted new ones + if (nwidth < 0) nwidth = PyFloat_AsDouble(PyDict_GetItem(oborder, dictkey_width)); // no new width: keep current + if (ndashes == Py_None) ndashes = PyDict_GetItem(oborder, dictkey_dashes); // no new dashes: keep old + if (nstyle == Py_None) nstyle = PyDict_GetItem(oborder, dictkey_style); // no new style: keep old + if (nclouds < 0) nclouds = (int) PyLong_AsLong(PyDict_GetItemString(oborder, "clouds")); // no new clouds: keep old + + if (ndashes && PyTuple_Check(ndashes) && PyTuple_Size(ndashes) > 0) { + dashlen = PyTuple_Size(ndashes); + pdf_obj *darr = pdf_new_array(ctx, doc, dashlen); + for (i = 0; i < dashlen; i++) { + d = (int) PyLong_AsLong(PyTuple_GetItem(ndashes, i)); + pdf_array_push_int(ctx, darr, (int64_t) d); + } + pdf_dict_putl_drop(ctx, annot_obj, darr, PDF_NAME(BS), PDF_NAME(D), NULL); + } + + pdf_dict_putl_drop(ctx, annot_obj, pdf_new_real(ctx, (float) nwidth), + PDF_NAME(BS), PDF_NAME(W), NULL); + + if (dashlen == 0) { + obj = JM_get_border_style(ctx, nstyle); + } else { + obj = PDF_NAME(D); + } + pdf_dict_putl_drop(ctx, annot_obj, obj, PDF_NAME(BS), PDF_NAME(S), NULL); + + if (nclouds > 0) { + pdf_dict_put_dict(ctx, annot_obj, PDF_NAME(BE), 2); + pdf_obj *obj = pdf_dict_get(ctx, annot_obj, PDF_NAME(BE)); + pdf_dict_put(ctx, obj, PDF_NAME(S), PDF_NAME(C)); + pdf_dict_put_int(ctx, obj, PDF_NAME(I), (int64_t) nclouds); + } + + PyErr_Clear(); + Py_RETURN_NONE; +} + +PyObject *JM_annot_colors(fz_context *ctx, pdf_obj *annot_obj) +{ + PyObject *res = PyDict_New(); + PyObject *color = NULL; + int i, n; + float col; + pdf_obj *o = NULL; + + o = pdf_dict_get(ctx, annot_obj, PDF_NAME(C)); + if (pdf_is_array(ctx, o)) { + n = pdf_array_len(ctx, o); + color = PyTuple_New((Py_ssize_t) n); + for (i = 0; i < n; i++) { + col = pdf_to_real(ctx, pdf_array_get(ctx, o, i)); + PyTuple_SET_ITEM(color, i, Py_BuildValue("f", col)); + } + DICT_SETITEM_DROP(res, dictkey_stroke, color); + } else { + DICT_SETITEM_DROP(res, dictkey_stroke, Py_BuildValue("s", NULL)); + } + + o = pdf_dict_get(ctx, annot_obj, PDF_NAME(IC)); + if (pdf_is_array(ctx, o)) { + n = pdf_array_len(ctx, o); + color = PyTuple_New((Py_ssize_t) n); + for (i = 0; i < n; i++) { + col = pdf_to_real(ctx, pdf_array_get(ctx, o, i)); + PyTuple_SET_ITEM(color, i, Py_BuildValue("f", col)); + } + DICT_SETITEM_DROP(res, dictkey_fill, color); + } else { + DICT_SETITEM_DROP(res, dictkey_fill, Py_BuildValue("s", NULL)); + } + + return res; +} + + +//------------------------------------------------------------------------ +// Return the first annotation whose /IRT key ("In Response To") points to +// annot. Used to remove the response chain of a given annotation. +//------------------------------------------------------------------------ +pdf_annot *JM_find_annot_irt(fz_context *ctx, pdf_annot *annot) +{ + pdf_annot *irt_annot = NULL; // returning this + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + pdf_obj *o = NULL; + int found = 0; + fz_try(ctx) { // loop thru MuPDF's internal annots array + pdf_page *page = pdf_annot_page(ctx, annot); + irt_annot = pdf_first_annot(ctx, page); + while (irt_annot) { + pdf_obj *irt_annot_obj = pdf_annot_obj(ctx, irt_annot); + o = pdf_dict_gets(ctx, irt_annot_obj, "IRT"); + if (o) { + if (!pdf_objcmp(ctx, o, annot_obj)) { + found = 1; + break; + } + } + irt_annot = pdf_next_annot(ctx, irt_annot); + } + } + fz_catch(ctx) {;} + if (found) return pdf_keep_annot(ctx, irt_annot); + return NULL; +} + +//------------------------------------------------------------------------ +// return the annotation names (list of /NM entries) +//------------------------------------------------------------------------ +PyObject *JM_get_annot_id_list(fz_context *ctx, pdf_page *page) +{ + PyObject *names = PyList_New(0); + pdf_obj *annot_obj = NULL; + pdf_obj *annots = pdf_dict_get(ctx, page->obj, PDF_NAME(Annots)); + pdf_obj *name = NULL; + if (!annots) return names; + fz_try(ctx) { + int i, n = pdf_array_len(ctx, annots); + for (i = 0; i < n; i++) { + annot_obj = pdf_array_get(ctx, annots, i); + name = pdf_dict_gets(ctx, annot_obj, "NM"); + if (name) { + LIST_APPEND_DROP(names, Py_BuildValue("s", pdf_to_text_string(ctx, name))); + } + } + } + fz_catch(ctx) { + return names; + } + return names; +} + + +//------------------------------------------------------------------------ +// return the xrefs and /NM ids of a page's annots, links and fields +//------------------------------------------------------------------------ +PyObject *JM_get_annot_xref_list(fz_context *ctx, pdf_obj *page_obj) +{ + PyObject *names = PyList_New(0); + pdf_obj *id, *subtype, *annots, *annot_obj; + int xref, type, i, n; + fz_try(ctx) { + annots = pdf_dict_get(ctx, page_obj, PDF_NAME(Annots)); + n = pdf_array_len(ctx, annots); + for (i = 0; i < n; i++) { + annot_obj = pdf_array_get(ctx, annots, i); + xref = pdf_to_num(ctx, annot_obj); + subtype = pdf_dict_get(ctx, annot_obj, PDF_NAME(Subtype)); + if (!subtype) { + continue; // subtype is required + } + type = pdf_annot_type_from_string(ctx, pdf_to_name(ctx, subtype)); + if (type == PDF_ANNOT_UNKNOWN) { + continue; // only accept valid annot types + } + id = pdf_dict_gets(ctx, annot_obj, "NM"); + LIST_APPEND_DROP(names, Py_BuildValue("iis", xref, type, pdf_to_text_string(ctx, id))); + } + } + fz_catch(ctx) { + return names; + } + return names; +} + + +//------------------------------------------------------------------------ +// Add a unique /NM key to an annotation or widget. +// Append a number to 'stem' such that the result is a unique name. +//------------------------------------------------------------------------ +static char JM_annot_id_stem[50] = "fitz"; +void JM_add_annot_id(fz_context *ctx, pdf_annot *annot, char *stem) +{ + fz_try(ctx) { + PyObject *names = NULL; + pdf_page *page = pdf_annot_page(ctx, annot); + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + names = JM_get_annot_id_list(ctx, page); + int i = 0; + PyObject *stem_id = NULL; + while (1) { + stem_id = PyUnicode_FromFormat("%s-%s%d", JM_annot_id_stem, stem, i); + if (!PySequence_Contains(names, stem_id)) break; + i += 1; + Py_DECREF(stem_id); + } + char *response = JM_StrAsChar(stem_id); + pdf_obj *name = pdf_new_string(ctx, (const char *) response, strlen(response)); + pdf_dict_puts_drop(ctx, annot_obj, "NM", name); + Py_CLEAR(stem_id); + Py_CLEAR(names); + page->doc->resynth_required = 0; + } + fz_catch(ctx) { + fz_rethrow(ctx); + } +} + +//------------------------------------------------------------------------ +// retrieve annot by name (/NM key) +//------------------------------------------------------------------------ +pdf_annot *JM_get_annot_by_name(fz_context *ctx, pdf_page *page, char *name) +{ + if (!name || strlen(name) == 0) { + return NULL; + } + pdf_annot *annot = NULL; + int found = 0; + size_t len = 0; + + fz_try(ctx) { // loop thru MuPDF's internal annots and widget arrays + annot = pdf_first_annot(ctx, page); + while (annot) { + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + const char *response = pdf_to_string(ctx, pdf_dict_gets(ctx, annot_obj, "NM"), &len); + if (strcmp(name, response) == 0) { + found = 1; + break; + } + annot = pdf_next_annot(ctx, annot); + } + if (!found) { + fz_throw(ctx, FZ_ERROR_GENERIC, "'%s' is not an annot of this page", name); + } + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return pdf_keep_annot(ctx, annot); +} + +//------------------------------------------------------------------------ +// retrieve annot by its xref +//------------------------------------------------------------------------ +pdf_annot *JM_get_annot_by_xref(fz_context *ctx, pdf_page *page, int xref) +{ + pdf_annot *annot = NULL; + int found = 0; + + fz_try(ctx) { // loop thru MuPDF's internal annots array + annot = pdf_first_annot(ctx, page); + while (annot) { + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + if (xref == pdf_to_num(ctx, annot_obj)) { + found = 1; + break; + } + annot = pdf_next_annot(ctx, annot); + } + if (!found) { + fz_throw(ctx, FZ_ERROR_GENERIC, "xref %d is not an annot of this page", xref); + } + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return pdf_keep_annot(ctx, annot); +} + +//------------------------------------------------------------------------ +// retrieve widget by its xref +//------------------------------------------------------------------------ +pdf_annot *JM_get_widget_by_xref(fz_context *ctx, pdf_page *page, int xref) +{ + pdf_annot *annot = NULL; + int found = 0; + + fz_try(ctx) { // loop thru MuPDF's internal annots array + annot = pdf_first_widget(ctx, page); + while (annot) { + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + if (xref == pdf_to_num(ctx, annot_obj)) { + found = 1; + break; + } + annot = pdf_next_widget(ctx, annot); + } + if (!found) { + fz_throw(ctx, FZ_ERROR_GENERIC, "xref %d is not a widget of this page", xref); + } + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return pdf_keep_annot(ctx, annot); +} + +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-convert.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-convert.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,98 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +//----------------------------------------------------------------------------- +// Convert any MuPDF document to a PDF +// Returns bytes object containing the PDF, created via 'write' function. +//----------------------------------------------------------------------------- +PyObject *JM_convert_to_pdf(fz_context *ctx, fz_document *doc, int fp, int tp, int rotate) +{ + pdf_document *pdfout = pdf_create_document(ctx); // new PDF document + int i, incr = 1, s = fp, e = tp; + if (fp > tp) { + incr = -1; // count backwards + s = tp; // adjust ... + e = fp; // ... range + } + fz_rect mediabox; + int rot = JM_norm_rotation(rotate); + fz_device *dev = NULL; + fz_buffer *contents = NULL; + pdf_obj *resources = NULL; + fz_page *page=NULL; + fz_var(dev); + fz_var(contents); + fz_var(resources); + fz_var(page); + for (i = fp; INRANGE(i, s, e); i += incr) { // interpret & write document pages as PDF pages + fz_try(ctx) { + page = fz_load_page(ctx, doc, i); + mediabox = fz_bound_page(ctx, page); + dev = pdf_page_write(ctx, pdfout, mediabox, &resources, &contents); + fz_run_page(ctx, page, dev, fz_identity, NULL); + fz_close_device(ctx, dev); + fz_drop_device(ctx, dev); + dev = NULL; + pdf_obj *page_obj = pdf_add_page(ctx, pdfout, mediabox, rot, resources, contents); + pdf_insert_page(ctx, pdfout, -1, page_obj); + pdf_drop_obj(ctx, page_obj); + } + fz_always(ctx) { + pdf_drop_obj(ctx, resources); + fz_drop_buffer(ctx, contents); + fz_drop_device(ctx, dev); + fz_drop_page(ctx, page); + page = NULL; + dev = NULL; + contents = NULL; + resources = NULL; + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + } + // PDF created - now write it to Python bytearray + PyObject *r = NULL; + fz_output *out = NULL; + fz_buffer *res = NULL; + // prepare write options structure + pdf_write_options opts = { 0 }; + opts.do_garbage = 4; + opts.do_compress = 1; + opts.do_compress_images = 1; + opts.do_compress_fonts = 1; + opts.do_sanitize = 1; + opts.do_incremental = 0; + opts.do_ascii = 0; + opts.do_decompress = 0; + opts.do_linear = 0; + opts.do_clean = 1; + opts.do_pretty = 0; + + fz_try(ctx) { + res = fz_new_buffer(ctx, 8192); + out = fz_new_output_with_buffer(ctx, res); + pdf_write_document(ctx, pdfout, out, &opts); + unsigned char *c = NULL; + size_t len = fz_buffer_storage(ctx, res, &c); + r = PyBytes_FromStringAndSize((const char *) c, (Py_ssize_t) len); + } + fz_always(ctx) { + pdf_drop_document(ctx, pdfout); + fz_drop_output(ctx, out); + fz_drop_buffer(ctx, res); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return r; +} +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-defines.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-defines.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,817 @@ +%inline %{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +//---------------------------------------------------------------------------- +// general +//---------------------------------------------------------------------------- +#define EPSILON 1e-5 + +//---------------------------------------------------------------------------- +// annotation types +//---------------------------------------------------------------------------- +#define PDF_ANNOT_TEXT 0 +#define PDF_ANNOT_LINK 1 +#define PDF_ANNOT_FREE_TEXT 2 +#define PDF_ANNOT_LINE 3 +#define PDF_ANNOT_SQUARE 4 +#define PDF_ANNOT_CIRCLE 5 +#define PDF_ANNOT_POLYGON 6 +#define PDF_ANNOT_POLY_LINE 7 +#define PDF_ANNOT_HIGHLIGHT 8 +#define PDF_ANNOT_UNDERLINE 9 +#define PDF_ANNOT_SQUIGGLY 10 +#define PDF_ANNOT_STRIKE_OUT 11 +#define PDF_ANNOT_REDACT 12 +#define PDF_ANNOT_STAMP 13 +#define PDF_ANNOT_CARET 14 +#define PDF_ANNOT_INK 15 +#define PDF_ANNOT_POPUP 16 +#define PDF_ANNOT_FILE_ATTACHMENT 17 +#define PDF_ANNOT_SOUND 18 +#define PDF_ANNOT_MOVIE 19 +#define PDF_ANNOT_RICH_MEDIA 20 +#define PDF_ANNOT_WIDGET 21 +#define PDF_ANNOT_SCREEN 22 +#define PDF_ANNOT_PRINTER_MARK 23 +#define PDF_ANNOT_TRAP_NET 24 +#define PDF_ANNOT_WATERMARK 25 +#define PDF_ANNOT_3D 26 +#define PDF_ANNOT_PROJECTION 27 +#define PDF_ANNOT_UNKNOWN -1 + +//------------------------ +// redaction annot options +//------------------------ +#define PDF_REDACT_IMAGE_NONE 0 +#define PDF_REDACT_IMAGE_REMOVE 1 +#define PDF_REDACT_IMAGE_PIXELS 2 + +//---------------------------------------------------------------------------- +// annotation flag bits +//---------------------------------------------------------------------------- +#define PDF_ANNOT_IS_INVISIBLE 1 << (1-1) +#define PDF_ANNOT_IS_HIDDEN 1 << (2-1) +#define PDF_ANNOT_IS_PRINT 1 << (3-1) +#define PDF_ANNOT_IS_NO_ZOOM 1 << (4-1) +#define PDF_ANNOT_IS_NO_ROTATE 1 << (5-1) +#define PDF_ANNOT_IS_NO_VIEW 1 << (6-1) +#define PDF_ANNOT_IS_READ_ONLY 1 << (7-1) +#define PDF_ANNOT_IS_LOCKED 1 << (8-1) +#define PDF_ANNOT_IS_TOGGLE_NO_VIEW 1 << (9-1) +#define PDF_ANNOT_IS_LOCKED_CONTENTS 1 << (10-1) + + +//---------------------------------------------------------------------------- +// annotation line ending styles +//---------------------------------------------------------------------------- +#define PDF_ANNOT_LE_NONE 0 +#define PDF_ANNOT_LE_SQUARE 1 +#define PDF_ANNOT_LE_CIRCLE 2 +#define PDF_ANNOT_LE_DIAMOND 3 +#define PDF_ANNOT_LE_OPEN_ARROW 4 +#define PDF_ANNOT_LE_CLOSED_ARROW 5 +#define PDF_ANNOT_LE_BUTT 6 +#define PDF_ANNOT_LE_R_OPEN_ARROW 7 +#define PDF_ANNOT_LE_R_CLOSED_ARROW 8 +#define PDF_ANNOT_LE_SLASH 9 + + +//---------------------------------------------------------------------------- +// annotation field (widget) types +//---------------------------------------------------------------------------- +#define PDF_WIDGET_TYPE_UNKNOWN 0 +#define PDF_WIDGET_TYPE_BUTTON 1 +#define PDF_WIDGET_TYPE_CHECKBOX 2 +#define PDF_WIDGET_TYPE_COMBOBOX 3 +#define PDF_WIDGET_TYPE_LISTBOX 4 +#define PDF_WIDGET_TYPE_RADIOBUTTON 5 +#define PDF_WIDGET_TYPE_SIGNATURE 6 +#define PDF_WIDGET_TYPE_TEXT 7 + + +//---------------------------------------------------------------------------- +// annotation text widget subtypes +//---------------------------------------------------------------------------- +#define PDF_WIDGET_TX_FORMAT_NONE 0 +#define PDF_WIDGET_TX_FORMAT_NUMBER 1 +#define PDF_WIDGET_TX_FORMAT_SPECIAL 2 +#define PDF_WIDGET_TX_FORMAT_DATE 3 +#define PDF_WIDGET_TX_FORMAT_TIME 4 + + +//---------------------------------------------------------------------------- +// annotation widget flags +//---------------------------------------------------------------------------- +// Common to all field types +#define PDF_FIELD_IS_READ_ONLY 1 +#define PDF_FIELD_IS_REQUIRED 1 << 1 +#define PDF_FIELD_IS_NO_EXPORT 1 << 2 + + +// Text fields +#define PDF_TX_FIELD_IS_MULTILINE 1 << 12 +#define PDF_TX_FIELD_IS_PASSWORD 1 << 13 +#define PDF_TX_FIELD_IS_FILE_SELECT 1 << 20 +#define PDF_TX_FIELD_IS_DO_NOT_SPELL_CHECK 1 << 22 +#define PDF_TX_FIELD_IS_DO_NOT_SCROLL 1 << 23 +#define PDF_TX_FIELD_IS_COMB 1 << 24 +#define PDF_TX_FIELD_IS_RICH_TEXT 1 << 25 + + +// Button fields +#define PDF_BTN_FIELD_IS_NO_TOGGLE_TO_OFF 1 << 14 +#define PDF_BTN_FIELD_IS_RADIO 1 << 15 +#define PDF_BTN_FIELD_IS_PUSHBUTTON 1 << 16 +#define PDF_BTN_FIELD_IS_RADIOS_IN_UNISON 1 << 25 + + +// Choice fields +#define PDF_CH_FIELD_IS_COMBO 1 << 17 +#define PDF_CH_FIELD_IS_EDIT 1 << 18 +#define PDF_CH_FIELD_IS_SORT 1 << 19 +#define PDF_CH_FIELD_IS_MULTI_SELECT 1 << 21 +#define PDF_CH_FIELD_IS_DO_NOT_SPELL_CHECK 1 << 22 +#define PDF_CH_FIELD_IS_COMMIT_ON_SEL_CHANGE 1 << 25 + + +// Signature fields errors +#define PDF_SIGNATURE_ERROR_OKAY 0 +#define PDF_SIGNATURE_ERROR_NO_SIGNATURES 1 +#define PDF_SIGNATURE_ERROR_NO_CERTIFICATE 2 +#define PDF_SIGNATURE_ERROR_DIGEST_FAILURE 3 +#define PDF_SIGNATURE_ERROR_SELF_SIGNED 4 +#define PDF_SIGNATURE_ERROR_SELF_SIGNED_IN_CHAIN 5 +#define PDF_SIGNATURE_ERROR_NOT_TRUSTED 6 +#define PDF_SIGNATURE_ERROR_UNKNOWN 7 + +// Signature appearances + +#define PDF_SIGNATURE_SHOW_LABELS 1 +#define PDF_SIGNATURE_SHOW_DN 2 +#define PDF_SIGNATURE_SHOW_DATE 4 +#define PDF_SIGNATURE_SHOW_TEXT_NAME 8 +#define PDF_SIGNATURE_SHOW_GRAPHIC_NAME 16 +#define PDF_SIGNATURE_SHOW_LOGO 32 +#define PDF_SIGNATURE_DEFAULT_APPEARANCE ( \ + PDF_SIGNATURE_SHOW_LABELS | \ + PDF_SIGNATURE_SHOW_DN | \ + PDF_SIGNATURE_SHOW_DATE | \ + PDF_SIGNATURE_SHOW_TEXT_NAME | \ + PDF_SIGNATURE_SHOW_GRAPHIC_NAME | \ + PDF_SIGNATURE_SHOW_LOGO ) + +//---------------------------------------------------------------------------- +// colorspace identifiers +//---------------------------------------------------------------------------- +#define CS_RGB 1 +#define CS_GRAY 2 +#define CS_CMYK 3 + +//---------------------------------------------------------------------------- +// PDF encryption algorithms +//---------------------------------------------------------------------------- +#define PDF_ENCRYPT_KEEP 0 +#define PDF_ENCRYPT_NONE 1 +#define PDF_ENCRYPT_RC4_40 2 +#define PDF_ENCRYPT_RC4_128 3 +#define PDF_ENCRYPT_AES_128 4 +#define PDF_ENCRYPT_AES_256 5 +#define PDF_ENCRYPT_UNKNOWN 6 + +//---------------------------------------------------------------------------- +// PDF permission codes +//---------------------------------------------------------------------------- +#define PDF_PERM_PRINT 1 << 2 +#define PDF_PERM_MODIFY 1 << 3 +#define PDF_PERM_COPY 1 << 4 +#define PDF_PERM_ANNOTATE 1 << 5 +#define PDF_PERM_FORM 1 << 8 +#define PDF_PERM_ACCESSIBILITY 1 << 9 +#define PDF_PERM_ASSEMBLE 1 << 10 +#define PDF_PERM_PRINT_HQ 1 << 11 + +//---------------------------------------------------------------------------- +// PDF Blend Modes +//---------------------------------------------------------------------------- +#define PDF_BM_Color "Color" +#define PDF_BM_ColorBurn "ColorBurn" +#define PDF_BM_ColorDodge "ColorDodge" +#define PDF_BM_Darken "Darken" +#define PDF_BM_Difference "Difference" +#define PDF_BM_Exclusion "Exclusion" +#define PDF_BM_HardLight "HardLight" +#define PDF_BM_Hue "Hue" +#define PDF_BM_Lighten "Lighten" +#define PDF_BM_Luminosity "Luminosity" +#define PDF_BM_Multiply "Multiply" +#define PDF_BM_Normal "Normal" +#define PDF_BM_Overlay "Overlay" +#define PDF_BM_Saturation "Saturation" +#define PDF_BM_Screen "Screen" +#define PDF_BM_SoftLight "Softlight" + + +// General text flags +#define TEXT_FONT_SUPERSCRIPT 1 +#define TEXT_FONT_ITALIC 2 +#define TEXT_FONT_SERIFED 4 +#define TEXT_FONT_MONOSPACED 8 +#define TEXT_FONT_BOLD 16 + +// UCDN Script codes +#define UCDN_SCRIPT_COMMON 0 +#define UCDN_SCRIPT_LATIN 1 +#define UCDN_SCRIPT_GREEK 2 +#define UCDN_SCRIPT_CYRILLIC 3 +#define UCDN_SCRIPT_ARMENIAN 4 +#define UCDN_SCRIPT_HEBREW 5 +#define UCDN_SCRIPT_ARABIC 6 +#define UCDN_SCRIPT_SYRIAC 7 +#define UCDN_SCRIPT_THAANA 8 +#define UCDN_SCRIPT_DEVANAGARI 9 +#define UCDN_SCRIPT_BENGALI 10 +#define UCDN_SCRIPT_GURMUKHI 11 +#define UCDN_SCRIPT_GUJARATI 12 +#define UCDN_SCRIPT_ORIYA 13 +#define UCDN_SCRIPT_TAMIL 14 +#define UCDN_SCRIPT_TELUGU 15 +#define UCDN_SCRIPT_KANNADA 16 +#define UCDN_SCRIPT_MALAYALAM 17 +#define UCDN_SCRIPT_SINHALA 18 +#define UCDN_SCRIPT_THAI 19 +#define UCDN_SCRIPT_LAO 20 +#define UCDN_SCRIPT_TIBETAN 21 +#define UCDN_SCRIPT_MYANMAR 22 +#define UCDN_SCRIPT_GEORGIAN 23 +#define UCDN_SCRIPT_HANGUL 24 +#define UCDN_SCRIPT_ETHIOPIC 25 +#define UCDN_SCRIPT_CHEROKEE 26 +#define UCDN_SCRIPT_CANADIAN_ABORIGINAL 27 +#define UCDN_SCRIPT_OGHAM 28 +#define UCDN_SCRIPT_RUNIC 29 +#define UCDN_SCRIPT_KHMER 30 +#define UCDN_SCRIPT_MONGOLIAN 31 +#define UCDN_SCRIPT_HIRAGANA 32 +#define UCDN_SCRIPT_KATAKANA 33 +#define UCDN_SCRIPT_BOPOMOFO 34 +#define UCDN_SCRIPT_HAN 35 +#define UCDN_SCRIPT_YI 36 +#define UCDN_SCRIPT_OLD_ITALIC 37 +#define UCDN_SCRIPT_GOTHIC 38 +#define UCDN_SCRIPT_DESERET 39 +#define UCDN_SCRIPT_INHERITED 40 +#define UCDN_SCRIPT_TAGALOG 41 +#define UCDN_SCRIPT_HANUNOO 42 +#define UCDN_SCRIPT_BUHID 43 +#define UCDN_SCRIPT_TAGBANWA 44 +#define UCDN_SCRIPT_LIMBU 45 +#define UCDN_SCRIPT_TAI_LE 46 +#define UCDN_SCRIPT_LINEAR_B 47 +#define UCDN_SCRIPT_UGARITIC 48 +#define UCDN_SCRIPT_SHAVIAN 49 +#define UCDN_SCRIPT_OSMANYA 50 +#define UCDN_SCRIPT_CYPRIOT 51 +#define UCDN_SCRIPT_BRAILLE 52 +#define UCDN_SCRIPT_BUGINESE 53 +#define UCDN_SCRIPT_COPTIC 54 +#define UCDN_SCRIPT_NEW_TAI_LUE 55 +#define UCDN_SCRIPT_GLAGOLITIC 56 +#define UCDN_SCRIPT_TIFINAGH 57 +#define UCDN_SCRIPT_SYLOTI_NAGRI 58 +#define UCDN_SCRIPT_OLD_PERSIAN 59 +#define UCDN_SCRIPT_KHAROSHTHI 60 +#define UCDN_SCRIPT_BALINESE 61 +#define UCDN_SCRIPT_CUNEIFORM 62 +#define UCDN_SCRIPT_PHOENICIAN 63 +#define UCDN_SCRIPT_PHAGS_PA 64 +#define UCDN_SCRIPT_NKO 65 +#define UCDN_SCRIPT_SUNDANESE 66 +#define UCDN_SCRIPT_LEPCHA 67 +#define UCDN_SCRIPT_OL_CHIKI 68 +#define UCDN_SCRIPT_VAI 69 +#define UCDN_SCRIPT_SAURASHTRA 70 +#define UCDN_SCRIPT_KAYAH_LI 71 +#define UCDN_SCRIPT_REJANG 72 +#define UCDN_SCRIPT_LYCIAN 73 +#define UCDN_SCRIPT_CARIAN 74 +#define UCDN_SCRIPT_LYDIAN 75 +#define UCDN_SCRIPT_CHAM 76 +#define UCDN_SCRIPT_TAI_THAM 77 +#define UCDN_SCRIPT_TAI_VIET 78 +#define UCDN_SCRIPT_AVESTAN 79 +#define UCDN_SCRIPT_EGYPTIAN_HIEROGLYPHS 80 +#define UCDN_SCRIPT_SAMARITAN 81 +#define UCDN_SCRIPT_LISU 82 +#define UCDN_SCRIPT_BAMUM 83 +#define UCDN_SCRIPT_JAVANESE 84 +#define UCDN_SCRIPT_MEETEI_MAYEK 85 +#define UCDN_SCRIPT_IMPERIAL_ARAMAIC 86 +#define UCDN_SCRIPT_OLD_SOUTH_ARABIAN 87 +#define UCDN_SCRIPT_INSCRIPTIONAL_PARTHIAN 88 +#define UCDN_SCRIPT_INSCRIPTIONAL_PAHLAVI 89 +#define UCDN_SCRIPT_OLD_TURKIC 90 +#define UCDN_SCRIPT_KAITHI 91 +#define UCDN_SCRIPT_BATAK 92 +#define UCDN_SCRIPT_BRAHMI 93 +#define UCDN_SCRIPT_MANDAIC 94 +#define UCDN_SCRIPT_CHAKMA 95 +#define UCDN_SCRIPT_MEROITIC_CURSIVE 96 +#define UCDN_SCRIPT_MEROITIC_HIEROGLYPHS 97 +#define UCDN_SCRIPT_MIAO 98 +#define UCDN_SCRIPT_SHARADA 99 +#define UCDN_SCRIPT_SORA_SOMPENG 100 +#define UCDN_SCRIPT_TAKRI 101 +#define UCDN_SCRIPT_UNKNOWN 102 +#define UCDN_SCRIPT_BASSA_VAH 103 +#define UCDN_SCRIPT_CAUCASIAN_ALBANIAN 104 +#define UCDN_SCRIPT_DUPLOYAN 105 +#define UCDN_SCRIPT_ELBASAN 106 +#define UCDN_SCRIPT_GRANTHA 107 +#define UCDN_SCRIPT_KHOJKI 108 +#define UCDN_SCRIPT_KHUDAWADI 109 +#define UCDN_SCRIPT_LINEAR_A 110 +#define UCDN_SCRIPT_MAHAJANI 111 +#define UCDN_SCRIPT_MANICHAEAN 112 +#define UCDN_SCRIPT_MENDE_KIKAKUI 113 +#define UCDN_SCRIPT_MODI 114 +#define UCDN_SCRIPT_MRO 115 +#define UCDN_SCRIPT_NABATAEAN 116 +#define UCDN_SCRIPT_OLD_NORTH_ARABIAN 117 +#define UCDN_SCRIPT_OLD_PERMIC 118 +#define UCDN_SCRIPT_PAHAWH_HMONG 119 +#define UCDN_SCRIPT_PALMYRENE 120 +#define UCDN_SCRIPT_PAU_CIN_HAU 121 +#define UCDN_SCRIPT_PSALTER_PAHLAVI 122 +#define UCDN_SCRIPT_SIDDHAM 123 +#define UCDN_SCRIPT_TIRHUTA 124 +#define UCDN_SCRIPT_WARANG_CITI 125 +#define UCDN_SCRIPT_AHOM 126 +#define UCDN_SCRIPT_ANATOLIAN_HIEROGLYPHS 127 +#define UCDN_SCRIPT_HATRAN 128 +#define UCDN_SCRIPT_MULTANI 129 +#define UCDN_SCRIPT_OLD_HUNGARIAN 130 +#define UCDN_SCRIPT_SIGNWRITING 131 +#define UCDN_SCRIPT_ADLAM 132 +#define UCDN_SCRIPT_BHAIKSUKI 133 +#define UCDN_SCRIPT_MARCHEN 134 +#define UCDN_SCRIPT_NEWA 135 +#define UCDN_SCRIPT_OSAGE 136 +#define UCDN_SCRIPT_TANGUT 137 +#define UCDN_SCRIPT_MASARAM_GONDI 138 +#define UCDN_SCRIPT_NUSHU 139 +#define UCDN_SCRIPT_SOYOMBO 140 +#define UCDN_SCRIPT_ZANABAZAR_SQUARE 141 +#define UCDN_SCRIPT_DOGRA 142 +#define UCDN_SCRIPT_GUNJALA_GONDI 143 +#define UCDN_SCRIPT_HANIFI_ROHINGYA 144 +#define UCDN_SCRIPT_MAKASAR 145 +#define UCDN_SCRIPT_MEDEFAIDRIN 146 +#define UCDN_SCRIPT_OLD_SOGDIAN 147 +#define UCDN_SCRIPT_SOGDIAN 148 +#define UCDN_SCRIPT_ELYMAIC 149 +#define UCDN_SCRIPT_NANDINAGARI 150 +#define UCDN_SCRIPT_NYIAKENG_PUACHUE_HMONG 151 +#define UCDN_SCRIPT_WANCHO 152 + + +// exceptions +PyObject *_set_FileDataError(PyObject *value) +{ + if (!value) { + Py_RETURN_FALSE; + } + JM_Exc_FileDataError = value; + Py_RETURN_TRUE; +} + +//------------------------------------------------------------------- +// minor tools +//------------------------------------------------------------------- +PyObject *util_sine_between(PyObject *C, PyObject *P, PyObject *Q) +{ + // for points C, P, Q compute the sine between lines CP and QP + fz_point c = JM_point_from_py(C); + fz_point p = JM_point_from_py(P); + fz_point q = JM_point_from_py(Q); + fz_point s = JM_normalize_vector(q.x - p.x, q.y - p.y); + fz_matrix m1 = fz_make_matrix(1, 0, 0, 1, -p.x, -p.y); + fz_matrix m2 = fz_make_matrix(s.x, -s.y, s.y, s.x, 0, 0); + m1 = fz_concat(m1, m2); + c = fz_transform_point(c, m1); + c = JM_normalize_vector(c.x, c.y); + return Py_BuildValue("f", c.y); +} + + +// Return the matrix that maps two points C, P to the x-axis such that +// C -> (0,0) and the image of P have the same distance. +PyObject *util_hor_matrix(PyObject *C, PyObject *P) +{ + fz_point c = JM_point_from_py(C); + fz_point p = JM_point_from_py(P); + + // compute (cosine, sine) of vector P-C with double precision: + fz_point s = JM_normalize_vector(p.x - c.x, p.y - c.y); + + fz_matrix m1 = fz_make_matrix(1, 0, 0, 1, -c.x, -c.y); + fz_matrix m2 = fz_make_matrix(s.x, -s.y, s.y, s.x, 0, 0); + return JM_py_from_matrix(fz_concat(m1, m2)); +} + +struct Annot; + +// Ensure that widgets with /AA/C JavaScript are in array AcroForm/CO +struct Annot; +PyObject *util_ensure_widget_calc(struct Annot *annot) +{ + pdf_obj *PDFNAME_CO=NULL; + fz_try(gctx) { + pdf_obj *annot_obj = pdf_annot_obj(gctx, (pdf_annot *) annot); + pdf_document *pdf = pdf_get_bound_document(gctx, annot_obj); + PDFNAME_CO = pdf_new_name(gctx, "CO"); // = PDF_NAME(CO) + pdf_obj *acro = pdf_dict_getl(gctx, // get AcroForm dict + pdf_trailer(gctx, pdf), + PDF_NAME(Root), + PDF_NAME(AcroForm), + NULL); + + pdf_obj *CO = pdf_dict_get(gctx, acro, PDFNAME_CO); // = AcroForm/CO + if (!CO) { + CO = pdf_dict_put_array(gctx, acro, PDFNAME_CO, 2); + } + int i, n = pdf_array_len(gctx, CO); + int xref, nxref, found = 0; + xref = pdf_to_num(gctx, annot_obj); + for (i = 0; i < n; i++) { + nxref = pdf_to_num(gctx, pdf_array_get(gctx, CO, i)); + if (xref == nxref) { + found = 1; + break; + } + } + if (!found) { + pdf_array_push_drop(gctx, CO, pdf_new_indirect(gctx, pdf, xref, 0)); + } + } + fz_always(gctx) { + pdf_drop_obj(gctx, PDFNAME_CO); + } + fz_catch(gctx) { + PyErr_SetString(PyExc_RuntimeError, fz_caught_message(gctx)); + return NULL; + } + Py_RETURN_NONE; +} + + +//----------------------------------------------------------- +// Compute Rect coordinates using different alternatives +//----------------------------------------------------------- +PyObject *util_make_rect(PyObject *a) +{ + Py_ssize_t i, n = PyTuple_GET_SIZE(a); + PyObject *p1, *p2, *l = a; + char *msg = "Rect: bad args"; + double c[4] = { 0, 0, 0, 0 }; + switch (n) { + case 0: goto exit_normal; + case 1: goto size1; + case 2: goto size2; + case 3: goto size31; + case 4: goto size4; + default: + msg = "Rect: bad seq len"; + goto exit_error; + } + + size4:; + for (i = 0; i < 4; i++) { + if (JM_FLOAT_ITEM(l, i, &c[i]) == 1) { + goto exit_error; + } + } + goto exit_normal; + + size1:; + l = PyTuple_GET_ITEM(a, 0); + if (!PySequence_Check(l) || PySequence_Size(l) != 4) { + msg = "Rect: bad seq len"; + goto exit_error; + } + goto size4; + + size2:; + msg = "Rect: bad args"; + p1 = PyTuple_GET_ITEM(a, 0); + p2 = PyTuple_GET_ITEM(a, 1); + if (!PySequence_Check(p1) || PySequence_Size(p1) != 2) { + goto exit_error; + } + if (!PySequence_Check(p2) || PySequence_Size(p2) != 2) { + goto exit_error; + } + if (JM_FLOAT_ITEM(p1, 0, &c[0]) == 1) goto exit_error; + if (JM_FLOAT_ITEM(p1, 1, &c[1]) == 1) goto exit_error; + if (JM_FLOAT_ITEM(p2, 0, &c[2]) == 1) goto exit_error; + if (JM_FLOAT_ITEM(p2, 1, &c[3]) == 1) goto exit_error; + goto exit_normal; + + size31:; + p1 = PyTuple_GET_ITEM(a, 0); + if (PySequence_Check(p1)) goto size32; + if (JM_FLOAT_ITEM(a, 0, &c[0]) == 1) goto exit_error; + if (JM_FLOAT_ITEM(a, 1, &c[1]) == 1) goto exit_error; + p2 = PyTuple_GET_ITEM(a, 2); + if (!PySequence_Check(p2) || PySequence_Size(p2) != 2) { + goto exit_error; + } + if (JM_FLOAT_ITEM(p2, 0, &c[2]) == 1) goto exit_error; + if (JM_FLOAT_ITEM(p2, 1, &c[3]) == 1) goto exit_error; + goto exit_normal; + + size32:; + if (PySequence_Size(p1) != 2) goto exit_error; + if (JM_FLOAT_ITEM(p1, 0, &c[0]) == 1) goto exit_error; + if (JM_FLOAT_ITEM(p1, 1, &c[1]) == 1) goto exit_error; + if (JM_FLOAT_ITEM(a, 1, &c[2]) == 1) goto exit_error; + if (JM_FLOAT_ITEM(a, 2, &c[3]) == 1) goto exit_error; + goto exit_normal; + + exit_normal:; + for (i = 0; i < 4; i++) { + if (c[i] < FZ_MIN_INF_RECT) c[i] = FZ_MIN_INF_RECT; + if (c[i] > FZ_MAX_INF_RECT) c[i] = FZ_MAX_INF_RECT; + } + return Py_BuildValue("dddd", c[0], c[1], c[2], c[3]); + + exit_error:; + PyErr_SetString(PyExc_ValueError, msg); + return NULL; +} + + +//----------------------------------------------------------- +// Compute IRect coordinates using different alternatives +//----------------------------------------------------------- +PyObject *util_make_irect(PyObject *a) +{ + Py_ssize_t i, n = PyTuple_GET_SIZE(a); + PyObject *p1, *p2, *l = a; + char *msg = "IRect: bad args"; + int c[4] = { 0, 0, 0, 0 }; + switch (n) { + case 0: goto exit_normal; + case 1: goto size1; + case 2: goto size2; + case 3: goto size31; + case 4: goto size4; + default: + msg = "IRect: bad seq len"; + goto exit_error; + } + + size4:; + for (i = 0; i < 4; i++) { + if (JM_INT_ITEM(l, i, &c[i]) == 1) { + goto exit_error; + } + } + goto exit_normal; + + size1:; + l = PyTuple_GET_ITEM(a, 0); + if (!PySequence_Check(l) || PySequence_Size(l) != 4) { + msg = "IRect: bad seq len"; + goto exit_error; + } + goto size4; + + size2:; + p1 = PyTuple_GET_ITEM(a, 0); + p2 = PyTuple_GET_ITEM(a, 1); + if (!PySequence_Check(p1) || PySequence_Size(p1) != 2) { + goto exit_error; + } + if (!PySequence_Check(p2) || PySequence_Size(p2) != 2) { + goto exit_error; + } + msg = "IRect: bad int values"; + if (JM_INT_ITEM(p1, 0, &c[0]) == 1) goto exit_error; + if (JM_INT_ITEM(p1, 1, &c[1]) == 1) goto exit_error; + if (JM_INT_ITEM(p2, 0, &c[2]) == 1) goto exit_error; + if (JM_INT_ITEM(p2, 1, &c[3]) == 1) goto exit_error; + goto exit_normal; + + size31:; + p1 = PyTuple_GET_ITEM(a, 0); + if (PySequence_Check(p1)) goto size32; + if (JM_INT_ITEM(a, 0, &c[0]) == 1) goto exit_error; + if (JM_INT_ITEM(a, 1, &c[1]) == 1) goto exit_error; + p2 = PyTuple_GET_ITEM(a, 2); + if (!PySequence_Check(p2) || PySequence_Size(p2) != 2) { + goto exit_error; + } + if (JM_INT_ITEM(p2, 0, &c[2]) == 1) goto exit_error; + if (JM_INT_ITEM(p2, 1, &c[3]) == 1) goto exit_error; + goto exit_normal; + + size32:; + if (PySequence_Size(p1) != 2) goto exit_error; + if (JM_INT_ITEM(p1, 0, &c[0]) == 1) goto exit_error; + if (JM_INT_ITEM(p1, 1, &c[1]) == 1) goto exit_error; + if (JM_INT_ITEM(a, 1, &c[2]) == 1) goto exit_error; + if (JM_INT_ITEM(a, 2, &c[3]) == 1) goto exit_error; + goto exit_normal; + + exit_normal:; + for (i = 0; i < 4; i++) { + if (c[i] < FZ_MIN_INF_RECT) c[i] = FZ_MIN_INF_RECT; + if (c[i] > FZ_MAX_INF_RECT) c[i] = FZ_MAX_INF_RECT; + } + return Py_BuildValue("iiii", c[0], c[1], c[2], c[3]); + + exit_error:; + PyErr_SetString(PyExc_ValueError, msg); + return NULL; +} + + +PyObject *util_round_rect(PyObject *rect) +{ + return JM_py_from_irect(fz_round_rect(JM_rect_from_py(rect))); +} + + +PyObject *util_transform_rect(PyObject *rect, PyObject *matrix) +{ + return JM_py_from_rect(fz_transform_rect(JM_rect_from_py(rect), JM_matrix_from_py(matrix))); +} + + +PyObject *util_intersect_rect(PyObject *r1, PyObject *r2) +{ + return JM_py_from_rect(fz_intersect_rect(JM_rect_from_py(r1), + JM_rect_from_py(r2))); +} + + +PyObject *util_is_point_in_rect(PyObject *p, PyObject *r) +{ + return JM_BOOL(fz_is_point_inside_rect(JM_point_from_py(p), JM_rect_from_py(r))); +} + + +PyObject *util_include_point_in_rect(PyObject *r, PyObject *p) +{ + return JM_py_from_rect(fz_include_point_in_rect(JM_rect_from_py(r), + JM_point_from_py(p))); +} + + +PyObject *util_point_in_quad(PyObject *P, PyObject *Q) +{ + fz_point p = JM_point_from_py(P); + fz_quad q = JM_quad_from_py(Q); + return JM_BOOL(fz_is_point_inside_quad(p, q)); +} + + +PyObject *util_transform_point(PyObject *point, PyObject *matrix) +{ + return JM_py_from_point(fz_transform_point(JM_point_from_py(point), JM_matrix_from_py(matrix))); +} + + +PyObject *util_union_rect(PyObject *r1, PyObject *r2) +{ + return JM_py_from_rect(fz_union_rect(JM_rect_from_py(r1), + JM_rect_from_py(r2))); +} + + +PyObject *util_concat_matrix(PyObject *m1, PyObject *m2) +{ + return JM_py_from_matrix(fz_concat(JM_matrix_from_py(m1), + JM_matrix_from_py(m2))); +} + + +PyObject *util_invert_matrix(PyObject *matrix) +{ + fz_matrix src = JM_matrix_from_py(matrix); + float a = src.a; + float det = a * src.d - src.b * src.c; + if (det < -FLT_EPSILON || det > FLT_EPSILON) + { + fz_matrix dst; + float rdet = 1 / det; + dst.a = src.d * rdet; + dst.b = -src.b * rdet; + dst.c = -src.c * rdet; + dst.d = a * rdet; + a = -src.e * dst.a - src.f * dst.c; + dst.f = -src.e * dst.b - src.f * dst.d; + dst.e = a; + return Py_BuildValue("iN", 0, JM_py_from_matrix(dst)); + } + return Py_BuildValue("(i, ())", 1); +} + + +PyObject *util_measure_string(const char *text, const char *fontname, double fontsize, int encoding) +{ + double w = 0; + fz_font *font = NULL; + fz_try(gctx) { + font = fz_new_base14_font(gctx, fontname); + while (*text) + { + int c, g; + text += fz_chartorune(&c, text); + switch (encoding) + { + case PDF_SIMPLE_ENCODING_GREEK: + c = fz_iso8859_7_from_unicode(c); break; + case PDF_SIMPLE_ENCODING_CYRILLIC: + c = fz_windows_1251_from_unicode(c); break; + default: + c = fz_windows_1252_from_unicode(c); break; + } + if (c < 0) c = 0xB7; + g = fz_encode_character(gctx, font, c); + w += (double) fz_advance_glyph(gctx, font, g, 0); + } + } + fz_always(gctx) { + fz_drop_font(gctx, font); + } + fz_catch(gctx) { + return PyFloat_FromDouble(0); + } + return PyFloat_FromDouble(w * fontsize); +} + +%} + +%{ +// Global Constants - Python dictionary keys +PyObject *dictkey_align; +PyObject *dictkey_ascender; +PyObject *dictkey_bbox; +PyObject *dictkey_blocks; +PyObject *dictkey_bpc; +PyObject *dictkey_c; +PyObject *dictkey_chars; +PyObject *dictkey_color; +PyObject *dictkey_colorspace; +PyObject *dictkey_content; +PyObject *dictkey_creationDate; +PyObject *dictkey_cs_name; +PyObject *dictkey_da; +PyObject *dictkey_dashes; +PyObject *dictkey_desc; +PyObject *dictkey_descender; +PyObject *dictkey_dir; +PyObject *dictkey_effect; +PyObject *dictkey_ext; +PyObject *dictkey_filename; +PyObject *dictkey_fill; +PyObject *dictkey_flags; +PyObject *dictkey_font; +PyObject *dictkey_glyph; +PyObject *dictkey_height; +PyObject *dictkey_id; +PyObject *dictkey_image; +PyObject *dictkey_items; +PyObject *dictkey_length; +PyObject *dictkey_lines; +PyObject *dictkey_matrix; +PyObject *dictkey_modDate; +PyObject *dictkey_name; +PyObject *dictkey_number; +PyObject *dictkey_origin; +PyObject *dictkey_rect; +PyObject *dictkey_size; +PyObject *dictkey_smask; +PyObject *dictkey_spans; +PyObject *dictkey_stroke; +PyObject *dictkey_style; +PyObject *dictkey_subject; +PyObject *dictkey_text; +PyObject *dictkey_title; +PyObject *dictkey_type; +PyObject *dictkey_ufilename; +PyObject *dictkey_width; +PyObject *dictkey_wmode; +PyObject *dictkey_xref; +PyObject *dictkey_xres; +PyObject *dictkey_yres; +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-devices.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-devices.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1049 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +typedef struct +{ + fz_device super; + PyObject *out; + size_t seqno; + long depth; + int clips; + PyObject *method; +} jm_lineart_device; + +static PyObject *dev_pathdict = NULL; +static PyObject *scissors = NULL; +static float dev_linewidth = 0; // border width if present +static fz_matrix trace_device_ptm; // page transformation matrix +static fz_matrix trace_device_ctm; // trace device matrix +static fz_matrix trace_device_rot; +static fz_point dev_lastpoint = {0, 0}; +static fz_point dev_firstpoint = {0, 0}; +static int dev_havemove = 0; +static fz_rect dev_pathrect; +static float dev_pathfactor = 0; +static int dev_linecount = 0; +static char *layer_name=NULL; // optional content name +static int path_type = 0; // one of the following values: +#define FILL_PATH 1 +#define STROKE_PATH 2 +#define CLIP_PATH 3 +#define CLIP_STROKE_PATH 4 + +static void trace_device_reset() +{ + Py_CLEAR(dev_pathdict); + Py_CLEAR(scissors); + layer_name = NULL; + dev_linewidth = 0; + trace_device_ptm = fz_identity; + trace_device_ctm = fz_identity; + trace_device_rot = fz_identity; + dev_lastpoint.x = 0; + dev_lastpoint.y = 0; + dev_firstpoint.x = 0; + dev_firstpoint.y = 0; + dev_pathrect.x0 = 0; + dev_pathrect.y0 = 0; + dev_pathrect.x1 = 0; + dev_pathrect.y1 = 0; + dev_pathfactor = 0; + dev_linecount = 0; + path_type = 0; +} + +// Every scissor of a clip is a sub rectangle of the preceeding clip +// scissor if the clip level is larger. +static fz_rect compute_scissor() +{ + PyObject *last_scissor = NULL; + fz_rect scissor; + if (!scissors) { + scissors = PyList_New(0); + } + Py_ssize_t num_scissors = PyList_Size(scissors); + if (num_scissors > 0) { + last_scissor = PyList_GET_ITEM(scissors, num_scissors-1); + scissor = JM_rect_from_py(last_scissor); + scissor = fz_intersect_rect(scissor, dev_pathrect); + } else { + scissor = dev_pathrect; + } + LIST_APPEND_DROP(scissors, JM_py_from_rect(scissor)); + return scissor; +} + + +static void +jm_increase_seqno(fz_context *ctx, fz_device *dev_, ...) +{ + jm_lineart_device *dev = (jm_lineart_device *) dev_; + dev->seqno += 1; +} + +/* +-------------------------------------------------------------------------- +Check whether the last 4 lines represent a quad. +Because of how we count, the lines are a polyline already, i.e. last point +of a line equals 1st point of next line. +So we check for a polygon (last line's end point equals start point). +If not true we return 0. +-------------------------------------------------------------------------- +*/ +static int +jm_checkquad() +{ + PyObject *items = PyDict_GetItem(dev_pathdict, dictkey_items); + Py_ssize_t i, len = PyList_Size(items); + float f[8]; // coordinates of the 4 corners + fz_point temp, lp; // line = (temp, lp) + PyObject *rect; + PyObject *line; + // fill the 8 floats in f, start from items[-4:] + for (i = 0; i < 4; i++) { // store line start points + line = PyList_GET_ITEM(items, len - 4 + i); + temp = JM_point_from_py(PyTuple_GET_ITEM(line, 1)); + f[i * 2] = temp.x; + f[i * 2 + 1] = temp.y; + lp = JM_point_from_py(PyTuple_GET_ITEM(line, 2)); + } + if (lp.x != f[0] || lp.y != f[1]) { + // not a polygon! + //dev_linecount -= 1; + return 0; + } + + // we have detected a quad + dev_linecount = 0; // reset this + // a quad item is ("qu", (ul, ur, ll, lr)), where the tuple items + // are pairs of floats representing a quad corner each. + rect = PyTuple_New(2); + PyTuple_SET_ITEM(rect, 0, PyUnicode_FromString("qu")); + /* ---------------------------------------------------- + * relationship of float array to quad points: + * (0, 1) = ul, (2, 3) = ll, (6, 7) = ur, (4, 5) = lr + ---------------------------------------------------- */ + fz_quad q = fz_make_quad(f[0], f[1], f[6], f[7], f[2], f[3], f[4], f[5]); + PyTuple_SET_ITEM(rect, 1, JM_py_from_quad(q)); + PyList_SetItem(items, len - 4, rect); // replace item -4 by rect + PyList_SetSlice(items, len - 3, len, NULL); // delete remaining 3 items + return 1; +} + + +/* +-------------------------------------------------------------------------- +Check whether the last 3 path items represent a rectangle. +Line 1 and 3 must be horizontal, line 2 must be vertical. +Returns 1 if we have modified the path, otherwise 0. +-------------------------------------------------------------------------- +*/ +static int +jm_checkrect() +{ + dev_linecount = 0; // reset line count + long orientation = 0; // area orientation of rectangle + fz_point ll, lr, ur, ul; + fz_rect r; + PyObject *rect; + PyObject *line0, *line2; + PyObject *items = PyDict_GetItem(dev_pathdict, dictkey_items); + Py_ssize_t len = PyList_Size(items); + + line0 = PyList_GET_ITEM(items, len - 3); + ll = JM_point_from_py(PyTuple_GET_ITEM(line0, 1)); + lr = JM_point_from_py(PyTuple_GET_ITEM(line0, 2)); + // no need to extract "line1"! + line2 = PyList_GET_ITEM(items, len - 1); + ur = JM_point_from_py(PyTuple_GET_ITEM(line2, 1)); + ul = JM_point_from_py(PyTuple_GET_ITEM(line2, 2)); + + /* + --------------------------------------------------------------------- + Assumption: + When decomposing rects, MuPDF always starts with a horizontal line, + followed by a vertical line, followed by a horizontal line. + First line: (ll, lr), third line: (ul, ur). + If 1st line is below 3rd line, we record anti-clockwise (+1), else + clockwise (-1) orientation. + --------------------------------------------------------------------- + */ + if (ll.y != lr.y || + ll.x != ul.x || + ur.y != ul.y || + ur.x != lr.x) { + goto drop_out; // not a rectangle + } + + // we have a rect, replace last 3 "l" items by one "re" item. + if (ul.y < lr.y) { + r = fz_make_rect(ul.x, ul.y, lr.x, lr.y); + orientation = 1; + } else { + r = fz_make_rect(ll.x, ll.y, ur.x, ur.y); + orientation = -1; + } + rect = PyTuple_New(3); + PyTuple_SET_ITEM(rect, 0, PyUnicode_FromString("re")); + PyTuple_SET_ITEM(rect, 1, JM_py_from_rect(r)); + PyTuple_SET_ITEM(rect, 2, PyLong_FromLong(orientation)); + PyList_SetItem(items, len - 3, rect); // replace item -3 by rect + PyList_SetSlice(items, len - 2, len, NULL); // delete remaining 2 items + return 1; + drop_out:; + return 0; +} + +static PyObject * +jm_lineart_color(fz_context *ctx, fz_colorspace *colorspace, const float *color) +{ + float rgb[3]; + if (colorspace) { + fz_convert_color(ctx, colorspace, color, fz_device_rgb(ctx), + rgb, NULL, fz_default_color_params); + return Py_BuildValue("fff", rgb[0], rgb[1], rgb[2]); + } + return PyTuple_New(0); +} + +static void +trace_moveto(fz_context *ctx, void *dev_, float x, float y) +{ + dev_lastpoint = fz_transform_point(fz_make_point(x, y), trace_device_ctm); + if (fz_is_infinite_rect(dev_pathrect)) { + dev_pathrect = fz_make_rect(dev_lastpoint.x, dev_lastpoint.y, + dev_lastpoint.x, dev_lastpoint.y); + } + dev_firstpoint = dev_lastpoint; + dev_havemove = 1; + dev_linecount = 0; // reset # of consec. lines +} + +static void +trace_lineto(fz_context *ctx, void *dev_, float x, float y) +{ + fz_point p1 = fz_transform_point(fz_make_point(x, y), trace_device_ctm); + dev_pathrect = fz_include_point_in_rect(dev_pathrect, p1); + PyObject *list = PyTuple_New(3); + PyTuple_SET_ITEM(list, 0, PyUnicode_FromString("l")); + PyTuple_SET_ITEM(list, 1, JM_py_from_point(dev_lastpoint)); + PyTuple_SET_ITEM(list, 2, JM_py_from_point(p1)); + dev_lastpoint = p1; + PyObject *items = PyDict_GetItem(dev_pathdict, dictkey_items); + LIST_APPEND_DROP(items, list); + dev_linecount += 1; // counts consecutive lines + if (dev_linecount == 4 && path_type != FILL_PATH) { // shrink to "re" or "qu" item + jm_checkquad(); + } +} + +static void +trace_curveto(fz_context *ctx, void *dev_, float x1, float y1, float x2, float y2, float x3, float y3) +{ + dev_linecount = 0; // reset # of consec. lines + fz_point p1 = fz_make_point(x1, y1); + fz_point p2 = fz_make_point(x2, y2); + fz_point p3 = fz_make_point(x3, y3); + p1 = fz_transform_point(p1, trace_device_ctm); + p2 = fz_transform_point(p2, trace_device_ctm); + p3 = fz_transform_point(p3, trace_device_ctm); + dev_pathrect = fz_include_point_in_rect(dev_pathrect, p1); + dev_pathrect = fz_include_point_in_rect(dev_pathrect, p2); + dev_pathrect = fz_include_point_in_rect(dev_pathrect, p3); + + PyObject *list = PyTuple_New(5); + PyTuple_SET_ITEM(list, 0, PyUnicode_FromString("c")); + PyTuple_SET_ITEM(list, 1, JM_py_from_point(dev_lastpoint)); + PyTuple_SET_ITEM(list, 2, JM_py_from_point(p1)); + PyTuple_SET_ITEM(list, 3, JM_py_from_point(p2)); + PyTuple_SET_ITEM(list, 4, JM_py_from_point(p3)); + dev_lastpoint = p3; + PyObject *items = PyDict_GetItem(dev_pathdict, dictkey_items); + LIST_APPEND_DROP(items, list); +} + +static void +trace_close(fz_context *ctx, void *dev_) +{ + if (dev_linecount == 3) { + if (jm_checkrect()) { + return; + } + } + dev_linecount = 0; // reset # of consec. lines + if (dev_havemove) { + if (dev_firstpoint.x != dev_lastpoint.x || dev_firstpoint.y != dev_lastpoint.y) { + PyObject *list = PyTuple_New(3); + PyTuple_SET_ITEM(list, 0, PyUnicode_FromString("l")); + PyTuple_SET_ITEM(list, 1, JM_py_from_point(dev_lastpoint)); + PyTuple_SET_ITEM(list, 2, JM_py_from_point(dev_firstpoint)); + dev_lastpoint = dev_firstpoint; + PyObject *items = PyDict_GetItem(dev_pathdict, dictkey_items); + LIST_APPEND_DROP(items, list); + } + dev_havemove = 0; + DICT_SETITEMSTR_DROP(dev_pathdict, "closePath", JM_BOOL(0)); + } else { + DICT_SETITEMSTR_DROP(dev_pathdict, "closePath", JM_BOOL(1)); + } +} + +static const fz_path_walker trace_path_walker = + { + trace_moveto, + trace_lineto, + trace_curveto, + trace_close + }; + +/* +--------------------------------------------------------------------- +Create the "items" list of the path dictionary +* either create or empty the path dictionary +* reset the end point of the path +* reset count of consecutive lines +* invoke fz_walk_path(), which create the single items +* if no items detected, empty path dict again +--------------------------------------------------------------------- +*/ +static void +jm_lineart_path(fz_context *ctx, jm_lineart_device *dev, const fz_path *path) +{ + dev_pathrect = fz_infinite_rect; + dev_linecount = 0; + dev_lastpoint = fz_make_point(0, 0); + if (dev_pathdict) { + Py_CLEAR(dev_pathdict); + } + dev_pathdict = PyDict_New(); + DICT_SETITEM_DROP(dev_pathdict, dictkey_items, PyList_New(0)); + fz_walk_path(ctx, path, &trace_path_walker, dev); + // Check if any items were added ... + if (!PyDict_GetItem(dev_pathdict, dictkey_items) || !PyList_Size(PyDict_GetItem(dev_pathdict, dictkey_items))) { + Py_CLEAR(dev_pathdict); + } +} + +//--------------------------------------------------------------------------- +// Append current path to list or merge into last path of the list. +// (1) Append if first path, different item lists or not a 'stroke' version +// of previous path +// (2) If new path has the same items, merge its content into previous path +// and change path["type"] to "fs". +// (3) If "out" is callable, skip the previous and pass dictionary to it. +//--------------------------------------------------------------------------- +static void +jm_append_merge(PyObject *out, PyObject *method) +{ + if (PyCallable_Check(out) || method != Py_None) { // function or method + goto callback; + } + Py_ssize_t len = PyList_Size(out); // len of output list so far + if (len == 0) { // always append first path + goto append; + } + const char *thistype = PyUnicode_AsUTF8(PyDict_GetItem(dev_pathdict, dictkey_type)); + if (strcmp(thistype, "s") != 0) { // if not stroke, then append + goto append; + } + PyObject *prev = PyList_GET_ITEM(out, len - 1); // get prev path + const char *prevtype = PyUnicode_AsUTF8(PyDict_GetItem(prev, dictkey_type)); + if (strcmp(prevtype, "f") != 0) { // if previous not fill, append + goto append; + } + // last check: there must be the same list of items for "f" and "s". + PyObject *previtems = PyDict_GetItem(prev, dictkey_items); + PyObject *thisitems = PyDict_GetItem(dev_pathdict, dictkey_items); + if (PyObject_RichCompareBool(previtems, thisitems, Py_NE)) { + goto append; + } + int rc = PyDict_Merge(prev, dev_pathdict, 0); // merge with no override + if (rc == 0) { + DICT_SETITEM_DROP(prev, dictkey_type, PyUnicode_FromString("fs")); + goto postappend; + } else { + PySys_WriteStderr("could not merge stroke and fill path"); + goto append; + } + append:; + PyList_Append(out, dev_pathdict); + postappend:; + Py_CLEAR(dev_pathdict); + return; + + callback:; // callback function or method + PyObject *resp = NULL; + if (method == Py_None) { + resp = PyObject_CallFunctionObjArgs(out, dev_pathdict, NULL); + } else { + resp = PyObject_CallMethodObjArgs(out, method, dev_pathdict, NULL); + } + if (resp) { + Py_DECREF(resp); + } else { + PySys_WriteStderr("calling cdrawings callback function/method failed!"); + PyErr_Clear(); + } + Py_CLEAR(dev_pathdict); + return; +} + + +static void +jm_lineart_fill_path(fz_context *ctx, fz_device *dev_, const fz_path *path, + int even_odd, fz_matrix ctm, fz_colorspace *colorspace, + const float *color, float alpha, fz_color_params color_params) +{ + jm_lineart_device *dev = (jm_lineart_device *) dev_; + PyObject *out = dev->out; + trace_device_ctm = ctm; //fz_concat(ctm, trace_device_ptm); + path_type = FILL_PATH; + jm_lineart_path(ctx, dev, path); + if (!dev_pathdict) { + return; + } + DICT_SETITEM_DROP(dev_pathdict, dictkey_type, PyUnicode_FromString("f")); + DICT_SETITEMSTR_DROP(dev_pathdict, "even_odd", JM_BOOL(even_odd)); + DICT_SETITEMSTR_DROP(dev_pathdict, "fill_opacity", Py_BuildValue("f", alpha)); + DICT_SETITEMSTR_DROP(dev_pathdict, "fill", jm_lineart_color(ctx, colorspace, color)); + DICT_SETITEM_DROP(dev_pathdict, dictkey_rect, JM_py_from_rect(dev_pathrect)); + DICT_SETITEMSTR_DROP(dev_pathdict, "seqno", PyLong_FromSize_t(dev->seqno)); + DICT_SETITEMSTR_DROP(dev_pathdict, "layer", JM_UnicodeFromStr(layer_name)); + if (dev->clips) { + DICT_SETITEMSTR_DROP(dev_pathdict, "level", PyLong_FromLong(dev->depth)); + } + jm_append_merge(out, dev->method); + dev->seqno += 1; +} + +static void +jm_lineart_stroke_path(fz_context *ctx, fz_device *dev_, const fz_path *path, + const fz_stroke_state *stroke, fz_matrix ctm, + fz_colorspace *colorspace, const float *color, float alpha, + fz_color_params color_params) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + PyObject *out = dev->out; + int i; + dev_pathfactor = 1; + if (fz_abs(ctm.a) == fz_abs(ctm.d)) { + dev_pathfactor = fz_abs(ctm.a); + } + trace_device_ctm = ctm; // fz_concat(ctm, trace_device_ptm); + path_type = STROKE_PATH; + + jm_lineart_path(ctx, dev, path); + if (!dev_pathdict) { + return; + } + DICT_SETITEM_DROP(dev_pathdict, dictkey_type, PyUnicode_FromString("s")); + DICT_SETITEMSTR_DROP(dev_pathdict, "stroke_opacity", Py_BuildValue("f", alpha)); + DICT_SETITEMSTR_DROP(dev_pathdict, "color", jm_lineart_color(ctx, colorspace, color)); + DICT_SETITEM_DROP(dev_pathdict, dictkey_width, Py_BuildValue("f", dev_pathfactor * stroke->linewidth)); + DICT_SETITEMSTR_DROP(dev_pathdict, "lineCap", Py_BuildValue("iii", stroke->start_cap, stroke->dash_cap, stroke->end_cap)); + DICT_SETITEMSTR_DROP(dev_pathdict, "lineJoin", Py_BuildValue("f", dev_pathfactor * stroke->linejoin)); + if (!PyDict_GetItemString(dev_pathdict, "closePath")) { + DICT_SETITEMSTR_DROP(dev_pathdict, "closePath", JM_BOOL(0)); + } + + // output the "dashes" string + if (stroke->dash_len) { + fz_buffer *buff = fz_new_buffer(ctx, 256); + fz_append_string(ctx, buff, "[ "); // left bracket + for (i = 0; i < stroke->dash_len; i++) { + fz_append_printf(ctx, buff, "%g ", dev_pathfactor * stroke->dash_list[i]); + } + fz_append_printf(ctx, buff, "] %g", dev_pathfactor * stroke->dash_phase); + DICT_SETITEMSTR_DROP(dev_pathdict, "dashes", JM_EscapeStrFromBuffer(ctx, buff)); + fz_drop_buffer(ctx, buff); + } else { + DICT_SETITEMSTR_DROP(dev_pathdict, "dashes", PyUnicode_FromString("[] 0")); + } + + DICT_SETITEM_DROP(dev_pathdict, dictkey_rect, JM_py_from_rect(dev_pathrect)); + DICT_SETITEMSTR_DROP(dev_pathdict, "layer", JM_UnicodeFromStr(layer_name)); + DICT_SETITEMSTR_DROP(dev_pathdict, "seqno", PyLong_FromSize_t(dev->seqno)); + if (dev->clips) { + DICT_SETITEMSTR_DROP(dev_pathdict, "level", PyLong_FromLong(dev->depth)); + } + // output the dict - potentially merging it with a previous fill_path twin + jm_append_merge(out, dev->method); + dev->seqno += 1; +} + +static void +jm_lineart_clip_path(fz_context *ctx, fz_device *dev_, const fz_path *path, int even_odd, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + PyObject *out = dev->out; + trace_device_ctm = ctm; //fz_concat(ctm, trace_device_ptm); + path_type = CLIP_PATH; + jm_lineart_path(ctx, dev, path); + if (!dev_pathdict) { + return; + } + DICT_SETITEM_DROP(dev_pathdict, dictkey_type, PyUnicode_FromString("clip")); + DICT_SETITEMSTR_DROP(dev_pathdict, "even_odd", JM_BOOL(even_odd)); + if (!PyDict_GetItemString(dev_pathdict, "closePath")) { + DICT_SETITEMSTR_DROP(dev_pathdict, "closePath", JM_BOOL(0)); + } + DICT_SETITEMSTR_DROP(dev_pathdict, "scissor", JM_py_from_rect(compute_scissor())); + DICT_SETITEMSTR_DROP(dev_pathdict, "level", PyLong_FromLong(dev->depth)); + DICT_SETITEMSTR_DROP(dev_pathdict, "layer", JM_UnicodeFromStr(layer_name)); + jm_append_merge(out, dev->method); + dev->depth++; +} + +static void +jm_lineart_clip_stroke_path(fz_context *ctx, fz_device *dev_, const fz_path *path, const fz_stroke_state *stroke, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + PyObject *out = dev->out; + trace_device_ctm = ctm; //fz_concat(ctm, trace_device_ptm); + path_type = CLIP_STROKE_PATH; + jm_lineart_path(ctx, dev, path); + if (!dev_pathdict) { + return; + } + DICT_SETITEM_DROP(dev_pathdict, dictkey_type, PyUnicode_FromString("clip")); + DICT_SETITEMSTR_DROP(dev_pathdict, "even_odd", Py_BuildValue("s", NULL)); + if (!PyDict_GetItemString(dev_pathdict, "closePath")) { + DICT_SETITEMSTR_DROP(dev_pathdict, "closePath", JM_BOOL(0)); + } + DICT_SETITEMSTR_DROP(dev_pathdict, "scissor", JM_py_from_rect(compute_scissor())); + DICT_SETITEMSTR_DROP(dev_pathdict, "level", PyLong_FromLong(dev->depth)); + DICT_SETITEMSTR_DROP(dev_pathdict, "layer", JM_UnicodeFromStr(layer_name)); + jm_append_merge(out, dev->method); + dev->depth++; +} + +static void +jm_lineart_clip_stroke_text(fz_context *ctx, fz_device *dev_, const fz_text *text, const fz_stroke_state *stroke, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + PyObject *out = dev->out; + compute_scissor(); + dev->depth++; +} + +static void +jm_lineart_clip_text(fz_context *ctx, fz_device *dev_, const fz_text *text, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + PyObject *out = dev->out; + compute_scissor(); + dev->depth++; +} + +static void +jm_lineart_clip_image_mask(fz_context *ctx, fz_device *dev_, fz_image *image, fz_matrix ctm, fz_rect scissor) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + PyObject *out = dev->out; + compute_scissor(); + dev->depth++; +} + +static void +jm_lineart_pop_clip(fz_context *ctx, fz_device *dev_) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + if (!scissors) return; + Py_ssize_t len = PyList_Size(scissors); + if (len < 1) return; + PyList_SetSlice(scissors, len - 1, len, NULL); + dev->depth--; +} + + +static void +jm_lineart_begin_layer(fz_context *ctx, fz_device *dev_, const char *name) +{ + layer_name = fz_strdup(ctx, name); +} + +static void +jm_lineart_end_layer(fz_context *ctx, fz_device *dev_) +{ + fz_free(ctx, layer_name); + layer_name = NULL; +} + +static void +jm_lineart_begin_group(fz_context *ctx, fz_device *dev_, fz_rect bbox, fz_colorspace *cs, int isolated, int knockout, int blendmode, float alpha) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + PyObject *out = dev->out; + dev_pathdict = Py_BuildValue("{s:s,s:N,s:N,s:N,s:s,s:f,s:i,s:N}", + "type", "group", + "rect", JM_py_from_rect(bbox), + "isolated", JM_BOOL(isolated), + "knockout", JM_BOOL(knockout), + "blendmode", fz_blendmode_name(blendmode), + "opacity", alpha, + "level", dev->depth, + "layer", JM_UnicodeFromStr(layer_name) + ); + jm_append_merge(out, dev->method); + dev->depth++; +} + +static void +jm_lineart_end_group(fz_context *ctx, fz_device *dev_) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (!dev->clips) return; + dev->depth--; +} + + +static void +jm_dev_linewidth(fz_context *ctx, fz_device *dev_, const fz_path *path, const fz_stroke_state *stroke, fz_matrix ctm, fz_colorspace *colorspace, const float *color, float alpha, fz_color_params color_params) +{ + dev_linewidth = stroke->linewidth; + jm_increase_seqno(ctx, dev_); +} + + +static void +jm_trace_text_span(fz_context *ctx, PyObject *out, fz_text_span *span, int type, fz_matrix ctm, fz_colorspace *colorspace, const float *color, float alpha, size_t seqno) +{ + fz_font *out_font = NULL; + int i; + const char *fontname = JM_font_name(ctx, span->font); + float rgb[3]; + PyObject *chars = PyTuple_New(span->len); + fz_matrix mat = fz_concat(span->trm, ctm); // text transformation matrix + fz_point dir = fz_transform_vector(fz_make_point(1, 0), mat); // writing direction + double fsize = sqrt(dir.x * dir.x + dir.y * dir.y); + + dir = fz_normalize_vector(dir); + double linewidth, adv, asc, dsc; + double space_adv = 0; + float x0, y0, x1, y1; + asc = (double) JM_font_ascender(ctx, span->font); + dsc = (double) JM_font_descender(ctx, span->font); + if (asc < 1e-3) { // probably Tesseract font + dsc = -0.1; + asc = 0.9; + } + // compute effective ascender / descender + double ascsize = asc * fsize / (asc - dsc); + double dscsize = dsc * fsize / (asc - dsc); + + int fflags = 0; // font flags + int mono = fz_font_is_monospaced(ctx, span->font); + fflags += mono * TEXT_FONT_MONOSPACED; + fflags += fz_font_is_italic(ctx, span->font) * TEXT_FONT_ITALIC; + fflags += fz_font_is_serif(ctx, span->font) * TEXT_FONT_SERIFED; + fflags += fz_font_is_bold(ctx, span->font) * TEXT_FONT_BOLD; + + if (dev_linewidth > 0) { // width of character border + linewidth = (double) dev_linewidth; + } else { + linewidth = fsize * 0.05; // default: 5% of font size + } + fz_point char_orig; + double last_adv = 0; + + // walk through characters of span + fz_rect span_bbox; + fz_matrix rot = fz_make_matrix(dir.x, dir.y, -dir.y, dir.x, 0, 0); + if (dir.x == -1) { // left-right flip + rot.d = 1; + } + + //PySys_WriteStdout("mat: (%g, %g, %g, %g)\n", mat.a, mat.b, mat.c, mat.d); + //PySys_WriteStdout("rot: (%g, %g, %g, %g)\n", rot.a, rot.b, rot.c, rot.d); + + for (i = 0; i < span->len; i++) { + adv = 0; + if (span->items[i].gid >= 0) { + adv = (double) fz_advance_glyph(ctx, span->font, span->items[i].gid, span->wmode); + } + adv *= fsize; + last_adv = adv; + if (span->items[i].ucs == 32) { + space_adv = adv; + } + char_orig = fz_make_point(span->items[i].x, span->items[i].y); + char_orig = fz_transform_point(char_orig, ctm); + fz_matrix m1 = fz_make_matrix(1, 0, 0, 1, -char_orig.x, -char_orig.y); + m1 = fz_concat(m1, rot); + m1 = fz_concat(m1, fz_make_matrix(1, 0, 0, 1, char_orig.x, char_orig.y)); + x0 = char_orig.x; + x1 = x0 + adv; + if (mat.d > 0 && (dir.x == 1 || dir.x == -1) || + mat.b !=0 && mat.b == -mat.c) { // up-down flip + y0 = char_orig.y + dscsize; + y1 = char_orig.y + ascsize; + } else { + y0 = char_orig.y - ascsize; + y1 = char_orig.y - dscsize; + } + fz_rect char_bbox = fz_make_rect(x0, y0, x1, y1); + char_bbox = fz_transform_rect(char_bbox, m1); + PyTuple_SET_ITEM(chars, (Py_ssize_t) i, Py_BuildValue("ii(ff)(ffff)", + span->items[i].ucs, span->items[i].gid, + char_orig.x, char_orig.y, char_bbox.x0, char_bbox.y0, char_bbox.x1, char_bbox.y1)); + if (i > 0) { + span_bbox = fz_union_rect(span_bbox, char_bbox); + } else { + span_bbox = char_bbox; + } + } + if (!space_adv) { + if (!mono) { + space_adv = fz_advance_glyph(ctx, span->font, + fz_encode_character_with_fallback(ctx, span->font, 32, 0, 0, &out_font), + span->wmode); + space_adv *= fsize; + if (!space_adv) { + space_adv = last_adv; + } + } else { + space_adv = last_adv; // for mono, any char width suffices + } + } + // make the span dictionary + PyObject *span_dict = PyDict_New(); + DICT_SETITEMSTR_DROP(span_dict, "dir", JM_py_from_point(dir)); + DICT_SETITEM_DROP(span_dict, dictkey_font, JM_EscapeStrFromStr(fontname)); + DICT_SETITEM_DROP(span_dict, dictkey_wmode, PyLong_FromLong((long) span->wmode)); + DICT_SETITEM_DROP(span_dict, dictkey_flags, PyLong_FromLong((long) fflags)); + DICT_SETITEMSTR_DROP(span_dict, "bidi_lvl", PyLong_FromLong((long) span->bidi_level)); + DICT_SETITEMSTR_DROP(span_dict, "bidi_dir", PyLong_FromLong((long) span->markup_dir)); + DICT_SETITEM_DROP(span_dict, dictkey_ascender, PyFloat_FromDouble(asc)); + DICT_SETITEM_DROP(span_dict, dictkey_descender, PyFloat_FromDouble(dsc)); + DICT_SETITEM_DROP(span_dict, dictkey_colorspace, PyLong_FromLong(3)); + + if (colorspace) { + fz_convert_color(ctx, colorspace, color, fz_device_rgb(ctx), + rgb, NULL, fz_default_color_params); + } else { + rgb[0] = rgb[1] = rgb[2] = 0; + } + + DICT_SETITEM_DROP(span_dict, dictkey_color, Py_BuildValue("fff", rgb[0], rgb[1], rgb[2])); + DICT_SETITEM_DROP(span_dict, dictkey_size, PyFloat_FromDouble(fsize)); + DICT_SETITEMSTR_DROP(span_dict, "opacity", PyFloat_FromDouble((double) alpha)); + DICT_SETITEMSTR_DROP(span_dict, "linewidth", PyFloat_FromDouble((double) linewidth)); + DICT_SETITEMSTR_DROP(span_dict, "spacewidth", PyFloat_FromDouble(space_adv)); + DICT_SETITEM_DROP(span_dict, dictkey_type, PyLong_FromLong((long) type)); + DICT_SETITEM_DROP(span_dict, dictkey_bbox, JM_py_from_rect(span_bbox)); + DICT_SETITEMSTR_DROP(span_dict, "layer", JM_UnicodeFromStr(layer_name)); + DICT_SETITEMSTR_DROP(span_dict, "seqno", PyLong_FromSize_t(seqno)); + DICT_SETITEM_DROP(span_dict, dictkey_chars, chars); + LIST_APPEND_DROP(out, span_dict); +} + +static void +jm_trace_text(fz_context *ctx, PyObject *out, const fz_text *text, int type, fz_matrix ctm, fz_colorspace *colorspace, const float *color, float alpha, size_t seqno) +{ + fz_text_span *span; + for (span = text->head; span; span = span->next) + jm_trace_text_span(ctx, out, span, type, ctm, colorspace, color, alpha, seqno); +} + +/*--------------------------------------------------------- +There are 3 text trace types: +0 - fill text (PDF Tr 0) +1 - stroke text (PDF Tr 1) +3 - ignore text (PDF Tr 3) +---------------------------------------------------------*/ +static void +jm_lineart_fill_text(fz_context *ctx, fz_device *dev_, const fz_text *text, fz_matrix ctm, fz_colorspace *colorspace, const float *color, float alpha, fz_color_params color_params) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + PyObject *out = dev->out; + jm_trace_text(ctx, out, text, 0, ctm, colorspace, color, alpha, dev->seqno); + dev->seqno += 1; +} + +static void +jm_lineart_stroke_text(fz_context *ctx, fz_device *dev_, const fz_text *text, const fz_stroke_state *stroke, fz_matrix ctm, fz_colorspace *colorspace, const float *color, float alpha, fz_color_params color_params) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + PyObject *out = dev->out; + jm_trace_text(ctx, out, text, 1, ctm, colorspace, color, alpha, dev->seqno); + dev->seqno += 1; +} + + +static void +jm_lineart_ignore_text(fz_context *ctx, fz_device *dev_, const fz_text *text, fz_matrix ctm) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + PyObject *out = dev->out; + jm_trace_text(ctx, out, text, 3, ctm, NULL, NULL, 1, dev->seqno); + dev->seqno += 1; +} + +static void jm_lineart_drop_device(fz_context *ctx, fz_device *dev_) +{ + jm_lineart_device *dev = (jm_lineart_device *)dev_; + if (PyList_Check(dev->out)) { + Py_CLEAR(dev->out); + } + Py_CLEAR(dev->method); + Py_CLEAR(scissors); +} + +//------------------------------------------------------------------- +// LINEART device for Python method Page.get_cdrawings() +//------------------------------------------------------------------- +fz_device *JM_new_lineart_device(fz_context *ctx, PyObject *out, int clips, PyObject *method) +{ + jm_lineart_device *dev = fz_new_derived_device(ctx, jm_lineart_device); + + dev->super.close_device = NULL; + dev->super.drop_device = jm_lineart_drop_device; + dev->super.fill_path = jm_lineart_fill_path; + dev->super.stroke_path = jm_lineart_stroke_path; + dev->super.clip_path = jm_lineart_clip_path; + dev->super.clip_stroke_path = jm_lineart_clip_stroke_path; + + dev->super.fill_text = jm_increase_seqno; + dev->super.stroke_text = jm_increase_seqno; + dev->super.clip_text = jm_lineart_clip_text; + dev->super.clip_stroke_text = jm_lineart_clip_stroke_text; + dev->super.ignore_text = jm_increase_seqno; + + dev->super.fill_shade = jm_increase_seqno; + dev->super.fill_image = jm_increase_seqno; + dev->super.fill_image_mask = jm_increase_seqno; + dev->super.clip_image_mask = jm_lineart_clip_image_mask; + + dev->super.pop_clip = jm_lineart_pop_clip; + + dev->super.begin_mask = NULL; + dev->super.end_mask = NULL; + dev->super.begin_group = jm_lineart_begin_group; + dev->super.end_group = jm_lineart_end_group; + + dev->super.begin_tile = NULL; + dev->super.end_tile = NULL; + + dev->super.begin_layer = jm_lineart_begin_layer; + dev->super.end_layer = jm_lineart_end_layer; + + dev->super.begin_structure = NULL; + dev->super.end_structure = NULL; + + dev->super.begin_metatext = NULL; + dev->super.end_metatext = NULL; + + dev->super.render_flags = NULL; + dev->super.set_default_colorspaces = NULL; + + if (PyList_Check(out)) { + Py_INCREF(out); + } + Py_INCREF(method); + dev->out = out; + dev->seqno = 0; + dev->depth = 0; + dev->clips = clips; + dev->method = method; + trace_device_reset(); + return (fz_device *)dev; +} + +//------------------------------------------------------------------- +// Trace TEXT device for Python method Page.get_texttrace() +//------------------------------------------------------------------- +fz_device *JM_new_texttrace_device(fz_context *ctx, PyObject *out) +{ + jm_lineart_device *dev = fz_new_derived_device(ctx, jm_lineart_device); + + dev->super.close_device = NULL; + dev->super.drop_device = jm_lineart_drop_device; + dev->super.fill_path = jm_increase_seqno; + dev->super.stroke_path = jm_dev_linewidth; + dev->super.clip_path = NULL; + dev->super.clip_stroke_path = NULL; + + dev->super.fill_text = jm_lineart_fill_text; + dev->super.stroke_text = jm_lineart_stroke_text; + dev->super.clip_text = NULL; + dev->super.clip_stroke_text = NULL; + dev->super.ignore_text = jm_lineart_ignore_text; + + dev->super.fill_shade = jm_increase_seqno; + dev->super.fill_image = jm_increase_seqno; + dev->super.fill_image_mask = jm_increase_seqno; + dev->super.clip_image_mask = NULL; + + dev->super.pop_clip = NULL; + + dev->super.begin_mask = NULL; + dev->super.end_mask = NULL; + dev->super.begin_group = NULL; + dev->super.end_group = NULL; + + dev->super.begin_tile = NULL; + dev->super.end_tile = NULL; + + dev->super.begin_layer = jm_lineart_begin_layer; + dev->super.end_layer = jm_lineart_end_layer; + + dev->super.begin_structure = NULL; + dev->super.end_structure = NULL; + + dev->super.begin_metatext = NULL; + dev->super.end_metatext = NULL; + + dev->super.render_flags = NULL; + dev->super.set_default_colorspaces = NULL; + + if (PyList_Check(out)) { + Py_XINCREF(out); + } + dev->out = out; + dev->seqno = 0; + dev->depth = 0; + dev->clips = 0; + dev->method = NULL; + trace_device_reset(); + + return (fz_device *)dev; +} + +//------------------------------------------------------------------- +// BBOX device +//------------------------------------------------------------------- +typedef struct jm_bbox_device_s +{ + fz_device super; + PyObject *result; + int layers; +} jm_bbox_device; + +static void +jm_bbox_add_rect(fz_context *ctx, fz_device *dev, fz_rect rect, char *code) +{ + jm_bbox_device *bdev = (jm_bbox_device *)dev; + if (!bdev->layers) { + LIST_APPEND_DROP(bdev->result, Py_BuildValue("sN", code, JM_py_from_rect(rect))); + } else { + LIST_APPEND_DROP(bdev->result, Py_BuildValue("sNN", code, JM_py_from_rect(rect), JM_UnicodeFromStr(layer_name))); + } +} + +static void +jm_bbox_fill_path(fz_context *ctx, fz_device *dev, const fz_path *path, int even_odd, fz_matrix ctm, + fz_colorspace *colorspace, const float *color, float alpha, fz_color_params color_params) +{ + jm_bbox_add_rect(ctx, dev, fz_bound_path(ctx, path, NULL, ctm), "fill-path"); +} + +static void +jm_bbox_stroke_path(fz_context *ctx, fz_device *dev, const fz_path *path, const fz_stroke_state *stroke, + fz_matrix ctm, fz_colorspace *colorspace, const float *color, float alpha, fz_color_params color_params) +{ + jm_bbox_add_rect(ctx, dev, fz_bound_path(ctx, path, stroke, ctm), "stroke-path"); +} + +static void +jm_bbox_fill_text(fz_context *ctx, fz_device *dev, const fz_text *text, fz_matrix ctm, ...) +{ + jm_bbox_add_rect(ctx, dev, fz_bound_text(ctx, text, NULL, ctm), "fill-text"); +} + +static void +jm_bbox_ignore_text(fz_context *ctx, fz_device *dev, const fz_text *text, fz_matrix ctm) +{ + jm_bbox_add_rect(ctx, dev, fz_bound_text(ctx, text, NULL, ctm), "ignore-text"); +} + +static void +jm_bbox_stroke_text(fz_context *ctx, fz_device *dev, const fz_text *text, const fz_stroke_state *stroke, fz_matrix ctm, ...) +{ + jm_bbox_add_rect(ctx, dev, fz_bound_text(ctx, text, stroke, ctm), "stroke-text"); +} + +static void +jm_bbox_fill_shade(fz_context *ctx, fz_device *dev, fz_shade *shade, fz_matrix ctm, float alpha, fz_color_params color_params) +{ + jm_bbox_add_rect(ctx, dev, fz_bound_shade(ctx, shade, ctm), "fill-shade"); +} + +static void +jm_bbox_fill_image(fz_context *ctx, fz_device *dev, fz_image *image, fz_matrix ctm, float alpha, fz_color_params color_params) +{ + jm_bbox_add_rect(ctx, dev, fz_transform_rect(fz_unit_rect, ctm), "fill-image"); +} + +static void +jm_bbox_fill_image_mask(fz_context *ctx, fz_device *dev, fz_image *image, fz_matrix ctm, + fz_colorspace *colorspace, const float *color, float alpha, fz_color_params color_params) +{ + jm_bbox_add_rect(ctx, dev, fz_transform_rect(fz_unit_rect, ctm), "fill-imgmask"); +} + +fz_device * +JM_new_bbox_device(fz_context *ctx, PyObject *result, int layers) +{ + jm_bbox_device *dev = fz_new_derived_device(ctx, jm_bbox_device); + + dev->super.fill_path = jm_bbox_fill_path; + dev->super.stroke_path = jm_bbox_stroke_path; + dev->super.clip_path = NULL; + dev->super.clip_stroke_path = NULL; + + dev->super.fill_text = jm_bbox_fill_text; + dev->super.stroke_text = jm_bbox_stroke_text; + dev->super.clip_text = NULL; + dev->super.clip_stroke_text = NULL; + dev->super.ignore_text = jm_bbox_ignore_text; + + dev->super.fill_shade = jm_bbox_fill_shade; + dev->super.fill_image = jm_bbox_fill_image; + dev->super.fill_image_mask = jm_bbox_fill_image_mask; + dev->super.clip_image_mask = NULL; + + dev->super.pop_clip = NULL; + + dev->super.begin_mask = NULL; + dev->super.end_mask = NULL; + dev->super.begin_group = NULL; + dev->super.end_group = NULL; + + dev->super.begin_tile = NULL; + dev->super.end_tile = NULL; + + dev->super.begin_layer = jm_lineart_begin_layer; + dev->super.end_layer = jm_lineart_end_layer; + + dev->super.begin_structure = NULL; + dev->super.end_structure = NULL; + + dev->super.begin_metatext = NULL; + dev->super.end_metatext = NULL; + + dev->super.render_flags = NULL; + dev->super.set_default_colorspaces = NULL; + + dev->result = result; + dev->layers = layers; + trace_device_reset(); + + return (fz_device *)dev; +} + +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-fields.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-fields.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1164 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +#define SETATTR(a, v) PyObject_SetAttrString(Widget, a, v) +#define GETATTR(a) PyObject_GetAttrString(Widget, a) +#define CALLATTR(m, p) PyObject_CallMethod(Widget, m, p) + +static void +SETATTR_DROP(PyObject *mod, const char *attr, PyObject *value) +{ + if (!value) + PyObject_DelAttrString(mod, attr); + else + { + PyObject_SetAttrString(mod, attr, value); + Py_DECREF(value); + } +} + +//----------------------------------------------------------------------------- +// Functions dealing with PDF form fields (widgets) +//----------------------------------------------------------------------------- +enum +{ + SigFlag_SignaturesExist = 1, + SigFlag_AppendOnly = 2 +}; + + +// make new PDF action object from JavaScript source +// Parameters are a PDF document and a Python string. +// Returns a PDF action object. +//----------------------------------------------------------------------------- +pdf_obj * +JM_new_javascript(fz_context *ctx, pdf_document *pdf, PyObject *value) +{ + fz_buffer *res = NULL; + if (!PyObject_IsTrue(value)) // no argument given + return NULL; + + char *data = JM_StrAsChar(value); + if (!data) // not convertible to char* + return NULL; + + res = fz_new_buffer_from_copied_data(ctx, data, strlen(data)); + pdf_obj *source = pdf_add_stream(ctx, pdf, res, NULL, 0); + pdf_obj *newaction = pdf_add_new_dict(ctx, pdf, 4); + pdf_dict_put(ctx, newaction, PDF_NAME(S), pdf_new_name(ctx, "JavaScript")); + pdf_dict_put(ctx, newaction, PDF_NAME(JS), source); + fz_drop_buffer(ctx, res); + return pdf_keep_obj(ctx, newaction); +} + + +// JavaScript extractor +// Returns either the script source or None. Parameter is a PDF action +// dictionary, which must have keys /S and /JS. The value of /S must be +// '/JavaScript'. The value of /JS is returned. +//----------------------------------------------------------------------------- +PyObject * +JM_get_script(fz_context *ctx, pdf_obj *key) +{ + pdf_obj *js = NULL; + fz_buffer *res = NULL; + PyObject *script = NULL; + if (!key) Py_RETURN_NONE; + + if (!strcmp(pdf_to_name(ctx, + pdf_dict_get(ctx, key, PDF_NAME(S))), "JavaScript")) { + js = pdf_dict_get(ctx, key, PDF_NAME(JS)); + } + if (!js) Py_RETURN_NONE; + + if (pdf_is_string(ctx, js)) { + script = JM_UnicodeFromStr(pdf_to_text_string(ctx, js)); + } else if (pdf_is_stream(ctx, js)) { + res = pdf_load_stream(ctx, js); + script = JM_EscapeStrFromBuffer(ctx, res); + fz_drop_buffer(ctx, res); + } else { + Py_RETURN_NONE; + } + if (PyObject_IsTrue(script)) { // do not return an empty script + return script; + } + Py_CLEAR(script); + Py_RETURN_NONE; +} + + +// Create a JavaScript PDF action. +// Usable for all object types which support PDF actions, even if the +// argument name suggests annotations. Up to 2 key values can be specified, so +// JavaScript actions can be stored for '/A' and '/AA/?' keys. +//----------------------------------------------------------------------------- +void JM_put_script(fz_context *ctx, pdf_obj *annot_obj, pdf_obj *key1, pdf_obj *key2, PyObject *value) +{ + PyObject *script = NULL; + pdf_obj *key1_obj = pdf_dict_get(ctx, annot_obj, key1); + pdf_document *pdf = pdf_get_bound_document(ctx, annot_obj); // owning PDF + + // if no new script given, just delete corresponding key + if (!value || !PyObject_IsTrue(value)) { + if (!key2) { + pdf_dict_del(ctx, annot_obj, key1); + } else if (key1_obj) { + pdf_dict_del(ctx, key1_obj, key2); + } + return; + } + + // read any existing script as a PyUnicode string + if (!key2 || !key1_obj) { + script = JM_get_script(ctx, key1_obj); + } else { + script = JM_get_script(ctx, pdf_dict_get(ctx, key1_obj, key2)); + } + + // replace old script, if different from new one + if (!PyObject_RichCompareBool(value, script, Py_EQ)) { + pdf_obj *newaction = JM_new_javascript(ctx, pdf, value); + if (!key2) { + pdf_dict_put_drop(ctx, annot_obj, key1, newaction); + } else { + pdf_dict_putl_drop(ctx, annot_obj, newaction, key1, key2, NULL); + } + } + Py_XDECREF(script); + return; +} + +/* +// Execute a JavaScript action for annot or field. +//----------------------------------------------------------------------------- +PyObject * +JM_exec_script(fz_context *ctx, pdf_obj *annot_obj, pdf_obj *key1, pdf_obj *key2) +{ + PyObject *script = NULL; + char *code = NULL; + fz_try(ctx) { + pdf_document *pdf = pdf_get_bound_document(ctx, annot_obj); + char buf[100]; + if (!key2) { + script = JM_get_script(ctx, key1_obj); + } else { + script = JM_get_script(ctx, pdf_dict_get(ctx, key1_obj, key2)); + } + code = JM_StrAsChar(script); + fz_snprintf(buf, sizeof buf, "%d/A", pdf_to_num(ctx, annot_obj)); + pdf_js_execute(pdf->js, buf, code); + } + fz_always(ctx) { + Py_XDECREF(string); + } + fz_catch(ctx) { + Py_RETURN_FALSE; + } + Py_RETURN_TRUE; +} +*/ + +// String from widget type +//----------------------------------------------------------------------------- +char *JM_field_type_text(int wtype) +{ + switch(wtype) { + case(PDF_WIDGET_TYPE_BUTTON): + return "Button"; + case(PDF_WIDGET_TYPE_CHECKBOX): + return "CheckBox"; + case(PDF_WIDGET_TYPE_RADIOBUTTON): + return "RadioButton"; + case(PDF_WIDGET_TYPE_TEXT): + return "Text"; + case(PDF_WIDGET_TYPE_LISTBOX): + return "ListBox"; + case(PDF_WIDGET_TYPE_COMBOBOX): + return "ComboBox"; + case(PDF_WIDGET_TYPE_SIGNATURE): + return "Signature"; + default: + return "unknown"; + } +} + +// Set the field type +//----------------------------------------------------------------------------- +void JM_set_field_type(fz_context *ctx, pdf_document *doc, pdf_obj *obj, int type) +{ + int setbits = 0; + int clearbits = 0; + pdf_obj *typename = NULL; + + switch(type) { + case PDF_WIDGET_TYPE_BUTTON: + typename = PDF_NAME(Btn); + setbits = PDF_BTN_FIELD_IS_PUSHBUTTON; + break; + case PDF_WIDGET_TYPE_RADIOBUTTON: + typename = PDF_NAME(Btn); + clearbits = PDF_BTN_FIELD_IS_PUSHBUTTON; + setbits = PDF_BTN_FIELD_IS_RADIO; + break; + case PDF_WIDGET_TYPE_CHECKBOX: + typename = PDF_NAME(Btn); + clearbits = (PDF_BTN_FIELD_IS_PUSHBUTTON|PDF_BTN_FIELD_IS_RADIO); + break; + case PDF_WIDGET_TYPE_TEXT: + typename = PDF_NAME(Tx); + break; + case PDF_WIDGET_TYPE_LISTBOX: + typename = PDF_NAME(Ch); + clearbits = PDF_CH_FIELD_IS_COMBO; + break; + case PDF_WIDGET_TYPE_COMBOBOX: + typename = PDF_NAME(Ch); + setbits = PDF_CH_FIELD_IS_COMBO; + break; + case PDF_WIDGET_TYPE_SIGNATURE: + typename = PDF_NAME(Sig); + break; + } + + if (typename) + pdf_dict_put_drop(ctx, obj, PDF_NAME(FT), typename); + + if (setbits != 0 || clearbits != 0) { + int bits = pdf_dict_get_int(ctx, obj, PDF_NAME(Ff)); + bits &= ~clearbits; + bits |= setbits; + pdf_dict_put_int(ctx, obj, PDF_NAME(Ff), bits); + } +} + +// Copied from MuPDF v1.14 +// Create widget. +// Returns a kept reference to a pdf_annot - caller must drop it. +//----------------------------------------------------------------------------- +pdf_annot *JM_create_widget(fz_context *ctx, pdf_document *doc, pdf_page *page, int type, char *fieldname) +{ + pdf_obj *form = NULL; + int old_sigflags = pdf_to_int(ctx, pdf_dict_getp(ctx, pdf_trailer(ctx, doc), "Root/AcroForm/SigFlags")); + pdf_annot *annot = pdf_create_annot_raw(ctx, page, PDF_ANNOT_WIDGET); // returns a kept reference. + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + fz_try(ctx) { + JM_set_field_type(ctx, doc, annot_obj, type); + pdf_dict_put_text_string(ctx, annot_obj, PDF_NAME(T), fieldname); + + if (type == PDF_WIDGET_TYPE_SIGNATURE) { + int sigflags = (old_sigflags | (SigFlag_SignaturesExist|SigFlag_AppendOnly)); + pdf_dict_putl_drop(ctx, pdf_trailer(ctx, doc), pdf_new_int(ctx, sigflags), PDF_NAME(Root), PDF_NAME(AcroForm), PDF_NAME(SigFlags), NULL); + } + + /* + pdf_create_annot will have linked the new widget into the page's + annot array. We also need it linked into the document's form + */ + form = pdf_dict_getp(ctx, pdf_trailer(ctx, doc), "Root/AcroForm/Fields"); + if (!form) { + form = pdf_new_array(ctx, doc, 1); + pdf_dict_putl_drop(ctx, pdf_trailer(ctx, doc), + form, + PDF_NAME(Root), + PDF_NAME(AcroForm), + PDF_NAME(Fields), + NULL); + } + + pdf_array_push(ctx, form, annot_obj); // Cleanup relies on this statement being last + } + fz_catch(ctx) { + pdf_delete_annot(ctx, page, annot); + + if (type == PDF_WIDGET_TYPE_SIGNATURE) { + pdf_dict_putl_drop(ctx, pdf_trailer(ctx, doc), pdf_new_int(ctx, old_sigflags), PDF_NAME(Root), PDF_NAME(AcroForm), PDF_NAME(SigFlags), NULL); + } + + fz_rethrow(ctx); + } + + return annot; +} + + + +// PushButton get state +//----------------------------------------------------------------------------- +PyObject *JM_pushbtn_state(fz_context *ctx, pdf_annot *annot) +{ // pushed buttons do not reflect status changes in the PDF + // always reflect them as untouched + Py_RETURN_FALSE; +} + + +// Text field retrieve value +//----------------------------------------------------------------------------- +PyObject *JM_text_value(fz_context *ctx, pdf_annot *annot) +{ + const char *text = NULL; + fz_var(text); + fz_try(ctx) { + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + text = pdf_field_value(ctx, annot_obj); + } + fz_catch(ctx) Py_RETURN_NONE; + return JM_UnicodeFromStr(text); +} + +// ListBox retrieve value +//----------------------------------------------------------------------------- +PyObject *JM_listbox_value(fz_context *ctx, pdf_annot *annot) +{ + int i = 0, n = 0; + // may be single value or array + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + pdf_obj *optarr = pdf_dict_get(ctx, annot_obj, PDF_NAME(V)); + if (pdf_is_string(ctx, optarr)) // a single string + return PyString_FromString(pdf_to_text_string(ctx, optarr)); + + // value is an array (may have len 0) + n = pdf_array_len(ctx, optarr); + PyObject *liste = PyList_New(0); + + // extract a list of strings + // each entry may again be an array: take second entry then + for (i = 0; i < n; i++) { + pdf_obj *elem = pdf_array_get(ctx, optarr, i); + if (pdf_is_array(ctx, elem)) + elem = pdf_array_get(ctx, elem, 1); + LIST_APPEND_DROP(liste, JM_UnicodeFromStr(pdf_to_text_string(ctx, elem))); + } + return liste; +} + +// ComboBox retrieve value +//----------------------------------------------------------------------------- +PyObject *JM_combobox_value(fz_context *ctx, pdf_annot *annot) +{ // combobox treated like listbox + return JM_listbox_value(ctx, annot); +} + +// Signature field retrieve value +PyObject *JM_signature_value(fz_context *ctx, pdf_annot *annot) +{ // signatures are currently not supported + Py_RETURN_NONE; +} + +// retrieve ListBox / ComboBox choice values +//----------------------------------------------------------------------------- +PyObject *JM_choice_options(fz_context *ctx, pdf_annot *annot) +{ // return list of choices for list or combo boxes + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + PyObject *val; + int n = pdf_choice_widget_options(ctx, annot, 0, NULL); + if (n == 0) Py_RETURN_NONE; // wrong widget type + + pdf_obj *optarr = pdf_dict_get(ctx, annot_obj, PDF_NAME(Opt)); + int i, m; + PyObject *liste = PyList_New(0); + + for (i = 0; i < n; i++) { + m = pdf_array_len(ctx, pdf_array_get(ctx, optarr, i)); + if (m == 2) { + val = Py_BuildValue("ss", + pdf_to_text_string(ctx, pdf_array_get(ctx, pdf_array_get(ctx, optarr, i), 0)), + pdf_to_text_string(ctx, pdf_array_get(ctx, pdf_array_get(ctx, optarr, i), 1))); + LIST_APPEND_DROP(liste, val); + } else { + val = JM_UnicodeFromStr(pdf_to_text_string(ctx, pdf_array_get(ctx, optarr, i))); + LIST_APPEND_DROP(liste, val); + } + } + return liste; +} + + +// set ListBox / ComboBox values +//----------------------------------------------------------------------------- +void JM_set_choice_options(fz_context *ctx, pdf_annot *annot, PyObject *liste) +{ + if (!liste) return; + if (!PySequence_Check(liste)) return; + Py_ssize_t i, n = PySequence_Size(liste); + if (n < 1) return; + PyObject *tuple = PySequence_Tuple(liste); + PyObject *val = NULL, *val1 = NULL, *val2 = NULL; + pdf_obj *optarrsub = NULL, *optarr = NULL, *annot_obj = NULL; + pdf_document *pdf = NULL; + const char *opt = NULL, *opt1 = NULL, *opt2 = NULL; + fz_try(ctx) { + annot_obj = pdf_annot_obj(ctx, annot); + pdf = pdf_get_bound_document(ctx, annot_obj); + optarr = pdf_new_array(ctx, pdf, (int) n); + for (i = 0; i < n; i++) { + val = PyTuple_GET_ITEM(tuple, i); + opt = PyUnicode_AsUTF8(val); + if (opt) { + pdf_array_push_text_string(ctx, optarr, opt); + } else { + if (!PySequence_Check(val) || PySequence_Size(val) != 2) { + RAISEPY(ctx, "bad choice field list", PyExc_ValueError); + } + val1 = PySequence_GetItem(val, 0); + opt1 = PyUnicode_AsUTF8(val1); + if (!opt1) { + RAISEPY(ctx, "bad choice field list", PyExc_ValueError); + } + val2 = PySequence_GetItem(val, 1); + opt2 = PyUnicode_AsUTF8(val2); + if (!opt2) { + RAISEPY(ctx, "bad choice field list", PyExc_ValueError); + }; + Py_CLEAR(val1); + Py_CLEAR(val2); + optarrsub = pdf_array_push_array(ctx, optarr, 2); + pdf_array_push_text_string(ctx, optarrsub, opt1); + pdf_array_push_text_string(ctx, optarrsub, opt2); + } + } + pdf_dict_put_drop(ctx, annot_obj, PDF_NAME(Opt), optarr); + } + fz_always(ctx) { + Py_CLEAR(tuple); + Py_CLEAR(val1); + Py_CLEAR(val2); + PyErr_Clear(); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return; +} + + +//----------------------------------------------------------------------------- +// Populate a Python Widget object with the values from a PDF form field. +// Called by "Page.firstWidget" and "Widget.next". +//----------------------------------------------------------------------------- +void JM_get_widget_properties(fz_context *ctx, pdf_annot *annot, PyObject *Widget) +{ + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + pdf_page *page = pdf_annot_page(ctx, annot); + pdf_document *pdf = page->doc; + pdf_annot *tw = annot; + pdf_obj *obj = NULL; + Py_ssize_t i = 0, n = 0; + fz_try(ctx) { + int field_type = pdf_widget_type(ctx, tw); + SETATTR_DROP(Widget, "field_type", Py_BuildValue("i", field_type)); + if (field_type == PDF_WIDGET_TYPE_SIGNATURE) { + if (pdf_signature_is_signed(ctx, pdf, annot_obj)) { + SETATTR("is_signed", Py_True); + } else { + SETATTR("is_signed", Py_False); + } + } else { + SETATTR("is_signed", Py_None); + } + SETATTR_DROP(Widget, "border_style", + JM_UnicodeFromStr(pdf_field_border_style(ctx, annot_obj))); + SETATTR_DROP(Widget, "field_type_string", + JM_UnicodeFromStr(JM_field_type_text(field_type))); + + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR <= 22 + char *field_name = pdf_field_name(ctx, annot_obj); + #else + char *field_name = pdf_load_field_name(ctx, annot_obj); + #endif + SETATTR_DROP(Widget, "field_name", JM_UnicodeFromStr(field_name)); + JM_Free(field_name); + + const char *label = NULL; + obj = pdf_dict_get(ctx, annot_obj, PDF_NAME(TU)); + if (obj) label = pdf_to_text_string(ctx, obj); + SETATTR_DROP(Widget, "field_label", JM_UnicodeFromStr(label)); + + const char *fvalue = NULL; + if (field_type == PDF_WIDGET_TYPE_RADIOBUTTON) { + obj = pdf_dict_get(ctx, annot_obj, PDF_NAME(Parent)); // owning RB group + if (obj) { + SETATTR_DROP(Widget, "rb_parent", Py_BuildValue("i", pdf_to_num(ctx, obj))); + } + obj = pdf_dict_get(ctx, annot_obj, PDF_NAME(AS)); + if (obj) { + fvalue = pdf_to_name(ctx, obj); + } + } + if (!fvalue) { + fvalue = pdf_field_value(ctx, annot_obj); + } + SETATTR_DROP(Widget, "field_value", JM_UnicodeFromStr(fvalue)); + + SETATTR_DROP(Widget, "field_display", + Py_BuildValue("i", pdf_field_display(ctx, annot_obj))); + + float border_width = pdf_to_real(ctx, pdf_dict_getl(ctx, annot_obj, + PDF_NAME(BS), PDF_NAME(W), NULL)); + if (border_width == 0) border_width = 1; + SETATTR_DROP(Widget, "border_width", + Py_BuildValue("f", border_width)); + + obj = pdf_dict_getl(ctx, annot_obj, + PDF_NAME(BS), PDF_NAME(D), NULL); + if (pdf_is_array(ctx, obj)) { + n = (Py_ssize_t) pdf_array_len(ctx, obj); + PyObject *d = PyList_New(n); + for (i = 0; i < n; i++) { + PyList_SET_ITEM(d, i, Py_BuildValue("i", pdf_to_int(ctx, + pdf_array_get(ctx, obj, (int) i)))); + } + SETATTR_DROP(Widget, "border_dashes", d); + } + + SETATTR_DROP(Widget, "text_maxlen", + Py_BuildValue("i", pdf_text_widget_max_len(ctx, tw))); + + SETATTR_DROP(Widget, "text_format", + Py_BuildValue("i", pdf_text_widget_format(ctx, tw))); + + obj = pdf_dict_getl(ctx, annot_obj, PDF_NAME(MK), PDF_NAME(BG), NULL); + if (pdf_is_array(ctx, obj)) { + n = (Py_ssize_t) pdf_array_len(ctx, obj); + PyObject *col = PyList_New(n); + for (i = 0; i < n; i++) { + PyList_SET_ITEM(col, i, Py_BuildValue("f", + pdf_to_real(ctx, pdf_array_get(ctx, obj, (int) i)))); + } + SETATTR_DROP(Widget, "fill_color", col); + } + + obj = pdf_dict_getl(ctx, annot_obj, PDF_NAME(MK), PDF_NAME(BC), NULL); + if (pdf_is_array(ctx, obj)) { + n = (Py_ssize_t) pdf_array_len(ctx, obj); + PyObject *col = PyList_New(n); + for (i = 0; i < n; i++) { + PyList_SET_ITEM(col, i, Py_BuildValue("f", + pdf_to_real(ctx, pdf_array_get(ctx, obj, (int) i)))); + } + SETATTR_DROP(Widget, "border_color", col); + } + + SETATTR_DROP(Widget, "choice_values", JM_choice_options(ctx, annot)); + + const char *da = pdf_to_text_string(ctx, pdf_dict_get_inheritable(ctx, + annot_obj, PDF_NAME(DA))); + SETATTR_DROP(Widget, "_text_da", JM_UnicodeFromStr(da)); + + obj = pdf_dict_getl(ctx, annot_obj, PDF_NAME(MK), PDF_NAME(CA), NULL); + if (obj) { + SETATTR_DROP(Widget, "button_caption", + JM_UnicodeFromStr((char *)pdf_to_text_string(ctx, obj))); + } + + SETATTR_DROP(Widget, "field_flags", + Py_BuildValue("i", pdf_field_flags(ctx, annot_obj))); + + // call Py method to reconstruct text color, font name, size + PyObject *call = CALLATTR("_parse_da", NULL); + Py_XDECREF(call); + + // extract JavaScript action texts + SETATTR_DROP(Widget, "script", + JM_get_script(ctx, pdf_dict_get(ctx, annot_obj, PDF_NAME(A)))); + + SETATTR_DROP(Widget, "script_stroke", + JM_get_script(ctx, pdf_dict_getl(ctx, annot_obj, PDF_NAME(AA), PDF_NAME(K), NULL))); + + SETATTR_DROP(Widget, "script_format", + JM_get_script(ctx, pdf_dict_getl(ctx, annot_obj, PDF_NAME(AA), PDF_NAME(F), NULL))); + + SETATTR_DROP(Widget, "script_change", + JM_get_script(ctx, pdf_dict_getl(ctx, annot_obj, PDF_NAME(AA), PDF_NAME(V), NULL))); + + SETATTR_DROP(Widget, "script_calc", + JM_get_script(ctx, pdf_dict_getl(ctx, annot_obj, PDF_NAME(AA), PDF_NAME(C), NULL))); + + SETATTR_DROP(Widget, "script_blur", + JM_get_script(ctx, pdf_dict_getl(ctx, annot_obj, PDF_NAME(AA), pdf_new_name(ctx, "Bl"), NULL))); + + SETATTR_DROP(Widget, "script_focus", + JM_get_script(ctx, pdf_dict_getl(ctx, annot_obj, PDF_NAME(AA), pdf_new_name(ctx, "Fo"), NULL))); + } + fz_always(ctx) PyErr_Clear(); + fz_catch(ctx) fz_rethrow(ctx); + return; +} + + +//----------------------------------------------------------------------------- +// Update the PDF form field with the properties from a Python Widget object. +// Called by "Page.addWidget" and "Annot.updateWidget". +//----------------------------------------------------------------------------- +void JM_set_widget_properties(fz_context *ctx, pdf_annot *annot, PyObject *Widget) +{ + pdf_page *page = pdf_annot_page(ctx, annot); + pdf_obj *annot_obj = pdf_annot_obj(ctx, annot); + pdf_document *pdf = page->doc; + fz_rect rect; + pdf_obj *fill_col = NULL, *border_col = NULL; + pdf_obj *dashes = NULL; + Py_ssize_t i, n = 0; + int d; + PyObject *value = GETATTR("field_type"); + int field_type = (int) PyInt_AsLong(value); + Py_DECREF(value); + + // rectangle -------------------------------------------------------------- + value = GETATTR("rect"); + rect = JM_rect_from_py(value); + Py_XDECREF(value); + fz_matrix rot_mat = JM_rotate_page_matrix(ctx, page); + rect = fz_transform_rect(rect, rot_mat); + pdf_set_annot_rect(ctx, annot, rect); + + // fill color ------------------------------------------------------------- + value = GETATTR("fill_color"); + if (value && PySequence_Check(value)) { + n = PySequence_Size(value); + fill_col = pdf_new_array(ctx, pdf, n); + double col = 0; + for (i = 0; i < n; i++) { + JM_FLOAT_ITEM(value, i, &col); + pdf_array_push_real(ctx, fill_col, col); + } + pdf_field_set_fill_color(ctx, annot_obj, fill_col); + pdf_drop_obj(ctx, fill_col); + } + Py_XDECREF(value); + + // dashes ----------------------------------------------------------------- + value = GETATTR("border_dashes"); + if (value && PySequence_Check(value)) { + n = PySequence_Size(value); + dashes = pdf_new_array(ctx, pdf, n); + for (i = 0; i < n; i++) { + pdf_array_push_int(ctx, dashes, + (int64_t) PyInt_AsLong(PySequence_ITEM(value, i))); + } + pdf_dict_putl_drop(ctx, annot_obj, dashes, + PDF_NAME(BS), + PDF_NAME(D), + NULL); + } + Py_XDECREF(value); + + // border color ----------------------------------------------------------- + value = GETATTR("border_color"); + if (value && PySequence_Check(value)) { + n = PySequence_Size(value); + border_col = pdf_new_array(ctx, pdf, n); + double col = 0; + for (i = 0; i < n; i++) { + JM_FLOAT_ITEM(value, i, &col); + pdf_array_push_real(ctx, border_col, col); + } + pdf_dict_putl_drop(ctx, annot_obj, border_col, + PDF_NAME(MK), + PDF_NAME(BC), + NULL); + } + Py_XDECREF(value); + + // entry ignored - may be used later + /* + int text_format = (int) PyInt_AsLong(GETATTR("text_format")); + */ + + // field label ----------------------------------------------------------- + value = GETATTR("field_label"); + if (value != Py_None) { + char *label = JM_StrAsChar(value); + pdf_dict_put_text_string(ctx, annot_obj, PDF_NAME(TU), label); + } + Py_XDECREF(value); + + // field name ------------------------------------------------------------- + value = GETATTR("field_name"); + if (value != Py_None) { + char *name = JM_StrAsChar(value); + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR <= 22 + char *old_name = pdf_field_name(ctx, annot_obj); + #else + char *old_name = pdf_load_field_name(ctx, annot_obj); + #endif + if (strcmp(name, old_name) != 0) { + pdf_dict_put_text_string(ctx, annot_obj, PDF_NAME(T), name); + } + JM_Free(old_name); + } + Py_XDECREF(value); + + // max text len ----------------------------------------------------------- + if (field_type == PDF_WIDGET_TYPE_TEXT) + { + value = GETATTR("text_maxlen"); + int text_maxlen = (int) PyInt_AsLong(value); + if (text_maxlen) { + pdf_dict_put_int(ctx, annot_obj, PDF_NAME(MaxLen), text_maxlen); + } + Py_XDECREF(value); + } + value = GETATTR("field_display"); + d = (int) PyInt_AsLong(value); + Py_XDECREF(value); + pdf_field_set_display(ctx, annot_obj, d); + + // choice values ---------------------------------------------------------- + if (field_type == PDF_WIDGET_TYPE_LISTBOX || + field_type == PDF_WIDGET_TYPE_COMBOBOX) { + value = GETATTR("choice_values"); + JM_set_choice_options(ctx, annot, value); + Py_XDECREF(value); + } + + // border style ----------------------------------------------------------- + value = GETATTR("border_style"); + pdf_obj *val = JM_get_border_style(ctx, value); + Py_XDECREF(value); + pdf_dict_putl_drop(ctx, annot_obj, val, + PDF_NAME(BS), + PDF_NAME(S), + NULL); + + // border width ----------------------------------------------------------- + value = GETATTR("border_width"); + float border_width = (float) PyFloat_AsDouble(value); + Py_XDECREF(value); + pdf_dict_putl_drop(ctx, annot_obj, pdf_new_real(ctx, border_width), + PDF_NAME(BS), + PDF_NAME(W), + NULL); + + // /DA string ------------------------------------------------------------- + value = GETATTR("_text_da"); + char *da = JM_StrAsChar(value); + Py_XDECREF(value); + pdf_dict_put_text_string(ctx, annot_obj, PDF_NAME(DA), da); + pdf_dict_del(ctx, annot_obj, PDF_NAME(DS)); /* not supported by MuPDF */ + pdf_dict_del(ctx, annot_obj, PDF_NAME(RC)); /* not supported by MuPDF */ + + // field flags ------------------------------------------------------------ + value = GETATTR("field_flags"); + int field_flags = (int) PyInt_AsLong(value); + Py_XDECREF(value); + if (!PyErr_Occurred()) { + if (field_type == PDF_WIDGET_TYPE_COMBOBOX) { + field_flags |= PDF_CH_FIELD_IS_COMBO; + } else if (field_type == PDF_WIDGET_TYPE_RADIOBUTTON) { + field_flags |= PDF_BTN_FIELD_IS_RADIO; + } else if (field_type == PDF_WIDGET_TYPE_BUTTON) { + field_flags |= PDF_BTN_FIELD_IS_PUSHBUTTON; + } + pdf_dict_put_int(ctx, annot_obj, PDF_NAME(Ff), field_flags); + } + + // button caption --------------------------------------------------------- + value = GETATTR("button_caption"); + char *ca = JM_StrAsChar(value); + if (ca) { + pdf_field_set_button_caption(ctx, annot_obj, ca); + } + Py_XDECREF(value); + + // script (/A) ------------------------------------------------------- + value = GETATTR("script"); + JM_put_script(ctx, annot_obj, PDF_NAME(A), NULL, value); + Py_CLEAR(value); + + // script (/AA/K) ------------------------------------------------------- + value = GETATTR("script_stroke"); + JM_put_script(ctx, annot_obj, PDF_NAME(AA), PDF_NAME(K), value); + Py_CLEAR(value); + + // script (/AA/F) ------------------------------------------------------- + value = GETATTR("script_format"); + JM_put_script(ctx, annot_obj, PDF_NAME(AA), PDF_NAME(F), value); + Py_CLEAR(value); + + // script (/AA/V) ------------------------------------------------------- + value = GETATTR("script_change"); + JM_put_script(ctx, annot_obj, PDF_NAME(AA), PDF_NAME(V), value); + Py_CLEAR(value); + + // script (/AA/C) ------------------------------------------------------- + value = GETATTR("script_calc"); + JM_put_script(ctx, annot_obj, PDF_NAME(AA), PDF_NAME(C), value); + Py_CLEAR(value); + + // script (/AA/Bl) ------------------------------------------------------ + value = GETATTR("script_blur"); + JM_put_script(ctx, annot_obj, PDF_NAME(AA), pdf_new_name(ctx, "Bl"), value); + Py_CLEAR(value); + + // script (/AA/Fo) ------------------------------------------------------ + value = GETATTR("script_focus"); + JM_put_script(ctx, annot_obj, PDF_NAME(AA), pdf_new_name(ctx, "Fo"), value); + Py_CLEAR(value); + + // field value ------------------------------------------------------------ + value = GETATTR("field_value"); // field value + char *text = JM_StrAsChar(value); // convert to text (may fail!) + + switch(field_type) + { + case PDF_WIDGET_TYPE_RADIOBUTTON: + if (PyObject_RichCompareBool(value, Py_False, Py_EQ)) { + pdf_set_field_value(ctx, pdf, annot_obj, "Off", 1); + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(AS), "Off"); + } else { + pdf_obj *onstate = pdf_button_field_on_state(ctx, annot_obj); + if (onstate) { + const char *on = pdf_to_name(ctx, onstate); + pdf_set_field_value(ctx, pdf, annot_obj, on, 1); + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(AS), on); + } else if (text) { + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(AS), text); + } + } + break; + + case PDF_WIDGET_TYPE_CHECKBOX: // will always be "Yes" or "Off" + if (PyObject_RichCompareBool(value, Py_True, Py_EQ) || text && strcmp(text, "Yes")==0) { + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(AS), "Yes"); + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(V), "Yes"); + } else { + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(AS), "Off"); + pdf_dict_put_name(gctx, annot_obj, PDF_NAME(V), "Off"); + } + break; + + default: + if (text) { + pdf_set_field_value(ctx, pdf, annot_obj, (const char *)text, 1); + if (field_type == PDF_WIDGET_TYPE_COMBOBOX || field_type == PDF_WIDGET_TYPE_LISTBOX) { + pdf_dict_del(ctx, annot_obj, PDF_NAME(I)); + } + } + } + Py_CLEAR(value); + PyErr_Clear(); + pdf_dirty_annot(ctx, annot); + pdf_set_annot_hot(ctx, annot, 1); + pdf_set_annot_active(ctx, annot, 1); + pdf_update_annot(ctx, annot); +} +#undef SETATTR +#undef GETATTR +#undef CALLATTR +%} + +%pythoncode %{ +#------------------------------------------------------------------------------ +# Class describing a PDF form field ("widget") +#------------------------------------------------------------------------------ +class Widget(object): + def __init__(self): + self.thisown = True + self.border_color = None + self.border_style = "S" + self.border_width = 0 + self.border_dashes = None + self.choice_values = None # choice fields only + self.rb_parent = None # radio buttons only: xref of owning parent + + self.field_name = None # field name + self.field_label = None # field label + self.field_value = None + self.field_flags = 0 + self.field_display = 0 + self.field_type = 0 # valid range 1 through 7 + self.field_type_string = None # field type as string + + self.fill_color = None + self.button_caption = None # button caption + self.is_signed = None # True / False if signature + self.text_color = (0, 0, 0) + self.text_font = "Helv" + self.text_fontsize = 0 + self.text_maxlen = 0 # text fields only + self.text_format = 0 # text fields only + self._text_da = "" # /DA = default apparance + + self.script = None # JavaScript (/A) + self.script_stroke = None # JavaScript (/AA/K) + self.script_format = None # JavaScript (/AA/F) + self.script_change = None # JavaScript (/AA/V) + self.script_calc = None # JavaScript (/AA/C) + self.script_blur = None # JavaScript (/AA/Bl) + self.script_focus = None # JavaScript (/AA/Fo) + + self.rect = None # annot value + self.xref = 0 # annot value + + + def _validate(self): + """Validate the class entries. + """ + if (self.rect.is_infinite + or self.rect.is_empty + ): + raise ValueError("bad rect") + + if not self.field_name: + raise ValueError("field name missing") + + if self.field_label == "Unnamed": + self.field_label = None + CheckColor(self.border_color) + CheckColor(self.fill_color) + if not self.text_color: + self.text_color = (0, 0, 0) + CheckColor(self.text_color) + + if not self.border_width: + self.border_width = 0 + + if not self.text_fontsize: + self.text_fontsize = 0 + + self.border_style = self.border_style.upper()[0:1] + + # standardize content of JavaScript entries + btn_type = self.field_type in ( + PDF_WIDGET_TYPE_BUTTON, + PDF_WIDGET_TYPE_CHECKBOX, + PDF_WIDGET_TYPE_RADIOBUTTON + ) + if not self.script: + self.script = None + elif type(self.script) is not str: + raise ValueError("script content must be a string") + + # buttons cannot have the following script actions + if btn_type or not self.script_calc: + self.script_calc = None + elif type(self.script_calc) is not str: + raise ValueError("script_calc content must be a string") + + if btn_type or not self.script_change: + self.script_change = None + elif type(self.script_change) is not str: + raise ValueError("script_change content must be a string") + + if btn_type or not self.script_format: + self.script_format = None + elif type(self.script_format) is not str: + raise ValueError("script_format content must be a string") + + if btn_type or not self.script_stroke: + self.script_stroke = None + elif type(self.script_stroke) is not str: + raise ValueError("script_stroke content must be a string") + + if btn_type or not self.script_blur: + self.script_blur = None + elif type(self.script_blur) is not str: + raise ValueError("script_blur content must be a string") + + if btn_type or not self.script_focus: + self.script_focus = None + elif type(self.script_focus) is not str: + raise ValueError("script_focus content must be a string") + + self._checker() # any field_type specific checks + + + def _adjust_font(self): + """Ensure text_font is correctly spelled if empty or from our list. + + Otherwise assume the font is in an existing field. + """ + if not self.text_font: + self.text_font = "Helv" + return + doc = self.parent.parent + for f in doc.FormFonts + ["Cour", "TiRo", "Helv", "ZaDb"]: + if self.text_font.lower() == f.lower(): + self.text_font = f + return + self.text_font = "Helv" + return + + + def _parse_da(self): + """Extract font name, size and color from default appearance string (/DA object). + + Equivalent to 'pdf_parse_default_appearance' function in MuPDF's 'pdf-annot.c'. + """ + if not self._text_da: + return + font = "Helv" + fsize = 0 + col = (0, 0, 0) + dat = self._text_da.split() # split on any whitespace + for i, item in enumerate(dat): + if item == "Tf": + font = dat[i - 2][1:] + fsize = float(dat[i - 1]) + dat[i] = dat[i-1] = dat[i-2] = "" + continue + if item == "g": # unicolor text + col = [(float(dat[i - 1]))] + dat[i] = dat[i-1] = "" + continue + if item == "rg": # RGB colored text + col = [float(f) for f in dat[i - 3:i]] + dat[i] = dat[i-1] = dat[i-2] = dat[i-3] = "" + continue + self.text_font = font + self.text_fontsize = fsize + self.text_color = col + self._text_da = "" + return + + + def _checker(self): + """Any widget type checks. + """ + if self.field_type not in range(1, 8): + raise ValueError("bad field type") + + + # if setting a radio button to ON, first set Off all buttons + # in the group - this is not done by MuPDF: + if self.field_type == PDF_WIDGET_TYPE_RADIOBUTTON and self.field_value not in (False, "Off") and hasattr(self, "parent"): + # so we are about setting this button to ON/True + # check other buttons in same group and set them to 'Off' + doc = self.parent.parent + kids_type, kids_value = doc.xref_get_key(self.xref, "Parent/Kids") + if kids_type == "array": + xrefs = tuple(map(int, kids_value[1:-1].replace("0 R","").split())) + for xref in xrefs: + if xref != self.xref: + doc.xref_set_key(xref, "AS", "/Off") + # the calling method will now set the intended button to on and + # will find everything prepared for correct functioning. + + + def update(self): + """Reflect Python object in the PDF. + """ + doc = self.parent.parent + self._validate() + + self._adjust_font() # ensure valid text_font name + + # now create the /DA string + self._text_da = "" + if len(self.text_color) == 3: + fmt = "{:g} {:g} {:g} rg /{f:s} {s:g} Tf" + self._text_da + elif len(self.text_color) == 1: + fmt = "{:g} g /{f:s} {s:g} Tf" + self._text_da + elif len(self.text_color) == 4: + fmt = "{:g} {:g} {:g} {:g} k /{f:s} {s:g} Tf" + self._text_da + self._text_da = fmt.format(*self.text_color, f=self.text_font, + s=self.text_fontsize) + + # if widget has a '/AA/C' script, make sure it is in the '/CO' + # array of the '/AcroForm' dictionary. + if self.script_calc: # there is a "calculation" script: + # make sure we are in the /CO array + util_ensure_widget_calc(self._annot) + + # finally update the widget + TOOLS._save_widget(self._annot, self) + self._text_da = "" + + + def button_states(self): + """Return the on/off state names for button widgets. + + A button may have 'normal' or 'pressed down' appearances. While the 'Off' + state is usually called like this, the 'On' state is often given a name + relating to the functional context. + """ + if self.field_type not in (2, 5): + return None # no button type + if hasattr(self, "parent"): # field already exists on page + doc = self.parent.parent + else: + return None + xref = self.xref + states = {"normal": None, "down": None} + APN = doc.xref_get_key(xref, "AP/N") + if APN[0] == "dict": + nstates = [] + APN = APN[1][2:-2] + apnt = APN.split("/")[1:] + for x in apnt: + nstates.append(x.split()[0]) + states["normal"] = nstates + if APN[0] == "xref": + nstates = [] + nxref = int(APN[1].split(" ")[0]) + APN = doc.xref_object(nxref) + apnt = APN.split("/")[1:] + for x in apnt: + nstates.append(x.split()[0]) + states["normal"] = nstates + APD = doc.xref_get_key(xref, "AP/D") + if APD[0] == "dict": + dstates = [] + APD = APD[1][2:-2] + apdt = APD.split("/")[1:] + for x in apdt: + dstates.append(x.split()[0]) + states["down"] = dstates + if APD[0] == "xref": + dstates = [] + dxref = int(APD[1].split(" ")[0]) + APD = doc.xref_object(dxref) + apdt = APD.split("/")[1:] + for x in apdt: + dstates.append(x.split()[0]) + states["down"] = dstates + return states + + def on_state(self): + """Return the "On" value for button widgets. + + This is useful for radio buttons mainly. Checkboxes will always return + "Yes". Radio buttons will return the string that is unequal to "Off" + as returned by method button_states(). + If the radio button is new / being created, it does not yet have an + "On" value. In this case, a warning is shown and True is returned. + """ + if self.field_type not in (2, 5): + return None # no checkbox or radio button + if self.field_type == 2: + return "Yes" + bstate = self.button_states() + if bstate==None: + bstate = {} + for k in bstate.keys(): + for v in bstate[k]: + if v != "Off": + return v + print("warning: radio button has no 'On' value.") + return True + + def reset(self): + """Reset the field value to its default. + """ + TOOLS._reset_widget(self._annot) + + def __repr__(self): + return "'%s' widget on %s" % (self.field_type_string, str(self.parent)) + + def __del__(self): + if hasattr(self, "_annot"): + del self._annot + + @property + def next(self): + return self._annot.next +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-fileobj.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-fileobj.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,113 @@ +%{ +//------------------------------------- +// fz_output for Python file objects +//------------------------------------- +static void +JM_bytesio_write(fz_context *ctx, void *opaque, const void *data, size_t len) +{ // bio.write(bytes object) + PyObject *bio = opaque, *b, *name, *rc; + fz_try(ctx){ + b = PyBytes_FromStringAndSize((const char *) data, (Py_ssize_t) len); + name = PyUnicode_FromString("write"); + PyObject_CallMethodObjArgs(bio, name, b, NULL); + rc = PyErr_Occurred(); + if (rc) { + RAISEPY(ctx, "could not write to Py file obj", rc); + } + } + fz_always(ctx) { + Py_XDECREF(b); + Py_XDECREF(name); + Py_XDECREF(rc); + PyErr_Clear(); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } +} + +static void +JM_bytesio_truncate(fz_context *ctx, void *opaque) +{ // bio.truncate(bio.tell()) !!! + PyObject *bio = opaque, *trunc = NULL, *tell = NULL, *rctell= NULL, *rc = NULL; + fz_try(ctx) { + trunc = PyUnicode_FromString("truncate"); + tell = PyUnicode_FromString("tell"); + rctell = PyObject_CallMethodObjArgs(bio, tell, NULL); + PyObject_CallMethodObjArgs(bio, trunc, rctell, NULL); + rc = PyErr_Occurred(); + if (rc) { + RAISEPY(ctx, "could not truncate Py file obj", rc); + } + } + fz_always(ctx) { + Py_XDECREF(tell); + Py_XDECREF(trunc); + Py_XDECREF(rc); + Py_XDECREF(rctell); + PyErr_Clear(); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } +} + +static int64_t +JM_bytesio_tell(fz_context *ctx, void *opaque) +{ // returns bio.tell() -> int + PyObject *bio = opaque, *rc = NULL, *name = NULL; + int64_t pos = 0; + fz_try(ctx) { + name = PyUnicode_FromString("tell"); + rc = PyObject_CallMethodObjArgs(bio, name, NULL); + if (!rc) { + RAISEPY(ctx, "could not tell Py file obj", PyErr_Occurred()); + } + pos = (int64_t) PyLong_AsUnsignedLongLong(rc); + } + fz_always(ctx) { + Py_XDECREF(name); + Py_XDECREF(rc); + PyErr_Clear(); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return pos; +} + + +static void +JM_bytesio_seek(fz_context *ctx, void *opaque, int64_t off, int whence) +{ // bio.seek(off, whence=0) + PyObject *bio = opaque, *rc = NULL, *name = NULL, *pos = NULL; + fz_try(ctx) { + name = PyUnicode_FromString("seek"); + pos = PyLong_FromUnsignedLongLong((unsigned long long) off); + PyObject_CallMethodObjArgs(bio, name, pos, whence, NULL); + rc = PyErr_Occurred(); + if (rc) { + RAISEPY(ctx, "could not seek Py file obj", rc); + } + } + fz_always(ctx) { + Py_XDECREF(rc); + Py_XDECREF(name); + Py_XDECREF(pos); + PyErr_Clear(); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } +} + +fz_output * +JM_new_output_fileptr(fz_context *ctx, PyObject *bio) +{ + fz_output *out = fz_new_output(ctx, 0, bio, JM_bytesio_write, NULL, NULL); + out->seek = JM_bytesio_seek; + out->tell = JM_bytesio_tell; + out->truncate = JM_bytesio_truncate; + return out; +} +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-geo-c.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-geo-c.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,243 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +//----------------------------------------------------------------------------- +// Functions converting betwenn PySequences and fitz geometry objects +//----------------------------------------------------------------------------- +static int +JM_INT_ITEM(PyObject *obj, Py_ssize_t idx, int *result) +{ + PyObject *temp = PySequence_ITEM(obj, idx); + if (!temp) return 1; + if (PyLong_Check(temp)) { + *result = (int) PyLong_AsLong(temp); + Py_DECREF(temp); + } else if (PyFloat_Check(temp)) { + *result = (int) PyFloat_AsDouble(temp); + Py_DECREF(temp); + } else { + Py_DECREF(temp); + return 1; + } + if (PyErr_Occurred()) { + PyErr_Clear(); + return 1; + } + return 0; +} + +static int +JM_FLOAT_ITEM(PyObject *obj, Py_ssize_t idx, double *result) +{ + PyObject *temp = PySequence_ITEM(obj, idx); + if (!temp) return 1; + *result = PyFloat_AsDouble(temp); + Py_DECREF(temp); + if (PyErr_Occurred()) { + PyErr_Clear(); + return 1; + } + return 0; +} + + +static fz_point +JM_normalize_vector(float x, float y) +{ + double px = x, py = y, len = (double) (x * x + y * y); + + if (len != 0) { + len = sqrt(len); + px /= len; + py /= len; + } + return fz_make_point((float) px, (float) py); +} + + +//----------------------------------------------------------------------------- +// PySequence to fz_rect. Default: infinite rect +//----------------------------------------------------------------------------- +static fz_rect +JM_rect_from_py(PyObject *r) +{ + if (!r || !PySequence_Check(r) || PySequence_Size(r) != 4) + return fz_infinite_rect; + Py_ssize_t i; + double f[4]; + + for (i = 0; i < 4; i++) { + if (JM_FLOAT_ITEM(r, i, &f[i]) == 1) return fz_infinite_rect; + if (f[i] < FZ_MIN_INF_RECT) f[i] = FZ_MIN_INF_RECT; + if (f[i] > FZ_MAX_INF_RECT) f[i] = FZ_MAX_INF_RECT; + } + + return fz_make_rect((float) f[0], (float) f[1], (float) f[2], (float) f[3]); +} + +//----------------------------------------------------------------------------- +// PySequence from fz_rect +//----------------------------------------------------------------------------- +static PyObject * +JM_py_from_rect(fz_rect r) +{ + return Py_BuildValue("ffff", r.x0, r.y0, r.x1, r.y1); +} + +//----------------------------------------------------------------------------- +// PySequence to fz_irect. Default: infinite irect +//----------------------------------------------------------------------------- +static fz_irect +JM_irect_from_py(PyObject *r) +{ + if (!r || !PySequence_Check(r) || PySequence_Size(r) != 4) + return fz_infinite_irect; + int x[4]; + Py_ssize_t i; + + for (i = 0; i < 4; i++) { + if (JM_INT_ITEM(r, i, &x[i]) == 1) return fz_infinite_irect; + if (x[i] < FZ_MIN_INF_RECT) x[i] = FZ_MIN_INF_RECT; + if (x[i] > FZ_MAX_INF_RECT) x[i] = FZ_MAX_INF_RECT; + } + + return fz_make_irect(x[0], x[1], x[2], x[3]); +} + +//----------------------------------------------------------------------------- +// PySequence from fz_irect +//----------------------------------------------------------------------------- +static PyObject * +JM_py_from_irect(fz_irect r) +{ + return Py_BuildValue("iiii", r.x0, r.y0, r.x1, r.y1); +} + + +//----------------------------------------------------------------------------- +// PySequence to fz_point. Default: (FZ_MIN_INF_RECT, FZ_MIN_INF_RECT) +//----------------------------------------------------------------------------- +static fz_point +JM_point_from_py(PyObject *p) +{ + fz_point p0 = fz_make_point(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT); + double x, y; + + if (!p || !PySequence_Check(p) || PySequence_Size(p) != 2) + return p0; + + if (JM_FLOAT_ITEM(p, 0, &x) == 1) return p0; + if (JM_FLOAT_ITEM(p, 1, &y) == 1) return p0; + if (x < FZ_MIN_INF_RECT) x = FZ_MIN_INF_RECT; + if (y < FZ_MIN_INF_RECT) y = FZ_MIN_INF_RECT; + if (x > FZ_MAX_INF_RECT) x = FZ_MAX_INF_RECT; + if (y > FZ_MAX_INF_RECT) y = FZ_MAX_INF_RECT; + + return fz_make_point((float) x, (float) y); +} + +//----------------------------------------------------------------------------- +// PySequence from fz_point +//----------------------------------------------------------------------------- +static PyObject * +JM_py_from_point(fz_point p) +{ + return Py_BuildValue("ff", p.x, p.y); +} + + +//----------------------------------------------------------------------------- +// PySequence to fz_matrix. Default: fz_identity +//----------------------------------------------------------------------------- +static fz_matrix +JM_matrix_from_py(PyObject *m) +{ + Py_ssize_t i; + double a[6]; + + if (!m || !PySequence_Check(m) || PySequence_Size(m) != 6) + return fz_identity; + + for (i = 0; i < 6; i++) + if (JM_FLOAT_ITEM(m, i, &a[i]) == 1) return fz_identity; + + return fz_make_matrix((float) a[0], (float) a[1], (float) a[2], (float) a[3], (float) a[4], (float) a[5]); +} + +//----------------------------------------------------------------------------- +// PySequence from fz_matrix +//----------------------------------------------------------------------------- +static PyObject * +JM_py_from_matrix(fz_matrix m) +{ + return Py_BuildValue("ffffff", m.a, m.b, m.c, m.d, m.e, m.f); +} + +//----------------------------------------------------------------------------- +// fz_quad from PySequence. Four floats are treated as rect. +// Else must be four pairs of floats. +//----------------------------------------------------------------------------- +static fz_quad +JM_quad_from_py(PyObject *r) +{ + fz_quad q = fz_make_quad(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, + FZ_MAX_INF_RECT, FZ_MIN_INF_RECT, + FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, + FZ_MAX_INF_RECT, FZ_MAX_INF_RECT); + fz_point p[4]; + double test, x, y; + Py_ssize_t i; + PyObject *obj = NULL; + + if (!r || !PySequence_Check(r) || PySequence_Size(r) != 4) + return q; + + if (JM_FLOAT_ITEM(r, 0, &test) == 0) + return fz_quad_from_rect(JM_rect_from_py(r)); + + for (i = 0; i < 4; i++) { + obj = PySequence_ITEM(r, i); // next point item + if (!obj || !PySequence_Check(obj) || PySequence_Size(obj) != 2) + goto exit_result; // invalid: cancel the rest + + if (JM_FLOAT_ITEM(obj, 0, &x) == 1) goto exit_result; + if (JM_FLOAT_ITEM(obj, 1, &y) == 1) goto exit_result; + if (x < FZ_MIN_INF_RECT) x = FZ_MIN_INF_RECT; + if (y < FZ_MIN_INF_RECT) y = FZ_MIN_INF_RECT; + if (x > FZ_MAX_INF_RECT) x = FZ_MAX_INF_RECT; + if (y > FZ_MAX_INF_RECT) y = FZ_MAX_INF_RECT; + p[i] = fz_make_point((float) x, (float) y); + + Py_CLEAR(obj); + } + q.ul = p[0]; + q.ur = p[1]; + q.ll = p[2]; + q.lr = p[3]; + return q; + + exit_result:; + Py_CLEAR(obj); + return q; +} + +//----------------------------------------------------------------------------- +// PySequence from fz_quad. +//----------------------------------------------------------------------------- +static PyObject * +JM_py_from_quad(fz_quad q) +{ + return Py_BuildValue("((f,f),(f,f),(f,f),(f,f))", + q.ul.x, q.ul.y, q.ur.x, q.ur.y, + q.ll.x, q.ll.y, q.lr.x, q.lr.y); +} + +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-geo-py.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-geo-py.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1155 @@ +%pythoncode %{ + +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ + +# largest 32bit integers surviving C float conversion roundtrips +# used by MuPDF to define infinite rectangles +FZ_MIN_INF_RECT = -0x80000000 +FZ_MAX_INF_RECT = 0x7fffff80 + + +class Matrix(object): + """Matrix() - all zeros + Matrix(a, b, c, d, e, f) + Matrix(zoom-x, zoom-y) - zoom + Matrix(shear-x, shear-y, 1) - shear + Matrix(degree) - rotate + Matrix(Matrix) - new copy + Matrix(sequence) - from 'sequence'""" + def __init__(self, *args): + if not args: + self.a = self.b = self.c = self.d = self.e = self.f = 0.0 + return None + if len(args) > 6: + raise ValueError("Matrix: bad seq len") + if len(args) == 6: # 6 numbers + self.a, self.b, self.c, self.d, self.e, self.f = map(float, args) + return None + if len(args) == 1: # either an angle or a sequ + if hasattr(args[0], "__float__"): + theta = math.radians(args[0]) + c = round(math.cos(theta), 12) + s = round(math.sin(theta), 12) + self.a = self.d = c + self.b = s + self.c = -s + self.e = self.f = 0.0 + return None + else: + self.a, self.b, self.c, self.d, self.e, self.f = map(float, args[0]) + return None + if len(args) == 2 or len(args) == 3 and args[2] == 0: + self.a, self.b, self.c, self.d, self.e, self.f = float(args[0]), \ + 0.0, 0.0, float(args[1]), 0.0, 0.0 + return None + if len(args) == 3 and args[2] == 1: + self.a, self.b, self.c, self.d, self.e, self.f = 1.0, \ + float(args[1]), float(args[0]), 1.0, 0.0, 0.0 + return None + raise ValueError("Matrix: bad args") + + def invert(self, src=None): + """Calculate the inverted matrix. Return 0 if successful and replace + current one. Else return 1 and do nothing. + """ + if src is None: + dst = util_invert_matrix(self) + else: + dst = util_invert_matrix(src) + if dst[0] == 1: + return 1 + self.a, self.b, self.c, self.d, self.e, self.f = dst[1] + return 0 + + def pretranslate(self, tx, ty): + """Calculate pre translation and replace current matrix.""" + tx = float(tx) + ty = float(ty) + self.e += tx * self.a + ty * self.c + self.f += tx * self.b + ty * self.d + return self + + def prescale(self, sx, sy): + """Calculate pre scaling and replace current matrix.""" + sx = float(sx) + sy = float(sy) + self.a *= sx + self.b *= sx + self.c *= sy + self.d *= sy + return self + + def preshear(self, h, v): + """Calculate pre shearing and replace current matrix.""" + h = float(h) + v = float(v) + a, b = self.a, self.b + self.a += v * self.c + self.b += v * self.d + self.c += h * a + self.d += h * b + return self + + def prerotate(self, theta): + """Calculate pre rotation and replace current matrix.""" + theta = float(theta) + while theta < 0: theta += 360 + while theta >= 360: theta -= 360 + if abs(0 - theta) < EPSILON: + pass + + elif abs(90.0 - theta) < EPSILON: + a = self.a + b = self.b + self.a = self.c + self.b = self.d + self.c = -a + self.d = -b + + elif abs(180.0 - theta) < EPSILON: + self.a = -self.a + self.b = -self.b + self.c = -self.c + self.d = -self.d + + elif abs(270.0 - theta) < EPSILON: + a = self.a + b = self.b + self.a = -self.c + self.b = -self.d + self.c = a + self.d = b + + else: + rad = math.radians(theta) + s = math.sin(rad) + c = math.cos(rad) + a = self.a + b = self.b + self.a = c * a + s * self.c + self.b = c * b + s * self.d + self.c =-s * a + c * self.c + self.d =-s * b + c * self.d + + return self + + def concat(self, one, two): + """Multiply two matrices and replace current one.""" + if not len(one) == len(two) == 6: + raise ValueError("Matrix: bad seq len") + self.a, self.b, self.c, self.d, self.e, self.f = util_concat_matrix(one, two) + return self + + def __getitem__(self, i): + return (self.a, self.b, self.c, self.d, self.e, self.f)[i] + + def __setitem__(self, i, v): + v = float(v) + if i == 0: self.a = v + elif i == 1: self.b = v + elif i == 2: self.c = v + elif i == 3: self.d = v + elif i == 4: self.e = v + elif i == 5: self.f = v + else: + raise IndexError("index out of range") + return + + def __len__(self): + return 6 + + def __repr__(self): + return "Matrix" + str(tuple(self)) + + def __invert__(self): + """Calculate inverted matrix.""" + m1 = Matrix() + m1.invert(self) + return m1 + __inv__ = __invert__ + + def __mul__(self, m): + if hasattr(m, "__float__"): + return Matrix(self.a * m, self.b * m, self.c * m, + self.d * m, self.e * m, self.f * m) + m1 = Matrix(1,1) + return m1.concat(self, m) + + def __truediv__(self, m): + if hasattr(m, "__float__"): + return Matrix(self.a * 1./m, self.b * 1./m, self.c * 1./m, + self.d * 1./m, self.e * 1./m, self.f * 1./m) + m1 = util_invert_matrix(m)[1] + if not m1: + raise ZeroDivisionError("matrix not invertible") + m2 = Matrix(1,1) + return m2.concat(self, m1) + __div__ = __truediv__ + + def __add__(self, m): + if hasattr(m, "__float__"): + return Matrix(self.a + m, self.b + m, self.c + m, + self.d + m, self.e + m, self.f + m) + if len(m) != 6: + raise ValueError("Matrix: bad seq len") + return Matrix(self.a + m[0], self.b + m[1], self.c + m[2], + self.d + m[3], self.e + m[4], self.f + m[5]) + + def __sub__(self, m): + if hasattr(m, "__float__"): + return Matrix(self.a - m, self.b - m, self.c - m, + self.d - m, self.e - m, self.f - m) + if len(m) != 6: + raise ValueError("Matrix: bad seq len") + return Matrix(self.a - m[0], self.b - m[1], self.c - m[2], + self.d - m[3], self.e - m[4], self.f - m[5]) + + def __pos__(self): + return Matrix(self) + + def __neg__(self): + return Matrix(-self.a, -self.b, -self.c, -self.d, -self.e, -self.f) + + def __bool__(self): + return not (max(self) == min(self) == 0) + + def __nonzero__(self): + return not (max(self) == min(self) == 0) + + def __eq__(self, mat): + if not hasattr(mat, "__len__"): + return False + return len(mat) == 6 and bool(self - mat) is False + + def __abs__(self): + return math.sqrt(sum([c*c for c in self])) + + norm = __abs__ + + @property + def is_rectilinear(self): + """True if rectangles are mapped to rectangles.""" + return (abs(self.b) < EPSILON and abs(self.c) < EPSILON) or \ + (abs(self.a) < EPSILON and abs(self.d) < EPSILON); + + +class IdentityMatrix(Matrix): + """Identity matrix [1, 0, 0, 1, 0, 0]""" + def __init__(self): + Matrix.__init__(self, 1.0, 1.0) + def __setattr__(self, name, value): + if name in "ad": + self.__dict__[name] = 1.0 + elif name in "bcef": + self.__dict__[name] = 0.0 + else: + self.__dict__[name] = value + + def checkargs(*args): + raise NotImplementedError("Identity is readonly") + + prerotate = checkargs + preshear = checkargs + prescale = checkargs + pretranslate = checkargs + concat = checkargs + invert = checkargs + + def __repr__(self): + return "IdentityMatrix(1.0, 0.0, 0.0, 1.0, 0.0, 0.0)" + + def __hash__(self): + return hash((1,0,0,1,0,0)) + + +Identity = IdentityMatrix() + +class Point(object): + """Point() - all zeros\nPoint(x, y)\nPoint(Point) - new copy\nPoint(sequence) - from 'sequence'""" + def __init__(self, *args): + if not args: + self.x = 0.0 + self.y = 0.0 + return None + + if len(args) > 2: + raise ValueError("Point: bad seq len") + if len(args) == 2: + self.x = float(args[0]) + self.y = float(args[1]) + return None + if len(args) == 1: + l = args[0] + if hasattr(l, "__getitem__") is False: + raise ValueError("Point: bad args") + if len(l) != 2: + raise ValueError("Point: bad seq len") + self.x = float(l[0]) + self.y = float(l[1]) + return None + raise ValueError("Point: bad args") + + def transform(self, m): + """Replace point by its transformation with matrix-like m.""" + if len(m) != 6: + raise ValueError("Matrix: bad seq len") + self.x, self.y = util_transform_point(self, m) + return self + + @property + def unit(self): + """Unit vector of the point.""" + s = self.x * self.x + self.y * self.y + if s < EPSILON: + return Point(0,0) + s = math.sqrt(s) + return Point(self.x / s, self.y / s) + + @property + def abs_unit(self): + """Unit vector with positive coordinates.""" + s = self.x * self.x + self.y * self.y + if s < EPSILON: + return Point(0,0) + s = math.sqrt(s) + return Point(abs(self.x) / s, abs(self.y) / s) + + def distance_to(self, *args): + """Return distance to rectangle or another point.""" + if not len(args) > 0: + raise ValueError("at least one parameter must be given") + + x = args[0] + if len(x) == 2: + x = Point(x) + elif len(x) == 4: + x = Rect(x) + else: + raise ValueError("arg1 must be point-like or rect-like") + + if len(args) > 1: + unit = args[1] + else: + unit = "px" + u = {"px": (1.,1.), "in": (1.,72.), "cm": (2.54, 72.), + "mm": (25.4, 72.)} + f = u[unit][0] / u[unit][1] + + if type(x) is Point: + return abs(self - x) * f + + # from here on, x is a rectangle + # as a safeguard, make a finite copy of it + r = Rect(x.top_left, x.top_left) + r = r | x.bottom_right + if self in r: + return 0.0 + if self.x > r.x1: + if self.y >= r.y1: + return self.distance_to(r.bottom_right, unit) + elif self.y <= r.y0: + return self.distance_to(r.top_right, unit) + else: + return (self.x - r.x1) * f + elif r.x0 <= self.x <= r.x1: + if self.y >= r.y1: + return (self.y - r.y1) * f + else: + return (r.y0 - self.y) * f + else: + if self.y >= r.y1: + return self.distance_to(r.bottom_left, unit) + elif self.y <= r.y0: + return self.distance_to(r.top_left, unit) + else: + return (r.x0 - self.x) * f + + def __getitem__(self, i): + return (self.x, self.y)[i] + + def __len__(self): + return 2 + + def __setitem__(self, i, v): + v = float(v) + if i == 0: self.x = v + elif i == 1: self.y = v + else: + raise IndexError("index out of range") + return None + + def __repr__(self): + return "Point" + str(tuple(self)) + + def __pos__(self): + return Point(self) + + def __neg__(self): + return Point(-self.x, -self.y) + + def __bool__(self): + return not (max(self) == min(self) == 0) + + def __nonzero__(self): + return not (max(self) == min(self) == 0) + + def __eq__(self, p): + if not hasattr(p, "__len__"): + return False + return len(p) == 2 and bool(self - p) is False + + def __abs__(self): + return math.sqrt(self.x * self.x + self.y * self.y) + + norm = __abs__ + + def __add__(self, p): + if hasattr(p, "__float__"): + return Point(self.x + p, self.y + p) + if len(p) != 2: + raise ValueError("Point: bad seq len") + return Point(self.x + p[0], self.y + p[1]) + + def __sub__(self, p): + if hasattr(p, "__float__"): + return Point(self.x - p, self.y - p) + if len(p) != 2: + raise ValueError("Point: bad seq len") + return Point(self.x - p[0], self.y - p[1]) + + def __mul__(self, m): + if hasattr(m, "__float__"): + return Point(self.x * m, self.y * m) + p = Point(self) + return p.transform(m) + + def __truediv__(self, m): + if hasattr(m, "__float__"): + return Point(self.x * 1./m, self.y * 1./m) + m1 = util_invert_matrix(m)[1] + if not m1: + raise ZeroDivisionError("matrix not invertible") + p = Point(self) + return p.transform(m1) + + __div__ = __truediv__ + + def __hash__(self): + return hash(tuple(self)) + +class Rect(object): + """Rect() - all zeros + Rect(x0, y0, x1, y1) - 4 coordinates + Rect(top-left, x1, y1) - point and 2 coordinates + Rect(x0, y0, bottom-right) - 2 coordinates and point + Rect(top-left, bottom-right) - 2 points + Rect(sequ) - new from sequence or rect-like + """ + def __init__(self, *args): + self.x0, self.y0, self.x1, self.y1 = util_make_rect(args) + return None + + def normalize(self): + """Replace rectangle with its valid version.""" + if self.x1 < self.x0: + self.x0, self.x1 = self.x1, self.x0 + if self.y1 < self.y0: + self.y0, self.y1 = self.y1, self.y0 + return self + + @property + def is_empty(self): + """True if rectangle area is empty.""" + return self.x0 >= self.x1 or self.y0 >= self.y1 + + @property + def is_valid(self): + """True if rectangle is valid.""" + return self.x0 <= self.x1 and self.y0 <= self.y1 + + @property + def is_infinite(self): + """True if this is the infinite rectangle.""" + return self.x0 == self.y0 == FZ_MIN_INF_RECT and self.x1 == self.y1 == FZ_MAX_INF_RECT + + @property + def top_left(self): + """Top-left corner.""" + return Point(self.x0, self.y0) + + @property + def top_right(self): + """Top-right corner.""" + return Point(self.x1, self.y0) + + @property + def bottom_left(self): + """Bottom-left corner.""" + return Point(self.x0, self.y1) + + @property + def bottom_right(self): + """Bottom-right corner.""" + return Point(self.x1, self.y1) + + tl = top_left + tr = top_right + bl = bottom_left + br = bottom_right + + @property + def quad(self): + """Return Quad version of rectangle.""" + return Quad(self.tl, self.tr, self.bl, self.br) + + def torect(self, r): + """Return matrix that converts to target rect.""" + + r = Rect(r) + if self.is_infinite or self.is_empty or r.is_infinite or r.is_empty: + raise ValueError("rectangles must be finite and not empty") + return ( + Matrix(1, 0, 0, 1, -self.x0, -self.y0) + * Matrix(r.width / self.width, r.height / self.height) + * Matrix(1, 0, 0, 1, r.x0, r.y0) + ) + + def morph(self, p, m): + """Morph with matrix-like m and point-like p. + + Returns a new quad.""" + if self.is_infinite: + return INFINITE_QUAD() + return self.quad.morph(p, m) + + def round(self): + """Return the IRect.""" + return IRect(util_round_rect(self)) + + irect = property(round) + + width = property(lambda self: self.x1 - self.x0 if self.x1 > self.x0 else 0) + height = property(lambda self: self.y1 - self.y0 if self.y1 > self.y0 else 0) + + def include_point(self, p): + """Extend to include point-like p.""" + if len(p) != 2: + raise ValueError("Point: bad seq len") + self.x0, self.y0, self.x1, self.y1 = util_include_point_in_rect(self, p) + return self + + def include_rect(self, r): + """Extend to include rect-like r.""" + if len(r) != 4: + raise ValueError("Rect: bad seq len") + r = Rect(r) + if r.is_infinite or self.is_infinite: + self.x0, self.y0, self.x1, self.y1 = FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, FZ_MAX_INF_RECT + elif r.is_empty: + return self + elif self.is_empty: + self.x0, self.y0, self.x1, self.y1 = r.x0, r.y0, r.x1, r.y1 + else: + self.x0, self.y0, self.x1, self.y1 = util_union_rect(self, r) + return self + + def intersect(self, r): + """Restrict to common rect with rect-like r.""" + if not len(r) == 4: + raise ValueError("Rect: bad seq len") + r = Rect(r) + if r.is_infinite: + return self + elif self.is_infinite: + self.x0, self.y0, self.x1, self.y1 = r.x0, r.y0, r.x1, r.y1 + elif r.is_empty: + self.x0, self.y0, self.x1, self.y1 = r.x0, r.y0, r.x1, r.y1 + elif self.is_empty: + return self + else: + self.x0, self.y0, self.x1, self.y1 = util_intersect_rect(self, r) + return self + + def contains(self, x): + """Check if containing point-like or rect-like x.""" + return self.__contains__(x) + + def transform(self, m): + """Replace with the transformation by matrix-like m.""" + if not len(m) == 6: + raise ValueError("Matrix: bad seq len") + self.x0, self.y0, self.x1, self.y1 = util_transform_rect(self, m) + return self + + def __getitem__(self, i): + return (self.x0, self.y0, self.x1, self.y1)[i] + + def __len__(self): + return 4 + + def __setitem__(self, i, v): + v = float(v) + if i == 0: self.x0 = v + elif i == 1: self.y0 = v + elif i == 2: self.x1 = v + elif i == 3: self.y1 = v + else: + raise IndexError("index out of range") + return None + + def __repr__(self): + return "Rect" + str(tuple(self)) + + def __pos__(self): + return Rect(self) + + def __neg__(self): + return Rect(-self.x0, -self.y0, -self.x1, -self.y1) + + def __bool__(self): + return not self.x0 == self.y0 == self.x1 == self.y1 == 0 + + def __nonzero__(self): + return not self.x0 == self.y0 == self.x1 == self.y1 == 0 + + def __eq__(self, r): + if not hasattr(r, "__len__"): + return False + return len(r) == 4 and self.x0 == r[0] and self.y0 == r[1] and self.x1 == r[2] and self.y1 == r[3] + + def __abs__(self): + if self.is_infinite or not self.is_valid: + return 0.0 + return self.width * self.height + + def norm(self): + return math.sqrt(sum([c*c for c in self])) + + def __add__(self, p): + if hasattr(p, "__float__"): + return Rect(self.x0 + p, self.y0 + p, self.x1 + p, self.y1 + p) + if len(p) != 4: + raise ValueError("Rect: bad seq len") + return Rect(self.x0 + p[0], self.y0 + p[1], self.x1 + p[2], self.y1 + p[3]) + + + def __sub__(self, p): + if hasattr(p, "__float__"): + return Rect(self.x0 - p, self.y0 - p, self.x1 - p, self.y1 - p) + if len(p) != 4: + raise ValueError("Rect: bad seq len") + return Rect(self.x0 - p[0], self.y0 - p[1], self.x1 - p[2], self.y1 - p[3]) + + + def __mul__(self, m): + if hasattr(m, "__float__"): + return Rect(self.x0 * m, self.y0 * m, self.x1 * m, self.y1 * m) + r = Rect(self) + r = r.transform(m) + return r + + def __truediv__(self, m): + if hasattr(m, "__float__"): + return Rect(self.x0 * 1./m, self.y0 * 1./m, self.x1 * 1./m, self.y1 * 1./m) + im = util_invert_matrix(m)[1] + if not im: + raise ZeroDivisionError("Matrix not invertible") + r = Rect(self) + r = r.transform(im) + return r + + __div__ = __truediv__ + + def __contains__(self, x): + if hasattr(x, "__float__"): + return x in tuple(self) + l = len(x) + if l == 2: + return util_is_point_in_rect(x, self) + if l == 4: + r = INFINITE_RECT() + try: + r = Rect(x) + except: + r = Quad(x).rect + return (self.x0 <= r.x0 <= r.x1 <= self.x1 and + self.y0 <= r.y0 <= r.y1 <= self.y1) + return False + + + def __or__(self, x): + if not hasattr(x, "__len__"): + raise ValueError("bad type op 2") + + r = Rect(self) + if len(x) == 2: + return r.include_point(x) + if len(x) == 4: + return r.include_rect(x) + raise ValueError("bad type op 2") + + def __and__(self, x): + if not hasattr(x, "__len__") or len(x) != 4: + raise ValueError("bad type op 2") + r = Rect(self) + return r.intersect(x) + + def intersects(self, x): + """Check if intersection with rectangle x is not empty.""" + r1 = Rect(x) + if self.is_empty or self.is_infinite or r1.is_empty or r1.is_infinite: + return False + r = Rect(self) + if r.intersect(r1).is_empty: + return False + return True + + def __hash__(self): + return hash(tuple(self)) + +class IRect(object): + """IRect() - all zeros + IRect(x0, y0, x1, y1) - 4 coordinates + IRect(top-left, x1, y1) - point and 2 coordinates + IRect(x0, y0, bottom-right) - 2 coordinates and point + IRect(top-left, bottom-right) - 2 points + IRect(sequ) - new from sequence or rect-like + """ + def __init__(self, *args): + self.x0, self.y0, self.x1, self.y1 = util_make_irect(args) + return None + + def normalize(self): + """Replace rectangle with its valid version.""" + if self.x1 < self.x0: + self.x0, self.x1 = self.x1, self.x0 + if self.y1 < self.y0: + self.y0, self.y1 = self.y1, self.y0 + return self + + @property + def is_empty(self): + """True if rectangle area is empty.""" + return self.x0 >= self.x1 or self.y0 >= self.y1 + + @property + def is_valid(self): + """True if rectangle is valid.""" + return self.x0 <= self.x1 and self.y0 <= self.y1 + + @property + def is_infinite(self): + """True if rectangle is infinite.""" + return self.x0 == self.y0 == FZ_MIN_INF_RECT and self.x1 == self.y1 == FZ_MAX_INF_RECT + + @property + def top_left(self): + """Top-left corner.""" + return Point(self.x0, self.y0) + + @property + def top_right(self): + """Top-right corner.""" + return Point(self.x1, self.y0) + + @property + def bottom_left(self): + """Bottom-left corner.""" + return Point(self.x0, self.y1) + + @property + def bottom_right(self): + """Bottom-right corner.""" + return Point(self.x1, self.y1) + + tl = top_left + tr = top_right + bl = bottom_left + br = bottom_right + + @property + def quad(self): + """Return Quad version of rectangle.""" + return Quad(self.tl, self.tr, self.bl, self.br) + + + def torect(self, r): + """Return matrix that converts to target rect.""" + + r = Rect(r) + if self.is_infinite or self.is_empty or r.is_infinite or r.is_empty: + raise ValueError("rectangles must be finite and not empty") + return ( + Matrix(1, 0, 0, 1, -self.x0, -self.y0) + * Matrix(r.width / self.width, r.height / self.height) + * Matrix(1, 0, 0, 1, r.x0, r.y0) + ) + + def morph(self, p, m): + """Morph with matrix-like m and point-like p. + + Returns a new quad.""" + if self.is_infinite: + return INFINITE_QUAD() + return self.quad.morph(p, m) + + @property + def rect(self): + return Rect(self) + + width = property(lambda self: self.x1 - self.x0 if self.x1 > self.x0 else 0) + height = property(lambda self: self.y1 - self.y0 if self.y1 > self.y0 else 0) + + def include_point(self, p): + """Extend rectangle to include point p.""" + rect = self.rect.include_point(p) + return rect.irect + + def include_rect(self, r): + """Extend rectangle to include rectangle r.""" + rect = self.rect.include_rect(r) + return rect.irect + + def intersect(self, r): + """Restrict rectangle to intersection with rectangle r.""" + rect = self.rect.intersect(r) + return rect.irect + + def __getitem__(self, i): + return (self.x0, self.y0, self.x1, self.y1)[i] + + def __len__(self): + return 4 + + def __setitem__(self, i, v): + v = int(v) + if i == 0: self.x0 = v + elif i == 1: self.y0 = v + elif i == 2: self.x1 = v + elif i == 3: self.y1 = v + else: + raise IndexError("index out of range") + return None + + def __repr__(self): + return "IRect" + str(tuple(self)) + + def __pos__(self): + return IRect(self) + + def __neg__(self): + return IRect(-self.x0, -self.y0, -self.x1, -self.y1) + + def __bool__(self): + return not self.x0 == self.y0 == self.x1 == self.y1 == 0 + + def __nonzero__(self): + return not self.x0 == self.y0 == self.x1 == self.y1 == 0 + + def __eq__(self, r): + if not hasattr(r, "__len__"): + return False + return len(r) == 4 and self.x0 == r[0] and self.y0 == r[1] and self.x1 == r[2] and self.y1 == r[3] + + def __abs__(self): + if self.is_infinite or not self.is_valid: + return 0 + return self.width * self.height + + def norm(self): + return math.sqrt(sum([c*c for c in self])) + + def __add__(self, p): + return Rect.__add__(self, p).round() + + def __sub__(self, p): + return Rect.__sub__(self, p).round() + + def transform(self, m): + return Rect.transform(self, m).round() + + def __mul__(self, m): + return Rect.__mul__(self, m).round() + + def __truediv__(self, m): + return Rect.__truediv__(self, m).round() + + __div__ = __truediv__ + + + def __contains__(self, x): + return Rect.__contains__(self, x) + + + def __or__(self, x): + return Rect.__or__(self, x).round() + + def __and__(self, x): + return Rect.__and__(self, x).round() + + def intersects(self, x): + return Rect.intersects(self, x) + + def __hash__(self): + return hash(tuple(self)) + + +class Quad(object): + """Quad() - all zero points\nQuad(ul, ur, ll, lr)\nQuad(quad) - new copy\nQuad(sequence) - from 'sequence'""" + def __init__(self, *args): + if not args: + self.ul = self.ur = self.ll = self.lr = Point() + return None + + if len(args) > 4: + raise ValueError("Quad: bad seq len") + if len(args) == 4: + self.ul, self.ur, self.ll, self.lr = map(Point, args) + return None + if len(args) == 1: + l = args[0] + if hasattr(l, "__getitem__") is False: + raise ValueError("Quad: bad args") + if len(l) != 4: + raise ValueError("Quad: bad seq len") + self.ul, self.ur, self.ll, self.lr = map(Point, l) + return None + raise ValueError("Quad: bad args") + + @property + def is_rectangular(self)->bool: + """Check if quad is rectangular. + + Notes: + Some rotation matrix can thus transform it into a rectangle. + This is equivalent to three corners enclose 90 degrees. + Returns: + True or False. + """ + + sine = util_sine_between(self.ul, self.ur, self.lr) + if abs(sine - 1) > EPSILON: # the sine of the angle + return False + + sine = util_sine_between(self.ur, self.lr, self.ll) + if abs(sine - 1) > EPSILON: + return False + + sine = util_sine_between(self.lr, self.ll, self.ul) + if abs(sine - 1) > EPSILON: + return False + + return True + + + @property + def is_convex(self)->bool: + """Check if quad is convex and not degenerate. + + Notes: + Check that for the two diagonals, the other two corners are not + on the same side of the diagonal. + Returns: + True or False. + """ + m = planish_line(self.ul, self.lr) # puts this diagonal on x-axis + p1 = self.ll * m # transform the + p2 = self.ur * m # other two points + if p1.y * p2.y > 0: + return False + m = planish_line(self.ll, self.ur) # puts other diagonal on x-axis + p1 = self.lr * m # tranform the + p2 = self.ul * m # remaining points + if p1.y * p2.y > 0: + return False + return True + + + width = property(lambda self: max(abs(self.ul - self.ur), abs(self.ll - self.lr))) + height = property(lambda self: max(abs(self.ul - self.ll), abs(self.ur - self.lr))) + + @property + def is_empty(self): + """Check whether all quad corners are on the same line. + + This is the case if width or height is zero. + """ + return self.width < EPSILON or self.height < EPSILON + + @property + def is_infinite(self): + """Check whether this is the infinite quad.""" + return self.rect.is_infinite + + @property + def rect(self): + r = Rect() + r.x0 = min(self.ul.x, self.ur.x, self.lr.x, self.ll.x) + r.y0 = min(self.ul.y, self.ur.y, self.lr.y, self.ll.y) + r.x1 = max(self.ul.x, self.ur.x, self.lr.x, self.ll.x) + r.y1 = max(self.ul.y, self.ur.y, self.lr.y, self.ll.y) + return r + + + def __contains__(self, x): + try: + l = x.__len__() + except: + return False + if l == 2: + return util_point_in_quad(x, self) + if l != 4: + return False + if CheckRect(x): + if Rect(x).is_empty: + return True + return util_point_in_quad(x[:2], self) and util_point_in_quad(x[2:], self) + if CheckQuad(x): + for i in range(4): + if not util_point_in_quad(x[i], self): + return False + return True + return False + + + def __getitem__(self, i): + return (self.ul, self.ur, self.ll, self.lr)[i] + + def __len__(self): + return 4 + + def __setitem__(self, i, v): + if i == 0: self.ul = Point(v) + elif i == 1: self.ur = Point(v) + elif i == 2: self.ll = Point(v) + elif i == 3: self.lr = Point(v) + else: + raise IndexError("index out of range") + return None + + def __repr__(self): + return "Quad" + str(tuple(self)) + + def __pos__(self): + return Quad(self) + + def __neg__(self): + return Quad(-self.ul, -self.ur, -self.ll, -self.lr) + + def __bool__(self): + return not self.is_empty + + def __nonzero__(self): + return not self.is_empty + + def __eq__(self, quad): + if not hasattr(quad, "__len__"): + return False + return len(quad) == 4 and ( + self.ul == quad[0] and + self.ur == quad[1] and + self.ll == quad[2] and + self.lr == quad[3] + ) + + def __abs__(self): + if self.is_empty: + return 0.0 + return abs(self.ul - self.ur) * abs(self.ul - self.ll) + + + def morph(self, p, m): + """Morph the quad with matrix-like 'm' and point-like 'p'. + + Return a new quad.""" + if self.is_infinite: + return INFINITE_QUAD() + delta = Matrix(1, 1).pretranslate(p.x, p.y) + q = self * ~delta * m * delta + return q + + + def transform(self, m): + """Replace quad by its transformation with matrix m.""" + if hasattr(m, "__float__"): + pass + elif len(m) != 6: + raise ValueError("Matrix: bad seq len") + self.ul *= m + self.ur *= m + self.ll *= m + self.lr *= m + return self + + def __mul__(self, m): + q = Quad(self) + q = q.transform(m) + return q + + def __add__(self, q): + if hasattr(q, "__float__"): + return Quad(self.ul + q, self.ur + q, self.ll + q, self.lr + q) + if len(p) != 4: + raise ValueError("Quad: bad seq len") + return Quad(self.ul + q[0], self.ur + q[1], self.ll + q[2], self.lr + q[3]) + + + def __sub__(self, q): + if hasattr(q, "__float__"): + return Quad(self.ul - q, self.ur - q, self.ll - q, self.lr - q) + if len(p) != 4: + raise ValueError("Quad: bad seq len") + return Quad(self.ul - q[0], self.ur - q[1], self.ll - q[2], self.lr - q[3]) + + + def __truediv__(self, m): + if hasattr(m, "__float__"): + im = 1. / m + else: + im = util_invert_matrix(m)[1] + if not im: + raise ZeroDivisionError("Matrix not invertible") + q = Quad(self) + q = q.transform(im) + return q + + __div__ = __truediv__ + + + def __hash__(self): + return hash(tuple(self)) + + +# some special geometry objects +def EMPTY_RECT(): + return Rect(FZ_MAX_INF_RECT, FZ_MAX_INF_RECT, FZ_MIN_INF_RECT, FZ_MIN_INF_RECT) + + +def INFINITE_RECT(): + return Rect(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, FZ_MAX_INF_RECT) + + +def EMPTY_IRECT(): + return IRect(FZ_MAX_INF_RECT, FZ_MAX_INF_RECT, FZ_MIN_INF_RECT, FZ_MIN_INF_RECT) + + +def INFINITE_IRECT(): + return IRect(FZ_MIN_INF_RECT, FZ_MIN_INF_RECT, FZ_MAX_INF_RECT, FZ_MAX_INF_RECT) + + +def INFINITE_QUAD(): + return INFINITE_RECT().quad + + +def EMPTY_QUAD(): + return EMPTY_RECT().quad + + +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-globals.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-globals.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,53 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +// Global switches +// Switch for device hints = no cache +static int no_device_caching = 0; + +// Switch for computing glyph of fontsize height +static int small_glyph_heights = 0; + +// Switch for returning fontnames including subset prefix +static int subset_fontnames = 0; + +// Unset ascender / descender corrections +static int skip_quad_corrections = 0; + +// constants: error messages +static const char MSG_BAD_ANNOT_TYPE[] = "bad annot type"; +static const char MSG_BAD_APN[] = "bad or missing annot AP/N"; +static const char MSG_BAD_ARG_INK_ANNOT[] = "arg must be seq of seq of float pairs"; +static const char MSG_BAD_ARG_POINTS[] = "bad seq of points"; +static const char MSG_BAD_BUFFER[] = "bad type: 'buffer'"; +static const char MSG_BAD_COLOR_SEQ[] = "bad color sequence"; +static const char MSG_BAD_DOCUMENT[] = "cannot open broken document"; +static const char MSG_BAD_FILETYPE[] = "bad filetype"; +static const char MSG_BAD_LOCATION[] = "bad location"; +static const char MSG_BAD_OC_CONFIG[] = "bad config number"; +static const char MSG_BAD_OC_LAYER[] = "bad layer number"; +static const char MSG_BAD_OC_REF[] = "bad 'oc' reference"; +static const char MSG_BAD_PAGEID[] = "bad page id"; +static const char MSG_BAD_PAGENO[] = "bad page number(s)"; +static const char MSG_BAD_PDFROOT[] = "PDF has no root"; +static const char MSG_BAD_RECT[] = "rect is infinite or empty"; +static const char MSG_BAD_TEXT[] = "bad type: 'text'"; +static const char MSG_BAD_XREF[] = "bad xref"; +static const char MSG_COLOR_COUNT_FAILED[] = "color count failed"; +static const char MSG_FILE_OR_BUFFER[] = "need font file or buffer"; +static const char MSG_FONT_FAILED[] = "cannot create font"; +static const char MSG_IS_NO_ANNOT[] = "is no annotation"; +static const char MSG_IS_NO_IMAGE[] = "is no image"; +static const char MSG_IS_NO_PDF[] = "is no PDF"; +static const char MSG_IS_NO_DICT[] = "object is no PDF dict"; +static const char MSG_PIX_NOALPHA[] = "source pixmap has no alpha"; +static const char MSG_PIXEL_OUTSIDE[] = "pixel(s) outside image"; +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-other.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-other.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1344 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +fz_buffer *JM_object_to_buffer(fz_context *ctx, pdf_obj *val, int a, int b); +PyObject *JM_EscapeStrFromBuffer(fz_context *ctx, fz_buffer *buff); +pdf_obj *JM_pdf_obj_from_str(fz_context *ctx, pdf_document *doc, char *src); + +// exception handling +void *JM_ReturnException(fz_context *ctx) +{ + PyErr_SetString(JM_Exc_CurrentException, fz_caught_message(ctx)); + JM_Exc_CurrentException = PyExc_RuntimeError; + return NULL; +} + + +static int LIST_APPEND_DROP(PyObject *list, PyObject *item) +{ + if (!list || !PyList_Check(list) || !item) return -2; + int rc = PyList_Append(list, item); + Py_DECREF(item); + return rc; +} + +static int DICT_SETITEM_DROP(PyObject *dict, PyObject *key, PyObject *value) +{ + if (!dict || !PyDict_Check(dict) || !key || !value) return -2; + int rc = PyDict_SetItem(dict, key, value); + Py_DECREF(value); + return rc; +} + +static int DICT_SETITEMSTR_DROP(PyObject *dict, const char *key, PyObject *value) +{ + if (!dict || !PyDict_Check(dict) || !key || !value) return -2; + int rc = PyDict_SetItemString(dict, key, value); + Py_DECREF(value); + return rc; +} + + +//-------------------------------------- +// Ensure valid journalling state +//-------------------------------------- +int JM_have_operation(fz_context *ctx, pdf_document *pdf) +{ + if (pdf->journal && !pdf_undoredo_step(ctx, pdf, 0)) { + return 0; + } + return 1; +} + +//---------------------------------- +// Set a PDF dict key to some value +//---------------------------------- +static pdf_obj +*JM_set_object_value(fz_context *ctx, pdf_obj *obj, const char *key, char *value) +{ + fz_buffer *res = NULL; + pdf_obj *new_obj = NULL, *testkey = NULL; + PyObject *skey = PyUnicode_FromString(key); // Python version of dict key + PyObject *slash = PyUnicode_FromString("/"); // PDF path separator + PyObject *list = NULL, *newval=NULL, *newstr=NULL, *nullval=NULL; + const char eyecatcher[] = "fitz: replace me!"; + pdf_document *pdf = NULL; + fz_try(ctx) + { + pdf = pdf_get_bound_document(ctx, obj); + // split PDF key at path seps and take last key part + list = PyUnicode_Split(skey, slash, -1); + Py_ssize_t len = PySequence_Size(list); + Py_ssize_t i = len - 1; + Py_DECREF(skey); + skey = PySequence_GetItem(list, i); + + PySequence_DelItem(list, i); // del the last sub-key + len = PySequence_Size(list); // remaining length + testkey = pdf_dict_getp(ctx, obj, key); // check if key already exists + if (!testkey) { + /*----------------------------------------------------------------- + No, it will be created here. But we cannot allow this happening if + indirect objects are referenced. So we check all higher level + sub-paths for indirect references. + -----------------------------------------------------------------*/ + while (len > 0) { + PyObject *t = PyUnicode_Join(slash, list); // next high level + if (pdf_is_indirect(ctx, pdf_dict_getp(ctx, obj, JM_StrAsChar(t)))) { + Py_DECREF(t); + fz_throw(ctx, FZ_ERROR_GENERIC, "path to '%s' has indirects", JM_StrAsChar(skey)); + } + PySequence_DelItem(list, len - 1); // del last sub-key + len = PySequence_Size(list); // remaining length + Py_DECREF(t); + } + } + // Insert our eyecatcher. Will create all sub-paths in the chain, or + // respectively remove old value of key-path. + pdf_dict_putp_drop(ctx, obj, key, pdf_new_text_string(ctx, eyecatcher)); + testkey = pdf_dict_getp(ctx, obj, key); + if (!pdf_is_string(ctx, testkey)) { + fz_throw(ctx, FZ_ERROR_GENERIC, "cannot insert value for '%s'", key); + } + const char *temp = pdf_to_text_string(ctx, testkey); + if (strcmp(temp, eyecatcher) != 0) { + fz_throw(ctx, FZ_ERROR_GENERIC, "cannot insert value for '%s'", key); + } + // read the result as a string + res = JM_object_to_buffer(ctx, obj, 1, 0); + PyObject *objstr = JM_EscapeStrFromBuffer(ctx, res); + + // replace 'eyecatcher' by desired 'value' + nullval = PyUnicode_FromFormat("/%s(%s)", JM_StrAsChar(skey), eyecatcher); + newval = PyUnicode_FromFormat("/%s %s", JM_StrAsChar(skey), value); + newstr = PyUnicode_Replace(objstr, nullval, newval, 1); + + // make PDF object from resulting string + new_obj = JM_pdf_obj_from_str(ctx, pdf, JM_StrAsChar(newstr)); + } + fz_always(ctx) { + fz_drop_buffer(ctx, res); + Py_CLEAR(skey); + Py_CLEAR(slash); + Py_CLEAR(list); + Py_CLEAR(newval); + Py_CLEAR(newstr); + Py_CLEAR(nullval); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return new_obj; +} + + +static void +JM_get_page_labels(fz_context *ctx, PyObject *liste, pdf_obj *nums) +{ + int pno, i, n = pdf_array_len(ctx, nums); + char *c = NULL; + pdf_obj *val; + fz_buffer *res = NULL; + for (i = 0; i < n; i += 2) { + pdf_obj *key = pdf_resolve_indirect(ctx, pdf_array_get(ctx, nums, i)); + pno = pdf_to_int(ctx, key); + val = pdf_resolve_indirect(ctx, pdf_array_get(ctx, nums, i + 1)); + res = JM_object_to_buffer(ctx, val, 1, 0); + fz_buffer_storage(ctx, res, &c); + LIST_APPEND_DROP(liste, Py_BuildValue("is", pno, c)); + fz_drop_buffer(ctx, res); + } +} + + +PyObject *JM_EscapeStrFromBuffer(fz_context *ctx, fz_buffer *buff) +{ + if (!buff) return EMPTY_STRING; + unsigned char *s = NULL; + size_t len = fz_buffer_storage(ctx, buff, &s); + PyObject *val = PyUnicode_DecodeRawUnicodeEscape((const char *) s, (Py_ssize_t) len, "replace"); + if (!val) { + val = EMPTY_STRING; + PyErr_Clear(); + } + return val; +} + +PyObject *JM_UnicodeFromBuffer(fz_context *ctx, fz_buffer *buff) +{ + unsigned char *s = NULL; + Py_ssize_t len = (Py_ssize_t) fz_buffer_storage(ctx, buff, &s); + PyObject *val = PyUnicode_DecodeUTF8((const char *) s, len, "replace"); + if (!val) { + val = EMPTY_STRING; + PyErr_Clear(); + } + return val; +} + +PyObject *JM_UnicodeFromStr(const char *c) +{ + if (!c) return EMPTY_STRING; + PyObject *val = Py_BuildValue("s", c); + if (!val) { + val = EMPTY_STRING; + PyErr_Clear(); + } + return val; +} + +PyObject *JM_EscapeStrFromStr(const char *c) +{ + if (!c) return EMPTY_STRING; + PyObject *val = PyUnicode_DecodeRawUnicodeEscape(c, (Py_ssize_t) strlen(c), "replace"); + if (!val) { + val = EMPTY_STRING; + PyErr_Clear(); + } + return val; +} + + +// list of valid unicodes of a fz_font +void JM_valid_chars(fz_context *ctx, fz_font *font, void *arr) +{ + FT_Face face = font->ft_face; + FT_ULong ucs; + FT_UInt gid; + long *table = (long *)arr; + fz_lock(ctx, FZ_LOCK_FREETYPE); + ucs = FT_Get_First_Char(face, &gid); + while (gid > 0) + { + if (gid < (FT_ULong)face->num_glyphs && face->num_glyphs > 0) + table[gid] = (long)ucs; + ucs = FT_Get_Next_Char(face, ucs, &gid); + } + fz_unlock(ctx, FZ_LOCK_FREETYPE); + return; +} + + +// redirect MuPDF warnings +void JM_mupdf_warning(void *user, const char *message) +{ + LIST_APPEND_DROP(JM_mupdf_warnings_store, JM_EscapeStrFromStr(message)); + if (JM_mupdf_show_warnings) { + PySys_WriteStderr("mupdf: %s\n", message); + } +} + +// redirect MuPDF errors +void JM_mupdf_error(void *user, const char *message) +{ + LIST_APPEND_DROP(JM_mupdf_warnings_store, JM_EscapeStrFromStr(message)); + if (JM_mupdf_show_errors) { + PySys_WriteStderr("mupdf: %s\n", message); + } +} + +// a simple tracer +void JM_TRACE(const char *id) +{ + PySys_WriteStdout("%s\n", id); +} + + +// put a warning on Python-stdout +void JM_Warning(const char *id) +{ + PySys_WriteStdout("warning: %s\n", id); +} + +#if JM_MEMORY == 1 +//----------------------------------------------------------------------------- +// The following 3 functions replace MuPDF standard memory allocation. +// This will ensure, that MuPDF memory handling becomes part of Python's +// memory management. +//----------------------------------------------------------------------------- +static void *JM_Py_Malloc(void *opaque, size_t size) +{ + void *mem = PyMem_Malloc((Py_ssize_t) size); + if (mem) return mem; + fz_throw(gctx, FZ_ERROR_MEMORY, "malloc of %zu bytes failed", size); +} + +static void *JM_Py_Realloc(void *opaque, void *old, size_t size) +{ + void *mem = PyMem_Realloc(old, (Py_ssize_t) size); + if (mem) return mem; + fz_throw(gctx, FZ_ERROR_MEMORY, "realloc of %zu bytes failed", size); +} + +static void JM_PY_Free(void *opaque, void *ptr) +{ + PyMem_Free(ptr); +} + +const fz_alloc_context JM_Alloc_Context = +{ + NULL, + JM_Py_Malloc, + JM_Py_Realloc, + JM_PY_Free +}; +#endif + +PyObject *JM_fitz_config() +{ +#if defined(TOFU) +#define have_TOFU JM_BOOL(0) +#else +#define have_TOFU JM_BOOL(1) +#endif +#if defined(TOFU_CJK) +#define have_TOFU_CJK JM_BOOL(0) +#else +#define have_TOFU_CJK JM_BOOL(1) +#endif +#if defined(TOFU_CJK_EXT) +#define have_TOFU_CJK_EXT JM_BOOL(0) +#else +#define have_TOFU_CJK_EXT JM_BOOL(1) +#endif +#if defined(TOFU_CJK_LANG) +#define have_TOFU_CJK_LANG JM_BOOL(0) +#else +#define have_TOFU_CJK_LANG JM_BOOL(1) +#endif +#if defined(TOFU_EMOJI) +#define have_TOFU_EMOJI JM_BOOL(0) +#else +#define have_TOFU_EMOJI JM_BOOL(1) +#endif +#if defined(TOFU_HISTORIC) +#define have_TOFU_HISTORIC JM_BOOL(0) +#else +#define have_TOFU_HISTORIC JM_BOOL(1) +#endif +#if defined(TOFU_SYMBOL) +#define have_TOFU_SYMBOL JM_BOOL(0) +#else +#define have_TOFU_SYMBOL JM_BOOL(1) +#endif +#if defined(TOFU_SIL) +#define have_TOFU_SIL JM_BOOL(0) +#else +#define have_TOFU_SIL JM_BOOL(1) +#endif +#if defined(TOFU_BASE14) +#define have_TOFU_BASE14 JM_BOOL(0) +#else +#define have_TOFU_BASE14 JM_BOOL(1) +#endif + PyObject *dict = PyDict_New(); + DICT_SETITEMSTR_DROP(dict, "plotter-g", JM_BOOL(FZ_PLOTTERS_G)); + DICT_SETITEMSTR_DROP(dict, "plotter-rgb", JM_BOOL(FZ_PLOTTERS_RGB)); + DICT_SETITEMSTR_DROP(dict, "plotter-cmyk", JM_BOOL(FZ_PLOTTERS_CMYK)); + DICT_SETITEMSTR_DROP(dict, "plotter-n", JM_BOOL(FZ_PLOTTERS_N)); + DICT_SETITEMSTR_DROP(dict, "pdf", JM_BOOL(FZ_ENABLE_PDF)); + DICT_SETITEMSTR_DROP(dict, "xps", JM_BOOL(FZ_ENABLE_XPS)); + DICT_SETITEMSTR_DROP(dict, "svg", JM_BOOL(FZ_ENABLE_SVG)); + DICT_SETITEMSTR_DROP(dict, "cbz", JM_BOOL(FZ_ENABLE_CBZ)); + DICT_SETITEMSTR_DROP(dict, "img", JM_BOOL(FZ_ENABLE_IMG)); + DICT_SETITEMSTR_DROP(dict, "html", JM_BOOL(FZ_ENABLE_HTML)); + DICT_SETITEMSTR_DROP(dict, "epub", JM_BOOL(FZ_ENABLE_EPUB)); + DICT_SETITEMSTR_DROP(dict, "jpx", JM_BOOL(FZ_ENABLE_JPX)); + DICT_SETITEMSTR_DROP(dict, "js", JM_BOOL(FZ_ENABLE_JS)); + DICT_SETITEMSTR_DROP(dict, "tofu", have_TOFU); + DICT_SETITEMSTR_DROP(dict, "tofu-cjk", have_TOFU_CJK); + DICT_SETITEMSTR_DROP(dict, "tofu-cjk-ext", have_TOFU_CJK_EXT); + DICT_SETITEMSTR_DROP(dict, "tofu-cjk-lang", have_TOFU_CJK_LANG); + DICT_SETITEMSTR_DROP(dict, "tofu-emoji", have_TOFU_EMOJI); + DICT_SETITEMSTR_DROP(dict, "tofu-historic", have_TOFU_HISTORIC); + DICT_SETITEMSTR_DROP(dict, "tofu-symbol", have_TOFU_SYMBOL); + DICT_SETITEMSTR_DROP(dict, "tofu-sil", have_TOFU_SIL); + DICT_SETITEMSTR_DROP(dict, "icc", JM_BOOL(FZ_ENABLE_ICC)); + DICT_SETITEMSTR_DROP(dict, "base14", have_TOFU_BASE14); + DICT_SETITEMSTR_DROP(dict, "py-memory", JM_BOOL(JM_MEMORY)); + return dict; +} + +//---------------------------------------------------------------------------- +// Update a color float array with values from a Python sequence. +// Any error condition is treated as a no-op. +//---------------------------------------------------------------------------- +void JM_color_FromSequence(PyObject *color, int *n, float col[4]) +{ + if (!color || color == Py_None) { + *n = -1; + return; + } + if (PyFloat_Check(color)) { // maybe just a single float + *n = 1; + float c = (float) PyFloat_AsDouble(color); + if (!INRANGE(c, 0, 1)) { + c = 1; + } + col[0] = c; + return; + } + + if (!PySequence_Check(color)) { + *n = -1; + return; + } + int len = (int) PySequence_Size(color), rc; + if (len == 0) { + *n = 0; + return; + } + if (!INRANGE(len, 1, 4) || len == 2) { + *n = -1; + return; + } + + double mcol[4] = {0,0,0,0}; // local color storage + Py_ssize_t i; + for (i = 0; i < len; i++) { + rc = JM_FLOAT_ITEM(color, i, &mcol[i]); + if (!INRANGE(mcol[i], 0, 1) || rc == 1) mcol[i] = 1; + } + + *n = len; + for (i = 0; i < len; i++) + col[i] = (float) mcol[i]; + return; +} + +// return extension for fitz image type +const char *JM_image_extension(int type) +{ + switch (type) { + case(FZ_IMAGE_FAX): return "fax"; + case(FZ_IMAGE_RAW): return "raw"; + case(FZ_IMAGE_FLATE): return "flate"; + case(FZ_IMAGE_LZW): return "lzw"; + case(FZ_IMAGE_RLD): return "rld"; + case(FZ_IMAGE_BMP): return "bmp"; + case(FZ_IMAGE_GIF): return "gif"; + case(FZ_IMAGE_JBIG2): return "jb2"; + case(FZ_IMAGE_JPEG): return "jpeg"; + case(FZ_IMAGE_JPX): return "jpx"; + case(FZ_IMAGE_JXR): return "jxr"; + case(FZ_IMAGE_PNG): return "png"; + case(FZ_IMAGE_PNM): return "pnm"; + case(FZ_IMAGE_TIFF): return "tiff"; + // case(FZ_IMAGE_PSD): return "psd"; + case(FZ_IMAGE_UNKNOWN): return "n/a"; + default: return "n/a"; + } +} + +//---------------------------------------------------------------------------- +// Turn fz_buffer into a Python bytes object +//---------------------------------------------------------------------------- +PyObject *JM_BinFromBuffer(fz_context *ctx, fz_buffer *buffer) +{ + if (!buffer) { + return PyBytes_FromString(""); + } + unsigned char *c = NULL; + size_t len = fz_buffer_storage(ctx, buffer, &c); + return PyBytes_FromStringAndSize((const char *) c, (Py_ssize_t) len); +} + +//---------------------------------------------------------------------------- +// Turn fz_buffer into a Python bytearray object +//---------------------------------------------------------------------------- +PyObject *JM_BArrayFromBuffer(fz_context *ctx, fz_buffer *buffer) +{ + if (!buffer) { + return PyByteArray_FromStringAndSize("", 0); + } + unsigned char *c = NULL; + size_t len = fz_buffer_storage(ctx, buffer, &c); + return PyByteArray_FromStringAndSize((const char *) c, (Py_ssize_t) len); +} + + +//---------------------------------------------------------------------------- +// compress char* into a new buffer +//---------------------------------------------------------------------------- +fz_buffer *JM_compress_buffer(fz_context *ctx, fz_buffer *inbuffer) +{ + fz_buffer *buf = NULL; + fz_try(ctx) { + size_t compressed_length = 0; + unsigned char *data = fz_new_deflated_data_from_buffer(ctx, + &compressed_length, inbuffer, FZ_DEFLATE_BEST); + if (data == NULL || compressed_length == 0) + return NULL; + buf = fz_new_buffer_from_data(ctx, data, compressed_length); + fz_resize_buffer(ctx, buf, compressed_length); + } + fz_catch(ctx) { + fz_drop_buffer(ctx, buf); + fz_rethrow(ctx); + } + return buf; +} + +//---------------------------------------------------------------------------- +// update a stream object +// compress stream when beneficial +//---------------------------------------------------------------------------- +void JM_update_stream(fz_context *ctx, pdf_document *doc, pdf_obj *obj, fz_buffer *buffer, int compress) +{ + + fz_buffer *nres = NULL; + size_t len = fz_buffer_storage(ctx, buffer, NULL); + size_t nlen = len; + + if (compress == 1 && len > 30) { // ignore small stuff + nres = JM_compress_buffer(ctx, buffer); + nlen = fz_buffer_storage(ctx, nres, NULL); + } + + if (nlen < len && nres && compress==1) { // was it worth the effort? + pdf_dict_put(ctx, obj, PDF_NAME(Filter), PDF_NAME(FlateDecode)); + pdf_update_stream(ctx, doc, obj, nres, 1); + } else { + pdf_update_stream(ctx, doc, obj, buffer, 0); + } + fz_drop_buffer(ctx, nres); +} + +//----------------------------------------------------------------------------- +// return hex characters for n characters in input 'in' +//----------------------------------------------------------------------------- +void hexlify(int n, unsigned char *in, unsigned char *out) +{ + const unsigned char hdigit[17] = "0123456789abcedf"; + int i, i1, i2; + for (i = 0; i < n; i++) { + i1 = in[i]>>4; + i2 = in[i] - i1*16; + out[2*i] = hdigit[i1]; + out[2*i + 1] = hdigit[i2]; + } + out[2*n] = 0; +} + +//---------------------------------------------------------------------------- +// Make fz_buffer from a PyBytes, PyByteArray, or io.BytesIO object +//---------------------------------------------------------------------------- +fz_buffer *JM_BufferFromBytes(fz_context *ctx, PyObject *stream) +{ + char *c = NULL; + PyObject *mybytes = NULL; + size_t len = 0; + fz_buffer *res = NULL; + fz_var(res); + fz_try(ctx) { + if (PyBytes_Check(stream)) { + c = PyBytes_AS_STRING(stream); + len = (size_t) PyBytes_GET_SIZE(stream); + } else if (PyByteArray_Check(stream)) { + c = PyByteArray_AS_STRING(stream); + len = (size_t) PyByteArray_GET_SIZE(stream); + } else if (PyObject_HasAttrString(stream, "getvalue")) { + // we assume here that this delivers what we expect + mybytes = PyObject_CallMethod(stream, "getvalue", NULL); + c = PyBytes_AS_STRING(mybytes); + len = (size_t) PyBytes_GET_SIZE(mybytes); + } + // if none of the above, c is NULL and we return an empty buffer + if (c) { + res = fz_new_buffer_from_copied_data(ctx, (const unsigned char *) c, len); + } else { + res = fz_new_buffer(ctx, 1); + fz_append_byte(ctx, res, 10); + } + fz_terminate_buffer(ctx, res); + } + fz_always(ctx) { + Py_CLEAR(mybytes); + PyErr_Clear(); + } + fz_catch(ctx) { + fz_drop_buffer(ctx, res); + fz_rethrow(ctx); + } + return res; +} + + +//---------------------------------------------------------------------------- +// Deep-copies a source page to the target. +// Modified version of function of pdfmerge.c: we also copy annotations, but +// we skip some subtypes. In addition we rotate output. +//---------------------------------------------------------------------------- +static void +page_merge(fz_context *ctx, pdf_document *doc_des, pdf_document *doc_src, int page_from, int page_to, int rotate, int links, int copy_annots, pdf_graft_map *graft_map) +{ + pdf_obj *page_ref = NULL; + pdf_obj *page_dict = NULL; + pdf_obj *obj = NULL, *ref = NULL; + + // list of object types (per page) we want to copy + static pdf_obj * const known_page_objs[] = { + PDF_NAME(Contents), + PDF_NAME(Resources), + PDF_NAME(MediaBox), + PDF_NAME(CropBox), + PDF_NAME(BleedBox), + PDF_NAME(TrimBox), + PDF_NAME(ArtBox), + PDF_NAME(Rotate), + PDF_NAME(UserUnit) + }; + + int i, n; + + fz_var(ref); + fz_var(page_dict); + + fz_try(ctx) { + page_ref = pdf_lookup_page_obj(ctx, doc_src, page_from); + + // make new page dict in dest doc + page_dict = pdf_new_dict(ctx, doc_des, 4); + pdf_dict_put(ctx, page_dict, PDF_NAME(Type), PDF_NAME(Page)); + + for (i = 0; i < (int) nelem(known_page_objs); i++) { + obj = pdf_dict_get_inheritable(ctx, page_ref, known_page_objs[i]); + if (obj != NULL) { + pdf_dict_put_drop(ctx, page_dict, known_page_objs[i], pdf_graft_mapped_object(ctx, graft_map, obj)); + } + } + + // Copy annotations, but skip Link, Popup, IRT, Widget types + // If selected, remove dict keys P (parent) and Popup + if (copy_annots) { + pdf_obj *old_annots = pdf_dict_get(ctx, page_ref, PDF_NAME(Annots)); + n = pdf_array_len(ctx, old_annots); + if (n > 0) { + pdf_obj *new_annots = pdf_dict_put_array(ctx, page_dict, PDF_NAME(Annots), n); + for (i = 0; i < n; i++) { + pdf_obj *o = pdf_array_get(ctx, old_annots, i); + if (!pdf_is_dict(ctx, o)) continue; // skip non-dict items + if (pdf_dict_get(ctx, o, PDF_NAME(IRT))) continue; + pdf_obj *subtype = pdf_dict_get(ctx, o, PDF_NAME(Subtype)); + if (pdf_name_eq(ctx, subtype, PDF_NAME(Link))) continue; + if (pdf_name_eq(ctx, subtype, PDF_NAME(Popup))) continue; + if (pdf_name_eq(ctx, subtype, PDF_NAME(Widget))) continue; + pdf_dict_del(ctx, o, PDF_NAME(Popup)); + pdf_dict_del(ctx, o, PDF_NAME(P)); + pdf_obj *copy_o = pdf_graft_mapped_object(ctx, graft_map, o); + pdf_obj *annot = pdf_new_indirect(ctx, doc_des, + pdf_to_num(ctx, copy_o), 0); + pdf_array_push_drop(ctx, new_annots, annot); + pdf_drop_obj(ctx, copy_o); + } + } + } + // rotate the page + if (rotate != -1) { + pdf_dict_put_int(ctx, page_dict, PDF_NAME(Rotate), (int64_t) rotate); + } + // Now add the page dictionary to dest PDF + ref = pdf_add_object(ctx, doc_des, page_dict); + + // Insert new page at specified location + pdf_insert_page(ctx, doc_des, page_to, ref); + + } + fz_always(ctx) { + pdf_drop_obj(ctx, page_dict); + pdf_drop_obj(ctx, ref); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } +} + +//----------------------------------------------------------------------------- +// Copy a range of pages (spage, epage) from a source PDF to a specified +// location (apage) of the target PDF. +// If spage > epage, the sequence of source pages is reversed. +//----------------------------------------------------------------------------- +void JM_merge_range(fz_context *ctx, pdf_document *doc_des, pdf_document *doc_src, int spage, int epage, int apage, int rotate, int links, int annots, int show_progress, pdf_graft_map *graft_map) +{ + int page, afterpage; + afterpage = apage; + int counter = 0; // copied pages counter + int total = fz_absi(epage - spage) + 1; // total pages to copy + + fz_try(ctx) { + if (spage < epage) { + for (page = spage; page <= epage; page++, afterpage++) { + page_merge(ctx, doc_des, doc_src, page, afterpage, rotate, links, annots, graft_map); + counter++; + if (show_progress > 0 && counter % show_progress == 0) { + PySys_WriteStdout("Inserted %i of %i pages.\n", counter, total); + } + } + } else { + for (page = spage; page >= epage; page--, afterpage++) { + page_merge(ctx, doc_des, doc_src, page, afterpage, rotate, links, annots, graft_map); + counter++; + if (show_progress > 0 && counter % show_progress == 0) { + PySys_WriteStdout("Inserted %i of %i pages.\n", counter, total); + } + } + } + } + + fz_catch(ctx) { + fz_rethrow(ctx); + } +} + +//---------------------------------------------------------------------------- +// Return list of outline xref numbers. Recursive function. Arguments: +// 'obj' first OL item +// 'xrefs' empty Python list +//---------------------------------------------------------------------------- +PyObject *JM_outline_xrefs(fz_context *ctx, pdf_obj *obj, PyObject *xrefs) +{ + pdf_obj *first, *parent, *thisobj; + if (!obj) return xrefs; + PyObject *newxref = NULL; + thisobj = obj; + while (thisobj) { + newxref = PyLong_FromLong((long) pdf_to_num(ctx, thisobj)); + if (PySequence_Contains(xrefs, newxref) || + pdf_dict_get(ctx, thisobj, PDF_NAME(Type))) { + // circular ref or top of chain: terminate + Py_DECREF(newxref); + break; + } + LIST_APPEND_DROP(xrefs, newxref); + first = pdf_dict_get(ctx, thisobj, PDF_NAME(First)); // try go down + if (pdf_is_dict(ctx, first)) xrefs = JM_outline_xrefs(ctx, first, xrefs); + thisobj = pdf_dict_get(ctx, thisobj, PDF_NAME(Next)); // try go next + parent = pdf_dict_get(ctx, thisobj, PDF_NAME(Parent)); // get parent + if (!pdf_is_dict(ctx, thisobj)) { + thisobj = parent; + } + } + return xrefs; +} + + +//------------------------------------------------------------------- +// Return the contents of a font file, identified by xref +//------------------------------------------------------------------- +fz_buffer *JM_get_fontbuffer(fz_context *ctx, pdf_document *doc, int xref) +{ + if (xref < 1) return NULL; + pdf_obj *o, *obj = NULL, *desft, *stream = NULL; + o = pdf_load_object(ctx, doc, xref); + desft = pdf_dict_get(ctx, o, PDF_NAME(DescendantFonts)); + if (desft) { + obj = pdf_resolve_indirect(ctx, pdf_array_get(ctx, desft, 0)); + obj = pdf_dict_get(ctx, obj, PDF_NAME(FontDescriptor)); + } else { + obj = pdf_dict_get(ctx, o, PDF_NAME(FontDescriptor)); + } + + if (!obj) { + pdf_drop_obj(ctx, o); + PySys_WriteStdout("invalid font - FontDescriptor missing"); + return NULL; + } + pdf_drop_obj(ctx, o); + o = obj; + + obj = pdf_dict_get(ctx, o, PDF_NAME(FontFile)); + if (obj) stream = obj; // ext = "pfa" + + obj = pdf_dict_get(ctx, o, PDF_NAME(FontFile2)); + if (obj) stream = obj; // ext = "ttf" + + obj = pdf_dict_get(ctx, o, PDF_NAME(FontFile3)); + if (obj) { + stream = obj; + + obj = pdf_dict_get(ctx, obj, PDF_NAME(Subtype)); + if (obj && !pdf_is_name(ctx, obj)) { + PySys_WriteStdout("invalid font descriptor subtype"); + return NULL; + } + + if (pdf_name_eq(ctx, obj, PDF_NAME(Type1C))) + ; /*Prev code did: ext = "cff", but this has no effect. */ + else if (pdf_name_eq(ctx, obj, PDF_NAME(CIDFontType0C))) + ; /*Prev code did: ext = "cid", but this has no effect. */ + else if (pdf_name_eq(ctx, obj, PDF_NAME(OpenType))) + ; /*Prev code did: ext = "otf", but this has no effect. */ + else + PySys_WriteStdout("warning: unhandled font type '%s'", pdf_to_name(ctx, obj)); + } + + if (!stream) { + PySys_WriteStdout("warning: unhandled font type"); + return NULL; + } + + return pdf_load_stream(ctx, stream); +} + +//----------------------------------------------------------------------------- +// Return the file extension of a font file, identified by xref +//----------------------------------------------------------------------------- +char *JM_get_fontextension(fz_context *ctx, pdf_document *doc, int xref) +{ + if (xref < 1) return "n/a"; + pdf_obj *o, *obj = NULL, *desft; + o = pdf_load_object(ctx, doc, xref); + desft = pdf_dict_get(ctx, o, PDF_NAME(DescendantFonts)); + if (desft) { + obj = pdf_resolve_indirect(ctx, pdf_array_get(ctx, desft, 0)); + obj = pdf_dict_get(ctx, obj, PDF_NAME(FontDescriptor)); + } else { + obj = pdf_dict_get(ctx, o, PDF_NAME(FontDescriptor)); + } + + pdf_drop_obj(ctx, o); + if (!obj) return "n/a"; // this is a base-14 font + + o = obj; // we have the FontDescriptor + + obj = pdf_dict_get(ctx, o, PDF_NAME(FontFile)); + if (obj) return "pfa"; + + obj = pdf_dict_get(ctx, o, PDF_NAME(FontFile2)); + if (obj) return "ttf"; + + obj = pdf_dict_get(ctx, o, PDF_NAME(FontFile3)); + if (obj) { + obj = pdf_dict_get(ctx, obj, PDF_NAME(Subtype)); + if (obj && !pdf_is_name(ctx, obj)) { + PySys_WriteStdout("invalid font descriptor subtype"); + return "n/a"; + } + if (pdf_name_eq(ctx, obj, PDF_NAME(Type1C))) + return "cff"; + else if (pdf_name_eq(ctx, obj, PDF_NAME(CIDFontType0C))) + return "cid"; + else if (pdf_name_eq(ctx, obj, PDF_NAME(OpenType))) + return "otf"; + else + PySys_WriteStdout("unhandled font type '%s'", pdf_to_name(ctx, obj)); + } + + return "n/a"; +} + + +//----------------------------------------------------------------------------- +// create PDF object from given string (new in v1.14.0: MuPDF dropped it) +//----------------------------------------------------------------------------- +pdf_obj *JM_pdf_obj_from_str(fz_context *ctx, pdf_document *doc, char *src) +{ + pdf_obj *result = NULL; + pdf_lexbuf lexbuf; + fz_stream *stream = fz_open_memory(ctx, (unsigned char *)src, strlen(src)); + + pdf_lexbuf_init(ctx, &lexbuf, PDF_LEXBUF_SMALL); + + fz_try(ctx) { + result = pdf_parse_stm_obj(ctx, doc, stream, &lexbuf); + } + + fz_always(ctx) { + pdf_lexbuf_fin(ctx, &lexbuf); + fz_drop_stream(ctx, stream); + } + + fz_catch(ctx) { + fz_rethrow(ctx); + } + + return result; + +} + +//---------------------------------------------------------------------------- +// return normalized /Rotate value:one of 0, 90, 180, 270 +//---------------------------------------------------------------------------- +int JM_norm_rotation(int rotate) +{ + while (rotate < 0) rotate += 360; + while (rotate >= 360) rotate -= 360; + if (rotate % 90 != 0) return 0; + return rotate; +} + + +//---------------------------------------------------------------------------- +// return a PDF page's /Rotate value: one of (0, 90, 180, 270) +//---------------------------------------------------------------------------- +int JM_page_rotation(fz_context *ctx, pdf_page *page) +{ + int rotate = 0; + fz_try(ctx) + { + rotate = pdf_to_int(ctx, + pdf_dict_get_inheritable(ctx, page->obj, PDF_NAME(Rotate))); + rotate = JM_norm_rotation(rotate); + } + fz_catch(ctx) return 0; + return rotate; +} + + +//---------------------------------------------------------------------------- +// return a PDF page's MediaBox +//---------------------------------------------------------------------------- +fz_rect JM_mediabox(fz_context *ctx, pdf_obj *page_obj) +{ + fz_rect mediabox, page_mediabox; + + mediabox = pdf_to_rect(ctx, pdf_dict_get_inheritable(ctx, page_obj, + PDF_NAME(MediaBox))); + if (fz_is_empty_rect(mediabox) || fz_is_infinite_rect(mediabox)) + { + mediabox.x0 = 0; + mediabox.y0 = 0; + mediabox.x1 = 612; + mediabox.y1 = 792; + } + + page_mediabox.x0 = fz_min(mediabox.x0, mediabox.x1); + page_mediabox.y0 = fz_min(mediabox.y0, mediabox.y1); + page_mediabox.x1 = fz_max(mediabox.x0, mediabox.x1); + page_mediabox.y1 = fz_max(mediabox.y0, mediabox.y1); + + if (page_mediabox.x1 - page_mediabox.x0 < 1 || + page_mediabox.y1 - page_mediabox.y0 < 1) + page_mediabox = fz_unit_rect; + + return page_mediabox; +} + + +//---------------------------------------------------------------------------- +// return a PDF page's CropBox +//---------------------------------------------------------------------------- +fz_rect JM_cropbox(fz_context *ctx, pdf_obj *page_obj) +{ + fz_rect mediabox = JM_mediabox(ctx, page_obj); + fz_rect cropbox = pdf_to_rect(ctx, + pdf_dict_get_inheritable(ctx, page_obj, PDF_NAME(CropBox))); + if (fz_is_infinite_rect(cropbox) || fz_is_empty_rect(cropbox)) + cropbox = mediabox; + float y0 = mediabox.y1 - cropbox.y1; + float y1 = mediabox.y1 - cropbox.y0; + cropbox.y0 = y0; + cropbox.y1 = y1; + return cropbox; +} + + +//---------------------------------------------------------------------------- +// calculate width and height of the UNROTATED page +//---------------------------------------------------------------------------- +fz_point JM_cropbox_size(fz_context *ctx, pdf_obj *page_obj) +{ + fz_point size; + fz_try(ctx) + { + fz_rect rect = JM_cropbox(ctx, page_obj); + float w = (rect.x0 < rect.x1 ? rect.x1 - rect.x0 : rect.x0 - rect.x1); + float h = (rect.y0 < rect.y1 ? rect.y1 - rect.y0 : rect.y0 - rect.y1); + size = fz_make_point(w, h); + } + fz_catch(ctx) fz_rethrow(ctx); + return size; +} + + +//---------------------------------------------------------------------------- +// calculate page rotation matrices +//---------------------------------------------------------------------------- +fz_matrix JM_rotate_page_matrix(fz_context *ctx, pdf_page *page) +{ + if (!page) return fz_identity; // no valid pdf page given + int rotation = JM_page_rotation(ctx, page); + if (rotation == 0) return fz_identity; // no rotation + fz_matrix m; + fz_point cb_size = JM_cropbox_size(ctx, page->obj); + float w = cb_size.x; + float h = cb_size.y; + if (rotation == 90) + m = fz_make_matrix(0, 1, -1, 0, h, 0); + else if (rotation == 180) + m = fz_make_matrix(-1, 0, 0, -1, w, h); + else + m = fz_make_matrix(0, -1, 1, 0, 0, w); + return m; +} + + +fz_matrix JM_derotate_page_matrix(fz_context *ctx, pdf_page *page) +{ // just the inverse of rotation + return fz_invert_matrix(JM_rotate_page_matrix(ctx, page)); +} + + +//----------------------------------------------------------------------------- +// Insert a font in a PDF +//----------------------------------------------------------------------------- +PyObject * +JM_insert_font(fz_context *ctx, pdf_document *pdf, char *bfname, char *fontfile, + PyObject *fontbuffer, int set_simple, int idx, int wmode, int serif, + int encoding, int ordering) +{ + pdf_obj *font_obj = NULL; + fz_font *font = NULL; + fz_buffer *res = NULL; + const unsigned char *data = NULL; + int size, ixref = 0, index = 0, simple = 0; + PyObject *value=NULL, *name=NULL, *subt=NULL, *exto = NULL; + + fz_var(exto); + fz_var(name); + fz_var(subt); + fz_var(res); + fz_var(font); + fz_var(font_obj); + fz_try(ctx) { + ENSURE_OPERATION(ctx, pdf); + //------------------------------------------------------------- + // check for CJK font + //------------------------------------------------------------- + if (ordering > -1) { + data = fz_lookup_cjk_font(ctx, ordering, &size, &index); + } + if (data) { + font = fz_new_font_from_memory(ctx, NULL, data, size, index, 0); + font_obj = pdf_add_cjk_font(ctx, pdf, font, ordering, wmode, serif); + exto = JM_UnicodeFromStr("n/a"); + simple = 0; + goto weiter; + } + + //------------------------------------------------------------- + // check for PDF Base-14 font + //------------------------------------------------------------- + if (bfname) { + data = fz_lookup_base14_font(ctx, bfname, &size); + } + if (data) { + font = fz_new_font_from_memory(ctx, bfname, data, size, 0, 0); + font_obj = pdf_add_simple_font(ctx, pdf, font, encoding); + exto = JM_UnicodeFromStr("n/a"); + simple = 1; + goto weiter; + } + + if (fontfile) { + font = fz_new_font_from_file(ctx, NULL, fontfile, idx, 0); + } else { + res = JM_BufferFromBytes(ctx, fontbuffer); + if (!res) { + RAISEPY(ctx, MSG_FILE_OR_BUFFER, PyExc_ValueError); + } + font = fz_new_font_from_buffer(ctx, NULL, res, idx, 0); + } + + if (!set_simple) { + font_obj = pdf_add_cid_font(ctx, pdf, font); + simple = 0; + } else { + font_obj = pdf_add_simple_font(ctx, pdf, font, encoding); + simple = 2; + } + + weiter: ; + ixref = pdf_to_num(ctx, font_obj); + name = JM_EscapeStrFromStr(pdf_to_name(ctx, + pdf_dict_get(ctx, font_obj, PDF_NAME(BaseFont)))); + + subt = JM_UnicodeFromStr(pdf_to_name(ctx, + pdf_dict_get(ctx, font_obj, PDF_NAME(Subtype)))); + + if (!exto) + exto = JM_UnicodeFromStr(JM_get_fontextension(ctx, pdf, ixref)); + + float asc = fz_font_ascender(ctx, font); + float dsc = fz_font_descender(ctx, font); + value = Py_BuildValue("[i,{s:O,s:O,s:O,s:O,s:i,s:f,s:f}]", + ixref, + "name", name, // base font name + "type", subt, // subtype + "ext", exto, // file extension + "simple", JM_BOOL(simple), // simple font? + "ordering", ordering, // CJK font? + "ascender", asc, + "descender", dsc + ); + } + fz_always(ctx) { + Py_CLEAR(exto); + Py_CLEAR(name); + Py_CLEAR(subt); + fz_drop_buffer(ctx, res); + fz_drop_font(ctx, font); + pdf_drop_obj(ctx, font_obj); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return value; +} + + +//----------------------------------------------------------------------------- +// compute image insertion matrix +//----------------------------------------------------------------------------- +fz_matrix +calc_image_matrix(int width, int height, PyObject *tr, int rotate, int keep) +{ + float large, small, fw, fh, trw, trh, f, w, h; + fz_rect trect = JM_rect_from_py(tr); + fz_matrix rot = fz_rotate((float) rotate); + trw = trect.x1 - trect.x0; + trh = trect.y1 - trect.y0; + w = trw; + h = trh; + if (keep) { + large = (float) Py_MAX(width, height); + fw = (float) width / large; + fh = (float) height / large; + } else { + fw = fh = 1; + } + small = Py_MIN(fw, fh); + if (rotate != 0 && rotate != 180) { + f = fw; + fw = fh; + fh = f; + } + if (fw < 1) { + if ((trw / fw) > (trh / fh)) { + w = trh * small; + h = trh; + } else { + w = trw; + h = trw / small; + } + } else if (fw != fh) { + if ((trw / fw) > (trh / fh)) { + w = trh / small; + h = trh; + } else { + w = trw; + h = trw * small; + } + } else { + w = trw; + h = trh; + } + fz_point tmp = fz_make_point((trect.x0 + trect.x1) / 2, + (trect.y0 + trect.y1) / 2); + fz_matrix mat = fz_make_matrix(1, 0, 0, 1, -0.5, -0.5); + mat = fz_concat(mat, rot); + mat = fz_concat(mat, fz_scale(w, h)); + mat = fz_concat(mat, fz_translate(tmp.x, tmp.y)); + return mat; +} + +// -------------------------------------------------------- +// Callback function for the Story class +// -------------------------------------------------------- +static PyObject *make_story_elpos = NULL; // Py function returning object +void Story_Callback(fz_context *ctx, void *opaque, fz_story_element_position *pos) +{ +#define SETATTR(a, v) PyObject_SetAttrString(arg, a, v);Py_DECREF(v) + // ------------------------------------------------------------------------ + // 'opaque' is a tuple (userfunc, userdict), where 'userfunc' is a function + // in the user's script and 'userdict' is a dictionary containing any + // additional parameters of the user + // userfunc will be called with the joined info of userdict and pos. + // ------------------------------------------------------------------------ + PyObject *callarg = (PyObject *) opaque; + PyObject *userfunc = PyTuple_GET_ITEM(callarg, 0); + PyObject *userdict = PyTuple_GET_ITEM(callarg, 1); + + PyObject *this_module = PyImport_AddModule("fitz"); // get our module + if (!make_story_elpos) { // locate ElementPosition maker once + make_story_elpos = Py_BuildValue("s", "make_story_elpos"); + } + // get access to ElementPosition() object + PyObject *arg = PyObject_CallMethodObjArgs(this_module, make_story_elpos, NULL); + Py_INCREF(arg); + SETATTR("depth", Py_BuildValue("i", pos->depth)); + SETATTR("heading", Py_BuildValue("i", pos->heading)); + SETATTR("id", Py_BuildValue("s", pos->id)); + SETATTR("rect", JM_py_from_rect(pos->rect)); + SETATTR("text", Py_BuildValue("s", pos->text)); + SETATTR("open_close", Py_BuildValue("i", pos->open_close)); + SETATTR("rect_num", Py_BuildValue("i", pos->rectangle_num)); + SETATTR("href", Py_BuildValue("s", pos->href)); + + // iterate over userdict items and set their attributes + PyObject *pkey = NULL; + PyObject *pval = NULL; + Py_ssize_t ppos = 0; + while (PyDict_Next(userdict, &ppos, &pkey, &pval)) { + PyObject_SetAttr(arg, pkey, pval); + } + PyObject_CallFunctionObjArgs(userfunc, arg, NULL); +#undef SETATTR +} + +// ----------------------------------------------------------- +// Return last archive if it is a tree and mount points match +// ----------------------------------------------------------- +fz_archive *JM_last_tree(fz_context *ctx, fz_archive *arch, const char *mount) +{ + typedef struct + { + fz_archive *arch; + char *dir; + } multi_archive_entry; + + typedef struct + { + fz_archive super; + int len; + int max; + multi_archive_entry *sub; + } fz_multi_archive; + + if (!arch) { + return NULL; + } + + fz_multi_archive *multi = (fz_multi_archive *) arch; + if (multi->len == 0) { // archive is empty + return NULL; + } + int i = multi->len - 1; // read last sub archive + multi_archive_entry *e = &multi->sub[i]; + fz_archive *arch_ = e->arch; + const char *mount_ = e->dir; + const char *fmt = fz_archive_format(ctx, arch_); + if (strcmp(fmt, "tree") != 0) { // not a tree archive + return NULL; + } + if ((mount_ && mount && strcmp(mount, mount_) == 0) || (!mount && !mount_)) { // last sub archive is eligible! + return arch_; + } + return NULL; +} + +fz_archive *JM_archive_from_py(fz_context *ctx, fz_archive *arch, PyObject *path, const char *mount, int *drop_sub) +{ + fz_stream *stream = NULL; + fz_buffer *buff = NULL; + *drop_sub = 1; + fz_archive *sub = NULL; + const char *my_mount = mount; + fz_try(ctx) { + // tree archive: tuple of memory items + // check if we can add to last sub-archive + sub = JM_last_tree(ctx, arch, my_mount); + if (!sub) { + sub = fz_new_tree_archive(ctx, NULL); + } else { + *drop_sub = 0; // never drop last sub-archive + } + + // a single tree item + if (PyBytes_Check(path) || PyByteArray_Check(path) || PyObject_HasAttrString(path, "getvalue")) { + buff = JM_BufferFromBytes(ctx, path); + fz_tree_archive_add_buffer(ctx, sub, mount, buff); + goto finished; + } + + // a tuple of tree items + Py_ssize_t i, n = PyTuple_Size(path); + for (i = 0; i < n; i++) { + PyObject *item = PyTuple_GET_ITEM(path, i); + PyObject *i0 = PySequence_GetItem(item, 0); // data + PyObject *i1 = PySequence_GetItem(item, 1); // name + buff = JM_BufferFromBytes(ctx, i0); + fz_tree_archive_add_buffer(ctx, sub, PyUnicode_AsUTF8(i1), buff); + fz_drop_buffer(ctx, buff); + Py_DECREF(i0); + Py_DECREF(i1); + } + buff = NULL; + goto finished; + + finished:; + } + + fz_always(ctx) { + fz_drop_buffer(ctx, buff); + fz_drop_stream(ctx, stream); + } + + fz_catch(ctx) { + fz_rethrow(ctx); + } + + return sub; +} + + +int JM_rects_overlap(const fz_rect a, const fz_rect b) +{ + if (0 + || a.x0 >= b.x1 + || a.y0 >= b.y1 + || a.x1 <= b.x0 + || a.y1 <= b.y0 + ) + return 0; + return 1; +} + +//----------------------------------------------------------------------------- +// dummy structure for various tools and utilities +//----------------------------------------------------------------------------- +struct Tools {int index;}; + +typedef struct fz_item fz_item; + +struct fz_item +{ + void *key; + fz_storable *val; + size_t size; + fz_item *next; + fz_item *prev; + fz_store *store; + const fz_store_type *type; +}; + +struct fz_store +{ + int refs; + + /* Every item in the store is kept in a doubly linked list, ordered + * by usage (so LRU entries are at the end). */ + fz_item *head; + fz_item *tail; + + /* We have a hash table that allows to quickly find a subset of the + * entries (those whose keys are indirect objects). */ + fz_hash_table *hash; + + /* We keep track of the size of the store, and keep it below max. */ + size_t max; + size_t size; + + int defer_reap_count; + int needs_reaping; +}; + +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-pdfinfo.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-pdfinfo.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,612 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +//------------------------------------------------------------------------ +// Store ID in PDF trailer +//------------------------------------------------------------------------ +void JM_ensure_identity(fz_context *ctx, pdf_document *pdf) +{ + unsigned char rnd[16]; + pdf_obj *id; + id = pdf_dict_get(ctx, pdf_trailer(ctx, pdf), PDF_NAME(ID)); + if (!id) { + fz_memrnd(ctx, rnd, nelem(rnd)); + id = pdf_dict_put_array(ctx, pdf_trailer(ctx, pdf), PDF_NAME(ID), 2); + pdf_array_push_drop(ctx, id, pdf_new_string(ctx, (char *) rnd + 0, nelem(rnd))); + pdf_array_push_drop(ctx, id, pdf_new_string(ctx, (char *) rnd + 0, nelem(rnd))); + } +} + + +//------------------------------------------------------------------------ +// Ensure OCProperties, return /OCProperties key +//------------------------------------------------------------------------ +pdf_obj * +JM_ensure_ocproperties(fz_context *ctx, pdf_document *pdf) +{ + pdf_obj *D, *ocp; + fz_try(ctx) { + ocp = pdf_dict_get(ctx, pdf_dict_get(ctx, pdf_trailer(ctx, pdf), PDF_NAME(Root)), PDF_NAME(OCProperties)); + if (ocp) goto finished; + pdf_obj *root = pdf_dict_get(ctx, pdf_trailer(ctx, pdf), PDF_NAME(Root)); + ocp = pdf_dict_put_dict(ctx, root, PDF_NAME(OCProperties), 2); + pdf_dict_put_array(ctx, ocp, PDF_NAME(OCGs), 0); + D = pdf_dict_put_dict(ctx, ocp, PDF_NAME(D), 5); + pdf_dict_put_array(ctx, D, PDF_NAME(ON), 0); + pdf_dict_put_array(ctx, D, PDF_NAME(OFF), 0); + pdf_dict_put_array(ctx, D, PDF_NAME(Order), 0); + pdf_dict_put_array(ctx, D, PDF_NAME(RBGroups), 0); + finished:; + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return ocp; +} + + +//------------------------------------------------------------------------ +// Add OC configuration to the PDF catalog +//------------------------------------------------------------------------ +void +JM_add_layer_config(fz_context *ctx, pdf_document *pdf, char *name, char *creator, PyObject *ON) +{ + pdf_obj *D, *ocp, *configs; + fz_try(ctx) { + ocp = JM_ensure_ocproperties(ctx, pdf); + configs = pdf_dict_get(ctx, ocp, PDF_NAME(Configs)); + if (!pdf_is_array(ctx, configs)) { + configs = pdf_dict_put_array(ctx,ocp, PDF_NAME(Configs), 1); + } + D = pdf_new_dict(ctx, pdf, 5); + pdf_dict_put_text_string(ctx, D, PDF_NAME(Name), name); + if (creator) { + pdf_dict_put_text_string(ctx, D, PDF_NAME(Creator), creator); + } + pdf_dict_put(ctx, D, PDF_NAME(BaseState), PDF_NAME(OFF)); + pdf_obj *onarray = pdf_dict_put_array(ctx, D, PDF_NAME(ON), 5); + if (!EXISTS(ON) || !PySequence_Check(ON) || !PySequence_Size(ON)) { + ; + } else { + pdf_obj *ocgs = pdf_dict_get(ctx, ocp, PDF_NAME(OCGs)); + int i, n = PySequence_Size(ON); + for (i = 0; i < n; i++) { + int xref = 0; + if (JM_INT_ITEM(ON, (Py_ssize_t) i, &xref) == 1) continue; + pdf_obj *ind = pdf_new_indirect(ctx, pdf, xref, 0); + if (pdf_array_contains(ctx, ocgs, ind)) { + pdf_array_push_drop(ctx, onarray, ind); + } else { + pdf_drop_obj(ctx, ind); + } + } + } + pdf_array_push_drop(ctx, configs, D); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } +} + + +//------------------------------------------------------------------------ +// Get OCG arrays from OC configuration +// Returns dict +// {"basestate":name, "on":list, "off":list, "rbg":list, "locked":list} +//------------------------------------------------------------------------ +static PyObject * +JM_get_ocg_arrays_imp(fz_context *ctx, pdf_obj *arr) +{ + int i, n; + PyObject *list = PyList_New(0), *item = NULL; + pdf_obj *obj = NULL; + if (pdf_is_array(ctx, arr)) { + n = pdf_array_len(ctx, arr); + for (i = 0; i < n; i++) { + obj = pdf_array_get(ctx, arr, i); + item = Py_BuildValue("i", pdf_to_num(ctx, obj)); + if (!PySequence_Contains(list, item)) { + LIST_APPEND_DROP(list, item); + } else { + Py_DECREF(item); + } + } + } + return list; +} + +PyObject * +JM_get_ocg_arrays(fz_context *ctx, pdf_obj *conf) +{ + PyObject *rc = PyDict_New(), *list = NULL, *list1 = NULL; + int i, n; + pdf_obj *arr = NULL, *obj = NULL; + fz_try(ctx) { + arr = pdf_dict_get(ctx, conf, PDF_NAME(ON)); + list = JM_get_ocg_arrays_imp(ctx, arr); + if (PySequence_Size(list)) { + PyDict_SetItemString(rc, "on", list); + } + Py_DECREF(list); + arr = pdf_dict_get(ctx, conf, PDF_NAME(OFF)); + list = JM_get_ocg_arrays_imp(ctx, arr); + if (PySequence_Size(list)) { + PyDict_SetItemString(rc, "off", list); + } + Py_DECREF(list); + arr = pdf_dict_get(ctx, conf, PDF_NAME(Locked)); + list = JM_get_ocg_arrays_imp(ctx, arr); + if (PySequence_Size(list)) { + PyDict_SetItemString(rc, "locked", list); + } + Py_DECREF(list); + list = PyList_New(0); + arr = pdf_dict_get(ctx, conf, PDF_NAME(RBGroups)); + if (pdf_is_array(ctx, arr)) { + n = pdf_array_len(ctx, arr); + for (i = 0; i < n; i++) { + obj = pdf_array_get(ctx, arr, i); + list1 = JM_get_ocg_arrays_imp(ctx, obj); + LIST_APPEND_DROP(list, list1); + } + } + if (PySequence_Size(list)) { + PyDict_SetItemString(rc, "rbgroups", list); + } + Py_DECREF(list); + obj = pdf_dict_get(ctx, conf, PDF_NAME(BaseState)); + + if (obj) { + PyObject *state = NULL; + state = Py_BuildValue("s", pdf_to_name(ctx, obj)); + PyDict_SetItemString(rc, "basestate", state); + Py_DECREF(state); + } + } + fz_always(ctx) { + } + fz_catch(ctx) { + Py_CLEAR(rc); + PyErr_Clear(); + fz_rethrow(ctx); + } + return rc; +} + + +//------------------------------------------------------------------------ +// Set OCG arrays from dict of Python lists +// Works with dict like {"basestate":name, "on":list, "off":list, "rbg":list} +//------------------------------------------------------------------------ +static void +JM_set_ocg_arrays_imp(fz_context *ctx, pdf_obj *arr, PyObject *list) +{ + int i, n = PySequence_Size(list); + pdf_obj *obj = NULL; + pdf_document *pdf = pdf_get_bound_document(ctx, arr); + for (i = 0; i < n; i++) { + int xref = 0; + if (JM_INT_ITEM(list, i, &xref) == 1) continue; + obj = pdf_new_indirect(ctx, pdf, xref, 0); + pdf_array_push_drop(ctx, arr, obj); + } + return; +} + +static void +JM_set_ocg_arrays(fz_context *ctx, pdf_obj *conf, const char *basestate, + PyObject *on, PyObject *off, PyObject *rbgroups, PyObject *locked) +{ + int i, n; + pdf_obj *arr = NULL, *obj = NULL; + fz_try(ctx) { + if (basestate) { + pdf_dict_put_name(ctx, conf, PDF_NAME(BaseState), basestate); + } + + if (on != Py_None) { + pdf_dict_del(ctx, conf, PDF_NAME(ON)); + if (PySequence_Size(on)) { + arr = pdf_dict_put_array(ctx, conf, PDF_NAME(ON), 1); + JM_set_ocg_arrays_imp(ctx, arr, on); + } + } + + if (off != Py_None) { + pdf_dict_del(ctx, conf, PDF_NAME(OFF)); + if (PySequence_Size(off)) { + arr = pdf_dict_put_array(ctx, conf, PDF_NAME(OFF), 1); + JM_set_ocg_arrays_imp(ctx, arr, off); + } + } + + if (locked != Py_None) { + pdf_dict_del(ctx, conf, PDF_NAME(Locked)); + if (PySequence_Size(locked)) { + arr = pdf_dict_put_array(ctx, conf, PDF_NAME(Locked), 1); + JM_set_ocg_arrays_imp(ctx, arr, locked); + } + } + + if (rbgroups != Py_None) { + pdf_dict_del(ctx, conf, PDF_NAME(RBGroups)); + if (PySequence_Size(rbgroups)) { + arr = pdf_dict_put_array(ctx, conf, PDF_NAME(RBGroups), 1); + n = PySequence_Size(rbgroups); + for (i = 0; i < n; i++) { + PyObject *item0 = PySequence_ITEM(rbgroups, i); + obj = pdf_array_push_array(ctx, arr, 1); + JM_set_ocg_arrays_imp(ctx, obj, item0); + Py_DECREF(item0); + } + } + } + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return; +} + + +//------------------------------------------------------------------------ +// Return the items of Resources/Properties (used for Marked Content) +// Argument may be e.g. a page object or a Form XObject +//------------------------------------------------------------------------ +PyObject * +JM_get_resource_properties(fz_context *ctx, pdf_obj *ref) +{ + PyObject *rc = NULL; + fz_try(ctx) { + pdf_obj *properties = pdf_dict_getl(ctx, ref, + PDF_NAME(Resources), + PDF_NAME(Properties), NULL); + if (!properties) { + rc = PyTuple_New(0); + } else { + int i, n = pdf_dict_len(ctx, properties); + if (n < 1) { + rc = PyTuple_New(0); + goto finished; + } + rc = PyTuple_New(n); + for (i = 0; i < n; i++) { + pdf_obj *key = pdf_dict_get_key(ctx, properties, i); + pdf_obj *val = pdf_dict_get_val(ctx, properties, i); + const char *c = pdf_to_name(ctx, key); + int xref = pdf_to_num(ctx, val); + PyTuple_SET_ITEM(rc, i, Py_BuildValue("si", c, xref)); + } + } + finished:; + } + fz_catch(ctx) { + Py_CLEAR(rc); + fz_rethrow(ctx); + } + return rc; +} + + +//------------------------------------------------------------------------ +// Insert an item into Resources/Properties (used for Marked Content) +// Arguments: +// (1) e.g. page object, Form XObject +// (2) marked content name +// (3) xref of the referenced object (insert as indirect reference) +//------------------------------------------------------------------------ +void +JM_set_resource_property(fz_context *ctx, pdf_obj *ref, const char *name, int xref) +{ + pdf_obj *ind = NULL; + pdf_obj *properties = NULL; + pdf_document *pdf = pdf_get_bound_document(ctx, ref); + pdf_obj *name2 = NULL; + fz_var(ind); + fz_var(name2); + fz_try(ctx) { + ind = pdf_new_indirect(ctx, pdf, xref, 0); + if (!ind) { + RAISEPY(ctx, MSG_BAD_XREF, PyExc_ValueError); + } + pdf_obj *resources = pdf_dict_get(ctx, ref, PDF_NAME(Resources)); + if (!resources) { + resources = pdf_dict_put_dict(ctx, ref, PDF_NAME(Resources), 1); + } + properties = pdf_dict_get(ctx, resources, PDF_NAME(Properties)); + if (!properties) { + properties = pdf_dict_put_dict(ctx, resources, PDF_NAME(Properties), 1); + } + name2 = pdf_new_name(ctx, name); + pdf_dict_put(ctx, properties, name2, ind); + } + fz_always(ctx) { + pdf_drop_obj(ctx, ind); + pdf_drop_obj(ctx, name2); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return; +} + + +//------------------------------------------------------------------------ +// Add OC object reference to a dictionary +//------------------------------------------------------------------------ +void +JM_add_oc_object(fz_context *ctx, pdf_document *pdf, pdf_obj *ref, int xref) +{ + pdf_obj *indobj = NULL; + fz_try(ctx) { + indobj = pdf_new_indirect(ctx, pdf, xref, 0); + if (!pdf_is_dict(ctx, indobj)) { + RAISEPY(ctx, MSG_BAD_OC_REF, PyExc_ValueError); + } + pdf_obj *type = pdf_dict_get(ctx, indobj, PDF_NAME(Type)); + if (pdf_objcmp(ctx, type, PDF_NAME(OCG)) == 0 || + pdf_objcmp(ctx, type, PDF_NAME(OCMD)) == 0) { + pdf_dict_put(ctx, ref, PDF_NAME(OC), indobj); + } else { + RAISEPY(ctx, MSG_BAD_OC_REF, PyExc_ValueError); + } + } + fz_always(ctx) { + pdf_drop_obj(ctx, indobj); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } +} + + +//------------------------------------------------------------------------- +// Store info of a font in Python list +//------------------------------------------------------------------------- +int JM_gather_fonts(fz_context *ctx, pdf_document *pdf, pdf_obj *dict, + PyObject *fontlist, int stream_xref) +{ + int i, n, rc = 1; + n = pdf_dict_len(ctx, dict); + for (i = 0; i < n; i++) { + pdf_obj *fontdict = NULL; + pdf_obj *subtype = NULL; + pdf_obj *basefont = NULL; + pdf_obj *name = NULL; + pdf_obj *refname = NULL; + pdf_obj *encoding = NULL; + + refname = pdf_dict_get_key(ctx, dict, i); + fontdict = pdf_dict_get_val(ctx, dict, i); + if (!pdf_is_dict(ctx, fontdict)) { + fz_warn(ctx, "'%s' is no font dict (%d 0 R)", + pdf_to_name(ctx, refname), pdf_to_num(ctx, fontdict)); + continue; + } + + subtype = pdf_dict_get(ctx, fontdict, PDF_NAME(Subtype)); + basefont = pdf_dict_get(ctx, fontdict, PDF_NAME(BaseFont)); + if (!basefont || pdf_is_null(ctx, basefont)) { + name = pdf_dict_get(ctx, fontdict, PDF_NAME(Name)); + } else { + name = basefont; + } + encoding = pdf_dict_get(ctx, fontdict, PDF_NAME(Encoding)); + if (pdf_is_dict(ctx, encoding)) { + encoding = pdf_dict_get(ctx, encoding, PDF_NAME(BaseEncoding)); + } + int xref = pdf_to_num(ctx, fontdict); + char *ext = "n/a"; + if (xref) { + ext = JM_get_fontextension(ctx, pdf, xref); + } + PyObject *entry = PyTuple_New(7); + PyTuple_SET_ITEM(entry, 0, Py_BuildValue("i", xref)); + PyTuple_SET_ITEM(entry, 1, Py_BuildValue("s", ext)); + PyTuple_SET_ITEM(entry, 2, Py_BuildValue("s", pdf_to_name(ctx, subtype))); + PyTuple_SET_ITEM(entry, 3, JM_EscapeStrFromStr(pdf_to_name(ctx, name))); + PyTuple_SET_ITEM(entry, 4, Py_BuildValue("s", pdf_to_name(ctx, refname))); + PyTuple_SET_ITEM(entry, 5, Py_BuildValue("s", pdf_to_name(ctx, encoding))); + PyTuple_SET_ITEM(entry, 6, Py_BuildValue("i", stream_xref)); + LIST_APPEND_DROP(fontlist, entry); + } + return rc; +} + +//------------------------------------------------------------------------- +// Store info of an image in Python list +//------------------------------------------------------------------------- +int JM_gather_images(fz_context *ctx, pdf_document *doc, pdf_obj *dict, + PyObject *imagelist, int stream_xref) +{ + int i, n, rc = 1; + n = pdf_dict_len(ctx, dict); + for (i = 0; i < n; i++) { + pdf_obj *imagedict, *smask; + pdf_obj *refname = NULL; + pdf_obj *type; + pdf_obj *width; + pdf_obj *height; + pdf_obj *bpc = NULL; + pdf_obj *filter = NULL; + pdf_obj *cs = NULL; + pdf_obj *altcs; + + refname = pdf_dict_get_key(ctx, dict, i); + imagedict = pdf_dict_get_val(ctx, dict, i); + if (!pdf_is_dict(ctx, imagedict)) { + fz_warn(ctx, "'%s' is no image dict (%d 0 R)", + pdf_to_name(ctx, refname), pdf_to_num(ctx, imagedict)); + continue; + } + + type = pdf_dict_get(ctx, imagedict, PDF_NAME(Subtype)); + if (!pdf_name_eq(ctx, type, PDF_NAME(Image))) + continue; + + int xref = pdf_to_num(ctx, imagedict); + int gen = 0; + smask = pdf_dict_geta(ctx, imagedict, PDF_NAME(SMask), PDF_NAME(Mask)); + if (smask) + gen = pdf_to_num(ctx, smask); + + filter = pdf_dict_geta(ctx, imagedict, PDF_NAME(Filter), PDF_NAME(F)); + if (pdf_is_array(ctx, filter)) { + filter = pdf_array_get(ctx, filter, 0); + } + + altcs = NULL; + cs = pdf_dict_geta(ctx, imagedict, PDF_NAME(ColorSpace), PDF_NAME(CS)); + if (pdf_is_array(ctx, cs)) { + pdf_obj *cses = cs; + cs = pdf_array_get(ctx, cses, 0); + if (pdf_name_eq(ctx, cs, PDF_NAME(DeviceN)) || + pdf_name_eq(ctx, cs, PDF_NAME(Separation))) { + altcs = pdf_array_get(ctx, cses, 2); + if (pdf_is_array(ctx, altcs)) { + altcs = pdf_array_get(ctx, altcs, 0); + } + } + } + + width = pdf_dict_geta(ctx, imagedict, PDF_NAME(Width), PDF_NAME(W)); + height = pdf_dict_geta(ctx, imagedict, PDF_NAME(Height), PDF_NAME(H)); + bpc = pdf_dict_geta(ctx, imagedict, PDF_NAME(BitsPerComponent), PDF_NAME(BPC)); + + PyObject *entry = PyTuple_New(10); + PyTuple_SET_ITEM(entry, 0, Py_BuildValue("i", xref)); + PyTuple_SET_ITEM(entry, 1, Py_BuildValue("i", gen)); + PyTuple_SET_ITEM(entry, 2, Py_BuildValue("i", pdf_to_int(ctx, width))); + PyTuple_SET_ITEM(entry, 3, Py_BuildValue("i", pdf_to_int(ctx, height))); + PyTuple_SET_ITEM(entry, 4, Py_BuildValue("i", pdf_to_int(ctx, bpc))); + PyTuple_SET_ITEM(entry, 5, JM_EscapeStrFromStr(pdf_to_name(ctx, cs))); + PyTuple_SET_ITEM(entry, 6, JM_EscapeStrFromStr(pdf_to_name(ctx, altcs))); + PyTuple_SET_ITEM(entry, 7, JM_EscapeStrFromStr(pdf_to_name(ctx, refname))); + PyTuple_SET_ITEM(entry, 8, JM_EscapeStrFromStr(pdf_to_name(ctx, filter))); + PyTuple_SET_ITEM(entry, 9, Py_BuildValue("i", stream_xref)); + LIST_APPEND_DROP(imagelist, entry); + } + return rc; +} + +//------------------------------------------------------------------------- +// Store info of a /Form xobject in Python list +//------------------------------------------------------------------------- +int JM_gather_forms(fz_context *ctx, pdf_document *doc, pdf_obj *dict, + PyObject *imagelist, int stream_xref) +{ + int i, rc = 1, n = pdf_dict_len(ctx, dict); + fz_rect bbox; + fz_matrix mat; + pdf_obj *o = NULL, *m = NULL; + for (i = 0; i < n; i++) { + pdf_obj *imagedict; + pdf_obj *refname = NULL; + pdf_obj *type; + + refname = pdf_dict_get_key(ctx, dict, i); + imagedict = pdf_dict_get_val(ctx, dict, i); + if (!pdf_is_dict(ctx, imagedict)) { + fz_warn(ctx, "'%s' is no form dict (%d 0 R)", + pdf_to_name(ctx, refname), pdf_to_num(ctx, imagedict)); + continue; + } + + type = pdf_dict_get(ctx, imagedict, PDF_NAME(Subtype)); + if (!pdf_name_eq(ctx, type, PDF_NAME(Form))) + continue; + + o = pdf_dict_get(ctx, imagedict, PDF_NAME(BBox)); + m = pdf_dict_get(ctx, imagedict, PDF_NAME(Matrix)); + if (m) { + mat = pdf_to_matrix(ctx, m); + } else { + mat = fz_identity; + } + if (o) { + bbox = fz_transform_rect(pdf_to_rect(ctx, o), mat); + } else { + bbox = fz_infinite_rect; + } + int xref = pdf_to_num(ctx, imagedict); + + PyObject *entry = PyTuple_New(4); + PyTuple_SET_ITEM(entry, 0, Py_BuildValue("i", xref)); + PyTuple_SET_ITEM(entry, 1, Py_BuildValue("s", pdf_to_name(ctx, refname))); + PyTuple_SET_ITEM(entry, 2, Py_BuildValue("i", stream_xref)); + PyTuple_SET_ITEM(entry, 3, JM_py_from_rect(bbox)); + LIST_APPEND_DROP(imagelist, entry); + } + return rc; +} + +//------------------------------------------------------------------------- +// Step through /Resources, looking up image, xobject or font information +//------------------------------------------------------------------------- +void JM_scan_resources(fz_context *ctx, pdf_document *pdf, pdf_obj *rsrc, + PyObject *liste, int what, int stream_xref, + PyObject *tracer) +{ + pdf_obj *font, *xobj, *subrsrc; + int i, n, sxref; + if (pdf_mark_obj(ctx, rsrc)) { + fz_warn(ctx, "Circular dependencies! Consider page cleaning."); + return; // Circular dependencies! + } + + fz_try(ctx) { + + xobj = pdf_dict_get(ctx, rsrc, PDF_NAME(XObject)); + + if (what == 1) { // lookup fonts + font = pdf_dict_get(ctx, rsrc, PDF_NAME(Font)); + JM_gather_fonts(ctx, pdf, font, liste, stream_xref); + } else if (what == 2) { // look up images + JM_gather_images(ctx, pdf, xobj, liste, stream_xref); + } else if (what == 3) { // look up form xobjects + JM_gather_forms(ctx, pdf, xobj, liste, stream_xref); + } else { // should never happen + goto finished; + } + + // check if we need to recurse into Form XObjects + n = pdf_dict_len(ctx, xobj); + for (i = 0; i < n; i++) { + pdf_obj *obj = pdf_dict_get_val(ctx, xobj, i); + if (pdf_is_stream(ctx, obj)) { + sxref = pdf_to_num(ctx, obj); + } else { + sxref = 0; + } + subrsrc = pdf_dict_get(ctx, obj, PDF_NAME(Resources)); + if (subrsrc) { + PyObject *sxref_t = Py_BuildValue("i", sxref); + if (PySequence_Contains(tracer, sxref_t) == 0) { + LIST_APPEND_DROP(tracer, sxref_t); + JM_scan_resources(ctx, pdf, subrsrc, liste, what, sxref, tracer); + } else { + Py_DECREF(sxref_t); + PyErr_Clear(); + fz_warn(ctx, "Circular dependencies! Consider page cleaning."); + goto finished; + } + } + } + finished:; + } + fz_always(ctx) { + pdf_unmark_obj(ctx, rsrc); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } +} +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-pixmap.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-pixmap.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,431 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +//----------------------------------------------------------------------------- +// pixmap helper functions +//----------------------------------------------------------------------------- + +//----------------------------------------------------------------------------- +// Clear a pixmap rectangle - my version also supports non-alpha pixmaps +//----------------------------------------------------------------------------- +int +JM_clear_pixmap_rect_with_value(fz_context *ctx, fz_pixmap *dest, int value, fz_irect b) +{ + unsigned char *destp; + int x, y, w, k, destspan; + + b = fz_intersect_irect(b, fz_pixmap_bbox(ctx, dest)); + w = b.x1 - b.x0; + y = b.y1 - b.y0; + if (w <= 0 || y <= 0) + return 0; + + destspan = dest->stride; + destp = dest->samples + (unsigned int)(destspan * (b.y0 - dest->y) + dest->n * (b.x0 - dest->x)); + + /* CMYK needs special handling (and potentially any other subtractive colorspaces) */ + if (fz_colorspace_n(ctx, dest->colorspace) == 4) { + value = 255 - value; + do { + unsigned char *s = destp; + for (x = 0; x < w; x++) { + *s++ = 0; + *s++ = 0; + *s++ = 0; + *s++ = value; + if (dest->alpha) *s++ = 255; + } + destp += destspan; + } while (--y); + return 1; + } + + do { + unsigned char *s = destp; + for (x = 0; x < w; x++) { + for (k = 0; k < dest->n - 1; k++) + *s++ = value; + if (dest->alpha) *s++ = 255; + else *s++ = value; + } + destp += destspan; + } while (--y); + return 1; +} + +//----------------------------------------------------------------------------- +// fill a rect with a color tuple +//----------------------------------------------------------------------------- +int +JM_fill_pixmap_rect_with_color(fz_context *ctx, fz_pixmap *dest, unsigned char col[5], fz_irect b) +{ + unsigned char *destp; + int x, y, w, i, destspan; + + b = fz_intersect_irect(b, fz_pixmap_bbox(ctx, dest)); + w = b.x1 - b.x0; + y = b.y1 - b.y0; + if (w <= 0 || y <= 0) + return 0; + + destspan = dest->stride; + destp = dest->samples + (unsigned int)(destspan * (b.y0 - dest->y) + dest->n * (b.x0 - dest->x)); + + do { + unsigned char *s = destp; + for (x = 0; x < w; x++) { + for (i = 0; i < dest->n; i++) + *s++ = col[i]; + } + destp += destspan; + } while (--y); + return 1; +} + +//----------------------------------------------------------------------------- +// invert a rectangle - also supports non-alpha pixmaps +//----------------------------------------------------------------------------- +int +JM_invert_pixmap_rect(fz_context *ctx, fz_pixmap *dest, fz_irect b) +{ + unsigned char *destp; + int x, y, w, i, destspan; + + b = fz_intersect_irect(b, fz_pixmap_bbox(ctx, dest)); + w = b.x1 - b.x0; + y = b.y1 - b.y0; + if (w <= 0 || y <= 0) + return 0; + + destspan = dest->stride; + destp = dest->samples + (unsigned int)(destspan * (b.y0 - dest->y) + dest->n * (b.x0 - dest->x)); + int n0 = dest->n - dest->alpha; + do { + unsigned char *s = destp; + for (x = 0; x < w; x++) { + for (i = 0; i < n0; i++) { + *s = 255 - *s; + s++; + } + if (dest->alpha) s++; + } + destp += destspan; + } while (--y); + return 1; +} + +int +JM_is_jbig2_image(fz_context *ctx, pdf_obj *dict) +{ + // fixme: should we remove this function? + return 0; + /* + pdf_obj *filter; + int i, n; + + filter = pdf_dict_get(ctx, dict, PDF_NAME(Filter)); + if (pdf_name_eq(ctx, filter, PDF_NAME(JBIG2Decode))) + return 1; + n = pdf_array_len(ctx, filter); + for (i = 0; i < n; i++) + if (pdf_name_eq(ctx, pdf_array_get(ctx, filter, i), PDF_NAME(JBIG2Decode))) + return 1; + return 0; + */ +} + +//----------------------------------------------------------------------------- +// Return basic properties of an image provided as bytes or bytearray +// The function creates an fz_image and optionally returns it. +//----------------------------------------------------------------------------- +PyObject *JM_image_profile(fz_context *ctx, PyObject *imagedata, int keep_image) +{ + if (!EXISTS(imagedata)) { + Py_RETURN_NONE; // nothing given + } + fz_image *image = NULL; + fz_buffer *res = NULL; + PyObject *result = NULL; + unsigned char *c = NULL; + Py_ssize_t len = 0; + if (PyBytes_Check(imagedata)) { + c = PyBytes_AS_STRING(imagedata); + len = PyBytes_GET_SIZE(imagedata); + } else if (PyByteArray_Check(imagedata)) { + c = PyByteArray_AS_STRING(imagedata); + len = PyByteArray_GET_SIZE(imagedata); + } else { + PySys_WriteStderr("bad image data\n"); + Py_RETURN_NONE; + } + + if (len < 8) { + PySys_WriteStderr("bad image data\n"); + Py_RETURN_NONE; + } + int type = fz_recognize_image_format(ctx, c); + if (type == FZ_IMAGE_UNKNOWN) { + Py_RETURN_NONE; + } + + fz_try(ctx) { + if (keep_image) { + res = fz_new_buffer_from_copied_data(ctx, c, (size_t) len); + } else { + res = fz_new_buffer_from_shared_data(ctx, c, (size_t) len); + } + image = fz_new_image_from_buffer(ctx, res); + int xres, yres, orientation; + fz_matrix ctm = fz_image_orientation_matrix(ctx, image); + fz_image_resolution(image, &xres, &yres); + orientation = (int) fz_image_orientation(ctx, image); + const char *cs_name = fz_colorspace_name(ctx, image->colorspace); + result = PyDict_New(); + DICT_SETITEM_DROP(result, dictkey_width, + Py_BuildValue("i", image->w)); + DICT_SETITEM_DROP(result, dictkey_height, + Py_BuildValue("i", image->h)); + DICT_SETITEMSTR_DROP(result, "orientation", + Py_BuildValue("i", orientation)); + DICT_SETITEM_DROP(result, dictkey_matrix, + JM_py_from_matrix(ctm)); + DICT_SETITEM_DROP(result, dictkey_xres, + Py_BuildValue("i", xres)); + DICT_SETITEM_DROP(result, dictkey_yres, + Py_BuildValue("i", yres)); + DICT_SETITEM_DROP(result, dictkey_colorspace, + Py_BuildValue("i", image->n)); + DICT_SETITEM_DROP(result, dictkey_bpc, + Py_BuildValue("i", image->bpc)); + DICT_SETITEM_DROP(result, dictkey_ext, + Py_BuildValue("s", JM_image_extension(type))); + DICT_SETITEM_DROP(result, dictkey_cs_name, + Py_BuildValue("s", cs_name)); + + if (keep_image) { + DICT_SETITEM_DROP(result, dictkey_image, + PyLong_FromVoidPtr((void *) fz_keep_image(ctx, image))); + } + } + fz_always(ctx) { + if (!keep_image) { + fz_drop_image(ctx, image); + } else { + fz_drop_buffer(ctx, res); // drop the buffer copy + } + } + fz_catch(ctx) { + Py_CLEAR(result); + fz_rethrow(ctx); + } + PyErr_Clear(); + return result; +} + +//---------------------------------------------------------------------------- +// Version of fz_new_pixmap_from_display_list (util.c) to also support +// rendering of only the 'clip' part of the displaylist rectangle +//---------------------------------------------------------------------------- +fz_pixmap * +JM_pixmap_from_display_list(fz_context *ctx, + fz_display_list *list, + PyObject *ctm, + fz_colorspace *cs, + int alpha, + PyObject *clip, + fz_separations *seps + ) +{ + fz_rect rect = fz_bound_display_list(ctx, list); + fz_matrix matrix = JM_matrix_from_py(ctm); + fz_pixmap *pix = NULL; + fz_var(pix); + fz_device *dev = NULL; + fz_var(dev); + fz_rect rclip = JM_rect_from_py(clip); + rect = fz_intersect_rect(rect, rclip); // no-op if clip is not given + + rect = fz_transform_rect(rect, matrix); + fz_irect irect = fz_round_rect(rect); + + pix = fz_new_pixmap_with_bbox(ctx, cs, irect, seps, alpha); + if (alpha) + fz_clear_pixmap(ctx, pix); + else + fz_clear_pixmap_with_value(ctx, pix, 0xFF); + + fz_try(ctx) { + if (!fz_is_infinite_rect(rclip)) { + dev = fz_new_draw_device_with_bbox(ctx, matrix, pix, &irect); + fz_run_display_list(ctx, list, dev, fz_identity, rclip, NULL); + } else { + dev = fz_new_draw_device(ctx, matrix, pix); + fz_run_display_list(ctx, list, dev, fz_identity, fz_infinite_rect, NULL); + } + + fz_close_device(ctx, dev); + } + fz_always(ctx) { + fz_drop_device(ctx, dev); + } + fz_catch(ctx) { + fz_drop_pixmap(ctx, pix); + fz_rethrow(ctx); + } + return pix; +} + +//---------------------------------------------------------------------------- +// Pixmap creation directly using a short-lived displaylist, so we can support +// separations. +//---------------------------------------------------------------------------- +fz_pixmap * +JM_pixmap_from_page(fz_context *ctx, + fz_document *doc, + fz_page *page, + PyObject *ctm, + fz_colorspace *cs, + int alpha, + int annots, + PyObject *clip + ) +{ + enum { SPOTS_NONE, SPOTS_OVERPRINT_SIM, SPOTS_FULL }; + int spots; + if (FZ_ENABLE_SPOT_RENDERING) + spots = SPOTS_OVERPRINT_SIM; + else + spots = SPOTS_NONE; + + fz_separations *seps = NULL; + fz_pixmap *pix = NULL; + fz_colorspace *oi = NULL; + fz_var(oi); + fz_colorspace *colorspace = cs; + fz_rect rect; + fz_irect bbox; + fz_device *dev = NULL; + fz_var(dev); + fz_matrix matrix = JM_matrix_from_py(ctm); + rect = fz_bound_page(ctx, page); + fz_rect rclip = JM_rect_from_py(clip); + rect = fz_intersect_rect(rect, rclip); // no-op if clip is not given + rect = fz_transform_rect(rect, matrix); + bbox = fz_round_rect(rect); + + fz_try(ctx) { + // Pixmap of the document's /OutputIntents ("output intents") + oi = fz_document_output_intent(ctx, doc); + // if present and compatible, use it instead of the parameter + if (oi) { + if (fz_colorspace_n(ctx, oi) == fz_colorspace_n(ctx, cs)) { + colorspace = fz_keep_colorspace(ctx, oi); + } + } + + // check if spots rendering is available and if so use separations + if (spots != SPOTS_NONE) { + seps = fz_page_separations(ctx, page); + if (seps) { + int i, n = fz_count_separations(ctx, seps); + if (spots == SPOTS_FULL) + for (i = 0; i < n; i++) + fz_set_separation_behavior(ctx, seps, i, FZ_SEPARATION_SPOT); + else + for (i = 0; i < n; i++) + fz_set_separation_behavior(ctx, seps, i, FZ_SEPARATION_COMPOSITE); + } else if (fz_page_uses_overprint(ctx, page)) { + /* This page uses overprint, so we need an empty + * sep object to force the overprint simulation on. */ + seps = fz_new_separations(ctx, 0); + } else if (oi && fz_colorspace_n(ctx, oi) != fz_colorspace_n(ctx, colorspace)) { + /* We have an output intent, and it's incompatible + * with the colorspace our device needs. Force the + * overprint simulation on, because this ensures that + * we 'simulate' the output intent too. */ + seps = fz_new_separations(ctx, 0); + } + } + + pix = fz_new_pixmap_with_bbox(ctx, colorspace, bbox, seps, alpha); + + if (alpha) { + fz_clear_pixmap(ctx, pix); + } else { + fz_clear_pixmap_with_value(ctx, pix, 0xFF); + } + + dev = fz_new_draw_device(ctx, matrix, pix); + if (annots) { + fz_run_page(ctx, page, dev, fz_identity, NULL); + } else { + fz_run_page_contents(ctx, page, dev, fz_identity, NULL); + } + fz_close_device(ctx, dev); + } + fz_always(ctx) { + fz_drop_device(ctx, dev); + fz_drop_separations(ctx, seps); + fz_drop_colorspace(ctx, oi); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return pix; +} + +PyObject *JM_color_count(fz_context *ctx, fz_pixmap *pm, PyObject *clip) +{ + PyObject *rc = PyDict_New(), *pixel=NULL, *c=NULL; + long cnt=0; + fz_irect irect = fz_pixmap_bbox(ctx, pm); + irect = fz_intersect_irect(irect, fz_round_rect(JM_rect_from_py(clip))); + size_t stride = pm->stride; + size_t width = irect.x1 - irect.x0, height = irect.y1 - irect.y0; + size_t i, j, n = (size_t) pm->n, substride = width * n; + unsigned char *s = pm->samples + stride * (irect.y0 - pm->y) + (irect.x0 - pm->x) * n; + unsigned char oldpix[10], newpix[10]; + memcpy(oldpix, s, n); + cnt = 0; + fz_try(ctx) { + if (fz_is_empty_irect(irect)) goto finished; + for (i = 0; i < height; i++) { + for (j = 0; j < substride; j += n) { + memcpy(newpix, s + j, n); + if (memcmp(oldpix, newpix,n) != 0) { + pixel = PyBytes_FromStringAndSize(oldpix, n); + c = PyDict_GetItem(rc, pixel); + if (c) cnt += PyLong_AsLong(c); + DICT_SETITEM_DROP(rc, pixel, PyLong_FromLong(cnt)); + Py_DECREF(pixel); + cnt = 1; + memcpy(oldpix, newpix, n); + } else { + cnt += 1; + } + } + s += stride; + } + pixel = PyBytes_FromStringAndSize(oldpix, n); + c = PyDict_GetItem(rc, pixel); + if (c) cnt += PyLong_AsLong(c); + DICT_SETITEM_DROP(rc, pixel, PyLong_FromLong(cnt)); + Py_DECREF(pixel); + finished:; + } + fz_catch(ctx) { + Py_CLEAR(rc); + fz_rethrow(ctx); + } + PyErr_Clear(); + return rc; +} +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-portfolio.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-portfolio.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,79 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +//----------------------------------------------------------------------------- +// perform some cleaning if we have /EmbeddedFiles: +// (1) remove any /Limits if /Names exists +// (2) remove any empty /Collection +// (3) set /PageMode/UseAttachments +//----------------------------------------------------------------------------- +void JM_embedded_clean(fz_context *ctx, pdf_document *pdf) +{ + pdf_obj *root = pdf_dict_get(ctx, pdf_trailer(ctx, pdf), PDF_NAME(Root)); + + // remove any empty /Collection entry + pdf_obj *coll = pdf_dict_get(ctx, root, PDF_NAME(Collection)); + if (coll && pdf_dict_len(ctx, coll) == 0) + pdf_dict_del(ctx, root, PDF_NAME(Collection)); + + pdf_obj *efiles = pdf_dict_getl(ctx, root, + PDF_NAME(Names), + PDF_NAME(EmbeddedFiles), + PDF_NAME(Names), + NULL); + if (efiles) { + pdf_dict_put_name(ctx, root, PDF_NAME(PageMode), "UseAttachments"); + } + return; +} + +//----------------------------------------------------------------------------- +// embed a new file in a PDF (not only /EmbeddedFiles entries) +//----------------------------------------------------------------------------- +pdf_obj *JM_embed_file(fz_context *ctx, + pdf_document *pdf, + fz_buffer *buf, + char *filename, + char *ufilename, + char *desc, + int compress) +{ + size_t len = 0; + pdf_obj *ef, *f, *params, *val = NULL; + fz_buffer *buff2 = NULL; + fz_var(buff2); + fz_try(ctx) { + val = pdf_new_dict(ctx, pdf, 6); + pdf_dict_put_dict(ctx, val, PDF_NAME(CI), 4); + ef = pdf_dict_put_dict(ctx, val, PDF_NAME(EF), 4); + pdf_dict_put_text_string(ctx, val, PDF_NAME(F), filename); + pdf_dict_put_text_string(ctx, val, PDF_NAME(UF), ufilename); + pdf_dict_put_text_string(ctx, val, PDF_NAME(Desc), desc); + pdf_dict_put(ctx, val, PDF_NAME(Type), PDF_NAME(Filespec)); + buff2 = fz_new_buffer_from_copied_data(ctx, " ", 1); + f = pdf_add_stream(ctx, pdf, buff2, NULL, 0); + pdf_dict_put_drop(ctx, ef, PDF_NAME(F), f); + JM_update_stream(ctx, pdf, f, buf, compress); + len = fz_buffer_storage(ctx, buf, NULL); + pdf_dict_put_int(ctx, f, PDF_NAME(DL), len); + pdf_dict_put_int(ctx, f, PDF_NAME(Length), len); + params = pdf_dict_put_dict(ctx, f, PDF_NAME(Params), 4); + pdf_dict_put_int(ctx, params, PDF_NAME(Size), len); + } + fz_always(ctx) { + fz_drop_buffer(ctx, buff2); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return val; +} +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-python.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-python.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,2152 @@ +%pythoncode %{ +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ + +# ------------------------------------------------------------------------------ +# Various PDF Optional Content Flags +# ------------------------------------------------------------------------------ +PDF_OC_ON = 0 +PDF_OC_TOGGLE = 1 +PDF_OC_OFF = 2 + +# ------------------------------------------------------------------------------ +# link kinds and link flags +# ------------------------------------------------------------------------------ +LINK_NONE = 0 +LINK_GOTO = 1 +LINK_URI = 2 +LINK_LAUNCH = 3 +LINK_NAMED = 4 +LINK_GOTOR = 5 +LINK_FLAG_L_VALID = 1 +LINK_FLAG_T_VALID = 2 +LINK_FLAG_R_VALID = 4 +LINK_FLAG_B_VALID = 8 +LINK_FLAG_FIT_H = 16 +LINK_FLAG_FIT_V = 32 +LINK_FLAG_R_IS_ZOOM = 64 + +# ------------------------------------------------------------------------------ +# Text handling flags +# ------------------------------------------------------------------------------ +TEXT_ALIGN_LEFT = 0 +TEXT_ALIGN_CENTER = 1 +TEXT_ALIGN_RIGHT = 2 +TEXT_ALIGN_JUSTIFY = 3 + +TEXT_OUTPUT_TEXT = 0 +TEXT_OUTPUT_HTML = 1 +TEXT_OUTPUT_JSON = 2 +TEXT_OUTPUT_XML = 3 +TEXT_OUTPUT_XHTML = 4 + +TEXT_PRESERVE_LIGATURES = 1 +TEXT_PRESERVE_WHITESPACE = 2 +TEXT_PRESERVE_IMAGES = 4 +TEXT_INHIBIT_SPACES = 8 +TEXT_DEHYPHENATE = 16 +TEXT_PRESERVE_SPANS = 32 +TEXT_MEDIABOX_CLIP = 64 +TEXT_CID_FOR_UNKNOWN_UNICODE = 128 + +TEXTFLAGS_WORDS = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_BLOCKS = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_DICT = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_PRESERVE_IMAGES + | TEXT_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_RAWDICT = TEXTFLAGS_DICT + +TEXTFLAGS_SEARCH = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_DEHYPHENATE + | TEXT_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_HTML = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_PRESERVE_IMAGES + | TEXT_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_XHTML = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_PRESERVE_IMAGES + | TEXT_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_XML = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_CID_FOR_UNKNOWN_UNICODE + ) + +TEXTFLAGS_TEXT = (0 + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP + | TEXT_CID_FOR_UNKNOWN_UNICODE + ) + +# ------------------------------------------------------------------------------ +# Simple text encoding options +# ------------------------------------------------------------------------------ +TEXT_ENCODING_LATIN = 0 +TEXT_ENCODING_GREEK = 1 +TEXT_ENCODING_CYRILLIC = 2 +# ------------------------------------------------------------------------------ +# Stamp annotation icon numbers +# ------------------------------------------------------------------------------ +STAMP_Approved = 0 +STAMP_AsIs = 1 +STAMP_Confidential = 2 +STAMP_Departmental = 3 +STAMP_Experimental = 4 +STAMP_Expired = 5 +STAMP_Final = 6 +STAMP_ForComment = 7 +STAMP_ForPublicRelease = 8 +STAMP_NotApproved = 9 +STAMP_NotForPublicRelease = 10 +STAMP_Sold = 11 +STAMP_TopSecret = 12 +STAMP_Draft = 13 + +# ------------------------------------------------------------------------------ +# Base 14 font names and dictionary +# ------------------------------------------------------------------------------ +Base14_fontnames = ( + "Courier", + "Courier-Oblique", + "Courier-Bold", + "Courier-BoldOblique", + "Helvetica", + "Helvetica-Oblique", + "Helvetica-Bold", + "Helvetica-BoldOblique", + "Times-Roman", + "Times-Italic", + "Times-Bold", + "Times-BoldItalic", + "Symbol", + "ZapfDingbats", +) + +Base14_fontdict = {} +for f in Base14_fontnames: + Base14_fontdict[f.lower()] = f + del f +Base14_fontdict["helv"] = "Helvetica" +Base14_fontdict["heit"] = "Helvetica-Oblique" +Base14_fontdict["hebo"] = "Helvetica-Bold" +Base14_fontdict["hebi"] = "Helvetica-BoldOblique" +Base14_fontdict["cour"] = "Courier" +Base14_fontdict["coit"] = "Courier-Oblique" +Base14_fontdict["cobo"] = "Courier-Bold" +Base14_fontdict["cobi"] = "Courier-BoldOblique" +Base14_fontdict["tiro"] = "Times-Roman" +Base14_fontdict["tibo"] = "Times-Bold" +Base14_fontdict["tiit"] = "Times-Italic" +Base14_fontdict["tibi"] = "Times-BoldItalic" +Base14_fontdict["symb"] = "Symbol" +Base14_fontdict["zadb"] = "ZapfDingbats" + +annot_skel = { + "goto1": "<>/Rect[%s]/BS<>/Subtype/Link>>", + "goto2": "<>/Rect[%s]/BS<>/Subtype/Link>>", + "gotor1": "<>>>/Rect[%s]/BS<>/Subtype/Link>>", + "gotor2": "<>/Rect[%s]/BS<>/Subtype/Link>>", + "launch": "<>>>/Rect[%s]/BS<>/Subtype/Link>>", + "uri": "<>/Rect[%s]/BS<>/Subtype/Link>>", + "named": "<>/Rect[%s]/BS<>/Subtype/Link>>", +} + +class FileDataError(RuntimeError): + """Raised for documents with file structure issues.""" + pass + +class FileNotFoundError(RuntimeError): + """Raised if file does not exist.""" + pass + +class EmptyFileError(FileDataError): + """Raised when creating documents from zero-length data.""" + pass + +# propagate exception class to C-level code +_set_FileDataError(FileDataError) + +def css_for_pymupdf_font( + fontcode: str, *, CSS: OptStr = None, archive: AnyType = None, name: OptStr = None +) -> str: + """Create @font-face items for the given fontcode of pymupdf-fonts. + + Adds @font-face support for fonts contained in package pymupdf-fonts. + + Creates a CSS font-family for all fonts starting with string 'fontcode'. + + Note: + The font naming convention in package pymupdf-fonts is "fontcode", + where the suffix "sf" is either empty or one of "it", "bo" or "bi". + These suffixes thus represent the regular, italic, bold or bold-italic + variants of a font. For example, font code "notos" refers to fonts + "notos" - "Noto Sans Regular" + "notosit" - "Noto Sans Italic" + "notosbo" - "Noto Sans Bold" + "notosbi" - "Noto Sans Bold Italic" + + This function creates four CSS @font-face definitions and collectively + assigns the font-family name "notos" to them (or the "name" value). + + All fitting font buffers of the pymupdf-fonts package are placed / added + to the archive provided as parameter. + To use the font in fitz.Story, execute 'set_font(fontcode)'. The correct + font weight (bold) or style (italic) will automatically be selected. + Expects and returns the CSS source, with the new CSS definitions appended. + + Args: + fontcode: (str) font code for naming the font variants to include. + E.g. "fig" adds notos, notosi, notosb, notosbi fonts. + A maximum of 4 font variants is accepted. + CSS: (str) CSS string to add @font-face definitions to. + archive: (Archive, mandatory) where to place the font buffers. + name: (str) use this as family-name instead of 'fontcode'. + Returns: + Modified CSS, with appended @font-face statements for each font variant + of fontcode. + Fontbuffers associated with "fontcode" will be added to 'archive'. + """ + # @font-face template string + CSSFONT = "\n@font-face {font-family: %s; src: url(%s);%s%s}\n" + + if not type(archive) is Archive: + raise ValueError("'archive' must be an Archive") + if CSS == None: + CSS = "" + + # select font codes starting with the pass-in string + font_keys = [k for k in fitz_fontdescriptors.keys() if k.startswith(fontcode)] + if font_keys == []: + raise ValueError(f"No font code '{fontcode}' found in pymupdf-fonts.") + if len(font_keys) > 4: + raise ValueError("fontcode too short") + if name == None: # use this name for font-family + name = fontcode + + for fkey in font_keys: + font = fitz_fontdescriptors[fkey] + bold = font["bold"] # determine font property + italic = font["italic"] # determine font property + fbuff = font["loader"]() # load the fontbuffer + archive.add(fbuff, fkey) # update the archive + bold_text = "font-weight: bold;" if bold else "" + italic_text = "font-style: italic;" if italic else "" + CSS += CSSFONT % (name, fkey, bold_text, italic_text) + return CSS + + +def get_text_length(text: str, fontname: str ="helv", fontsize: float =11, encoding: int =0) -> float: + """Calculate length of a string for a built-in font. + + Args: + fontname: name of the font. + fontsize: font size points. + encoding: encoding to use, 0=Latin (default), 1=Greek, 2=Cyrillic. + Returns: + (float) length of text. + """ + fontname = fontname.lower() + basename = Base14_fontdict.get(fontname, None) + + glyphs = None + if basename == "Symbol": + glyphs = symbol_glyphs + if basename == "ZapfDingbats": + glyphs = zapf_glyphs + if glyphs is not None: + w = sum([glyphs[ord(c)][1] if ord(c) < 256 else glyphs[183][1] for c in text]) + return w * fontsize + + if fontname in Base14_fontdict.keys(): + return util_measure_string( + text, Base14_fontdict[fontname], fontsize, encoding + ) + + if fontname in ( + "china-t", + "china-s", + "china-ts", + "china-ss", + "japan", + "japan-s", + "korea", + "korea-s", + ): + return len(text) * fontsize + + raise ValueError("Font '%s' is unsupported" % fontname) + + +# ------------------------------------------------------------------------------ +# Glyph list for the built-in font 'ZapfDingbats' +# ------------------------------------------------------------------------------ +zapf_glyphs = ( + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (32, 0.278), + (33, 0.974), + (34, 0.961), + (35, 0.974), + (36, 0.98), + (37, 0.719), + (38, 0.789), + (39, 0.79), + (40, 0.791), + (41, 0.69), + (42, 0.96), + (43, 0.939), + (44, 0.549), + (45, 0.855), + (46, 0.911), + (47, 0.933), + (48, 0.911), + (49, 0.945), + (50, 0.974), + (51, 0.755), + (52, 0.846), + (53, 0.762), + (54, 0.761), + (55, 0.571), + (56, 0.677), + (57, 0.763), + (58, 0.76), + (59, 0.759), + (60, 0.754), + (61, 0.494), + (62, 0.552), + (63, 0.537), + (64, 0.577), + (65, 0.692), + (66, 0.786), + (67, 0.788), + (68, 0.788), + (69, 0.79), + (70, 0.793), + (71, 0.794), + (72, 0.816), + (73, 0.823), + (74, 0.789), + (75, 0.841), + (76, 0.823), + (77, 0.833), + (78, 0.816), + (79, 0.831), + (80, 0.923), + (81, 0.744), + (82, 0.723), + (83, 0.749), + (84, 0.79), + (85, 0.792), + (86, 0.695), + (87, 0.776), + (88, 0.768), + (89, 0.792), + (90, 0.759), + (91, 0.707), + (92, 0.708), + (93, 0.682), + (94, 0.701), + (95, 0.826), + (96, 0.815), + (97, 0.789), + (98, 0.789), + (99, 0.707), + (100, 0.687), + (101, 0.696), + (102, 0.689), + (103, 0.786), + (104, 0.787), + (105, 0.713), + (106, 0.791), + (107, 0.785), + (108, 0.791), + (109, 0.873), + (110, 0.761), + (111, 0.762), + (112, 0.762), + (113, 0.759), + (114, 0.759), + (115, 0.892), + (116, 0.892), + (117, 0.788), + (118, 0.784), + (119, 0.438), + (120, 0.138), + (121, 0.277), + (122, 0.415), + (123, 0.392), + (124, 0.392), + (125, 0.668), + (126, 0.668), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (183, 0.788), + (161, 0.732), + (162, 0.544), + (163, 0.544), + (164, 0.91), + (165, 0.667), + (166, 0.76), + (167, 0.76), + (168, 0.776), + (169, 0.595), + (170, 0.694), + (171, 0.626), + (172, 0.788), + (173, 0.788), + (174, 0.788), + (175, 0.788), + (176, 0.788), + (177, 0.788), + (178, 0.788), + (179, 0.788), + (180, 0.788), + (181, 0.788), + (182, 0.788), + (183, 0.788), + (184, 0.788), + (185, 0.788), + (186, 0.788), + (187, 0.788), + (188, 0.788), + (189, 0.788), + (190, 0.788), + (191, 0.788), + (192, 0.788), + (193, 0.788), + (194, 0.788), + (195, 0.788), + (196, 0.788), + (197, 0.788), + (198, 0.788), + (199, 0.788), + (200, 0.788), + (201, 0.788), + (202, 0.788), + (203, 0.788), + (204, 0.788), + (205, 0.788), + (206, 0.788), + (207, 0.788), + (208, 0.788), + (209, 0.788), + (210, 0.788), + (211, 0.788), + (212, 0.894), + (213, 0.838), + (214, 1.016), + (215, 0.458), + (216, 0.748), + (217, 0.924), + (218, 0.748), + (219, 0.918), + (220, 0.927), + (221, 0.928), + (222, 0.928), + (223, 0.834), + (224, 0.873), + (225, 0.828), + (226, 0.924), + (227, 0.924), + (228, 0.917), + (229, 0.93), + (230, 0.931), + (231, 0.463), + (232, 0.883), + (233, 0.836), + (234, 0.836), + (235, 0.867), + (236, 0.867), + (237, 0.696), + (238, 0.696), + (239, 0.874), + (183, 0.788), + (241, 0.874), + (242, 0.76), + (243, 0.946), + (244, 0.771), + (245, 0.865), + (246, 0.771), + (247, 0.888), + (248, 0.967), + (249, 0.888), + (250, 0.831), + (251, 0.873), + (252, 0.927), + (253, 0.97), + (183, 0.788), + (183, 0.788), +) + +# ------------------------------------------------------------------------------ +# Glyph list for the built-in font 'Symbol' +# ------------------------------------------------------------------------------ +symbol_glyphs = ( + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (32, 0.25), + (33, 0.333), + (34, 0.713), + (35, 0.5), + (36, 0.549), + (37, 0.833), + (38, 0.778), + (39, 0.439), + (40, 0.333), + (41, 0.333), + (42, 0.5), + (43, 0.549), + (44, 0.25), + (45, 0.549), + (46, 0.25), + (47, 0.278), + (48, 0.5), + (49, 0.5), + (50, 0.5), + (51, 0.5), + (52, 0.5), + (53, 0.5), + (54, 0.5), + (55, 0.5), + (56, 0.5), + (57, 0.5), + (58, 0.278), + (59, 0.278), + (60, 0.549), + (61, 0.549), + (62, 0.549), + (63, 0.444), + (64, 0.549), + (65, 0.722), + (66, 0.667), + (67, 0.722), + (68, 0.612), + (69, 0.611), + (70, 0.763), + (71, 0.603), + (72, 0.722), + (73, 0.333), + (74, 0.631), + (75, 0.722), + (76, 0.686), + (77, 0.889), + (78, 0.722), + (79, 0.722), + (80, 0.768), + (81, 0.741), + (82, 0.556), + (83, 0.592), + (84, 0.611), + (85, 0.69), + (86, 0.439), + (87, 0.768), + (88, 0.645), + (89, 0.795), + (90, 0.611), + (91, 0.333), + (92, 0.863), + (93, 0.333), + (94, 0.658), + (95, 0.5), + (96, 0.5), + (97, 0.631), + (98, 0.549), + (99, 0.549), + (100, 0.494), + (101, 0.439), + (102, 0.521), + (103, 0.411), + (104, 0.603), + (105, 0.329), + (106, 0.603), + (107, 0.549), + (108, 0.549), + (109, 0.576), + (110, 0.521), + (111, 0.549), + (112, 0.549), + (113, 0.521), + (114, 0.549), + (115, 0.603), + (116, 0.439), + (117, 0.576), + (118, 0.713), + (119, 0.686), + (120, 0.493), + (121, 0.686), + (122, 0.494), + (123, 0.48), + (124, 0.2), + (125, 0.48), + (126, 0.549), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (183, 0.46), + (160, 0.25), + (161, 0.62), + (162, 0.247), + (163, 0.549), + (164, 0.167), + (165, 0.713), + (166, 0.5), + (167, 0.753), + (168, 0.753), + (169, 0.753), + (170, 0.753), + (171, 1.042), + (172, 0.713), + (173, 0.603), + (174, 0.987), + (175, 0.603), + (176, 0.4), + (177, 0.549), + (178, 0.411), + (179, 0.549), + (180, 0.549), + (181, 0.576), + (182, 0.494), + (183, 0.46), + (184, 0.549), + (185, 0.549), + (186, 0.549), + (187, 0.549), + (188, 1), + (189, 0.603), + (190, 1), + (191, 0.658), + (192, 0.823), + (193, 0.686), + (194, 0.795), + (195, 0.987), + (196, 0.768), + (197, 0.768), + (198, 0.823), + (199, 0.768), + (200, 0.768), + (201, 0.713), + (202, 0.713), + (203, 0.713), + (204, 0.713), + (205, 0.713), + (206, 0.713), + (207, 0.713), + (208, 0.768), + (209, 0.713), + (210, 0.79), + (211, 0.79), + (212, 0.89), + (213, 0.823), + (214, 0.549), + (215, 0.549), + (216, 0.713), + (217, 0.603), + (218, 0.603), + (219, 1.042), + (220, 0.987), + (221, 0.603), + (222, 0.987), + (223, 0.603), + (224, 0.494), + (225, 0.329), + (226, 0.79), + (227, 0.79), + (228, 0.786), + (229, 0.713), + (230, 0.384), + (231, 0.384), + (232, 0.384), + (233, 0.384), + (234, 0.384), + (235, 0.384), + (236, 0.494), + (237, 0.494), + (238, 0.494), + (239, 0.494), + (183, 0.46), + (241, 0.329), + (242, 0.274), + (243, 0.686), + (244, 0.686), + (245, 0.686), + (246, 0.384), + (247, 0.549), + (248, 0.384), + (249, 0.384), + (250, 0.384), + (251, 0.384), + (252, 0.494), + (253, 0.494), + (254, 0.494), + (183, 0.46), +) + + +class linkDest(object): + """link or outline destination details""" + + def __init__(self, obj, rlink): + isExt = obj.is_external + isInt = not isExt + self.dest = "" + self.fileSpec = "" + self.flags = 0 + self.isMap = False + self.isUri = False + self.kind = LINK_NONE + self.lt = Point(0, 0) + self.named = "" + self.newWindow = "" + self.page = obj.page + self.rb = Point(0, 0) + self.uri = obj.uri + if rlink and not self.uri.startswith("#"): + self.uri = "#page=%i&zoom=0,%g,%g" % (rlink[0] + 1, rlink[1], rlink[2]) + if obj.is_external: + self.page = -1 + self.kind = LINK_URI + if not self.uri: + self.page = -1 + self.kind = LINK_NONE + if isInt and self.uri: + self.uri = self.uri.replace("&zoom=nan", "&zoom=0") + if self.uri.startswith("#"): + self.named = "" + self.kind = LINK_GOTO + m = re.match('^#page=([0-9]+)&zoom=([0-9.]+),(-?[0-9.]+),(-?[0-9.]+)$', self.uri) + if m: + self.page = int(m.group(1)) - 1 + self.lt = Point(float((m.group(3))), float(m.group(4))) + self.flags = self.flags | LINK_FLAG_L_VALID | LINK_FLAG_T_VALID + else: + m = re.match('^#page=([0-9]+)$', self.uri) + if m: + self.page = int(m.group(1)) - 1 + else: + self.kind = LINK_NAMED + self.named = self.uri[1:] + else: + self.kind = LINK_NAMED + self.named = self.uri + if obj.is_external: + if self.uri.startswith(("http://", "https://", "mailto:", "ftp://")): + self.isUri = True + self.kind = LINK_URI + elif self.uri.startswith("file://"): + self.fileSpec = self.uri[7:] + self.isUri = False + self.uri = "" + self.kind = LINK_LAUNCH + ftab = self.fileSpec.split("#") + if len(ftab) == 2: + if ftab[1].startswith("page="): + self.kind = LINK_GOTOR + self.fileSpec = ftab[0] + self.page = int(ftab[1][5:]) - 1 + else: + self.isUri = True + self.kind = LINK_LAUNCH + + +# ------------------------------------------------------------------------------- +# "Now" timestamp in PDF Format +# ------------------------------------------------------------------------------- +def get_pdf_now() -> str: + import time + + tz = "%s'%s'" % ( + str(abs(time.altzone // 3600)).rjust(2, "0"), + str((abs(time.altzone // 60) % 60)).rjust(2, "0"), + ) + tstamp = time.strftime("D:%Y%m%d%H%M%S", time.localtime()) + if time.altzone > 0: + tstamp += "-" + tz + elif time.altzone < 0: + tstamp += "+" + tz + else: + pass + return tstamp + + +def get_pdf_str(s: str) -> str: + """ Return a PDF string depending on its coding. + + Notes: + Returns a string bracketed with either "()" or "<>" for hex values. + If only ascii then "(original)" is returned, else if only 8 bit chars + then "(original)" with interspersed octal strings \nnn is returned, + else a string "" is returned, where [hexstring] is the + UTF-16BE encoding of the original. + """ + if not bool(s): + return "()" + + def make_utf16be(s): + r = bytearray([254, 255]) + bytearray(s, "UTF-16BE") + return "<" + r.hex() + ">" # brackets indicate hex + + # The following either returns the original string with mixed-in + # octal numbers \nnn for chars outside the ASCII range, or returns + # the UTF-16BE BOM version of the string. + r = "" + for c in s: + oc = ord(c) + if oc > 255: # shortcut if beyond 8-bit code range + return make_utf16be(s) + + if oc > 31 and oc < 127: # in ASCII range + if c in ("(", ")", "\\"): # these need to be escaped + r += "\\" + r += c + continue + + if oc > 127: # beyond ASCII + r += "\\%03o" % oc + continue + + # now the white spaces + if oc == 8: # backspace + r += "\\b" + elif oc == 9: # tab + r += "\\t" + elif oc == 10: # line feed + r += "\\n" + elif oc == 12: # form feed + r += "\\f" + elif oc == 13: # carriage return + r += "\\r" + else: + r += "\\267" # unsupported: replace by 0xB7 + + return "(" + r + ")" + + +def getTJstr(text: str, glyphs: typing.Union[list, tuple, None], simple: bool, ordering: int) -> str: + """ Return a PDF string enclosed in [] brackets, suitable for the PDF TJ + operator. + + Notes: + The input string is converted to either 2 or 4 hex digits per character. + Args: + simple: no glyphs: 2-chars, use char codes as the glyph + glyphs: 2-chars, use glyphs instead of char codes (Symbol, + ZapfDingbats) + not simple: ordering < 0: 4-chars, use glyphs not char codes + ordering >=0: a CJK font! 4 chars, use char codes as glyphs + """ + if text.startswith("[<") and text.endswith(">]"): # already done + return text + + if not bool(text): + return "[<>]" + + if simple: # each char or its glyph is coded as a 2-byte hex + if glyphs is None: # not Symbol, not ZapfDingbats: use char code + otxt = "".join(["%02x" % ord(c) if ord(c) < 256 else "b7" for c in text]) + else: # Symbol or ZapfDingbats: use glyphs + otxt = "".join( + ["%02x" % glyphs[ord(c)][0] if ord(c) < 256 else "b7" for c in text] + ) + return "[<" + otxt + ">]" + + # non-simple fonts: each char or its glyph is coded as 4-byte hex + if ordering < 0: # not a CJK font: use the glyphs + otxt = "".join(["%04x" % glyphs[ord(c)][0] for c in text]) + else: # CJK: use the char codes + otxt = "".join(["%04x" % ord(c) for c in text]) + + return "[<" + otxt + ">]" + + +def paper_sizes(): + """Known paper formats @ 72 dpi as a dictionary. Key is the format string + like "a4" for ISO-A4. Value is the tuple (width, height). + + Information taken from the following web sites: + www.din-formate.de + www.din-formate.info/amerikanische-formate.html + www.directtools.de/wissen/normen/iso.htm + """ + return { + "a0": (2384, 3370), + "a1": (1684, 2384), + "a10": (74, 105), + "a2": (1191, 1684), + "a3": (842, 1191), + "a4": (595, 842), + "a5": (420, 595), + "a6": (298, 420), + "a7": (210, 298), + "a8": (147, 210), + "a9": (105, 147), + "b0": (2835, 4008), + "b1": (2004, 2835), + "b10": (88, 125), + "b2": (1417, 2004), + "b3": (1001, 1417), + "b4": (709, 1001), + "b5": (499, 709), + "b6": (354, 499), + "b7": (249, 354), + "b8": (176, 249), + "b9": (125, 176), + "c0": (2599, 3677), + "c1": (1837, 2599), + "c10": (79, 113), + "c2": (1298, 1837), + "c3": (918, 1298), + "c4": (649, 918), + "c5": (459, 649), + "c6": (323, 459), + "c7": (230, 323), + "c8": (162, 230), + "c9": (113, 162), + "card-4x6": (288, 432), + "card-5x7": (360, 504), + "commercial": (297, 684), + "executive": (522, 756), + "invoice": (396, 612), + "ledger": (792, 1224), + "legal": (612, 1008), + "legal-13": (612, 936), + "letter": (612, 792), + "monarch": (279, 540), + "tabloid-extra": (864, 1296), + } + + +def paper_size(s: str) -> tuple: + """Return a tuple (width, height) for a given paper format string. + + Notes: + 'A4-L' will return (842, 595), the values for A4 landscape. + Suffix '-P' and no suffix return the portrait tuple. + """ + size = s.lower() + f = "p" + if size.endswith("-l"): + f = "l" + size = size[:-2] + if size.endswith("-p"): + size = size[:-2] + rc = paper_sizes().get(size, (-1, -1)) + if f == "p": + return rc + return (rc[1], rc[0]) + + +def paper_rect(s: str) -> Rect: + """Return a Rect for the paper size indicated in string 's'. Must conform to the argument of method 'PaperSize', which will be invoked. + """ + width, height = paper_size(s) + return Rect(0.0, 0.0, width, height) + + +def CheckParent(o: typing.Any): + if getattr(o, "parent", None) == None: + raise ValueError("orphaned object: parent is None") + + +def EnsureOwnership(o: typing.Any): + if not getattr(o, "thisown", False): + raise RuntimeError("object destroyed") + + +def CheckColor(c: OptSeq): + if c: + if ( + type(c) not in (list, tuple) + or len(c) not in (1, 3, 4) + or min(c) < 0 + or max(c) > 1 + ): + raise ValueError("need 1, 3 or 4 color components in range 0 to 1") + + +def ColorCode(c: typing.Union[list, tuple, float, None], f: str) -> str: + if not c: + return "" + if hasattr(c, "__float__"): + c = (c,) + CheckColor(c) + if len(c) == 1: + s = "%g " % c[0] + return s + "G " if f == "c" else s + "g " + + if len(c) == 3: + s = "%g %g %g " % tuple(c) + return s + "RG " if f == "c" else s + "rg " + + s = "%g %g %g %g " % tuple(c) + return s + "K " if f == "c" else s + "k " + + +def JM_TUPLE(o: typing.Sequence) -> tuple: + return tuple(map(lambda x: round(x, 5) if abs(x) >= 1e-4 else 0, o)) + + +def JM_TUPLE3(o: typing.Sequence) -> tuple: + return tuple(map(lambda x: round(x, 3) if abs(x) >= 1e-3 else 0, o)) + + +def CheckRect(r: typing.Any) -> bool: + """Check whether an object is non-degenerate rect-like. + + It must be a sequence of 4 numbers. + """ + try: + r = Rect(r) + except: + return False + return not (r.is_empty or r.is_infinite) + + +def CheckQuad(q: typing.Any) -> bool: + """Check whether an object is convex, not empty quad-like. + + It must be a sequence of 4 number pairs. + """ + try: + q0 = Quad(q) + except: + return False + return q0.is_convex + + +def CheckMarkerArg(quads: typing.Any) -> tuple: + if CheckRect(quads): + r = Rect(quads) + return (r.quad,) + if CheckQuad(quads): + return (quads,) + for q in quads: + if not (CheckRect(q) or CheckQuad(q)): + raise ValueError("bad quads entry") + return quads + + +def CheckMorph(o: typing.Any) -> bool: + if not bool(o): + return False + if not (type(o) in (list, tuple) and len(o) == 2): + raise ValueError("morph must be a sequence of length 2") + if not (len(o[0]) == 2 and len(o[1]) == 6): + raise ValueError("invalid morph parm 0") + if not o[1][4] == o[1][5] == 0: + raise ValueError("invalid morph parm 1") + return True + + +def CheckFont(page: "struct Page *", fontname: str) -> tuple: + """Return an entry in the page's font list if reference name matches. + """ + for f in page.get_fonts(): + if f[4] == fontname: + return f + + +def CheckFontInfo(doc: "struct Document *", xref: int) -> list: + """Return a font info if present in the document. + """ + for f in doc.FontInfos: + if xref == f[0]: + return f + + +def UpdateFontInfo(doc: "struct Document *", info: typing.Sequence): + xref = info[0] + found = False + for i, fi in enumerate(doc.FontInfos): + if fi[0] == xref: + found = True + break + if found: + doc.FontInfos[i] = info + else: + doc.FontInfos.append(info) + + +def DUMMY(*args, **kw): + return + + +def planish_line(p1: point_like, p2: point_like) -> Matrix: + """Compute matrix which maps line from p1 to p2 to the x-axis, such that it + maintains its length and p1 * matrix = Point(0, 0). + + Args: + p1, p2: point_like + Returns: + Matrix which maps p1 to Point(0, 0) and p2 to a point on the x axis at + the same distance to Point(0,0). Will always combine a rotation and a + transformation. + """ + p1 = Point(p1) + p2 = Point(p2) + return Matrix(util_hor_matrix(p1, p2)) + + +def image_profile(img: ByteString) -> dict: + """ Return basic properties of an image. + + Args: + img: bytes, bytearray, io.BytesIO object or an opened image file. + Returns: + A dictionary with keys width, height, colorspace.n, bpc, type, ext and size, + where 'type' is the MuPDF image type (0 to 14) and 'ext' the suitable + file extension. + """ + if type(img) is io.BytesIO: + stream = img.getvalue() + elif hasattr(img, "read"): + stream = img.read() + elif type(img) in (bytes, bytearray): + stream = img + else: + raise ValueError("bad argument 'img'") + + return TOOLS.image_profile(stream) + + +def ConversionHeader(i: str, filename: OptStr ="unknown"): + t = i.lower() + html = """ + + + + +\n""" + + xml = ( + """ +\n""" + % filename + ) + + xhtml = """ + + + + + +\n""" + + text = "" + json = '{"document": "%s", "pages": [\n' % filename + if t == "html": + r = html + elif t == "json": + r = json + elif t == "xml": + r = xml + elif t == "xhtml": + r = xhtml + else: + r = text + + return r + + +def ConversionTrailer(i: str): + t = i.lower() + text = "" + json = "]\n}" + html = "\n\n" + xml = "\n" + xhtml = html + if t == "html": + r = html + elif t == "json": + r = json + elif t == "xml": + r = xml + elif t == "xhtml": + r = xhtml + else: + r = text + + return r + +class ElementPosition(object): + """Convert a dictionary with element position information to an object.""" + def __init__(self): + pass + def __str__(self): + ret = "" + for n, v in self.__dict__.items(): + ret += f" {n}={v!r}" + return ret + +def make_story_elpos(): + return ElementPosition() + + +def get_highlight_selection(page, start: point_like =None, stop: point_like =None, clip: rect_like =None) -> list: + """Return rectangles of text lines between two points. + + Notes: + The default of 'start' is top-left of 'clip'. The default of 'stop' + is bottom-reight of 'clip'. + + Args: + start: start point_like + stop: end point_like, must be 'below' start + clip: consider this rect_like only, default is page rectangle + Returns: + List of line bbox intersections with the area established by the + parameters. + """ + # validate and normalize arguments + if clip is None: + clip = page.rect + clip = Rect(clip) + if start is None: + start = clip.tl + if stop is None: + stop = clip.br + clip.y0 = start.y + clip.y1 = stop.y + if clip.is_empty or clip.is_infinite: + return [] + + # extract text of page, clip only, no images, expand ligatures + blocks = page.get_text( + "dict", flags=0, clip=clip, + )["blocks"] + + lines = [] # will return this list of rectangles + for b in blocks: + bbox = Rect(b["bbox"]) + if bbox.is_infinite or bbox.is_empty: + continue + for line in b["lines"]: + bbox = Rect(line["bbox"]) + if bbox.is_infinite or bbox.is_empty: + continue + lines.append(bbox) + + if lines == []: # did not select anything + return lines + + lines.sort(key=lambda bbox: bbox.y1) # sort by vertical positions + + # cut off prefix from first line if start point is close to its top + bboxf = lines.pop(0) + if bboxf.y0 - start.y <= 0.1 * bboxf.height: # close enough? + r = Rect(start.x, bboxf.y0, bboxf.br) # intersection rectangle + if not (r.is_empty or r.is_infinite): + lines.insert(0, r) # insert again if not empty + else: + lines.insert(0, bboxf) # insert again + + if lines == []: # the list might have been emptied + return lines + + # cut off suffix from last line if stop point is close to its bottom + bboxl = lines.pop() + if stop.y - bboxl.y1 <= 0.1 * bboxl.height: # close enough? + r = Rect(bboxl.tl, stop.x, bboxl.y1) # intersection rectangle + if not (r.is_empty or r.is_infinite): + lines.append(r) # append if not empty + else: + lines.append(bboxl) # append again + + return lines + + +def annot_preprocess(page: "Page") -> int: + """Prepare for annotation insertion on the page. + + Returns: + Old page rotation value. Temporarily sets rotation to 0 when required. + """ + CheckParent(page) + if not page.parent.is_pdf: + raise ValueError("is no PDF") + old_rotation = page.rotation + if old_rotation != 0: + page.set_rotation(0) + return old_rotation + + +def annot_postprocess(page: "Page", annot: "Annot") -> None: + """Clean up after annotation inertion. + + Set ownership flag and store annotation in page annotation dictionary. + """ + annot.parent = weakref.proxy(page) + page._annot_refs[id(annot)] = annot + annot.thisown = True + + +def sRGB_to_rgb(srgb: int) -> tuple: + """Convert sRGB color code to an RGB color triple. + + There is **no error checking** for performance reasons! + + Args: + srgb: (int) RRGGBB (red, green, blue), each color in range(255). + Returns: + Tuple (red, green, blue) each item in intervall 0 <= item <= 255. + """ + r = srgb >> 16 + g = (srgb - (r << 16)) >> 8 + b = srgb - (r << 16) - (g << 8) + return (r, g, b) + + +def sRGB_to_pdf(srgb: int) -> tuple: + """Convert sRGB color code to a PDF color triple. + + There is **no error checking** for performance reasons! + + Args: + srgb: (int) RRGGBB (red, green, blue), each color in range(255). + Returns: + Tuple (red, green, blue) each item in intervall 0 <= item <= 1. + """ + t = sRGB_to_rgb(srgb) + return t[0] / 255.0, t[1] / 255.0, t[2] / 255.0 + + +def make_table(rect: rect_like =(0, 0, 1, 1), cols: int =1, rows: int =1) -> list: + """Return a list of (rows x cols) equal sized rectangles. + + Notes: + A utility to fill a given area with table cells of equal size. + Args: + rect: rect_like to use as the table area + rows: number of rows + cols: number of columns + Returns: + A list with items, where each item is a list of + PyMuPDF Rect objects of equal sizes. + """ + rect = Rect(rect) # ensure this is a Rect + if rect.is_empty or rect.is_infinite: + raise ValueError("rect must be finite and not empty") + tl = rect.tl + + height = rect.height / rows # height of one table cell + width = rect.width / cols # width of one table cell + delta_h = (width, 0, width, 0) # diff to next right rect + delta_v = (0, height, 0, height) # diff to next lower rect + + r = Rect(tl, tl.x + width, tl.y + height) # first rectangle + + # make the first row + row = [r] + for i in range(1, cols): + r += delta_h # build next rect to the right + row.append(r) + + # make result, starts with first row + rects = [row] + for i in range(1, rows): + row = rects[i - 1] # take previously appended row + nrow = [] # the new row to append + for r in row: # for each previous cell add its downward copy + nrow.append(r + delta_v) + rects.append(nrow) # append new row to result + + return rects + + +def repair_mono_font(page: "Page", font: "Font") -> None: + """Repair character spacing for mono fonts. + + Notes: + Some mono-spaced fonts are displayed with a too large character + width, e.g. "a b c" instead of "abc". This utility adds an entry + "/DW w" to the descendent font of font. The int w is + taken to be the first width > 0 of the font's unicodes. + This should enforce viewers to use 'w' as the character width. + + Args: + page: fitz.Page object. + font: fitz.Font object. + """ + def set_font_width(doc, xref, width): + df = doc.xref_get_key(xref, "DescendantFonts") + if df[0] != "array": + return False + df_xref = int(df[1][1:-1].replace("0 R","")) + W = doc.xref_get_key(df_xref, "W") + if W[1] != "null": + doc.xref_set_key(df_xref, "W", "null") + doc.xref_set_key(df_xref, "DW", str(width)) + return True + + if not font.flags["mono"]: # font not flagged as monospaced + return None + doc = page.parent # the document + fontlist = page.get_fonts() # list of fonts on page + xrefs = [ # list of objects referring to font + f[0] + for f in fontlist + if (f[3] == font.name and f[4].startswith("F") and f[5].startswith("Identity")) + ] + if xrefs == []: # our font does not occur + return + xrefs = set(xrefs) # drop any double counts + maxadv = max([font.glyph_advance(cp) for cp in font.valid_codepoints()[:3]]) + width = int(round((maxadv * 1000))) + for xref in xrefs: + if not set_font_width(doc, xref, width): + print("Cannot set width for '%s' in xref %i" % (font.name, xref)) + + +# Adobe Glyph List functions +import base64, gzip + +_adobe_glyphs = {} +_adobe_unicodes = {} +def unicode_to_glyph_name(ch: int) -> str: + if _adobe_glyphs == {}: + for line in _get_glyph_text(): + if line.startswith("#"): + continue + name, unc = line.split(";") + uncl = unc.split() + for unc in uncl: + c = int(unc[:4], base=16) + _adobe_glyphs[c] = name + return _adobe_glyphs.get(ch, ".notdef") + + +def glyph_name_to_unicode(name: str) -> int: + if _adobe_unicodes == {}: + for line in _get_glyph_text(): + if line.startswith("#"): + continue + gname, unc = line.split(";") + c = int(unc[:4], base=16) + _adobe_unicodes[gname] = c + return _adobe_unicodes.get(name, 65533) + +def adobe_glyph_names() -> tuple: + if _adobe_unicodes == {}: + for line in _get_glyph_text(): + if line.startswith("#"): + continue + gname, unc = line.split(";") + c = int("0x" + unc[:4], base=16) + _adobe_unicodes[gname] = c + return tuple(_adobe_unicodes.keys()) + +def adobe_glyph_unicodes() -> tuple: + if _adobe_unicodes == {}: + for line in _get_glyph_text(): + if line.startswith("#"): + continue + gname, unc = line.split(";") + c = int("0x" + unc[:4], base=16) + _adobe_unicodes[gname] = c + return tuple(_adobe_unicodes.values()) + +def _get_glyph_text() -> bytes: + return gzip.decompress(base64.b64decode( + b'H4sIABmRaF8C/7W9SZfjRpI1useviPP15utzqroJgBjYWhEkKGWVlKnOoapVO0YQEYSCJE' + b'IcMhT569+9Ppibg8xevHdeSpmEXfPBfDZ3N3f/t7u//r//k/zb3WJ4eTv2T9vzXTaZZH/N' + b'Junsbr4Z7ru7/7s9n1/+6z//8/X19T/WRP7jYdj/57//R/Jv8Pax2/Sn87G/v5z74XC3Pm' + b'zuLqfurj/cnYbL8aEzyH1/WB/f7h6H4/70l7vX/ry9G47wzK/hcr7bD5v+sX9YM4i/3K2P' + b'3d1Ld9z353O3uXs5Dl/7DT7O2/UZ/3Tw9zjsdsNrf3i6exgOm57eTsbbvjv/1w2xTnfDo5' + b'fnYdjA3eV0vjt25zXkRJB36/vhKwN+kEw4DOf+ofsLuP3pboewGISO7bAxPkUU+EaUD7t1' + b'v++O/3FTCESmcsILgQRuLhDs/w857lz6NsPDZd8dzmtfSP85HO8GcI53+/W5O/br3QkeJa' + b'9NERmPKgE2Ue+73vgj97Ded5TH1pPDEFCT4/35RFFtAMORMezXb3dwiioCsYe77rABjjCO' + b'jHs/nLs7mx3wuYFYX+HsEQyTfHg/DY/nVxa0rzmnl+6BVQfeegTyemSlOdjqczqJ0J9/ev' + b'fp7tOH1ed/zj+2d/j+9eOHf7xbtsu75jcw27vFh19/+/jux58+3/304edl+/HT3fz9kq3i' + b'w/vPH981Xz5/APR/5p/g9/+Qhb+/3bX/8+vH9tOnuw8f79798uvP7xAcwv84f//5XfvpL/' + b'D97v3i5y/Ld+9//Msdgrh7/+Hz3c/vfnn3GQ4/f/iLifja492HFbz+0n5c/ARg3rz7+d3n' + b'30ycq3ef3zO+FSKc3/06//j53eLLz/OPd79++fjrh0/tHRIHr8t3nxY/z9/90i7/AxIg1r' + b'v2H+37z3effpr//PPN1CIF47Q2LUSdNz+3NjakdvnuY7v4/BcEGb4WyEPI+DMT++nXdvEO' + b'n8iWFomaf/ztL8wZhPqp/e8vcAbm3XL+y/xHpPH/xlnDejXKHJTQ4svH9hdK/mF19+lL8+' + b'nzu89fPrd3P374sDSZ/qn9+I93i/bTD/D+8wcWxOruy6f2L4jl89xEjkCQaZ9+4Hfz5dM7' + b'k33v3n9uP3788uvndx/e/zu8/vThn8ggSDqH56XJ6Q/vTZKRVx8+/sZgmRemIP5y98+fWu' + b'Ao8vc+z+bMjE/Iu8Vn7RBxIis/q7TevW9//Pndj+37RWuz/AND+ue7T+2/o+zefaKTdzbq' + b'f84R7xeTdJYYJLOf7z4xq11N/osp2bt3q7v58h/vKLxzjtrw6Z2rOSbzFj+5rEd7+P84UL' + b'xH8/6vO/lj2/6Pu7eX7d3P6C3Y2tb3u+7ua3dkA/yvu+w/JqyV6GeUt0/dy7nb36MjySZ/' + b'MUMO3Hz5+LNycsdx54SB5wmN/XJvRh0z/vz1/PaCf4Zhd/rP9dPur/j7eDDtfIV+dX3+r7' + b'vz63B36vb9w7AbDn/ddLseown7kr7bbU4YIhD6/03//e7JiM0O669/vbyg1/hPdKLd8WGN' + b'PmnXoSs52h5200OGk/WW/fvdl0NvhpHTw3q3Pt59Xe8uCOARA8ydCcX433Z/rjfonfbrnf' + b'hP5j9MJtM0mbf4XZT4XT9czt0Pk3S1ALFfPxyHA6g2A3WCz90Pq6qFO+dsskjdtzAB3B+7' + b'rwwDeWi/reu0nbcOeMBostv1Dz9MpsuJwzbD+b5DcuGuKR32dFx/pcfGO9oOw7MZlAj64M' + b'/9bmOAaTJ/WFuJF0t898eHXfdDNmV4JC77x133J8XONCDiTTWq5JkvNMMLNY9C1ZLNa82R' + b'rIki9ULP50AZ/6pczOyn92DSE3IqRSZs7nc2+gmqKMi+O3an/sQkTQOpszcLsBTnsg2gSE' + b'f/KskTQ4YaANrFPFn4b/ELIEo/Iu2jQkbg/QEtEJXe1Y6MtWP3sl3/MMlnqf08D4cBaclr' + b'5KzEzHTuyXhZPyCXVhkcD0/DoXsmEwEfoWVQqsJ+Sg2eW9qniOGQFqHh3n+XCNMWCMLJ3b' + b'c4BPB2vz5CYenXkKjI06Rhu8mSJlSxKmmQX+uHB6g1jC0ztEQ+TRqdISmC6A46TLiH/sfM' + b'wBczE0mo4WrXHzoJpUyaKCvglLnpJC1XiEWSBN55eIHcDChLFpQ4TxZrHWkL2mUXwl6Yto' + b'N6OLefEmyRLHy7mizwDT1yt1szryqhfCOa1AJJBtKVZFRtCd8WU3pATvFrbr5cHlo6Dome' + b'tzoF0xmAbn3/vF2fgKgcbhbkKCCrCKBYETp0uZt+2siJ5pSGc92+kOVgbLVIOREE/rw+jc' + b'JfNGSxGWBysYMmOzxrCU3qelSBOUV1VQCf456kXEGaqB4gykGJUKTJQupBnixZ9NNk+S+2' + b'ihS/0kkCjOoD6ccjhCO3niVLKfYW367Y0xY90TIU6MwSVkRfVdMM6HFYsxzpPGobc0NLrV' + b'4ky6htQIoOA9rLmWTeIupuh6aRZaij5vPp2LH15zO49PmEMH1niBrcCCWd60KgH00/Bmgp' + b'kM8t9NzL/mm930scS/j7XYuHlr2MGiXkiwoDQvnESoFVyfKEarx1uSGFA7ehkULobywiRP' + b'BNiqgAcbOCo9MFRwtGp1GVn6wSDuzTImllwJ65b2mcAPyAjZxvfcTpHN+2xC0bZboApKt6' + b'joBDPZhbIgyyEeD7B7Sx9kZ1qTWqKgeUkvZ66MUI1N4eejGytzeG3kgUP/QumFyVWyD1+E' + b'pSja9NICVYYqbrSkvzJV2Xo0WhQfIedV+EsGU0rd23hAogyuUKtNZ7kBjOxTEPBT9LS/Cv' + b'BlfE32OqDgVzo+JFfWt3uqkhATv4OEhYCFtGXrRhR/jCY7Is4kuCVWavQ0QdiVoDqoiute' + b'kS9K0eFjpDy3E8nc75EdVjKGbtgVmg+1KkWtQAVp/hpaPQM1SNl1O/YwryWeEJUS3gUkeb' + b'wTnzDLP+DdtgG0jtClLrXh86SHu6mQoIb1r5HM1KWjmksEN7xQ9VsjVpEQ1ezvA7gUqMD+' + b'97RcpruAv3Le0G8V2Oww/ZBDpq+40xQxPBh2/G6D1BqRSiKq7YJ5TJKjTdJlnpDjptk1U0' + b'phVwrbvkabJy/S5Ut1UPnyELqgwIovM1Cm6jCoGgMDERdp6sJJ/K5EeKViU/Nqc/Lutj90' + b'OeYwD8UVS6Kb7RNzMrc/sZhqsZmYenfh3EnCc/StfWJj9KniAe0WFSKFE/hpxYWEK0k5TA' + b'wIh806Z72+hRd37UjZ50NJBBxu16o3UD+N1iHrjZ7LpRfab42+5KJ5gZH5eX8+WomxFq+Y' + b'++BBALJnWqVgGIRywArlFjJgefUXkgf/142NpPKQ84le/KfdtYs1kD2gjLDJ0mP7Hg6uSn' + b'tEb8P2TFYmW+p/xGo+B3kfK7SX7CQF4ZPE1++lUKGh3sT+tbAx3G5J/WN5WyDIzj5tQ/ae' + b'cZYrMDKqraT6b8fWshK2gxGcINBb+0hBQ8uuifpPuHY4SlmwhqwU+qg6frKFcRttbIphPQ' + b'R9WCwJesxfcF85bjZb9bX84siFWEiBYBh98kv1AF3jHTZ8k7PUvMVsm7v0F+TCjefdF4m7' + b'wTJWDpvmXIAeBbSrZI3on2gcBCFrWWCAN8BEhYRFXlK5N3elStQapRdRVIP8hQ0huaNirZ' + b'u6sBmN5NW8wn5kvaoqNFjZgn77qrpQeIFrXXInn3eFw/o62hZ8IU7Z2M0Qv3LREDiNQOJK' + b'vXQZEej8mQoT9th+NZO0TxyYCL+ukInW4UZFS14AO1SrX3Jnk36ByH4DIyMjMHO/jMzJfq' + b'MEsDhNLI0VCJyIAEUiopfEt7xzj2zk2XU9T0d9GQxPrzbdufT9GgMPWgrwuaWSZ/Y02eJ3' + b'+L5nZp8rdQ+VaWkPaJucrfok6uTv42mog1yd+ijEP4kpx58ndG2SR/V0NNkfz976E/WiZ/' + b'X99DZ3/uoxF+AtjV1Nx8q8JEqDd7qhkZYwUmB/byYoqG7OuuvwX63cnibJH8XQa0Gt8yoO' + b'UlKJ9v0JT/Ho9fZKuWgX7i7/FYPwUQLU2skr9vdTKh0/19q9UBhOgHI0gSjz0QU8+WUGx/' + b'jwoFJTAgF5SXemIhmYEhH066cZUEfEE2yc8syEXyM3s9aIU//4yuEtXlZ6815DN87+83Jq' + b'fh3OdavsR3yDVyJNdSS8STlByRjPISnlz/szJfgWNp8VoGUoZiqH8/969RViOG35kMcOJs' + b'RBqibJwnP0fZCI9+gol2Y79l3IBnya9F8gvza5n8oip+mfxihVqVUD7tt0yJVwRchW+TX0' + b'ImZckvekjEGPeLSjJ0nV+iejSdJr9EMkMGEQvfVHGMioqq/cuFhbVI3lPWNnlvynaevPdl' + b'Os2T974coS++D+WIye77IGJuibgc0dG8j8uRnqKkTA0tHsrkPSv4rnuk69kyeY+yEBW2Tt' + b'6bQmvwGxUa4tGFBv3ofZQBSNjwqnMI8UiOgOmXJJep+5Y5AQCTQ8vkA3NolXzARD8tMvxK' + b'qc+TD37AX+buWwIAACXpGM1y0I048Nbwi+C8ioAS+eBzH7J9YK7Bw8aPCTPIE8pgaglRG5' + b'YR4KsW6t2HmysAy1oz/LxzmWlUD8Vx8JLgCPXzKWgAH3T/jXRhfPKVrJgYUlSXBcigutDv' + b'rXxSsEROTCkjCMiMz1JUDQCnajBhkaqxAhD1zwXoPeodVNIPkQ7Skj6yUDBImU/J3LmllR' + b'BtZiHJ0IWlo6x0IfrsahmsVlVtHvWMEcFdKTzwLroNeugP8WICa2u8mMDA9t3T2iWOn7rb' + b'd1w/LmCKbejjcDnoalzNLX7uzzutF1ULh3v1BrV031vx8pkQwqZz3VrhQjV6CCNKFtuGJc' + b'J+CXy7FQn0rh9c3zxhZTbfMqVtHSDFTRe+D0CUduDXzrX6WJH2vUThvn0GM8sNoOYxU+9B' + b'4iuSX+EZWf+rFMw0+TU0X/B111iUya+R0rwCHaldcwA3p7hzeLXr2/ywCsMccRkI8fevR1' + b'3P8+RXnf9Qtn49Gac1P3QmkOOSg+//ZnLS5L9DEsrkv6OQwBT3afKR7rPkY6R7LkD7bmCa' + b'fPS9XVHjW8Ya5MXHEEsFIhpVyFb9RzoBqXOyNrRvkMU8kKIiFJAj1s4QiJqjgL0dmCdIRt' + b'jbKlcLknFrTJFEPRoVbfIxyhXwJVf8tw8E/ut0hJ0uLx2tXMBryuQTczFPPq24YzeZYHqP' + b'/hJU5qh0Sir31ITU1FM1qcJRufFXOiozVOV5JpTa+zO8mXdJnoncxM4YUpElI+VdlimozL' + b'ssycu8SxQaKC81OltQXuqS6cu81IUJxUtdVKS81MWSlJe6oJyZl7poQOXisiUlLlekxOWc' + b'lJe6YPqmIvWMlJe6pNRTL3XJtE+91IWhvNQlZZl6qUtKPfWylCyHqZelNPF5WUrmxFRkYe' + b'yFl6Wgv0JykPlZSA4yzwrJQaa9EFmQPmll/ls3EYqw3r/0vsvHAPTJN8XSf0ceSgdKS0BB' + b'qAaLzH7YvvITvb/51OsBtYVubaNDutDSa0vIXJTlGzX9jDU6kmtiaN/2WOU8GTmDt7gzhf' + b'jR+jzSF2+AVgT05AxBbB9iCIUVzdcQ+zZy0SB5236vlk6Rov7JrLTOUYD9nyIAqkHUa4A7' + b'PJ7Ha3DwLn0JXJwZlszn5slndhbT5POaSiyGgM92wQ6p+yzFCzQUHDLsc8j/mSVirR49/+' + b'e4/6WnKHfnhpZCWCSfow1iOL+5+Tunw1AEiL07n6KNW8i6dbv3NT7d0LbgJ/WxCRQp8ymD' + b'Lmlkh4SJqNWgXJIfzwyh4n/WvTemB5+jcoAIesERk97PUEgee6OwNwtDnXrW1npqiPPrQC' + b'Gr5POxg47h1WhiCDtKH5Sxz6d4Z7EB4gsY4b12O7XkD+brIFSafGFxF8kXmY7M3bfkBwA/' + b'uUCxfJHJRY5vKfa5JcJEotGA1INSoxID3aoUIWCl6aPufNEj9RSk0vQXgfQ+llXAJOYsYJ' + b'KCmcKU2cAkwC7WlMm5NtUpAihpoTxKk4e0MnuYuW9xC0Cr9JiefPGThJX99Gofpn9fRpME' + b'iqknCVB0v4wnCegqvkSThBZ0PElg9mpIZwTy7EpTgYxab6wgmGQIGvGX6zXS1oNK1a3oUj' + b'cRZKWo7Cwr2SacF55I2T8Jy+QM03p6298PO+nAcnEgi6lN6jG9ntqMwRuBTb2bwIuEkPkI' + b'0mhNnVI0/i/jheQJMd8ikR7MG9bcJdb9WBvga+MTlJGfv2MY+hLNJCoPSFWfJv9goy6Tf4' + b'T22ST/UHUHU5N/RBOFDHS02gEHrsdpwIuKCuFG2yd18g9JHHi+rmFK90+KUSX/9KLWWfLP' + b'INLCEjJSQ+5/qipSk1QjBKZq/1RJqOvkn77q15Pkn5GIiFNEqpL/oRh18j8h6mXyPzqmBU' + b'gd0zz5n2ikz+Ges5tZm/xPFA8ClXjq5DfGM0t+k6506b6lwRPQpY6x5bcgVWuJkCFl8luo' + b'sSljuOpuVsC06K2hpY+YJr9hHqA714bI5Va3h+B9hqLl/+aLP7efvktZQSi9wzEtQOu6Xo' + b'GOhkfonL9FuYYsklzDt68wFOByuu+fdAbNHXbLYGJB3q4/n3e6LkNREfiWrzr5F8tpnvwr' + b'Mq8qQfsRZ5aIGVa1dN8y/K8ASJE5whVZ2s4myb/sonPVmC9ReBztS2aWJf+KWmAF+ub2RE' + b'3GDa23BW7VGoi+7XRa5gTGO2qLlKiO0vi7Gafl3Ih0kfxLazqzafKvqGgRsxQtv/2uVFMk' + b'tEmEvrFe33cYbXZoTzM06bVvLC1Zm+4rnM0mxJ8uv6+P6zPczWtLH/eXZ65RzA1/v0Z3qc' + b'C8BXi8yML5JAf9dYD2QwU4RNq0Gncx5hGooqbre2Zlb87D7NfHZ121VxFXBYhhVScUyb8f' + b'Xob98Dj8kNN+ay2G2Ln7FkvnlQN0vqcO03ZLlcPEENs7igySfPBipgJRZAsZiZO6vJxYQl' + b'Q4TEXWNwyxC41qq+SlZoghdqXRyBB5pjlict0kvkZAczefJoKH/T2qelpZyFKT1FFDRLoS' + b'KJx3LtkMXCRBYzUABm0XwJQ+Qi7nyAG9pgzuZrN+VnWsIuTqKPJB6aFQ9G7OTfMAB70Rgu' + b'iMSw0ZlidBmxaBWh4WF5G73fNw7FDvcq7srrvgAZE89v2EO/g/QOzCkvVsmtL4aGrIdII+' + b'yFqqe7K2xs6enFlFwJHZxFrJeDK11p+ezOyevCdzu7ftyantXjxZ2A7Ok6XdhPdkZbfaPV' + b'nbzVpPzqwpnCPzibVj82RqzdY8mdmNAk/mdg3Uk1NrU+bJwhqLebK000xPVnYm4snaWgZ6' + b'cma3Wh05ndiJmCdTa9LsycxO/T2Z22m/J6fWLsaThR2kPVnaGbsnK2vw5snaGo94cmZtTB' + b'xZTKwxkidTayDrycxaH3kyt1aWnpxao1VPFtZaxJOlHeg9Wdk9fk/WdlPUkzO73ebIcmKn' + b'qJ5M7Ua0JzOrLnsyp8WNSFVOSYpUZeEarSMpVS4FWlKqXNJbUqpc0ltSqlxCrihVLiFXlK' + b'qQoCpKlUvyK+ZVLsmvmFe5JL8yUknyKyOVJL8yUknyKyOVJL8yUkn51kYqyY2aUuVSvjWl' + b'mkrya0o1FZlrSjWV5NeUairJrynVVJJfU6qpJL+mVFNJb02pppLeGaWaSnpnlGoq6Z0ZqS' + b'S9MyOVpHdmpJL0zoxUkt6ZkUrSOzNSSXpnlGomCZxRqsInEADJXEhTglMhKVVRCEmpilJI' + b'SlVUQlKqohaSUhUzISlVMReSUhWNkEYqn8A0NVL5FKWmdU9WQpZ2DuDJyppoerK2xjmORM' + b'ai8ovMJmMLCcpkbCnJNxlbBZIRVT75NbpNBFUJaUL26a2NVEub3gy5nE1cg8y5MDxx4mO4' + b'JWHLrqhyVs6ynAsJ4UvXrkGyVpTlRMicZCrklGQmZEEyF7IkORWyIlkIyYjKUsgZycqRU9' + b'aKsqyFNELOhKQYbnAhyZDdeEGSQWVeyCmLsswyIRlUlgvJBGZTIRlyVgjJBGalkExgJkKm' + b'TGAmQnKYLjMRksN0mc2FNFKJzJmRaiGkkWoppJGqFdJIJQnkMF3mEyEpVS7p5TBd5pJeDt' + b'NlLunlMF3mkl4O02Uu6eUwXeaSXg7TZS7p5TBd5pJeDtNlLunNjVSSXo6t5VSE5NhaTkVI' + b'jq3lVITk2FpORUiOreVUhGTrK6ciJOt5ORUh2dzKqUjFwbScilSFEUOkKowYUgqFEUNKoT' + b'BiSCkURgwphcKIIaXAwbQsJIEcTMtCEsjBtCwkgZURw+dkwZ6qnE+FZFBVKySDqkshGdSs' + b'FpIJnHsxClOfq5mQTFEtjk19nqVCMkXNXEgGtfRCFqYElz6fUQ+ohXrHJUuhaLyQJRNYLH' + b'yRoZ2DXE6EpONlKmRJMhOyIhn8MqjlVMgZSRGDWVcsSyFTkpWQGclayJzkTEgjlSShMlI1' + b'QhqpFkIaqZZCGqkkvZWRymd7ySG+aCW97EWLVtLLIb5oJb0c4otW0sshvmglvRzii1bSyy' + b'G+aCW9HOKLVtLL/rloJb0c4otW0jszUkl60T+vmiyQBUmf/Ap97KqZBpJc6UUrdm7FaiIk' + b'xVilQlKMlU9ghQ5q1Ug3UnGYKJqpkExvE7imIpVCMqJGxOAwUTS1kIyoqYRkehsvVc1hom' + b'gyIVkKTSokS6HJhaRUi+CYUi2CYyPGTEgjhq8bdW7i9XWjnpqIVkIyooWXasZONXN+yzRD' + b'B5WlTicHiSLLUjdBK9McXVCWujlXmRY04p9kCyGnJJdCFiRbR7LRYSh3jvO0NCOsczydcS' + b'qUUWa/kcHqqldniiRanAG57Y/rp/Vh/UPOk7jraNoPifuwMsL5Sa+XRiBU76bYnKrGR5UR' + b'dK9iNp5V1MbDeF2IXTpvUlnfMwwz0PSHRyA7h61ogQ4M/517jTZE990mAhcER7ZUTNKNlS' + b'aqVP14pWkagSoxdP28PuOvybd5Fsjtevf42m/O2x9WKy5ByDoAR5Fd9+i6THxJMqldgN6s' + b'n7rT1iwGvrJpWVdx6uvWgNv1/tvalFIIJB9xRh6ngW0WM4LHYsQZeawt24olwu/WyGyR1a' + b'VtzzWYkVjZiDMK3bOfT5fjWnxxLA9w7GU10bxxRVjlmjuqECubCS8oqpDPmc3SP7hIeQqo' + b'SdHLFg2Vfdxu1/1xWe9+yDJqDu64PXsdfdx+DlY4bg+mXm6lHrR/6Y6n9WHzAxdWAqmdTR' + b'TuV2eN22BPjyw7qFbIHD48aWBK4Hm7PjxvL+ftGhWWRlHAuHaYcVWFn/fH9cNzdza2uJgt' + b'1FeoN5lHxnEiq7jmCiN6ml3DytfUxWSiyPLMuba+QRuZuOxsrDDRgg/DGY575m2NNnG4bN' + b'bns1/Eo2J1uJy+sjTDYm0A/VpfQHS/BzRcdoACfVmj2ML684TIsTv8kPFAwPploFgv0Uo9' + b's1Bwu0rJ/v7lBbm6qlcrfh6H9cO2OyGXqSSS/lPqTa2B4Yi+74nFwWQZnJ1ht3sT9xDyuO' + b'7UQiLbPpEAoJ8/PiAnuRJocpWdj9nbTNvZnJi50YF6RnSjQ2NpOXmNqnk8Dq/3w5n1fTa1' + b'5GZ92m6GV9oeUI/xkC1NXmQhkCtRXm8i2OWFgAt5c79zgS+ngriwl7kgLujlRBAf8jITyA' + b'S89AHbMGZ5IF0gs1mAfChUqD32uu2RGRDRuUNZb4i79ecioAzQoVlATZgOzgN8eXGYS+cW' + b'Jf2t+xM1hPocES/fJJBIlUq2Q9x+TMYrWARHB3r0qeH6gsclNQ6TFGeKjgJdKQYE//r2Q1' + b'bNWgUyKierT4zBJSqXmWfeCmSrxFQQqREuH02hzVJPbEyhFYG8PzHIeS0ISuJ+PQJ9zpUa' + b'GB5dHVhIcJL4yiMis0OMTmAKBWGdHvrebm5wr7HVQLRf5jjeTLjStHZogzj2LzRg4+zQEv' + b'5Yhmnx9gio0rxSh2mtYoxp1YLLJife8HZ65mgyF2q9456JjKRUDT3nBoY+B60yS0No0WAU' + b'gnVjUcuFIAuh0zYKo5ivrkq2pdPb/uU8mCFAdWZoIWcesEAV9/nHPuUcGYaTKfGgjwo5Bs' + b'5F6aFTkmrAI9vroeRptdPSQe0kvUNQ5y33B0OgnF5ervRRdPCXW9pihHttMQK1tgjGV2rk' + b'Wz9Icdk4ugqH2frWH9wM8o0KD4sxqCMTg4oWBlf33KPFjxoNoYDcYyT2RvKFIqOaTNxJkv' + b'FbyTq3tOSA4auKWk1In51aAb3gXivCS3KPbBz0doxaBRBVZhiD78N2ZprcRxeb5IaW8Qlu' + b'O+pyp/7PcwcnWyoKGGXLEoF2D+sLO4ospzO9RYhQaRriNdGaZKxLohMGNtYhZ8ajSvOM9E' + b'iXRM9qwG4/8r6YrYRzGnYY1DfCmhgZDsMQT2oWaJH3nc5HxqjtMljQ3dmur9xbU4LGQOuR' + b'FRQTdLYzCc4h0kCGiYUBg0JvSGjZobahJt9vdb1akvY1xhC6yjgg1BkC9nh7gZLsdVaS1g' + b'klvUMurHcPKDVzIh551B82eq4Ine6+V+YCTMEONdtXIJ6SNwBKCHVuQ6R0CAaHl6E/nKHv' + b'QEF1SjBn+YbNEcSzzW93pOfpNVd5xqzfscF5uKAYY106/d/4WqtuvuPO69dp+r850CH55P' + b'CWO8aipEU/G3jGo2ZmlnnsHs4em7vAjNvrzGnmN9g6a13Om57cFZm5u8Ch/Q7uH9kpZKXP' + b'geDMZd3pjG4kK9nySZrb98bpmireVbqCRyehEUeLOR270EyTLYdn9E0Zs09fU1SBHlBTsw' + b'JT4/toigdfwz1XNXrXP6ZI9aCrP7J20NUftMw70Gr+CLM8RIuy7oyWgnmrIey5yUnVBPL+' + b'TH4egH2/IZIpRPfCyqsfajV2fqHnNAC6klUWtrUTYiwVbeVoFeIE0Y4iSTRDRFko0MqiES' + b'1MnehGh8Gu0YAVZ6Ihq++tNBQNipF/E3fbJlGDRCTLCLGxNBFmC2weYVE8cRA2keju3frU' + b'sk7CVRvW8iVrLeQMaUpLycKWcriKWc4OJ43RzXCBwm55JXn95imKbu6wGzHk5GECcbCj/B' + b'yyiNlYjdzWuiCchiu5UEEvuh3A40W3A9KY/p251Jm5bxM/R3au9VtoQPCYtx+pss4Mdure' + b'TJfcJg/Uh/LkQVsKloDVOIY58YPc01fh2yuNxLXSaOmgNJLehWPeNcjDhoP3YaP00jrVuM' + b'v9icb8GkXkUC9TkPFysv0Lj0M+IMbh0a4lO0uwbFHZT11mCwu5KmIo9GZP3bGjEg3/Dfzr' + b'pVskQe6kW+JbriLEFOlhfBXhDJDoapklwr2D5F6OO472iMRdQdiYr3AFIenQucGdRNjUnn' + b'BpgQDGE5dV+dU/cXGHeZBb+vDoK9lyZRDdvtqJgYbd5nR+49JM5YLRdRNuotM/0PAetMIz' + b'a0j72mEIXT0cEOoHAZ27U9C3b1NckvPwzLkHJtxpbsjAn1YE/vfLFVeRE82xnm+YCxdkaC' + b'vpykR8+3LFBVnfv1yRWUUDa1bDbd9deEbKVA6/LpVVgWMGN2Gkwhj5KGeeEZbL5x6Kw2B1' + b'2w4ImlM4M8hO5h7xQG2BPjhxnobOA0yku/EQrhnPVSpKh4/S4OBxClwoQX4HjKR36GUUKM' + b'QRXbZx3/vL7ty/7N7Q2c0qh6FxgZo56mV34VrjrPD0AL1pZ+pWjs7dobxTnWMalw+MysMe' + b'daKYsnQo3DTRTTxblMnofJBrqkuFu74HjW3XUXkzDZk6/Xr3tcM8iOPAIrPQhnfW7whMLM' + b'Bp0tEiqUXkMBUx1Nbd5Z4TPvt1uvRnJ6yG3DIPbUoe9g/omUOXM0eTjHQ1+HJr6soRpNHH' + b'JdgdD+ZoywQjn/nc88TX+vjGbfJUIAk2dc64AqCciH5TWNqqmlTome12xXCZjnkOp1Dmsj' + b'buEdqTedxIceNLriBTkA4vEn2Ib1UuvEM/H574wNQS99JCqodtUwtFy0LOp78NT4szjVlu' + b'ndyFK9ngkqS75MxCds1HhxgxXHgNsRd0XZxDUJrD0/HCdJp1c75NMFyOnLA8Hc36E1Qo82' + b'DBAILG5o6YL3h5ETQqRzct78ChZuBoHsZmk7XkYs5rVNJA88Q7R09LLhcp2WmgM9JZoHPS' + b'eaCnpKdCm9irldA/89JRKhCWbnnhDNQeT77nAf1JIfQHngadSHDtJ15VzKHJ0Z952XJaBZ' + b'pnbUJmrHidoSlaSzLtqZA/GlLS+pOJS2T52fide/L9nPmaimgfjWcpg0+8b20i6fzEq1cm' + b'gWvTIdn2ycop2frpi0mHRPbpN1MqUohfTGQS+j9MaMwF9/QGFYtZIE/rw4m6voZQKR+pXR' + b'BDrRtN700ejeBoaTa75utdsTRmy2ba8gYehZvfcKADNvG+DEd7vsF3aqZCBdWL5Q9Pz08B' + b'QtbJJBTFcLx863p7FyZChALQnalWcGkGnqHpvXELM6ONvqGMOk4F/HJEIA9vzGDUwrejuV' + b'Ob+ZiSWrEvX9H0CMS9ZxmHj45VJNwaLafJJlLiSavFqBLkJtgIGNItTZnveImvaYmNl/ig' + b'RAEd2wtMErdyZsxAomUzjzxxDWSSTdy32bmZZClJtSJWGjosiJFW05+S3tX0x0S8CyuVFG' + b'5nl/ty+xlW9CIgrOk5eItA7f628XxnLGVGnLDyd8U/dU88Nek46Zgz8un5AXVAf+z/EFdT' + b'BY4C8CxoB3sBZwocuXesOH2VAkfuHctu7Qtaa3Tkw/Mu9xflo9HoyIfjxTlXKnDk3rO2ps' + b'o6cKLAkXvHYqfUCVgocOTesOImMJ8D00P/dGUBbQbisfP6MNpCmi4CJ8IOvApuZprn8SnI' + b'Pa8sYPrFCMRM4+XQcZdFjvKYQX5aQ+r7nb8/lfWIy2/XRgrzWwy9KrQcO5DetbnJ0X5b4+' + b'LIecP10or1rvZv0XN5RG1Sc1vb54tJ05NPUymUU5RXBLSOsiCAGLnayKNBlaLd8ovJGLMx' + b'GzATzsux33ujBJNJPmFcf8k4OiqMnpWGNWHC1c4MWtl9GBzQImShAFGpy+vR/MOqQG6J0W' + b'3kRP3l9XAedeOG9h23IXQP6oDQhRog9JGYtW3GFb2pIfpmIxP3Ajm6ifYxskSxM0vpWD0S' + b'oiWid6YaQ8tiMOqbfQrm1L2szdJU2GVtrni06zFjmmOqvSrUpo6bOFwQQZPvtn1oOktDh9' + b'EDFUPfQoJS0XtHC7LROYjZTeNosbspCdg9pKn9lCsDa8Z1GPbIVsiLn8sJXcHhsrfrbiEr' + b'V8j/jvdkZxjr40yuEpXHhtBZ7ICQwwTcZhE+MR6/nblD5E/rFyPMnQacJrLXwxMFjogmgS' + b'i6cOZvXifx1RNoklUS3TzhWvpUUNc8gk9pzAGK5NSFxNh1qZA+nwc3OYfaven5JhtEW1Xu' + b'm3P5zDL4wpLdxs0y6NGb6D7EAmE9n7ZmUayYwUO0P4HqEJYqobFtwj30aEPRHBhJPchmBg' + b'guomzWfokE3cKAmuW3MsjXCURb01sZC9I7M82fMA/Nt55I5g6LZpLeoVquE89iCuBD1tNF' + b'Ojo8UUdF9R7U3iBrd1h4zJazQLryrBLfgl2J5wEYFKISt2IkGGxOvDgtzVNP/c4rUluh7G' + b'KZq80mQ8/OwGJRkOCavCzzoHMyK/Fvw8YqNMYSO8ZEvzOc1wMS8qyP2LaCurUCRCOqPLzo' + b'HEMSzuveLNMii8LSPOTQS/MctvTSPCU3r2kgT75ZzYCNnpQcTS5J2CXgOZ3ffmcjJUdXYz' + b'qNVj+LVcIGARE6OWo+w/eReciTJJ1abIdbveS6SDq5ox7+7fq6X29fekCvtQt4ZchRXHG0' + b'NYfhuhbV4Hv0uAeD1UutTM3D9i2+Z6GuAMrgObVEOM0914C8+LHSqIyxM43q2zErzZAXP1' + b'KNRtde5pojb3tQelVCEFUfuwbX5zGk02eskTPuSY8q6aInPSwtR+Mhf6f3+hFOd2WHAz/6' + b'3Q/0XJ1YuNf4VsUK/1H2w2u0No/y0YZX8B2dwYfckY07gnOrBnltP8MI74BQKdvWIlK0jD' + b'0AbkeLSw52jSGrZql14HKxdAF0mEj7MKpUMN+2MdoIxAa+YXufWUzlhRdH5aSPYIs+4yoh' + b'XFT/th0uyJfMQzS1sdY3HFMbi2KwGpD/L9verRzkWeZSKl1+NqldGNECqcNUh+/z1Seucp' + b'FIyuqVAE59Wjkv/m6sykUu/V02qZwTbwBNcnwWgL5u3DqCzNVmeHUgI+N+1MHn4YBc1JcO' + b'GNCf/AehX4nJkbBdt7frlFArOvNkTKgrc4dIRrQekDLOHCIJp59d/8JGl9Go3FMyscky1o' + b'KgA+SekLdoKo/IWzTIAP0WTY6+db8xygiXK+23njmhgkZ6Bf2/cAA4je/gaMg5v506kwVw' + b'F1myQzY9YmA21x18vLn71vFmxG5dNEfH5g2chh86CkY5ehSH0PhOeRTOwSbHPGHZhRdy0M' + b'qGUMKIyN5OmzFp/HzYDSe7WDa3QHgzBoN+DInboo0ZXiFGBvjKMJ/g21+0hVl+F99qhUmC' + b'NbZEP+U+o2bnMNGpSkerBrMg1H/FvP3AdGclivWo8w5+dC5PIZFOXB1I7Qox671IjuK3n/' + b'xBBnLpLatzfjh9oi5JDEffQUIrtfTVoG0cegF2w/DCq9nmBKkbnpWk7D2vDHArh+mWP8ai' + b'1VgGfTZG+xseX6BcSttCZtoZVsUPNRzVpKXU4Ms8VbRCXsqtL0v3LUM8cuaM2M/rxwH9jE' + b'wMOXYoPFpvCbwb0LVLP/9bIu6LVG/WAHkVqbtlB1sp2BeExrTeBPzPB7PSxwVT+637hoXD' + b'7JpqLiTNuyfcSgu03KnvwWhS4UE5P0MAUzXaDpgeEbMvO3dlf6reeFoZyla8mXGjH3yaEb' + b'AqdNrMk0dqqmXyKKsNLb7VUGBoBHDYdj1XhyYz0OetWoVrLRCtwjksWmtrkke9PlMnj0F1' + b'LJLH6MWpVfKobF7R2B4jbQjN6XFsBLvMiI1XyJc50dEKOTTVR730gNgxdlASHvt+fMRMZc' + b'Lfnh8I4HHHD3gyAITpHyPVBtqIg0SzyQSRQQ8y0xq080MBnex2GMeHP63JoCVpw2jNF036' + b'nteP9iCwp8Ia+hgLy+iBE5ZVAxYWkud2sThmKC8xWxZ753ZFN8JHvhx33+3tyWRPBWcOO1' + b'wO9nSyp4ILh7109giyI4LxuIP4ikxvzyEHOrgiejydzRVMqB7diToTpvmPPeS2Vlck4kfL' + b'GLRRy/PCfAUd09JKV24MEOrCVNE3NOW6NXyvKFvfVkeF7pMWSwNo7bdxSFB+LRLrvoXDgu' + b'prkVs6rhVRq7jWbTTUWkgruBYRta62pKi3C0977da6Fx3PxqqHauvAq7agTDtDu+DBMvMm' + b'Eb4jlQxtKBwhxFThcXgUexl2GsOjX/eBqvAIXXAv7CnZR3alvM474XPYLN+p+Qr5aGlVvn' + b'MDhPLNFX2rfJeG78vX+tbF6ZFQnBaJi3PqsFCcFrlVnFYiXZzWbVScFrq1BFoZji5o61YK' + b'2joIBd142he0dS8FbeXRBW0dxH3mUjDpNNMASa9ZWMzVERfQdtSaIZEomAjkuH7g3jFP9k' + b'xJHR449ucJTxFiKvukTeRI+gOFBb69tRzxcLZ5viIZL9NjaH3iod5owGlmU6LxgNPMGLI2' + b'vasMHSzvSGs1bgFaq3Ck7UuHTW4/dwjJKRCYMDlQ3cHfTgDF7x82iZ5DTJYg/VITkifqA2' + b'RRzyEi5DBMl5YIzyEijNFziHDvnkNMzVfggI72CuBSL2EUGWiV5ob0sOcOV3QIq2A4x45v' + b'ZjDkoAAuHC7IKnfI/vLHRu3CzpbEUVl5kpCXpq5II8A33nkeB9oGVggXRQzt162BY0r3FB' + b'ld1qT1M49VZhBXsQxb1wUHhMpgAH1/wNwCoxsEWote3SGwsvhY50F9+N5bkwVZ10+KMWE3' + b'3ppE/m/D5tTcUFphJGInfiXjVE8UIkC9uQAt8UlvLsxJa12a1brfdzt7A4v5DNpPBATVx8' + b'FBiwAQbzsg0N1wxvRBXq6QK0NbzzqdOfHK2JgDoF6/gDKnGO6s7ERjaqLG/L1mOE/pLZ5u' + b'x5EIXtRsnl7DKso5Uh3e+ITbaBRFC9d7IOhVn/QeSANautOM38G0EI3syOsl7eJPlfjlSx' + b'Y1P/WyfpnojWLnwN+c6UhfjXJLhpszWwtEcjs/6jZNIh2NLjmUt57wXQWUIo0MR25vAF82' + b'Ho+GSPE/HGUJgcms8sBwIVSVQF9VfILKAgUkkEO0mIc+hUdSwdEbFgWScuEEYD/4syDzJk' + b'De5qux2Kk/PLlz5pN8FiC3OUo7zye9/dEw9ON6HzaY2Mu8hf3xWcL5O6b129uPrs7IiA0q' + b'UHV1v9fQyU177jwJJ0bpSN91a+lwoy5pddhxSXJkBpIRG/d689ygYf9nRXrUB86nAPuz2m' + b'WbJ9vIgmmlaL1MUtPhDrqkXs2ncLymRKRNLRBbqWTpnTFLCSw9K7bcheXGE2vLahXr2mNj' + b'udFFKKlgz+vTcRQeqlnEvQ7Spep0eb6MWAVznja9ZqJ65MoKM/Tqyd0pM+v4MgzmEoP79f' + b'HenJtvFh62p448vqBIoSbSs7L+ajJFm5udIiTLr5DHMRJs3zR6cJcd3OJRGLTi20zUie6K' + b'I3NqU9sFSO+voKy+gvLpFRQiiOCx0BHzSuqIG4vtWN7eq0kVbS7MipBsOkbyyRgJYWt0LL' + b'DmXcmrmbG44LhHnKtEb4NN0K7iN53RItSbzuhOgvZaWSK86VwkW/2mM/jRm865oSVkuO7s' + b'bW+8UOXMfaTCfkZ2/AoTGw6I3wXNZSpUUFuIbW90sHoVrCIpeo3xYbtG7W3VzCvNOb8O0v' + b'9h7rkdL5tZ7Dv3LTXzIuaOj4I3cyOG741HgtSaJxE2Bg2H6Iwr11OPApgplvhHNwI5OhRc' + b'6DUqBqpP4tWKjjryJRmXc3Rve14CPIjWyvw7XtQwwVHJ2rGSpSxFQXpPpf3Ur6Ch+Prucn' + b'2uqHH46PCMg8cncpYWDidyWguMTuTQmc5V9EvRCXVNRxnCaK2hK/Q+85lOFZGlmtgoIrRO' + b'B4zbuoOvmrnD4xYOMLrmH/kZ6X4oUH2mpcKgAR32xS0MsNlHJ5RJ6+RrOko+ctPZ7VIX4W' + b'c6U0RWKiLPFBFEd8A4+Q6+Sr7D4+QTPAzP24s3VMoomNvQ9zrzzEAPmnjhQgAUsG+xnWdq' + b'mHL4SLMysoJd/ZS0fop+ZuhvA482ObPLgpA7lclqOpxPL7x5ydxdwYIxN1fw0NRW5g3oPH' + b'VbQHHJPSjsIqNjtKT7Xl1klcN3dLC2UHRUfOgMoseFsuUyQlxmQeivXE9EOG8vW+508mpC' + b'+62tuzw/2ojxDkWpzz2gdspKh/EdrYzHXXrq07OkFxOgJb+VlrRK1KWEdZVoe42MpFucga' + b'C9vB+FcMOAVid9bHDTJvpdlKJMem3lAmH86qExRnIB5Vm9CpzH/tgFRpOoBUea3GJW0PmF' + b'x3yluWQLZx5xkCsqUIwpmsnNY5oSlhFqjorlPC8zRs2sZ7WC6hlxuO1/vuzMoRERo4rdHL' + b'm3EuTINdfkiCypRikzzxmjwp9CypcR/8+Hbse5ogQ9i/iP3GHFbNL7xqxVczHgHh54c4j4' + b'Lm/yJfIR+yhiZVFxbddfg8BZxIH+HbIhysieBxj9syMsgKiwduiOjkHO+oon8cUsFFmILy' + b'oU9kvCiRLGYf+B9uHCnsXsc8gSdJaaNYQqkEU18bDehyyJ0u0WnHOaSWiYx+9CgqNoMPI+' + b'SI2Z5jHrBVolaoRENovZJ24hBFHicJXpFVId5eSpe+A5JhFoFjN3jyJPlIzT8NB35zeJLx' + b'LW9nN8kjNGu6jSRfXgdB4enoWVxqzLJkQUVcjTJbTMOC72o191+1po9itXVKRAY9YwbIQT' + b'Nbpv3XFgolRtM1Um9G0q01ljAkNVGVaYkNuqxiAtAVeJMbKGoJSwFDUwjKzWFIQSKovDVS' + b'C9bVOmMG2KyjJRlpLI7KsnmKCiRvfZshw7jo9jpdTjI6XUwWOltLJwUEodMFJKgYp9I7JC' + b'2zeSpcwlQeqVYeR0ZNSJeq4HS7QJPdCxt5Hs5LeOyNIhJtJXhpkowSuzOmRnP35Wj+345r' + b'27E417E5II1DYkYPxOC2y0Q73+PU1uqujQ5ftgzAI/5ua5bIkc3V3ewgEL0GIgx6Hg+l3E' + b'PDH3dQ7Hm3d1FoY9euIKVS/Sw5EBB/RB3vwPXfbB7IHxfH+KJnXQL7WVkEIdDQrU/cBDBD' + b'zFkQbsHNP2CppCaC7Jw8EkAIo+ome0e35ZRhHPfbgVlUF89Rez8BYWkGLAvqTrr7zPqQu3' + b'OfX6ofgCIonhHJviYE2iZuZLve+4mEeIt45i9wDYbNhR+7X+xHYKAYrSjApw1JWVJX9l4p' + b'U7TNecMRaZeCHBp9N2rfd8IalsJRi+0mTRNXklQEU7U7A+UkDYvRPJjI8svtgjRzccwsFF' + b'q8CoL7eeS1slV20p15heQAb+bdufT5H5RuFBOaymmFXyO1XzefJ7dHdKClrt4i1A+i07fu' + b'sdO0uHDTvQ2tZ6kvzu9fUVv0Vfn1lCFqDQGf+OJno6df5MA3L5d3cMQ8qnWCXxBlYNutuH' + b'tdmFoUdXArYGvLoTcGXg8bo4pFQLTTNGsB2dSWuS36NdziVpn0GG0DnkgJBFBOKrWxAgWk' + b'3Oo/6/Rz0MCkYaBDJIzyKzhNeEolfByLA+bZ/7yPIyJRwkLEC6ATQnS3fjc9A3nyFsDMOm' + b'igE82mcXnpUtABpgZIbVJDcssAw4MlBjpMogyzi5slcz6HjvdkEwvttwCUjneGHokOGkda' + b'/BcMfmwVNguhdpFB0NQCUYLy+m15vbz/i+RlRzoG/dcDnsoQfsZbSqUmG8cNXqJaxj1dPA' + b'Iif4qYVxOq2hU8TcGbjH4dirDp55cdr2mzUm/EMop4mGUcF69kz2CunYzag3XTHvwjVZlF' + b'PvoxST5GrrxBTH9Q76KmGwLAYMtztjjnR8jnKWYX33kiI0o2e92N0mz9EFXjPSzmqD32K1' + b'gYnvc+h2UGSxkQbZSnGEGvIcm1dOCai9SZRiZJqh6Sg5kCK+8BM5cGWQvEJ1Ys057NaHDR' + b'OaQoF7jnqXkrQeKQoCvmEarq78Dgi13wBqH7E19Ggj0Tq62kmsDDzuIimhthmlq2AFMTOU' + b'toIggor7fL38WwtnpGsLY6xtzz0j6NuNh0YaN50Oz1u5uhHTWQMMcqtUYYHL2p8pmeQWeQ' + b'2epkT2Fzl1wtjsNVMzpgv647O+uYoZqcw8UDsiZR61OFJzNR3VHuRpfxzGG9WFQfddd9YH' + b'JFnEgAMNmXt0Gs/j/C5bzxhllcfH7icOl8zm6GGQUQDe4akfTsExcjMertF565VtDPrP6m' + b'QrCn18xxNSFg2IyP3rO55QrpENR05aPa8A4ZBkKdHUkKEF54qOygAVaECXE/IV2TSgw1cp' + b'qhkYk3s685KA48Y9U466vSJnOPhDxxwqZSwv+R0SgIhOehLHruIc5CflF4yhzDzrBeMpmH' + b'p5eK7pKDXI3a8SZgPqNVBtwmMm5SLZaSuGDKSzB4SWsBPDBeJa77R0mCeRfjat4m09eJPT' + b'IuHhgKvnT1YLj3/vnZNVfe1ivPfWrqrI0Y1XT1bzaxfXwcy8o2tW41nfe/kEffmVi+tgbD' + b'7IYDkleb8x+kTjvsUwZmYQljsfuDKfQdeKgKBtOTjoVh7wV7Is7L0rAZQbchzrztyMM+ar' + b'AG+6GvPJGil9LbHrYWaxMEVzpf6tiN7Q3BcLE/jzrZBMhhlptuOsX65YL8f6fjuxYHdDsG' + b'Vde+ZVRAvPuTW1WK7uEPL0zkwnnLtb46tyx5iOT2I7X7RIvd3mnyF3UFuN1RRi1UoQSK/0' + b'5MhcpfSQI0pPY4n4lHG+BBqrQvBk7VWhCu60vaqjxWsVSLGsy1Eo3aO9clpf9jY38PiYO5' + b'JL67EJDwXxS8zGpoEcjt6gLcuWc4NHNmrW59hALXNo8AuV3UDaOs1CsovFWM3xIYyQvDTR' + b'XaCAGKK9QzpAtqH3tS877+Ij4CwermWxfsbjHgC+Xo+RaBe60ZyE7kcJ6NER5aacI7rd1w' + b'FKb/+gTPLTgHo7ewXdWFFo8xts7xU8axbr1jEyzC+jU4dTJDGMrEukZ3jYcqvJ7dSCPTxR' + b'gbcXimWVpw+DMeNbKFpsNDPeqetwc/VYhuox7MJlnxk6zYF7rJMUw6q/QMfsRZmrdVbttE' + b'3ie3UyT/OIEeKAE5Tc8A35YM65oD7JaAwh3QML6RT+/NXlPFm706tBiOMsl3Qgl/1TTBlq' + b'01XJsPLEBTMJyK1yyZLvFgtYf4ZMzxMeuENF3Os7WtrEL3hSB7Df+p7n1GFuF3jqyGBlun' + b'RIdPVuTtAtHDBUfwkMY9N3wFg6XAFDmkq9Ots4nwoW3yNlcLUFTr/cskOn8UrjPNN/MKdX' + b'Nab2Me8oB8LBnGqm1zsaDYZb550Xpq/vnuNYUHQe1eHXjYV9yLUlx2HWc+LQfrh+oPGpwv' + b'1rGyyV/rzuMQnRTmcB9rFVBsJQG4u6CnAka+tw733m6Ctpl4aBrirO6CzAUR6nDvfhzh19' + b'lbMTMt7W+0HyqwSiDRlaRUeGDEyTPYFIKQ6nN22jwXz4Q60dNQzmePKu0fO7WU+oYAwvrB' + b'SgyPUYivDC3VhLlFEYN1ENRtMRVD9tFjdNDe07bKj4e70aCZ13f7UaiXZ+Q6FoW+t3rJ1M' + b'HXqtgSzTwBo/SsKqOZojovfb63WMmt77b7HlGLJSr220qaJ1CbF22NOM9LEPOqkig0ZqwK' + b'AektSjZsU0cikoFFjhkOfuEWNLwMsIj3sRz4tRhOSs0iokRs/MkQQz0qlrgaKdgsLwzajV' + b'oI5wKe9q+SJz+GjxwsHjyfQ0iRcEWXsIvKCK62lzNfF4NMV23uMlQOgrBo0CwPRxHxnAkd' + b'YtT9NRuTLmg7mB2iQCn9pcynF9A6FxhgHcTUWVpdwV1hg8SdLoE17xfezvI0tDdh0AA40u' + b'iqP8rnuS2S6zQi0QIL5xi0QskX6Can61QDBDevUCQZ2RVgsEKAi9IsAmenNFgMPFEORZQp' + b'5hL7oPQ6FGE4SrIkRJjfYp2of5DiwMMiEEqIR7rYEgIcF0DMSFtRM19ZL6D9XRIRWXh23Q' + b'g6HLEXDHNkpk/+UxuEZnd/Fr2I0hAg+ZqtccapSKXnNoNR3lF7LkosqPArob0CcT1peLOs' + b'FK6Q7KQp1FSyBu0ARPToE09sRzDZiLBkqTUGCP6BXttd18IM1A3Pt78RgzUOU180utkKBw' + b'L2qJBFnydd89hfzFFHevnCM1rzEfwSv/y4SqGdrrQWttNUlM2cwBooNfbZlO8e1VLTrRqp' + b'alg6pFWp/2mCeH6ByHpqNhtgBDnr9krDMAodDTRN/kMmlA2lYGBXOSHPzEE2PNIUw8MciH' + b'c63LpSXiiSc0skM88aSnaFgtDC0ekDPRbYkINroeUdNRCiFa9wr1/w+rTtuH0A+q0kOU6A' + b'TsjLRfWjeEXlp3QFhaJ4Aey+toLEK9TZwn5hYae4SJo8VhPJus4ITGIlcLtSuHj8YAB8fv' + b'EuSFR+MwUgvHJtN5adEATC0wHoXK2uORBC7Q2GllwXP/3F3OAWZUutyQ29EFipqOyo0ezX' + b'qJ1p+Z/Q71GiUKntO/Cc998SucGbe0ml2tDBCOXNeKvnWJV2b4fgJmfeuj6x4JR9ctEh9d' + b'nzksHF23yK2j61YifXTduo3WPCykD6hbRA6oLywpZ8YnnvYH1K17OaBuY9UH1K2D+L6yTD' + b'A5oF4GSCKbW8ztlCAgsxoCkeLVEDjTW2B5IKPBA6ULXcDMPqgXcCkMvadeIWGPFY3+4KsR' + b'BfFEnW1O2nerhtD9qgNCx0oguEdU0WWZiCq6LFPTUWWmxwOGr/UzzcRVD8prWP0NDTlJ34' + b'+wlIdB7aiWydUDg21rwaftBUKK02au0NEZ/ZVh3TqGUt2ZsyRkX/MMfGsZdpkF1tUMpDG8' + b'8XSmduiNwIrAugqsNbzrRxahmGDU57MA6/5ApWbCRJzVlWwzRfPVJY/4dUAWw1mpSCtFHw' + b'ZZL8TkIcL90VcTWL8xj/nZAJknZ69itZ7QQZkoeX3wbtcZU7DSAEdeO2kujK2Ni9Pl3t6p' + b'Vk8tidERKiSB1AJs1NYF8+5VT6kQpOiXkFEpOfCrGzvS619vXYF1ofKHTI2uD0WeRteHaj' + b'qq6RUZZ72DtLCIX8J0pF7zFChsHxHa37PHejKHE3JFR4cRNEMeIlkl9mIPax3lFFrMMRVq' + b'3k0UVmFZAxf8kG/mDh5otPiQee1UkcHsxIDhch2QSh1EqEr5Q2t403pGS9rrGYbQeoYDgp' + b'7RJgN1x1Uy+BMU6DSHsOucLZPhfn082jlT4Qlt7jjz4C3j2QbMIByC1iZcZLrjF1NIEF3D' + b'mqYe0PILeGUFOrviaFNQw3WHOzJ8ix7ZWkIOd6ymGvALlMtUo0qBXM40w9+JuMw1qk1s0R' + b'cN1/emYr6iTSFzCMXr4p3KXqSGlAMmKBGfR4hHGTWvykDqMkDo2oAZ/k2w8Kyun5wn3vqS' + b'B/ftt5uc18ng7YtXyDxdHggjMmlB8vQOMgKNDIxXpI8shXlqPyWHG0srQdvcQpKrS0tH+e' + b'lC9DnZMtjoqJLJPl7EjFF4uLI+hne9wz1Pbm/XI1khp5CdegkQgos9MNTGIb4wk7kcX5hJ' + b'efbeomWCb8zsaNY6s58pH+Yt7bfet08tZOxb5SrIqrLocUAfoq0vG4ufoebqmlUtHe7MYq' + b'FaDHtVnkvK09vEcJbpCHG+AKKVIriwSnKaRO+IG1KpyBXpoCFPAnnrbqc52V4/Nl5RKzpo' + b'bOgbzIMqU2L2Ni9e5tWQfOx5YzbvW1+Q1Ap1ZYGgTxsgVqdTC+14UR+GqSFWrQ33lmZtUq' + b'IVa+My0qsNcutGKJMKrW8bl6JuG3a4Dqp2pFe2jWN36pEym1SL7m3kCjadk2ZGwKvPqSX6' + b'Iy+jZA0Vw2v215aQOt0uCakhg+6vTPvpz91tCsFFQ0BRAhWrcGiWNO2iAXmeoVEdN49GXz' + b'OViI6Pm/369HDZWaQhct5SIKPgpKhv+n7PNHP01WgAj/5h81XtvuUCKoYyNveeOUz3BmMs' + b'WsRFgq0xRRRsWFBboQj0mQboQ4PoQ4X79r0E+w0DqIPybFyRWTdKzT3mwXXPVqh4t3KexE' + b'9+TAoBwn7lLGD3u9f11zeCCwE90hjk9DAcO7v3N9w6lNEo2Oe/xvQ43CQvfLZskrys1/uX' + b'oDzWBuFZrmATlcGxnmPNQfpetcC3nz4Rf+rMzZ9ZigGBlLnyAoP7SzQPMy7VNIy0XsxOQf' + b'dva0wH/CZUxuD0+jaduLPAxkh/9DTNlOzhYRvZQS+YuNFCPMNFxOxOWNHLRKvtTN2xO7gL' + b'ajD+Chkf3V/mbWCZ94XRWAWwbxgvAqD7KeUuUnxVXKL3zhSmFHwVhH0BuQmAvnjZpcbfrZ' + b'PNFD1Oz0rx7IPJtULsWZVKITpJrcKjNOkIJVFzDapU6VDse8ulQnS6DM6Z5qZ/NPO/DMCp' + b'Cyf2Tbmfolt1KUpYkCfl7l+p7GeaamKjiGytiLBF6YDxqXgHX52Kd3h8Kp7gN+UKutmLXp' + b'9FQoPCjBLSC6rQhuzNoaj50Qk4uAuXcUynQoVJDrHuW9ilyVF/rN3b2GUORjAzZhHFhxzm' + b'ib6wlOGOzlUYKceLE01RGzS0fxPO6FJB1v7ozgs6unnB25yRxMcHKOnRPVDMVm2JoHXMPR' + b'TVV3EoRkTGHRUBBNO6b612zxxmhwKqhtxZtFg0aqUO1KfxvcNIBh+LtJfMA2rPqDbYCTUF' + b'kphZrzNINY4x8G/6B75NisYxN4milcDJ2O9gYAJw4r3XGe/OflFL50ht9EZQQ9r39obQnb' + b'oDQq9OwLw5XPLD6NNF4s5FXO2zzoUz2mkVxnjte5GMz1hg9HbQaEXbOPUn0qqa1OEsdhe5' + b'iSI+4mEktTbgc/P5El4qxlzdABeZnKeMYDiteX++N8eASvpiUs9fyHSV4tzho/Q6OF7/r0' + b'qPxnlQWHhkwV1lSbyFPHXAKFucbzMgjkKYKpaEosDRPkDlgjoz+8+hRDAvsvjIOROpGzxD' + b'1m2b9KhAmAOvR93YEAj3odEUG/OljQ9XBgnb2IWh7c73hCc6DGk3tUtHqFZnA5Rmn1lSjU' + b'6oMtoD5o8vymYONSy6ngX1cuAhzcNTD83sT6pI/rIkSqp5HLSFt4h5ZuQTZhszLy/CYXQ6' + b'N0m/iAFfisTpJ6ehvAf60R6OZ+WVuQPch5VLphyasbnkz8wfUgqiHrKbWSpY/vFS6ZfjsL' + b'k8mOXaFYnfeXz1q7lFxTC5+N9t/G7BgtBLtzOWgjQkNeQxLJdmgoQF0txgmIPYY7F5pWg7' + b'aUE2nEyLrPmhpwQpgV3/nWcOUT/U6ipyJrrNBfFEd7eAVmuEqMhqjXCe/EGtO03+kKM0Nb' + b'/3ygCGgDp9l5EcGVmXxK4MjSui46N0DM1f1ea/00lErSPqQVNZFVEzTeW5pjidClRQaTwy' + b'1os8/gfPlX0H/l/9XGlUETfWq4T1PT/Xzo+Hjtc6KI1xlfyhl0xRhqKLtZPkD2eCNMdn1D' + b'HA3cBTlRjd8REUMUUGNcWA0X2AbWVfe43woGKNuP5+O4unMT7yZbkBM6S7Gsu6mAo08moZ' + b'7rCBhWYCjdwaRpyaSqCRW8OQ+mqxOmAj15bj33y1WBOwkWvDifOnFGjk1jLc9f8Wmgg0cm' + b'sY/p1XCxUCjdyCIZ3qInG10Ru5IKN8Wiis+U5rTWWFpvJUU6H2emTcejx+1Qg8I24ERHmR' + b'j7E2xiTCU9IzpRoL74G0gronQJpVhPjnPRQs2zTBb7RwF1x6z0YeZwuE4T8T6n59Mq+wto' + b'K4W2PThSDRQB+8mlGLw2EbQzKQ5XxJ3bP8zbMe8tHUgVQjYNpY+BbkA5op+mBNdQxgLrr1' + b'6ZorjEtBWaWBKGVVwvVGqILH6Nz/ArTavZuA9NsbRSKbPjnxjdvwRKyOsCsZxt3IDK4dYc' + b'oQbkVWIJcJp2asYqtETdIcrfcNJ0l8NwdpbaI2A61N1DQdWRkgK9ZmQxBjo1nCVIu/KXjO' + b'SvSayRj3J7tTQuNOcx8ElYsy0W8spSD9rhamqcdgK4X5bnhLoUVcsVUU2WpHCYPKMZrTzw' + b'zt92GKJpByJqdAfnaYQ/L5J6PQQd9qCKGwgsJUChIUJsTdPfGBHTtPZRE6mpsALOg6IGZL' + b'YFVi0n1UKwB5asmgk08IjA4eM2BdbgvSb52x49UH5fL0btWucvxTt3fm3NwxMlVeKDoqXw' + b'plTrcZiU/b8bBq0Xhcre3IGTNCfz1my8hR27EzZoz8OXYALe0H19qOoYKNfDuOH15rO4oK' + b'NnJtOXGyqoCNXFtOGGJrO5AGcOTesWSQre1QGsCRe8uKM6sM2Mi14/iBtrbjqWAj15YjQ2' + b'1tR1TBRq7JsZ2tXezPeIsdoF6pdJUFaBS7VuVlcXWoyRxeOvIFHW9o3gZSXUNfoQfTCyaY' + b'eB3DoXkSA6cfKT9sOEv7GYyhGw3ou0AKMkbXUJiAzv0Dfbi5LATDfHt3tdiQOny02ODg8b' + b'JCbuHRTawTi46Pi881HBsNzhxL3DogNpJnf0X0yjxx4fFo1cIJN178gU5g8WjlI18oNA7d' + b'xRofZ19acLyOkbt8HZs/urQj5cd+ZIVZMiiurJuh2uyZ2bXs0THJmYOPvXfJgVCvjtSMRX' + b'eEmo46QjTXnlZ0PEvJL23ZXxjE7UVZNv06y1UTZ0C0RjeLOFr0RcQJa57ZMheO223ImjaG' + b'9Lm1WczSAWVkxbYCKQM/RydfMMs6aqPBAqlx5wzYqBZChYaGHIjmaYgoOj+A0ovOC2g6yn' + b'NUI4giJwQgnOj48KOVreWCtNewUhL6Cg1y9bVEqaFH9xIxyOsTopOA+u16BekteAXf2kKc' + b'3mD7rcRbPL2lCL7edoX4Z3/KdoZoQ9bPPKH7N/iOzh8gW6PzB5qO8h+hIRij+yjNLbNonL' + b'xVTrTnq90l+2Y53InIrw93NskoTycB0TfuBfRWjubJdzP0BkvnZ55wqbLCj1bY6+QkCnvj' + b'vrXOWBYAN0GnMqSrcvS7iZWzZk5svJbUMOTNaC2pWQDU+nlt6KCfk9Z3dDBqfQmHpiOrHs' + b'YGfRn/b4cLYnzbdq9rA+3DyX4Kuu+ejZaTuu+wnBIjQfXzeNAOiGBK5Btsnlna22RMHb/f' + b'8/+dXCmC6h/wS3hmLbfw3gfnaE9ODCmBW7Lv9enM0mHeS2Fp7cRB3oUVRc592hRcuk57qT' + b'3oPVUO0I485t1YUWRfxIUh9Cw56VkPSD/rKVP3HVVFBK+mQitQ29c1LVNm9lNf3OmgG2Zz' + b'y8ay/PO6qAhhSpVZQu6Yg5Z1iuZYGcWMpEoN7YcK6DpCRs7grUP13u30SIUm0D0Mdt8sd9' + b'+jx9nmib+bccL9tFPXqaetckOPmmBmwKs2aN2OGyHK3j9iUdrPNNfEoyKyB0WEebYDxgtE' + b'Dr5aH3K43j3PkhuPVtBdtBu8JKD6A5RjdK2WpqP+oAVj3z8MO7v41AQyrD4pMFosUrhsmU' + b'4N9nXoURs5TjgBZosbeDS2oMp2+m7NLEtGpjEspK/mgnU2MH6GTWUHqHF6aZFggFdq4NYZ' + b'lYl14Ed1F4B6QLO1iB7jlx4KhnYOik3tKg8G+zoH3bKwc6JqQw/nOsp/h2lzOgeJQd3c0W' + b'JS1wrgjeqcFzGjc5HrHTjnJD7EMgmgnGKZKkyOsdQOdIZ4COzxLHflQ3E7baNVs4qAGoVL' + b'0vrCtpoAbwSSa/NSh+jnkVaLMoLDnXqrBUvScPSzSPAw0bC+hK9wTyJZtr60D74yDUfRrB' + b'K538I64ikMo6TlltzZFUlef2Fo9kCXvXJvlQmTBVodcEDQBwyww1R+px4RMbHoUQRj2/Yh' + b'zkx0vduo25xaYNRvlha96jgri497ThaRvtKOgvDYoD0yaL+dmB4x6xLNxH5CVE1pIss00S' + b'kidI8OGPe6Dr7qdR0ed7EEo6xiH7rlzceSKlbd3pxvmJmvoCJpOihIGjVfwxlwtriGxU/M' + b'FC/LKzT4cLwh1INFaqCgl1lBlAhzDYSgHCzOGkUHV0StvlCj1vZP5jFRqtT8pCnKwsGmTi' + b'l6dzmsz91ooYU8PZKhhukJeaPpaCRDTvW7i3o7ZmmB6MCzAfe9tc+hijHKKcY+nK6WdKYW' + b'Hq3oWHRkPdI6MF7lKZNblh/zJDb6KAwdHyilxt6zz48WZmx4o/tLl8ktcxEmkqc82Ef0f4' + b'YhyZBqwDTuwnBZBPKWvfqKbD9UGq96WHRAGBQNEA+JpYXCgGiAW8OhEUUPhsZlNBQaRA+E' + b'BpBhcGYoGQSXjvRDoHEsA6CJTg9/hh0/MbwS6HLkfsDbBuPwHvU7NnefeWcyQuaCyPhYGc' + b'iNjojL2XBnK/sZ7TQRs4c3K/epFekZ6oq+bhz1K1p4QeTcDT6pVrIwWDwec0d19O4eyi+6' + b'E5KudKvUdNQqIeWw6zcXI6uxtV6/OQW/9ixjzh7zkCdcdBKTZGQk2l+4GIt+T35WNmlIhX' + b'UhJNudC80m9lPXPAduzE6w+4yeWVOYPLM2TU6y1IQWbnRSPVlpHPbwwAswpp7a89zs0lF+' + b'08vcyw394mHL1w4x2M9nzkV4HslzfEjPTzQSXHnKhNsK9bB+6eGJUXtwd6BxVOqpgf6XmS' + b'P3JjTvFDWGzMKTJvCFp5zs3E70oYXzCddJKZ2bcIHRYLYDzWqjd1RpR3ZJ1rqiB++odo68' + b'+bHHvZymbF5RQ8zcw5Ueb7Q4HYN1GMolWtKpSHu1yhBarTIAn6TQPTqHbaLxkjPXCYjGj1' + b'XUE4uO1+0zC8c9e+mCGNkP5haNR4bSgqO+nU1IrwMiGnsqgs+RMyccFd1BhlI0ZziuG2Tp' + b'ODfaI0RVFmH2Wx38recOCwdz2UmHQ7YcxS4PW6rVNEwjpbsTZHH0pqymo+5kmcSvhxYUht' + b'q9tURLkbgLLyPh0B4ZrHlKC90IqsRGHQg2ZUsE8zZcXtfRvU6LhLbNUAr04dw5yYdneyQj' + b'c5Q1VeB7UHJqNyNH2/JaOpjyklbbvhXJ0fvcGbGr17nz5BytCa5IjzTzBUPvmaYoRcvkHC' + b'0frhQdnUmegHF+7bqdvuf8vOZBZxP0V6qXc34Y5ZRab6C2IzJoxgYM+ilIe1kn5s1nbZUP' + b'hiyDFfjG6Mu3DdBXnMPqV4mMeNDPW6IqGiBe30eVNOjYQp7F+3D1OGTDPLLw1Wl7eDEXjy' + b'bnsFiWWyK+q6VKgUZWCZRVnX+CLnCOVsYaQ8sCGmTQBw6mqAjdrccG5nSoLimfkxw941AS' + b'u3Hp6zzzjPHFAZMFOVcPP1QGDQfcTcC3bjjAAOI5V0E3ZO35cO9ZvSs8U+hI/KlhxbV7Vl' + b'vwRtRT4VxF3ZJ1fRtChaKJ7sUpFR01CjrcdS9bngvNeGZNSK9TmDh2PSft3WbQd7BNPOOP' + b'jksHgcGkK4XTkLeUY8MQRXdpKFEtKUpY2aFTqpZ8KO1sXx1lhp3DhXOKDBfOGTBcOGfIk6' + b'6GDZpi97UPM+pZY4Fo6kUwOuJQkPa9oiF0t+iA0C8aIPQ7+cTQI/uXBUEuNT1jpBndwViP' + b'eNFFjJVm+tX+KLSrKxlRH3QvkzWGHlXTuQGv2ox1O66+jA99Qfdnfzqb+zdyCzzyMGLGd+' + b'VA2ieCavtpTnqk9ntkxE/U7KxfzWZnwhlNaIUxnr42yXiX3uSNgUYzU+P0GM+WFoLJPGgS' + b'IKmtTB60SqOvhLs2UybEHQ9Z8vPFnCYRdkaMVmOTVZtYb+r8SOUgASYWGMKBktoi6ogJS9' + b'Ye2tF302eCnsx7cpzrhens4gY3TDENGyXDeXhuP4NXB6i5+MwiIQczDdyaj7vw/YzcBaAW' + b'r50DPUufeSjM0x0Uz9RzD4a5uoNudUhOVD1fd66jGbvDbh0SLy1LT+eda+nnnJMwpZ8L4C' + b'f1zotb7TNHUdoY4t2aJ7NB7RjSU7o06MPkLjg/Tyeprr9E1Y3u5kKdje7m0nQ0dhgGmtFV' + b'I514xqiNenzcRLNkPDmoHDJqoHQoz7yFR7Wcoj+xkLNdyR01RORmuNzvnJPSeeARERajXV' + b'azUDSDmFrQz+Yciozv9506PEShedIxDBulQ+LBxKAv0YtmlERd/eBOlFDm6FrxCsqtNmAp' + b'QUerJJBUvwfNNhFdVYX+IrqqStNR2TIgxIPs//NMc9qnrbUca4uIIXdGs0FaXLktPRac1R' + b'7a9xsHVQZ67M29Ms3SUGbZjxNVEnw8GB2o8WrutbDShd01hkAzRn+/8ATZwmlgj45m22GC' + b'fUSf0Jkb5GiePf0uV7YCl991ok8Uz266sqZMOR+I/i5bImq/70bHhC4CqrWMGwjZHWv3o0' + b'uTnGWRB6mn/ZA1803ZqXnSW+zOFeRNdhGC3Efo18SR5cd+/bRBsHziwRC7R16aPrXEkTtA' + b'zdwSPMRPa1jagPLZWr4013NO5D7DRCoCwlTKwWEyRSCaNBjAGHZSceNnmmlCc7J7RYRVdA' + b'eMN1gcfLXB4vB4g4XgNrrIDrmnVzPQcvUEe7Yi7W/BMIS+lccB4coOAvoE9czQ8RyQ88vr' + b'KU3DJn41u2jYEcQa7MQAXoW1lNZhPRKUWCLeOKtG5NHNYKgP0c1gmo46FlSPy/g2D47Sl/' + b'F1HosrMDoZjSx67XZflZ7ROEQGWu8kaGm5Q2SwNH4O57ewNZw7RDSGIp9OHSYaYOUBCZkB' + b'8WauPONH0D8MqbSjmnSQOQ3kLc3IhOr1IuN1dLNO4bDvIboPmZCjdajaAkGDMkCsP2UWCt' + b'qTAW7pTiYpWnMyLiO9ySC3tCYjtNaZjEspSMMO+tLMkV5bMo6lSI0c8m5OY7JQK0PGtVeF' + b'HNEfN0bRnCa8RhnxXeR2tXlyMes5GaK9KLM/UuqylxqkuxqtXCYXubwMIYaFFUeEy8saDc' + b'hKS5VEz4HmyWWzDt1HkYIOt41VlpSzIZDd2yFCRH3b2CKQ3jMmxIJJ9HnAJBlzhQXRVmmA' + b'nQDpUkUjdxItS4DqpjAIKTeUQUptJmnI8C4xSH3tD8LR14lBd7i4C8qaif30V860M0uraC' + b'muvqCsbSwdhbi0mFxQtgIdX1DGHNeQzhDk3ZUdMmTUtxSVye3lYXjVt1Ogz7+EO8yQqZKZ' + b'6Ogu148YrzyoluQq43J08xOkj1RGlAVX4PytQcVK0eYS7QlTIJD2m2u3uqvJFe4vJ6Jb9x' + b'TxnJ/s7cyy9QQlJxdaMRt8u2eRvsgLPCTQiqMtbzQonsg2158tCk/ox4ebMeh1SBO44fgL' + b'HzAPc4jcn4bK8DI2xPeYO0kBEaL8ZQKsdT0v37+Mn8qGwnc1/E2L5Gr0m4+xaPBD3UAPtz' + b'ZW8GrldBXgq1czG5S7f5KY/qP7rCoPSCeA6HVvh6yRboXfusVaOjRZ0le1LgN4y+45wr3F' + b'cwRqW2cwbgWSJtdhaEwHkSZf2cWXyVfZSyvwrbfSLB0MlEjrW4or0NwsWJIRtgdyRZbFCA' + b'hLkgYMS5KWNKe4oAE3QgWt2GDaz2pC5G0IL7uhZ/sahhkEqXo9qEHRS88YW78q3XI+JTlS' + b'LRtiV5rlguhYsVwC1JkzA23ejeDuiu8TzAg6qRYCcBKrngabLCOOPo8yizjhjaI4LAfWAK' + b'Pbb9vkq5/LIE16WWMFt2iC+uEkNHcL+TrkaV1/iJ3WR31XPObpDvNNRADdTgBGHS+qoJ6r' + b'VxDImJjefGe8HTN1UjxTG602yf9isEoPOoB58lU6XVQlP/hVSGxQ+ZHjeiyeoeLogW01TV' + b'5ZyFXy6rsVJPl1re4snYHUhzdWoPXhDU1H8i7IkGBqUOM+tG49qAMkeFZ2uAWF+2ou1uME' + b'ncF+fbs9hCE169ewU8g4R89ImtBfw0uUYTV9GjNib3WZvKpnhpbJa2i5pSXETB3d8Ksaz2' + b'uSaosN85BX1dKhO73q3axZChq+OSbwFuo0RSqixkoHIV+Rnk7dmwrJvKZUwyFNFvTFkAaQ' + b'Rwox0CrAzWWAL2cOh07VHeOFmEn7HZ4qB2i/1278Cstk9T2mDmFqHaHb2huT/GJRRYi7NJ' + b'zn4LjlZSqRclw7x8PrwV+kY5yEk3g8kn7lRrOXls2kfS+IRX7tRrNTz+b94ryja7SmVX6H' + b'L4tRLs2G/m46Zjccab4LxPjzb+PxRl2H9jTYCAZcFhVnLgmnMw0Yy4mTWG0/lr48/7fFu/' + b'r7TiStLhnQF7+X0GLsQjNRFHpBfDYBrVuNoaWZQOaoW0ce6SXXWQZa+9Z0pNQhQwbzMMmM' + b'H5HdC1noSf1GUIY4pL9GeEbfTLmF/KrPysFV6L1RB98OZqK0Sjj3xHDzpxqB82Xypza3zp' + b'JgT4lZ1p+6F4LTqBdqkj+jEx3QCf7kBUpNm0SWjui4xawRmfynkrXNEz4EBD30bb3ehA57' + b'2ib6tnRouG8yM18mcnF6Rlz1ZFkSXaNuvOmlLNJ68JiC1uOGpqOByDAkmhTUfs3h1e+6Ut' + b'yroSn3oI7iCozqwgJcrdqXcB7Ko7ZEGCaq5E3P9JG8qIAsLdPgInlTCuB0TtLcCB+GsGUW' + b'wFg3ZF6Od4pXxvWtkbCMGaORcB5zxzvNqFgRf7TlDIXk7Xp7GlPwt6vdaegmb7eNKzD+vn' + b'3HuALV9e2WccXMBGa3LIezXTcJGYc6oSoi029MU5nncZsmokZbQ16dDq8ZwHG9RRN4Q9sM' + b'JhbzCI8fxjI8fXHZlBl5vLmCgwYHKDYETAUbH7VnVXasGGcFOPdhijKDDF55YIm4bYpmaj' + b'/9agumUm+91oGRC1rwgvxgdIhY+sMb+mmMFWzD8eYYhYi6G6RtMA9mm48wT1NkmJYZMEzL' + b'DBlNsTKH6PsyVk0KMaID4ag0QxC5Zji62deKjnqWkgypDSiwqzuvoe29XV163V6BUT+C/s' + b'g8VmLPJ6AgBt1PGmFVh2ZieJNttIxJfgtv72KWJkvgLMmX4alDIe9ZAryXaR5D+oJRlCtt' + b'4uZIpR+skDN6sIIoftrBShkGLiQhOvGNIC4qg9EJRAfAS0VHGVyQIVVpAup03z/pPrZxWD' + b'+c+8c+ejQDQxp4u/4MPUTDVYBv+ZqRPS7GwoNa7CswKkbGrroVdowX3XuwJ9Xj5HJF2i8Y' + b'r5JvHFvnyTd9WA36xjdZRCbPO2/wrS8cIK2MOmuSI6NOBnVt1FkZNBh1Gldjo04G16szXJ' + b'mhR0e4JgC1jSdD+qN7xIRbHVhFCRs0visQvfW39fEPtSnPGN/M2adlaT9D1xABoXNwcOge' + b'AGhtCSn1S+VVi28ZqWeWcCM1an0KwBp+8tO+sV4tzJcYVjraj9ezPPkWLeAgtpuWk2hS37' + b'pbJ6NRAaITtgg/OmFL+mh2rybmK2z/WFrtX5UG8FtSltJ7Sh4Jm0oWiXeVbLB6s8gi0W6R' + b'hfSukEXUzo8F9HkXi/jtHUuZZvT7wLfOqAusAngYDg7PJpNFwK0MwFD3ndEakhGdR0ShbD' + b'vdnOYEzKK/vko+I6oLj+HcLr3KcG4U3zL5Fh0rQwWOjpWRPgzqPnBUQW0lwoYRDYwQNToR' + b'A/fRiRjQ0s/D79gsABOib2GDDQmK7OEReGQPP0/+7a59v0z+H+SUGTTsMAEA' + )).decode().splitlines() + + +def get_tessdata() -> str: + """Detect Tesseract-OCR and return its language support folder. + + This function can be used to enable OCR via Tesseract even if the + environment variable TESSDATA_PREFIX has not been set. + If the value of TESSDATA_PREFIX is None, the function tries to locate + Tesseract-OCR and fills the required variable. + + Returns: + Folder name of tessdata if Tesseract-OCR is available, otherwise False. + """ + TESSDATA_PREFIX = os.getenv("TESSDATA_PREFIX") + if TESSDATA_PREFIX != None: + return TESSDATA_PREFIX + + if sys.platform == "win32": + tessdata = "C:\\Program Files\\Tesseract-OCR\\tessdata" + else: + tessdata = "/usr/share/tesseract-ocr/4.00/tessdata" + + if os.path.exists(tessdata): + return tessdata + + """ + Try to locate the tesseract-ocr installation. + """ + # Windows systems: + if sys.platform == "win32": + try: + response = os.popen("where tesseract").read().strip() + except: + response = "" + if not response: + print("Tesseract-OCR is not installed") + return False + dirname = os.path.dirname(response) # path of tesseract.exe + tessdata = os.path.join(dirname, "tessdata") # language support + if os.path.exists(tessdata): # all ok? + return tessdata + else: # should not happen! + print("unexpected: Tesseract-OCR has no 'tessdata' folder", file=sys.stderr) + return False + + # Unix-like systems: + try: + response = os.popen("whereis tesseract-ocr").read().strip().split() + except: + response = "" + if len(response) != 2: # if not 2 tokens: no tesseract-ocr + print("Tesseract-OCR is not installed") + return False + + # determine tessdata via iteration over subfolders + tessdata = None + for sub_response in response.iterdir(): + for sub_sub in sub_response.iterdir(): + if str(sub_sub).endswith("tessdata"): + tessdata = sub_sub + break + if tessdata != None: + return tessdata + else: + print( + "unexpected: tesseract-ocr has no 'tessdata' folder", + file=sys.stderr, + ) + return False + return False + +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-select.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-select.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,71 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +void remove_dest_range(fz_context *ctx, pdf_document *pdf, PyObject *numbers) +{ + fz_try(ctx) { + int i, j, pno, len, pagecount = pdf_count_pages(ctx, pdf); + PyObject *n1 = NULL; + pdf_obj *target, *annots, *pageref, *o, *action, *dest; + for (i = 0; i < pagecount; i++) { + n1 = PyLong_FromLong((long) i); + if (PySet_Contains(numbers, n1)) { + Py_DECREF(n1); + continue; + } + Py_DECREF(n1); + + pageref = pdf_lookup_page_obj(ctx, pdf, i); + annots = pdf_dict_get(ctx, pageref, PDF_NAME(Annots)); + if (!annots) continue; + len = pdf_array_len(ctx, annots); + for (j = len - 1; j >= 0; j -= 1) { + o = pdf_array_get(ctx, annots, j); + if (!pdf_name_eq(ctx, pdf_dict_get(ctx, o, PDF_NAME(Subtype)), PDF_NAME(Link))) { + continue; + } + action = pdf_dict_get(ctx, o, PDF_NAME(A)); + dest = pdf_dict_get(ctx, o, PDF_NAME(Dest)); + if (action) { + if (!pdf_name_eq(ctx, pdf_dict_get(ctx, action, + PDF_NAME(S)), PDF_NAME(GoTo))) + continue; + dest = pdf_dict_get(ctx, action, PDF_NAME(D)); + } + pno = -1; + if (pdf_is_array(ctx, dest)) { + target = pdf_array_get(ctx, dest, 0); + pno = pdf_lookup_page_number(ctx, pdf, target); + } + else if (pdf_is_string(ctx, dest)) { + fz_location location = fz_resolve_link(ctx, &pdf->super, + pdf_to_text_string(ctx, dest), + NULL, NULL); + pno = location.page; + } + if (pno < 0) { // page number lookup did not work + continue; + } + n1 = PyLong_FromLong((long) pno); + if (PySet_Contains(numbers, n1)) { + pdf_array_delete(ctx, annots, j); + } + Py_DECREF(n1); + } + } + } + + fz_catch(ctx) { + fz_rethrow(ctx); + } + return; +} +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-stext.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-stext.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1072 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +// need own versions of ascender / descender +static const float +JM_font_ascender(fz_context *ctx, fz_font *font) +{ + if (skip_quad_corrections) { + return 0.8f; + } + return fz_font_ascender(ctx, font); +} + +static const float +JM_font_descender(fz_context *ctx, fz_font *font) +{ + if (skip_quad_corrections) { + return -0.2f; + } + return fz_font_descender(ctx, font); +} + + +//---------------------------------------------------------------- +// Return true if character is considered to be a word delimiter +//---------------------------------------------------------------- +static const int +JM_is_word_delimiter(int c, PyObject *delimiters) +{ + if (c <= 32 || c == 160) return 1; // a standard delimiter + + // extra delimiters must be a non-empty sequence + if (!delimiters || PyObject_Not(delimiters) || !PySequence_Check(delimiters)) { + return 0; + } + + // convert to tuple for easier looping + PyObject *delims = PySequence_Tuple(delimiters); + if (!delims) { + PyErr_Clear(); + return 0; + } + + // Make 1-char PyObject from character given as integer + PyObject *cchar = Py_BuildValue("C", c); // single character PyObject + Py_ssize_t i, len = PyTuple_Size(delims); + for (i = 0; i < len; i++) { + int rc = PyUnicode_Compare(cchar, PyTuple_GET_ITEM(delims, i)); + if (rc == 0) { // equal to a delimiter character + Py_DECREF(cchar); + Py_DECREF(delims); + PyErr_Clear(); + return 1; + } + } + + Py_DECREF(delims); + PyErr_Clear(); + return 0; +} + +/* inactive +//----------------------------------------------------------------------------- +// Make OCR text page directly from an fz_page +//----------------------------------------------------------------------------- +fz_stext_page * +JM_new_stext_page_ocr_from_page(fz_context *ctx, fz_page *page, fz_rect rect, int flags, + const char *lang, const char *tessdata) +{ + if (!page) return NULL; + int with_list = 1; + fz_stext_page *tp = NULL; + fz_device *dev = NULL, *ocr_dev = NULL; + fz_var(dev); + fz_var(ocr_dev); + fz_var(tp); + fz_stext_options options; + memset(&options, 0, sizeof options); + options.flags = flags; + //fz_matrix ctm = fz_identity; + fz_matrix ctm1 = fz_make_matrix(100/72, 0, 0, 100/72, 0, 0); + fz_matrix ctm2 = fz_make_matrix(400/72, 0, 0, 400/72, 0, 0); + + fz_try(ctx) { + tp = fz_new_stext_page(ctx, rect); + dev = fz_new_stext_device(ctx, tp, &options); + ocr_dev = fz_new_ocr_device(ctx, dev, fz_identity, rect, with_list, lang, tessdata, NULL); + fz_run_page(ctx, page, ocr_dev, fz_identity, NULL); + fz_close_device(ctx, ocr_dev); + fz_close_device(ctx, dev); + } + fz_always(ctx) { + fz_drop_device(ctx, dev); + fz_drop_device(ctx, ocr_dev); + } + fz_catch(ctx) { + fz_drop_stext_page(ctx, tp); + fz_rethrow(ctx); + } + return tp; +} +*/ + +//--------------------------------------------------------------------------- +// APPEND non-ascii runes in unicode escape format to fz_buffer +//--------------------------------------------------------------------------- +void JM_append_rune(fz_context *ctx, fz_buffer *buff, int ch) +{ + if (ch == 92) { // prevent accidental "\u" etc. + fz_append_string(ctx, buff, "\\u005c"); + } else if ((ch >= 32 && ch <= 255) || ch == 10) { + fz_append_byte(ctx, buff, ch); + } else if (ch >= 0xd800 && ch <= 0xdfff) { // surrogate Unicode range + fz_append_string(ctx, buff, "\\ufffd"); + } else if (ch <= 0xffff) { // 4 hex digits + fz_append_printf(ctx, buff, "\\u%04x", ch); + } else { // 8 hex digits + fz_append_printf(ctx, buff, "\\U%08x", ch); + } +} + + +// re-compute char quad if ascender/descender values make no sense +static fz_quad +JM_char_quad(fz_context *ctx, fz_stext_line *line, fz_stext_char *ch) +{ + if (skip_quad_corrections) { // no special handling + return ch->quad; + } + if (line->wmode) { // never touch vertical write mode + return ch->quad; + } + fz_font *font = ch->font; + float asc = JM_font_ascender(ctx, font); + float dsc = JM_font_descender(ctx, font); + float c, s, fsize = ch->size; + float asc_dsc = asc - dsc + FLT_EPSILON; + if (asc_dsc >= 1 && small_glyph_heights == 0) { // no problem + return ch->quad; + } + if (asc < 1e-3) { // probably Tesseract glyphless font + dsc = -0.1f; + asc = 0.9f; + asc_dsc = 1.0f; + } + + if (small_glyph_heights || asc_dsc < 1) { + dsc = dsc / asc_dsc; + asc = asc / asc_dsc; + } + asc_dsc = asc - dsc; + asc = asc * fsize / asc_dsc; + dsc = dsc * fsize / asc_dsc; + + /* ------------------------------ + Re-compute quad with the adjusted ascender / descender values: + Move ch->origin to (0,0) and de-rotate quad, then adjust the corners, + re-rotate and move back to ch->origin location. + ------------------------------ */ + fz_matrix trm1, trm2, xlate1, xlate2; + fz_quad quad; + c = line->dir.x; // cosine + s = line->dir.y; // sine + trm1 = fz_make_matrix(c, -s, s, c, 0, 0); // derotate + trm2 = fz_make_matrix(c, s, -s, c, 0, 0); // rotate + if (c == -1) { // left-right flip + trm1.d = 1; + trm2.d = 1; + } + xlate1 = fz_make_matrix(1, 0, 0, 1, -ch->origin.x, -ch->origin.y); + xlate2 = fz_make_matrix(1, 0, 0, 1, ch->origin.x, ch->origin.y); + + quad = fz_transform_quad(ch->quad, xlate1); // move origin to (0,0) + quad = fz_transform_quad(quad, trm1); // de-rotate corners + + // adjust vertical coordinates + if (c == 1 && quad.ul.y > 0) { // up-down flip + quad.ul.y = asc; + quad.ur.y = asc; + quad.ll.y = dsc; + quad.lr.y = dsc; + } else { + quad.ul.y = -asc; + quad.ur.y = -asc; + quad.ll.y = -dsc; + quad.lr.y = -dsc; + } + + // adjust horizontal coordinates that are too crazy: + // (1) left x must be >= 0 + // (2) if bbox width is 0, lookup char advance in font. + if (quad.ll.x < 0) { + quad.ll.x = 0; + quad.ul.x = 0; + } + float cwidth = quad.lr.x - quad.ll.x; + if (cwidth < FLT_EPSILON) { + int glyph = fz_encode_character(ctx, font, ch->c); + if (glyph) { + float fwidth = fz_advance_glyph(ctx, font, glyph, line->wmode); + quad.lr.x = quad.ll.x + fwidth * fsize; + quad.ur.x = quad.lr.x; + } + } + + quad = fz_transform_quad(quad, trm2); // rotate back + quad = fz_transform_quad(quad, xlate2); // translate back + return quad; +} + + +// return rect of char quad +static fz_rect +JM_char_bbox(fz_context *ctx, fz_stext_line *line, fz_stext_char *ch) +{ + fz_rect r = fz_rect_from_quad(JM_char_quad(ctx, line, ch)); + if (!line->wmode) { + return r; + } + if (r.y1 < r.y0 + ch->size) { + r.y0 = r.y1 - ch->size; + } + return r; +} + + +//------------------------------------------- +// make a buffer from an stext_page's text +//------------------------------------------- +fz_buffer * +JM_new_buffer_from_stext_page(fz_context *ctx, fz_stext_page *page) +{ + fz_stext_block *block; + fz_stext_line *line; + fz_stext_char *ch; + fz_rect rect = page->mediabox; + fz_buffer *buf = NULL; + + fz_try(ctx) + { + buf = fz_new_buffer(ctx, 256); + for (block = page->first_block; block; block = block->next) { + if (block->type == FZ_STEXT_BLOCK_TEXT) { + for (line = block->u.t.first_line; line; line = line->next) { + for (ch = line->first_char; ch; ch = ch->next) { + if (!JM_rects_overlap(rect, JM_char_bbox(ctx, line, ch)) && + !fz_is_infinite_rect(rect)) { + continue; + } + fz_append_rune(ctx, buf, ch->c); + } + fz_append_byte(ctx, buf, '\n'); + } + fz_append_byte(ctx, buf, '\n'); + } + } + } + fz_catch(ctx) { + fz_drop_buffer(ctx, buf); + fz_rethrow(ctx); + } + return buf; +} + + +static float hdist(fz_point *dir, fz_point *a, fz_point *b) +{ + float dx = b->x - a->x; + float dy = b->y - a->y; + return fz_abs(dx * dir->x + dy * dir->y); +} + + +static float vdist(fz_point *dir, fz_point *a, fz_point *b) +{ + float dx = b->x - a->x; + float dy = b->y - a->y; + return fz_abs(dx * dir->y + dy * dir->x); +} + + +struct highlight +{ + Py_ssize_t len; + PyObject *quads; + float hfuzz, vfuzz; +}; + + +static void on_highlight_char(fz_context *ctx, void *arg, fz_stext_line *line, fz_stext_char *ch) +{ + struct highlight *hits = arg; + float vfuzz = ch->size * hits->vfuzz; + float hfuzz = ch->size * hits->hfuzz; + fz_quad ch_quad = JM_char_quad(ctx, line, ch); + if (hits->len > 0) { + PyObject *quad = PySequence_ITEM(hits->quads, hits->len - 1); + fz_quad end = JM_quad_from_py(quad); + Py_DECREF(quad); + if (hdist(&line->dir, &end.lr, &ch_quad.ll) < hfuzz + && vdist(&line->dir, &end.lr, &ch_quad.ll) < vfuzz + && hdist(&line->dir, &end.ur, &ch_quad.ul) < hfuzz + && vdist(&line->dir, &end.ur, &ch_quad.ul) < vfuzz) + { + end.ur = ch_quad.ur; + end.lr = ch_quad.lr; + quad = JM_py_from_quad(end); + PyList_SetItem(hits->quads, hits->len - 1, quad); + return; + } + } + LIST_APPEND_DROP(hits->quads, JM_py_from_quad(ch_quad)); + hits->len++; +} + + +static inline int canon(int c) +{ + /* TODO: proper unicode case folding */ + /* TODO: character equivalence (a matches ä, etc) */ + if (c == 0xA0 || c == 0x2028 || c == 0x2029) + return ' '; + if (c == '\r' || c == '\n' || c == '\t') + return ' '; + if (c >= 'A' && c <= 'Z') + return c - 'A' + 'a'; + return c; +} + + +static inline int chartocanon(int *c, const char *s) +{ + int n = fz_chartorune(c, s); + *c = canon(*c); + return n; +} + + +static const char *match_string(const char *h, const char *n) +{ + int hc, nc; + const char *e = h; + h += chartocanon(&hc, h); + n += chartocanon(&nc, n); + while (hc == nc) + { + e = h; + if (hc == ' ') + do + h += chartocanon(&hc, h); + while (hc == ' '); + else + h += chartocanon(&hc, h); + if (nc == ' ') + do + n += chartocanon(&nc, n); + while (nc == ' '); + else + n += chartocanon(&nc, n); + } + return nc == 0 ? e : NULL; +} + + +static const char *find_string(const char *s, const char *needle, const char **endp) +{ + const char *end; + while (*s) + { + end = match_string(s, needle); + if (end) + return *endp = end, s; + ++s; + } + return *endp = NULL, NULL; +} + + +PyObject * +JM_search_stext_page(fz_context *ctx, fz_stext_page *page, const char *needle) +{ + struct highlight hits; + fz_stext_block *block; + fz_stext_line *line; + fz_stext_char *ch; + fz_buffer *buffer = NULL; + const char *haystack, *begin, *end; + fz_rect rect = page->mediabox; + int c, inside; + + if (strlen(needle) == 0) Py_RETURN_NONE; + PyObject *quads = PyList_New(0); + hits.len = 0; + hits.quads = quads; + hits.hfuzz = 0.2f; /* merge kerns but not large gaps */ + hits.vfuzz = 0.1f; + + fz_try(ctx) { + buffer = JM_new_buffer_from_stext_page(ctx, page); + haystack = fz_string_from_buffer(ctx, buffer); + begin = find_string(haystack, needle, &end); + if (!begin) goto no_more_matches; + + inside = 0; + for (block = page->first_block; block; block = block->next) { + if (block->type != FZ_STEXT_BLOCK_TEXT) { + continue; + } + for (line = block->u.t.first_line; line; line = line->next) { + for (ch = line->first_char; ch; ch = ch->next) { + if (!fz_is_infinite_rect(rect) && + !JM_rects_overlap(rect, JM_char_bbox(ctx, line, ch))) { + goto next_char; + } +try_new_match: + if (!inside) { + if (haystack >= begin) inside = 1; + } + if (inside) { + if (haystack < end) { + on_highlight_char(ctx, &hits, line, ch); + } else { + inside = 0; + begin = find_string(haystack, needle, &end); + if (!begin) goto no_more_matches; + else goto try_new_match; + } + } + haystack += fz_chartorune(&c, haystack); +next_char:; + } + assert(*haystack == '\n'); + ++haystack; + } + assert(*haystack == '\n'); + ++haystack; + } +no_more_matches:; + } + fz_always(ctx) + fz_drop_buffer(ctx, buffer); + fz_catch(ctx) + fz_rethrow(ctx); + + return quads; +} + + +//----------------------------------------------------------------------------- +// Plain text output. An identical copy of fz_print_stext_page_as_text, +// but lines within a block are concatenated by space instead a new-line +// character (which else leads to 2 new-lines). +//----------------------------------------------------------------------------- +void +JM_print_stext_page_as_text(fz_context *ctx, fz_buffer *buff, fz_stext_page *page) +{ + fz_stext_block *block; + fz_stext_line *line; + fz_stext_char *ch; + fz_rect rect = page->mediabox; + fz_rect chbbox; + int last_char = 0; + char utf[10]; + int i, n; + + for (block = page->first_block; block; block = block->next) { + if (block->type == FZ_STEXT_BLOCK_TEXT) { + for (line = block->u.t.first_line; line; line = line->next) { + last_char = 0; + for (ch = line->first_char; ch; ch = ch->next) { + chbbox = JM_char_bbox(ctx, line, ch); + if (fz_is_infinite_rect(rect) || + JM_rects_overlap(rect, chbbox)) { + last_char = ch->c; + JM_append_rune(ctx, buff, ch->c); + } + } + if (last_char != 10 && last_char > 0) { + fz_append_string(ctx, buff, "\n"); + } + } + } + } +} + +//----------------------------------------------------------------------------- +// Functions for wordlist output +//----------------------------------------------------------------------------- +int JM_append_word(fz_context *ctx, PyObject *lines, fz_buffer *buff, fz_rect *wbbox, + int block_n, int line_n, int word_n) +{ + PyObject *s = JM_EscapeStrFromBuffer(ctx, buff); + PyObject *litem = Py_BuildValue("ffffOiii", + wbbox->x0, + wbbox->y0, + wbbox->x1, + wbbox->y1, + s, + block_n, line_n, word_n); + LIST_APPEND_DROP(lines, litem); + Py_DECREF(s); + *wbbox = fz_empty_rect; + return word_n + 1; // word counter +} + +//----------------------------------------------------------------------------- +// Functions for dictionary output +//----------------------------------------------------------------------------- + +static int detect_super_script(fz_stext_line *line, fz_stext_char *ch) +{ + if (line->wmode == 0 && line->dir.x == 1 && line->dir.y == 0) + return ch->origin.y < line->first_char->origin.y - ch->size * 0.1f; + return 0; +} + +static int JM_char_font_flags(fz_context *ctx, fz_font *font, fz_stext_line *line, fz_stext_char *ch) +{ + int flags = detect_super_script(line, ch); + flags += fz_font_is_italic(ctx, font) * TEXT_FONT_ITALIC; + flags += fz_font_is_serif(ctx, font) * TEXT_FONT_SERIFED; + flags += fz_font_is_monospaced(ctx, font) * TEXT_FONT_MONOSPACED; + flags += fz_font_is_bold(ctx, font) * TEXT_FONT_BOLD; + return flags; +} + +static const char * +JM_font_name(fz_context *ctx, fz_font *font) +{ + const char *name = fz_font_name(ctx, font); + const char *s = strchr(name, '+'); + if (subset_fontnames || s == NULL || s-name != 6) { + return name; + } + return s + 1; +} + + +static fz_rect +JM_make_spanlist(fz_context *ctx, PyObject *line_dict, + fz_stext_line *line, int raw, fz_buffer *buff, + fz_rect tp_rect) +{ + PyObject *span = NULL, *char_list = NULL, *char_dict; + PyObject *span_list = PyList_New(0); + fz_clear_buffer(ctx, buff); + fz_stext_char *ch; + fz_rect span_rect = fz_empty_rect; + fz_rect line_rect = fz_empty_rect; + fz_point span_origin = {0, 0}; + typedef struct style_s { + float size; int flags; const char *font; int color; + float asc; float desc; + } char_style; + char_style old_style = { -1, -1, "", -1, 0, 0 }, style; + + for (ch = line->first_char; ch; ch = ch->next) { + fz_rect r = JM_char_bbox(ctx, line, ch); + if (!JM_rects_overlap(tp_rect, r) && + !fz_is_infinite_rect(tp_rect)) { + continue; + } + int flags = JM_char_font_flags(ctx, ch->font, line, ch); + fz_point origin = ch->origin; + style.size = ch->size; + style.flags = flags; + style.font = JM_font_name(ctx, ch->font); + style.color = ch->color; + style.asc = JM_font_ascender(ctx, ch->font); + style.desc = JM_font_descender(ctx, ch->font); + + if (style.size != old_style.size || + style.flags != old_style.flags || + style.color != old_style.color || + strcmp(style.font, old_style.font) != 0) { + + if (old_style.size >= 0) { + // not first one, output previous + if (raw) { + // put character list in the span + DICT_SETITEM_DROP(span, dictkey_chars, char_list); + char_list = NULL; + } else { + // put text string in the span + DICT_SETITEM_DROP(span, dictkey_text, JM_EscapeStrFromBuffer(ctx, buff)); + fz_clear_buffer(ctx, buff); + } + + DICT_SETITEM_DROP(span, dictkey_origin, + JM_py_from_point(span_origin)); + DICT_SETITEM_DROP(span, dictkey_bbox, + JM_py_from_rect(span_rect)); + line_rect = fz_union_rect(line_rect, span_rect); + LIST_APPEND_DROP(span_list, span); + span = NULL; + } + + span = PyDict_New(); + float asc = style.asc, desc = style.desc; + if (style.asc < 1e-3) { + asc = 0.9f; + desc = -0.1f; + } + + DICT_SETITEM_DROP(span, dictkey_size, Py_BuildValue("f", style.size)); + DICT_SETITEM_DROP(span, dictkey_flags, Py_BuildValue("i", style.flags)); + DICT_SETITEM_DROP(span, dictkey_font, JM_EscapeStrFromStr(style.font)); + DICT_SETITEM_DROP(span, dictkey_color, Py_BuildValue("i", style.color)); + DICT_SETITEMSTR_DROP(span, "ascender", Py_BuildValue("f", asc)); + DICT_SETITEMSTR_DROP(span, "descender", Py_BuildValue("f", desc)); + + old_style = style; + span_rect = r; + span_origin = origin; + + } + span_rect = fz_union_rect(span_rect, r); + + if (raw) { // make and append a char dict + char_dict = PyDict_New(); + DICT_SETITEM_DROP(char_dict, dictkey_origin, + JM_py_from_point(ch->origin)); + + DICT_SETITEM_DROP(char_dict, dictkey_bbox, + JM_py_from_rect(r)); + + DICT_SETITEM_DROP(char_dict, dictkey_c, + Py_BuildValue("C", ch->c)); + + if (!char_list) { + char_list = PyList_New(0); + } + LIST_APPEND_DROP(char_list, char_dict); + } else { // add character byte to buffer + JM_append_rune(ctx, buff, ch->c); + } + } + // all characters processed, now flush remaining span + if (span) { + if (raw) { + DICT_SETITEM_DROP(span, dictkey_chars, char_list); + char_list = NULL; + } else { + DICT_SETITEM_DROP(span, dictkey_text, JM_EscapeStrFromBuffer(ctx, buff)); + fz_clear_buffer(ctx, buff); + } + DICT_SETITEM_DROP(span, dictkey_origin, JM_py_from_point(span_origin)); + DICT_SETITEM_DROP(span, dictkey_bbox, JM_py_from_rect(span_rect)); + + if (!fz_is_empty_rect(span_rect)) { + LIST_APPEND_DROP(span_list, span); + line_rect = fz_union_rect(line_rect, span_rect); + } else { + Py_DECREF(span); + } + span = NULL; + } + if (!fz_is_empty_rect(line_rect)) { + DICT_SETITEM_DROP(line_dict, dictkey_spans, span_list); + } else { + DICT_SETITEM_DROP(line_dict, dictkey_spans, span_list); + } + return line_rect; +} + +static void JM_make_image_block(fz_context *ctx, fz_stext_block *block, PyObject *block_dict) +{ + fz_image *image = block->u.i.image; + fz_buffer *buf = NULL, *freebuf = NULL; + fz_compressed_buffer *buffer = fz_compressed_image_buffer(ctx, image); + fz_var(buf); + fz_var(freebuf); + int n = fz_colorspace_n(ctx, image->colorspace); + int w = image->w; + int h = image->h; + const char *ext = NULL; + int type = FZ_IMAGE_UNKNOWN; + if (buffer) + type = buffer->params.type; + if (type < FZ_IMAGE_BMP || type == FZ_IMAGE_JBIG2) + type = FZ_IMAGE_UNKNOWN; + PyObject *bytes = NULL; + fz_var(bytes); + fz_try(ctx) { + if (buffer && type != FZ_IMAGE_UNKNOWN) { + buf = buffer->buffer; + ext = JM_image_extension(type); + } else { + buf = freebuf = fz_new_buffer_from_image_as_png(ctx, image, fz_default_color_params); + ext = "png"; + } + bytes = JM_BinFromBuffer(ctx, buf); + } + fz_always(ctx) { + if (!bytes) + bytes = JM_BinFromChar(""); + + DICT_SETITEM_DROP(block_dict, dictkey_width, + Py_BuildValue("i", w)); + DICT_SETITEM_DROP(block_dict, dictkey_height, + Py_BuildValue("i", h)); + DICT_SETITEM_DROP(block_dict, dictkey_ext, + Py_BuildValue("s", ext)); + DICT_SETITEM_DROP(block_dict, dictkey_colorspace, + Py_BuildValue("i", n)); + DICT_SETITEM_DROP(block_dict, dictkey_xres, + Py_BuildValue("i", image->xres)); + DICT_SETITEM_DROP(block_dict, dictkey_yres, + Py_BuildValue("i", image->xres)); + DICT_SETITEM_DROP(block_dict, dictkey_bpc, + Py_BuildValue("i", (int) image->bpc)); + DICT_SETITEM_DROP(block_dict, dictkey_matrix, + JM_py_from_matrix(block->u.i.transform)); + DICT_SETITEM_DROP(block_dict, dictkey_size, + Py_BuildValue("n", PyBytes_Size(bytes))); + DICT_SETITEM_DROP(block_dict, dictkey_image, bytes); + + fz_drop_buffer(ctx, freebuf); + } + fz_catch(ctx) {;} + return; +} + +static void JM_make_text_block(fz_context *ctx, fz_stext_block *block, PyObject *block_dict, int raw, fz_buffer *buff, fz_rect tp_rect) +{ + fz_stext_line *line; + PyObject *line_list = PyList_New(0), *line_dict; + fz_rect block_rect = fz_empty_rect; + for (line = block->u.t.first_line; line; line = line->next) { + if (fz_is_empty_rect(fz_intersect_rect(tp_rect, line->bbox)) && + !fz_is_infinite_rect(tp_rect)) { + continue; + } + line_dict = PyDict_New(); + fz_rect line_rect = JM_make_spanlist(ctx, line_dict, line, raw, buff, tp_rect); + block_rect = fz_union_rect(block_rect, line_rect); + DICT_SETITEM_DROP(line_dict, dictkey_wmode, + Py_BuildValue("i", line->wmode)); + DICT_SETITEM_DROP(line_dict, dictkey_dir, JM_py_from_point(line->dir)); + DICT_SETITEM_DROP(line_dict, dictkey_bbox, + JM_py_from_rect(line_rect)); + LIST_APPEND_DROP(line_list, line_dict); + } + DICT_SETITEM_DROP(block_dict, dictkey_bbox, JM_py_from_rect(block_rect)); + DICT_SETITEM_DROP(block_dict, dictkey_lines, line_list); + return; +} + +void JM_make_textpage_dict(fz_context *ctx, fz_stext_page *tp, PyObject *page_dict, int raw) +{ + fz_stext_block *block; + fz_buffer *text_buffer = fz_new_buffer(ctx, 128); + PyObject *block_dict, *block_list = PyList_New(0); + fz_rect tp_rect = tp->mediabox; + int block_n = -1; + for (block = tp->first_block; block; block = block->next) { + block_n++; + if (!fz_contains_rect(tp_rect, block->bbox) && + !fz_is_infinite_rect(tp_rect) && + block->type == FZ_STEXT_BLOCK_IMAGE) { + continue; + } + if (!fz_is_infinite_rect(tp_rect) && + fz_is_empty_rect(fz_intersect_rect(tp_rect, block->bbox))) { + continue; + } + + block_dict = PyDict_New(); + DICT_SETITEM_DROP(block_dict, dictkey_number, Py_BuildValue("i", block_n)); + DICT_SETITEM_DROP(block_dict, dictkey_type, Py_BuildValue("i", block->type)); + if (block->type == FZ_STEXT_BLOCK_IMAGE) { + DICT_SETITEM_DROP(block_dict, dictkey_bbox, JM_py_from_rect(block->bbox)); + JM_make_image_block(ctx, block, block_dict); + } else { + JM_make_text_block(ctx, block, block_dict, raw, text_buffer, tp_rect); + } + + LIST_APPEND_DROP(block_list, block_dict); + } + DICT_SETITEM_DROP(page_dict, dictkey_blocks, block_list); + fz_drop_buffer(ctx, text_buffer); +} + + + +//--------------------------------------------------------------------- +PyObject * +JM_copy_rectangle(fz_context *ctx, fz_stext_page *page, fz_rect area) +{ + fz_stext_block *block; + fz_stext_line *line; + fz_stext_char *ch; + fz_buffer *buffer; + int need_new_line = 0; + PyObject *rc = NULL; + fz_try(ctx) { + buffer = fz_new_buffer(ctx, 1024); + for (block = page->first_block; block; block = block->next) { + if (block->type != FZ_STEXT_BLOCK_TEXT) + continue; + for (line = block->u.t.first_line; line; line = line->next) { + int line_had_text = 0; + for (ch = line->first_char; ch; ch = ch->next) { + fz_rect r = JM_char_bbox(ctx, line, ch); + if (JM_rects_overlap(area, r)) { + line_had_text = 1; + if (need_new_line) { + fz_append_string(ctx, buffer, "\n"); + need_new_line = 0; + } + JM_append_rune(ctx, buffer, ch->c); + } + } + if (line_had_text) + need_new_line = 1; + } + } + fz_terminate_buffer(ctx, buffer); + rc = JM_EscapeStrFromBuffer(ctx, buffer); + if (!rc) { + rc = EMPTY_STRING; + PyErr_Clear(); + } + } + fz_always(ctx) { + fz_drop_buffer(ctx, buffer); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return rc; +} +//--------------------------------------------------------------------- + + + + +fz_buffer *JM_object_to_buffer(fz_context *ctx, pdf_obj *what, int compress, int ascii) +{ + fz_buffer *res=NULL; + fz_output *out=NULL; + fz_try(ctx) { + res = fz_new_buffer(ctx, 512); + out = fz_new_output_with_buffer(ctx, res); + pdf_print_obj(ctx, out, what, compress, ascii); + } + fz_always(ctx) { + fz_drop_output(ctx, out); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + fz_terminate_buffer(ctx, res); + return res; +} + +//----------------------------------------------------------------------------- +// Merge the /Resources object created by a text pdf device into the page. +// The device may have created multiple /ExtGState/Alp? and /Font/F? objects. +// These need to be renamed (renumbered) to not overwrite existing page +// objects from previous executions. +// Returns the next available numbers n, m for objects /Alp, /F. +//----------------------------------------------------------------------------- +PyObject *JM_merge_resources(fz_context *ctx, pdf_page *page, pdf_obj *temp_res) +{ + // page objects /Resources, /Resources/ExtGState, /Resources/Font + pdf_obj *resources = pdf_dict_get(ctx, page->obj, PDF_NAME(Resources)); + pdf_obj *main_extg = pdf_dict_get(ctx, resources, PDF_NAME(ExtGState)); + pdf_obj *main_fonts = pdf_dict_get(ctx, resources, PDF_NAME(Font)); + + // text pdf device objects /ExtGState, /Font + pdf_obj *temp_extg = pdf_dict_get(ctx, temp_res, PDF_NAME(ExtGState)); + pdf_obj *temp_fonts = pdf_dict_get(ctx, temp_res, PDF_NAME(Font)); + + + int max_alp = -1, max_fonts = -1, i, n; + char text[20]; + + // Handle /Alp objects + if (pdf_is_dict(ctx, temp_extg)) // any created at all? + { + n = pdf_dict_len(ctx, temp_extg); + if (pdf_is_dict(ctx, main_extg)) { // does page have /ExtGState yet? + for (i = 0; i < pdf_dict_len(ctx, main_extg); i++) { + // get highest number of objects named /Alpxxx + char *alp = (char *) pdf_to_name(ctx, pdf_dict_get_key(ctx, main_extg, i)); + if (strncmp(alp, "Alp", 3) != 0) continue; + int j = fz_atoi(alp + 3); + if (j > max_alp) max_alp = j; + } + } + else // create a /ExtGState for the page + main_extg = pdf_dict_put_dict(ctx, resources, PDF_NAME(ExtGState), n); + + max_alp += 1; + for (i = 0; i < n; i++) // copy over renumbered /Alp objects + { + char *alp = (char *) pdf_to_name(ctx, pdf_dict_get_key(ctx, temp_extg, i)); + int j = fz_atoi(alp + 3) + max_alp; + fz_snprintf(text, sizeof(text), "Alp%d", j); // new name + pdf_obj *val = pdf_dict_get_val(ctx, temp_extg, i); + pdf_dict_puts(ctx, main_extg, text, val); + } + } + + + if (pdf_is_dict(ctx, main_fonts)) { // has page any fonts yet? + for (i = 0; i < pdf_dict_len(ctx, main_fonts); i++) { // get max font number + char *font = (char *) pdf_to_name(ctx, pdf_dict_get_key(ctx, main_fonts, i)); + if (strncmp(font, "F", 1) != 0) continue; + int j = fz_atoi(font + 1); + if (j > max_fonts) max_fonts = j; + } + } + else // create a Resources/Font for the page + main_fonts = pdf_dict_put_dict(ctx, resources, PDF_NAME(Font), 2); + + max_fonts += 1; + for (i = 0; i < pdf_dict_len(ctx, temp_fonts); i++) { // copy renumbered fonts + char *font = (char *) pdf_to_name(ctx, pdf_dict_get_key(ctx, temp_fonts, i)); + int j = fz_atoi(font + 1) + max_fonts; + fz_snprintf(text, sizeof(text), "F%d", j); + pdf_obj *val = pdf_dict_get_val(ctx, temp_fonts, i); + pdf_dict_puts(ctx, main_fonts, text, val); + } + return Py_BuildValue("ii", max_alp, max_fonts); // next available numbers +} + + +//----------------------------------------------------------------------------- +// version of fz_show_string, which covers SMALL CAPS +//----------------------------------------------------------------------------- +fz_matrix +JM_show_string_cs(fz_context *ctx, fz_text *text, fz_font *user_font, fz_matrix trm, const char *s, + int wmode, int bidi_level, fz_bidi_direction markup_dir, fz_text_language language) +{ + fz_font *font=NULL; + int gid, ucs; + float adv; + + while (*s) + { + s += fz_chartorune(&ucs, s); + gid = fz_encode_character_sc(ctx, user_font, ucs); + if (gid == 0) { + gid = fz_encode_character_with_fallback(ctx, user_font, ucs, 0, language, &font); + } else { + font = user_font; + } + fz_show_glyph(ctx, text, font, trm, gid, ucs, wmode, bidi_level, markup_dir, language); + adv = fz_advance_glyph(ctx, font, gid, wmode); + if (wmode == 0) + trm = fz_pre_translate(trm, adv, 0); + else + trm = fz_pre_translate(trm, 0, -adv); + } + + return trm; +} + + +//----------------------------------------------------------------------------- +// version of fz_show_string, which also covers UCDN script +//----------------------------------------------------------------------------- +fz_matrix JM_show_string(fz_context *ctx, fz_text *text, fz_font *user_font, fz_matrix trm, const char *s, int wmode, int bidi_level, fz_bidi_direction markup_dir, fz_text_language language, int script) +{ + fz_font *font; + int gid, ucs; + float adv; + + while (*s) { + s += fz_chartorune(&ucs, s); + gid = fz_encode_character_with_fallback(ctx, user_font, ucs, script, language, &font); + fz_show_glyph(ctx, text, font, trm, gid, ucs, wmode, bidi_level, markup_dir, language); + adv = fz_advance_glyph(ctx, font, gid, wmode); + if (wmode == 0) + trm = fz_pre_translate(trm, adv, 0); + else + trm = fz_pre_translate(trm, 0, -adv); + } + return trm; +} + + +//----------------------------------------------------------------------------- +// return a fz_font from a number of parameters +//----------------------------------------------------------------------------- +fz_font *JM_get_font(fz_context *ctx, + char *fontname, + char *fontfile, + PyObject *fontbuffer, + int script, + int lang, + int ordering, + int is_bold, + int is_italic, + int is_serif, + int embed) +{ + const unsigned char *data = NULL; + int size, index=0; + fz_buffer *res = NULL; + fz_font *font = NULL; + fz_try(ctx) { + if (fontfile) goto have_file; + if (EXISTS(fontbuffer)) goto have_buffer; + if (ordering > -1) goto have_cjk; + if (fontname) goto have_base14; + goto have_noto; + + // Base-14 or a MuPDF builtin font + have_base14:; + font = fz_new_base14_font(ctx, fontname); + if (font) { + goto fertig; + } + font = fz_new_builtin_font(ctx, fontname, is_bold, is_italic); + goto fertig; + + // CJK font + have_cjk:; + font = fz_new_cjk_font(ctx, ordering); + goto fertig; + + // fontfile + have_file:; + font = fz_new_font_from_file(ctx, NULL, fontfile, index, 0); + goto fertig; + + // fontbuffer + have_buffer:; + res = JM_BufferFromBytes(ctx, fontbuffer); + font = fz_new_font_from_buffer(ctx, NULL, res, index, 0); + goto fertig; + + // Check for NOTO font + have_noto:; + data = fz_lookup_noto_font(ctx, script, lang, &size, &index); + if (data) font = fz_new_font_from_memory(ctx, NULL, data, size, index, 0); + if (font) goto fertig; + font = fz_load_fallback_font(ctx, script, lang, is_serif, is_bold, is_italic); + goto fertig; + + fertig:; + if (!font) { + RAISEPY(ctx, MSG_FONT_FAILED, PyExc_RuntimeError); + } + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + // if font allows this, set embedding + if (!font->flags.never_embed) { + fz_set_font_embedding(ctx, font, embed); + } + #endif + } + fz_always(ctx) { + fz_drop_buffer(ctx, res); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return font; +} + +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/helper-xobject.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/helper-xobject.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,285 @@ +%{ +/* +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +*/ +//----------------------------------------------------------------------------- +// Read and concatenate a PDF page's /Conents object(s) in a buffer +//----------------------------------------------------------------------------- +fz_buffer *JM_read_contents(fz_context * ctx, pdf_obj * pageref) +{ + fz_buffer *res = NULL, *nres = NULL; + int i; + fz_try(ctx) { + pdf_obj *contents = pdf_dict_get(ctx, pageref, PDF_NAME(Contents)); + if (pdf_is_array(ctx, contents)) { + res = fz_new_buffer(ctx, 1024); + for (i = 0; i < pdf_array_len(ctx, contents); i++) { + nres = pdf_load_stream(ctx, pdf_array_get(ctx, contents, i)); + fz_append_buffer(ctx, res, nres); + fz_drop_buffer(ctx, nres); + } + } + else if (contents) { + res = pdf_load_stream(ctx, contents); + } + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return res; +} + +//----------------------------------------------------------------------------- +// Make an XObject from a PDF page +// For a positive xref assume that its object can be used instead +//----------------------------------------------------------------------------- +pdf_obj *JM_xobject_from_page(fz_context * ctx, pdf_document * pdfout, fz_page * fsrcpage, int xref, pdf_graft_map *gmap) +{ + pdf_obj *xobj1, *resources = NULL, *o, *spageref; + fz_try(ctx) { + if (xref > 0) { + xobj1 = pdf_new_indirect(ctx, pdfout, xref, 0); + } else { + fz_buffer *res = NULL; + fz_rect mediabox; + pdf_page *srcpage = pdf_page_from_fz_page(ctx, fsrcpage); + spageref = srcpage->obj; + mediabox = pdf_to_rect(ctx, pdf_dict_get_inheritable(ctx, spageref, PDF_NAME(MediaBox))); + // Deep-copy resources object of source page + o = pdf_dict_get_inheritable(ctx, spageref, PDF_NAME(Resources)); + if (gmap) // use graftmap when possible + resources = pdf_graft_mapped_object(ctx, gmap, o); + else + resources = pdf_graft_object(ctx, pdfout, o); + + // get spgage contents source + res = JM_read_contents(ctx, spageref); + + //------------------------------------------------------------- + // create XObject representing the source page + //------------------------------------------------------------- + xobj1 = pdf_new_xobject(ctx, pdfout, mediabox, fz_identity, NULL, res); + // store spage contents + JM_update_stream(ctx, pdfout, xobj1, res, 1); + fz_drop_buffer(ctx, res); + + // store spage resources + pdf_dict_put_drop(ctx, xobj1, PDF_NAME(Resources), resources); + } + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return xobj1; +} + +//----------------------------------------------------------------------------- +// Insert a buffer as a new separate /Contents object of a page. +// 1. Create a new stream object from buffer 'newcont' +// 2. If /Contents already is an array, then just prepend or append this object +// 3. Else, create new array and put old content obj and this object into it. +// If the page had no /Contents before, just create a 1-item array. +//----------------------------------------------------------------------------- +int JM_insert_contents(fz_context * ctx, pdf_document * pdf, + pdf_obj * pageref, fz_buffer * newcont, int overlay) +{ + int xref = 0; + pdf_obj *newconts = NULL; + pdf_obj *carr = NULL; + fz_var(newconts); + fz_var(carr); + fz_try(ctx) { + pdf_obj *contents = pdf_dict_get(ctx, pageref, PDF_NAME(Contents)); + newconts = pdf_add_stream(ctx, pdf, newcont, NULL, 0); + xref = pdf_to_num(ctx, newconts); + if (pdf_is_array(ctx, contents)) { + if (overlay) // append new object + pdf_array_push(ctx, contents, newconts); + else // prepend new object + pdf_array_insert(ctx, contents, newconts, 0); + } else { + carr = pdf_new_array(ctx, pdf, 5); + if (overlay) { + if (contents) + pdf_array_push(ctx, carr, contents); + pdf_array_push(ctx, carr, newconts); + } else { + pdf_array_push(ctx, carr, newconts); + if (contents) + pdf_array_push(ctx, carr, contents); + } + pdf_dict_put(ctx, pageref, PDF_NAME(Contents), carr); + } + } + fz_always(ctx) { + pdf_drop_obj(ctx, newconts); + pdf_drop_obj(ctx, carr); + } + fz_catch(ctx) { + fz_rethrow(ctx); + } + return xref; +} + +static void show(const char* prefix, PyObject* obj) +{ + if (!obj) + { + printf( "%s \n", prefix); + return; + } + PyObject* obj_repr = PyObject_Repr( obj); + PyObject* obj_repr_u = PyUnicode_AsEncodedString( obj_repr, "utf-8", "~E~"); + const char* obj_repr_s = PyString_AsString( obj_repr_u); + printf( "%s%s\n", prefix, obj_repr_s); + fflush(stdout); +} + +static PyObject *g_img_info = NULL; +static fz_matrix g_img_info_matrix = {0}; + +static fz_image * +JM_image_filter(fz_context *ctx, void *opaque, fz_matrix ctm, const char *name, fz_image *image) +{ + fz_quad q = fz_transform_quad(fz_quad_from_rect(fz_unit_rect), ctm); + #if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + q = fz_transform_quad( q, g_img_info_matrix); + #endif + PyObject *temp = Py_BuildValue("sN", name, JM_py_from_quad(q)); + + LIST_APPEND_DROP(g_img_info, temp); + return image; +} + +#if FZ_VERSION_MAJOR == 1 && FZ_VERSION_MINOR >= 22 + +static PyObject * +JM_image_reporter(fz_context *ctx, pdf_page *page) +{ + pdf_document *doc = page->doc; + + pdf_page_transform(ctx, page, NULL, &g_img_info_matrix); + pdf_filter_options filter_options = {0}; + filter_options.recurse = 0; + filter_options.instance_forms = 1; + filter_options.ascii = 1; + filter_options.no_update = 1; + + pdf_sanitize_filter_options sanitize_filter_options = {0}; + sanitize_filter_options.opaque = page; + sanitize_filter_options.image_filter = JM_image_filter; + + pdf_filter_factory filter_factory[2] = {0}; + filter_factory[0].filter = pdf_new_sanitize_filter; + filter_factory[0].options = &sanitize_filter_options; + + filter_options.filters = filter_factory; // was & + + g_img_info = PyList_New(0); + + pdf_filter_page_contents(ctx, doc, page, &filter_options); + + PyObject *rc = PySequence_Tuple(g_img_info); + Py_CLEAR(g_img_info); + + return rc; +} + +#else + +void +JM_filter_content_stream( + fz_context * ctx, + pdf_document * doc, + pdf_obj * in_stm, + pdf_obj * in_res, + fz_matrix transform, + pdf_filter_options * filter, + int struct_parents, + fz_buffer **out_buf, + pdf_obj **out_res) +{ + pdf_processor *proc_buffer = NULL; + pdf_processor *proc_filter = NULL; + + fz_var(proc_buffer); + fz_var(proc_filter); + + *out_buf = NULL; + *out_res = NULL; + + fz_try(ctx) { + *out_buf = fz_new_buffer(ctx, 1024); + proc_buffer = pdf_new_buffer_processor(ctx, *out_buf, filter->ascii); + if (filter->sanitize) { + *out_res = pdf_new_dict(ctx, doc, 1); + proc_filter = pdf_new_filter_processor(ctx, doc, proc_buffer, in_res, *out_res, struct_parents, transform, filter); + pdf_process_contents(ctx, proc_filter, doc, in_res, in_stm, NULL); + pdf_close_processor(ctx, proc_filter); + } else { + *out_res = pdf_keep_obj(ctx, in_res); + pdf_process_contents(ctx, proc_buffer, doc, in_res, in_stm, NULL); + } + pdf_close_processor(ctx, proc_buffer); + } + fz_always(ctx) { + pdf_drop_processor(ctx, proc_filter); + pdf_drop_processor(ctx, proc_buffer); + } + fz_catch(ctx) { + fz_drop_buffer(ctx, *out_buf); + *out_buf = NULL; + pdf_drop_obj(ctx, *out_res); + *out_res = NULL; + fz_rethrow(ctx); + } +} + +PyObject * +JM_image_reporter(fz_context *ctx, pdf_page *page) +{ + pdf_document *doc = page->doc; + pdf_filter_options filter; + memset(&filter, 0, sizeof filter); + filter.opaque = page; + filter.text_filter = NULL; + filter.image_filter = JM_image_filter; + filter.end_page = NULL; + filter.recurse = 0; + filter.instance_forms = 1; + filter.sanitize = 1; + filter.ascii = 1; + + pdf_obj *contents, *old_res; + pdf_obj *struct_parents_obj; + pdf_obj *new_res; + fz_buffer *buffer; + int struct_parents; + fz_matrix ctm = fz_identity; + pdf_page_transform(ctx, page, NULL, &ctm); + struct_parents_obj = pdf_dict_get(ctx, page->obj, PDF_NAME(StructParents)); + struct_parents = -1; + if (pdf_is_number(ctx, struct_parents_obj)) + struct_parents = pdf_to_int(ctx, struct_parents_obj); + + contents = pdf_page_contents(ctx, page); + old_res = pdf_page_resources(ctx, page); + g_img_info = PyList_New(0); + JM_filter_content_stream(ctx, doc, contents, old_res, ctm, &filter, struct_parents, &buffer, &new_res); + fz_drop_buffer(ctx, buffer); + pdf_drop_obj(ctx, new_res); + PyObject *rc = PySequence_Tuple(g_img_info); + Py_CLEAR(g_img_info); + return rc; +} + +#endif + +%} diff -r 000000000000 -r 1d09e1dec1d9 src_classic/utils.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/utils.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,5507 @@ +# ------------------------------------------------------------------------ +# Copyright 2020-2022, Harald Lieder, mailto:harald.lieder@outlook.com +# License: GNU AFFERO GPL 3.0, https://www.gnu.org/licenses/agpl-3.0.html +# +# Part of "PyMuPDF", a Python binding for "MuPDF" (http://mupdf.com), a +# lightweight PDF, XPS, and E-book viewer, renderer and toolkit which is +# maintained and developed by Artifex Software, Inc. https://artifex.com. +# ------------------------------------------------------------------------ +import io +import json +import math +import os +import random +import string +import tempfile +import typing +import warnings + +from fitz_old import * + +TESSDATA_PREFIX = os.getenv("TESSDATA_PREFIX") +point_like = "point_like" +rect_like = "rect_like" +matrix_like = "matrix_like" +quad_like = "quad_like" + +# ByteString is gone from typing in 3.14. +# collections.abc.Buffer available from 3.12 only +try: + ByteString = typing.ByteString +except AttributeError: + ByteString = bytes | bytearray | memoryview + +AnyType = typing.Any +OptInt = typing.Union[int, None] +OptFloat = typing.Optional[float] +OptStr = typing.Optional[str] +OptDict = typing.Optional[dict] +OptBytes = typing.Optional[ByteString] +OptSeq = typing.Optional[typing.Sequence] + +""" +This is a collection of functions to extend PyMupdf. +""" + + +def write_text(page: Page, **kwargs) -> None: + """Write the text of one or more TextWriter objects. + + Args: + rect: target rectangle. If None, the union of the text writers is used. + writers: one or more TextWriter objects. + overlay: put in foreground or background. + keep_proportion: maintain aspect ratio of rectangle sides. + rotate: arbitrary rotation angle. + oc: the xref of an optional content object + """ + if type(page) is not Page: + raise ValueError("bad page parameter") + s = { + k + for k in kwargs.keys() + if k + not in { + "rect", + "writers", + "opacity", + "color", + "overlay", + "keep_proportion", + "rotate", + "oc", + } + } + if s != set(): + raise ValueError("bad keywords: " + str(s)) + + rect = kwargs.get("rect") + writers = kwargs.get("writers") + opacity = kwargs.get("opacity") + color = kwargs.get("color") + overlay = bool(kwargs.get("overlay", True)) + keep_proportion = bool(kwargs.get("keep_proportion", True)) + rotate = int(kwargs.get("rotate", 0)) + oc = int(kwargs.get("oc", 0)) + + if not writers: + raise ValueError("need at least one TextWriter") + if type(writers) is TextWriter: + if rotate == 0 and rect is None: + writers.write_text(page, opacity=opacity, color=color, overlay=overlay) + return None + else: + writers = (writers,) + clip = writers[0].text_rect + textdoc = Document() + tpage = textdoc.new_page(width=page.rect.width, height=page.rect.height) + for writer in writers: + clip |= writer.text_rect + writer.write_text(tpage, opacity=opacity, color=color) + if rect is None: + rect = clip + page.show_pdf_page( + rect, + textdoc, + 0, + overlay=overlay, + keep_proportion=keep_proportion, + rotate=rotate, + clip=clip, + oc=oc, + ) + textdoc = None + tpage = None + + +def show_pdf_page(*args, **kwargs) -> int: + """Show page number 'pno' of PDF 'src' in rectangle 'rect'. + + Args: + rect: (rect-like) where to place the source image + src: (document) source PDF + pno: (int) source page number + overlay: (bool) put in foreground + keep_proportion: (bool) do not change width-height-ratio + rotate: (int) degrees (multiple of 90) + clip: (rect-like) part of source page rectangle + Returns: + xref of inserted object (for reuse) + """ + if len(args) not in (3, 4): + raise ValueError("bad number of positional parameters") + pno = None + if len(args) == 3: + page, rect, src = args + else: + page, rect, src, pno = args + if pno == None: + pno = int(kwargs.get("pno", 0)) + overlay = bool(kwargs.get("overlay", True)) + keep_proportion = bool(kwargs.get("keep_proportion", True)) + rotate = float(kwargs.get("rotate", 0)) + oc = int(kwargs.get("oc", 0)) + clip = kwargs.get("clip") + + def calc_matrix(sr, tr, keep=True, rotate=0): + """Calculate transformation matrix from source to target rect. + + Notes: + The product of four matrices in this sequence: (1) translate correct + source corner to origin, (2) rotate, (3) scale, (4) translate to + target's top-left corner. + Args: + sr: source rect in PDF (!) coordinate system + tr: target rect in PDF coordinate system + keep: whether to keep source ratio of width to height + rotate: rotation angle in degrees + Returns: + Transformation matrix. + """ + # calc center point of source rect + smp = (sr.tl + sr.br) / 2.0 + # calc center point of target rect + tmp = (tr.tl + tr.br) / 2.0 + + # m moves to (0, 0), then rotates + m = Matrix(1, 0, 0, 1, -smp.x, -smp.y) * Matrix(rotate) + + sr1 = sr * m # resulting source rect to calculate scale factors + + fw = tr.width / sr1.width # scale the width + fh = tr.height / sr1.height # scale the height + if keep: + fw = fh = min(fw, fh) # take min if keeping aspect ratio + + m *= Matrix(fw, fh) # concat scale matrix + m *= Matrix(1, 0, 0, 1, tmp.x, tmp.y) # concat move to target center + return JM_TUPLE(m) + + CheckParent(page) + doc = page.parent + + if not doc.is_pdf or not src.is_pdf: + raise ValueError("is no PDF") + + if rect.is_empty or rect.is_infinite: + raise ValueError("rect must be finite and not empty") + + while pno < 0: # support negative page numbers + pno += src.page_count + src_page = src[pno] # load source page + if src_page.get_contents() == []: + raise ValueError("nothing to show - source page empty") + + tar_rect = rect * ~page.transformation_matrix # target rect in PDF coordinates + + src_rect = src_page.rect if not clip else src_page.rect & clip # source rect + if src_rect.is_empty or src_rect.is_infinite: + raise ValueError("clip must be finite and not empty") + src_rect = src_rect * ~src_page.transformation_matrix # ... in PDF coord + + matrix = calc_matrix(src_rect, tar_rect, keep=keep_proportion, rotate=rotate) + + # list of existing /Form /XObjects + ilst = [i[1] for i in doc.get_page_xobjects(page.number)] + ilst += [i[7] for i in doc.get_page_images(page.number)] + ilst += [i[4] for i in doc.get_page_fonts(page.number)] + + # create a name not in that list + n = "fzFrm" + i = 0 + _imgname = n + "0" + while _imgname in ilst: + i += 1 + _imgname = n + str(i) + + isrc = src._graft_id # used as key for graftmaps + if doc._graft_id == isrc: + raise ValueError("source document must not equal target") + + # retrieve / make Graftmap for source PDF + gmap = doc.Graftmaps.get(isrc, None) + if gmap is None: + gmap = Graftmap(doc) + doc.Graftmaps[isrc] = gmap + + # take note of generated xref for automatic reuse + pno_id = (isrc, pno) # id of src[pno] + xref = doc.ShownPages.get(pno_id, 0) + + xref = page._show_pdf_page( + src_page, + overlay=overlay, + matrix=matrix, + xref=xref, + oc=oc, + clip=src_rect, + graftmap=gmap, + _imgname=_imgname, + ) + doc.ShownPages[pno_id] = xref + + return xref + + +def replace_image(page: Page, xref: int, *, filename=None, pixmap=None, stream=None): + """Replace the image referred to by xref. + + Replace the image by changing the object definition stored under xref. This + will leave the pages appearance instructions intact, so the new image is + being displayed with the same bbox, rotation etc. + By providing a small fully transparent image, an effect as if the image had + been deleted can be achieved. + A typical use may include replacing large images by a smaller version, + e.g. with a lower resolution or graylevel instead of colored. + + Args: + xref: the xref of the image to replace. + filename, pixmap, stream: exactly one of these must be provided. The + meaning being the same as in Page.insert_image. + """ + doc = page.parent # the owning document + if not doc.xref_is_image(xref): + raise ValueError("xref not an image") # insert new image anywhere in page + if bool(filename) + bool(stream) + bool(pixmap) != 1: + raise ValueError("Exactly one of filename/stream/pixmap must be given") + new_xref = page.insert_image( + page.rect, filename=filename, stream=stream, pixmap=pixmap + ) + doc.xref_copy(new_xref, xref) # copy over new to old + last_contents_xref = page.get_contents()[-1] + # new image insertion has created a new /Contents source, + # which we will set to spaces now + doc.update_stream(last_contents_xref, b" ") + + +def delete_image(page: Page, xref: int): + """Delete the image referred to by xef. + + Actually replaces by a small transparent Pixmap using method Page.replace_image. + + Args: + xref: xref of the image to delete. + """ + # make a small 100% transparent pixmap (of just any dimension) + pix = fitz_old.Pixmap(fitz_old.csGRAY, (0, 0, 1, 1), 1) + pix.clear_with() # clear all samples bytes to 0x00 + page.replace_image(xref, pixmap=pix) + + +def insert_image(page, rect, **kwargs): + """Insert an image for display in a rectangle. + + Args: + rect: (rect_like) position of image on the page. + alpha: (int, optional) set to 0 if image has no transparency. + filename: (str, Path, file object) image filename. + keep_proportion: (bool) keep width / height ratio (default). + mask: (bytes, optional) image consisting of alpha values to use. + oc: (int) xref of OCG or OCMD to declare as Optional Content. + overlay: (bool) put in foreground (default) or background. + pixmap: (Pixmap) use this as image. + rotate: (int) rotate by 0, 90, 180 or 270 degrees. + stream: (bytes) use this as image. + xref: (int) use this as image. + + 'page' and 'rect' are positional, all other parameters are keywords. + + If 'xref' is given, that image is used. Other input options are ignored. + Else, exactly one of pixmap, stream or filename must be given. + + 'alpha=0' for non-transparent images improves performance significantly. + Affects stream and filename only. + + Optimum transparent insertions are possible by using filename / stream in + conjunction with a 'mask' image of alpha values. + + Returns: + xref (int) of inserted image. Re-use as argument for multiple insertions. + """ + CheckParent(page) + doc = page.parent + if not doc.is_pdf: + raise ValueError("is no PDF") + + valid_keys = { + "alpha", + "filename", + "height", + "keep_proportion", + "mask", + "oc", + "overlay", + "pixmap", + "rotate", + "stream", + "width", + "xref", + } + s = set(kwargs.keys()).difference(valid_keys) + if s != set(): + raise ValueError(f"bad key argument(s): {s}.") + filename = kwargs.get("filename") + pixmap = kwargs.get("pixmap") + stream = kwargs.get("stream") + mask = kwargs.get("mask") + rotate = int(kwargs.get("rotate", 0)) + width = int(kwargs.get("width", 0)) + height = int(kwargs.get("height", 0)) + alpha = int(kwargs.get("alpha", -1)) + oc = int(kwargs.get("oc", 0)) + xref = int(kwargs.get("xref", 0)) + keep_proportion = bool(kwargs.get("keep_proportion", True)) + overlay = bool(kwargs.get("overlay", True)) + + if xref == 0 and (bool(filename) + bool(stream) + bool(pixmap) != 1): + raise ValueError("xref=0 needs exactly one of filename, pixmap, stream") + + if filename: + if type(filename) is str: + pass + elif hasattr(filename, "absolute"): + filename = str(filename) + elif hasattr(filename, "name"): + filename = filename.name + else: + raise ValueError("bad filename") + + if filename and not os.path.exists(filename): + raise FileNotFoundError("No such file: '%s'" % filename) + elif stream and type(stream) not in (bytes, bytearray, io.BytesIO): + raise ValueError("stream must be bytes-like / BytesIO") + elif pixmap and type(pixmap) is not Pixmap: + raise ValueError("pixmap must be a Pixmap") + if mask and not (stream or filename): + raise ValueError("mask requires stream or filename") + if mask and type(mask) not in (bytes, bytearray, io.BytesIO): + raise ValueError("mask must be bytes-like / BytesIO") + while rotate < 0: + rotate += 360 + while rotate >= 360: + rotate -= 360 + if rotate not in (0, 90, 180, 270): + raise ValueError("bad rotate value") + + r = Rect(rect) + if r.is_empty or r.is_infinite: + raise ValueError("rect must be finite and not empty") + clip = r * ~page.transformation_matrix + + # Create a unique image reference name. + ilst = [i[7] for i in doc.get_page_images(page.number)] + ilst += [i[1] for i in doc.get_page_xobjects(page.number)] + ilst += [i[4] for i in doc.get_page_fonts(page.number)] + n = "fzImg" # 'fitz image' + i = 0 + _imgname = n + "0" # first name candidate + while _imgname in ilst: + i += 1 + _imgname = n + str(i) # try new name + + digests = doc.InsertedImages + xref, digests = page._insert_image( + filename=filename, + pixmap=pixmap, + stream=stream, + imask=mask, + clip=clip, + overlay=overlay, + oc=oc, + xref=xref, + rotate=rotate, + keep_proportion=keep_proportion, + width=width, + height=height, + alpha=alpha, + _imgname=_imgname, + digests=digests, + ) + if digests != None: + doc.InsertedImages = digests + + return xref + + +def search_for(*args, **kwargs) -> list: + """Search for a string on a page. + + Args: + text: string to be searched for + clip: restrict search to this rectangle + quads: (bool) return quads instead of rectangles + flags: bit switches, default: join hyphened words + textpage: a pre-created TextPage + Returns: + a list of rectangles or quads, each containing one occurrence. + """ + if len(args) != 2: + raise ValueError("bad number of positional parameters") + page, text = args + quads = kwargs.get("quads", 0) + clip = kwargs.get("clip") + textpage = kwargs.get("textpage") + if clip != None: + clip = Rect(clip) + flags = kwargs.get( + "flags", + TEXT_DEHYPHENATE + | TEXT_PRESERVE_WHITESPACE + | TEXT_PRESERVE_LIGATURES + | TEXT_MEDIABOX_CLIP, + ) + + CheckParent(page) + tp = textpage + if tp is None: + tp = page.get_textpage(clip=clip, flags=flags) # create TextPage + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + rlist = tp.search(text, quads=quads) + if textpage is None: + del tp + return rlist + + +def search_page_for( + doc: Document, + pno: int, + text: str, + quads: bool = False, + clip: rect_like = None, + flags: int = TEXT_DEHYPHENATE + | TEXT_PRESERVE_LIGATURES + | TEXT_PRESERVE_WHITESPACE + | TEXT_MEDIABOX_CLIP, + textpage: TextPage = None, +) -> list: + """Search for a string on a page. + + Args: + pno: page number + text: string to be searched for + clip: restrict search to this rectangle + quads: (bool) return quads instead of rectangles + flags: bit switches, default: join hyphened words + textpage: reuse a prepared textpage + Returns: + a list of rectangles or quads, each containing an occurrence. + """ + + return doc[pno].search_for( + text, + quads=quads, + clip=clip, + flags=flags, + textpage=textpage, + ) + + +def get_text_blocks( + page: Page, + clip: rect_like = None, + flags: OptInt = None, + textpage: TextPage = None, + sort: bool = False, +) -> list: + """Return the text blocks on a page. + + Notes: + Lines in a block are concatenated with line breaks. + Args: + flags: (int) control the amount of data parsed into the textpage. + Returns: + A list of the blocks. Each item contains the containing rectangle + coordinates, text lines, block type and running block number. + """ + CheckParent(page) + if flags is None: + flags = ( + TEXT_PRESERVE_WHITESPACE + | TEXT_PRESERVE_IMAGES + | TEXT_PRESERVE_LIGATURES + | TEXT_MEDIABOX_CLIP + ) + tp = textpage + if tp is None: + tp = page.get_textpage(clip=clip, flags=flags) + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + + blocks = tp.extractBLOCKS() + if textpage is None: + del tp + if sort is True: + blocks.sort(key=lambda b: (b[3], b[0])) + return blocks + + +def get_text_words( + page: Page, + clip: rect_like = None, + flags: OptInt = None, + textpage: TextPage = None, + sort: bool = False, + delimiters=None, +) -> list: + """Return the text words as a list with the bbox for each word. + + Args: + flags: (int) control the amount of data parsed into the textpage. + delimiters: (str,list) characters to use as word delimiters + + Returns: + Word tuples (x0, y0, x1, y1, "word", bno, lno, wno). + """ + CheckParent(page) + if flags is None: + flags = TEXT_PRESERVE_WHITESPACE | TEXT_PRESERVE_LIGATURES | TEXT_MEDIABOX_CLIP + + tp = textpage + if tp is None: + tp = page.get_textpage(clip=clip, flags=flags) + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + + words = tp.extractWORDS(delimiters) + if textpage is None: + del tp + if sort is True: + words.sort(key=lambda w: (w[3], w[0])) + + return words + + +def get_textbox( + page: Page, + rect: rect_like, + textpage: TextPage = None, +) -> str: + tp = textpage + if tp is None: + tp = page.get_textpage() + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + rc = tp.extractTextbox(rect) + if textpage is None: + del tp + return rc + + +def get_text_selection( + page: Page, + p1: point_like, + p2: point_like, + clip: rect_like = None, + textpage: TextPage = None, +): + CheckParent(page) + tp = textpage + if tp is None: + tp = page.get_textpage(clip=clip, flags=TEXT_DEHYPHENATE) + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + rc = tp.extractSelection(p1, p2) + if textpage is None: + del tp + return rc + + +def get_textpage_ocr( + page: Page, + flags: int = 0, + language: str = "eng", + dpi: int = 72, + full: bool = False, + tessdata: str = None, +) -> TextPage: + """Create a Textpage from combined results of normal and OCR text parsing. + + Args: + flags: (int) control content becoming part of the result. + language: (str) specify expected language(s). Deafault is "eng" (English). + dpi: (int) resolution in dpi, default 72. + full: (bool) whether to OCR the full page image, or only its images (default) + """ + CheckParent(page) + if not os.getenv("TESSDATA_PREFIX") and not tessdata: + raise RuntimeError("No OCR support: TESSDATA_PREFIX not set") + + def full_ocr(page, dpi, language, flags): + zoom = dpi / 72 + mat = Matrix(zoom, zoom) + pix = page.get_pixmap(matrix=mat) + ocr_pdf = Document( + "pdf", + pix.pdfocr_tobytes(compress=False, language=language, tessdata=tessdata), + ) + ocr_page = ocr_pdf.load_page(0) + unzoom = page.rect.width / ocr_page.rect.width + ctm = Matrix(unzoom, unzoom) * page.derotation_matrix + tpage = ocr_page.get_textpage(flags=flags, matrix=ctm) + ocr_pdf.close() + pix = None + tpage.parent = weakref.proxy(page) + return tpage + + # if OCR for the full page, OCR its pixmap @ desired dpi + if full is True: + return full_ocr(page, dpi, language, flags) + + # For partial OCR, make a normal textpage, then extend it with text that + # is OCRed from each image. + # Because of this, we need the images flag bit set ON. + tpage = page.get_textpage(flags=flags) + for block in page.get_text("dict", flags=TEXT_PRESERVE_IMAGES)["blocks"]: + if block["type"] != 1: # only look at images + continue + bbox = Rect(block["bbox"]) + if bbox.width <= 3 or bbox.height <= 3: # ignore tiny stuff + continue + try: + pix = Pixmap(block["image"]) # get image pixmap + if pix.n - pix.alpha != 3: # we need to convert this to RGB! + pix = Pixmap(csRGB, pix) + if pix.alpha: # must remove alpha channel + pix = Pixmap(pix, 0) + imgdoc = Document( + "pdf", pix.pdfocr_tobytes(language=language, tessdata=tessdata) + ) # pdf with OCRed page + imgpage = imgdoc.load_page(0) # read image as a page + pix = None + # compute matrix to transform coordinates back to that of 'page' + imgrect = imgpage.rect # page size of image PDF + shrink = Matrix(1 / imgrect.width, 1 / imgrect.height) + mat = shrink * block["transform"] + imgpage.extend_textpage(tpage, flags=0, matrix=mat) + imgdoc.close() + except RuntimeError: + tpage = None + print("Falling back to full page OCR") + return full_ocr(page, dpi, language, flags) + + return tpage + + +def get_image_info(page: Page, hashes: bool = False, xrefs: bool = False) -> list: + """Extract image information only from a TextPage. + + Args: + hashes: (bool) include MD5 hash for each image. + xrefs: (bool) try to find the xref for each image. Sets hashes to true. + """ + doc = page.parent + if xrefs and doc.is_pdf: + hashes = True + if not doc.is_pdf: + xrefs = False + imginfo = getattr(page, "_image_info", None) + if imginfo and not xrefs: + return imginfo + if not imginfo: + tp = page.get_textpage(flags=TEXT_PRESERVE_IMAGES) + imginfo = tp.extractIMGINFO(hashes=hashes) + del tp + if hashes: + page._image_info = imginfo + if not xrefs or not doc.is_pdf: + return imginfo + imglist = page.get_images() + digests = {} + for item in imglist: + xref = item[0] + pix = Pixmap(doc, xref) + digests[pix.digest] = xref + del pix + for i in range(len(imginfo)): + item = imginfo[i] + xref = digests.get(item["digest"], 0) + item["xref"] = xref + imginfo[i] = item + return imginfo + + +def get_image_rects(page: Page, name, transform=False) -> list: + """Return list of image positions on a page. + + Args: + name: (str, list, int) image identification. May be reference name, an + item of the page's image list or an xref. + transform: (bool) whether to also return the transformation matrix. + Returns: + A list of Rect objects or tuples of (Rect, Matrix) for all image + locations on the page. + """ + if type(name) in (list, tuple): + xref = name[0] + elif type(name) is int: + xref = name + else: + imglist = [i for i in page.get_images() if i[7] == name] + if imglist == []: + raise ValueError("bad image name") + elif len(imglist) != 1: + raise ValueError("multiple image names found") + xref = imglist[0][0] + pix = Pixmap(page.parent, xref) # make pixmap of the image to compute MD5 + digest = pix.digest + del pix + infos = page.get_image_info(hashes=True) + if not transform: + bboxes = [Rect(im["bbox"]) for im in infos if im["digest"] == digest] + else: + bboxes = [ + (Rect(im["bbox"]), Matrix(im["transform"])) + for im in infos + if im["digest"] == digest + ] + return bboxes + + +def get_text( + page: Page, + option: str = "text", + clip: rect_like = None, + flags: OptInt = None, + textpage: TextPage = None, + sort: bool = False, + delimiters=None, +): + """Extract text from a page or an annotation. + + This is a unifying wrapper for various methods of the TextPage class. + + Args: + option: (str) text, words, blocks, html, dict, json, rawdict, xhtml or xml. + clip: (rect-like) restrict output to this area. + flags: bit switches to e.g. exclude images or decompose ligatures. + textpage: reuse this TextPage and make no new one. If specified, + 'flags' and 'clip' are ignored. + + Returns: + the output of methods get_text_words / get_text_blocks or TextPage + methods extractText, extractHTML, extractDICT, extractJSON, extractRAWDICT, + extractXHTML or etractXML respectively. + Default and misspelling choice is "text". + """ + formats = { + "text": fitz.TEXTFLAGS_TEXT, + "html": fitz.TEXTFLAGS_HTML, + "json": fitz.TEXTFLAGS_DICT, + "rawjson": fitz.TEXTFLAGS_RAWDICT, + "xml": fitz.TEXTFLAGS_XML, + "xhtml": fitz.TEXTFLAGS_XHTML, + "dict": fitz.TEXTFLAGS_DICT, + "rawdict": fitz.TEXTFLAGS_RAWDICT, + "words": fitz.TEXTFLAGS_WORDS, + "blocks": fitz.TEXTFLAGS_BLOCKS, + } + option = option.lower() + if option not in formats: + option = "text" + if flags is None: + flags = formats[option] + + if option == "words": + return get_text_words( + page, + clip=clip, + flags=flags, + textpage=textpage, + sort=sort, + delimiters=delimiters, + ) + if option == "blocks": + return get_text_blocks( + page, clip=clip, flags=flags, textpage=textpage, sort=sort + ) + CheckParent(page) + cb = None + if option in ("html", "xml", "xhtml"): # no clipping for MuPDF functions + clip = page.cropbox + if clip != None: + clip = Rect(clip) + cb = None + elif type(page) is Page: + cb = page.cropbox + + # TextPage with or without images + tp = textpage + if tp is None: + tp = page.get_textpage(clip=clip, flags=flags) + elif getattr(tp, "parent") != page: + raise ValueError("not a textpage of this page") + + if option == "json": + t = tp.extractJSON(cb=cb, sort=sort) + elif option == "rawjson": + t = tp.extractRAWJSON(cb=cb, sort=sort) + elif option == "dict": + t = tp.extractDICT(cb=cb, sort=sort) + elif option == "rawdict": + t = tp.extractRAWDICT(cb=cb, sort=sort) + elif option == "html": + t = tp.extractHTML() + elif option == "xml": + t = tp.extractXML() + elif option == "xhtml": + t = tp.extractXHTML() + else: + t = tp.extractText(sort=sort) + + if textpage is None: + del tp + return t + + +def get_page_text( + doc: Document, + pno: int, + option: str = "text", + clip: rect_like = None, + flags: OptInt = None, + textpage: TextPage = None, + sort: bool = False, +) -> typing.Any: + """Extract a document page's text by page number. + + Notes: + Convenience function calling page.get_text(). + Args: + pno: page number + option: (str) text, words, blocks, html, dict, json, rawdict, xhtml or xml. + Returns: + output from page.TextPage(). + """ + return doc[pno].get_text(option, clip=clip, flags=flags, sort=sort) + + +def get_pixmap( + page: Page, + *, + matrix: matrix_like = Identity, + dpi=None, + colorspace: Colorspace = csRGB, + clip: rect_like = None, + alpha: bool = False, + annots: bool = True, +) -> Pixmap: + """Create pixmap of page. + + Keyword args: + matrix: Matrix for transformation (default: Identity). + dpi: desired dots per inch. If given, matrix is ignored. + colorspace: (str/Colorspace) cmyk, rgb, gray - case ignored, default csRGB. + clip: (irect-like) restrict rendering to this area. + alpha: (bool) whether to include alpha channel + annots: (bool) whether to also render annotations + """ + CheckParent(page) + if dpi: + zoom = dpi / 72 + matrix = Matrix(zoom, zoom) + + if type(colorspace) is str: + if colorspace.upper() == "GRAY": + colorspace = csGRAY + elif colorspace.upper() == "CMYK": + colorspace = csCMYK + else: + colorspace = csRGB + if colorspace.n not in (1, 3, 4): + raise ValueError("unsupported colorspace") + + dl = page.get_displaylist(annots=annots) + pix = dl.get_pixmap(matrix=matrix, colorspace=colorspace, alpha=alpha, clip=clip) + dl = None + if dpi: + pix.set_dpi(dpi, dpi) + return pix + + +def get_page_pixmap( + doc: Document, + pno: int, + *, + matrix: matrix_like = Identity, + dpi=None, + colorspace: Colorspace = csRGB, + clip: rect_like = None, + alpha: bool = False, + annots: bool = True, +) -> Pixmap: + """Create pixmap of document page by page number. + + Notes: + Convenience function calling page.get_pixmap. + Args: + pno: (int) page number + matrix: Matrix for transformation (default: Identity). + colorspace: (str,Colorspace) rgb, rgb, gray - case ignored, default csRGB. + clip: (irect-like) restrict rendering to this area. + alpha: (bool) include alpha channel + annots: (bool) also render annotations + """ + return doc[pno].get_pixmap( + matrix=matrix, + dpi=dpi, + colorspace=colorspace, + clip=clip, + alpha=alpha, + annots=annots, + ) + + +def getLinkDict(ln) -> dict: + nl = {"kind": ln.dest.kind, "xref": 0} + try: + nl["from"] = ln.rect + except: + pass + pnt = Point(0, 0) + if ln.dest.flags & LINK_FLAG_L_VALID: + pnt.x = ln.dest.lt.x + if ln.dest.flags & LINK_FLAG_T_VALID: + pnt.y = ln.dest.lt.y + + if ln.dest.kind == LINK_URI: + nl["uri"] = ln.dest.uri + + elif ln.dest.kind == LINK_GOTO: + nl["page"] = ln.dest.page + nl["to"] = pnt + if ln.dest.flags & LINK_FLAG_R_IS_ZOOM: + nl["zoom"] = ln.dest.rb.x + else: + nl["zoom"] = 0.0 + + elif ln.dest.kind == LINK_GOTOR: + nl["file"] = ln.dest.fileSpec.replace("\\", "/") + nl["page"] = ln.dest.page + if ln.dest.page < 0: + nl["to"] = ln.dest.dest + else: + nl["to"] = pnt + if ln.dest.flags & LINK_FLAG_R_IS_ZOOM: + nl["zoom"] = ln.dest.rb.x + else: + nl["zoom"] = 0.0 + + elif ln.dest.kind == LINK_LAUNCH: + nl["file"] = ln.dest.fileSpec.replace("\\", "/") + + elif ln.dest.kind == LINK_NAMED: + nl["name"] = ln.dest.named + + else: + nl["page"] = ln.dest.page + + return nl + + +def get_links(page: Page) -> list: + """Create a list of all links contained in a PDF page. + + Notes: + see PyMuPDF ducmentation for details. + """ + + CheckParent(page) + ln = page.first_link + links = [] + while ln: + nl = getLinkDict(ln) + links.append(nl) + ln = ln.next + if links != [] and page.parent.is_pdf: + linkxrefs = [x for x in page.annot_xrefs() if x[1] == PDF_ANNOT_LINK] + if len(linkxrefs) == len(links): + for i in range(len(linkxrefs)): + links[i]["xref"] = linkxrefs[i][0] + links[i]["id"] = linkxrefs[i][2] + return links + + +def get_toc( + doc: Document, + simple: bool = True, +) -> list: + """Create a table of contents. + + Args: + simple: a bool to control output. Returns a list, where each entry consists of outline level, title, page number and link destination (if simple = False). For details see PyMuPDF's documentation. + """ + + def recurse(olItem, liste, lvl): + """Recursively follow the outline item chain and record item information in a list.""" + while olItem: + if olItem.title: + title = olItem.title + else: + title = " " + + if not olItem.is_external: + if olItem.uri: + if olItem.page == -1: + resolve = doc.resolve_link(olItem.uri) + page = resolve[0] + 1 + else: + page = olItem.page + 1 + else: + page = -1 + else: + page = -1 + + if not simple: + link = getLinkDict(olItem) + liste.append([lvl, title, page, link]) + else: + liste.append([lvl, title, page]) + + if olItem.down: + liste = recurse(olItem.down, liste, lvl + 1) + olItem = olItem.next + return liste + + # ensure document is open + if doc.is_closed: + raise ValueError("document closed") + doc.init_doc() + olItem = doc.outline + if not olItem: + return [] + lvl = 1 + liste = [] + toc = recurse(olItem, liste, lvl) + if doc.is_pdf and simple is False: + doc._extend_toc_items(toc) + return toc + + +def del_toc_item( + doc: Document, + idx: int, +) -> None: + """Delete TOC / bookmark item by index.""" + xref = doc.get_outline_xrefs()[idx] + doc._remove_toc_item(xref) + + +def set_toc_item( + doc: Document, + idx: int, + dest_dict: OptDict = None, + kind: OptInt = None, + pno: OptInt = None, + uri: OptStr = None, + title: OptStr = None, + to: point_like = None, + filename: OptStr = None, + zoom: float = 0, +) -> None: + """Update TOC item by index. + + It allows changing the item's title and link destination. + + Args: + idx: (int) desired index of the TOC list, as created by get_toc. + dest_dict: (dict) destination dictionary as created by get_toc(False). + Outrules all other parameters. If None, the remaining parameters + are used to make a dest dictionary. + kind: (int) kind of link (LINK_GOTO, etc.). If None, then only the + title will be updated. If LINK_NONE, the TOC item will be deleted. + pno: (int) page number (1-based like in get_toc). Required if LINK_GOTO. + uri: (str) the URL, required if LINK_URI. + title: (str) the new title. No change if None. + to: (point-like) destination on the target page. If omitted, (72, 36) + will be used as taget coordinates. + filename: (str) destination filename, required for LINK_GOTOR and + LINK_LAUNCH. + name: (str) a destination name for LINK_NAMED. + zoom: (float) a zoom factor for the target location (LINK_GOTO). + """ + xref = doc.get_outline_xrefs()[idx] + page_xref = 0 + if type(dest_dict) is dict: + if dest_dict["kind"] == LINK_GOTO: + pno = dest_dict["page"] + page_xref = doc.page_xref(pno) + page_height = doc.page_cropbox(pno).height + to = dest_dict.get("to", Point(72, 36)) + to.y = page_height - to.y + dest_dict["to"] = to + action = getDestStr(page_xref, dest_dict) + if not action.startswith("/A"): + raise ValueError("bad bookmark dest") + color = dest_dict.get("color") + if color: + color = list(map(float, color)) + if len(color) != 3 or min(color) < 0 or max(color) > 1: + raise ValueError("bad color value") + bold = dest_dict.get("bold", False) + italic = dest_dict.get("italic", False) + flags = italic + 2 * bold + collapse = dest_dict.get("collapse") + return doc._update_toc_item( + xref, + action=action[2:], + title=title, + color=color, + flags=flags, + collapse=collapse, + ) + + if kind == LINK_NONE: # delete bookmark item + return doc.del_toc_item(idx) + if kind is None and title is None: # treat as no-op + return None + if kind is None: # only update title text + return doc._update_toc_item(xref, action=None, title=title) + + if kind == LINK_GOTO: + if pno is None or pno not in range(1, doc.page_count + 1): + raise ValueError("bad page number") + page_xref = doc.page_xref(pno - 1) + page_height = doc.page_cropbox(pno - 1).height + if to is None: + to = Point(72, page_height - 36) + else: + to = Point(to) + to.y = page_height - to.y + + ddict = { + "kind": kind, + "to": to, + "uri": uri, + "page": pno, + "file": filename, + "zoom": zoom, + } + action = getDestStr(page_xref, ddict) + if action == "" or not action.startswith("/A"): + raise ValueError("bad bookmark dest") + + return doc._update_toc_item(xref, action=action[2:], title=title) + + +def get_area(*args) -> float: + """Calculate area of rectangle.\nparameter is one of 'px' (default), 'in', 'cm', or 'mm'.""" + rect = args[0] + if len(args) > 1: + unit = args[1] + else: + unit = "px" + u = {"px": (1, 1), "in": (1.0, 72.0), "cm": (2.54, 72.0), "mm": (25.4, 72.0)} + f = (u[unit][0] / u[unit][1]) ** 2 + return f * rect.width * rect.height + + +def set_metadata(doc: Document, m: dict) -> None: + """Update the PDF /Info object. + + Args: + m: a dictionary like doc.metadata. + """ + if not doc.is_pdf: + raise ValueError("is no PDF") + if doc.is_closed or doc.is_encrypted: + raise ValueError("document closed or encrypted") + if type(m) is not dict: + raise ValueError("bad metadata") + keymap = { + "author": "Author", + "producer": "Producer", + "creator": "Creator", + "title": "Title", + "format": None, + "encryption": None, + "creationDate": "CreationDate", + "modDate": "ModDate", + "subject": "Subject", + "keywords": "Keywords", + "trapped": "Trapped", + } + valid_keys = set(keymap.keys()) + diff_set = set(m.keys()).difference(valid_keys) + if diff_set != set(): + msg = "bad dict key(s): %s" % diff_set + raise ValueError(msg) + + t, temp = doc.xref_get_key(-1, "Info") + if t != "xref": + info_xref = 0 + else: + info_xref = int(temp.replace("0 R", "")) + + if m == {} and info_xref == 0: # nothing to do + return + + if info_xref == 0: # no prev metadata: get new xref + info_xref = doc.get_new_xref() + doc.update_object(info_xref, "<<>>") # fill it with empty object + doc.xref_set_key(-1, "Info", "%i 0 R" % info_xref) + elif m == {}: # remove existing metadata + doc.xref_set_key(-1, "Info", "null") + return + + for key, val in [(k, v) for k, v in m.items() if keymap[k] != None]: + pdf_key = keymap[key] + if not bool(val) or val in ("none", "null"): + val = "null" + else: + val = get_pdf_str(val) + doc.xref_set_key(info_xref, pdf_key, val) + doc.init_doc() + return + + +def getDestStr(xref: int, ddict: dict) -> str: + """Calculate the PDF action string. + + Notes: + Supports Link annotations and outline items (bookmarks). + """ + if not ddict: + return "" + str_goto = "/A<>" + str_gotor1 = "/A<>>>" + str_gotor2 = "/A<>>>" + str_launch = "/A<>>>" + str_uri = "/A<>" + + if type(ddict) in (int, float): + dest = str_goto % (xref, 0, ddict, 0) + return dest + d_kind = ddict.get("kind", LINK_NONE) + + if d_kind == LINK_NONE: + return "" + + if ddict["kind"] == LINK_GOTO: + d_zoom = ddict.get("zoom", 0) + to = ddict.get("to", Point(0, 0)) + d_left, d_top = to + dest = str_goto % (xref, d_left, d_top, d_zoom) + return dest + + if ddict["kind"] == LINK_URI: + dest = str_uri % (get_pdf_str(ddict["uri"]),) + return dest + + if ddict["kind"] == LINK_LAUNCH: + fspec = get_pdf_str(ddict["file"]) + dest = str_launch % (fspec, fspec) + return dest + + if ddict["kind"] == LINK_GOTOR and ddict["page"] < 0: + fspec = get_pdf_str(ddict["file"]) + dest = str_gotor2 % (get_pdf_str(ddict["to"]), fspec, fspec) + return dest + + if ddict["kind"] == LINK_GOTOR and ddict["page"] >= 0: + fspec = get_pdf_str(ddict["file"]) + dest = str_gotor1 % ( + ddict["page"], + ddict["to"].x, + ddict["to"].y, + ddict["zoom"], + fspec, + fspec, + ) + return dest + + return "" + + +def set_toc( + doc: Document, + toc: list, + collapse: int = 1, +) -> int: + """Create new outline tree (table of contents, TOC). + + Args: + toc: (list, tuple) each entry must contain level, title, page and + optionally top margin on the page. None or '()' remove the TOC. + collapse: (int) collapses entries beyond this level. Zero or None + shows all entries unfolded. + Returns: + the number of inserted items, or the number of removed items respectively. + """ + if doc.is_closed or doc.is_encrypted: + raise ValueError("document closed or encrypted") + if not doc.is_pdf: + raise ValueError("is no PDF") + if not toc: # remove all entries + return len(doc._delToC()) + + # validity checks -------------------------------------------------------- + if type(toc) not in (list, tuple): + raise ValueError("'toc' must be list or tuple") + toclen = len(toc) + page_count = doc.page_count + t0 = toc[0] + if type(t0) not in (list, tuple): + raise ValueError("items must be sequences of 3 or 4 items") + if t0[0] != 1: + raise ValueError("hierarchy level of item 0 must be 1") + for i in list(range(toclen - 1)): + t1 = toc[i] + t2 = toc[i + 1] + if not -1 <= t1[2] <= page_count: + raise ValueError("row %i: page number out of range" % i) + if (type(t2) not in (list, tuple)) or len(t2) not in (3, 4): + raise ValueError("bad row %i" % (i + 1)) + if (type(t2[0]) is not int) or t2[0] < 1: + raise ValueError("bad hierarchy level in row %i" % (i + 1)) + if t2[0] > t1[0] + 1: + raise ValueError("bad hierarchy level in row %i" % (i + 1)) + # no formal errors in toc -------------------------------------------------- + + # -------------------------------------------------------------------------- + # make a list of xref numbers, which we can use for our TOC entries + # -------------------------------------------------------------------------- + old_xrefs = doc._delToC() # del old outlines, get their xref numbers + + # prepare table of xrefs for new bookmarks + old_xrefs = [] + xref = [0] + old_xrefs + xref[0] = doc._getOLRootNumber() # entry zero is outline root xref number + if toclen > len(old_xrefs): # too few old xrefs? + for i in range((toclen - len(old_xrefs))): + xref.append(doc.get_new_xref()) # acquire new ones + + lvltab = {0: 0} # to store last entry per hierarchy level + + # ------------------------------------------------------------------------------ + # contains new outline objects as strings - first one is the outline root + # ------------------------------------------------------------------------------ + olitems = [{"count": 0, "first": -1, "last": -1, "xref": xref[0]}] + # ------------------------------------------------------------------------------ + # build olitems as a list of PDF-like connnected dictionaries + # ------------------------------------------------------------------------------ + for i in range(toclen): + o = toc[i] + lvl = o[0] # level + title = get_pdf_str(o[1]) # title + pno = min(doc.page_count - 1, max(0, o[2] - 1)) # page number + page_xref = doc.page_xref(pno) + page_height = doc.page_cropbox(pno).height + top = Point(72, page_height - 36) + dest_dict = {"to": top, "kind": LINK_GOTO} # fall back target + if o[2] < 0: + dest_dict["kind"] = LINK_NONE + if len(o) > 3: # some target is specified + if type(o[3]) in (int, float): # convert a number to a point + dest_dict["to"] = Point(72, page_height - o[3]) + else: # if something else, make sure we have a dict + dest_dict = o[3] if type(o[3]) is dict else dest_dict + if "to" not in dest_dict: # target point not in dict? + dest_dict["to"] = top # put default in + else: # transform target to PDF coordinates + point = +dest_dict["to"] + point.y = page_height - point.y + dest_dict["to"] = point + d = {} + d["first"] = -1 + d["count"] = 0 + d["last"] = -1 + d["prev"] = -1 + d["next"] = -1 + d["dest"] = getDestStr(page_xref, dest_dict) + d["top"] = dest_dict["to"] + d["title"] = title + d["parent"] = lvltab[lvl - 1] + d["xref"] = xref[i + 1] + d["color"] = dest_dict.get("color") + d["flags"] = dest_dict.get("italic", 0) + 2 * dest_dict.get("bold", 0) + lvltab[lvl] = i + 1 + parent = olitems[lvltab[lvl - 1]] # the parent entry + + if ( + dest_dict.get("collapse") or collapse and lvl > collapse + ): # suppress expansion + parent["count"] -= 1 # make /Count negative + else: + parent["count"] += 1 # positive /Count + + if parent["first"] == -1: + parent["first"] = i + 1 + parent["last"] = i + 1 + else: + d["prev"] = parent["last"] + prev = olitems[parent["last"]] + prev["next"] = i + 1 + parent["last"] = i + 1 + olitems.append(d) + + # ------------------------------------------------------------------------------ + # now create each outline item as a string and insert it in the PDF + # ------------------------------------------------------------------------------ + for i, ol in enumerate(olitems): + txt = "<<" + if ol["count"] != 0: + txt += "/Count %i" % ol["count"] + try: + txt += ol["dest"] + except: + pass + try: + if ol["first"] > -1: + txt += "/First %i 0 R" % xref[ol["first"]] + except: + pass + try: + if ol["last"] > -1: + txt += "/Last %i 0 R" % xref[ol["last"]] + except: + pass + try: + if ol["next"] > -1: + txt += "/Next %i 0 R" % xref[ol["next"]] + except: + pass + try: + if ol["parent"] > -1: + txt += "/Parent %i 0 R" % xref[ol["parent"]] + except: + pass + try: + if ol["prev"] > -1: + txt += "/Prev %i 0 R" % xref[ol["prev"]] + except: + pass + try: + txt += "/Title" + ol["title"] + except: + pass + + if ol.get("color") and len(ol["color"]) == 3: + txt += "/C[ %g %g %g]" % tuple(ol["color"]) + if ol.get("flags", 0) > 0: + txt += "/F %i" % ol["flags"] + + if i == 0: # special: this is the outline root + txt += "/Type/Outlines" # so add the /Type entry + txt += ">>" + doc.update_object(xref[i], txt) # insert the PDF object + + doc.init_doc() + return toclen + + +def do_links( + doc1: Document, + doc2: Document, + from_page: int = -1, + to_page: int = -1, + start_at: int = -1, +) -> None: + """Insert links contained in copied page range into destination PDF. + + Parameter values **must** equal those of method insert_pdf(), which must + have been previously executed. + """ + + # -------------------------------------------------------------------------- + # internal function to create the actual "/Annots" object string + # -------------------------------------------------------------------------- + def cre_annot(lnk, xref_dst, pno_src, ctm): + """Create annotation object string for a passed-in link.""" + + r = lnk["from"] * ctm # rect in PDF coordinates + rect = "%g %g %g %g" % tuple(r) + if lnk["kind"] == LINK_GOTO: + txt = annot_skel["goto1"] # annot_goto + idx = pno_src.index(lnk["page"]) + p = lnk["to"] * ctm # target point in PDF coordinates + annot = txt % (xref_dst[idx], p.x, p.y, lnk["zoom"], rect) + + elif lnk["kind"] == LINK_GOTOR: + if lnk["page"] >= 0: + txt = annot_skel["gotor1"] # annot_gotor + pnt = lnk.get("to", Point(0, 0)) # destination point + if type(pnt) is not Point: + pnt = Point(0, 0) + annot = txt % ( + lnk["page"], + pnt.x, + pnt.y, + lnk["zoom"], + lnk["file"], + lnk["file"], + rect, + ) + else: + txt = annot_skel["gotor2"] # annot_gotor_n + to = get_pdf_str(lnk["to"]) + to = to[1:-1] + f = lnk["file"] + annot = txt % (to, f, rect) + + elif lnk["kind"] == LINK_LAUNCH: + txt = annot_skel["launch"] # annot_launch + annot = txt % (lnk["file"], lnk["file"], rect) + + elif lnk["kind"] == LINK_URI: + txt = annot_skel["uri"] # annot_uri + annot = txt % (lnk["uri"], rect) + + else: + annot = "" + + return annot + + # -------------------------------------------------------------------------- + + # validate & normalize parameters + if from_page < 0: + fp = 0 + elif from_page >= doc2.page_count: + fp = doc2.page_count - 1 + else: + fp = from_page + + if to_page < 0 or to_page >= doc2.page_count: + tp = doc2.page_count - 1 + else: + tp = to_page + + if start_at < 0: + raise ValueError("'start_at' must be >= 0") + sa = start_at + + incr = 1 if fp <= tp else -1 # page range could be reversed + + # lists of source / destination page numbers + pno_src = list(range(fp, tp + incr, incr)) + pno_dst = [sa + i for i in range(len(pno_src))] + + # lists of source / destination page xrefs + xref_src = [] + xref_dst = [] + for i in range(len(pno_src)): + p_src = pno_src[i] + p_dst = pno_dst[i] + old_xref = doc2.page_xref(p_src) + new_xref = doc1.page_xref(p_dst) + xref_src.append(old_xref) + xref_dst.append(new_xref) + + # create the links for each copied page in destination PDF + for i in range(len(xref_src)): + page_src = doc2[pno_src[i]] # load source page + links = page_src.get_links() # get all its links + if len(links) == 0: # no links there + page_src = None + continue + ctm = ~page_src.transformation_matrix # calc page transformation matrix + page_dst = doc1[pno_dst[i]] # load destination page + link_tab = [] # store all link definitions here + for l in links: + if l["kind"] == LINK_GOTO and (l["page"] not in pno_src): + continue # GOTO link target not in copied pages + annot_text = cre_annot(l, xref_dst, pno_src, ctm) + if annot_text: + link_tab.append(annot_text) + if link_tab != []: + page_dst._addAnnot_FromString(tuple(link_tab)) + + return + + +def getLinkText(page: Page, lnk: dict) -> str: + # -------------------------------------------------------------------------- + # define skeletons for /Annots object texts + # -------------------------------------------------------------------------- + ctm = page.transformation_matrix + ictm = ~ctm + r = lnk["from"] + rect = "%g %g %g %g" % tuple(r * ictm) + + annot = "" + if lnk["kind"] == LINK_GOTO: + if lnk["page"] >= 0: + txt = annot_skel["goto1"] # annot_goto + pno = lnk["page"] + xref = page.parent.page_xref(pno) + pnt = lnk.get("to", Point(0, 0)) # destination point + ipnt = pnt * ictm + annot = txt % (xref, ipnt.x, ipnt.y, lnk.get("zoom", 0), rect) + else: + txt = annot_skel["goto2"] # annot_goto_n + annot = txt % (get_pdf_str(lnk["to"]), rect) + + elif lnk["kind"] == LINK_GOTOR: + if lnk["page"] >= 0: + txt = annot_skel["gotor1"] # annot_gotor + pnt = lnk.get("to", Point(0, 0)) # destination point + if type(pnt) is not Point: + pnt = Point(0, 0) + annot = txt % ( + lnk["page"], + pnt.x, + pnt.y, + lnk.get("zoom", 0), + lnk["file"], + lnk["file"], + rect, + ) + else: + txt = annot_skel["gotor2"] # annot_gotor_n + annot = txt % (get_pdf_str(lnk["to"]), lnk["file"], rect) + + elif lnk["kind"] == LINK_LAUNCH: + txt = annot_skel["launch"] # annot_launch + annot = txt % (lnk["file"], lnk["file"], rect) + + elif lnk["kind"] == LINK_URI: + txt = annot_skel["uri"] # txt = annot_uri + annot = txt % (lnk["uri"], rect) + + elif lnk["kind"] == LINK_NAMED: + txt = annot_skel["named"] # annot_named + annot = txt % (lnk["name"], rect) + if not annot: + return annot + + # add a /NM PDF key to the object definition + link_names = dict( # existing ids and their xref + [(x[0], x[2]) for x in page.annot_xrefs() if x[1] == PDF_ANNOT_LINK] + ) + + old_name = lnk.get("id", "") # id value in the argument + + if old_name and (lnk["xref"], old_name) in link_names.items(): + name = old_name # no new name if this is an update only + else: + i = 0 + stem = TOOLS.set_annot_stem() + "-L%i" + while True: + name = stem % i + if name not in link_names.values(): + break + i += 1 + # add /NM key to object definition + annot = annot.replace("/Link", "/Link/NM(%s)" % name) + return annot + + +def delete_widget(page: Page, widget: Widget) -> Widget: + """Delete widget from page and return the next one.""" + CheckParent(page) + annot = getattr(widget, "_annot", None) + if annot is None: + raise ValueError("bad type: widget") + nextwidget = widget.next + page.delete_annot(annot) + widget._annot.__del__() + widget._annot.parent = None + keylist = list(widget.__dict__.keys()) + for key in keylist: + del widget.__dict__[key] + return nextwidget + + +def update_link(page: Page, lnk: dict) -> None: + """Update a link on the current page.""" + CheckParent(page) + annot = getLinkText(page, lnk) + if annot == "": + raise ValueError("link kind not supported") + + page.parent.update_object(lnk["xref"], annot, page=page) + return + + +def insert_link(page: Page, lnk: dict, mark: bool = True) -> None: + """Insert a new link for the current page.""" + CheckParent(page) + annot = getLinkText(page, lnk) + if annot == "": + raise ValueError("link kind not supported") + page._addAnnot_FromString((annot,)) + return + + +def insert_textbox( + page: Page, + rect: rect_like, + buffer: typing.Union[str, list], + fontname: str = "helv", + fontfile: OptStr = None, + set_simple: int = 0, + encoding: int = 0, + fontsize: float = 11, + lineheight: OptFloat = None, + color: OptSeq = None, + fill: OptSeq = None, + expandtabs: int = 1, + align: int = 0, + rotate: int = 0, + render_mode: int = 0, + border_width: float = 0.05, + morph: OptSeq = None, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> float: + """Insert text into a given rectangle. + + Notes: + Creates a Shape object, uses its same-named method and commits it. + Parameters: + rect: (rect-like) area to use for text. + buffer: text to be inserted + fontname: a Base-14 font, font name or '/name' + fontfile: name of a font file + fontsize: font size + lineheight: overwrite the font property + color: RGB color triple + expandtabs: handles tabulators with string function + align: left, center, right, justified + rotate: 0, 90, 180, or 270 degrees + morph: morph box with a matrix and a fixpoint + overlay: put text in foreground or background + Returns: + unused or deficit rectangle area (float) + """ + img = page.new_shape() + rc = img.insert_textbox( + rect, + buffer, + fontsize=fontsize, + lineheight=lineheight, + fontname=fontname, + fontfile=fontfile, + set_simple=set_simple, + encoding=encoding, + color=color, + fill=fill, + expandtabs=expandtabs, + render_mode=render_mode, + border_width=border_width, + align=align, + rotate=rotate, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + if rc >= 0: + img.commit(overlay) + return rc + + +def insert_text( + page: Page, + point: point_like, + text: typing.Union[str, list], + fontsize: float = 11, + lineheight: OptFloat = None, + fontname: str = "helv", + fontfile: OptStr = None, + set_simple: int = 0, + encoding: int = 0, + color: OptSeq = None, + fill: OptSeq = None, + border_width: float = 0.05, + render_mode: int = 0, + rotate: int = 0, + morph: OptSeq = None, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +): + img = page.new_shape() + rc = img.insert_text( + point, + text, + fontsize=fontsize, + lineheight=lineheight, + fontname=fontname, + fontfile=fontfile, + set_simple=set_simple, + encoding=encoding, + color=color, + fill=fill, + border_width=border_width, + render_mode=render_mode, + rotate=rotate, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + if rc >= 0: + img.commit(overlay) + return rc + + +def new_page( + doc: Document, + pno: int = -1, + width: float = 595, + height: float = 842, +) -> Page: + """Create and return a new page object. + + Args: + pno: (int) insert before this page. Default: after last page. + width: (float) page width in points. Default: 595 (ISO A4 width). + height: (float) page height in points. Default 842 (ISO A4 height). + Returns: + A Page object. + """ + doc._newPage(pno, width=width, height=height) + return doc[pno] + + +def insert_page( + doc: Document, + pno: int, + text: typing.Union[str, list, None] = None, + fontsize: float = 11, + width: float = 595, + height: float = 842, + fontname: str = "helv", + fontfile: OptStr = None, + color: OptSeq = (0,), +) -> int: + """Create a new PDF page and insert some text. + + Notes: + Function combining Document.new_page() and Page.insert_text(). + For parameter details see these methods. + """ + page = doc.new_page(pno=pno, width=width, height=height) + if not bool(text): + return 0 + rc = page.insert_text( + (50, 72), + text, + fontsize=fontsize, + fontname=fontname, + fontfile=fontfile, + color=color, + ) + return rc + + +def draw_line( + page: Page, + p1: point_like, + p2: point_like, + color: OptSeq = (0,), + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc=0, +) -> Point: + """Draw a line from point p1 to point p2.""" + img = page.new_shape() + p = img.draw_line(Point(p1), Point(p2)) + img.finish( + color=color, + dashes=dashes, + width=width, + closePath=False, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return p + + +def draw_squiggle( + page: Page, + p1: point_like, + p2: point_like, + breadth: float = 2, + color: OptSeq = (0,), + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> Point: + """Draw a squiggly line from point p1 to point p2.""" + img = page.new_shape() + p = img.draw_squiggle(Point(p1), Point(p2), breadth=breadth) + img.finish( + color=color, + dashes=dashes, + width=width, + closePath=False, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return p + + +def draw_zigzag( + page: Page, + p1: point_like, + p2: point_like, + breadth: float = 2, + color: OptSeq = (0,), + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> Point: + """Draw a zigzag line from point p1 to point p2.""" + img = page.new_shape() + p = img.draw_zigzag(Point(p1), Point(p2), breadth=breadth) + img.finish( + color=color, + dashes=dashes, + width=width, + closePath=False, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return p + + +def draw_rect( + page: Page, + rect: rect_like, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + morph: OptSeq = None, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, + radius=None, +) -> Point: + """Draw a rectangle. See Shape class method for details.""" + img = page.new_shape() + Q = img.draw_rect(Rect(rect), radius=radius) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_quad( + page: Page, + quad: quad_like, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + morph: OptSeq = None, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> Point: + """Draw a quadrilateral.""" + img = page.new_shape() + Q = img.draw_quad(Quad(quad)) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_polyline( + page: Page, + points: list, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + morph: OptSeq = None, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + closePath: bool = False, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> Point: + """Draw multiple connected line segments.""" + img = page.new_shape() + Q = img.draw_polyline(points) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + closePath=closePath, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_circle( + page: Page, + center: point_like, + radius: float, + color: OptSeq = (0,), + fill: OptSeq = None, + morph: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> Point: + """Draw a circle given its center and radius.""" + img = page.new_shape() + Q = img.draw_circle(Point(center), radius) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + return Q + + +def draw_oval( + page: Page, + rect: typing.Union[rect_like, quad_like], + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + morph: OptSeq = None, + width: float = 1, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> Point: + """Draw an oval given its containing rectangle or quad.""" + img = page.new_shape() + Q = img.draw_oval(rect) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_curve( + page: Page, + p1: point_like, + p2: point_like, + p3: point_like, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + morph: OptSeq = None, + closePath: bool = False, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> Point: + """Draw a special Bezier curve from p1 to p3, generating control points on lines p1 to p2 and p2 to p3.""" + img = page.new_shape() + Q = img.draw_curve(Point(p1), Point(p2), Point(p3)) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + closePath=closePath, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_bezier( + page: Page, + p1: point_like, + p2: point_like, + p3: point_like, + p4: point_like, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + width: float = 1, + morph: OptStr = None, + closePath: bool = False, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> Point: + """Draw a general cubic Bezier curve from p1 to p4 using control points p2 and p3.""" + img = page.new_shape() + Q = img.draw_bezier(Point(p1), Point(p2), Point(p3), Point(p4)) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + closePath=closePath, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +def draw_sector( + page: Page, + center: point_like, + point: point_like, + beta: float, + color: OptSeq = (0,), + fill: OptSeq = None, + dashes: OptStr = None, + fullSector: bool = True, + morph: OptSeq = None, + width: float = 1, + closePath: bool = False, + lineCap: int = 0, + lineJoin: int = 0, + overlay: bool = True, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, +) -> Point: + """Draw a circle sector given circle center, one arc end point and the angle of the arc. + + Parameters: + center -- center of circle + point -- arc end point + beta -- angle of arc (degrees) + fullSector -- connect arc ends with center + """ + img = page.new_shape() + Q = img.draw_sector(Point(center), Point(point), beta, fullSector=fullSector) + img.finish( + color=color, + fill=fill, + dashes=dashes, + width=width, + lineCap=lineCap, + lineJoin=lineJoin, + morph=morph, + closePath=closePath, + stroke_opacity=stroke_opacity, + fill_opacity=fill_opacity, + oc=oc, + ) + img.commit(overlay) + + return Q + + +# ---------------------------------------------------------------------- +# Name: wx.lib.colourdb.py +# Purpose: Adds a bunch of colour names and RGB values to the +# colour database so they can be found by name +# +# Author: Robin Dunn +# +# Created: 13-March-2001 +# Copyright: (c) 2001-2017 by Total Control Software +# Licence: wxWindows license +# Tags: phoenix-port, unittest, documented +# ---------------------------------------------------------------------- + + +def getColorList() -> list: + """ + Returns a list of just the colour names used by this module. + :rtype: list of strings + """ + + return [x[0] for x in getColorInfoList()] + + +def getColorInfoList() -> list: + """ + Returns the list of colour name/value tuples used by this module. + :rtype: list of tuples + """ + + return [ + ("ALICEBLUE", 240, 248, 255), + ("ANTIQUEWHITE", 250, 235, 215), + ("ANTIQUEWHITE1", 255, 239, 219), + ("ANTIQUEWHITE2", 238, 223, 204), + ("ANTIQUEWHITE3", 205, 192, 176), + ("ANTIQUEWHITE4", 139, 131, 120), + ("AQUAMARINE", 127, 255, 212), + ("AQUAMARINE1", 127, 255, 212), + ("AQUAMARINE2", 118, 238, 198), + ("AQUAMARINE3", 102, 205, 170), + ("AQUAMARINE4", 69, 139, 116), + ("AZURE", 240, 255, 255), + ("AZURE1", 240, 255, 255), + ("AZURE2", 224, 238, 238), + ("AZURE3", 193, 205, 205), + ("AZURE4", 131, 139, 139), + ("BEIGE", 245, 245, 220), + ("BISQUE", 255, 228, 196), + ("BISQUE1", 255, 228, 196), + ("BISQUE2", 238, 213, 183), + ("BISQUE3", 205, 183, 158), + ("BISQUE4", 139, 125, 107), + ("BLACK", 0, 0, 0), + ("BLANCHEDALMOND", 255, 235, 205), + ("BLUE", 0, 0, 255), + ("BLUE1", 0, 0, 255), + ("BLUE2", 0, 0, 238), + ("BLUE3", 0, 0, 205), + ("BLUE4", 0, 0, 139), + ("BLUEVIOLET", 138, 43, 226), + ("BROWN", 165, 42, 42), + ("BROWN1", 255, 64, 64), + ("BROWN2", 238, 59, 59), + ("BROWN3", 205, 51, 51), + ("BROWN4", 139, 35, 35), + ("BURLYWOOD", 222, 184, 135), + ("BURLYWOOD1", 255, 211, 155), + ("BURLYWOOD2", 238, 197, 145), + ("BURLYWOOD3", 205, 170, 125), + ("BURLYWOOD4", 139, 115, 85), + ("CADETBLUE", 95, 158, 160), + ("CADETBLUE1", 152, 245, 255), + ("CADETBLUE2", 142, 229, 238), + ("CADETBLUE3", 122, 197, 205), + ("CADETBLUE4", 83, 134, 139), + ("CHARTREUSE", 127, 255, 0), + ("CHARTREUSE1", 127, 255, 0), + ("CHARTREUSE2", 118, 238, 0), + ("CHARTREUSE3", 102, 205, 0), + ("CHARTREUSE4", 69, 139, 0), + ("CHOCOLATE", 210, 105, 30), + ("CHOCOLATE1", 255, 127, 36), + ("CHOCOLATE2", 238, 118, 33), + ("CHOCOLATE3", 205, 102, 29), + ("CHOCOLATE4", 139, 69, 19), + ("COFFEE", 156, 79, 0), + ("CORAL", 255, 127, 80), + ("CORAL1", 255, 114, 86), + ("CORAL2", 238, 106, 80), + ("CORAL3", 205, 91, 69), + ("CORAL4", 139, 62, 47), + ("CORNFLOWERBLUE", 100, 149, 237), + ("CORNSILK", 255, 248, 220), + ("CORNSILK1", 255, 248, 220), + ("CORNSILK2", 238, 232, 205), + ("CORNSILK3", 205, 200, 177), + ("CORNSILK4", 139, 136, 120), + ("CYAN", 0, 255, 255), + ("CYAN1", 0, 255, 255), + ("CYAN2", 0, 238, 238), + ("CYAN3", 0, 205, 205), + ("CYAN4", 0, 139, 139), + ("DARKBLUE", 0, 0, 139), + ("DARKCYAN", 0, 139, 139), + ("DARKGOLDENROD", 184, 134, 11), + ("DARKGOLDENROD1", 255, 185, 15), + ("DARKGOLDENROD2", 238, 173, 14), + ("DARKGOLDENROD3", 205, 149, 12), + ("DARKGOLDENROD4", 139, 101, 8), + ("DARKGREEN", 0, 100, 0), + ("DARKGRAY", 169, 169, 169), + ("DARKKHAKI", 189, 183, 107), + ("DARKMAGENTA", 139, 0, 139), + ("DARKOLIVEGREEN", 85, 107, 47), + ("DARKOLIVEGREEN1", 202, 255, 112), + ("DARKOLIVEGREEN2", 188, 238, 104), + ("DARKOLIVEGREEN3", 162, 205, 90), + ("DARKOLIVEGREEN4", 110, 139, 61), + ("DARKORANGE", 255, 140, 0), + ("DARKORANGE1", 255, 127, 0), + ("DARKORANGE2", 238, 118, 0), + ("DARKORANGE3", 205, 102, 0), + ("DARKORANGE4", 139, 69, 0), + ("DARKORCHID", 153, 50, 204), + ("DARKORCHID1", 191, 62, 255), + ("DARKORCHID2", 178, 58, 238), + ("DARKORCHID3", 154, 50, 205), + ("DARKORCHID4", 104, 34, 139), + ("DARKRED", 139, 0, 0), + ("DARKSALMON", 233, 150, 122), + ("DARKSEAGREEN", 143, 188, 143), + ("DARKSEAGREEN1", 193, 255, 193), + ("DARKSEAGREEN2", 180, 238, 180), + ("DARKSEAGREEN3", 155, 205, 155), + ("DARKSEAGREEN4", 105, 139, 105), + ("DARKSLATEBLUE", 72, 61, 139), + ("DARKSLATEGRAY", 47, 79, 79), + ("DARKTURQUOISE", 0, 206, 209), + ("DARKVIOLET", 148, 0, 211), + ("DEEPPINK", 255, 20, 147), + ("DEEPPINK1", 255, 20, 147), + ("DEEPPINK2", 238, 18, 137), + ("DEEPPINK3", 205, 16, 118), + ("DEEPPINK4", 139, 10, 80), + ("DEEPSKYBLUE", 0, 191, 255), + ("DEEPSKYBLUE1", 0, 191, 255), + ("DEEPSKYBLUE2", 0, 178, 238), + ("DEEPSKYBLUE3", 0, 154, 205), + ("DEEPSKYBLUE4", 0, 104, 139), + ("DIMGRAY", 105, 105, 105), + ("DODGERBLUE", 30, 144, 255), + ("DODGERBLUE1", 30, 144, 255), + ("DODGERBLUE2", 28, 134, 238), + ("DODGERBLUE3", 24, 116, 205), + ("DODGERBLUE4", 16, 78, 139), + ("FIREBRICK", 178, 34, 34), + ("FIREBRICK1", 255, 48, 48), + ("FIREBRICK2", 238, 44, 44), + ("FIREBRICK3", 205, 38, 38), + ("FIREBRICK4", 139, 26, 26), + ("FLORALWHITE", 255, 250, 240), + ("FORESTGREEN", 34, 139, 34), + ("GAINSBORO", 220, 220, 220), + ("GHOSTWHITE", 248, 248, 255), + ("GOLD", 255, 215, 0), + ("GOLD1", 255, 215, 0), + ("GOLD2", 238, 201, 0), + ("GOLD3", 205, 173, 0), + ("GOLD4", 139, 117, 0), + ("GOLDENROD", 218, 165, 32), + ("GOLDENROD1", 255, 193, 37), + ("GOLDENROD2", 238, 180, 34), + ("GOLDENROD3", 205, 155, 29), + ("GOLDENROD4", 139, 105, 20), + ("GREEN YELLOW", 173, 255, 47), + ("GREEN", 0, 255, 0), + ("GREEN1", 0, 255, 0), + ("GREEN2", 0, 238, 0), + ("GREEN3", 0, 205, 0), + ("GREEN4", 0, 139, 0), + ("GREENYELLOW", 173, 255, 47), + ("GRAY", 190, 190, 190), + ("GRAY0", 0, 0, 0), + ("GRAY1", 3, 3, 3), + ("GRAY10", 26, 26, 26), + ("GRAY100", 255, 255, 255), + ("GRAY11", 28, 28, 28), + ("GRAY12", 31, 31, 31), + ("GRAY13", 33, 33, 33), + ("GRAY14", 36, 36, 36), + ("GRAY15", 38, 38, 38), + ("GRAY16", 41, 41, 41), + ("GRAY17", 43, 43, 43), + ("GRAY18", 46, 46, 46), + ("GRAY19", 48, 48, 48), + ("GRAY2", 5, 5, 5), + ("GRAY20", 51, 51, 51), + ("GRAY21", 54, 54, 54), + ("GRAY22", 56, 56, 56), + ("GRAY23", 59, 59, 59), + ("GRAY24", 61, 61, 61), + ("GRAY25", 64, 64, 64), + ("GRAY26", 66, 66, 66), + ("GRAY27", 69, 69, 69), + ("GRAY28", 71, 71, 71), + ("GRAY29", 74, 74, 74), + ("GRAY3", 8, 8, 8), + ("GRAY30", 77, 77, 77), + ("GRAY31", 79, 79, 79), + ("GRAY32", 82, 82, 82), + ("GRAY33", 84, 84, 84), + ("GRAY34", 87, 87, 87), + ("GRAY35", 89, 89, 89), + ("GRAY36", 92, 92, 92), + ("GRAY37", 94, 94, 94), + ("GRAY38", 97, 97, 97), + ("GRAY39", 99, 99, 99), + ("GRAY4", 10, 10, 10), + ("GRAY40", 102, 102, 102), + ("GRAY41", 105, 105, 105), + ("GRAY42", 107, 107, 107), + ("GRAY43", 110, 110, 110), + ("GRAY44", 112, 112, 112), + ("GRAY45", 115, 115, 115), + ("GRAY46", 117, 117, 117), + ("GRAY47", 120, 120, 120), + ("GRAY48", 122, 122, 122), + ("GRAY49", 125, 125, 125), + ("GRAY5", 13, 13, 13), + ("GRAY50", 127, 127, 127), + ("GRAY51", 130, 130, 130), + ("GRAY52", 133, 133, 133), + ("GRAY53", 135, 135, 135), + ("GRAY54", 138, 138, 138), + ("GRAY55", 140, 140, 140), + ("GRAY56", 143, 143, 143), + ("GRAY57", 145, 145, 145), + ("GRAY58", 148, 148, 148), + ("GRAY59", 150, 150, 150), + ("GRAY6", 15, 15, 15), + ("GRAY60", 153, 153, 153), + ("GRAY61", 156, 156, 156), + ("GRAY62", 158, 158, 158), + ("GRAY63", 161, 161, 161), + ("GRAY64", 163, 163, 163), + ("GRAY65", 166, 166, 166), + ("GRAY66", 168, 168, 168), + ("GRAY67", 171, 171, 171), + ("GRAY68", 173, 173, 173), + ("GRAY69", 176, 176, 176), + ("GRAY7", 18, 18, 18), + ("GRAY70", 179, 179, 179), + ("GRAY71", 181, 181, 181), + ("GRAY72", 184, 184, 184), + ("GRAY73", 186, 186, 186), + ("GRAY74", 189, 189, 189), + ("GRAY75", 191, 191, 191), + ("GRAY76", 194, 194, 194), + ("GRAY77", 196, 196, 196), + ("GRAY78", 199, 199, 199), + ("GRAY79", 201, 201, 201), + ("GRAY8", 20, 20, 20), + ("GRAY80", 204, 204, 204), + ("GRAY81", 207, 207, 207), + ("GRAY82", 209, 209, 209), + ("GRAY83", 212, 212, 212), + ("GRAY84", 214, 214, 214), + ("GRAY85", 217, 217, 217), + ("GRAY86", 219, 219, 219), + ("GRAY87", 222, 222, 222), + ("GRAY88", 224, 224, 224), + ("GRAY89", 227, 227, 227), + ("GRAY9", 23, 23, 23), + ("GRAY90", 229, 229, 229), + ("GRAY91", 232, 232, 232), + ("GRAY92", 235, 235, 235), + ("GRAY93", 237, 237, 237), + ("GRAY94", 240, 240, 240), + ("GRAY95", 242, 242, 242), + ("GRAY96", 245, 245, 245), + ("GRAY97", 247, 247, 247), + ("GRAY98", 250, 250, 250), + ("GRAY99", 252, 252, 252), + ("HONEYDEW", 240, 255, 240), + ("HONEYDEW1", 240, 255, 240), + ("HONEYDEW2", 224, 238, 224), + ("HONEYDEW3", 193, 205, 193), + ("HONEYDEW4", 131, 139, 131), + ("HOTPINK", 255, 105, 180), + ("HOTPINK1", 255, 110, 180), + ("HOTPINK2", 238, 106, 167), + ("HOTPINK3", 205, 96, 144), + ("HOTPINK4", 139, 58, 98), + ("INDIANRED", 205, 92, 92), + ("INDIANRED1", 255, 106, 106), + ("INDIANRED2", 238, 99, 99), + ("INDIANRED3", 205, 85, 85), + ("INDIANRED4", 139, 58, 58), + ("IVORY", 255, 255, 240), + ("IVORY1", 255, 255, 240), + ("IVORY2", 238, 238, 224), + ("IVORY3", 205, 205, 193), + ("IVORY4", 139, 139, 131), + ("KHAKI", 240, 230, 140), + ("KHAKI1", 255, 246, 143), + ("KHAKI2", 238, 230, 133), + ("KHAKI3", 205, 198, 115), + ("KHAKI4", 139, 134, 78), + ("LAVENDER", 230, 230, 250), + ("LAVENDERBLUSH", 255, 240, 245), + ("LAVENDERBLUSH1", 255, 240, 245), + ("LAVENDERBLUSH2", 238, 224, 229), + ("LAVENDERBLUSH3", 205, 193, 197), + ("LAVENDERBLUSH4", 139, 131, 134), + ("LAWNGREEN", 124, 252, 0), + ("LEMONCHIFFON", 255, 250, 205), + ("LEMONCHIFFON1", 255, 250, 205), + ("LEMONCHIFFON2", 238, 233, 191), + ("LEMONCHIFFON3", 205, 201, 165), + ("LEMONCHIFFON4", 139, 137, 112), + ("LIGHTBLUE", 173, 216, 230), + ("LIGHTBLUE1", 191, 239, 255), + ("LIGHTBLUE2", 178, 223, 238), + ("LIGHTBLUE3", 154, 192, 205), + ("LIGHTBLUE4", 104, 131, 139), + ("LIGHTCORAL", 240, 128, 128), + ("LIGHTCYAN", 224, 255, 255), + ("LIGHTCYAN1", 224, 255, 255), + ("LIGHTCYAN2", 209, 238, 238), + ("LIGHTCYAN3", 180, 205, 205), + ("LIGHTCYAN4", 122, 139, 139), + ("LIGHTGOLDENROD", 238, 221, 130), + ("LIGHTGOLDENROD1", 255, 236, 139), + ("LIGHTGOLDENROD2", 238, 220, 130), + ("LIGHTGOLDENROD3", 205, 190, 112), + ("LIGHTGOLDENROD4", 139, 129, 76), + ("LIGHTGOLDENRODYELLOW", 250, 250, 210), + ("LIGHTGREEN", 144, 238, 144), + ("LIGHTGRAY", 211, 211, 211), + ("LIGHTPINK", 255, 182, 193), + ("LIGHTPINK1", 255, 174, 185), + ("LIGHTPINK2", 238, 162, 173), + ("LIGHTPINK3", 205, 140, 149), + ("LIGHTPINK4", 139, 95, 101), + ("LIGHTSALMON", 255, 160, 122), + ("LIGHTSALMON1", 255, 160, 122), + ("LIGHTSALMON2", 238, 149, 114), + ("LIGHTSALMON3", 205, 129, 98), + ("LIGHTSALMON4", 139, 87, 66), + ("LIGHTSEAGREEN", 32, 178, 170), + ("LIGHTSKYBLUE", 135, 206, 250), + ("LIGHTSKYBLUE1", 176, 226, 255), + ("LIGHTSKYBLUE2", 164, 211, 238), + ("LIGHTSKYBLUE3", 141, 182, 205), + ("LIGHTSKYBLUE4", 96, 123, 139), + ("LIGHTSLATEBLUE", 132, 112, 255), + ("LIGHTSLATEGRAY", 119, 136, 153), + ("LIGHTSTEELBLUE", 176, 196, 222), + ("LIGHTSTEELBLUE1", 202, 225, 255), + ("LIGHTSTEELBLUE2", 188, 210, 238), + ("LIGHTSTEELBLUE3", 162, 181, 205), + ("LIGHTSTEELBLUE4", 110, 123, 139), + ("LIGHTYELLOW", 255, 255, 224), + ("LIGHTYELLOW1", 255, 255, 224), + ("LIGHTYELLOW2", 238, 238, 209), + ("LIGHTYELLOW3", 205, 205, 180), + ("LIGHTYELLOW4", 139, 139, 122), + ("LIMEGREEN", 50, 205, 50), + ("LINEN", 250, 240, 230), + ("MAGENTA", 255, 0, 255), + ("MAGENTA1", 255, 0, 255), + ("MAGENTA2", 238, 0, 238), + ("MAGENTA3", 205, 0, 205), + ("MAGENTA4", 139, 0, 139), + ("MAROON", 176, 48, 96), + ("MAROON1", 255, 52, 179), + ("MAROON2", 238, 48, 167), + ("MAROON3", 205, 41, 144), + ("MAROON4", 139, 28, 98), + ("MEDIUMAQUAMARINE", 102, 205, 170), + ("MEDIUMBLUE", 0, 0, 205), + ("MEDIUMORCHID", 186, 85, 211), + ("MEDIUMORCHID1", 224, 102, 255), + ("MEDIUMORCHID2", 209, 95, 238), + ("MEDIUMORCHID3", 180, 82, 205), + ("MEDIUMORCHID4", 122, 55, 139), + ("MEDIUMPURPLE", 147, 112, 219), + ("MEDIUMPURPLE1", 171, 130, 255), + ("MEDIUMPURPLE2", 159, 121, 238), + ("MEDIUMPURPLE3", 137, 104, 205), + ("MEDIUMPURPLE4", 93, 71, 139), + ("MEDIUMSEAGREEN", 60, 179, 113), + ("MEDIUMSLATEBLUE", 123, 104, 238), + ("MEDIUMSPRINGGREEN", 0, 250, 154), + ("MEDIUMTURQUOISE", 72, 209, 204), + ("MEDIUMVIOLETRED", 199, 21, 133), + ("MIDNIGHTBLUE", 25, 25, 112), + ("MINTCREAM", 245, 255, 250), + ("MISTYROSE", 255, 228, 225), + ("MISTYROSE1", 255, 228, 225), + ("MISTYROSE2", 238, 213, 210), + ("MISTYROSE3", 205, 183, 181), + ("MISTYROSE4", 139, 125, 123), + ("MOCCASIN", 255, 228, 181), + ("MUPDFBLUE", 37, 114, 172), + ("NAVAJOWHITE", 255, 222, 173), + ("NAVAJOWHITE1", 255, 222, 173), + ("NAVAJOWHITE2", 238, 207, 161), + ("NAVAJOWHITE3", 205, 179, 139), + ("NAVAJOWHITE4", 139, 121, 94), + ("NAVY", 0, 0, 128), + ("NAVYBLUE", 0, 0, 128), + ("OLDLACE", 253, 245, 230), + ("OLIVEDRAB", 107, 142, 35), + ("OLIVEDRAB1", 192, 255, 62), + ("OLIVEDRAB2", 179, 238, 58), + ("OLIVEDRAB3", 154, 205, 50), + ("OLIVEDRAB4", 105, 139, 34), + ("ORANGE", 255, 165, 0), + ("ORANGE1", 255, 165, 0), + ("ORANGE2", 238, 154, 0), + ("ORANGE3", 205, 133, 0), + ("ORANGE4", 139, 90, 0), + ("ORANGERED", 255, 69, 0), + ("ORANGERED1", 255, 69, 0), + ("ORANGERED2", 238, 64, 0), + ("ORANGERED3", 205, 55, 0), + ("ORANGERED4", 139, 37, 0), + ("ORCHID", 218, 112, 214), + ("ORCHID1", 255, 131, 250), + ("ORCHID2", 238, 122, 233), + ("ORCHID3", 205, 105, 201), + ("ORCHID4", 139, 71, 137), + ("PALEGOLDENROD", 238, 232, 170), + ("PALEGREEN", 152, 251, 152), + ("PALEGREEN1", 154, 255, 154), + ("PALEGREEN2", 144, 238, 144), + ("PALEGREEN3", 124, 205, 124), + ("PALEGREEN4", 84, 139, 84), + ("PALETURQUOISE", 175, 238, 238), + ("PALETURQUOISE1", 187, 255, 255), + ("PALETURQUOISE2", 174, 238, 238), + ("PALETURQUOISE3", 150, 205, 205), + ("PALETURQUOISE4", 102, 139, 139), + ("PALEVIOLETRED", 219, 112, 147), + ("PALEVIOLETRED1", 255, 130, 171), + ("PALEVIOLETRED2", 238, 121, 159), + ("PALEVIOLETRED3", 205, 104, 137), + ("PALEVIOLETRED4", 139, 71, 93), + ("PAPAYAWHIP", 255, 239, 213), + ("PEACHPUFF", 255, 218, 185), + ("PEACHPUFF1", 255, 218, 185), + ("PEACHPUFF2", 238, 203, 173), + ("PEACHPUFF3", 205, 175, 149), + ("PEACHPUFF4", 139, 119, 101), + ("PERU", 205, 133, 63), + ("PINK", 255, 192, 203), + ("PINK1", 255, 181, 197), + ("PINK2", 238, 169, 184), + ("PINK3", 205, 145, 158), + ("PINK4", 139, 99, 108), + ("PLUM", 221, 160, 221), + ("PLUM1", 255, 187, 255), + ("PLUM2", 238, 174, 238), + ("PLUM3", 205, 150, 205), + ("PLUM4", 139, 102, 139), + ("POWDERBLUE", 176, 224, 230), + ("PURPLE", 160, 32, 240), + ("PURPLE1", 155, 48, 255), + ("PURPLE2", 145, 44, 238), + ("PURPLE3", 125, 38, 205), + ("PURPLE4", 85, 26, 139), + ("PY_COLOR", 240, 255, 210), + ("RED", 255, 0, 0), + ("RED1", 255, 0, 0), + ("RED2", 238, 0, 0), + ("RED3", 205, 0, 0), + ("RED4", 139, 0, 0), + ("ROSYBROWN", 188, 143, 143), + ("ROSYBROWN1", 255, 193, 193), + ("ROSYBROWN2", 238, 180, 180), + ("ROSYBROWN3", 205, 155, 155), + ("ROSYBROWN4", 139, 105, 105), + ("ROYALBLUE", 65, 105, 225), + ("ROYALBLUE1", 72, 118, 255), + ("ROYALBLUE2", 67, 110, 238), + ("ROYALBLUE3", 58, 95, 205), + ("ROYALBLUE4", 39, 64, 139), + ("SADDLEBROWN", 139, 69, 19), + ("SALMON", 250, 128, 114), + ("SALMON1", 255, 140, 105), + ("SALMON2", 238, 130, 98), + ("SALMON3", 205, 112, 84), + ("SALMON4", 139, 76, 57), + ("SANDYBROWN", 244, 164, 96), + ("SEAGREEN", 46, 139, 87), + ("SEAGREEN1", 84, 255, 159), + ("SEAGREEN2", 78, 238, 148), + ("SEAGREEN3", 67, 205, 128), + ("SEAGREEN4", 46, 139, 87), + ("SEASHELL", 255, 245, 238), + ("SEASHELL1", 255, 245, 238), + ("SEASHELL2", 238, 229, 222), + ("SEASHELL3", 205, 197, 191), + ("SEASHELL4", 139, 134, 130), + ("SIENNA", 160, 82, 45), + ("SIENNA1", 255, 130, 71), + ("SIENNA2", 238, 121, 66), + ("SIENNA3", 205, 104, 57), + ("SIENNA4", 139, 71, 38), + ("SKYBLUE", 135, 206, 235), + ("SKYBLUE1", 135, 206, 255), + ("SKYBLUE2", 126, 192, 238), + ("SKYBLUE3", 108, 166, 205), + ("SKYBLUE4", 74, 112, 139), + ("SLATEBLUE", 106, 90, 205), + ("SLATEBLUE1", 131, 111, 255), + ("SLATEBLUE2", 122, 103, 238), + ("SLATEBLUE3", 105, 89, 205), + ("SLATEBLUE4", 71, 60, 139), + ("SLATEGRAY", 112, 128, 144), + ("SNOW", 255, 250, 250), + ("SNOW1", 255, 250, 250), + ("SNOW2", 238, 233, 233), + ("SNOW3", 205, 201, 201), + ("SNOW4", 139, 137, 137), + ("SPRINGGREEN", 0, 255, 127), + ("SPRINGGREEN1", 0, 255, 127), + ("SPRINGGREEN2", 0, 238, 118), + ("SPRINGGREEN3", 0, 205, 102), + ("SPRINGGREEN4", 0, 139, 69), + ("STEELBLUE", 70, 130, 180), + ("STEELBLUE1", 99, 184, 255), + ("STEELBLUE2", 92, 172, 238), + ("STEELBLUE3", 79, 148, 205), + ("STEELBLUE4", 54, 100, 139), + ("TAN", 210, 180, 140), + ("TAN1", 255, 165, 79), + ("TAN2", 238, 154, 73), + ("TAN3", 205, 133, 63), + ("TAN4", 139, 90, 43), + ("THISTLE", 216, 191, 216), + ("THISTLE1", 255, 225, 255), + ("THISTLE2", 238, 210, 238), + ("THISTLE3", 205, 181, 205), + ("THISTLE4", 139, 123, 139), + ("TOMATO", 255, 99, 71), + ("TOMATO1", 255, 99, 71), + ("TOMATO2", 238, 92, 66), + ("TOMATO3", 205, 79, 57), + ("TOMATO4", 139, 54, 38), + ("TURQUOISE", 64, 224, 208), + ("TURQUOISE1", 0, 245, 255), + ("TURQUOISE2", 0, 229, 238), + ("TURQUOISE3", 0, 197, 205), + ("TURQUOISE4", 0, 134, 139), + ("VIOLET", 238, 130, 238), + ("VIOLETRED", 208, 32, 144), + ("VIOLETRED1", 255, 62, 150), + ("VIOLETRED2", 238, 58, 140), + ("VIOLETRED3", 205, 50, 120), + ("VIOLETRED4", 139, 34, 82), + ("WHEAT", 245, 222, 179), + ("WHEAT1", 255, 231, 186), + ("WHEAT2", 238, 216, 174), + ("WHEAT3", 205, 186, 150), + ("WHEAT4", 139, 126, 102), + ("WHITE", 255, 255, 255), + ("WHITESMOKE", 245, 245, 245), + ("YELLOW", 255, 255, 0), + ("YELLOW1", 255, 255, 0), + ("YELLOW2", 238, 238, 0), + ("YELLOW3", 205, 205, 0), + ("YELLOW4", 139, 139, 0), + ("YELLOWGREEN", 154, 205, 50), + ] + + +def getColorInfoDict() -> dict: + d = {} + for item in getColorInfoList(): + d[item[0].lower()] = item[1:] + return d + + +def getColor(name: str) -> tuple: + """Retrieve RGB color in PDF format by name. + + Returns: + a triple of floats in range 0 to 1. In case of name-not-found, "white" is returned. + """ + try: + c = getColorInfoList()[getColorList().index(name.upper())] + return (c[1] / 255.0, c[2] / 255.0, c[3] / 255.0) + except: + return (1, 1, 1) + + +def getColorHSV(name: str) -> tuple: + """Retrieve the hue, saturation, value triple of a color name. + + Returns: + a triple (degree, percent, percent). If not found (-1, -1, -1) is returned. + """ + try: + x = getColorInfoList()[getColorList().index(name.upper())] + except: + return (-1, -1, -1) + + r = x[1] / 255.0 + g = x[2] / 255.0 + b = x[3] / 255.0 + cmax = max(r, g, b) + V = round(cmax * 100, 1) + cmin = min(r, g, b) + delta = cmax - cmin + if delta == 0: + hue = 0 + elif cmax == r: + hue = 60.0 * (((g - b) / delta) % 6) + elif cmax == g: + hue = 60.0 * (((b - r) / delta) + 2) + else: + hue = 60.0 * (((r - g) / delta) + 4) + + H = int(round(hue)) + + if cmax == 0: + sat = 0 + else: + sat = delta / cmax + S = int(round(sat * 100)) + + return (H, S, V) + + +def _get_font_properties(doc: Document, xref: int) -> tuple: + fontname, ext, stype, buffer = doc.extract_font(xref) + asc = 0.8 + dsc = -0.2 + if ext == "": + return fontname, ext, stype, asc, dsc + + if buffer: + try: + font = Font(fontbuffer=buffer) + asc = font.ascender + dsc = font.descender + bbox = font.bbox + if asc - dsc < 1: + if bbox.y0 < dsc: + dsc = bbox.y0 + asc = 1 - dsc + except: + asc *= 1.2 + dsc *= 1.2 + return fontname, ext, stype, asc, dsc + if ext != "n/a": + try: + font = Font(fontname) + asc = font.ascender + dsc = font.descender + except: + asc *= 1.2 + dsc *= 1.2 + else: + asc *= 1.2 + dsc *= 1.2 + return fontname, ext, stype, asc, dsc + + +def get_char_widths( + doc: Document, xref: int, limit: int = 256, idx: int = 0, fontdict: OptDict = None +) -> list: + """Get list of glyph information of a font. + + Notes: + Must be provided by its XREF number. If we already dealt with the + font, it will be recorded in doc.FontInfos. Otherwise we insert an + entry there. + Finally we return the glyphs for the font. This is a list of + (glyph, width) where glyph is an integer controlling the char + appearance, and width is a float controlling the char's spacing: + width * fontsize is the actual space. + For 'simple' fonts, glyph == ord(char) will usually be true. + Exceptions are 'Symbol' and 'ZapfDingbats'. We are providing data for these directly here. + """ + fontinfo = CheckFontInfo(doc, xref) + if fontinfo is None: # not recorded yet: create it + if fontdict is None: + name, ext, stype, asc, dsc = _get_font_properties(doc, xref) + fontdict = { + "name": name, + "type": stype, + "ext": ext, + "ascender": asc, + "descender": dsc, + } + else: + name = fontdict["name"] + ext = fontdict["ext"] + stype = fontdict["type"] + ordering = fontdict["ordering"] + simple = fontdict["simple"] + + if ext == "": + raise ValueError("xref is not a font") + + # check for 'simple' fonts + if stype in ("Type1", "MMType1", "TrueType"): + simple = True + else: + simple = False + + # check for CJK fonts + if name in ("Fangti", "Ming"): + ordering = 0 + elif name in ("Heiti", "Song"): + ordering = 1 + elif name in ("Gothic", "Mincho"): + ordering = 2 + elif name in ("Dotum", "Batang"): + ordering = 3 + else: + ordering = -1 + + fontdict["simple"] = simple + + if name == "ZapfDingbats": + glyphs = zapf_glyphs + elif name == "Symbol": + glyphs = symbol_glyphs + else: + glyphs = None + + fontdict["glyphs"] = glyphs + fontdict["ordering"] = ordering + fontinfo = [xref, fontdict] + doc.FontInfos.append(fontinfo) + else: + fontdict = fontinfo[1] + glyphs = fontdict["glyphs"] + simple = fontdict["simple"] + ordering = fontdict["ordering"] + + if glyphs is None: + oldlimit = 0 + else: + oldlimit = len(glyphs) + + mylimit = max(256, limit) + + if mylimit <= oldlimit: + return glyphs + + if ordering < 0: # not a CJK font + glyphs = doc._get_char_widths( + xref, fontdict["name"], fontdict["ext"], fontdict["ordering"], mylimit, idx + ) + else: # CJK fonts use char codes and width = 1 + glyphs = None + + fontdict["glyphs"] = glyphs + fontinfo[1] = fontdict + UpdateFontInfo(doc, fontinfo) + + return glyphs + + +class Shape(object): + """Create a new shape.""" + + @staticmethod + def horizontal_angle(C, P): + """Return the angle to the horizontal for the connection from C to P. + This uses the arcus sine function and resolves its inherent ambiguity by + looking up in which quadrant vector S = P - C is located. + """ + S = Point(P - C).unit # unit vector 'C' -> 'P' + alfa = math.asin(abs(S.y)) # absolute angle from horizontal + if S.x < 0: # make arcsin result unique + if S.y <= 0: # bottom-left + alfa = -(math.pi - alfa) + else: # top-left + alfa = math.pi - alfa + else: + if S.y >= 0: # top-right + pass + else: # bottom-right + alfa = -alfa + return alfa + + def __init__(self, page: Page): + CheckParent(page) + self.page = page + self.doc = page.parent + if not self.doc.is_pdf: + raise ValueError("is no PDF") + self.height = page.mediabox_size.y + self.width = page.mediabox_size.x + self.x = page.cropbox_position.x + self.y = page.cropbox_position.y + + self.pctm = page.transformation_matrix # page transf. matrix + self.ipctm = ~self.pctm # inverted transf. matrix + + self.draw_cont = "" + self.text_cont = "" + self.totalcont = "" + self.lastPoint = None + self.rect = None + + def updateRect(self, x): + if self.rect is None: + if len(x) == 2: + self.rect = Rect(x, x) + else: + self.rect = Rect(x) + + else: + if len(x) == 2: + x = Point(x) + self.rect.x0 = min(self.rect.x0, x.x) + self.rect.y0 = min(self.rect.y0, x.y) + self.rect.x1 = max(self.rect.x1, x.x) + self.rect.y1 = max(self.rect.y1, x.y) + else: + x = Rect(x) + self.rect.x0 = min(self.rect.x0, x.x0) + self.rect.y0 = min(self.rect.y0, x.y0) + self.rect.x1 = max(self.rect.x1, x.x1) + self.rect.y1 = max(self.rect.y1, x.y1) + + def draw_line(self, p1: point_like, p2: point_like) -> Point: + """Draw a line between two points.""" + p1 = Point(p1) + p2 = Point(p2) + if not (self.lastPoint == p1): + self.draw_cont += "%g %g m\n" % JM_TUPLE(p1 * self.ipctm) + self.lastPoint = p1 + self.updateRect(p1) + + self.draw_cont += "%g %g l\n" % JM_TUPLE(p2 * self.ipctm) + self.updateRect(p2) + self.lastPoint = p2 + return self.lastPoint + + def draw_polyline(self, points: list) -> Point: + """Draw several connected line segments.""" + for i, p in enumerate(points): + if i == 0: + if not (self.lastPoint == Point(p)): + self.draw_cont += "%g %g m\n" % JM_TUPLE(Point(p) * self.ipctm) + self.lastPoint = Point(p) + else: + self.draw_cont += "%g %g l\n" % JM_TUPLE(Point(p) * self.ipctm) + self.updateRect(p) + + self.lastPoint = Point(points[-1]) + return self.lastPoint + + def draw_bezier( + self, + p1: point_like, + p2: point_like, + p3: point_like, + p4: point_like, + ) -> Point: + """Draw a standard cubic Bezier curve.""" + p1 = Point(p1) + p2 = Point(p2) + p3 = Point(p3) + p4 = Point(p4) + if not (self.lastPoint == p1): + self.draw_cont += "%g %g m\n" % JM_TUPLE(p1 * self.ipctm) + self.draw_cont += "%g %g %g %g %g %g c\n" % JM_TUPLE( + list(p2 * self.ipctm) + list(p3 * self.ipctm) + list(p4 * self.ipctm) + ) + self.updateRect(p1) + self.updateRect(p2) + self.updateRect(p3) + self.updateRect(p4) + self.lastPoint = p4 + return self.lastPoint + + def draw_oval(self, tetra: typing.Union[quad_like, rect_like]) -> Point: + """Draw an ellipse inside a tetrapod.""" + if len(tetra) != 4: + raise ValueError("invalid arg length") + if hasattr(tetra[0], "__float__"): + q = Rect(tetra).quad + else: + q = Quad(tetra) + + mt = q.ul + (q.ur - q.ul) * 0.5 + mr = q.ur + (q.lr - q.ur) * 0.5 + mb = q.ll + (q.lr - q.ll) * 0.5 + ml = q.ul + (q.ll - q.ul) * 0.5 + if not (self.lastPoint == ml): + self.draw_cont += "%g %g m\n" % JM_TUPLE(ml * self.ipctm) + self.lastPoint = ml + self.draw_curve(ml, q.ll, mb) + self.draw_curve(mb, q.lr, mr) + self.draw_curve(mr, q.ur, mt) + self.draw_curve(mt, q.ul, ml) + self.updateRect(q.rect) + self.lastPoint = ml + return self.lastPoint + + def draw_circle(self, center: point_like, radius: float) -> Point: + """Draw a circle given its center and radius.""" + if not radius > EPSILON: + raise ValueError("radius must be positive") + center = Point(center) + p1 = center - (radius, 0) + return self.draw_sector(center, p1, 360, fullSector=False) + + def draw_curve( + self, + p1: point_like, + p2: point_like, + p3: point_like, + ) -> Point: + """Draw a curve between points using one control point.""" + kappa = 0.55228474983 + p1 = Point(p1) + p2 = Point(p2) + p3 = Point(p3) + k1 = p1 + (p2 - p1) * kappa + k2 = p3 + (p2 - p3) * kappa + return self.draw_bezier(p1, k1, k2, p3) + + def draw_sector( + self, + center: point_like, + point: point_like, + beta: float, + fullSector: bool = True, + ) -> Point: + """Draw a circle sector.""" + center = Point(center) + point = Point(point) + l3 = "%g %g m\n" + l4 = "%g %g %g %g %g %g c\n" + l5 = "%g %g l\n" + betar = math.radians(-beta) + w360 = math.radians(math.copysign(360, betar)) * (-1) + w90 = math.radians(math.copysign(90, betar)) + w45 = w90 / 2 + while abs(betar) > 2 * math.pi: + betar += w360 # bring angle below 360 degrees + if not (self.lastPoint == point): + self.draw_cont += l3 % JM_TUPLE(point * self.ipctm) + self.lastPoint = point + Q = Point(0, 0) # just make sure it exists + C = center + P = point + S = P - C # vector 'center' -> 'point' + rad = abs(S) # circle radius + + if not rad > EPSILON: + raise ValueError("radius must be positive") + + alfa = self.horizontal_angle(center, point) + while abs(betar) > abs(w90): # draw 90 degree arcs + q1 = C.x + math.cos(alfa + w90) * rad + q2 = C.y + math.sin(alfa + w90) * rad + Q = Point(q1, q2) # the arc's end point + r1 = C.x + math.cos(alfa + w45) * rad / math.cos(w45) + r2 = C.y + math.sin(alfa + w45) * rad / math.cos(w45) + R = Point(r1, r2) # crossing point of tangents + kappah = (1 - math.cos(w45)) * 4 / 3 / abs(R - Q) + kappa = kappah * abs(P - Q) + cp1 = P + (R - P) * kappa # control point 1 + cp2 = Q + (R - Q) * kappa # control point 2 + self.draw_cont += l4 % JM_TUPLE( + list(cp1 * self.ipctm) + list(cp2 * self.ipctm) + list(Q * self.ipctm) + ) + + betar -= w90 # reduce parm angle by 90 deg + alfa += w90 # advance start angle by 90 deg + P = Q # advance to arc end point + # draw (remaining) arc + if abs(betar) > 1e-3: # significant degrees left? + beta2 = betar / 2 + q1 = C.x + math.cos(alfa + betar) * rad + q2 = C.y + math.sin(alfa + betar) * rad + Q = Point(q1, q2) # the arc's end point + r1 = C.x + math.cos(alfa + beta2) * rad / math.cos(beta2) + r2 = C.y + math.sin(alfa + beta2) * rad / math.cos(beta2) + R = Point(r1, r2) # crossing point of tangents + # kappa height is 4/3 of segment height + kappah = (1 - math.cos(beta2)) * 4 / 3 / abs(R - Q) # kappa height + kappa = kappah * abs(P - Q) / (1 - math.cos(betar)) + cp1 = P + (R - P) * kappa # control point 1 + cp2 = Q + (R - Q) * kappa # control point 2 + self.draw_cont += l4 % JM_TUPLE( + list(cp1 * self.ipctm) + list(cp2 * self.ipctm) + list(Q * self.ipctm) + ) + if fullSector: + self.draw_cont += l3 % JM_TUPLE(point * self.ipctm) + self.draw_cont += l5 % JM_TUPLE(center * self.ipctm) + self.draw_cont += l5 % JM_TUPLE(Q * self.ipctm) + self.lastPoint = Q + return self.lastPoint + + def draw_rect(self, rect: rect_like, *, radius=None) -> Point: + """Draw a rectangle. + + Args: + radius: if not None, the rectangle will have rounded corners. + This is the radius of the curvature, given as percentage of + the rectangle width or height. Valid are values 0 < v <= 0.5. + For a sequence of two values, the corners will have different + radii. Otherwise, the percentage will be computed from the + shorter side. A value of (0.5, 0.5) will draw an ellipse. + """ + r = Rect(rect) + if radius == None: # standard rectangle + self.draw_cont += "%g %g %g %g re\n" % JM_TUPLE( + list(r.bl * self.ipctm) + [r.width, r.height] + ) + self.updateRect(r) + self.lastPoint = r.tl + return self.lastPoint + # rounded corners requested. This requires 1 or 2 values, each + # with 0 < value <= 0.5 + if hasattr(radius, "__float__"): + if radius <= 0 or radius > 0.5: + raise ValueError(f"bad radius value {radius}.") + d = min(r.width, r.height) * radius + px = (d, 0) + py = (0, d) + elif hasattr(radius, "__len__") and len(radius) == 2: + rx, ry = radius + px = (rx * r.width, 0) + py = (0, ry * r.height) + if min(rx, ry) <= 0 or max(rx, ry) > 0.5: + raise ValueError(f"bad radius value {radius}.") + else: + raise ValueError(f"bad radius value {radius}.") + + lp = self.draw_line(r.tl + py, r.bl - py) + lp = self.draw_curve(lp, r.bl, r.bl + px) + + lp = self.draw_line(lp, r.br - px) + lp = self.draw_curve(lp, r.br, r.br - py) + + lp = self.draw_line(lp, r.tr + py) + lp = self.draw_curve(lp, r.tr, r.tr - px) + + lp = self.draw_line(lp, r.tl + px) + self.lastPoint = self.draw_curve(lp, r.tl, r.tl + py) + + self.updateRect(r) + return self.lastPoint + + def draw_quad(self, quad: quad_like) -> Point: + """Draw a Quad.""" + q = Quad(quad) + return self.draw_polyline([q.ul, q.ll, q.lr, q.ur, q.ul]) + + def draw_zigzag( + self, + p1: point_like, + p2: point_like, + breadth: float = 2, + ) -> Point: + """Draw a zig-zagged line from p1 to p2.""" + p1 = Point(p1) + p2 = Point(p2) + S = p2 - p1 # vector start - end + rad = abs(S) # distance of points + cnt = 4 * int(round(rad / (4 * breadth), 0)) # always take full phases + if cnt < 4: + raise ValueError("points too close") + mb = rad / cnt # revised breadth + matrix = Matrix(util_hor_matrix(p1, p2)) # normalize line to x-axis + i_mat = ~matrix # get original position + points = [] # stores edges + for i in range(1, cnt): + if i % 4 == 1: # point "above" connection + p = Point(i, -1) * mb + elif i % 4 == 3: # point "below" connection + p = Point(i, 1) * mb + else: # ignore others + continue + points.append(p * i_mat) + self.draw_polyline([p1] + points + [p2]) # add start and end points + return p2 + + def draw_squiggle( + self, + p1: point_like, + p2: point_like, + breadth=2, + ) -> Point: + """Draw a squiggly line from p1 to p2.""" + p1 = Point(p1) + p2 = Point(p2) + S = p2 - p1 # vector start - end + rad = abs(S) # distance of points + cnt = 4 * int(round(rad / (4 * breadth), 0)) # always take full phases + if cnt < 4: + raise ValueError("points too close") + mb = rad / cnt # revised breadth + matrix = Matrix(util_hor_matrix(p1, p2)) # normalize line to x-axis + i_mat = ~matrix # get original position + k = 2.4142135623765633 # y of draw_curve helper point + + points = [] # stores edges + for i in range(1, cnt): + if i % 4 == 1: # point "above" connection + p = Point(i, -k) * mb + elif i % 4 == 3: # point "below" connection + p = Point(i, k) * mb + else: # else on connection line + p = Point(i, 0) * mb + points.append(p * i_mat) + + points = [p1] + points + [p2] + cnt = len(points) + i = 0 + while i + 2 < cnt: + self.draw_curve(points[i], points[i + 1], points[i + 2]) + i += 2 + return p2 + + # ============================================================================== + # Shape.insert_text + # ============================================================================== + def insert_text( + self, + point: point_like, + buffer: typing.Union[str, list], + fontsize: float = 11, + lineheight: OptFloat = None, + fontname: str = "helv", + fontfile: OptStr = None, + set_simple: bool = 0, + encoding: int = 0, + color: OptSeq = None, + fill: OptSeq = None, + render_mode: int = 0, + border_width: float = 0.05, + rotate: int = 0, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, + ) -> int: + # ensure 'text' is a list of strings, worth dealing with + if not bool(buffer): + return 0 + + if type(buffer) not in (list, tuple): + text = buffer.splitlines() + else: + text = buffer + + if not len(text) > 0: + return 0 + + point = Point(point) + try: + maxcode = max([ord(c) for c in " ".join(text)]) + except: + return 0 + + # ensure valid 'fontname' + fname = fontname + if fname.startswith("/"): + fname = fname[1:] + + xref = self.page.insert_font( + fontname=fname, fontfile=fontfile, encoding=encoding, set_simple=set_simple + ) + fontinfo = CheckFontInfo(self.doc, xref) + + fontdict = fontinfo[1] + ordering = fontdict["ordering"] + simple = fontdict["simple"] + bfname = fontdict["name"] + ascender = fontdict["ascender"] + descender = fontdict["descender"] + if lineheight: + lheight = fontsize * lineheight + elif ascender - descender <= 1: + lheight = fontsize * 1.2 + else: + lheight = fontsize * (ascender - descender) + + if maxcode > 255: + glyphs = self.doc.get_char_widths(xref, maxcode + 1) + else: + glyphs = fontdict["glyphs"] + + tab = [] + for t in text: + if simple and bfname not in ("Symbol", "ZapfDingbats"): + g = None + else: + g = glyphs + tab.append(getTJstr(t, g, simple, ordering)) + text = tab + + color_str = ColorCode(color, "c") + fill_str = ColorCode(fill, "f") + if not fill and render_mode == 0: # ensure fill color when 0 Tr + fill = color + fill_str = ColorCode(color, "f") + + morphing = CheckMorph(morph) + rot = rotate + if rot % 90 != 0: + raise ValueError("bad rotate value") + + while rot < 0: + rot += 360 + rot = rot % 360 # text rotate = 0, 90, 270, 180 + + templ1 = "\nq\n%s%sBT\n%s1 0 0 1 %g %g Tm\n/%s %g Tf " + templ2 = "TJ\n0 -%g TD\n" + cmp90 = "0 1 -1 0 0 0 cm\n" # rotates 90 deg counter-clockwise + cmm90 = "0 -1 1 0 0 0 cm\n" # rotates 90 deg clockwise + cm180 = "-1 0 0 -1 0 0 cm\n" # rotates by 180 deg. + height = self.height + width = self.width + + # setting up for standard rotation directions + # case rotate = 0 + if morphing: + m1 = Matrix(1, 0, 0, 1, morph[0].x + self.x, height - morph[0].y - self.y) + mat = ~m1 * morph[1] * m1 + cm = "%g %g %g %g %g %g cm\n" % JM_TUPLE(mat) + else: + cm = "" + top = height - point.y - self.y # start of 1st char + left = point.x + self.x # start of 1. char + space = top # space available + headroom = point.y + self.y # distance to page border + if rot == 90: + left = height - point.y - self.y + top = -point.x - self.x + cm += cmp90 + space = width - abs(top) + headroom = point.x + self.x + + elif rot == 270: + left = -height + point.y + self.y + top = point.x + self.x + cm += cmm90 + space = abs(top) + headroom = width - point.x - self.x + + elif rot == 180: + left = -point.x - self.x + top = -height + point.y + self.y + cm += cm180 + space = abs(point.y + self.y) + headroom = height - point.y - self.y + + optcont = self.page._get_optional_content(oc) + if optcont != None: + bdc = "/OC /%s BDC\n" % optcont + emc = "EMC\n" + else: + bdc = emc = "" + + alpha = self.page._set_opacity(CA=stroke_opacity, ca=fill_opacity) + if alpha == None: + alpha = "" + else: + alpha = "/%s gs\n" % alpha + nres = templ1 % (bdc, alpha, cm, left, top, fname, fontsize) + + if render_mode > 0: + nres += "%i Tr " % render_mode + nres += "%g w " % (border_width * fontsize) + + if color is not None: + nres += color_str + if fill is not None: + nres += fill_str + + # ========================================================================= + # start text insertion + # ========================================================================= + nres += text[0] + nlines = 1 # set output line counter + if len(text) > 1: + nres += templ2 % lheight # line 1 + else: + nres += templ2[:2] + for i in range(1, len(text)): + if space < lheight: + break # no space left on page + if i > 1: + nres += "\nT* " + nres += text[i] + templ2[:2] + space -= lheight + nlines += 1 + + nres += "\nET\n%sQ\n" % emc + + # ===================================================================== + # end of text insertion + # ===================================================================== + # update the /Contents object + self.text_cont += nres + return nlines + + # ========================================================================= + # Shape.insert_textbox + # ========================================================================= + def insert_textbox( + self, + rect: rect_like, + buffer: typing.Union[str, list], + fontname: OptStr = "helv", + fontfile: OptStr = None, + fontsize: float = 11, + lineheight: OptFloat = None, + set_simple: bool = 0, + encoding: int = 0, + color: OptSeq = None, + fill: OptSeq = None, + expandtabs: int = 1, + border_width: float = 0.05, + align: int = 0, + render_mode: int = 0, + rotate: int = 0, + morph: OptSeq = None, + stroke_opacity: float = 1, + fill_opacity: float = 1, + oc: int = 0, + ) -> float: + """Insert text into a given rectangle. + + Args: + rect -- the textbox to fill + buffer -- text to be inserted + fontname -- a Base-14 font, font name or '/name' + fontfile -- name of a font file + fontsize -- font size + lineheight -- overwrite the font property + color -- RGB stroke color triple + fill -- RGB fill color triple + render_mode -- text rendering control + border_width -- thickness of glyph borders as percentage of fontsize + expandtabs -- handles tabulators with string function + align -- left, center, right, justified + rotate -- 0, 90, 180, or 270 degrees + morph -- morph box with a matrix and a fixpoint + Returns: + unused or deficit rectangle area (float) + """ + rect = Rect(rect) + if rect.is_empty or rect.is_infinite: + raise ValueError("text box must be finite and not empty") + + color_str = ColorCode(color, "c") + fill_str = ColorCode(fill, "f") + if fill is None and render_mode == 0: # ensure fill color for 0 Tr + fill = color + fill_str = ColorCode(color, "f") + + optcont = self.page._get_optional_content(oc) + if optcont != None: + bdc = "/OC /%s BDC\n" % optcont + emc = "EMC\n" + else: + bdc = emc = "" + + # determine opacity / transparency + alpha = self.page._set_opacity(CA=stroke_opacity, ca=fill_opacity) + if alpha == None: + alpha = "" + else: + alpha = "/%s gs\n" % alpha + + if rotate % 90 != 0: + raise ValueError("rotate must be multiple of 90") + + rot = rotate + while rot < 0: + rot += 360 + rot = rot % 360 + + # is buffer worth of dealing with? + if not bool(buffer): + return rect.height if rot in (0, 180) else rect.width + + cmp90 = "0 1 -1 0 0 0 cm\n" # rotates counter-clockwise + cmm90 = "0 -1 1 0 0 0 cm\n" # rotates clockwise + cm180 = "-1 0 0 -1 0 0 cm\n" # rotates by 180 deg. + height = self.height + + fname = fontname + if fname.startswith("/"): + fname = fname[1:] + + xref = self.page.insert_font( + fontname=fname, fontfile=fontfile, encoding=encoding, set_simple=set_simple + ) + fontinfo = CheckFontInfo(self.doc, xref) + + fontdict = fontinfo[1] + ordering = fontdict["ordering"] + simple = fontdict["simple"] + glyphs = fontdict["glyphs"] + bfname = fontdict["name"] + ascender = fontdict["ascender"] + descender = fontdict["descender"] + + if lineheight: + lheight_factor = lineheight + elif ascender - descender <= 1: + lheight_factor = 1.2 + else: + lheight_factor = ascender - descender + lheight = fontsize * lheight_factor + + # create a list from buffer, split into its lines + if type(buffer) in (list, tuple): + t0 = "\n".join(buffer) + else: + t0 = buffer + + maxcode = max([ord(c) for c in t0]) + # replace invalid char codes for simple fonts + if simple and maxcode > 255: + t0 = "".join([c if ord(c) < 256 else "?" for c in t0]) + + t0 = t0.splitlines() + + glyphs = self.doc.get_char_widths(xref, maxcode + 1) + if simple and bfname not in ("Symbol", "ZapfDingbats"): + tj_glyphs = None + else: + tj_glyphs = glyphs + + # ---------------------------------------------------------------------- + # calculate pixel length of a string + # ---------------------------------------------------------------------- + def pixlen(x): + """Calculate pixel length of x.""" + if ordering < 0: + return sum([glyphs[ord(c)][1] for c in x]) * fontsize + else: + return len(x) * fontsize + + # --------------------------------------------------------------------- + + if ordering < 0: + blen = glyphs[32][1] * fontsize # pixel size of space character + else: + blen = fontsize + + text = "" # output buffer + + if CheckMorph(morph): + m1 = Matrix( + 1, 0, 0, 1, morph[0].x + self.x, self.height - morph[0].y - self.y + ) + mat = ~m1 * morph[1] * m1 + cm = "%g %g %g %g %g %g cm\n" % JM_TUPLE(mat) + else: + cm = "" + + # --------------------------------------------------------------------- + # adjust for text orientation / rotation + # --------------------------------------------------------------------- + progr = 1 # direction of line progress + c_pnt = Point(0, fontsize * ascender) # used for line progress + if rot == 0: # normal orientation + point = rect.tl + c_pnt # line 1 is 'lheight' below top + maxwidth = rect.width # pixels available in one line + maxheight = rect.height # available text height + + elif rot == 90: # rotate counter clockwise + c_pnt = Point(fontsize * ascender, 0) # progress in x-direction + point = rect.bl + c_pnt # line 1 'lheight' away from left + maxwidth = rect.height # pixels available in one line + maxheight = rect.width # available text height + cm += cmp90 + + elif rot == 180: # text upside down + # progress upwards in y direction + c_pnt = -Point(0, fontsize * ascender) + point = rect.br + c_pnt # line 1 'lheight' above bottom + maxwidth = rect.width # pixels available in one line + progr = -1 # subtract lheight for next line + maxheight = rect.height # available text height + cm += cm180 + + else: # rotate clockwise (270 or -90) + # progress from right to left + c_pnt = -Point(fontsize * ascender, 0) + point = rect.tr + c_pnt # line 1 'lheight' left of right + maxwidth = rect.height # pixels available in one line + progr = -1 # subtract lheight for next line + maxheight = rect.width # available text height + cm += cmm90 + + # ===================================================================== + # line loop + # ===================================================================== + just_tab = [] # 'justify' indicators per line + + for i, line in enumerate(t0): + line_t = line.expandtabs(expandtabs).split(" ") # split into words + num_words = len(line_t) + lbuff = "" # init line buffer + rest = maxwidth # available line pixels + # ================================================================= + # word loop + # ================================================================= + for j in range(num_words): + word = line_t[j] + pl_w = pixlen(word) # pixel len of word + if rest >= pl_w: # does it fit on the line? + lbuff += word + " " # yes, append word + rest -= pl_w + blen # update available line space + continue # next word + + # word doesn't fit - output line (if not empty) + if lbuff: + lbuff = lbuff.rstrip() + "\n" # line full, append line break + text += lbuff # append to total text + just_tab.append(True) # can align-justify + + lbuff = "" # re-init line buffer + rest = maxwidth # re-init avail. space + + if pl_w <= maxwidth: # word shorter than 1 line? + lbuff = word + " " # start the line with it + rest = maxwidth - pl_w - blen # update free space + continue + + # long word: split across multiple lines - char by char ... + if len(just_tab) > 0: + just_tab[-1] = False # cannot align-justify + for c in word: + if pixlen(lbuff) <= maxwidth - pixlen(c): + lbuff += c + else: # line full + lbuff += "\n" # close line + text += lbuff # append to text + just_tab.append(False) # cannot align-justify + lbuff = c # start new line with this char + + lbuff += " " # finish long word + rest = maxwidth - pixlen(lbuff) # long word stored + + if lbuff: # unprocessed line content? + text += lbuff.rstrip() # append to text + just_tab.append(False) # cannot align-justify + + if i < len(t0) - 1: # not the last line? + text += "\n" # insert line break + + # compute used part of the textbox + if text.endswith("\n"): + text = text[:-1] + lb_count = text.count("\n") + 1 # number of lines written + + # text height = line count * line height plus one descender value + text_height = lheight * lb_count - descender * fontsize + + more = text_height - maxheight # difference to height limit + if more > EPSILON: # landed too much outside rect + return (-1) * more # return deficit, don't output + + more = abs(more) + if more < EPSILON: + more = 0 # don't bother with epsilons + nres = "\nq\n%s%sBT\n" % (bdc, alpha) + cm # initialize output buffer + templ = "1 0 0 1 %g %g Tm /%s %g Tf " + # center, right, justify: output each line with its own specifics + text_t = text.splitlines() # split text in lines again + just_tab[-1] = False # never justify last line + for i, t in enumerate(text_t): + pl = maxwidth - pixlen(t) # length of empty line part + pnt = point + c_pnt * (i * lheight_factor) # text start of line + if align == 1: # center: right shift by half width + if rot in (0, 180): + pnt = pnt + Point(pl / 2, 0) * progr + else: + pnt = pnt - Point(0, pl / 2) * progr + elif align == 2: # right: right shift by full width + if rot in (0, 180): + pnt = pnt + Point(pl, 0) * progr + else: + pnt = pnt - Point(0, pl) * progr + elif align == 3: # justify + spaces = t.count(" ") # number of spaces in line + if spaces > 0 and just_tab[i]: # if any, and we may justify + spacing = pl / spaces # make every space this much larger + else: + spacing = 0 # keep normal space length + top = height - pnt.y - self.y + left = pnt.x + self.x + if rot == 90: + left = height - pnt.y - self.y + top = -pnt.x - self.x + elif rot == 270: + left = -height + pnt.y + self.y + top = pnt.x + self.x + elif rot == 180: + left = -pnt.x - self.x + top = -height + pnt.y + self.y + + nres += templ % (left, top, fname, fontsize) + + if render_mode > 0: + nres += "%i Tr " % render_mode + nres += "%g w " % (border_width * fontsize) + + if align == 3: + nres += "%g Tw " % spacing + + if color is not None: + nres += color_str + if fill is not None: + nres += fill_str + nres += "%sTJ\n" % getTJstr(t, tj_glyphs, simple, ordering) + + nres += "ET\n%sQ\n" % emc + + self.text_cont += nres + self.updateRect(rect) + return more + + def finish( + self, + width: float = 1, + color: OptSeq = (0,), + fill: OptSeq = None, + lineCap: int = 0, + lineJoin: int = 0, + dashes: OptStr = None, + even_odd: bool = False, + morph: OptSeq = None, + closePath: bool = True, + fill_opacity: float = 1, + stroke_opacity: float = 1, + oc: int = 0, + ) -> None: + """Finish the current drawing segment. + + Notes: + Apply colors, opacity, dashes, line style and width, or + morphing. Also whether to close the path + by connecting last to first point. + """ + if self.draw_cont == "": # treat empty contents as no-op + return + + if width == 0: # border color makes no sense then + color = None + elif color == None: # vice versa + width = 0 + # if color == None and fill == None: + # raise ValueError("at least one of 'color' or 'fill' must be given") + color_str = ColorCode(color, "c") # ensure proper color string + fill_str = ColorCode(fill, "f") # ensure proper fill string + + optcont = self.page._get_optional_content(oc) + if optcont is not None: + self.draw_cont = "/OC /%s BDC\n" % optcont + self.draw_cont + emc = "EMC\n" + else: + emc = "" + + alpha = self.page._set_opacity(CA=stroke_opacity, ca=fill_opacity) + if alpha != None: + self.draw_cont = "/%s gs\n" % alpha + self.draw_cont + + if width != 1 and width != 0: + self.draw_cont += "%g w\n" % width + + if lineCap != 0: + self.draw_cont = "%i J\n" % lineCap + self.draw_cont + if lineJoin != 0: + self.draw_cont = "%i j\n" % lineJoin + self.draw_cont + + if dashes not in (None, "", "[] 0"): + self.draw_cont = "%s d\n" % dashes + self.draw_cont + + if closePath: + self.draw_cont += "h\n" + self.lastPoint = None + + if color is not None: + self.draw_cont += color_str + + if fill is not None: + self.draw_cont += fill_str + if color is not None: + if not even_odd: + self.draw_cont += "B\n" + else: + self.draw_cont += "B*\n" + else: + if not even_odd: + self.draw_cont += "f\n" + else: + self.draw_cont += "f*\n" + else: + self.draw_cont += "S\n" + + self.draw_cont += emc + if CheckMorph(morph): + m1 = Matrix( + 1, 0, 0, 1, morph[0].x + self.x, self.height - morph[0].y - self.y + ) + mat = ~m1 * morph[1] * m1 + self.draw_cont = "%g %g %g %g %g %g cm\n" % JM_TUPLE(mat) + self.draw_cont + + self.totalcont += "\nq\n" + self.draw_cont + "Q\n" + self.draw_cont = "" + self.lastPoint = None + return + + def commit(self, overlay: bool = True) -> None: + """Update the page's /Contents object with Shape data. The argument controls whether data appear in foreground (default) or background.""" + CheckParent(self.page) # doc may have died meanwhile + self.totalcont += self.text_cont + + self.totalcont = self.totalcont.encode() + + if self.totalcont != b"": + # make /Contents object with dummy stream + xref = TOOLS._insert_contents(self.page, b" ", overlay) + # update it with potential compression + self.doc.update_stream(xref, self.totalcont) + + self.lastPoint = None # clean up ... + self.rect = None # + self.draw_cont = "" # for potential ... + self.text_cont = "" # ... + self.totalcont = "" # re-use + return + + +def apply_redactions(page: Page, images: int = 2) -> bool: + """Apply the redaction annotations of the page. + + Args: + page: the PDF page. + images: 0 - ignore images, 1 - remove complete overlapping image, + 2 - blank out overlapping image parts. + """ + + def center_rect(annot_rect, text, font, fsize): + """Calculate minimal sub-rectangle for the overlay text. + + Notes: + Because 'insert_textbox' supports no vertical text centering, + we calculate an approximate number of lines here and return a + sub-rect with smaller height, which should still be sufficient. + Args: + annot_rect: the annotation rectangle + text: the text to insert. + font: the fontname. Must be one of the CJK or Base-14 set, else + the rectangle is returned unchanged. + fsize: the fontsize + Returns: + A rectangle to use instead of the annot rectangle. + """ + if not text: + return annot_rect + try: + text_width = get_text_length(text, font, fsize) + except ValueError: # unsupported font + return annot_rect + line_height = fsize * 1.2 + limit = annot_rect.width + h = math.ceil(text_width / limit) * line_height # estimate rect height + if h >= annot_rect.height: + return annot_rect + r = annot_rect + y = (annot_rect.tl.y + annot_rect.bl.y - h) * 0.5 + r.y0 = y + return r + + CheckParent(page) + doc = page.parent + if doc.is_encrypted or doc.is_closed: + raise ValueError("document closed or encrypted") + if not doc.is_pdf: + raise ValueError("is no PDF") + + redact_annots = [] # storage of annot values + for annot in page.annots(types=(PDF_ANNOT_REDACT,)): # loop redactions + redact_annots.append(annot._get_redact_values()) # save annot values + + if redact_annots == []: # any redactions on this page? + return False # no redactions + + rc = page._apply_redactions(images) # call MuPDF redaction process step + if not rc: # should not happen really + raise ValueError("Error applying redactions.") + + # now write replacement text in old redact rectangles + shape = page.new_shape() + for redact in redact_annots: + annot_rect = redact["rect"] + fill = redact["fill"] + if fill: + shape.draw_rect(annot_rect) # colorize the rect background + shape.finish(fill=fill, color=fill) + if "text" in redact.keys(): # if we also have text + text = redact["text"] + align = redact.get("align", 0) + fname = redact["fontname"] + fsize = redact["fontsize"] + color = redact["text_color"] + # try finding vertical centered sub-rect + trect = center_rect(annot_rect, text, fname, fsize) + + rc = -1 + while rc < 0 and fsize >= 4: # while not enough room + # (re-) try insertion + rc = shape.insert_textbox( + trect, + text, + fontname=fname, + fontsize=fsize, + color=color, + align=align, + ) + fsize -= 0.5 # reduce font if unsuccessful + shape.commit() # append new contents object + return True + + +# ------------------------------------------------------------------------------ +# Remove potentially sensitive data from a PDF. Similar to the Adobe +# Acrobat 'sanitize' function +# ------------------------------------------------------------------------------ +def scrub( + doc: Document, + attached_files: bool = True, + clean_pages: bool = True, + embedded_files: bool = True, + hidden_text: bool = True, + javascript: bool = True, + metadata: bool = True, + redactions: bool = True, + redact_images: int = 0, + remove_links: bool = True, + reset_fields: bool = True, + reset_responses: bool = True, + thumbnails: bool = True, + xml_metadata: bool = True, +) -> None: + def remove_hidden(cont_lines): + """Remove hidden text from a PDF page. + + Args: + cont_lines: list of lines with /Contents content. Should have status + from after page.cleanContents(). + + Returns: + List of /Contents lines from which hidden text has been removed. + + Notes: + The input must have been created after the page's /Contents object(s) + have been cleaned with page.cleanContents(). This ensures a standard + formatting: one command per line, single spaces between operators. + This allows for drastic simplification of this code. + """ + out_lines = [] # will return this + in_text = False # indicate if within BT/ET object + suppress = False # indicate text suppression active + make_return = False + for line in cont_lines: + if line == b"BT": # start of text object + in_text = True # switch on + out_lines.append(line) # output it + continue + if line == b"ET": # end of text object + in_text = False # switch off + out_lines.append(line) # output it + continue + if line == b"3 Tr": # text suppression operator + suppress = True # switch on + make_return = True + continue + if line[-2:] == b"Tr" and line[0] != b"3": + suppress = False # text rendering changed + out_lines.append(line) + continue + if line == b"Q": # unstack command also switches off + suppress = False + out_lines.append(line) + continue + if suppress and in_text: # suppress hidden lines + continue + out_lines.append(line) + if make_return: + return out_lines + else: + return None + + if not doc.is_pdf: # only works for PDF + raise ValueError("is no PDF") + if doc.is_encrypted or doc.is_closed: + raise ValueError("closed or encrypted doc") + + if clean_pages is False: + hidden_text = False + redactions = False + + if metadata: + doc.set_metadata({}) # remove standard metadata + + for page in doc: + if reset_fields: + # reset form fields (widgets) + for widget in page.widgets(): + widget.reset() + + if remove_links: + links = page.get_links() # list of all links on page + for link in links: # remove all links + page.delete_link(link) + + found_redacts = False + for annot in page.annots(): + if annot.type[0] == PDF_ANNOT_FILE_ATTACHMENT and attached_files: + annot.update_file(buffer=b" ") # set file content to empty + if reset_responses: + annot.delete_responses() + if annot.type[0] == PDF_ANNOT_REDACT: + found_redacts = True + + if redactions and found_redacts: + page.apply_redactions(images=redact_images) + + if not (clean_pages or hidden_text): + continue # done with the page + + page.clean_contents() + if not page.get_contents(): + continue + if hidden_text: + xref = page.get_contents()[0] # only one b/o cleaning! + cont = doc.xref_stream(xref) + cont_lines = remove_hidden(cont.splitlines()) # remove hidden text + if cont_lines: # something was actually removed + cont = b"\n".join(cont_lines) + doc.update_stream(xref, cont) # rewrite the page /Contents + + if thumbnails: # remove page thumbnails? + if doc.xref_get_key(page.xref, "Thumb")[0] != "null": + doc.xref_set_key(page.xref, "Thumb", "null") + + # pages are scrubbed, now perform document-wide scrubbing + # remove embedded files + if embedded_files: + for name in doc.embfile_names(): + doc.embfile_del(name) + + if xml_metadata: + doc.del_xml_metadata() + if not (xml_metadata or javascript): + xref_limit = 0 + else: + xref_limit = doc.xref_length() + for xref in range(1, xref_limit): + if not doc.xref_object(xref): + msg = "bad xref %i - clean PDF before scrubbing" % xref + raise ValueError(msg) + if javascript and doc.xref_get_key(xref, "S")[1] == "/JavaScript": + # a /JavaScript action object + obj = "<>" # replace with a null JavaScript + doc.update_object(xref, obj) # update this object + continue # no further handling + + if not xml_metadata: + continue + + if doc.xref_get_key(xref, "Type")[1] == "/Metadata": + # delete any metadata object directly + doc.update_object(xref, "<<>>") + doc.update_stream(xref, b"deleted", new=True) + continue + + if doc.xref_get_key(xref, "Metadata")[0] != "null": + doc.xref_set_key(xref, "Metadata", "null") + + +def fill_textbox( + writer: TextWriter, + rect: rect_like, + text: typing.Union[str, list], + pos: point_like = None, + font: typing.Optional[Font] = None, + fontsize: float = 11, + lineheight: OptFloat = None, + align: int = 0, + warn: bool = None, + right_to_left: bool = False, + small_caps: bool = False, +) -> tuple: + """Fill a rectangle with text. + + Args: + writer: TextWriter object (= "self") + rect: rect-like to receive the text. + text: string or list/tuple of strings. + pos: point-like start position of first word. + font: Font object (default Font('helv')). + fontsize: the fontsize. + lineheight: overwrite the font property + align: (int) 0 = left, 1 = center, 2 = right, 3 = justify + warn: (bool) text overflow action: none, warn, or exception + right_to_left: (bool) indicate right-to-left language. + """ + rect = Rect(rect) + if rect.is_empty: + raise ValueError("fill rect must not empty.") + if type(font) is not Font: + font = Font("helv") + + def textlen(x): + """Return length of a string.""" + return font.text_length( + x, fontsize=fontsize, small_caps=small_caps + ) # abbreviation + + def char_lengths(x): + """Return list of single character lengths for a string.""" + return font.char_lengths(x, fontsize=fontsize, small_caps=small_caps) + + def append_this(pos, text): + return writer.append( + pos, text, font=font, fontsize=fontsize, small_caps=small_caps + ) + + tolerance = fontsize * 0.2 # extra distance to left border + space_len = textlen(" ") + std_width = rect.width - tolerance + std_start = rect.x0 + tolerance + + def norm_words(width, words): + """Cut any word in pieces no longer than 'width'.""" + nwords = [] + word_lengths = [] + for w in words: + wl_lst = char_lengths(w) + wl = sum(wl_lst) + if wl <= width: # nothing to do - copy over + nwords.append(w) + word_lengths.append(wl) + continue + + # word longer than rect width - split it in parts + n = len(wl_lst) + while n > 0: + wl = sum(wl_lst[:n]) + if wl <= width: + nwords.append(w[:n]) + word_lengths.append(wl) + w = w[n:] + wl_lst = wl_lst[n:] + n = len(wl_lst) + else: + n -= 1 + return nwords, word_lengths + + def output_justify(start, line): + """Justified output of a line.""" + # ignore leading / trailing / multiple spaces + words = [w for w in line.split(" ") if w != ""] + nwords = len(words) + if nwords == 0: + return + if nwords == 1: # single word cannot be justified + append_this(start, words[0]) + return + tl = sum([textlen(w) for w in words]) # total word lengths + gaps = nwords - 1 # number of word gaps + gapl = (std_width - tl) / gaps # width of each gap + for w in words: + _, lp = append_this(start, w) # output one word + start.x = lp.x + gapl # next start at word end plus gap + return + + asc = font.ascender + dsc = font.descender + if not lineheight: + if asc - dsc <= 1: + lheight = 1.2 + else: + lheight = asc - dsc + else: + lheight = lineheight + + LINEHEIGHT = fontsize * lheight # effective line height + width = std_width # available horizontal space + + # starting point of text + if pos is not None: + pos = Point(pos) + else: # default is just below rect top-left + pos = rect.tl + (tolerance, fontsize * asc) + if not pos in rect: + raise ValueError("Text must start in rectangle.") + + # calculate displacement factor for alignment + if align == TEXT_ALIGN_CENTER: + factor = 0.5 + elif align == TEXT_ALIGN_RIGHT: + factor = 1.0 + else: + factor = 0 + + # split in lines if just a string was given + if type(text) is str: + textlines = text.splitlines() + else: + textlines = [] + for line in text: + textlines.extend(line.splitlines()) + + max_lines = int((rect.y1 - pos.y) / LINEHEIGHT) + 1 + + new_lines = [] # the final list of textbox lines + no_justify = [] # no justify for these line numbers + for i, line in enumerate(textlines): + if line in ("", " "): + new_lines.append((line, space_len)) + width = rect.width - tolerance + no_justify.append((len(new_lines) - 1)) + continue + if i == 0: + width = rect.x1 - pos.x + else: + width = rect.width - tolerance + + if right_to_left: # reverses Arabic / Hebrew text front to back + line = writer.clean_rtl(line) + tl = textlen(line) + if tl <= width: # line short enough + new_lines.append((line, tl)) + no_justify.append((len(new_lines) - 1)) + continue + + # we need to split the line in fitting parts + words = line.split(" ") # the words in the line + + # cut in parts any words that are longer than rect width + words, word_lengths = norm_words(std_width, words) + + n = len(words) + while True: + line0 = " ".join(words[:n]) + wl = sum(word_lengths[:n]) + space_len * (len(word_lengths[:n]) - 1) + if wl <= width: + new_lines.append((line0, wl)) + words = words[n:] + word_lengths = word_lengths[n:] + n = len(words) + line0 = None + else: + n -= 1 + + if len(words) == 0: + break + + # ------------------------------------------------------------------------- + # List of lines created. Each item is (text, tl), where 'tl' is the PDF + # output length (float) and 'text' is the text. Except for justified text, + # this is output-ready. + # ------------------------------------------------------------------------- + nlines = len(new_lines) + if nlines > max_lines: + msg = "Only fitting %i of %i lines." % (max_lines, nlines) + if warn == True: + print("Warning: " + msg) + elif warn == False: + raise ValueError(msg) + + start = Point() + no_justify += [len(new_lines) - 1] # no justifying of last line + for i in range(max_lines): + try: + line, tl = new_lines.pop(0) + except IndexError: + break + + if right_to_left: # Arabic, Hebrew + line = "".join(reversed(line)) + + if i == 0: # may have different start for first line + start = pos + + if align == TEXT_ALIGN_JUSTIFY and i not in no_justify and tl < std_width: + output_justify(start, line) + start.x = std_start + start.y += LINEHEIGHT + continue + + if i > 0 or pos.x == std_start: # left, center, right alignments + start.x += (width - tl) * factor + + append_this(start, line) + start.x = std_start + start.y += LINEHEIGHT + + return new_lines # return non-written lines + + +# ------------------------------------------------------------------------ +# Optional Content functions +# ------------------------------------------------------------------------ +def get_oc(doc: Document, xref: int) -> int: + """Return optional content object xref for an image or form xobject. + + Args: + xref: (int) xref number of an image or form xobject. + """ + if doc.is_closed or doc.is_encrypted: + raise ValueError("document close or encrypted") + t, name = doc.xref_get_key(xref, "Subtype") + if t != "name" or name not in ("/Image", "/Form"): + raise ValueError("bad object type at xref %i" % xref) + t, oc = doc.xref_get_key(xref, "OC") + if t != "xref": + return 0 + rc = int(oc.replace("0 R", "")) + return rc + + +def set_oc(doc: Document, xref: int, oc: int) -> None: + """Attach optional content object to image or form xobject. + + Args: + xref: (int) xref number of an image or form xobject + oc: (int) xref number of an OCG or OCMD + """ + if doc.is_closed or doc.is_encrypted: + raise ValueError("document close or encrypted") + t, name = doc.xref_get_key(xref, "Subtype") + if t != "name" or name not in ("/Image", "/Form"): + raise ValueError("bad object type at xref %i" % xref) + if oc > 0: + t, name = doc.xref_get_key(oc, "Type") + if t != "name" or name not in ("/OCG", "/OCMD"): + raise ValueError("bad object type at xref %i" % oc) + if oc == 0 and "OC" in doc.xref_get_keys(xref): + doc.xref_set_key(xref, "OC", "null") + return None + doc.xref_set_key(xref, "OC", "%i 0 R" % oc) + return None + + +def set_ocmd( + doc: Document, + xref: int = 0, + ocgs: typing.Union[list, None] = None, + policy: OptStr = None, + ve: typing.Union[list, None] = None, +) -> int: + """Create or update an OCMD object in a PDF document. + + Args: + xref: (int) 0 for creating a new object, otherwise update existing one. + ocgs: (list) OCG xref numbers, which shall be subject to 'policy'. + policy: one of 'AllOn', 'AllOff', 'AnyOn', 'AnyOff' (any casing). + ve: (list) visibility expression. Use instead of 'ocgs' with 'policy'. + + Returns: + Xref of the created or updated OCMD. + """ + + all_ocgs = set(doc.get_ocgs().keys()) + + def ve_maker(ve): + if type(ve) not in (list, tuple) or len(ve) < 2: + raise ValueError("bad 've' format: %s" % ve) + if ve[0].lower() not in ("and", "or", "not"): + raise ValueError("bad operand: %s" % ve[0]) + if ve[0].lower() == "not" and len(ve) != 2: + raise ValueError("bad 've' format: %s" % ve) + item = "[/%s" % ve[0].title() + for x in ve[1:]: + if type(x) is int: + if x not in all_ocgs: + raise ValueError("bad OCG %i" % x) + item += " %i 0 R" % x + else: + item += " %s" % ve_maker(x) + item += "]" + return item + + text = "< dict: + """Return the definition of an OCMD (optional content membership dictionary). + + Recognizes PDF dict keys /OCGs (PDF array of OCGs), /P (policy string) and + /VE (visibility expression, PDF array). Via string manipulation, this + info is converted to a Python dictionary with keys "xref", "ocgs", "policy" + and "ve" - ready to recycle as input for 'set_ocmd()'. + """ + + if xref not in range(doc.xref_length()): + raise ValueError("bad xref") + text = doc.xref_object(xref, compressed=True) + if "/Type/OCMD" not in text: + raise ValueError("bad object type") + textlen = len(text) + + p0 = text.find("/OCGs[") # look for /OCGs key + p1 = text.find("]", p0) + if p0 < 0 or p1 < 0: # no OCGs found + ocgs = None + else: + ocgs = text[p0 + 6 : p1].replace("0 R", " ").split() + ocgs = list(map(int, ocgs)) + + p0 = text.find("/P/") # look for /P policy key + if p0 < 0: + policy = None + else: + p1 = text.find("ff", p0) + if p1 < 0: + p1 = text.find("on", p0) + if p1 < 0: # some irregular syntax + raise ValueError("bad object at xref") + else: + policy = text[p0 + 3 : p1 + 2] + + p0 = text.find("/VE[") # look for /VE visibility expression key + if p0 < 0: # no visibility expression found + ve = None + else: + lp = rp = 0 # find end of /VE by finding last ']'. + p1 = p0 + while lp < 1 or lp != rp: + p1 += 1 + if not p1 < textlen: # some irregular syntax + raise ValueError("bad object at xref") + if text[p1] == "[": + lp += 1 + if text[p1] == "]": + rp += 1 + # p1 now positioned at the last "]" + ve = text[p0 + 3 : p1 + 1] # the PDF /VE array + ve = ( + ve.replace("/And", '"and",') + .replace("/Not", '"not",') + .replace("/Or", '"or",') + ) + ve = ve.replace(" 0 R]", "]").replace(" 0 R", ",").replace("][", "],[") + try: + ve = json.loads(ve) + except: + print("bad /VE key: ", ve) + raise + return {"xref": xref, "ocgs": ocgs, "policy": policy, "ve": ve} + + +""" +Handle page labels for PDF documents. + +Reading +------- +* compute the label of a page +* find page number(s) having the given label. + +Writing +------- +Supports setting (defining) page labels for PDF documents. + +A big Thank You goes to WILLIAM CHAPMAN who contributed the idea and +significant parts of the following code during late December 2020 +through early January 2021. +""" + + +def rule_dict(item): + """Make a Python dict from a PDF page label rule. + + Args: + item -- a tuple (pno, rule) with the start page number and the rule + string like <>. + Returns: + A dict like + {'startpage': int, 'prefix': str, 'style': str, 'firstpagenum': int}. + """ + # Jorj McKie, 2021-01-06 + + pno, rule = item + rule = rule[2:-2].split("/")[1:] # strip "<<" and ">>" + d = {"startpage": pno, "prefix": "", "firstpagenum": 1} + skip = False + for i, item in enumerate(rule): + if skip: # this item has already been processed + skip = False # deactivate skipping again + continue + if item == "S": # style specification + d["style"] = rule[i + 1] # next item has the style + skip = True # do not process next item again + continue + if item.startswith("P"): # prefix specification: extract the string + x = item[1:].replace("(", "").replace(")", "") + d["prefix"] = x + continue + if item.startswith("St"): # start page number specification + x = int(item[2:]) + d["firstpagenum"] = x + return d + + +def get_label_pno(pgNo, labels): + """Return the label for this page number. + + Args: + pgNo: page number, 0-based. + labels: result of doc._get_page_labels(). + Returns: + The label (str) of the page number. Errors return an empty string. + """ + # Jorj McKie, 2021-01-06 + + item = [x for x in labels if x[0] <= pgNo][-1] + rule = rule_dict(item) + prefix = rule.get("prefix", "") + style = rule.get("style", "") + pagenumber = pgNo - rule["startpage"] + rule["firstpagenum"] + return construct_label(style, prefix, pagenumber) + + +def get_label(page): + """Return the label for this PDF page. + + Args: + page: page object. + Returns: + The label (str) of the page. Errors return an empty string. + """ + # Jorj McKie, 2021-01-06 + + labels = page.parent._get_page_labels() + if not labels: + return "" + labels.sort() + return get_label_pno(page.number, labels) + + +def get_page_numbers(doc, label, only_one=False): + """Return a list of page numbers with the given label. + + Args: + doc: PDF document object (resp. 'self'). + label: (str) label. + only_one: (bool) stop searching after first hit. + Returns: + List of page numbers having this label. + """ + # Jorj McKie, 2021-01-06 + + numbers = [] + if not label: + return numbers + labels = doc._get_page_labels() + if labels == []: + return numbers + for i in range(doc.page_count): + plabel = get_label_pno(i, labels) + if plabel == label: + numbers.append(i) + if only_one: + break + return numbers + + +def construct_label(style, prefix, pno) -> str: + """Construct a label based on style, prefix and page number.""" + # William Chapman, 2021-01-06 + + n_str = "" + if style == "D": + n_str = str(pno) + elif style == "r": + n_str = integerToRoman(pno).lower() + elif style == "R": + n_str = integerToRoman(pno).upper() + elif style == "a": + n_str = integerToLetter(pno).lower() + elif style == "A": + n_str = integerToLetter(pno).upper() + result = prefix + n_str + return result + + +def integerToLetter(i) -> str: + """Returns letter sequence string for integer i.""" + # William Chapman, Jorj McKie, 2021-01-06 + + ls = string.ascii_uppercase + n, a = 1, i + while pow(26, n) <= a: + a -= int(math.pow(26, n)) + n += 1 + + str_t = "" + for j in reversed(range(n)): + f, g = divmod(a, int(math.pow(26, j))) + str_t += ls[f] + a = g + return str_t + + +def integerToRoman(num: int) -> str: + """Return roman numeral for an integer.""" + # William Chapman, Jorj McKie, 2021-01-06 + + roman = ( + (1000, "M"), + (900, "CM"), + (500, "D"), + (400, "CD"), + (100, "C"), + (90, "XC"), + (50, "L"), + (40, "XL"), + (10, "X"), + (9, "IX"), + (5, "V"), + (4, "IV"), + (1, "I"), + ) + + def roman_num(num): + for r, ltr in roman: + x, _ = divmod(num, r) + yield ltr * x + num -= r * x + if num <= 0: + break + + return "".join([a for a in roman_num(num)]) + + +def get_page_labels(doc): + """Return page label definitions in PDF document. + + Args: + doc: PDF document (resp. 'self'). + Returns: + A list of dictionaries with the following format: + {'startpage': int, 'prefix': str, 'style': str, 'firstpagenum': int}. + """ + # Jorj McKie, 2021-01-10 + return [rule_dict(item) for item in doc._get_page_labels()] + + +def set_page_labels(doc, labels): + """Add / replace page label definitions in PDF document. + + Args: + doc: PDF document (resp. 'self'). + labels: list of label dictionaries like: + {'startpage': int, 'prefix': str, 'style': str, 'firstpagenum': int}, + as returned by get_page_labels(). + """ + # William Chapman, 2021-01-06 + + def create_label_str(label): + """Convert Python label dict to correspnding PDF rule string. + + Args: + label: (dict) build rule for the label. + Returns: + PDF label rule string wrapped in "<<", ">>". + """ + s = "%i<<" % label["startpage"] + if label.get("prefix", "") != "": + s += "/P(%s)" % label["prefix"] + if label.get("style", "") != "": + s += "/S/%s" % label["style"] + if label.get("firstpagenum", 1) > 1: + s += "/St %i" % label["firstpagenum"] + s += ">>" + return s + + def create_nums(labels): + """Return concatenated string of all labels rules. + + Args: + labels: (list) dictionaries as created by function 'rule_dict'. + Returns: + PDF compatible string for page label definitions, ready to be + enclosed in PDF array 'Nums[...]'. + """ + labels.sort(key=lambda x: x["startpage"]) + s = "".join([create_label_str(label) for label in labels]) + return s + + doc._set_page_labels(create_nums(labels)) + + +# End of Page Label Code ------------------------------------------------- + + +def has_links(doc: Document) -> bool: + """Check whether there are links on any page.""" + if doc.is_closed: + raise ValueError("document closed") + if not doc.is_pdf: + raise ValueError("is no PDF") + for i in range(doc.page_count): + for item in doc.page_annot_xrefs(i): + if item[1] == PDF_ANNOT_LINK: + return True + return False + + +def has_annots(doc: Document) -> bool: + """Check whether there are annotations on any page.""" + if doc.is_closed: + raise ValueError("document closed") + if not doc.is_pdf: + raise ValueError("is no PDF") + for i in range(doc.page_count): + for item in doc.page_annot_xrefs(i): + if not (item[1] == PDF_ANNOT_LINK or item[1] == PDF_ANNOT_WIDGET): + return True + return False + + +# ------------------------------------------------------------------- +# Functions to recover the quad contained in a text extraction bbox +# ------------------------------------------------------------------- +def recover_bbox_quad(line_dir: tuple, span: dict, bbox: tuple) -> Quad: + """Compute the quad located inside the bbox. + + The bbox may be any of the resp. tuples occurring inside the given span. + + Args: + line_dir: (tuple) 'line["dir"]' of the owning line or None. + span: (dict) the span. May be from get_texttrace() method. + bbox: (tuple) the bbox of the span or any of its characters. + Returns: + The quad which is wrapped by the bbox. + """ + if line_dir == None: + line_dir = span["dir"] + cos, sin = line_dir + bbox = Rect(bbox) # make it a rect + if TOOLS.set_small_glyph_heights(): # ==> just fontsize as height + d = 1 + else: + d = span["ascender"] - span["descender"] + + height = d * span["size"] # the quad's rectangle height + # The following are distances from the bbox corners, at wich we find the + # respective quad points. The computation depends on in which quadrant + # the text writing angle is located. + hs = height * sin + hc = height * cos + if hc >= 0 and hs <= 0: # quadrant 1 + ul = bbox.bl - (0, hc) + ur = bbox.tr + (hs, 0) + ll = bbox.bl - (hs, 0) + lr = bbox.tr + (0, hc) + elif hc <= 0 and hs <= 0: # quadrant 2 + ul = bbox.br + (hs, 0) + ur = bbox.tl - (0, hc) + ll = bbox.br + (0, hc) + lr = bbox.tl - (hs, 0) + elif hc <= 0 and hs >= 0: # quadrant 3 + ul = bbox.tr - (0, hc) + ur = bbox.bl + (hs, 0) + ll = bbox.tr - (hs, 0) + lr = bbox.bl + (0, hc) + else: # quadrant 4 + ul = bbox.tl + (hs, 0) + ur = bbox.br - (0, hc) + ll = bbox.tl + (0, hc) + lr = bbox.br - (hs, 0) + return Quad(ul, ur, ll, lr) + + +def recover_quad(line_dir: tuple, span: dict) -> Quad: + """Recover the quadrilateral of a text span. + + Args: + line_dir: (tuple) 'line["dir"]' of the owning line. + span: the span. + Returns: + The quadrilateral enveloping the span's text. + """ + if type(line_dir) is not tuple or len(line_dir) != 2: + raise ValueError("bad line dir argument") + if type(span) is not dict: + raise ValueError("bad span argument") + return recover_bbox_quad(line_dir, span, span["bbox"]) + + +def recover_line_quad(line: dict, spans: list = None) -> Quad: + """Calculate the line quad for 'dict' / 'rawdict' text extractions. + + The lower quad points are those of the first, resp. last span quad. + The upper points are determined by the maximum span quad height. + From this, compute a rect with bottom-left in (0, 0), convert this to a + quad and rotate and shift back to cover the text of the spans. + + Args: + spans: (list, optional) sub-list of spans to consider. + Returns: + Quad covering selected spans. + """ + if spans == None: # no sub-selection + spans = line["spans"] # all spans + if len(spans) == 0: + raise ValueError("bad span list") + line_dir = line["dir"] # text direction + cos, sin = line_dir + q0 = recover_quad(line_dir, spans[0]) # quad of first span + if len(spans) > 1: # get quad of last span + q1 = recover_quad(line_dir, spans[-1]) + else: + q1 = q0 # last = first + + line_ll = q0.ll # lower-left of line quad + line_lr = q1.lr # lower-right of line quad + + mat0 = planish_line(line_ll, line_lr) + + # map base line to x-axis such that line_ll goes to (0, 0) + x_lr = line_lr * mat0 + + small = TOOLS.set_small_glyph_heights() # small glyph heights? + + h = max( + [s["size"] * (1 if small else (s["ascender"] - s["descender"])) for s in spans] + ) + + line_rect = Rect(0, -h, x_lr.x, 0) # line rectangle + line_quad = line_rect.quad # make it a quad and: + line_quad *= ~mat0 + return line_quad + + +def recover_span_quad(line_dir: tuple, span: dict, chars: list = None) -> Quad: + """Calculate the span quad for 'dict' / 'rawdict' text extractions. + + Notes: + There are two execution paths: + 1. For the full span quad, the result of 'recover_quad' is returned. + 2. For the quad of a sub-list of characters, the char quads are + computed and joined. This is only supported for the "rawdict" + extraction option. + + Args: + line_dir: (tuple) 'line["dir"]' of the owning line. + span: (dict) the span. + chars: (list, optional) sub-list of characters to consider. + Returns: + Quad covering selected characters. + """ + if line_dir == None: # must be a span from get_texttrace() + line_dir = span["dir"] + if chars == None: # no sub-selection + return recover_quad(line_dir, span) + if not "chars" in span.keys(): + raise ValueError("need 'rawdict' option to sub-select chars") + + q0 = recover_char_quad(line_dir, span, chars[0]) # quad of first char + if len(chars) > 1: # get quad of last char + q1 = recover_char_quad(line_dir, span, chars[-1]) + else: + q1 = q0 # last = first + + span_ll = q0.ll # lower-left of span quad + span_lr = q1.lr # lower-right of span quad + mat0 = planish_line(span_ll, span_lr) + # map base line to x-axis such that span_ll goes to (0, 0) + x_lr = span_lr * mat0 + + small = TOOLS.set_small_glyph_heights() # small glyph heights? + h = span["size"] * (1 if small else (span["ascender"] - span["descender"])) + + span_rect = Rect(0, -h, x_lr.x, 0) # line rectangle + span_quad = span_rect.quad # make it a quad and: + span_quad *= ~mat0 # rotate back and shift back + return span_quad + + +def recover_char_quad(line_dir: tuple, span: dict, char: dict) -> Quad: + """Recover the quadrilateral of a text character. + + This requires the "rawdict" option of text extraction. + + Args: + line_dir: (tuple) 'line["dir"]' of the span's line. + span: (dict) the span dict. + char: (dict) the character dict. + Returns: + The quadrilateral enveloping the character. + """ + if line_dir == None: + line_dir = span["dir"] + if type(line_dir) is not tuple or len(line_dir) != 2: + raise ValueError("bad line dir argument") + if type(span) is not dict: + raise ValueError("bad span argument") + if type(char) is dict: + bbox = Rect(char["bbox"]) + elif type(char) is tuple: + bbox = Rect(char[3]) + else: + raise ValueError("bad span argument") + + return recover_bbox_quad(line_dir, span, bbox) + + +# ------------------------------------------------------------------- +# Building font subsets using fontTools +# ------------------------------------------------------------------- +def subset_fonts(doc: Document, verbose: bool = False) -> None: + """Build font subsets of a PDF. Requires package 'fontTools'. + + Eligible fonts are potentially replaced by smaller versions. Page text is + NOT rewritten and thus should retain properties like being hidden or + controlled by optional content. + """ + # Font binaries: - "buffer" -> (names, xrefs, (unicodes, glyphs)) + # An embedded font is uniquely defined by its fontbuffer only. It may have + # multiple names and xrefs. + # Once the sets of used unicodes and glyphs are known, we compute a + # smaller version of the buffer user package fontTools. + font_buffers = {} + + def get_old_widths(xref): + """Retrieve old font '/W' and '/DW' values.""" + df = doc.xref_get_key(xref, "DescendantFonts") + if df[0] != "array": # only handle xref specifications + return None, None + df_xref = int(df[1][1:-1].replace("0 R", "")) + widths = doc.xref_get_key(df_xref, "W") + if widths[0] != "array": # no widths key found + widths = None + else: + widths = widths[1] + dwidths = doc.xref_get_key(df_xref, "DW") + if dwidths[0] != "int": + dwidths = None + else: + dwidths = dwidths[1] + return widths, dwidths + + def set_old_widths(xref, widths, dwidths): + """Restore the old '/W' and '/DW' in subsetted font. + + If either parameter is None or evaluates to False, the corresponding + dictionary key will be set to null. + """ + df = doc.xref_get_key(xref, "DescendantFonts") + if df[0] != "array": # only handle xref specs + return None + df_xref = int(df[1][1:-1].replace("0 R", "")) + if (type(widths) is not str or not widths) and doc.xref_get_key(df_xref, "W")[ + 0 + ] != "null": + doc.xref_set_key(df_xref, "W", "null") + else: + doc.xref_set_key(df_xref, "W", widths) + if (type(dwidths) is not str or not dwidths) and doc.xref_get_key( + df_xref, "DW" + )[0] != "null": + doc.xref_set_key(df_xref, "DW", "null") + else: + doc.xref_set_key(df_xref, "DW", dwidths) + return None + + def set_subset_fontname(new_xref): + """Generate a name prefix to tag a font as subset. + + We use a random generator to select 6 upper case ASCII characters. + The prefixed name must be put in the font xref as the "/BaseFont" value + and in the FontDescriptor object as the '/FontName' value. + """ + # The following generates a prefix like 'ABCDEF+' + prefix = "".join(random.choices(tuple(string.ascii_uppercase), k=6)) + "+" + font_str = doc.xref_object(new_xref, compressed=True) + font_str = font_str.replace("/BaseFont/", "/BaseFont/" + prefix) + df = doc.xref_get_key(new_xref, "DescendantFonts") + if df[0] == "array": + df_xref = int(df[1][1:-1].replace("0 R", "")) + fd = doc.xref_get_key(df_xref, "FontDescriptor") + if fd[0] == "xref": + fd_xref = int(fd[1].replace("0 R", "")) + fd_str = doc.xref_object(fd_xref, compressed=True) + fd_str = fd_str.replace("/FontName/", "/FontName/" + prefix) + doc.update_object(fd_xref, fd_str) + doc.update_object(new_xref, font_str) + return None + + def build_subset(buffer, unc_set, gid_set): + """Build font subset using fontTools. + + Args: + buffer: (bytes) the font given as a binary buffer. + unc_set: (set) required glyph ids. + Returns: + Either None if subsetting is unsuccessful or the subset font buffer. + """ + try: + import fontTools.subset as fts + except ImportError: + print("This method requires fontTools to be installed.") + raise + tmp_dir = tempfile.gettempdir() + oldfont_path = f"{tmp_dir}/oldfont.ttf" + newfont_path = f"{tmp_dir}/newfont.ttf" + uncfile_path = f"{tmp_dir}/uncfile.txt" + args = [ + oldfont_path, + "--retain-gids", + f"--output-file={newfont_path}", + "--layout-features='*'", + "--passthrough-tables", + "--ignore-missing-glyphs", + "--ignore-missing-unicodes", + "--symbol-cmap", + ] + + unc_file = open( + f"{tmp_dir}/uncfile.txt", "w" + ) # store glyph ids or unicodes as file + if 0xFFFD in unc_set: # error unicode exists -> use glyphs + args.append(f"--gids-file={uncfile_path}") + gid_set.add(189) + unc_list = list(gid_set) + for unc in unc_list: + unc_file.write("%i\n" % unc) + else: + args.append(f"--unicodes-file={uncfile_path}") + unc_set.add(255) + unc_list = list(unc_set) + for unc in unc_list: + unc_file.write("%04x\n" % unc) + + unc_file.close() + fontfile = open(oldfont_path, "wb") # store fontbuffer as a file + fontfile.write(buffer) + fontfile.close() + try: + os.remove(newfont_path) # remove old file + except: + pass + try: # invoke fontTools subsetter + fts.main(args) + font = Font(fontfile=newfont_path) + new_buffer = font.buffer + if len(font.valid_codepoints()) == 0: + new_buffer = None + except: + new_buffer = None + try: + os.remove(uncfile_path) + except: + pass + try: + os.remove(oldfont_path) + except: + pass + try: + os.remove(newfont_path) + except: + pass + return new_buffer + + def repl_fontnames(doc): + """Populate 'font_buffers'. + + For each font candidate, store its xref and the list of names + by which PDF text may refer to it (there may be multiple). + """ + + def norm_name(name): + """Recreate font name that contains PDF hex codes. + + E.g. #20 -> space, chr(32) + """ + while "#" in name: + p = name.find("#") + c = int(name[p + 1 : p + 3], 16) + name = name.replace(name[p : p + 3], chr(c)) + return name + + def get_fontnames(doc, item): + """Return a list of fontnames for an item of page.get_fonts(). + + There may be multiple names e.g. for Type0 fonts. + """ + fontname = item[3] + names = [fontname] + fontname = doc.xref_get_key(item[0], "BaseFont")[1][1:] + fontname = norm_name(fontname) + if fontname not in names: + names.append(fontname) + descendents = doc.xref_get_key(item[0], "DescendantFonts") + if descendents[0] != "array": + return names + descendents = descendents[1][1:-1] + if descendents.endswith(" 0 R"): + xref = int(descendents[:-4]) + descendents = doc.xref_object(xref, compressed=True) + p1 = descendents.find("/BaseFont") + if p1 >= 0: + p2 = descendents.find("/", p1 + 1) + p1 = min(descendents.find("/", p2 + 1), descendents.find(">>", p2 + 1)) + fontname = descendents[p2 + 1 : p1] + fontname = norm_name(fontname) + if fontname not in names: + names.append(fontname) + return names + + for i in range(doc.page_count): + for f in doc.get_page_fonts(i, full=True): + font_xref = f[0] # font xref + font_ext = f[1] # font file extension + basename = f[3] # font basename + + if font_ext not in ( # skip if not supported by fontTools + "otf", + "ttf", + "woff", + "woff2", + ): + continue + # skip fonts which already are subsets + if len(basename) > 6 and basename[6] == "+": + continue + + extr = doc.extract_font(font_xref) + fontbuffer = extr[-1] + names = get_fontnames(doc, f) + name_set, xref_set, subsets = font_buffers.get( + fontbuffer, (set(), set(), (set(), set())) + ) + xref_set.add(font_xref) + for name in names: + name_set.add(name) + font = Font(fontbuffer=fontbuffer) + name_set.add(font.name) + del font + font_buffers[fontbuffer] = (name_set, xref_set, subsets) + return None + + def find_buffer_by_name(name): + for buffer in font_buffers.keys(): + name_set, _, _ = font_buffers[buffer] + if name in name_set: + return buffer + return None + + # ----------------- + # main function + # ----------------- + repl_fontnames(doc) # populate font information + if not font_buffers: # nothing found to do + if verbose: + print("No fonts to subset.") + return 0 + + old_fontsize = 0 + new_fontsize = 0 + for fontbuffer in font_buffers.keys(): + old_fontsize += len(fontbuffer) + + # Scan page text for usage of subsettable fonts + for page in doc: + # go through the text and extend set of used glyphs by font + # we use a modified MuPDF trace device, which delivers us glyph ids. + for span in page.get_texttrace(): + if type(span) is not dict: # skip useless information + continue + fontname = span["font"][:33] # fontname for the span + buffer = find_buffer_by_name(fontname) + if buffer is None: + continue + name_set, xref_set, (set_ucs, set_gid) = font_buffers[buffer] + for c in span["chars"]: + set_ucs.add(c[0]) # unicode + set_gid.add(c[1]) # glyph id + font_buffers[buffer] = (name_set, xref_set, (set_ucs, set_gid)) + + # build the font subsets + for old_buffer in font_buffers.keys(): + name_set, xref_set, subsets = font_buffers[old_buffer] + new_buffer = build_subset(old_buffer, subsets[0], subsets[1]) + fontname = list(name_set)[0] + if new_buffer == None or len(new_buffer) >= len(old_buffer): + # subset was not created or did not get smaller + if verbose: + print(f"Cannot subset '{fontname}'.") + continue + if verbose: + print(f"Built subset of font '{fontname}'.") + val = doc._insert_font(fontbuffer=new_buffer) # store subset font in PDF + new_xref = val[0] # get its xref + set_subset_fontname(new_xref) # tag fontname as subset font + font_str = doc.xref_object( # get its object definition + new_xref, + compressed=True, + ) + # walk through the original font xrefs and replace each by the subset def + for font_xref in xref_set: + # we need the original '/W' and '/DW' width values + width_table, def_width = get_old_widths(font_xref) + # ... and replace original font definition at xref with it + doc.update_object(font_xref, font_str) + # now copy over old '/W' and '/DW' values + if width_table or def_width: + set_old_widths(font_xref, width_table, def_width) + # 'new_xref' remains unused in the PDF and must be removed + # by garbage collection. + new_fontsize += len(new_buffer) + + return old_fontsize - new_fontsize + + +# ------------------------------------------------------------------- +# Copy XREF object to another XREF +# ------------------------------------------------------------------- +def xref_copy(doc: Document, source: int, target: int, *, keep: list = None) -> None: + """Copy a PDF dictionary object to another one given their xref numbers. + + Args: + doc: PDF document object + source: source xref number + target: target xref number, the xref must already exist + keep: an optional list of 1st level keys in target that should not be + removed before copying. + Notes: + This works similar to the copy() method of dictionaries in Python. The + source may be a stream object. + """ + if doc.xref_is_stream(source): + # read new xref stream, maintaining compression + stream = doc.xref_stream_raw(source) + doc.update_stream( + target, + stream, + compress=False, # keeps source compression + new=True, # in case target is no stream + ) + + # empty the target completely, observe exceptions + if keep is None: + keep = [] + for key in doc.xref_get_keys(target): + if key in keep: + continue + doc.xref_set_key(target, key, "null") + # copy over all source dict items + for key in doc.xref_get_keys(source): + item = doc.xref_get_key(source, key) + doc.xref_set_key(target, key, item[1]) + return None diff -r 000000000000 -r 1d09e1dec1d9 src_classic/version.i --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src_classic/version.i Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,7 @@ +%pythoncode %{ +VersionFitz = "1.24.1" # MuPDF version. +VersionBind = "1.24.1" # PyMuPDF version. +VersionDate = "2024-04-02 00:00:01" +version = (VersionBind, VersionFitz, "20240402000001") +pymupdf_version_tuple = tuple( [int(i) for i in VersionFitz.split('.')]) +%} diff -r 000000000000 -r 1d09e1dec1d9 tests/README.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/README.md Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,81 @@ +# PyMuPDF tests + +To run these tests: + +* Create and enter a venv. +* Install PyMuPDF. +* Install the Python packages listed in + `PyMuPDF/scripts/gh_release.py:test_packages`. +* Run pytest on the PyMuPDF directory. + +For example, as of 2023-12-11: + +``` +> python -m pip install pytest fontTools psutil pymupdf-fonts pillow +> pytest PyMuPDF +============================= test session starts ============================== +platform linux -- Python 3.11.2, pytest-7.4.3, pluggy-1.3.0 +rootdir: /home/jules/artifex-remote/PyMuPDF +configfile: pytest.ini +collected 171 items + +PyMuPDF/tests/test_2548.py . [ 0%] +PyMuPDF/tests/test_2634.py . [ 1%] +PyMuPDF/tests/test_2736.py . [ 1%] +PyMuPDF/tests/test_2791.py . [ 2%] +PyMuPDF/tests/test_2861.py . [ 2%] +PyMuPDF/tests/test_annots.py .................. [ 13%] +PyMuPDF/tests/test_badfonts.py . [ 14%] +PyMuPDF/tests/test_crypting.py . [ 14%] +PyMuPDF/tests/test_docs_samples.py ............. [ 22%] +PyMuPDF/tests/test_drawings.py ...... [ 25%] +PyMuPDF/tests/test_embeddedfiles.py . [ 26%] +PyMuPDF/tests/test_extractimage.py .. [ 27%] +PyMuPDF/tests/test_flake8.py . [ 28%] +PyMuPDF/tests/test_font.py ..... [ 30%] +PyMuPDF/tests/test_general.py .......................................... [ 55%] +... [ 57%] +PyMuPDF/tests/test_geometry.py ........ [ 61%] +PyMuPDF/tests/test_imagebbox.py .. [ 63%] +PyMuPDF/tests/test_insertimage.py .. [ 64%] +PyMuPDF/tests/test_insertpdf.py .. [ 65%] +PyMuPDF/tests/test_linequad.py . [ 66%] +PyMuPDF/tests/test_metadata.py .. [ 67%] +PyMuPDF/tests/test_nonpdf.py ... [ 69%] +PyMuPDF/tests/test_object_manipulation.py .... [ 71%] +PyMuPDF/tests/test_optional_content.py .. [ 72%] +PyMuPDF/tests/test_pagedelete.py . [ 73%] +PyMuPDF/tests/test_pagelabels.py . [ 73%] +PyMuPDF/tests/test_pixmap.py .......... [ 79%] +PyMuPDF/tests/test_showpdfpage.py . [ 80%] +PyMuPDF/tests/test_story.py ... [ 81%] +PyMuPDF/tests/test_tables.py ... [ 83%] +PyMuPDF/tests/test_tesseract.py . [ 84%] +PyMuPDF/tests/test_textbox.py ...... [ 87%] +PyMuPDF/tests/test_textextract.py .. [ 88%] +PyMuPDF/tests/test_textsearch.py .. [ 90%] +PyMuPDF/tests/test_toc.py ........ [ 94%] +PyMuPDF/tests/test_widgets.py ........ [ 99%] +PyMuPDF/tests/test_word_delimiters.py . [100%] + +======================== 171 passed in 78.65s (0:01:18) ======================== +> +``` + +## Known test failure with non-default build of MuPDF + +If PyMuPDF has been built with a non-default build of MuPDF (using +environmental variable ``PYMUPDF_SETUP_MUPDF_BUILD``), it is possible that +``tests/test_textbox.py:test_textbox3()`` will fail, because it relies on MuPDF +having been built with PyMuPDF's customized configuration, ``fitz/_config.h``. + +One can skip this particular test by adding ``-k 'not test_textbox3'`` to the +pytest command line. + + +## Resuming at a particular test. + +To skip tests before a particular test, set PYMUPDF_PYTEST_RESUME to the name +of the function. + +For example PYMUPDF_PYTEST_RESUME=test_haslinks. diff -r 000000000000 -r 1d09e1dec1d9 tests/conftest.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/conftest.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,143 @@ +import copy +import os +import platform +import sys + +import pymupdf + +import pytest + +PYMUPDF_PYTEST_RESUME = os.environ.get('PYMUPDF_PYTEST_RESUME') + +@pytest.fixture(autouse=True) +def wrap(request): + ''' + Check that tests return with empty MuPDF warnings buffer. For example this + detects failure to call fz_close_output() before fz_drop_output(), which + (as of 2024-4-12) generates a warning from MuPDF. + + As of 2024-09-12 we also detect whether tests leave fds open; but for now + do not fail tests, because many tests need fixing. + ''' + global PYMUPDF_PYTEST_RESUME + if PYMUPDF_PYTEST_RESUME: + # Skip all tests until we reach a matching name. + if PYMUPDF_PYTEST_RESUME == request.function.__name__: + print(f'### {PYMUPDF_PYTEST_RESUME=}: resuming at {request.function.__name__=}.') + PYMUPDF_PYTEST_RESUME = None + else: + print(f'### {PYMUPDF_PYTEST_RESUME=}: Skipping {request.function.__name__=}.') + return + + wt = pymupdf.TOOLS.mupdf_warnings() + assert not wt, f'{wt=}' + if platform.python_implementation() == 'GraalVM': + pymupdf.TOOLS.set_small_glyph_heights() + else: + assert not pymupdf.TOOLS.set_small_glyph_heights() + next_fd_before = os.open(__file__, os.O_RDONLY) + os.close(next_fd_before) + + if platform.system() == 'Linux' and platform.python_implementation() != 'GraalVM': + test_fds = True + else: + test_fds = False + + if test_fds: + # Gather detailed information about leaked fds. + def get_fds(): + import subprocess + path = 'PyMuPDF-linx-fds' + path_l = 'PyMuPDF-linx-fds-l' + command = f'ls /proc/{os.getpid()}/fd > {path}' + command_l = f'ls -l /proc/{os.getpid()}/fd > {path_l}' + subprocess.run(command, shell=1) + subprocess.run(command_l, shell=1) + with open(path) as f: + ret = f.read() + ret = ret.replace('\n', ' ') + with open(path_l) as f: + ret_l = f.read() + return ret, ret_l + open_fds_before, open_fds_before_l = get_fds() + + pymupdf._log_items_clear() + pymupdf._log_items_active(True) + + JM_annot_id_stem = pymupdf.JM_annot_id_stem + + def get_members(a): + ret = dict() + for n in dir(a): + if not n.startswith('_'): + v = getattr(a, n) + ret[n] = v + return ret + + # Allow post-test checking that pymupdf._globals has not changed. + _globals_pre = get_members(pymupdf._globals) + + # Run the test. + rep = yield + + sys.stdout.flush() + + # Test has run; check it did not create any MuPDF warnings etc. + wt = pymupdf.TOOLS.mupdf_warnings() + if not hasattr(pymupdf, 'mupdf'): + print(f'Not checking mupdf_warnings on classic.') + else: + assert not wt, f'Warnings text not empty: {wt=}' + + assert not pymupdf.TOOLS.set_small_glyph_heights() + + _globals_post = get_members(pymupdf._globals) + if _globals_post != _globals_pre: + print(f'Test has changed pymupdf._globals from {_globals_pre=} to {_globals_post=}') + assert 0 + + log_items = pymupdf._log_items() + assert not log_items, f'log() was called; {len(log_items)=}.' + + assert pymupdf.JM_annot_id_stem == JM_annot_id_stem, \ + f'pymupdf.JM_annot_id_stem has changed from {JM_annot_id_stem!r} to {pymupdf.JM_annot_id_stem!r}' + + if test_fds: + # Show detailed information about leaked fds. + open_fds_after, open_fds_after_l = get_fds() + if open_fds_after != open_fds_before: + import textwrap + print(f'Test has changed process fds:') + print(f' {open_fds_before=}') + print(f' {open_fds_after=}') + print(f'open_fds_before_l:') + print(textwrap.indent(open_fds_before_l, ' ')) + print(f'open_fds_after_l:') + print(textwrap.indent(open_fds_after_l, ' ')) + #assert 0 + + next_fd_after = os.open(__file__, os.O_RDONLY) + os.close(next_fd_after) + + if test_fds and next_fd_after != next_fd_before: + print(f'Test has leaked fds, {next_fd_before=} {next_fd_after=}.') + #assert 0, f'Test has leaked fds, {next_fd_before=} {next_fd_after=}. {args=} {kwargs=}.' + + if 0: + # This code can be useful to track down test failures caused by other + # tests modifying global state. + # + # We run a particular test menually after each test returns. + sys.path.insert(0, os.path.dirname(__file__)) + try: + import test_tables + finally: + del sys.path[0] + print(f'### Calling test_tables.test_md_styles().') + try: + test_tables.test_md_styles() + except Exception as e: + print(f'### test_tables.test_md_styles() failed: {e}') + raise + else: + print(f'### test_tables.test_md_styles() passed.') diff -r 000000000000 -r 1d09e1dec1d9 tests/gentle_compare.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/gentle_compare.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,98 @@ +import math + +import pymupdf + + +def gentle_compare(w0, w1): + """Check lists of "words" extractions for approximate equality. + + * both lists must have same length + * word items must contain same word strings + * word rectangles must be approximately equal + """ + tolerance = 1e-3 # maximum (Euclidean) norm of difference rectangle + word_count = len(w0) # number of words + if word_count != len(w1): + print(f"different number of words: {word_count}/{len(w1)}") + return False + for i in range(word_count): + if w0[i][4] != w1[i][4]: # word strings must be the same + print(f"word {i} mismatch") + return False + r0 = pymupdf.Rect(w0[i][:4]) # rect of first word + r1 = pymupdf.Rect(w1[i][:4]) # rect of second word + delta = (r1 - r0).norm() # norm of difference rectangle + if delta > tolerance: + print(f"word {i}: rectangle mismatch {delta}") + return False + return True + + +def rms(a, b, verbose=None, out_prefix=''): + ''' + Returns RMS diff of raw bytes of two sequences. + ''' + if verbose is True: + verbose = 100000 + assert len(a) == len(b) + e = 0 + for i, (aa, bb) in enumerate(zip(a, b)): + if verbose and (i % verbose == 0): + print(f'{out_prefix}rms(): {i=} {e=} {aa=} {aa=}.') + e += (aa - bb) ** 2 + rms = math.sqrt(e / len(a)) + return rms + + +def pixmaps_rms(a, b, out_prefix=''): + ''' + Returns RMS diff of raw bytes of two pixmaps. + + We assert that the pixmaps/sequences are the same size. + + and can each be a pymupdf.Pixmap or path of a bitmap file. + ''' + if isinstance(a, str): + print(f'{out_prefix}pixmaps_rms(): reading pixmap from {a=}.') + a = pymupdf.Pixmap(a) + if isinstance(b, str): + print(f'{out_prefix}pixmaps_rms(): reading pixmap from {b=}.') + b = pymupdf.Pixmap(b) + assert a.irect == b.irect, f'Differing rects: {a.irect=} {b.irect=}.' + a_mv = a.samples_mv + b_mv = b.samples_mv + assert len(a_mv) == len(b_mv) + ret = rms(a_mv, b_mv, verbose=True, out_prefix=out_prefix) + print(f'{out_prefix}pixmaps_rms(): {ret=}.') + return ret + + +def pixmaps_diff(a, b, out_prefix=''): + ''' + Returns a pymupdf.Pixmap that represents the difference between pixmaps + and . + + Each byte in the returned pixmap is `128 + (b_byte - a_byte) // 2`. + ''' + if isinstance(a, str): + print(f'{out_prefix}pixmaps_rms(): reading pixmap from {a=}.') + a = pymupdf.Pixmap(a) + if isinstance(b, str): + print(f'{out_prefix}pixmaps_rms(): reading pixmap from {b=}.') + b = pymupdf.Pixmap(b) + assert a.irect == b.irect, f'Differing rects: {a.irect=} {b.irect=}.' + a_mv = a.samples_mv + b_mv = b.samples_mv + c = pymupdf.Pixmap(a.tobytes()) + c_mv = c.samples_mv + assert len(a_mv) == len(b_mv) == len(c_mv) + if 1: + print(f'{len(a_mv)=}') + for i, (a_byte, b_byte, c_byte) in enumerate(zip(a_mv, b_mv, c_mv)): + assert 0 <= a_byte < 256 + assert 0 <= b_byte < 256 + assert 0 <= c_byte < 256 + # Set byte to 128 plus half the diff so we represent the full + # -255..+255 range. + c_mv[i] = 128 + (b_byte - a_byte) // 2 + return c diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/001003ED.pdf Binary file tests/resources/001003ED.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/1.pdf Binary file tests/resources/1.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/2.pdf Binary file tests/resources/2.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/2201.00069.pdf Binary file tests/resources/2201.00069.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/3.pdf Binary file tests/resources/3.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/4.pdf Binary file tests/resources/4.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/Bezier.epub Binary file tests/resources/Bezier.epub has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/PragmaticaC.otf Binary file tests/resources/PragmaticaC.otf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/battery-file-22.pdf Binary file tests/resources/battery-file-22.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/bug1945.pdf Binary file tests/resources/bug1945.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/bug1971.pdf Binary file tests/resources/bug1971.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/chinese-tables.pdf Binary file tests/resources/chinese-tables.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/chinese-tables.pickle Binary file tests/resources/chinese-tables.pickle has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/circular-toc.pdf --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/circular-toc.pdf Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,74 @@ +%PDF-1.7 +%µ¶ + +1 0 obj +<> +endobj + +2 0 obj +<> +endobj + +3 0 obj +<>>>/Contents[9 0 R]/MediaBox[0 0 612 792]>> +endobj + +4 0 obj +<>>>/Contents[10 0 R]/MediaBox[0 0 612 792]>> +endobj + +5 0 obj +<> +endobj + +6 0 obj +<> +endobj + +7 0 obj +<> +endobj + +8 0 obj +<> +endobj + +9 0 obj +<> +stream +BT +/F1 20 Tf +100 600 TD (Page1)Tj +ET +endstream +endobj + +10 0 obj +<> +stream +BT +/F1 20 Tf +100 600 TD (Page2)Tj +ET +endstream +endobj + +xref +0 11 +0000000000 65536 f +0000000016 00000 n +0000000077 00000 n +0000000135 00000 n +0000000249 00000 n +0000000364 00000 n +0000000443 00000 n +0000000519 00000 n +0000000585 00000 n +0000000645 00000 n +0000000730 00000 n + +trailer +<> +startxref +816 +%%EOF diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/cms-etc-filled.pdf Binary file tests/resources/cms-etc-filled.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/cython.pdf Binary file tests/resources/cython.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/cython.pickle Binary file tests/resources/cython.pickle has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/dotted-gridlines.pdf Binary file tests/resources/dotted-gridlines.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/full_toc.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/full_toc.txt Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,48 @@ +[1, 'HAUPTÜBERSICHT', -1, {'kind': 5, 'xref': 2, 'file': '../SDW2006.PDF', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[1, 'Januar 01/2006', -1, {'kind': 5, 'xref': 3, 'file': '01004INH.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0, 'collapse': False}] +[2, 'SPEKTROGRAMM', -1, {'kind': 0, 'xref': 4, 'page': -1, 'collapse': False, 'zoom': 0.0}] +[3, 'Urzeit-Godzilla', -1, {'kind': 5, 'xref': 87, 'file': '01008SP.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Frühchristliche Mosaike im Knast', -1, {'kind': 5, 'xref': 102, 'file': '01008SP.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Evolution auf Eis', -1, {'kind': 5, 'xref': 100, 'file': '01008SP.pdf', 'page': 1, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Entwarnung bei Kondensstreifen', -1, {'kind': 5, 'xref': 98, 'file': '01008SP.pdf', 'page': 1, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Spermatausch beim Schnecken-Sex', -1, {'kind': 5, 'xref': 96, 'file': '01008SP.pdf', 'page': 1, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Mehr Monde für Pluto', -1, {'kind': 5, 'xref': 94, 'file': '01008SP.pdf', 'page': 2, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Endlich ein Malaria-Impfstoff', -1, {'kind': 5, 'xref': 92, 'file': '01008SP.pdf', 'page': 2, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Spuren der ersten Sterne', -1, {'kind': 5, 'xref': 90, 'file': '01008SP.pdf', 'page': 2, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Bild des Monats', -1, {'kind': 5, 'xref': 88, 'file': '01008SP.pdf', 'page': 3, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[2, 'FORSCHUNG AKTUELL', -1, {'kind': 0, 'xref': 23, 'page': -1, 'collapse': False, 'zoom': 0.0}] +[3, 'Der Super-Teilchenfänger in der Pampa', -1, {'kind': 5, 'xref': 24, 'file': '01012FA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Auf der Fährte der Lepra', -1, {'kind': 5, 'xref': 29, 'file': '01012FA.pdf', 'page': 2, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Vampire gegen Schlaganfall', -1, {'kind': 5, 'xref': 27, 'file': '01012FA.pdf', 'page': 4, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Der Flug des Kolibris', -1, {'kind': 5, 'xref': 25, 'file': '01012FA.pdf', 'page': 7, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[2, 'THEMEN', -1, {'kind': 0, 'xref': 20, 'page': -1, 'collapse': False, 'zoom': 0.0}] +[3, 'Entwicklung von Spiralgalaxien', -1, {'kind': 5, 'xref': 21, 'file': '01022HA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Geschichtsträchtige Genspuren', -1, {'kind': 5, 'xref': 46, 'file': '01030HA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Was Sedimente verraten', -1, {'kind': 5, 'xref': 44, 'file': '01042HA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Von Baumringen und Regenmengen', -1, {'kind': 5, 'xref': 42, 'file': '01050HA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Software-Agenten in Not', -1, {'kind': 5, 'xref': 40, 'file': '01056HA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Künstlicher kalter Antiwasserstoff', -1, {'kind': 5, 'xref': 38, 'file': '01062HA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Rüsten gegen eine Pandemie', -1, {'kind': 5, 'xref': 36, 'file': '01072HA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Satelliten zeigen Lawinengefahr', -1, {'kind': 5, 'xref': 34, 'file': '01084HA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Provokante Verheißung: Update für den Menschen', -1, {'kind': 5, 'xref': 22, 'file': '01100HA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[2, 'KOMMENTAR', -1, {'kind': 0, 'xref': 18, 'page': -1, 'collapse': False, 'zoom': 0.0}] +[3, 'Springers Einwüfe: Holland, die Hydrometropole', -1, {'kind': 5, 'xref': 19, 'file': '01012FA.pdf', 'page': 8, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[2, 'WISSENSCHAFT IM ...', -1, {'kind': 0, 'xref': 15, 'page': -1, 'collapse': False, 'zoom': 0.0}] +[3, 'Alltag: Eine Decke für die Straße', -1, {'kind': 5, 'xref': 16, 'file': '01040WA.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Rückblick: Mozarts Ohr • Per Auto zum Südpol u.a.', -1, {'kind': 5, 'xref': 17, 'file': '01081IR.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[2, 'JUNGE WISSENSCHAFT', -1, {'kind': 0, 'xref': 13, 'page': -1, 'collapse': False, 'zoom': 0.0}] +[3, 'Ein Putzroboter für die Mama', -1, {'kind': 5, 'xref': 14, 'file': '01082JW.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[2, 'REZENSIONEN', -1, {'kind': 0, 'xref': 10, 'page': -1, 'collapse': False, 'zoom': 0.0}] +[3, 'Vulkanismus verstehen und erleben', -1, {'kind': 5, 'xref': 11, 'file': '01090RE.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Warum der Mensch glaubt', -1, {'kind': 5, 'xref': 72, 'file': '01090RE.pdf', 'page': 1, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Biomedizin und Ethik', -1, {'kind': 5, 'xref': 70, 'file': '01090RE.pdf', 'page': 2, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Mythos Meer', -1, {'kind': 5, 'xref': 68, 'file': '01090RE.pdf', 'page': 3, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Warum Frauen nicht schwach ... sind', -1, {'kind': 5, 'xref': 66, 'file': '01090RE.pdf', 'page': 4, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'PISA, Bach, Pythagoras', -1, {'kind': 5, 'xref': 12, 'file': '01090RE.pdf', 'page': 5, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[2, 'MATHEMATISCHE UNTERHALTUNGEN', -1, {'kind': 0, 'xref': 8, 'page': -1, 'collapse': False, 'zoom': 0.0}] +[3, 'Himmliches Ballett', -1, {'kind': 5, 'xref': 9, 'file': '01098MU.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[2, 'WEITERE RUBRIKEN', -1, {'kind': 0, 'xref': 5, 'page': -1, 'collapse': False, 'zoom': 0.0}] +[3, 'Editorial', -1, {'kind': 5, 'xref': 6, 'file': '01003ED.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Leserbriefe/Impressum', -1, {'kind': 5, 'xref': 81, 'file': '01006LB.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Preisrätsel', -1, {'kind': 5, 'xref': 79, 'file': '01090RE.pdf', 'page': 6, 'to': Point(0.0, 0.0), 'zoom': 0.0}] +[3, 'Vorschau', -1, {'kind': 5, 'xref': 7, 'file': '01106VO.pdf', 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.0}] diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/github_sample.pdf Binary file tests/resources/github_sample.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/has-bad-fonts.pdf Binary file tests/resources/has-bad-fonts.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/image-file1.pdf Binary file tests/resources/image-file1.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/img-regular.pdf Binary file tests/resources/img-regular.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/img-transparent.pdf Binary file tests/resources/img-transparent.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/img-transparent.png Binary file tests/resources/img-transparent.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/interfield-calculation.pdf --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/interfield-calculation.pdf Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,669 @@ +%PDF-1.7 +%µ¶ + +1 0 obj +<< + /Type /Catalog + /Pages 2 0 R + /AcroForm << + /Fields [ 5 0 R 8 0 R 10 0 R 16 0 R 18 0 R 20 0 R 26 0 R 28 0 R + 30 0 R ] + /CO [ 10 0 R 20 0 R 30 0 R ] + >> +>> +endobj + +2 0 obj +<< + /Type /Pages + /Count 3 + /Kids [ 4 0 R 15 0 R 25 0 R ] +>> +endobj + +3 0 obj +<< +>> +endobj + +4 0 obj +<< + /Type /Page + /MediaBox [ 0 0 595 842 ] + /Rotate 0 + /Resources 3 0 R + /Parent 2 0 R + /Annots [ 5 0 R 8 0 R 10 0 R ] +>> +endobj + +5 0 obj +<< + /Type /Annot + /Subtype /Widget + /FT /Tx + /T (NUM10) + /NM (fitz-W0) + /Rect [ 100 722 300 742 ] + /F 4 + /BS << + /S /S + /W 0 + >> + /DA (0 0 0 rg /Helv 0 Tf) + /Ff 2 + /V (1) + /AP << + /N 7 0 R + >> +>> +endobj + +6 0 obj +<< + /Type /Font + /Subtype /Type1 + /BaseFont /Helvetica + /Encoding /WinAnsiEncoding +>> +endobj + +7 0 obj +<< + /Type /XObject + /Subtype /Form + /BBox [ 0 0 200 20 ] + /Matrix [ 1 0 0 1 0 0 ] + /Resources << + /Font << + /Helv 6 0 R + >> + >> + /Length 79 +>> +stream +/Tx BMC +q +0 w +0 0 200 20 re +W +n +BT +0 0 0 rg +0 4 Td +/Helv 20 Tf +(1) Tj +ET +Q +EMC + +endstream +endobj + +8 0 obj +<< + /Type /Annot + /Subtype /Widget + /FT /Tx + /T (NUM20) + /NM (fitz-W1) + /Rect [ 100 692 300 712 ] + /F 4 + /BS << + /S /S + /W 0 + >> + /DA (0 0 0 rg /Helv 0 Tf) + /Ff 2 + /V (200) + /AP << + /N 9 0 R + >> +>> +endobj + +9 0 obj +<< + /Type /XObject + /Subtype /Form + /BBox [ 0 0 200 20 ] + /Matrix [ 1 0 0 1 0 0 ] + /Resources << + /Font << + /Helv 6 0 R + >> + >> + /Length 81 +>> +stream +/Tx BMC +q +0 w +0 0 200 20 re +W +n +BT +0 0 0 rg +0 4 Td +/Helv 20 Tf +(200) Tj +ET +Q +EMC + +endstream +endobj + +10 0 obj +<< + /Type /Annot + /Subtype /Widget + /FT /Tx + /T (RESULT0) + /NM (fitz-W2) + /Rect [ 100 642 300 662 ] + /F 4 + /BS << + /S /S + /W 0 + >> + /DA (0 0 0 rg /Helv 0 Tf) + /Ff 0 + /AA << + /C 12 0 R + >> + /V (Resultat?) + /AP << + /N 13 0 R + >> +>> +endobj + +11 0 obj +<< + /Length 55 +>> +stream +AFSimple_Calculate("SUM", new Array("NUM10", "NUM20")); +endstream +endobj + +12 0 obj +<< + /S /JavaScript + /JS 11 0 R +>> +endobj + +13 0 obj +<< + /Type /XObject + /Subtype /Form + /BBox [ 0 0 200 20 ] + /Matrix [ 1 0 0 1 0 0 ] + /Resources << + /Font << + /Helv 6 0 R + >> + >> + /Length 87 +>> +stream +/Tx BMC +q +0 w +0 0 200 20 re +W +n +BT +0 0 0 rg +0 4 Td +/Helv 20 Tf +(Resultat?) Tj +ET +Q +EMC + +endstream +endobj + +14 0 obj +<< +>> +endobj + +15 0 obj +<< + /Type /Page + /MediaBox [ 0 0 595 842 ] + /Rotate 0 + /Resources 14 0 R + /Parent 2 0 R + /Annots [ 16 0 R 18 0 R 20 0 R ] +>> +endobj + +16 0 obj +<< + /Type /Annot + /Subtype /Widget + /FT /Tx + /T (NUM11) + /NM (fitz-W0) + /Rect [ 100 722 300 742 ] + /F 4 + /BS << + /S /S + /W 0 + >> + /DA (0 0 0 rg /Helv 0 Tf) + /Ff 2 + /V (101) + /AP << + /N 17 0 R + >> +>> +endobj + +17 0 obj +<< + /Type /XObject + /Subtype /Form + /BBox [ 0 0 200 20 ] + /Matrix [ 1 0 0 1 0 0 ] + /Resources << + /Font << + /Helv 6 0 R + >> + >> + /Length 81 +>> +stream +/Tx BMC +q +0 w +0 0 200 20 re +W +n +BT +0 0 0 rg +0 4 Td +/Helv 20 Tf +(101) Tj +ET +Q +EMC + +endstream +endobj + +18 0 obj +<< + /Type /Annot + /Subtype /Widget + /FT /Tx + /T (NUM21) + /NM (fitz-W1) + /Rect [ 100 692 300 712 ] + /F 4 + /BS << + /S /S + /W 0 + >> + /DA (0 0 0 rg /Helv 0 Tf) + /Ff 2 + /V (200) + /AP << + /N 19 0 R + >> +>> +endobj + +19 0 obj +<< + /Type /XObject + /Subtype /Form + /BBox [ 0 0 200 20 ] + /Matrix [ 1 0 0 1 0 0 ] + /Resources << + /Font << + /Helv 6 0 R + >> + >> + /Length 81 +>> +stream +/Tx BMC +q +0 w +0 0 200 20 re +W +n +BT +0 0 0 rg +0 4 Td +/Helv 20 Tf +(200) Tj +ET +Q +EMC + +endstream +endobj + +20 0 obj +<< + /Type /Annot + /Subtype /Widget + /FT /Tx + /T (RESULT1) + /NM (fitz-W2) + /Rect [ 100 642 300 662 ] + /F 4 + /BS << + /S /S + /W 0 + >> + /DA (0 0 0 rg /Helv 0 Tf) + /Ff 0 + /AA << + /C 22 0 R + >> + /V (Resultat?) + /AP << + /N 23 0 R + >> +>> +endobj + +21 0 obj +<< + /Length 55 +>> +stream +AFSimple_Calculate("SUM", new Array("NUM11", "NUM21")); +endstream +endobj + +22 0 obj +<< + /S /JavaScript + /JS 21 0 R +>> +endobj + +23 0 obj +<< + /Type /XObject + /Subtype /Form + /BBox [ 0 0 200 20 ] + /Matrix [ 1 0 0 1 0 0 ] + /Resources << + /Font << + /Helv 6 0 R + >> + >> + /Length 87 +>> +stream +/Tx BMC +q +0 w +0 0 200 20 re +W +n +BT +0 0 0 rg +0 4 Td +/Helv 20 Tf +(Resultat?) Tj +ET +Q +EMC + +endstream +endobj + +24 0 obj +<< +>> +endobj + +25 0 obj +<< + /Type /Page + /MediaBox [ 0 0 595 842 ] + /Rotate 0 + /Resources 24 0 R + /Parent 2 0 R + /Annots [ 26 0 R 28 0 R 30 0 R ] +>> +endobj + +26 0 obj +<< + /Type /Annot + /Subtype /Widget + /FT /Tx + /T (NUM12) + /NM (fitz-W0) + /Rect [ 100 722 300 742 ] + /F 4 + /BS << + /S /S + /W 0 + >> + /DA (0 0 0 rg /Helv 0 Tf) + /Ff 2 + /V (201) + /AP << + /N 27 0 R + >> +>> +endobj + +27 0 obj +<< + /Type /XObject + /Subtype /Form + /BBox [ 0 0 200 20 ] + /Matrix [ 1 0 0 1 0 0 ] + /Resources << + /Font << + /Helv 6 0 R + >> + >> + /Length 81 +>> +stream +/Tx BMC +q +0 w +0 0 200 20 re +W +n +BT +0 0 0 rg +0 4 Td +/Helv 20 Tf +(201) Tj +ET +Q +EMC + +endstream +endobj + +28 0 obj +<< + /Type /Annot + /Subtype /Widget + /FT /Tx + /T (NUM22) + /NM (fitz-W1) + /Rect [ 100 692 300 712 ] + /F 4 + /BS << + /S /S + /W 0 + >> + /DA (0 0 0 rg /Helv 0 Tf) + /Ff 2 + /V (200) + /AP << + /N 29 0 R + >> +>> +endobj + +29 0 obj +<< + /Type /XObject + /Subtype /Form + /BBox [ 0 0 200 20 ] + /Matrix [ 1 0 0 1 0 0 ] + /Resources << + /Font << + /Helv 6 0 R + >> + >> + /Length 81 +>> +stream +/Tx BMC +q +0 w +0 0 200 20 re +W +n +BT +0 0 0 rg +0 4 Td +/Helv 20 Tf +(200) Tj +ET +Q +EMC + +endstream +endobj + +30 0 obj +<< + /Type /Annot + /Subtype /Widget + /FT /Tx + /T (RESULT2) + /NM (fitz-W2) + /Rect [ 100 642 300 662 ] + /F 4 + /BS << + /S /S + /W 0 + >> + /DA (0 0 0 rg /Helv 0 Tf) + /Ff 0 + /AA << + /C 32 0 R + >> + /V (Resultat?) + /AP << + /N 33 0 R + >> +>> +endobj + +31 0 obj +<< + /Length 55 +>> +stream +AFSimple_Calculate("SUM", new Array("NUM12", "NUM22")); +endstream +endobj + +32 0 obj +<< + /S /JavaScript + /JS 31 0 R +>> +endobj + +33 0 obj +<< + /Type /XObject + /Subtype /Form + /BBox [ 0 0 200 20 ] + /Matrix [ 1 0 0 1 0 0 ] + /Resources << + /Font << + /Helv 6 0 R + >> + >> + /Length 87 +>> +stream +/Tx BMC +q +0 w +0 0 200 20 re +W +n +BT +0 0 0 rg +0 4 Td +/Helv 20 Tf +(Resultat?) Tj +ET +Q +EMC + +endstream +endobj + +xref +0 34 +0000000000 00001 f +0000000016 00000 n +0000000208 00000 n +0000000288 00000 n +0000000310 00000 n +0000000454 00000 n +0000000689 00000 n +0000000795 00000 n +0000001069 00000 n +0000001306 00000 n +0000001582 00000 n +0000001857 00000 n +0000001966 00000 n +0000002019 00000 n +0000002302 00000 n +0000002325 00000 n +0000002473 00000 n +0000002712 00000 n +0000002989 00000 n +0000003228 00000 n +0000003505 00000 n +0000003780 00000 n +0000003889 00000 n +0000003942 00000 n +0000004225 00000 n +0000004248 00000 n +0000004396 00000 n +0000004635 00000 n +0000004912 00000 n +0000005151 00000 n +0000005428 00000 n +0000005703 00000 n +0000005812 00000 n +0000005865 00000 n + +trailer +<< + /Size 34 + /Root 1 0 R + /ID [ <41C38EC38A1A5C58C2A1507EC3B17775> <37EE68FEBD7F417FDE1B6A6EBE1F12D0> ] +>> +startxref +6148 +%%EOF diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/joined.pdf Binary file tests/resources/joined.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/merge-form1.pdf Binary file tests/resources/merge-form1.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/merge-form2.pdf Binary file tests/resources/merge-form2.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/metadata.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/metadata.txt Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1 @@ +{"format": "PDF 1.6", "title": "RUBRIK_Editorial_01-06.indd", "author": "Natalie Schaefer", "subject": "", "keywords": "", "creator": "", "producer": "Acrobat Distiller 7.0.5 (Windows)", "creationDate": "D:20070113191400+01'00'", "modDate": "D:20070120104154+01'00'", "trapped": "", "encryption": null} \ No newline at end of file diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/mupdf_explored.pdf Binary file tests/resources/mupdf_explored.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/nur-ruhig.jpg Binary file tests/resources/nur-ruhig.jpg has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/quad-calc-0.pdf Binary file tests/resources/quad-calc-0.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/simple_toc.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/simple_toc.txt Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,1 @@ +[1, 'HAUPTÜBERSICHT', -1][1, 'Januar 01/2006', -1][2, 'SPEKTROGRAMM', -1][3, 'Urzeit-Godzilla', -1][3, 'Frühchristliche Mosaike im Knast', -1][3, 'Evolution auf Eis', -1][3, 'Entwarnung bei Kondensstreifen', -1][3, 'Spermatausch beim Schnecken-Sex', -1][3, 'Mehr Monde für Pluto', -1][3, 'Endlich ein Malaria-Impfstoff', -1][3, 'Spuren der ersten Sterne', -1][3, 'Bild des Monats', -1][2, 'FORSCHUNG AKTUELL', -1][3, 'Der Super-Teilchenfänger in der Pampa', -1][3, 'Auf der Fährte der Lepra', -1][3, 'Vampire gegen Schlaganfall', -1][3, 'Der Flug des Kolibris', -1][2, 'THEMEN', -1][3, 'Entwicklung von Spiralgalaxien', -1][3, 'Geschichtsträchtige Genspuren', -1][3, 'Was Sedimente verraten', -1][3, 'Von Baumringen und Regenmengen', -1][3, 'Software-Agenten in Not', -1][3, 'Künstlicher kalter Antiwasserstoff', -1][3, 'Rüsten gegen eine Pandemie', -1][3, 'Satelliten zeigen Lawinengefahr', -1][3, 'Provokante Verheißung: Update für den Menschen', -1][2, 'KOMMENTAR', -1][3, 'Springers Einwüfe: Holland, die Hydrometropole', -1][2, 'WISSENSCHAFT IM ...', -1][3, 'Alltag: Eine Decke für die Straße', -1][3, 'Rückblick: Mozarts Ohr • Per Auto zum Südpol u.a.', -1][2, 'JUNGE WISSENSCHAFT', -1][3, 'Ein Putzroboter für die Mama', -1][2, 'REZENSIONEN', -1][3, 'Vulkanismus verstehen und erleben', -1][3, 'Warum der Mensch glaubt', -1][3, 'Biomedizin und Ethik', -1][3, 'Mythos Meer', -1][3, 'Warum Frauen nicht schwach ... sind', -1][3, 'PISA, Bach, Pythagoras', -1][2, 'MATHEMATISCHE UNTERHALTUNGEN', -1][3, 'Himmliches Ballett', -1][2, 'WEITERE RUBRIKEN', -1][3, 'Editorial', -1][3, 'Leserbriefe/Impressum', -1][3, 'Preisrätsel', -1][3, 'Vorschau', -1] \ No newline at end of file diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/small-table.pdf Binary file tests/resources/small-table.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/strict-yes-no.pdf Binary file tests/resources/strict-yes-no.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/symbol-list.pdf Binary file tests/resources/symbol-list.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/symbols.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/symbols.txt Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,728 @@ +[{'closePath': False, + 'color': (1.0, 1.0, 1.0), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (1.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('l', (50.0, 50.0), (50.0, 100.0)), + ('l', (50.0, 100.0), (100.0, 75.0)), + ('l', (100.0, 75.0), (50.0, 50.0))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (50.0, 50.0, 100.0, 100.0), + 'seqno': 0, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 1.0}, + {'closePath': False, + 'color': (1.0, 1.0, 1.0), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (1.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('c', + (50.0, 135.0), + (63.807098388671875, 135.0), + (75.0, 123.8070068359375), + (75.0, 110.0)), + ('c', + (75.0, 110.0), + (75.0, 123.8070068359375), + (86.19290161132812, 135.0), + (100.0, 135.0)), + ('c', + (100.0, 135.0), + (86.19290161132812, 135.0), + (75.0, 146.1929931640625), + (75.0, 160.0)), + ('c', + (75.0, 160.0), + (75.0, 146.1929931640625), + (63.807098388671875, 135.0), + (50.0, 135.0))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (50.0, 110.0, 100.0, 160.0), + 'seqno': 2, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 1.0}, + {'closePath': False, + 'color': (0.0, 1.0, 0.0), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (0.0, 1.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('c', (75.0, 195.0), (50.0, 170.0), (100.0, 170.0), (75.0, 195.0)), + ('c', (75.0, 195.0), (100.0, 170.0), (100.0, 220.0), (75.0, 195.0)), + ('c', (75.0, 195.0), (50.0, 220.0), (50.0, 170.0), (75.0, 195.0)), + ('c', (75.0, 195.0), (100.0, 220.0), (50.0, 220.0), (75.0, 195.0))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (50.0, 170.0, 100.0, 220.0), + 'seqno': 4, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 0.30000001192092896}, + {'closePath': False, + 'color': (1.0, 1.0, 1.0), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (1.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('l', (75.0, 230.0), (100.0, 255.0)), + ('l', (100.0, 255.0), (75.0, 280.0)), + ('l', (75.0, 280.0), (50.0, 255.0)), + ('l', (50.0, 255.0), (75.0, 230.0))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (50.0, 230.0, 100.0, 280.0), + 'seqno': 6, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 1.0}, + {'closePath': False, + 'color': (1.0, 1.0, 1.0), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (0.8039219975471497, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('c', + (50.0, 315.0), + (50.0, 328.8070068359375), + (61.192901611328125, 340.0), + (75.0, 340.0)), + ('c', + (75.0, 340.0), + (88.80709838867188, 340.0), + (100.0, 328.8070068359375), + (100.0, 315.0)), + ('c', + (100.0, 315.0), + (100.0, 301.1929931640625), + (88.80709838867188, 290.0), + (75.0, 290.0)), + ('c', + (75.0, 290.0), + (61.192901611328125, 290.0), + (50.0, 301.1929931640625), + (50.0, 315.0))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (50.0, 290.0, 100.0, 340.0), + 'seqno': 8, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 2.0}, + {'closePath': False, + 'color': (0.0, 0.0, 0.0), + 'dashes': '[] 0', + 'items': [('c', + (50.0, 315.0), + (50.0, 328.8070068359375), + (61.192901611328125, 340.0), + (75.0, 340.0)), + ('c', + (75.0, 340.0), + (88.80709838867188, 340.0), + (100.0, 328.8070068359375), + (100.0, 315.0)), + ('c', + (100.0, 315.0), + (100.0, 301.1929931640625), + (88.80709838867188, 290.0), + (75.0, 290.0)), + ('c', + (75.0, 290.0), + (61.192901611328125, 290.0), + (50.0, 301.1929931640625), + (50.0, 315.0))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (50.0, 290.0, 100.0, 340.0), + 'seqno': 10, + 'stroke_opacity': 1.0, + 'type': 's', + 'width': 1.0}, + {'closePath': False, + 'color': (1.0, 1.0, 1.0), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (1.0, 1.0, 1.0), + 'fill_opacity': 1.0, + 'items': [('re', (56.5, 312.5, 93.5, 317.5), 1)], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (56.5, 312.5, 93.5, 317.5), + 'seqno': 11, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 3.0}, + {'closePath': False, + 'even_odd': False, + 'fill': (1.0, 1.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('c', + (50.0, 375.0), + (50.0, 388.8070068359375), + (61.192901611328125, 400.0), + (75.0, 400.0)), + ('c', + (75.0, 400.0), + (88.80709838867188, 400.0), + (100.0, 388.8070068359375), + (100.0, 375.0)), + ('c', + (100.0, 375.0), + (100.0, 361.1929931640625), + (88.80709838867188, 350.0), + (75.0, 350.0)), + ('c', + (75.0, 350.0), + (61.192901611328125, 350.0), + (50.0, 361.1929931640625), + (50.0, 375.0))], + 'layer': '', + 'rect': (50.0, 350.0, 100.0, 400.0), + 'seqno': 13, + 'type': 'f'}, + {'closePath': False, + 'even_odd': False, + 'fill': (0.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('c', + (60.0, 368.75), + (60.0, 372.2019958496094), + (62.23860168457031, 375.0), + (65.0, 375.0)), + ('c', + (65.0, 375.0), + (67.76139831542969, 375.0), + (70.0, 372.2019958496094), + (70.0, 368.75)), + ('c', + (70.0, 368.75), + (70.0, 365.2980041503906), + (67.76139831542969, 362.5), + (65.0, 362.5)), + ('c', + (65.0, 362.5), + (62.23860168457031, 362.5), + (60.0, 365.2980041503906), + (60.0, 368.75)), + ('c', + (80.0, 368.75), + (80.0, 372.2019958496094), + (82.23860168457031, 375.0), + (85.0, 375.0)), + ('c', + (85.0, 375.0), + (87.76139831542969, 375.0), + (90.0, 372.2019958496094), + (90.0, 368.75)), + ('c', + (90.0, 368.75), + (90.0, 365.2980041503906), + (87.76139831542969, 362.5), + (85.0, 362.5)), + ('c', + (85.0, 362.5), + (82.23860168457031, 362.5), + (80.0, 365.2980041503906), + (80.0, 368.75))], + 'layer': '', + 'rect': (60.0, 362.5, 90.0, 375.0), + 'seqno': 14, + 'type': 'f'}, + {'closePath': False, + 'color': (0.0, 0.0, 0.0), + 'dashes': '[] 0', + 'items': [('c', + (60.0, 387.5), + (68.2843017578125, 380.59600830078125), + (81.7156982421875, 380.59600830078125), + (90.0, 387.5))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (60.0, 380.59600830078125, 90.0, 387.5), + 'seqno': 15, + 'stroke_opacity': 1.0, + 'type': 's', + 'width': 1.0}, + {'closePath': False, + 'color': (1.0, 0.6470590233802795, 0.0), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (1.0, 0.8274509906768799, 0.6078429818153381), + 'fill_opacity': 1.0, + 'items': [('c', + (50.0, 433.6669921875), + (60.30929946899414, 433.6669921875), + (68.66670227050781, 426.50299072265625), + (68.66670227050781, 417.6669921875)), + ('c', + (68.66670227050781, 417.6669921875), + (74.55770111083984, 416.1940002441406), + (74.55770111083984, 423.35699462890625), + (68.66670227050781, 433.6669921875)), + ('l', + (68.66670227050781, 433.6669921875), + (95.33329772949219, 433.6669921875)), + ('c', + (95.33329772949219, 433.6669921875), + (100.66699981689453, 433.6669921875), + (100.66699981689453, 439.0), + (95.33329772949219, 439.0)), + ('l', (95.33329772949219, 439.0), (79.33329772949219, 439.0)), + ('l', (79.33329772949219, 439.0), (87.33329772949219, 439.0)), + ('c', + (87.33329772949219, 439.0), + (92.66670227050781, 439.0), + (92.66670227050781, 444.3330078125), + (87.33329772949219, 444.3330078125)), + ('l', + (87.33329772949219, 444.3330078125), + (79.33329772949219, 444.3330078125)), + ('l', + (79.33329772949219, 444.3330078125), + (84.66670227050781, 444.3330078125)), + ('c', + (84.66670227050781, 444.3330078125), + (90.0, 444.3330078125), + (90.0, 449.6669921875), + (84.66670227050781, 449.6669921875)), + ('l', + (84.66670227050781, 449.6669921875), + (79.33329772949219, 449.6669921875)), + ('l', + (79.33329772949219, 449.6669921875), + (83.33329772949219, 449.6669921875)), + ('c', + (83.33329772949219, 449.6669921875), + (88.66670227050781, 449.6669921875), + (88.66670227050781, 455.0), + (83.33329772949219, 455.0)), + ('l', (83.33329772949219, 455.0), (50.0, 455.0))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (50.0, 416.1940002441406, 100.66699981689453, 455.0), + 'seqno': 16, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 1.0}, + {'closePath': False, + 'color': (1.0, 0.0, 0.0), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (1.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('c', (75.0, 485.0), (62.5, 470.0), (50.0, 490.0), (75.0, 510.0)), + ('c', (75.0, 485.0), (87.5, 470.0), (100.0, 490.0), (75.0, 510.0)), + ('l', (75.0, 510.0), (75.0, 485.0))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (50.0, 470.0, 100.0, 510.0), + 'seqno': 18, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 1.0}, + {'closePath': False, + 'color': (0.9333329796791077, 0.8470590114593506, 0.6823530197143555), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (0.7215690016746521, 0.5254899859428406, 0.04313730075955391), + 'fill_opacity': 1.0, + 'items': [('re', + (56.52170181274414, + 547.753173828125, + 85.5072021484375, + 562.2459716796875), + 1)], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (56.52170181274414, + 547.753173828125, + 85.5072021484375, + 562.2459716796875), + 'seqno': 20, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 0.07246380299329758}, + {'closePath': False, + 'color': (0.9333329796791077, 0.8470590114593506, 0.6823530197143555), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (0.9333329796791077, 0.8470590114593506, 0.6823530197143555), + 'fill_opacity': 1.0, + 'items': [('l', + (56.52170181274414, 547.7540283203125), + (59.4202995300293, 550.6519775390625)), + ('l', + (59.4202995300293, 550.6519775390625), + (59.4202995300293, 559.3480224609375)), + ('l', + (59.4202995300293, 559.3480224609375), + (56.52170181274414, 562.2459716796875)), + ('l', + (85.5072021484375, 547.7540283203125), + (82.60870361328125, 550.6519775390625)), + ('l', + (82.60870361328125, 550.6519775390625), + (82.60870361328125, 559.3480224609375)), + ('l', + (82.60870361328125, 559.3480224609375), + (85.5072021484375, 562.2459716796875))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (56.52170181274414, + 547.7540283203125, + 85.5072021484375, + 562.2459716796875), + 'seqno': 22, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 0.07246380299329758}, + {'closePath': False, + 'color': (0.8039219975471497, 0.7294120192527771, 0.5882350206375122), + 'dashes': '[] 0', + 'items': [('l', + (59.4202995300293, 550.6519775390625), + (82.60870361328125, 550.6519775390625)), + ('l', + (59.4202995300293, 559.3480224609375), + (82.60870361328125, 559.3480224609375))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (59.4202995300293, + 550.6519775390625, + 82.60870361328125, + 559.3480224609375), + 'seqno': 24, + 'stroke_opacity': 1.0, + 'type': 's', + 'width': 0.07246380299329758}, + {'even_odd': False, + 'fill': (0.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('re', + (56.52170181274414, + 547.753173828125, + 63.76808166503906, + 562.2459716796875), + 1)], + 'layer': '', + 'rect': (56.52170181274414, + 547.753173828125, + 63.76808166503906, + 562.2459716796875), + 'seqno': 25, + 'type': 'f'}, + {'even_odd': False, + 'fill': (1.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('c', + (56.52170181274414, 547.7540283203125), + (47.82609939575195, 547.7540283203125), + (47.82609939575195, 562.2459716796875), + (56.52170181274414, 562.2459716796875))], + 'layer': '', + 'rect': (47.82609939575195, + 547.7540283203125, + 56.52170181274414, + 562.2459716796875), + 'seqno': 26, + 'type': 'f'}, + {'even_odd': False, + 'fill': (0.9333329796791077, 0.8470590114593506, 0.6823530197143555), + 'fill_opacity': 1.0, + 'items': [('l', (85.5072021484375, 547.7540283203125), (100.0, 555.0)), + ('l', (100.0, 555.0), (85.5072021484375, 562.2459716796875))], + 'layer': '', + 'rect': (85.5072021484375, 547.7540283203125, 100.0, 562.2459716796875), + 'seqno': 27, + 'type': 'f'}, + {'closePath': False, + 'color': (0.7215690016746521, 0.5254899859428406, 0.04313730075955391), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (0.7215690016746521, 0.5254899859428406, 0.04313730075955391), + 'fill_opacity': 1.0, + 'items': [('c', + (85.5072021484375, 547.7540283203125), + (86.30770111083984, 548.553955078125), + (85.00990295410156, 549.8519897460938), + (82.60870361328125, 550.6519775390625)), + ('l', + (82.60870361328125, 550.6519775390625), + (85.5072021484375, 547.7540283203125))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (82.60870361328125, + 547.7540283203125, + 86.30770111083984, + 550.6519775390625), + 'seqno': 28, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 0.07246380299329758}, + {'closePath': False, + 'color': (0.7215690016746521, 0.5254899859428406, 0.04313730075955391), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (0.7215690016746521, 0.5254899859428406, 0.04313730075955391), + 'fill_opacity': 1.0, + 'items': [('c', + (82.60870361328125, 550.6519775390625), + (87.2510986328125, 553.052978515625), + (87.2510986328125, 556.947021484375), + (82.60870361328125, 559.3480224609375)), + ('l', + (82.60870361328125, 559.3480224609375), + (82.60870361328125, 550.6519775390625))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (82.60870361328125, + 550.6519775390625, + 87.2510986328125, + 559.3480224609375), + 'seqno': 30, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 0.07246380299329758}, + {'closePath': False, + 'color': (0.7215690016746521, 0.5254899859428406, 0.04313730075955391), + 'dashes': '[] 0', + 'even_odd': False, + 'fill': (0.7215690016746521, 0.5254899859428406, 0.04313730075955391), + 'fill_opacity': 1.0, + 'items': [('c', + (82.60870361328125, 559.3480224609375), + (85.00990295410156, 560.1480102539062), + (86.30770111083984, 561.446044921875), + (85.5072021484375, 562.2459716796875)), + ('l', + (85.5072021484375, 562.2459716796875), + (82.60870361328125, 559.3480224609375))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (82.60870361328125, + 559.3480224609375, + 86.30770111083984, + 562.2459716796875), + 'seqno': 32, + 'stroke_opacity': 1.0, + 'type': 'fs', + 'width': 0.07246380299329758}, + {'even_odd': False, + 'fill': (0.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('l', (94.2029037475586, 552.1010131835938), (100.0, 555.0)), + ('l', (100.0, 555.0), (94.2029037475586, 557.8989868164062)), + ('c', + (94.2029037475586, 552.1010131835938), + (92.60209655761719, 553.7020263671875), + (92.60209655761719, 556.2979736328125), + (94.2029037475586, 557.8989868164062))], + 'layer': '', + 'rect': (92.60209655761719, 552.1010131835938, 100.0, 557.8989868164062), + 'seqno': 34, + 'type': 'f'}, + {'closePath': False, + 'color': (0.7215690016746521, 0.5254899859428406, 0.04313730075955391), + 'dashes': '[] 0', + 'items': [('l', + (85.5072021484375, 547.7540283203125), + (82.60870361328125, 550.6519775390625)), + ('l', + (82.60870361328125, 550.6519775390625), + (82.60870361328125, 559.3480224609375)), + ('l', + (82.60870361328125, 559.3480224609375), + (85.5072021484375, 562.2459716796875))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (82.60870361328125, + 547.7540283203125, + 85.5072021484375, + 562.2459716796875), + 'seqno': 35, + 'stroke_opacity': 1.0, + 'type': 's', + 'width': 0.07246380299329758}, + {'closePath': False, + 'color': (0.0, 0.0, 0.0), + 'dashes': '[] 0', + 'items': [('l', + (63.76810073852539, 547.7540283203125), + (85.5072021484375, 547.7540283203125)), + ('l', (85.5072021484375, 547.7540283203125), (100.0, 555.0)), + ('l', (100.0, 555.0), (85.5072021484375, 562.2459716796875)), + ('l', + (85.5072021484375, 562.2459716796875), + (63.76810073852539, 562.2459716796875))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (63.76810073852539, 547.7540283203125, 100.0, 562.2459716796875), + 'seqno': 36, + 'stroke_opacity': 1.0, + 'type': 's', + 'width': 1.0}, + {'even_odd': False, + 'fill': (0.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('re', + (65.94200134277344, + 552.826171875, + 73.18838500976562, + 557.1740112304688), + 1), + ('c', + (73.18840026855469, 552.8259887695312), + (75.18939971923828, 554.0269775390625), + (75.18939971923828, 555.9730224609375), + (73.18840026855469, 557.1740112304688)), + ('c', + (65.94200134277344, 552.8259887695312), + (63.941001892089844, 554.0269775390625), + (63.941001892089844, 555.9730224609375), + (65.94200134277344, 557.1740112304688))], + 'layer': '', + 'rect': (63.941001892089844, + 552.826171875, + 75.18939971923828, + 557.1740112304688), + 'seqno': 37, + 'type': 'f'}, + {'closePath': False, + 'color': (1.0, 1.0, 1.0), + 'dashes': '[] 0', + 'items': [('l', + (58.937198638916016, 548.47802734375), + (58.937198638916016, 561.52197265625)), + ('l', + (61.352699279785156, 548.47802734375), + (61.352699279785156, 561.52197265625)), + ('l', + (61.352699279785156, 561.52197265625), + (61.352699279785156, 548.47802734375))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (58.937198638916016, + 548.47802734375, + 61.352699279785156, + 561.52197265625), + 'seqno': 38, + 'stroke_opacity': 1.0, + 'type': 's', + 'width': 1.1594200134277344}, + {'closePath': False, + 'even_odd': False, + 'fill': (1.0, 1.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('c', + (50.0, 615.0), + (50.0, 628.8070068359375), + (61.192901611328125, 640.0), + (75.0, 640.0)), + ('c', + (75.0, 640.0), + (88.80709838867188, 640.0), + (100.0, 628.8070068359375), + (100.0, 615.0)), + ('c', + (100.0, 615.0), + (100.0, 601.1929931640625), + (88.80709838867188, 590.0), + (75.0, 590.0)), + ('c', + (75.0, 590.0), + (61.192901611328125, 590.0), + (50.0, 601.1929931640625), + (50.0, 615.0))], + 'layer': '', + 'rect': (50.0, 590.0, 100.0, 640.0), + 'seqno': 39, + 'type': 'f'}, + {'closePath': False, + 'even_odd': False, + 'fill': (0.0, 0.0, 0.0), + 'fill_opacity': 1.0, + 'items': [('c', + (60.0, 608.75), + (60.0, 612.2020263671875), + (62.23860168457031, 615.0), + (65.0, 615.0)), + ('c', + (65.0, 615.0), + (67.76139831542969, 615.0), + (70.0, 612.2020263671875), + (70.0, 608.75)), + ('c', + (70.0, 608.75), + (70.0, 605.2979736328125), + (67.76139831542969, 602.5), + (65.0, 602.5)), + ('c', + (65.0, 602.5), + (62.23860168457031, 602.5), + (60.0, 605.2979736328125), + (60.0, 608.75)), + ('c', + (80.0, 608.75), + (80.0, 612.2020263671875), + (82.23860168457031, 615.0), + (85.0, 615.0)), + ('c', + (85.0, 615.0), + (87.76139831542969, 615.0), + (90.0, 612.2020263671875), + (90.0, 608.75)), + ('c', + (90.0, 608.75), + (90.0, 605.2979736328125), + (87.76139831542969, 602.5), + (85.0, 602.5)), + ('c', + (85.0, 602.5), + (82.23860168457031, 602.5), + (80.0, 605.2979736328125), + (80.0, 608.75))], + 'layer': '', + 'rect': (60.0, 602.5, 90.0, 615.0), + 'seqno': 40, + 'type': 'f'}, + {'closePath': False, + 'color': (0.0, 0.0, 0.0), + 'dashes': '[] 0', + 'items': [('c', + (60.0, 624.375), + (68.2843017578125, 633.0040283203125), + (81.7156982421875, 633.0040283203125), + (90.0, 624.375))], + 'layer': '', + 'lineCap': (0, 0, 0), + 'lineJoin': 0.0, + 'rect': (60.0, 624.375, 90.0, 633.0040283203125), + 'seqno': 41, + 'stroke_opacity': 1.0, + 'type': 's', + 'width': 1.0}] diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-2333.pdf Binary file tests/resources/test-2333.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-2462.pdf Binary file tests/resources/test-2462.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-2812.pdf Binary file tests/resources/test-2812.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-3143.pdf Binary file tests/resources/test-3143.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-3150.pdf Binary file tests/resources/test-3150.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-3207.pdf Binary file tests/resources/test-3207.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-3591.pdf Binary file tests/resources/test-3591.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-3820.pdf Binary file tests/resources/test-3820.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-4055.pdf Binary file tests/resources/test-4055.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-4503.pdf Binary file tests/resources/test-4503.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-707448.pdf Binary file tests/resources/test-707448.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-707673.pdf Binary file tests/resources/test-707673.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-E+A.pdf Binary file tests/resources/test-E+A.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-linebreaks.pdf Binary file tests/resources/test-linebreaks.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-rewrite-images.pdf Binary file tests/resources/test-rewrite-images.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test-styled-table.pdf Binary file tests/resources/test-styled-table.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test2093.pdf Binary file tests/resources/test2093.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test2182.pdf Binary file tests/resources/test2182.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test2238.pdf --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test2238.pdf Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,68 @@ +Here should not be anything - it is not correct +But some readers can still read it +---------------------------------- +__%PDF-1.1 +1 0 obj + << + /Type /Catalog + /Pages 2 0 R + >> +endobj +2 0 obj + << + /Type /Pages + /Kids [3 0 R] + /Count 1 + /MediaBox [0 0 300 144] + >> +endobj +3 0 obj + << + /Type /Page + /Parent 2 0 R + /Resources + << + /Font + << + /F1 + << + /Type /Font + /Subtype /Type1 + /BaseFont /Times-Roman + >> + >> + >> + /Contents 4 0 R + >> +endobj +4 0 obj + << + /Length 55 + >> + stream + BT + /F1 18 Tf + 0 0 Td + (Hello World) Tj + ET + endstream +endobj +xref +0 5 +0000000000 65535 f +0000000107 00000 n +0000000162 00000 n +0000000253 00000 n +0000000456 00000 n +trailer + << + /Root 1 0 R + /Size 5 + >> +startxref +564 +%%EOF +----------------------------------- +Here should not be anything as well +But some readers can still read it +42 diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_1645_expected.pdf Binary file tests/resources/test_1645_expected.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_1824.pdf Binary file tests/resources/test_1824.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2108.pdf Binary file tests/resources/test_2108.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2270.pdf Binary file tests/resources/test_2270.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2533.pdf Binary file tests/resources/test_2533.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2548.pdf Binary file tests/resources/test_2548.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2553-2.pdf Binary file tests/resources/test_2553-2.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2553.pdf Binary file tests/resources/test_2553.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2596.pdf Binary file tests/resources/test_2596.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2608_expected --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_2608_expected Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,14 @@ +No significant gamma-ray excess above the expected background +is detected from the direction of FRB 20171019A, with 52 gamma +candidate events from the source region and 524 background event. +A second analysis using an independent event calibration and recon- +struction (Parsons & Hinton 2014) confirms this result. A search for +variable emission on timescales ranging from milliseconds to sev- +eral minutes with tools provided in (Brun et al. 2020) does not reveal +any variability above 2.2 𝜎. For the total data set of 1.8 h, 95% confi- +dence level (C. L.) upper limits on the photon flux are derived using +the method described by Rolke et al. (2005). The energy threshold +of the data is highly dependent on the zenith angle of the observa- +tions. For these observations, the zenith angles range from 15 to 25 +deg, which leads to an energy threshold for the stacked data set of +𝐸th = 120 GeV. The upper limit on the Very High Energy (VHE) diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2608_expected_1.26 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_2608_expected_1.26 Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,10 @@ +No significant gamma-ray excess above the expected background +is detected from the direction of FRB 20171019A, with 52 gamma +candidate events from the source region and 524 background event. +A second analysis using an independent event calibration and reconstruction (Parsons & Hinton 2014) confirms this result. A search for +variable emission on timescales ranging from milliseconds to several minutes with tools provided in (Brun et al. 2020) does not reveal +any variability above 2.2 𝜎. For the total data set of 1.8 h, 95% confidence level (C. L.) upper limits on the photon flux are derived using +the method described by Rolke et al. (2005). The energy threshold +of the data is highly dependent on the zenith angle of the observations. For these observations, the zenith angles range from 15 to 25 +deg, which leads to an energy threshold for the stacked data set of +𝐸th = 120 GeV. The upper limit on the Very High Energy (VHE) diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2634.pdf Binary file tests/resources/test_2634.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2635.pdf Binary file tests/resources/test_2635.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2645_1.pdf Binary file tests/resources/test_2645_1.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2645_2.pdf Binary file tests/resources/test_2645_2.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2645_3.pdf Binary file tests/resources/test_2645_3.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2710.pdf Binary file tests/resources/test_2710.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2730.pdf Binary file tests/resources/test_2730.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2742.pdf --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_2742.pdf Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,360 @@ +%PDF-1.4 +% ReportLab Generated PDF document http://www.reportlab.com +1 0 obj +<< +/F1 2 0 R /F2 3 0 R +>> +endobj +2 0 obj +<< +/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font +>> +endobj +3 0 obj +<< +/BaseFont /Times-Roman /Encoding /WinAnsiEncoding /Name /F2 /Subtype /Type1 /Type /Font +>> +endobj +4 0 obj +<< +/Contents 23 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +5 0 obj +<< +/Contents 24 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +6 0 obj +<< +/Contents 25 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +7 0 obj +<< +/Contents 26 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +8 0 obj +<< +/Contents 27 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +9 0 obj +<< +/Contents 28 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +10 0 obj +<< +/Contents 29 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +11 0 obj +<< +/Contents 30 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +12 0 obj +<< +/Contents 31 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +13 0 obj +<< +/Contents 32 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +14 0 obj +<< +/Contents 33 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +15 0 obj +<< +/Contents 34 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +16 0 obj +<< +/Contents 35 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +17 0 obj +<< +/Contents 36 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +18 0 obj +<< +/Contents 37 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +19 0 obj +<< +/Contents 38 0 R /MediaBox [ 0 0 421.1008 597.5079 ] /Parent 22 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +20 0 obj +<< +/PageMode /UseNone /Pages 22 0 R /Type /Catalog +>> +endobj +21 0 obj +<< +/Author (Generated using dummypdf \204 http://framagit.org/spalax/dummypdf) /CreationDate (D:20231014190918+00'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:20231014190918+00'00') /Producer (ReportLab PDF Library - www.reportlab.com) + /Subject (unspecified) /Title (Dummy pdf) /Trapped /False +>> +endobj +22 0 obj +<< +/Count 16 /Kids [ 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R + 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R ] /Type /Pages +>> +endobj +23 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]++lc&FB4M.FeLQ"V,3?eAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;:rHiZ7pO:TuUV)^qV#s,;7$$mFJ3037q+*63fO-b0CAI9D'9u]SHgi3sO#0^2A]_3\J?8UV0oUUefg$&U9d2PQ`VX.F6i(6l#eq#B0XTWG"[tRqj8_3endstream +endobj +24 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]++lc&FB4M.FeLQ"V,3?eAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;:rHiZ7pO:TuUV)^qV#s,;7$$mFJ3037q+*63fO-b0CAI9D'9u]SHgi3sO#0^2A]_3\J?8UV0oUUefg$&U9d2PQ`VX.F6i(6l#eq#B0XTWG"[tRqj8_3endstream +endobj +25 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]++lc&FB4M.FeLQ"V,3?eAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;:rHiZ7pO:TuUV)^qV#s,;7$$mFJ3037q+*63fO-b0CAI9D'9u]SHgi3sO#0^2A]_3\J?8UV0oUUefg$&U9d2PQ`VX.F6i(6l#eq#B0XTWG"[tRqj8_3endstream +endobj +26 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]++lc&FB4M.FeLQ"V,3?eAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;:rHiZ7pO:TuUV)^qV#s,;7$$mFJ3037q+*63fO-b0CAI9D'9u]SHgi3sO#0^2A]_3\J?8UV0oUUefg$&U9d2PQ`VX.F6i(6l#eq#B0XTWG"[tRqj8_3endstream +endobj +27 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]++lc&FB4M.FeK&?k6pFeAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;9Qq#^-@,9O/79%soj,NrOg-g5!C_)%\7J';o9(bbQ4Ppu3e1`6tgEQ:Z/)),P\Z"=YI0:MASqG(RNg$&T3UGn-J:"ceK_3:b%XTM]?>7&ToD[Wg^PHendstream +endobj +28 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]++lc&FB4M.FeLQ"V,3?eAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;:rHiZ7pO:TuUV)^qV#s,;7$$mFJ3037q+*63fO-b0CAI9D'9u]SHgi3sO#0^2A]_3\J?8UV0oUUefg$&U9d2PQ`VX.F6i(6l#eq#B0XTWG"[tRqj8_3endstream +endobj +29 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]++lc&FB4M.FeLQ"V,3?eAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;:rHiZ7pO:TuUV)^qV#s,;7$$mFJ3037q+*63fO-b0CAI9D'9u]SHgi3sO#0^2A]_3\J?8UV0oUUefg$&U9d2PQ`VX.F6i(6l#eq#B0XTWG"[tRqj8_3endstream +endobj +30 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]++lc&FB4M.FeLQ"V,3?eAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;:rHiZ7pO:TuUV)^qV#s,;7$$mFJ3037q+*63fO-b0CAI9D'9u]SHgi3sO#0^2A]_3\J?8UV0oUUefg$&U9d2PQ`VX.F6i(6l#eq#B0XTWG"[tRqj8_3endstream +endobj +31 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]++lc&FB4M.FeLQ"V,3?eAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;:rHiZ7pO:TuUV)^qV#s,;7$$mFJ3037q+*63fO-b0CAI9D'9u]SHgi3sO#0^2A]_3\J?8UV0oUUefg$&U9d2PQ`VX.F6i(6l#eq#B0XTWG"[tRqj8_3endstream +endobj +32 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 221 +>> +stream +GarnQ]++lc&FB4M.FeLQ4ZK)l;&_%I%nnf7@U\KK'!hmtGq>@,(`B@:3endstream +endobj +33 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 221 +>> +stream +GarnQ]++lc&FB4M.FeLQ4ZK)l;&_%I%nnf7@U\KK'!hmtGq>@,(`B@:3endstream +endobj +34 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 221 +>> +stream +GarnQ]++lc&FB4M.FeLQ4ZK)l;&_%I%nnf7@U\KK'!hmtGq>@,(`B@:3endstream +endobj +35 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 221 +>> +stream +GarnQ]++lc&FB4M.FeLQ"V,3?eAb,V!u"Cf=P3EG0#ThI'!hmtGq>@,(`@)O\H&+Xoo;:rHiZ7pO:TuUV)^qV#s,;7$$mFJ3037q+*63fO-b0CAI9D'9u]SHgi3sO#0^2A]_3\J?8UV0oUUefg$&U9d2PQ`VX.F6i(6l#eq#B0XTWG"[tRqj8_3endstream +endobj +36 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 222 +>> +stream +GarnQ]*\To&FAj9VGrJ@h:l1t<_s_cbo[2u@Ou?nI5eu(c[[PlA0)\%#Zc`GH-X2P@D*ls,@*QD"R1GRl4!P44i)Ad&n-NG]f8-Xs-Vi\F*MgBIKd2?SFCdf#/5CmqcB[0AM9Vop68XdH&iN!7<9l%ZPp$dcQf+tZF\~>endstream +endobj +37 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 221 +>> +stream +GarnQ]++lc&FB4M.FeLQ4ZK)l;&_%I%nnf7@U\KK'!hmtGq>@,(`B@:3endstream +endobj +38 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 220 +>> +stream +GarnQ]*\To&FB!';lW:1mcE+/C1F+9"$o53KoNk9-@"\kOG/A3A&A_kaA=tE"k[@eG6DC9$cm(@I)%WjbotelTE@9FTY9Mr*M1l4b]0g+?KqQW.Bo$n)&)+Oa%>2,0OnC/g6n%Kd=`=`G9[#tV[o.aD]%/rZ=?U]-V\/V#I5bd&r%QY>DP/+AF4965tpq8dO29(3f0nF6%4Z<6XJ:H:i@OL9)&~>endstream +endobj +xref +0 39 +0000000000 65535 f +0000000073 00000 n +0000000114 00000 n +0000000221 00000 n +0000000330 00000 n +0000000535 00000 n +0000000740 00000 n +0000000945 00000 n +0000001150 00000 n +0000001355 00000 n +0000001560 00000 n +0000001766 00000 n +0000001972 00000 n +0000002178 00000 n +0000002384 00000 n +0000002590 00000 n +0000002796 00000 n +0000003002 00000 n +0000003208 00000 n +0000003414 00000 n +0000003620 00000 n +0000003690 00000 n +0000004044 00000 n +0000004208 00000 n +0000004521 00000 n +0000004834 00000 n +0000005147 00000 n +0000005460 00000 n +0000005773 00000 n +0000006086 00000 n +0000006399 00000 n +0000006712 00000 n +0000007025 00000 n +0000007337 00000 n +0000007649 00000 n +0000007961 00000 n +0000008273 00000 n +0000008586 00000 n +0000008898 00000 n +trailer +<< +/ID +[<86fda58a19e58a1296a6b887cfc81b23><86fda58a19e58a1296a6b887cfc81b23>] +% ReportLab generated PDF document -- digest (http://www.reportlab.com) + +/Info 21 0 R +/Root 20 0 R +/Size 39 +>> +startxref +9209 +%%EOF diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2788.pdf --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_2788.pdf Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,177 @@ +%PDF-1.7 +%µ¶ + +1 0 obj +<< + /Type /Catalog + /Pages 2 0 R + /Names << + /Dests 11 0 R + >> + /Outlines 13 0 R +>> +endobj + +2 0 obj +<< + /Type /Pages + /Count 2 + /Kids [ 4 0 R 8 0 R ] +>> +endobj + +3 0 obj +<< + /Font << + /helv 5 0 R + >> +>> +endobj + +4 0 obj +<< + /Type /Page + /MediaBox [ 0 0 595 842 ] + /Rotate 0 + /Resources 3 0 R + /Parent 2 0 R + /Contents [ 6 0 R ] + /Annots [ 12 0 R ] +>> +endobj + +5 0 obj +<< + /Type /Font + /Subtype /Type1 + /BaseFont /Helvetica + /Encoding /WinAnsiEncoding +>> +endobj + +6 0 obj +<< + /Length 77 +>> +stream + +q +BT +1 0 0 1 100 742 Tm +/helv 11 Tf [<54657874206f6e20706167652031>]TJ +ET +Q + +endstream +endobj + +7 0 obj +<< + /Font << + /helv 5 0 R + >> +>> +endobj + +8 0 obj +<< + /Type /Page + /MediaBox [ 0 0 595 842 ] + /Rotate 0 + /Resources 7 0 R + /Parent 2 0 R + /Contents [ 9 0 R ] +>> +endobj + +9 0 obj +<< + /Length 77 +>> +stream + +q +BT +1 0 0 1 100 742 Tm +/helv 11 Tf [<54657874206f6e20706167652032>]TJ +ET +Q + +endstream +endobj + +10 0 obj +<< + /D [ 8 0 R /XYZ 100 760 null ] +>> +endobj + +11 0 obj +<< + /Names [ (page.2) 10 0 R ] +>> +endobj + +12 0 obj +<< + /A << + /S /GoTo + /D (page.2) + /Type /Action + >> + /Rect [ 100 738.711 121.395 753.825 ] + /BS << + /W 0 + >> + /Subtype /Link + /NM (fitz-L0) +>> +endobj + +13 0 obj +<< + /Count 1 + /First 14 0 R + /Last 14 0 R + /Type /Outlines +>> +endobj + +14 0 obj +<< + /A << + /S /GoTo + /D (page.2) + >> + /Parent 13 0 R + /Title (page2) +>> +endobj + +xref +0 15 +0000000000 00003 f +0000000016 00000 n +0000000124 00000 n +0000000196 00000 n +0000000250 00000 n +0000000404 00000 n +0000000510 00000 n +0000000640 00000 n +0000000694 00000 n +0000000827 00000 n +0000000957 00000 n +0000001013 00000 n +0000001065 00000 n +0000001244 00000 n +0000001327 00000 n + +trailer +<< + /Size 15 + /Root 1 0 R + /ID [ <45826EEEA04C856402259BC94B65F54B> <54474D7FAA6876109B94EF0F4B28E781> ] +>> +startxref +1441 +%%EOF diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2791_content.pdf Binary file tests/resources/test_2791_content.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2791_coverpage.pdf Binary file tests/resources/test_2791_coverpage.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2861.pdf Binary file tests/resources/test_2861.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2871.pdf Binary file tests/resources/test_2871.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2885.pdf Binary file tests/resources/test_2885.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2904.pdf Binary file tests/resources/test_2904.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2907.pdf Binary file tests/resources/test_2907.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2954.pdf Binary file tests/resources/test_2954.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2957_1.pdf Binary file tests/resources/test_2957_1.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2957_2.pdf --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_2957_2.pdf Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,68 @@ +%PDF-1.7 +%µ¶ + +1 0 obj +<> +endobj + +2 0 obj +<> +endobj + +3 0 obj +<>>>/Parent 2 0 R/Contents 5 0 R/Annots[6 0 R]>> +endobj + +4 0 obj +<> +endobj + +5 0 obj +<> +stream +q +BT +/helv 11 Tf +1 0 0 1 100 742 Tm +(This is some longer text for testing.) Tj +ET +Q +q +Q + +endstream +endobj + +6 0 obj +<>/NM(fitz-A0)>> +endobj + +7 0 obj +<> +stream +1 0 0 RG +165.78998 739.711 m +194.35897 739.711 l +194.35897 752.825 l +165.78998 752.825 l +s + +endstream +endobj + +xref +0 8 +0000000000 65536 f +0000000016 00000 n +0000000062 00000 n +0000000114 00000 n +0000000252 00000 n +0000000341 00000 n +0000000478 00000 n +0000000606 00000 n + +trailer +<]>> +startxref +834 +%%EOF diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2969.pdf Binary file tests/resources/test_2969.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_2979.pdf Binary file tests/resources/test_2979.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3050_expected.png Binary file tests/resources/test_3050_expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3058.pdf Binary file tests/resources/test_3058.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3062.pdf Binary file tests/resources/test_3062.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3070.pdf Binary file tests/resources/test_3070.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3072.pdf Binary file tests/resources/test_3072.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3087.pdf Binary file tests/resources/test_3087.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3179.pdf Binary file tests/resources/test_3179.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3186.pdf Binary file tests/resources/test_3186.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3197.pdf Binary file tests/resources/test_3197.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3357.pdf Binary file tests/resources/test_3357.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3362.pdf Binary file tests/resources/test_3362.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3376.pdf Binary file tests/resources/test_3376.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3448.pdf Binary file tests/resources/test_3448.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3448.pdf-expected.png Binary file tests/resources/test_3448.pdf-expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3450.pdf Binary file tests/resources/test_3450.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3493.epub Binary file tests/resources/test_3493.epub has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3569.pdf --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_3569.pdf Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,609 @@ +%PDF-1.4 +%µ¶ + +1 0 obj +<> +endobj + +2 0 obj +<> +endobj + +3 0 obj +<> +endobj + +4 0 obj +<>/Properties<>/Font<>>>/Rotate 270/VP[]>> +endobj + +5 0 obj +<> +endobj + +6 0 obj +<> +endobj + +7 0 obj +<> +endobj + +8 0 obj +<> +endobj + +9 0 obj +<> +endobj + +10 0 obj +<> +endobj + +11 0 obj +<> +endobj + +12 0 obj +<> +endobj + +13 0 obj +<> +endobj + +14 0 obj +<> +endobj + +15 0 obj +<> +endobj + +16 0 obj +<> +endobj + +17 0 obj +<> +endobj + +18 0 obj +<> +endobj + +19 0 obj +<> +endobj + +20 0 obj +<> +endobj + +21 0 obj +<> +endobj + +22 0 obj +<> +endobj + +23 0 obj +<> +endobj + +24 0 obj +<> +endobj + +25 0 obj +<> +endobj + +26 0 obj +<> +endobj + +27 0 obj +<> +endobj + +28 0 obj +<> +endobj + +29 0 obj +<> +endobj + +30 0 obj +<> +endobj + +31 0 obj +<> +endobj + +32 0 obj +<> +endobj + +33 0 obj +<> +endobj + +34 0 obj +<> +endobj + +35 0 obj +<> +endobj + +36 0 obj +<> +endobj + +37 0 obj +<> +endobj + +38 0 obj +<> +endobj + +39 0 obj +<> +endobj + +40 0 obj +<> +endobj + +41 0 obj +<> +endobj + +42 0 obj +<> +endobj + +43 0 obj +<> +endobj + +44 0 obj +<> +endobj + +45 0 obj +<> +endobj + +46 0 obj +<> +endobj + +47 0 obj +<> +endobj + +48 0 obj +<> +endobj + +49 0 obj +<> +endobj + +50 0 obj +<> +endobj + +51 0 obj +<> +endobj + +52 0 obj +<> +endobj + +53 0 obj +<> +endobj + +56 0 obj +<> +endobj + +57 0 obj +<> +endobj + +59 0 obj +<> +endobj + +60 0 obj +<> +endobj + +61 0 obj +<> +endobj + +62 0 obj +<> +endobj + +63 0 obj +<> +endobj + +64 0 obj +<> +endobj + +65 0 obj +<> +endobj + +66 0 obj +<> +endobj + +67 0 obj +<> +endobj + +68 0 obj +<> +endobj + +69 0 obj +<>/FontDescriptor 72 0 R/W 71 0 R>>]/Encoding/Identity-H/ToUnicode 70 0 R>> +endobj + +70 0 obj +<> +stream +/CIDInit /ProcSet findresource begin +12 dict begin +begincmap + /CIDSystemInfo + +<< /Registry (\(Adobe\)) + /Ordering (\(UCS\)) + /Supplement 0 + +>> def + /CMapName /Adobe-Identity-UCS (def) + /CMapType (2) def +1 begincodespacerange +<0000> +endcodespacerange +50 beginbfchar +<000d> <002a> +<002f> <004c> +<0014> <0031> +<0010> <002d> +<0016> <0033> +<0015> <0032> +<0013> <0030> +<0003> <0020> +<002e> <004b> +<0039> <0056> +<0024> <0041> +<0035> <0052> +<0028> <0045> +<0036> <0053> +<002c> <0049> +<0027> <0044> +<0031> <004e> +<0037> <0054> +<0038> <0055> +<0025> <0042> +<0032> <004f> +<0030> <004d> +<003a> <0057> +<0026> <0043> +<002b> <0048> +<000b> <0028> +<0029> <0046> +<000c> <0029> +<0033> <0050> +<0017> <0034> +<000a> <0027> +<0005> <0022> +<001c> <0039> +<0019> <0036> +<001b> <0038> +<0018> <0035> +<001a> <0037> +<000f> <002c> +<002a> <0047> +<003c> <0059> +<0011> <002e> +<001d> <003a> +<0012> <002f> +<0020> <003d> +<003b> <0058> +<005b> <0078> +<0009> <0026> +<003d> <005a> +<0006> <0023> +<0056> <0073> + +endbfchar + +endcmap +CMapName currentdict /CMap defineresource pop +end +endstream +endobj + +71 0 obj +[13[389]47[556]20[556]16[333]22[556]21[556]19[556]3[278]46[667]57[667]36[667]53[722]40[667]54[667]44[278]39[722]49[722]55[611]56[722]37[667]50[778]48[833]58[944]38[722]43[722]11[333]41[611]12[333]51[667]23[556]10[191]5[355]28[556]25[556]27[556]24[556]26[556]15[278]42[778]60[667]17[278]29[278]18[278]32[584]59[667]91[500]9[667]61[611]6[556]86[500]] +endobj + +72 0 obj +<> +endobj + +74 0 obj +<>/FontDescriptor 77 0 R/W 76 0 R>>]/Encoding/Identity-H/ToUnicode 75 0 R>> +endobj + +75 0 obj +<> +stream +/CIDInit /ProcSet findresource begin +12 dict begin +begincmap + /CIDSystemInfo + +<< /Registry (\(Adobe\)) + /Ordering (\(UCS\)) + /Supplement 0 + +>> def + /CMapName /Adobe-Identity-UCS (def) + /CMapType (2) def +1 begincodespacerange +<0000> +endcodespacerange +42 beginbfchar +<0028> <0045> +<002f> <004c> +<0026> <0043> +<0037> <0054> +<0035> <0052> +<002c> <0049> +<0024> <0041> +<0003> <0020> +<0032> <004f> +<0030> <004d> +<0013> <0030> +<0016> <0033> +<0014> <0031> +<0027> <0044> +<0031> <004e> +<0012> <002f> +<001b> <0038> +<002b> <0048> +<0025> <0042> +<0017> <0034> +<0015> <0032> +<0011> <002e> +<001c> <0039> +<003a> <0057> +<002a> <0047> +<0038> <0055> +<0033> <0050> +<004c> <0069> +<0018> <0035> +<0036> <0053> +<000f> <002c> +<001a> <0037> +<0029> <0046> +<002e> <004b> +<0010> <002d> +<003c> <0059> +<002d> <004a> +<001d> <003a> +<0019> <0036> +<0039> <0056> +<0008> <0025> +<0009> <0026> + +endbfchar + +endcmap +CMapName currentdict /CMap defineresource pop +end +endstream +endobj + +76 0 obj +[40[547]47[456]38[592]55[501]53[592]44[228]36[547]3[228]50[638]48[683]19[456]22[456]20[456]39[592]49[592]18[228]27[456]43[592]37[547]23[456]21[456]17[228]28[456]58[774]42[638]56[592]51[547]76[182]24[456]54[547]15[228]26[456]41[501]46[547]16[273]60[547]45[410]29[228]25[456]57[547]8[729]9[547]] +endobj + +77 0 obj +<> +endobj + +79 0 obj +<> +endobj + +80 0 obj +<> +stream +q + .06 0 0 .06 0 0 cm + 25432 10909 m + 29692 10909 l + 29692 15642 l + 25432 15642 l + 25432 10909 l + W + n + 0 0 0 rg + 0 G + BT + /F2 174.644 Tf + /GT255 gs + /OC/oc2 BDC + 0 G + 0 -1 1 0 28538 14909 Tm + <000d000d002f0014001000140016>Tj + ET +Q +q + .06 0 0 .06 0 0 cm + 28526 38017 m + 31807 40376 l + 31807 40379 l + 31312 41314 l + 31312 42889 l + 28202 42889 l + 25092 42888 l + 25092 42887 l + 28524 38017 l + 28526 38017 l + W + n + EMC + q + /OC/oc833 BDC + .49804 .49804 .49804 rg + .49804 .49804 .49804 RG + 0 w + 31130 41483 m + 31130 42083 l + 30530 41483 l + h + 31130 42083 m + 30530 41483 l + 30530 42083 l + h + b + Q + .49804 .49804 .49804 RG + 1 J + 1 j + 9 w + 30530 41483 m + 31130 41483 l + 31130 42083 l + 30530 42083 l + 30530 41483 l + S + EMC +Q + + +endstream +endobj + +82 0 obj +<> +endobj + +83 0 obj +<>]>> +endobj + +xref +0 84 +0000000054 65536 f +0000000016 00000 n +0000000184 00000 n +0000000248 00000 n +0000000301 00000 n +0000001267 00000 n +0000001324 00000 n +0000001376 00000 n +0000001433 00000 n +0000001485 00000 n +0000001537 00000 n +0000001590 00000 n +0000001643 00000 n +0000001696 00000 n +0000001754 00000 n +0000001812 00000 n +0000001865 00000 n +0000001923 00000 n +0000001985 00000 n +0000002043 00000 n +0000002114 00000 n +0000002167 00000 n +0000002225 00000 n +0000002283 00000 n +0000002336 00000 n +0000002389 00000 n +0000002442 00000 n +0000002495 00000 n +0000002548 00000 n +0000002606 00000 n +0000002664 00000 n +0000002717 00000 n +0000002775 00000 n +0000002831 00000 n +0000002894 00000 n +0000002956 00000 n +0000003014 00000 n +0000003067 00000 n +0000003125 00000 n +0000003178 00000 n +0000003236 00000 n +0000003289 00000 n +0000003347 00000 n +0000003405 00000 n +0000003458 00000 n +0000003511 00000 n +0000003564 00000 n +0000003617 00000 n +0000003670 00000 n +0000003723 00000 n +0000003781 00000 n +0000003847 00000 n +0000003918 00000 n +0000003990 00000 n +0000000055 00002 f +0000000058 00002 f +0000004039 00000 n +0000004092 00000 n +0000000073 00002 f +0000004140 00000 n +0000004188 00000 n +0000004245 00000 n +0000004303 00000 n +0000004356 00000 n +0000004407 00000 n +0000004457 00000 n +0000004519 00000 n +0000004575 00000 n +0000004659 00000 n +0000004708 00000 n +0000004982 00000 n +0000006077 00000 n +0000006443 00000 n +0000000078 00001 f +0000006646 00000 n +0000006928 00000 n +0000007910 00000 n +0000008221 00000 n +0000000081 00001 f +0000008423 00000 n +0000008480 00000 n +0000000000 00002 f +0000009260 00000 n +0000009326 00000 n + +trailer +<<93A573E7CE67A5B51F05FE9E29C3AB03>]>> +startxref +9395 +%%EOF diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3594.pdf Binary file tests/resources/test_3594.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3615.epub Binary file tests/resources/test_3615.epub has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3624.pdf Binary file tests/resources/test_3624.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3624_expected.png Binary file tests/resources/test_3624_expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3650.pdf Binary file tests/resources/test_3650.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3654.docx Binary file tests/resources/test_3654.docx has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3677.pdf Binary file tests/resources/test_3677.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3687-3.epub Binary file tests/resources/test_3687-3.epub has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3687.epub Binary file tests/resources/test_3687.epub has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3705.pdf Binary file tests/resources/test_3705.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3725.pdf Binary file tests/resources/test_3725.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3727.pdf Binary file tests/resources/test_3727.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3780.pdf Binary file tests/resources/test_3780.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3789.pdf Binary file tests/resources/test_3789.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3806-expected.png Binary file tests/resources/test_3806-expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3806.pdf Binary file tests/resources/test_3806.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3842.pdf Binary file tests/resources/test_3842.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3848.pdf Binary file tests/resources/test_3848.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3854.pdf Binary file tests/resources/test_3854.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3854_expected.png Binary file tests/resources/test_3854_expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3863.pdf Binary file tests/resources/test_3863.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3863.pdf.pdf.0.png Binary file tests/resources/test_3863.pdf.pdf.0.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3863.pdf.pdf.1.png Binary file tests/resources/test_3863.pdf.pdf.1.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3863.pdf.pdf.2.png Binary file tests/resources/test_3863.pdf.pdf.2.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3863.pdf.pdf.3.png Binary file tests/resources/test_3863.pdf.pdf.3.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3863.pdf.pdf.4.png Binary file tests/resources/test_3863.pdf.pdf.4.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3863.pdf.pdf.5.png Binary file tests/resources/test_3863.pdf.pdf.5.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3863.pdf.pdf.6.png Binary file tests/resources/test_3863.pdf.pdf.6.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3863.pdf.pdf.7.png Binary file tests/resources/test_3863.pdf.pdf.7.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3886.pdf Binary file tests/resources/test_3886.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3887.pdf Binary file tests/resources/test_3887.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3933.pdf Binary file tests/resources/test_3933.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3950.pdf Binary file tests/resources/test_3950.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_3994.pdf Binary file tests/resources/test_3994.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4004.pdf Binary file tests/resources/test_4004.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4017.pdf Binary file tests/resources/test_4017.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4026.pdf Binary file tests/resources/test_4026.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4034.pdf Binary file tests/resources/test_4034.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4043.pdf --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_4043.pdf Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,77 @@ +%PDF-1.7 +%µ¶ + +1 0 obj +<> +endobj + +2 0 obj +<> +endobj + +3 0 obj +<>>> +endobj + +4 0 obj +<> +endobj + +5 0 obj +<> +endobj + +6 0 obj +<> +stream + +q +BT +1 0 0 1 30 770 Tm +/helv 11 Tf [<61626364>]TJ +ET +Q + +endstream +endobj + +7 0 obj +<> +stream + +q +BT +1 0 0 1 30 750 Tm +/helv 11 Tf [<6b6b6b>]TJ +ET +Q + +endstream +endobj + +8 0 obj +<<>> +endobj + +9 0 obj +<> +endobj + +xref +0 10 +0000000000 00001 f +0000000016 00000 n +0000000062 00000 n +0000000120 00000 n +0000000161 00000 n +0000000302 00000 n +0000000391 00000 n +0000000496 00000 n +0000000599 00000 n +0000000620 00000 n + +trailer +<<706AFDA60F32AE1EC7862F85C0C2B0DF>]>> +startxref +715 +%%EOF diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4047.pdf Binary file tests/resources/test_4047.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4079.pdf Binary file tests/resources/test_4079.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4079_after.pdf Binary file tests/resources/test_4079_after.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4079_after_1.25.pdf Binary file tests/resources/test_4079_after_1.25.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4090.pdf Binary file tests/resources/test_4090.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4125.pdf Binary file tests/resources/test_4125.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4139.pdf Binary file tests/resources/test_4139.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4141.pdf Binary file tests/resources/test_4141.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4147.pdf Binary file tests/resources/test_4147.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4179.pdf Binary file tests/resources/test_4179.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4179_expected.png Binary file tests/resources/test_4179_expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4180.pdf Binary file tests/resources/test_4180.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4180_expected.png Binary file tests/resources/test_4180_expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4182.pdf Binary file tests/resources/test_4182.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4182_expected.png Binary file tests/resources/test_4182_expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4224.pdf Binary file tests/resources/test_4224.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4245.pdf Binary file tests/resources/test_4245.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4245_expected.png Binary file tests/resources/test_4245_expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4263.pdf Binary file tests/resources/test_4263.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4363.pdf Binary file tests/resources/test_4363.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4388_BOZ1.pdf Binary file tests/resources/test_4388_BOZ1.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4388_BUL1.pdf Binary file tests/resources/test_4388_BUL1.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4412.pdf Binary file tests/resources/test_4412.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4415.pdf Binary file tests/resources/test_4415.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4415_out_expected.png Binary file tests/resources/test_4415_out_expected.png has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4423.pdf Binary file tests/resources/test_4423.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4435.pdf Binary file tests/resources/test_4435.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4479.pdf Binary file tests/resources/test_4479.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4496.hwpx Binary file tests/resources/test_4496.hwpx has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4503.pdf Binary file tests/resources/test_4503.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4505.pdf Binary file tests/resources/test_4505.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4546.pdf Binary file tests/resources/test_4546.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4564.pdf --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_4564.pdf Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,51 @@ +%PDF-1.7 +%µ¶ + +1 0 obj +<> +endobj + +2 0 obj +<> +endobj + +3 0 obj +<<>> +endobj + +4 0 obj +<> +endobj + +5 0 obj +<> +endobj + +xref +0 6 +0000000000 65535 f +0000000016 00000 n +0000000062 00000 n +0000000114 00000 n +0000000135 00000 n +0000000226 00000 n + +trailer +<<1B39208D6AF10175D235B24C7B616316>]>> +startxref +294 +%%EOF + +5 0 obj +<>> +endobj + +xref +5 1 +0000000560 00000 n + +trailer +<<53D89E9C1F45647C23FCE6596A7EDD32>]/Prev 294>> +startxref +729 +%%EOF diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4571.pdf Binary file tests/resources/test_4571.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4614.pdf Binary file tests/resources/test_4614.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_4639.pdf Binary file tests/resources/test_4639.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_annot_file_info.pdf Binary file tests/resources/test_annot_file_info.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_delete_image.pdf Binary file tests/resources/test_delete_image.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.cbz Binary file tests/resources/test_open2.cbz has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.doc Binary file tests/resources/test_open2.doc has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.docx Binary file tests/resources/test_open2.docx has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.epub Binary file tests/resources/test_open2.epub has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.fb2 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_open2.fb2 Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,64 @@ + + + + + computers + + Chris + Clark + + Sample FB2 book + +

Short sample of a FictionBook2 book with simple metadata. Based on test_book.md from https://github.com/clach04/sample_reading_media

+
+ ebook,sample,markdown,fb2,FictionBook2 +
+ + + clach04 + https://github.com/clach04/sample_reading_media + + + vim and scite + https://github.com/clach04/sample_reading_media + 1.0 + +

Initial version, written by hand.

+
+
+
+ + + <p>This is a title</p> + + +
+ + <p>Test Header h1</p> + + +

A test paragraph.

+

Another test paragraph.

+
+ +
+ + <p>Another Test Header h1</p> + + +
+ + <p>A Test Header h2</p> + + +
+ + <p>A Test Header h3</p> + + +

Yet more copy

+
+
+
+ +
diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.html Binary file tests/resources/test_open2.html has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.jpg Binary file tests/resources/test_open2.jpg has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.mobi Binary file tests/resources/test_open2.mobi has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.pdf Binary file tests/resources/test_open2.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.svg --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_open2.svg Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,18 @@ + + + + + + + + + + + + + + + + + + diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.xhtml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_open2.xhtml Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,17 @@ + + + + + + + + +
+

Some text

+
+ + diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_open2.xml Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,21 @@ + + + + + + + + + + + + + + + + + + + + + diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2.xps Binary file tests/resources/test_open2.xps has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_open2_expected.json --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/resources/test_open2_expected.json Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,808 @@ +{ + "tests/resources/test_open2.cbz": { + "": { + "file": "zip", + "stream": "zip" + }, + ".cbz": { + "file": "zip", + "stream": "zip" + }, + ".doc": { + "file": "zip", + "stream": "zip" + }, + ".docx": { + "file": "zip", + "stream": "zip" + }, + ".epub": { + "file": "zip", + "stream": "zip" + }, + ".fb2": { + "file": "zip", + "stream": "zip" + }, + ".html": { + "file": "zip", + "stream": "zip" + }, + ".jpg": { + "file": "zip", + "stream": "zip" + }, + ".mobi": { + "file": "zip", + "stream": "zip" + }, + ".pdf": { + "file": "zip", + "stream": "zip" + }, + ".svg": { + "file": "zip", + "stream": "zip" + }, + ".txt": { + "file": "zip", + "stream": "zip" + }, + ".xhtml": { + "file": "zip", + "stream": "zip" + }, + ".xml": { + "file": "zip", + "stream": "zip" + }, + ".xps": { + "file": "zip", + "stream": "zip" + } + }, + "tests/resources/test_open2.doc": { + "": { + "file": "[error]", + "stream": "[error]" + }, + ".cbz": { + "file": "cfb", + "stream": "cfb" + }, + ".doc": { + "file": "[error]", + "stream": "[error]" + }, + ".docx": { + "file": "[error]", + "stream": "[error]" + }, + ".epub": { + "file": "[error]", + "stream": "[error]" + }, + ".fb2": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".html": { + "file": "HTML5", + "stream": "HTML5" + }, + ".jpg": { + "file": "Image", + "stream": "Image" + }, + ".mobi": { + "file": "[error]", + "stream": "[error]" + }, + ".pdf": { + "file": "[error]", + "stream": "[error]" + }, + ".svg": { + "file": "SVG", + "stream": "SVG" + }, + ".txt": { + "file": "Text", + "stream": "Text" + }, + ".xhtml": { + "file": "XHTML", + "stream": "XHTML" + }, + ".xml": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".xps": { + "file": "[error]", + "stream": "[error]" + } + }, + "tests/resources/test_open2.docx": { + "": { + "file": "Office document", + "stream": "Office document" + }, + ".cbz": { + "file": "Office document", + "stream": "Office document" + }, + ".doc": { + "file": "Office document", + "stream": "Office document" + }, + ".docx": { + "file": "Office document", + "stream": "Office document" + }, + ".epub": { + "file": "Office document", + "stream": "Office document" + }, + ".fb2": { + "file": "Office document", + "stream": "Office document" + }, + ".html": { + "file": "Office document", + "stream": "Office document" + }, + ".jpg": { + "file": "Office document", + "stream": "Office document" + }, + ".mobi": { + "file": "Office document", + "stream": "Office document" + }, + ".pdf": { + "file": "Office document", + "stream": "Office document" + }, + ".svg": { + "file": "Office document", + "stream": "Office document" + }, + ".txt": { + "file": "Office document", + "stream": "Office document" + }, + ".xhtml": { + "file": "Office document", + "stream": "Office document" + }, + ".xml": { + "file": "Office document", + "stream": "Office document" + }, + ".xps": { + "file": "Office document", + "stream": "Office document" + } + }, + "tests/resources/test_open2.epub": { + "": { + "file": "EPUB", + "stream": "EPUB" + }, + ".cbz": { + "file": "zip", + "stream": "zip" + }, + ".doc": { + "file": "EPUB", + "stream": "EPUB" + }, + ".docx": { + "file": "EPUB", + "stream": "EPUB" + }, + ".epub": { + "file": "EPUB", + "stream": "EPUB" + }, + ".fb2": { + "file": "EPUB", + "stream": "EPUB" + }, + ".html": { + "file": "EPUB", + "stream": "EPUB" + }, + ".jpg": { + "file": "EPUB", + "stream": "EPUB" + }, + ".mobi": { + "file": "EPUB", + "stream": "EPUB" + }, + ".pdf": { + "file": "EPUB", + "stream": "EPUB" + }, + ".svg": { + "file": "EPUB", + "stream": "EPUB" + }, + ".txt": { + "file": "EPUB", + "stream": "EPUB" + }, + ".xhtml": { + "file": "EPUB", + "stream": "EPUB" + }, + ".xml": { + "file": "EPUB", + "stream": "EPUB" + }, + ".xps": { + "file": "EPUB", + "stream": "EPUB" + } + }, + "tests/resources/test_open2.fb2": { + "": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".cbz": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".doc": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".docx": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".epub": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".fb2": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".html": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".jpg": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".mobi": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".pdf": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".svg": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".txt": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".xhtml": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".xml": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".xps": { + "file": "FictionBook2", + "stream": "FictionBook2" + } + }, + "tests/resources/test_open2.html": { + "": { + "file": "HTML5", + "stream": "HTML5" + }, + ".cbz": { + "file": "HTML5", + "stream": "HTML5" + }, + ".doc": { + "file": "HTML5", + "stream": "HTML5" + }, + ".docx": { + "file": "HTML5", + "stream": "HTML5" + }, + ".epub": { + "file": "HTML5", + "stream": "HTML5" + }, + ".fb2": { + "file": "HTML5", + "stream": "HTML5" + }, + ".html": { + "file": "HTML5", + "stream": "HTML5" + }, + ".jpg": { + "file": "HTML5", + "stream": "HTML5" + }, + ".mobi": { + "file": "HTML5", + "stream": "HTML5" + }, + ".pdf": { + "file": "HTML5", + "stream": "HTML5" + }, + ".svg": { + "file": "HTML5", + "stream": "HTML5" + }, + ".txt": { + "file": "HTML5", + "stream": "HTML5" + }, + ".xhtml": { + "file": "XHTML", + "stream": "XHTML" + }, + ".xml": { + "file": "HTML5", + "stream": "HTML5" + }, + ".xps": { + "file": "HTML5", + "stream": "HTML5" + } + }, + "tests/resources/test_open2.jpg": { + "": { + "file": "Image", + "stream": "Image" + }, + ".cbz": { + "file": "Image", + "stream": "Image" + }, + ".doc": { + "file": "Image", + "stream": "Image" + }, + ".docx": { + "file": "Image", + "stream": "Image" + }, + ".epub": { + "file": "Image", + "stream": "Image" + }, + ".fb2": { + "file": "Image", + "stream": "Image" + }, + ".html": { + "file": "Image", + "stream": "Image" + }, + ".jpg": { + "file": "Image", + "stream": "Image" + }, + ".mobi": { + "file": "Image", + "stream": "Image" + }, + ".pdf": { + "file": "Image", + "stream": "Image" + }, + ".svg": { + "file": "Image", + "stream": "Image" + }, + ".txt": { + "file": "Image", + "stream": "Image" + }, + ".xhtml": { + "file": "Image", + "stream": "Image" + }, + ".xml": { + "file": "Image", + "stream": "Image" + }, + ".xps": { + "file": "Image", + "stream": "Image" + } + }, + "tests/resources/test_open2.mobi": { + "": { + "file": "MOBI", + "stream": "MOBI" + }, + ".cbz": { + "file": "MOBI", + "stream": "MOBI" + }, + ".doc": { + "file": "MOBI", + "stream": "MOBI" + }, + ".docx": { + "file": "MOBI", + "stream": "MOBI" + }, + ".epub": { + "file": "MOBI", + "stream": "MOBI" + }, + ".fb2": { + "file": "MOBI", + "stream": "MOBI" + }, + ".html": { + "file": "MOBI", + "stream": "MOBI" + }, + ".jpg": { + "file": "MOBI", + "stream": "MOBI" + }, + ".mobi": { + "file": "MOBI", + "stream": "MOBI" + }, + ".pdf": { + "file": "MOBI", + "stream": "MOBI" + }, + ".svg": { + "file": "MOBI", + "stream": "MOBI" + }, + ".txt": { + "file": "MOBI", + "stream": "MOBI" + }, + ".xhtml": { + "file": "MOBI", + "stream": "MOBI" + }, + ".xml": { + "file": "MOBI", + "stream": "MOBI" + }, + ".xps": { + "file": "MOBI", + "stream": "MOBI" + } + }, + "tests/resources/test_open2.pdf": { + "": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".cbz": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".doc": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".docx": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".epub": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".fb2": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".html": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".jpg": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".mobi": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".pdf": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".svg": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".txt": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".xhtml": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".xml": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + }, + ".xps": { + "file": "PDF 1.5", + "stream": "PDF 1.5" + } + }, + "tests/resources/test_open2.svg": { + "": { + "file": "SVG", + "stream": "SVG" + }, + ".cbz": { + "file": "SVG", + "stream": "SVG" + }, + ".doc": { + "file": "SVG", + "stream": "SVG" + }, + ".docx": { + "file": "SVG", + "stream": "SVG" + }, + ".epub": { + "file": "SVG", + "stream": "SVG" + }, + ".fb2": { + "file": "SVG", + "stream": "SVG" + }, + ".html": { + "file": "SVG", + "stream": "SVG" + }, + ".jpg": { + "file": "SVG", + "stream": "SVG" + }, + ".mobi": { + "file": "SVG", + "stream": "SVG" + }, + ".pdf": { + "file": "SVG", + "stream": "SVG" + }, + ".svg": { + "file": "SVG", + "stream": "SVG" + }, + ".txt": { + "file": "SVG", + "stream": "SVG" + }, + ".xhtml": { + "file": "SVG", + "stream": "SVG" + }, + ".xml": { + "file": "SVG", + "stream": "SVG" + }, + ".xps": { + "file": "SVG", + "stream": "SVG" + } + }, + "tests/resources/test_open2.xhtml": { + "": { + "file": "XHTML", + "stream": "XHTML" + }, + ".cbz": { + "file": "XHTML", + "stream": "XHTML" + }, + ".doc": { + "file": "XHTML", + "stream": "XHTML" + }, + ".docx": { + "file": "XHTML", + "stream": "XHTML" + }, + ".epub": { + "file": "XHTML", + "stream": "XHTML" + }, + ".fb2": { + "file": "XHTML", + "stream": "XHTML" + }, + ".html": { + "file": "HTML5", + "stream": "HTML5" + }, + ".jpg": { + "file": "XHTML", + "stream": "XHTML" + }, + ".mobi": { + "file": "XHTML", + "stream": "XHTML" + }, + ".pdf": { + "file": "XHTML", + "stream": "XHTML" + }, + ".svg": { + "file": "XHTML", + "stream": "XHTML" + }, + ".txt": { + "file": "XHTML", + "stream": "XHTML" + }, + ".xhtml": { + "file": "XHTML", + "stream": "XHTML" + }, + ".xml": { + "file": "XHTML", + "stream": "XHTML" + }, + ".xps": { + "file": "XHTML", + "stream": "XHTML" + } + }, + "tests/resources/test_open2.xml": { + "": { + "file": "[error]", + "stream": "[error]" + }, + ".cbz": { + "file": "[error]", + "stream": "[error]" + }, + ".doc": { + "file": "[error]", + "stream": "[error]" + }, + ".docx": { + "file": "[error]", + "stream": "[error]" + }, + ".epub": { + "file": "[error]", + "stream": "[error]" + }, + ".fb2": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".html": { + "file": "HTML5", + "stream": "HTML5" + }, + ".jpg": { + "file": "Image", + "stream": "Image" + }, + ".mobi": { + "file": "[error]", + "stream": "[error]" + }, + ".pdf": { + "file": "[error]", + "stream": "[error]" + }, + ".svg": { + "file": "SVG", + "stream": "SVG" + }, + ".txt": { + "file": "Text", + "stream": "Text" + }, + ".xhtml": { + "file": "XHTML", + "stream": "XHTML" + }, + ".xml": { + "file": "FictionBook2", + "stream": "FictionBook2" + }, + ".xps": { + "file": "[error]", + "stream": "[error]" + } + }, + "tests/resources/test_open2.xps": { + "": { + "file": "XPS", + "stream": "XPS" + }, + ".cbz": { + "file": "zip", + "stream": "zip" + }, + ".doc": { + "file": "XPS", + "stream": "XPS" + }, + ".docx": { + "file": "XPS", + "stream": "XPS" + }, + ".epub": { + "file": "XPS", + "stream": "XPS" + }, + ".fb2": { + "file": "XPS", + "stream": "XPS" + }, + ".html": { + "file": "XPS", + "stream": "XPS" + }, + ".jpg": { + "file": "XPS", + "stream": "XPS" + }, + ".mobi": { + "file": "XPS", + "stream": "XPS" + }, + ".pdf": { + "file": "XPS", + "stream": "XPS" + }, + ".svg": { + "file": "XPS", + "stream": "XPS" + }, + ".txt": { + "file": "XPS", + "stream": "XPS" + }, + ".xhtml": { + "file": "XPS", + "stream": "XPS" + }, + ".xml": { + "file": "XPS", + "stream": "XPS" + }, + ".xps": { + "file": "XPS", + "stream": "XPS" + } + } +} \ No newline at end of file diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/test_toc_count.pdf Binary file tests/resources/test_toc_count.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/text-find-ligatures.pdf Binary file tests/resources/text-find-ligatures.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/type3font.pdf Binary file tests/resources/type3font.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/v110-changes.pdf Binary file tests/resources/v110-changes.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/resources/widgettest.pdf Binary file tests/resources/widgettest.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/run_compound.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/run_compound.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,139 @@ +#! /usr/bin/env python3 + +''' +Runs a command using different implementations of PyMuPDF: + +1. Run with rebased implementation of PyMuPDF. + +2. As 1 but also set PYMUPDF_USE_EXTRA=0 to disable use of C++ optimisations. + +Example usage: + + ./PyMuPDF/tests/run_compound.py python -m pytest -s PyMuPDF + +Use `-i ` to select which implementations to use. In +``, `r` means rebased, `R` means rebased without +optimisations. + +For example use the rebased and unoptimised rebased implementations with: + + ./PyMuPDF/tests/run_compound.py python -m pytest -s PyMuPDF +''' + +import shlex +import os +import platform +import subprocess +import sys +import textwrap +import time + + +def log(text): + print(textwrap.indent(text, 'PyMuPDF:tests/run_compound.py: ')) + sys.stdout.flush() + + +def log_star(text): + log('#' * 40) + log(text) + log('#' * 40) + + +def main(): + + implementations = 'rR' + timeout = None + i = 1 + while i < len(sys.argv): + arg = sys.argv[i] + if arg == '-i': + i += 1 + implementations = sys.argv[i] + elif arg == '-t': + i += 1 + timeout = float(sys.argv[i]) + elif arg.startswith('-'): + raise Exception(f'Unrecognised {arg=}.') + else: + break + i += 1 + args = sys.argv[i:] + + e_rebased = None + e_rebased_unoptimised = None + + endtime = None + if timeout: + endtime = time.time() + timeout + + # Check `implementations`. + implementations_seen = set() + for i in implementations: + assert i not in implementations_seen, f'Duplicate implementation {i!r} in {implementations!r}.' + if i == 'r': + name = 'rebased' + elif i == 'R': + name = 'rebased (unoptimised)' + else: + assert 0, f'Unrecognised implementation {i!r} in {implementations!r}.' + log(f' {i!r}: will run with PyMuPDF {name}.') + implementations_seen.add(i) + + for i in implementations: + log(f'run_compound.py: {i=}') + + cpu_bits = int.bit_length(sys.maxsize+1) + log(f'{os.getcwd()=}') + log(f'{platform.machine()=}') + log(f'{platform.platform()=}') + log(f'{platform.python_version()=}') + log(f'{platform.system()=}') + if sys.implementation.name != 'graalpy': + log(f'{platform.uname()=}') + log(f'{sys.executable=}') + log(f'{sys.version=}') + log(f'{sys.version_info=}') + log(f'{list(sys.version_info)=}') + log(f'{cpu_bits=}') + + timeout = None + if endtime: + timeout = max(0, endtime - time.time()) + if i == 'r': + + # Run with default `pymupdf` (rebased). + # + log_star( f'Running using pymupdf (rebased): {shlex.join(args)}') + e_rebased = subprocess.run( args, shell=0, check=0, timeout=timeout).returncode + + elif i == 'R': + + # Run with `pymupdf` (rebased) again, this time with PYMUPDF_USE_EXTRA=0. + # + env = os.environ.copy() + env[ 'PYMUPDF_USE_EXTRA'] = '0' + log_star(f'Running using pymupdf (rebased) with PYMUPDF_USE_EXTRA=0: {shlex.join(args)}') + e_rebased_unoptimised = subprocess.run( args, shell=0, check=0, env=env, timeout=timeout).returncode + + else: + raise Exception(f'Unrecognised implementation {i!r}.') + + if e_rebased is not None: + log(f'{e_rebased=}') + if e_rebased_unoptimised is not None: + log(f'{e_rebased_unoptimised=}') + + if e_rebased or e_rebased_unoptimised: + log('Test(s) failed.') + return 1 + + +if __name__ == '__main__': + try: + sys.exit(main()) + except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as e: + # Terminate relatively quietly, failed commands will usually have + # generated diagnostics. + log(str(e)) + sys.exit(1) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_2548.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_2548.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,42 @@ +import os + +import pymupdf + +root = os.path.abspath(f'{__file__}/../..') + +def test_2548(): + """Text extraction should fail because of PDF structure cycle. + + Old MuPDF version did not detect the loop. + """ + print(f'test_2548(): {pymupdf.mupdf_version_tuple=}') + pymupdf.TOOLS.mupdf_warnings(reset=True) + doc = pymupdf.open(f'{root}/tests/resources/test_2548.pdf') + e = False + for page in doc: + try: + _ = page.get_text() + except Exception as ee: + print(f'test_2548: {ee=}') + if hasattr(pymupdf, 'mupdf'): + # Rebased. + expected = "RuntimeError('code=2: cycle in structure tree')" + else: + # Classic. + expected = "RuntimeError('cycle in structure tree')" + assert repr(ee) == expected, f'Expected {expected=} but got {repr(ee)=}.' + e = True + wt = pymupdf.TOOLS.mupdf_warnings() + print(f'test_2548(): {wt=}') + + # This checks that PyMuPDF 1.23.7 fixes this bug, and also that earlier + # versions with updated MuPDF also fix the bug. + rebased = hasattr(pymupdf, 'mupdf') + if pymupdf.mupdf_version_tuple >= (1, 27): + expected = 'format error: No common ancestor in structure tree\nstructure tree broken, assume tree is missing' + expected = '\n'.join([expected] * 5) + else: + expected = 'format error: cycle in structure tree\nstructure tree broken, assume tree is missing' + if rebased: + assert wt == expected, f'expected:\n {expected!r}\nwt:\n {wt!r}\n' + assert not e diff -r 000000000000 -r 1d09e1dec1d9 tests/test_2634.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_2634.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,65 @@ +import pymupdf + +import difflib +import json +import os +import pprint + + +def test_2634(): + if not hasattr(pymupdf, 'mupdf'): + print('test_2634(): Not running on classic.') + return + path = os.path.abspath(f'{__file__}/../../tests/resources/test_2634.pdf') + with pymupdf.open(path) as pdf, pymupdf.open() as new: + new.insert_pdf(pdf) + new.set_toc(pdf.get_toc(simple=False)) + toc_pdf = pdf.get_toc(simple=False) + toc_new = new.get_toc(simple=False) + + def clear_xref(toc): + ''' + Clear toc items that naturally differ. + ''' + for item in toc: + d = item[3] + if 'collapse' in d: + d['collapse'] = 'dummy' + if 'xref' in d: + d['xref'] = 'dummy' + + clear_xref(toc_pdf) + clear_xref(toc_new) + + print('toc_pdf') + for item in toc_pdf: print(item) + print() + print('toc_new') + for item in toc_new: print(item) + + toc_text_pdf = pprint.pformat(toc_pdf, indent=4).split('\n') + toc_text_new = pprint.pformat(toc_new, indent=4).split('\n') + + diff = difflib.unified_diff( + toc_text_pdf, + toc_text_new, + lineterm='', + ) + print('\n'.join(diff)) + + # Check 'to' points are identical apart from rounding errors. + # + assert len(toc_new) == len(toc_pdf) + for a, b in zip(toc_pdf, toc_new): + a_dict = a[3] + b_dict = b[3] + if 'to' in a_dict: + assert 'to' in b_dict + a_to = a_dict['to'] + b_to = b_dict['to'] + assert isinstance(a_to, pymupdf.Point) + assert isinstance(b_to, pymupdf.Point) + if a_to != b_to: + print(f'Points not identical: {a_to=} {b_to=}.') + assert abs(a_to.x - b_to.x) < 0.01 + assert abs(a_to.y - b_to.y) < 0.01 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_2904.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_2904.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,40 @@ +import pymupdf + +import os +import sys + +def test_2904(): + print(f'test_2904(): {pymupdf.mupdf_version_tuple=}.') + path = os.path.abspath(f'{__file__}/../../tests/resources/test_2904.pdf') + pdf_docs = pymupdf.open(path) + for page_id, page in enumerate(pdf_docs): + page_imgs = page.get_images() + for i, img in enumerate(page_imgs): + if page_id == 5: + #print(f'{page_id=} {i=} {type(img)=} {img=}') + sys.stdout.flush() + e = None + try: + recs = page.get_image_rects(img, transform=True) + except Exception as ee: + print(f'Exception: {page_id=} {i=} {img=}: {ee}') + if 0 and hasattr(pymupdf, 'mupdf'): + print(f'pymupdf.exception_info:') + pymupdf.exception_info() + sys.stdout.flush() + e = ee + if page_id == 5: + print(f'{pymupdf.mupdf_version_tuple=}: {page_id=} {i=} {e=} {img=}:') + if page_id == 5 and i==3: + assert e + if hasattr(pymupdf, 'mupdf'): + # rebased. + assert str(e) == 'code=8: Failed to read JPX header' + else: + # classic + assert str(e) == 'Failed to read JPX header' + else: + assert not e + + # Clear warnings, as we will have generated many. + pymupdf.TOOLS.mupdf_warnings() diff -r 000000000000 -r 1d09e1dec1d9 tests/test_2907.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_2907.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,18 @@ +import pymupdf + +import os.path +import pathlib + +def test_2907(): + # This test is for a bug in classic 'segfault trying to call clean_contents + # on certain pdfs with python 3.12', which we are not going to fix. + if not hasattr(pymupdf, 'mupdf'): + print('test_2907(): not running on classic because known to fail.') + return + path = os.path.abspath(f'{__file__}/../../tests/resources/test_2907.pdf') + pdf_file = pathlib.Path(path).read_bytes() + fitz_document = pymupdf.open(stream=pdf_file, filetype="application/pdf") + + pdf_pages = list(fitz_document.pages()) + (page,) = pdf_pages + page.clean_contents() diff -r 000000000000 -r 1d09e1dec1d9 tests/test_4141.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_4141.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,19 @@ +import pymupdf + +import os.path + + +def test_4141(): + """survive missing /Resources object in a number of cases.""" + path = os.path.abspath(f"{__file__}/../../tests/resources/test_4141.pdf") + doc = pymupdf.open(path) + page = doc[0] + # make sure the right test file + assert doc.xref_get_key(page.xref, "Resources") == ("null", "null") + page.insert_htmlbox((100, 100, 200, 200), "Hallo") # will fail without the fix + doc.close() + doc = pymupdf.open(doc.name) + page = doc[0] + tw = pymupdf.TextWriter(page.rect) + tw.append((100, 100), "Hallo") + tw.write_text(page) # will fail without the fix diff -r 000000000000 -r 1d09e1dec1d9 tests/test_4466.pdf Binary file tests/test_4466.pdf has changed diff -r 000000000000 -r 1d09e1dec1d9 tests/test_4503.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_4503.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,38 @@ +""" +Test for issue #4503 in pymupdf: +Correct recognition of strikeout and underline styles in text spans. +""" + +import os +import pymupdf +from pymupdf import mupdf + +STRIKEOUT = mupdf.FZ_STEXT_STRIKEOUT +UNDERLINE = mupdf.FZ_STEXT_UNDERLINE + + +def test_4503(): + """ + Check that the text span with the specified text has the correct styling: + strikeout, but no underline. + Previously, the text was broken in multiple spans with span breaks at + every space. and some parts were not detected as strikeout at all. + """ + scriptdir = os.path.dirname(os.path.abspath(__file__)) + text = "the right to request the state to review and, if appropriate," + filename = os.path.join(scriptdir, "resources", "test-4503.pdf") + doc = pymupdf.open(filename) + page = doc[0] + flags = pymupdf.TEXT_ACCURATE_BBOXES | pymupdf.TEXT_COLLECT_STYLES + spans = [ + s + for b in page.get_text("dict", flags=flags)["blocks"] + for l in b["lines"] + for s in l["spans"] + if s["text"] == text + ] + assert spans, "No spans found with the specified text" + span = spans[0] + + assert span["char_flags"] & STRIKEOUT + assert not span["char_flags"] & UNDERLINE diff -r 000000000000 -r 1d09e1dec1d9 tests/test_4505.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_4505.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,27 @@ +import pymupdf +import os.path + + +def test_4505(): + """Copy field flags to Parent widget and all of its kids.""" + path = os.path.abspath(f"{__file__}/../../tests/resources/test_4505.pdf") + doc = pymupdf.open(path) + page = doc[0] + text1_flags_before = {} + text1_flags_after = {} + # extract all widgets having the same field name + for w in page.widgets(): + if w.field_name != "text_1": + continue + text1_flags_before[w.xref] = w.field_flags + # expected exiting field flags + assert text1_flags_before == {8: 1, 10: 0, 33: 0} + w = page.load_widget(8) # first of these widgets + # give all connected widgets that field flags value + w.update(sync_flags=True) + # confirm that all connected widgets have the same field flags + for w in page.widgets(): + if w.field_name != "text_1": + continue + text1_flags_after[w.xref] = w.field_flags + assert text1_flags_after == {8: 1, 10: 1, 33: 1} diff -r 000000000000 -r 1d09e1dec1d9 tests/test_4520.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_4520.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,11 @@ +import pymupdf + + +def test_4520(): + """Accept source pages without /Contents object in show_pdf_page.""" + tar = pymupdf.open() + src = pymupdf.open() + src.new_page() + page = tar.new_page() + xref = page.show_pdf_page(page.rect, src, 0) + assert xref diff -r 000000000000 -r 1d09e1dec1d9 tests/test_4614.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_4614.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,10 @@ +import pymupdf +import os + + +def test_4614(): + script_dir = os.path.dirname(__file__) + filename = os.path.join(script_dir, "resources", "test_4614.pdf") + src = pymupdf.open(filename) + doc = pymupdf.open() + doc.insert_pdf(src) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_annots.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_annots.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,678 @@ +# -*- coding: utf-8 -*- +""" +Test PDF annotation insertions. +""" + +import os +import platform + +import pymupdf +import gentle_compare + + +red = (1, 0, 0) +blue = (0, 0, 1) +gold = (1, 1, 0) +green = (0, 1, 0) +scriptdir = os.path.dirname(__file__) + +displ = pymupdf.Rect(0, 50, 0, 50) +r = pymupdf.Rect(72, 72, 220, 100) +t1 = "têxt üsès Lätiñ charß,\nEUR: €, mu: µ, super scripts: ²³!" +rect = pymupdf.Rect(100, 100, 200, 200) + + +def test_caret(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_caret_annot(rect.tl) + assert annot.type == (14, "Caret") + annot.update(rotate=20) + page.annot_names() + page.annot_xrefs() + + +def test_freetext(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_freetext_annot( + rect, + t1, + fontsize=10, + rotate=90, + text_color=blue, + fill_color=gold, + align=pymupdf.TEXT_ALIGN_CENTER, + ) + annot.set_border(width=0.3, dashes=[2]) + annot.update(text_color=blue, fill_color=gold) + assert annot.type == (2, "FreeText") + + +def test_text(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_text_annot(r.tl, t1) + assert annot.type == (0, "Text") + + +def test_highlight(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_highlight_annot(rect) + assert annot.type == (8, "Highlight") + + +def test_underline(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_underline_annot(rect) + assert annot.type == (9, "Underline") + + +def test_squiggly(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_squiggly_annot(rect) + assert annot.type == (10, "Squiggly") + + +def test_strikeout(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_strikeout_annot(rect) + assert annot.type == (11, "StrikeOut") + page.delete_annot(annot) + + +def test_polyline(): + doc = pymupdf.open() + page = doc.new_page() + rect = page.rect + (100, 36, -100, -36) + cell = pymupdf.make_table(rect, rows=10) + for i in range(10): + annot = page.add_polyline_annot((cell[i][0].bl, cell[i][0].br)) + annot.set_line_ends(i, i) + annot.update() + for i, annot in enumerate(page.annots()): + assert annot.line_ends == (i, i) + assert annot.type == (7, "PolyLine") + + +def test_polygon(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_polygon_annot([rect.bl, rect.tr, rect.br, rect.tl]) + assert annot.type == (6, "Polygon") + + +def test_line(): + doc = pymupdf.open() + page = doc.new_page() + rect = page.rect + (100, 36, -100, -36) + cell = pymupdf.make_table(rect, rows=10) + for i in range(10): + annot = page.add_line_annot(cell[i][0].bl, cell[i][0].br) + annot.set_line_ends(i, i) + annot.update() + for i, annot in enumerate(page.annots()): + assert annot.line_ends == (i, i) + assert annot.type == (3, "Line") + + +def test_square(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_rect_annot(rect) + assert annot.type == (4, "Square") + + +def test_circle(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_circle_annot(rect) + assert annot.type == (5, "Circle") + + +def test_fileattachment(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_file_annot(rect.tl, b"just anything for testing", "testdata.txt") + assert annot.type == (17, "FileAttachment") + + +def test_stamp(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_stamp_annot(r, stamp=0) + assert annot.type == (13, "Stamp") + assert annot.info["content"] == "Approved" + annot_id = annot.info["id"] + annot_xref = annot.xref + page.load_annot(annot_id) + page.load_annot(annot_xref) + page = doc.reload_page(page) + + +def test_image_stamp(): + doc = pymupdf.open() + page = doc.new_page() + filename = os.path.join(scriptdir, "resources", "nur-ruhig.jpg") + annot = page.add_stamp_annot(r, stamp=filename) + assert annot.info["content"] == "Image Stamp" + + +def test_redact1(): + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_redact_annot(r, text="Hello") + annot.update( + cross_out=True, + rotate=-1, + ) + assert annot.type == (12, "Redact") + annot.get_pixmap() + info = annot.info + annot.set_info(info) + assert not annot.has_popup + annot.set_popup(r) + s = annot.popup_rect + assert s == r + page.apply_redactions() + + +def test_redact2(): + """Test for keeping text and removing graphics.""" + if not hasattr(pymupdf, "mupdf"): + print("Not executing 'test_redact2' in classic") + return + filename = os.path.join(scriptdir, "resources", "symbol-list.pdf") + doc = pymupdf.open(filename) + page = doc[0] + all_text0 = page.get_text("words") + page.add_redact_annot(page.rect) + page.apply_redactions(text=1) + t = page.get_text("words") + assert t == all_text0 + assert not page.get_drawings() + + +def test_redact3(): + """Test for removing text and graphics.""" + if not hasattr(pymupdf, "mupdf"): + print("Not executing 'test_redact3' in classic") + return + filename = os.path.join(scriptdir, "resources", "symbol-list.pdf") + doc = pymupdf.open(filename) + page = doc[0] + page.add_redact_annot(page.rect) + page.apply_redactions() + assert not page.get_text("words") + assert not page.get_drawings() + + +def test_redact4(): + """Test for removing text and keeping graphics.""" + if not hasattr(pymupdf, "mupdf"): + print("Not executing 'test_redact4' in classic") + return + filename = os.path.join(scriptdir, "resources", "symbol-list.pdf") + doc = pymupdf.open(filename) + page = doc[0] + line_art = page.get_drawings() + page.add_redact_annot(page.rect) + page.apply_redactions(graphics=0) + assert not page.get_text("words") + assert line_art == page.get_drawings() + + +def test_1645(): + ''' + Test fix for #1645. + ''' + # The expected output files assume annot_stem is 'jorj'. We need to always + # restore this before returning (this is checked by conftest.py). + annot_stem = pymupdf.JM_annot_id_stem + pymupdf.TOOLS.set_annot_stem('jorj') + try: + path_in = os.path.abspath( f'{__file__}/../resources/symbol-list.pdf') + path_expected = os.path.abspath( f'{__file__}/../../tests/resources/test_1645_expected.pdf') + path_out = os.path.abspath( f'{__file__}/../test_1645_out.pdf') + doc = pymupdf.open(path_in) + page = doc[0] + page_bounds = page.bound() + annot_loc = pymupdf.Rect(page_bounds.x0, page_bounds.y0, page_bounds.x0 + 75, page_bounds.y0 + 15) + # Check type of page.derotation_matrix - this is #2911. + assert isinstance(page.derotation_matrix, pymupdf.Matrix), \ + f'Bad type for page.derotation_matrix: {type(page.derotation_matrix)=} {page.derotation_matrix=}.' + page.add_freetext_annot( + annot_loc * page.derotation_matrix, + "TEST", + fontsize=18, + fill_color=pymupdf.utils.getColor("FIREBRICK1"), + rotate=page.rotation, + ) + doc.save(path_out, garbage=1, deflate=True, no_new_id=True) + print(f'Have created {path_out}. comparing with {path_expected}.') + with open( path_out, 'rb') as f: + out = f.read() + with open( path_expected, 'rb') as f: + expected = f.read() + assert out == expected, f'Files differ: {path_out} {path_expected}' + finally: + # Restore annot_stem. + pymupdf.TOOLS.set_annot_stem(annot_stem) + +def test_1824(): + ''' + Test for fix for #1824: SegFault when applying redactions overlapping a + transparent image. + ''' + path = os.path.abspath( f'{__file__}/../resources/test_1824.pdf') + doc=pymupdf.open(path) + page=doc[0] + page.apply_redactions() + +def test_2270(): + ''' + https://github.com/pymupdf/PyMuPDF/issues/2270 + ''' + path = os.path.abspath( f'{__file__}/../../tests/resources/test_2270.pdf') + with pymupdf.open(path) as document: + for page_number, page in enumerate(document): + for textBox in page.annots(types=(pymupdf.PDF_ANNOT_FREE_TEXT,pymupdf.PDF_ANNOT_TEXT)): + print("textBox.type :", textBox.type) + print(f"{textBox.rect=}") + print("textBox.get_text('words') : ", textBox.get_text('words')) + print("textBox.get_text('text') : ", textBox.get_text('text')) + print("textBox.get_textbox(textBox.rect) : ", textBox.get_textbox(textBox.rect)) + print("textBox.info['content'] : ", textBox.info['content']) + assert textBox.type == (2, 'FreeText') + assert textBox.get_text('words')[0][4] == 'abc123' + assert textBox.get_text('text') == 'abc123\n' + assert textBox.get_textbox(textBox.rect) == 'abc123' + assert textBox.info['content'] == 'abc123' + + # Additional check that Annot.get_textpage() returns a + # TextPage that works with page.get_text() - prior to + # 2024-01-30 the TextPage had no `.parent` member. + textpage = textBox.get_textpage() + text = page.get_text() + print(f'{text=}') + text = page.get_text(textpage=textpage) + print(f'{text=}') + print(f'{getattr(textpage, "parent")=}') + + if pymupdf.mupdf_version_tuple >= (1, 26): + # Check Annotation.get_textpage()'s arg. + clip = textBox.rect + clip.x1 = clip.x0 + (clip.x1 - clip.x0) / 3 + textpage2 = textBox.get_textpage(clip=clip) + text = textpage2.extractText() + print(f'With {clip=}: {text=}') + assert text == 'ab\n' + else: + assert not hasattr(pymupdf.mupdf, 'FZ_STEXT_CLIP_RECT') + + +def test_2934_add_redact_annot(): + ''' + Test fix for bug mentioned in #2934. + ''' + path = os.path.abspath(f'{__file__}/../../tests/resources/mupdf_explored.pdf') + with open(path, 'rb') as f: + data = f.read() + doc = pymupdf.Document(stream=data) + print(f'Is PDF: {doc.is_pdf}') + print(f'Number of pages: {doc.page_count}') + + import json + page=doc[0] + page_json_str =doc[0].get_text("json") + page_json_data = json.loads(page_json_str) + span=page_json_data.get("blocks")[0].get("lines")[0].get("spans")[0] + page.add_redact_annot(span["bbox"], text="") + page.apply_redactions() + +def test_2969(): + ''' + https://github.com/pymupdf/PyMuPDF/issues/2969 + ''' + path = os.path.abspath(f'{__file__}/../../tests/resources/test_2969.pdf') + doc = pymupdf.open(path) + page = doc[0] + first_annot = list(page.annots())[0] + first_annot.next + +def test_file_info(): + path = os.path.abspath(f'{__file__}/../../tests/resources/test_annot_file_info.pdf') + document = pymupdf.open(path) + results = list() + for i, page in enumerate(document): + print(f'{i=}') + annotations = page.annots() + for j, annotation in enumerate(annotations): + print(f'{j=} {annotation=}') + t = annotation.type + print(f'{t=}') + if t[0] == pymupdf.PDF_ANNOT_FILE_ATTACHMENT: + file_info = annotation.file_info + print(f'{file_info=}') + results.append(file_info) + assert results == [ + {'filename': 'example.pdf', 'description': '', 'length': 8416, 'size': 8992}, + {'filename': 'photo1.jpeg', 'description': '', 'length': 10154, 'size': 8012}, + ] + +def test_3131(): + doc = pymupdf.open() + page = doc.new_page() + + page.add_line_annot((0, 0), (1, 1)) + page.add_line_annot((1, 0), (0, 1)) + + first_annot, _ = page.annots() + first_annot.next.type + +def test_3209(): + pdf = pymupdf.Document(filetype="pdf") + page = pdf.new_page() + page.add_ink_annot([[(300,300), (400, 380), (350, 350)]]) + n = 0 + for annot in page.annots(): + n += 1 + assert annot.vertices == [[(300.0, 300.0), (400.0, 380.0), (350.0, 350.0)]] + assert n == 1 + path = os.path.abspath(f'{__file__}/../../tests/test_3209_out.pdf') + pdf.save(path) # Check the output PDF that the annotation is correctly drawn + +def test_3863(): + path_in = os.path.normpath(f'{__file__}/../../tests/resources/test_3863.pdf') + path_out = os.path.normpath(f'{__file__}/../../tests/test_3863.pdf.pdf') + + # Create redacted PDF. + print(f'Loading {path_in=}.') + with pymupdf.open(path_in) as document: + + for num, page in enumerate(document): + print(f"Page {num + 1} - {page.rect}:") + + for image in page.get_images(full=True): + print(f" - Image: {image}") + + redact_rect = page.rect + + if page.rotation in (90, 270): + redact_rect = pymupdf.Rect(0, 0, page.rect.height, page.rect.width) + + page.add_redact_annot(redact_rect) + page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE) + + print(f'Writing to {path_out=}.') + document.save(path_out) + + with pymupdf.open(path_out) as document: + assert len(document) == 8 + + # Create PNG for each page of redacted PDF. + for num, page in enumerate(document): + path_png = f'{path_out}.{num}.png' + pixmap = page.get_pixmap() + print(f'Writing to {path_png=}.') + pixmap.save(path_png) + # Compare with expected png. + + print(f'Comparing page PNGs with expected PNGs.') + for num, _ in enumerate(document): + path_png = f'{path_out}.{num}.png' + path_png_expected = f'{path_in}.pdf.{num}.png' + print(f'{path_png=}.') + print(f'{path_png_expected=}.') + rms = gentle_compare.pixmaps_rms(path_png, path_png_expected, ' ') + # We get small differences in sysinstall tests, where some + # thirdparty libraries can differ. + assert rms < 1 + +def test_3758(): + # This test requires input file that is not public, so is usually not + # available. + path = os.path.normpath(f'{__file__}/../../../test_3758.pdf') + if not os.path.exists(path): + print(f'test_3758(): not running because does not exist: {path=}.') + return + import json + with pymupdf.open(path) as document: + for page in document: + info = json.loads(page.get_text('json', flags=pymupdf.TEXTFLAGS_TEXT)) + for block_ind, block in enumerate(info['blocks']): + for line_ind, line in enumerate(block['lines']): + for span_ind, span in enumerate(line['spans']): + # print(span) + page.add_redact_annot(pymupdf.Rect(*span['bbox'])) + page.apply_redactions() + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt + + +def test_parent(): + """Test invalidating parent on page re-assignment.""" + doc = pymupdf.open() + page = doc.new_page() + a = page.add_highlight_annot(page.rect) # insert annotation on page 0 + page = doc.new_page() # make a new page, should orphanate annotation + try: + print(a) # should raise + except Exception as e: + if platform.system() == 'OpenBSD': + assert isinstance(e, pymupdf.mupdf.FzErrorBase), f'Incorrect {type(e)=}.' + else: + assert isinstance(e, pymupdf.mupdf.FzErrorArgument), f'Incorrect {type(e)=}.' + assert str(e) == 'code=4: annotation not bound to any page', f'Incorrect error text {str(e)=}.' + else: + assert 0, f'Failed to get expected exception.' + +def test_4047(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4047.pdf') + with pymupdf.open(path) as document: + page = document[0] + fontname = page.get_fonts()[0][3] + if fontname not in pymupdf.Base14_fontnames: + fontname = "Courier" + hits = page.search_for("|") + for rect in hits: + page.add_redact_annot( + rect, " ", fontname=fontname, align=pymupdf.TEXT_ALIGN_CENTER, fontsize=10 + ) # Segmentation Fault... + page.apply_redactions() + +def test_4079(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4079.pdf') + if pymupdf.mupdf_version_tuple >= (1, 25, 5): + path_after = os.path.normpath(f'{__file__}/../../tests/resources/test_4079_after.pdf') + else: + # 2024-11-27 Expect incorrect behaviour. + path_after = os.path.normpath(f'{__file__}/../../tests/resources/test_4079_after_1.25.pdf') + + path_out = os.path.normpath(f'{__file__}/../../tests/test_4079_out') + with pymupdf.open(path_after) as document_after: + page = document_after[0] + pixmap_after_expected = page.get_pixmap() + with pymupdf.open(path) as document: + page = document[0] + rects = [ + [164,213,282,227], + [282,213,397,233], + [434,209,525,243], + [169,228,231,243], + [377,592,440,607], + [373,611,444,626], + ] + for rect in rects: + page.add_redact_annot(rect, fill=(1,0,0)) + page.draw_rect(rect, color=(0, 1, 0)) + document.save(f'{path_out}_before.pdf') + page.apply_redactions(images=0) + pixmap_after = page.get_pixmap() + document.save(f'{path_out}_after.pdf') + rms = gentle_compare.pixmaps_rms(pixmap_after_expected, pixmap_after) + diff = gentle_compare.pixmaps_diff(pixmap_after_expected, pixmap_after) + path = os.path.normpath(f'{__file__}/../../tests/test_4079_diff.png') + diff.save(path) + print(f'{rms=}') + assert rms == 0 + +def test_4254(): + """Ensure that both annotations are fully created + + We do this by asserting equal top-used colors in respective pixmaps. + """ + doc = pymupdf.open() + page = doc.new_page() + + rect = pymupdf.Rect(100, 100, 200, 150) + annot = page.add_freetext_annot(rect, "Test Annotation from minimal example") + annot.set_border(width=1, dashes=(3, 3)) + annot.set_opacity(0.5) + try: + annot.set_colors(stroke=(1, 0, 0)) + except ValueError as e: + assert 'cannot be used for FreeText annotations' in str(e), f'{e}' + else: + assert 0 + annot.update() + + rect = pymupdf.Rect(200, 200, 400, 400) + annot2 = page.add_freetext_annot(rect, "Test Annotation from minimal example pt 2") + annot2.set_border(width=1, dashes=(3, 3)) + annot2.set_opacity(0.5) + try: + annot2.set_colors(stroke=(1, 0, 0)) + except ValueError as e: + assert 'cannot be used for FreeText annotations' in str(e), f'{e}' + else: + assert 0 + annot.update() + annot2.update() + + # stores top color for each pixmap + top_colors = set() + for annot in page.annots(): + pix = annot.get_pixmap() + top_colors.add(pix.color_topusage()[1]) + + # only one color must exist + assert len(top_colors) == 1 + +def test_richtext(): + """Test creation of rich text FreeText annotations. + + We create the same annotation on different pages in different ways, + with and without using Annotation.update(), and then assert equality + of the respective images. + """ + ds = """font-size: 11pt; font-family: sans-serif;""" + bullet = chr(0x2610) + chr(0x2611) + chr(0x2612) + text = f"""

+ PyMuPDF འདི་ ཡིག་ཆ་བཀྲམ་སྤེལ་གྱི་དོན་ལུ་ པའི་ཐོན་ཐུམ་སྒྲིལ་དྲག་ཤོས་དང་མགྱོགས་ཤོས་ཅིག་ཨིན། + Here is some bold and italic text, followed by bold-italic. Text-based check boxes: {bullet}. +

""" + gold = (1, 1, 0) + doc = pymupdf.open() + + # First page. + page = doc.new_page() + rect = pymupdf.Rect(100, 100, 350, 200) + p2 = rect.tr + (50, 30) + p3 = p2 + (0, 30) + annot = page.add_freetext_annot( + rect, + text, + fill_color=gold, + opacity=0.5, + rotate=90, + border_width=1, + dashes=None, + richtext=True, + callout=(p3, p2, rect.tr), + ) + + pix1 = page.get_pixmap() + + # Second page. + # the annotation is created with minimal parameters, which are supplied + # in a separate call to the .update() method. + page = doc.new_page() + annot = page.add_freetext_annot( + rect, + text, + border_width=1, + dashes=None, + richtext=True, + callout=(p3, p2, rect.tr), + ) + annot.update(fill_color=gold, opacity=0.5, rotate=90) + pix2 = page.get_pixmap() + assert pix1.samples == pix2.samples + + +def test_4447(): + document = pymupdf.open() + + page = document.new_page() + + text_color = (1, 0, 0) + fill_color = (0, 1, 0) + border_color = (0, 0, 1) + + annot_rect = pymupdf.Rect(90.1, 486.73, 139.26, 499.46) + + try: + annot = page.add_freetext_annot( + annot_rect, + "AETERM", + fontname="Arial", + fontsize=10, + text_color=text_color, + fill_color=fill_color, + border_color=border_color, + border_width=1, + ) + except ValueError as e: + assert 'cannot set border_color if rich_text is False' in str(e), str(e) + else: + assert 0 + + try: + annot = page.add_freetext_annot( + (30, 400, 100, 450), + "Two", + fontname="Arial", + fontsize=10, + text_color=text_color, + fill_color=fill_color, + border_color=border_color, + border_width=1, + ) + except ValueError as e: + assert 'cannot set border_color if rich_text is False' in str(e), str(e) + else: + assert 0 + + annot = page.add_freetext_annot( + (30, 500, 100, 550), + "Three", + fontname="Arial", + fontsize=10, + text_color=text_color, + border_width=1, + ) + annot.update(text_color=text_color, fill_color=fill_color) + try: + annot.update(border_color=border_color) + except ValueError as e: + assert 'cannot set border_color if rich_text is False' in str(e), str(e) + else: + assert 0 + + path_out = os.path.normpath(f'{__file__}/../../tests/test_4447.pdf') + document.save(path_out) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_badfonts.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_badfonts.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,15 @@ +""" +Ensure we can deal with non-Latin font names. +""" +import os + +import pymupdf + + +def test_survive_names(): + scriptdir = os.path.abspath(os.path.dirname(__file__)) + filename = os.path.join(scriptdir, "resources", "has-bad-fonts.pdf") + doc = pymupdf.open(filename) + print("File '%s' uses the following fonts on page 0:" % doc.name) + for f in doc.get_page_fonts(0): + print(f) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_balance_count.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_balance_count.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,52 @@ +import pymupdf + + +def test_q_count(): + """Testing graphics state balances and wrap_contents(). + + Take page's contents and generate various imbalanced graphics state + situations. Each time compare q-count with expected results. + Finally confirm we are out of balance using "is_wrapped", wrap the + contents object(s) via "wrap_contents()" and confirm success. + PDF commands "q" / "Q" stand for "push", respectively "pop". + """ + doc = pymupdf.open() + page = doc.new_page() + # the page has no /Contents objects at all yet. Create one causing + # an initial imbalance (so prepended "q" is needed) + pymupdf.TOOLS._insert_contents(page, b"Q", True) # append + assert page._count_q_balance() == (1, 0) + assert page.is_wrapped is False + + # Prepend more data that yield a different type of imbalanced contents: + # Although counts of q and Q are equal now, the unshielded 'cm' before + # the first 'q' makes the contents unusable for insertions. + pymupdf.TOOLS._insert_contents(page, b"1 0 0 -1 0 0 cm q ", False) # prepend + assert page.is_wrapped is False + if page._count_q_balance() == (0, 0): + print("imbalance undetected by q balance count") + + text = "Hello, World!" + page.insert_text((100, 100), text) # establishes balance! + + # this should have produced a balanced graphics state + assert page._count_q_balance() == (0, 0) + assert page.is_wrapped + + # an appended "pop" must be balanced by a prepended "push" + pymupdf.TOOLS._insert_contents(page, b"Q", True) # append + assert page._count_q_balance() == (1, 0) + + # a prepended "pop" yet needs another push + pymupdf.TOOLS._insert_contents(page, b"Q", False) # prepend + assert page._count_q_balance() == (2, 0) + + # an appended "push" needs an additional "pop" + pymupdf.TOOLS._insert_contents(page, b"q", True) # append + assert page._count_q_balance() == (2, 1) + + # wrapping the contents should yield a balanced state again + assert page.is_wrapped is False + page.wrap_contents() + assert page.is_wrapped is True + assert page._count_q_balance() == (0, 0) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_barcode.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_barcode.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,63 @@ +import os + +import pymupdf + + +def test_barcode(): + if pymupdf.mupdf_version_tuple < (1, 26): + print(f'Not testing barcode because {pymupdf.mupdf_version=} < 1.26') + return + path = os.path.normpath(f'{__file__}/../../tests/test_barcode_out.pdf') + + url = 'http://artifex.com' + text_in = '012345678901' + text_out = '123456789012' + # Create empty document and add a qrcode image. + with pymupdf.Document() as document: + page = document.new_page() + + pixmap = pymupdf.mupdf.fz_new_barcode_pixmap( + pymupdf.mupdf.FZ_BARCODE_QRCODE, + url, + 512, + 4, # ec_level + 0, # quiet + 1, # hrt + ) + pixmap = pymupdf.Pixmap('raw', pixmap) + page.insert_image( + (0, 0, 100, 100), + pixmap=pixmap, + ) + pixmap = pymupdf.mupdf.fz_new_barcode_pixmap( + pymupdf.mupdf.FZ_BARCODE_EAN13, + text_in, + 512, + 4, # ec_level + 0, # quiet + 1, # hrt + ) + pixmap = pymupdf.Pixmap('raw', pixmap) + page.insert_image( + (0, 200, 100, 300), + pixmap=pixmap, + ) + + document.save(path) + + with pymupdf.open(path) as document: + page = document[0] + for i, ii in enumerate(page.get_images()): + xref = ii[0] + pixmap = pymupdf.Pixmap(document, xref) + hrt, barcode_type = pymupdf.mupdf.fz_decode_barcode_from_pixmap2( + pixmap.this, + 0, # rotate. + ) + print(f'{hrt=}') + if i == 0: + assert hrt == url + elif i == 1: + assert hrt == text_out + else: + assert 0 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_clip_page.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_clip_page.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,37 @@ +""" +Test Page method clip_to_rect. +""" + +import os +import pymupdf + + +def test_clip(): + """ + Clip a Page to a rectangle and confirm that no text has survived + that is completely outside the rectangle.. + """ + scriptdir = os.path.dirname(os.path.abspath(__file__)) + rect = pymupdf.Rect(200, 200, 400, 500) + filename = os.path.join(scriptdir, "resources", "v110-changes.pdf") + doc = pymupdf.open(filename) + page = doc[0] + page.clip_to_rect(rect) # clip the page to the rectangle + # capture font warning message of MuPDF + assert pymupdf.TOOLS.mupdf_warnings() == "bogus font ascent/descent values (0 / 0)" + # extract all text characters and assert that each one + # has a non-empty intersection with the rectangle. + chars = [ + c + for b in page.get_text("rawdict")["blocks"] + for l in b["lines"] + for s in l["spans"] + for c in s["chars"] + ] + for char in chars: + bbox = pymupdf.Rect(char["bbox"]) + if bbox.is_empty: + continue + assert bbox.intersects( + rect + ), f"Character '{char['c']}' at {bbox} is outside of {rect}." diff -r 000000000000 -r 1d09e1dec1d9 tests/test_cluster_drawings.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_cluster_drawings.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,47 @@ +import os +import pymupdf + +scriptdir = os.path.dirname(__file__) + + +def test_cluster1(): + """Confirm correct identification of known examples.""" + if not hasattr(pymupdf, "mupdf"): + print("Not executing 'test_cluster1' in classic") + return + filename = os.path.join(scriptdir, "resources", "symbol-list.pdf") + doc = pymupdf.open(filename) + page = doc[0] + assert len(page.cluster_drawings()) == 10 + filename = os.path.join(scriptdir, "resources", "chinese-tables.pdf") + doc = pymupdf.open(filename) + page = doc[0] + assert len(page.cluster_drawings()) == 2 + + +def test_cluster2(): + """Join disjoint but neighbored drawings.""" + if not hasattr(pymupdf, "mupdf"): + print("Not executing 'test_cluster2' in classic") + return + doc = pymupdf.open() + page = doc.new_page() + r1 = pymupdf.Rect(100, 100, 200, 200) + r2 = pymupdf.Rect(203, 203, 400, 400) + page.draw_rect(r1) + page.draw_rect(r2) + assert page.cluster_drawings() == [r1 | r2] + + +def test_cluster3(): + """Confirm as separate if neighborhood threshold exceeded.""" + if not hasattr(pymupdf, "mupdf"): + print("Not executing 'test_cluster3' in classic") + return + doc = pymupdf.open() + page = doc.new_page() + r1 = pymupdf.Rect(100, 100, 200, 200) + r2 = pymupdf.Rect(204, 200, 400, 400) + page.draw_rect(r1) + page.draw_rect(r2) + assert page.cluster_drawings() == [r1, r2] diff -r 000000000000 -r 1d09e1dec1d9 tests/test_codespell.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_codespell.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,68 @@ +import pymupdf + +import os +import platform +import shlex +import subprocess +import sys +import textwrap + + +def test_codespell(): + ''' + Check rebased Python code with codespell. + ''' + if not hasattr(pymupdf, 'mupdf'): + print('Not running codespell with classic implementation.') + return + + if platform.system() == 'Windows': + # Git commands seem to fail on Github Windows runners. + print(f'test_codespell(): Not running on Widows') + return + + root = os.path.abspath(f'{__file__}/../..') + + # For now we ignore files that we would ideally still look at, because it + # is difficult to exclude some text sections. + skips = textwrap.dedent(''' + *.pdf + docs/_static/prism/prism.js + docs/_static/prism/prism.js + docs/locales/ja/LC_MESSAGES/changes.po + docs/locales/ja/LC_MESSAGES/recipes-common-issues-and-their-solutions.po + docs/locales/ + src_classic/* + ''') + skips = skips.strip().replace('\n', ',') + + command = textwrap.dedent(f''' + cd {root} && codespell + --skip {shlex.quote(skips)} + --ignore-words-list re-use,flate,thirdparty,re-using + --ignore-regex 'https?://[a-z0-9/_.]+' + --ignore-multiline-regex 'codespell:ignore-begin.*codespell:ignore-end' + ''') + + sys.path.append(root) + try: + import pipcl + finally: + del sys.path[0] + git_files = pipcl.git_items(root) + + for p in git_files: + _, ext = os.path.splitext(p) + if ext in ('.png', '.pdf', '.jpg', '.svg'): + pass + else: + command += f' {p}\n' + + if platform.system() != 'Windows': + command = command.replace('\n', ' \\\n') + # Don't print entire command because very long, and will be displayed + # anyway if there is an error. + #print(f'test_codespell(): Running: {command}') + print(f'Running codespell.') + subprocess.run(command, shell=1, check=1) + print('test_codespell(): codespell succeeded.') diff -r 000000000000 -r 1d09e1dec1d9 tests/test_crypting.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_crypting.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,39 @@ +""" +Check PDF encryption: +* make a PDF with owber and user passwords +* open and decrypt as owner or user +""" +import pymupdf + + +def test_encryption(): + text = "some secret information" # keep this data secret + perm = int( + pymupdf.PDF_PERM_ACCESSIBILITY # always use this + | pymupdf.PDF_PERM_PRINT # permit printing + | pymupdf.PDF_PERM_COPY # permit copying + | pymupdf.PDF_PERM_ANNOTATE # permit annotations + ) + owner_pass = "owner" # owner password + user_pass = "user" # user password + encrypt_meth = pymupdf.PDF_ENCRYPT_AES_256 # strongest algorithm + doc = pymupdf.open() # empty pdf + page = doc.new_page() # empty page + page.insert_text((50, 72), text) # insert the data + tobytes = doc.tobytes( + encryption=encrypt_meth, # set the encryption method + owner_pw=owner_pass, # set the owner password + user_pw=user_pass, # set the user password + permissions=perm, # set permissions + ) + doc.close() + doc = pymupdf.open("pdf", tobytes) + assert doc.needs_pass + assert doc.is_encrypted + rc = doc.authenticate("owner") + assert rc == 4 + assert not doc.is_encrypted + doc.close() + doc = pymupdf.open("pdf", tobytes) + rc = doc.authenticate("user") + assert rc == 2 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_docs_samples.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_docs_samples.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,59 @@ +''' +Test sample scripts in docs/samples/. +''' + +import glob +import os +import pytest +import runpy + +# We only look at sample scripts that can run standalone (i.e. don't require +# sys.argv). +# +root = os.path.abspath(f'{__file__}/../..') +samples = [] +for p in glob.glob(f'{root}/docs/samples/*.py'): + if os.path.basename(p) in ( + 'make-bold.py', # Needs sys.argv[1]. + 'multiprocess-gui.py', # GUI. + 'multiprocess-render.py', # Needs sys.argv[1]. + 'text-lister.py', # Needs sys.argv[1]. + ): + print(f'Not testing: {p}') + else: + p = os.path.relpath(p, root) + samples.append(p) + +def _test_all(): + # Allow runnings tests directly without pytest. + import subprocess + import sys + e = 0 + for sample in samples: + print( f'Running: {sample}', flush=1) + try: + if 0: + # Curiously this fails in an odd way when testing compound + # package with $PYTHONPATH set. + print( f'os.environ is:') + for n, v in os.environ.items(): + print( f' {n}: {v!r}') + command = f'{sys.executable} {sample}' + print( f'command is: {command!r}') + sys.stdout.flush() + subprocess.check_call( command, shell=1, text=1) + else: + runpy.run_path(sample) + except Exception: + print( f'Failed: {sample}') + e += 1 + if e: + raise Exception( f'Errors: {e}') + +# We use pytest.mark.parametrize() to run sample scripts via a fn, which +# ensures that pytest treats each script as a test. +# +@pytest.mark.parametrize('sample', samples) +def test_docs_samples(sample): + sample = f'{root}/{sample}' + runpy.run_path(sample) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_drawings.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_drawings.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,230 @@ +""" +Extract drawings of a PDF page and compare with stored expected result. +""" + +import io +import os +import sys +import pprint + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "symbol-list.pdf") +symbols = os.path.join(scriptdir, "resources", "symbols.txt") + + +def test_drawings1(): + symbols_text = open(symbols).read() # expected result + doc = pymupdf.open(filename) + page = doc[0] + paths = page.get_cdrawings() + out = io.StringIO() # pprint output goes here + pprint.pprint(paths, stream=out) + assert symbols_text == out.getvalue() + + +def test_drawings2(): + delta = (0, 20, 0, 20) + doc = pymupdf.open() + page = doc.new_page() + + r = pymupdf.Rect(100, 100, 200, 200) + page.draw_circle(r.br, 2, color=0) + r += delta + + page.draw_line(r.tl, r.br, color=0) + r += delta + + page.draw_oval(r, color=0) + r += delta + + page.draw_rect(r, color=0) + r += delta + + page.draw_quad(r.quad, color=0) + r += delta + + page.draw_polyline((r.tl, r.tr, r.br), color=0) + r += delta + + page.draw_bezier(r.tl, r.tr, r.br, r.bl, color=0) + r += delta + + page.draw_curve(r.tl, r.tr, r.br, color=0) + r += delta + + page.draw_squiggle(r.tl, r.br, color=0) + r += delta + + rects = [p["rect"] for p in page.get_cdrawings()] + bboxes = [b[1] for b in page.get_bboxlog()] + for i, r in enumerate(rects): + assert pymupdf.Rect(r) in pymupdf.Rect(bboxes[i]) + + +def _dict_difference(a, b): + """ + Verifies that dictionaries "a", "b" + * have the same keys and values, except for key "items": + * the items list of "a" must be one shorter but otherwise equal the "b" items + + Returns last item of b["items"]. + """ + assert a.keys() == b.keys() + for k in a.keys(): + v1 = a[k] + v2 = b[k] + if k != "items": + assert v1 == v2 + else: + assert v1 == v2[:-1] + rc = v2[-1] + return rc + + +def test_drawings3(): + doc = pymupdf.open() + page1 = doc.new_page() + shape1 = page1.new_shape() + shape1.draw_line((10, 10), (10, 50)) + shape1.draw_line((10, 50), (100, 100)) + shape1.finish(closePath=False) + shape1.commit() + drawings1 = page1.get_drawings()[0] + + page2 = doc.new_page() + shape2 = page2.new_shape() + shape2.draw_line((10, 10), (10, 50)) + shape2.draw_line((10, 50), (100, 100)) + shape2.finish(closePath=True) + shape2.commit() + drawings2 = page2.get_drawings()[0] + + assert _dict_difference(drawings1, drawings2) == ("l", (100, 100), (10, 10)) + + page3 = doc.new_page() + shape3 = page3.new_shape() + shape3.draw_line((10, 10), (10, 50)) + shape3.draw_line((10, 50), (100, 100)) + shape3.draw_line((100, 100), (50, 70)) + shape3.finish(closePath=False) + shape3.commit() + drawings3 = page3.get_drawings()[0] + + page4 = doc.new_page() + shape4 = page4.new_shape() + shape4.draw_line((10, 10), (10, 50)) + shape4.draw_line((10, 50), (100, 100)) + shape4.draw_line((100, 100), (50, 70)) + shape4.finish(closePath=True) + shape4.commit() + drawings4 = page4.get_drawings()[0] + + assert _dict_difference(drawings3, drawings4) == ("l", (50, 70), (10, 10)) + + +def test_2365(): + """Draw a filled rectangle on a new page. + + Then extract the page's vector graphics and confirm that only one path + was generated which has all the right properties.""" + doc = pymupdf.open() + page = doc.new_page() + rect = pymupdf.Rect(100, 100, 200, 200) + page.draw_rect( + rect, color=pymupdf.pdfcolor["black"], fill=pymupdf.pdfcolor["yellow"], width=3 + ) + paths = page.get_drawings() + assert len(paths) == 1 + path = paths[0] + assert path["type"] == "fs" + assert path["fill"] == pymupdf.pdfcolor["yellow"] + assert path["fill_opacity"] == 1 + assert path["color"] == pymupdf.pdfcolor["black"] + assert path["stroke_opacity"] == 1 + assert path["width"] == 3 + assert path["rect"] == rect + + +def test_2462(): + """ + Assertion happens, if this code does NOT bring down the interpreter. + + Background: + We previously ignored clips for non-vector-graphics. However, ending + a clip does not refer back the object(s) that have been clipped. + In order to correctly compute the "scissor" rectangle, we now keep track + of the clipped object type. + """ + doc = pymupdf.open(f"{scriptdir}/resources/test-2462.pdf") + page = doc[0] + vg = page.get_drawings(extended=True) + + +def test_2556(): + """Ensure that incomplete clip paths will be properly ignored.""" + doc = pymupdf.open() # new empty PDF + page = doc.new_page() # new page + # following contains an incomplete clip + c = b"q 50 697.6 400 100.0 re W n q 0 0 m W n Q " + xref = doc.get_new_xref() # prepare /Contents object for page + doc.update_object(xref, "<<>>") # new xref now is a dictionary + doc.update_stream(xref, c) # store drawing commands + page.set_contents(xref) # give the page this xref as /Contents + # following will bring down interpreter if fix not installed + assert page.get_drawings(extended=True) + + +def test_3207(): + """Example graphics with multiple "close path" commands within same path. + + The fix translates a close-path commands into an additional line + which connects the current point with a preceding "move" target. + The example page has 2 paths which each contain 2 close-path + commands after 2 normal "line" commands, i.e. 2 command sequences + "move-to, line-to, line-to, close-path". + This is converted into 3 connected lines, where the last end point + is connect to the start point of the first line. + So, in the sequence of lines / points + + (p0, p1), (p2, p3), (p4, p5), (p6, p7), (p8, p9), (p10, p11) + + point p5 must equal p0, and p11 must equal p6 (for each of the + two paths in the example). + """ + filename = os.path.join(scriptdir, "resources", "test-3207.pdf") + doc = pymupdf.open(filename) + page = doc[0] + paths = page.get_drawings() + assert len(paths) == 2 + + path0 = paths[0] + items = path0["items"] + assert len(items) == 6 + p0 = items[0][1] + p5 = items[2][2] + p6 = items[3][1] + p11 = items[5][2] + assert p0 == p5 + assert p6 == p11 + + path1 = paths[1] + items = path1["items"] + assert len(items) == 6 + p0 = items[0][1] + p5 = items[2][2] + p6 = items[3][1] + p11 = items[5][2] + assert p0 == p5 + assert p6 == p11 + + +def test_3591(): + """Confirm correct scaling factor for rotation matrices.""" + filename = os.path.join(scriptdir, "resources", "test-3591.pdf") + doc = pymupdf.open(filename) + page = doc[0] + paths = page.get_drawings() + for p in paths: + assert p["width"] == 15 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_embeddedfiles.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_embeddedfiles.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,43 @@ +""" +Tests for PDF EmbeddedFiles functions. +""" +import pymupdf + + +def test_embedded1(): + doc = pymupdf.open() + buffer = b"123456678790qwexcvnmhofbnmfsdg4589754uiofjkb-" + doc.embfile_add( + "file1", + buffer, + filename="testfile.txt", + ufilename="testfile-u.txt", + desc="Description of some sort", + ) + assert doc.embfile_count() == 1 + assert doc.embfile_names() == ["file1"] + assert doc.embfile_info(0)["name"] == "file1" + doc.embfile_upd(0, filename="new-filename.txt") + assert doc.embfile_info(0)["filename"] == "new-filename.txt" + assert doc.embfile_get(0) == buffer + doc.embfile_del(0) + assert doc.embfile_count() == 0 + +def test_4050(): + with pymupdf.open() as document: + document.embfile_add('test', b'foobar', desc='some text') + d = document.embfile_info('test') + print(f'{d=}') + # Date is non-trivial to test for. + del d['creationDate'] + del d['modDate'] + assert d == { + 'name': 'test', + 'collection': 0, + 'filename': 'test', + 'ufilename': 'test', + 'description': 'some text', + 'size': 6, + 'length': 6, + } + diff -r 000000000000 -r 1d09e1dec1d9 tests/test_extractimage.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_extractimage.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,57 @@ +""" +Extract images from a PDF file, confirm number of images found. +""" +import os +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "joined.pdf") +known_image_count = 21 + + +def test_extract_image(): + doc = pymupdf.open(filename) + + image_count = 1 + for xref in range(1, doc.xref_length() - 1): + if doc.xref_get_key(xref, "Subtype")[1] != "/Image": + continue + img = doc.extract_image(xref) + if isinstance(img, dict): + image_count += 1 + + assert image_count == known_image_count # this number is know about the file + +def test_2348(): + + pdf_path = f'{scriptdir}/test_2348.pdf' + document = pymupdf.open() + page = document.new_page(width=500, height=842) + rect = pymupdf.Rect(20, 20, 480, 820) + page.insert_image(rect, filename=f'{scriptdir}/resources/nur-ruhig.jpg') + page = document.new_page(width=500, height=842) + page.insert_image(rect, filename=f'{scriptdir}/resources/img-transparent.png') + document.ez_save(pdf_path) + document.close() + + document = pymupdf.open(pdf_path) + page = document[0] + imlist = page.get_images() + image = document.extract_image(imlist[0][0]) + jpeg_extension = image['ext'] + + page = document[1] + imlist = page.get_images() + image = document.extract_image(imlist[0][0]) + png_extension = image['ext'] + + print(f'jpeg_extension={jpeg_extension!r} png_extension={png_extension!r}') + assert jpeg_extension == 'jpeg' + assert png_extension == 'png' + +def test_delete_image(): + + doc = pymupdf.open(os.path.abspath(f'{__file__}/../../tests/resources/test_delete_image.pdf')) + page = doc[0] + xref = page.get_images()[0][0] + page.delete_image(xref) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_flake8.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_flake8.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,54 @@ +import pymupdf + +import os +import subprocess +import sys + + +def test_flake8(): + ''' + Check rebased Python code with flake8. + ''' + if not hasattr(pymupdf, 'mupdf'): + print(f'Not running flake8 with classic implementation.') + return + ignores = ( + 'E123', # closing bracket does not match indentation of opening bracket's line + 'E124', # closing bracket does not match visual indentation + 'E126', # continuation line over-indented for hanging indent + 'E127', # continuation line over-indented for visual indent + 'E128', # continuation line under-indented for visual indent + 'E131', # continuation line unaligned for hanging indent + 'E201', # whitespace after '(' + 'E203', # whitespace before ':' + 'E221', # E221 multiple spaces before operator + 'E225', # missing whitespace around operator + 'E226', # missing whitespace around arithmetic operator + 'E231', # missing whitespace after ',' + 'E241', # multiple spaces after ':' + 'E251', # unexpected spaces around keyword / parameter equals + 'E252', # missing whitespace around parameter equals + 'E261', # at least two spaces before inline comment + 'E265', # block comment should start with '# ' + 'E271', # multiple spaces after keyword + 'E272', # multiple spaces before keyword + 'E302', # expected 2 blank lines, found 1 + 'E305', # expected 2 blank lines after class or function definition, found 1 + 'E306', # expected 1 blank line before a nested definition, found 0 + 'E402', # module level import not at top of file + 'E501', # line too long (80 > 79 characters) + 'E701', # multiple statements on one line (colon) + 'E741', # ambiguous variable name 'l' + 'F541', # f-string is missing placeholders + 'W293', # blank line contains whitespace + 'W503', # line break before binary operator + 'W504', # line break after binary operator + 'E731', # do not assign a lambda expression, use a def + ) + ignores = ','.join(ignores) + root = os.path.abspath(f'{__file__}/../..') + def run(command): + print(f'test_flake8(): Running: {command}') + subprocess.run(command, shell=1, check=1) + run(f'flake8 --ignore={ignores} --statistics {root}/src/__init__.py {root}/src/utils.py {root}/src/table.py') + print(f'test_flake8(): flake8 succeeded.') diff -r 000000000000 -r 1d09e1dec1d9 tests/test_font.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_font.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,332 @@ +""" +Tests for the Font class. +""" +import os +import platform +import pymupdf +import subprocess +import textwrap + +import util + + +def test_font1(): + text = "PyMuPDF" + font = pymupdf.Font("helv") + assert font.name == "Helvetica" + tl = font.text_length(text, fontsize=20) + cl = font.char_lengths(text, fontsize=20) + assert len(text) == len(cl) + assert abs(sum(cl) - tl) < pymupdf.EPSILON + for i in range(len(cl)): + assert cl[i] == font.glyph_advance(ord(text[i])) * 20 + font2 = pymupdf.Font(fontbuffer=font.buffer) + codepoints1 = font.valid_codepoints() + codepoints2 = font2.valid_codepoints() + print('') + print(f'{len(codepoints1)=}') + print(f'{len(codepoints2)=}') + if 0: + for i, (ucs1, ucs2) in enumerate(zip(codepoints1, codepoints2)): + print(f' {i}: {ucs1=} {ucs2=} {"" if ucs2==ucs2 else "*"}') + assert font2.valid_codepoints() == font.valid_codepoints() + + # Also check we can get font's bbox. + bbox1 = font.bbox + print(f'{bbox1=}') + if hasattr(pymupdf, 'mupdf'): + bbox2 = font.this.fz_font_bbox() + assert bbox2 == bbox1 + + +def test_font2(): + """Old and new length computation must be the same.""" + font = pymupdf.Font("helv") + text = "PyMuPDF" + assert font.text_length(text) == pymupdf.get_text_length(text) + + +def test_fontname(): + """Assert a valid PDF fontname.""" + doc = pymupdf.open() + page = doc.new_page() + assert page.insert_font() # assert: a valid fontname works! + detected = False # preset indicator + try: # fontname check will fail first - don't need a font at all here + page.insert_font(fontname="illegal/char", fontfile="unimportant") + except ValueError as e: + if str(e).startswith("bad fontname chars"): + detected = True # illegal fontname detected + assert detected + +def test_2608(): + flags = (pymupdf.TEXT_DEHYPHENATE | pymupdf.TEXT_MEDIABOX_CLIP) + with pymupdf.open(os.path.abspath(f'{__file__}/../../tests/resources/2201.00069.pdf')) as doc: + page = doc[0] + blocks = page.get_text_blocks(flags=flags) + text = blocks[10][4] + with open(os.path.abspath(f'{__file__}/../../tests/test_2608_out'), 'wb') as f: + f.write(text.encode('utf8')) + path_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_2608_expected') + path_expected_1_26 = os.path.normpath(f'{__file__}/../../tests/resources/test_2608_expected_1.26') + if pymupdf.mupdf_version_tuple >= (1, 27): + path_expected2 = path_expected + else: + path_expected2 = path_expected_1_26 + with open(path_expected2, 'rb') as f: + expected = f.read().decode('utf8') + # Github windows x32 seems to insert \r characters; maybe something to + # do with the Python installation's line endings settings. + expected = expected.replace('\r', '') + print(f'test_2608(): {text.encode("utf8")=}') + print(f'test_2608(): {expected.encode("utf8")=}') + assert text == expected + +def test_fontarchive(): + import subprocess + arch = pymupdf.Archive() + css = pymupdf.css_for_pymupdf_font("notos", archive=arch, name="sans-serif") + print(css) + print(arch.entry_list) + assert arch.entry_list == \ + [ + { + 'fmt': 'tree', + 'entries': + [ + 'notosbo', 'notosbi', 'notosit', 'notos' + ], + 'path': None + } + ] + +def test_load_system_font(): + if not hasattr(pymupdf, 'mupdf'): + print(f'test_load_system_font(): Not running on classic.') + return + trace = list() + def font_f(name, bold, italic, needs_exact_metrics): + trace.append((name, bold, italic, needs_exact_metrics)) + #print(f'test_load_system_font():font_f(): Looking for font: {name=} {bold=} {italic=} {needs_exact_metrics=}.') + return None + def f_cjk(name, ordering, serif): + trace.append((name, ordering, serif)) + #print(f'test_load_system_font():f_cjk(): Looking for font: {name=} {ordering=} {serif=}.') + return None + def f_fallback(script, language, serif, bold, italic): + trace.append((script, language, serif, bold, italic)) + #print(f'test_load_system_font():f_fallback(): looking for font: {script=} {language=} {serif=} {bold=} {italic=}.') + return None + pymupdf.mupdf.fz_install_load_system_font_funcs(font_f, f_cjk, f_fallback) + f = pymupdf.mupdf.fz_load_system_font("some-font-name", 0, 0, 0) + assert trace == [ + ('some-font-name', 0, 0, 0), + ], f'Incorrect {trace=}.' + print(f'test_load_system_font(): {f.m_internal=}') + + +def test_mupdf_subset_fonts2(): + if not hasattr(pymupdf, 'mupdf'): + print('Not running on rebased.') + return + path = os.path.abspath(f'{__file__}/../../tests/resources/2.pdf') + with pymupdf.open(path) as doc: + n = len(doc) + pages = [i*2 for i in range(n//2)] + print(f'{pages=}.') + pymupdf.mupdf.pdf_subset_fonts2(pymupdf._as_pdf_document(doc), pages) + + +def test_3677(): + pymupdf.TOOLS.set_subset_fontnames(True) + try: + path = os.path.abspath(f'{__file__}/../../tests/resources/test_3677.pdf') + font_names_expected = [ + 'BCDEEE+Aptos', + 'BCDFEE+Aptos', + 'BCDGEE+Calibri-Light', + 'BCDHEE+Calibri-Light', + ] + font_names = list() + with pymupdf.open(path) as document: + for page in document: + for block in page.get_text('dict')['blocks']: + if block['type'] == 0: + if 'lines' in block.keys(): + for line in block['lines']: + for span in line['spans']: + font_name=span['font'] + print(font_name) + font_names.append(font_name) + assert font_names == font_names_expected, f'{font_names=}' + finally: + pymupdf.TOOLS.set_subset_fontnames(False) + + +def test_3933(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3933.pdf') + with pymupdf.open(path) as document: + page = document[0] + print(f'{len(page.get_fonts())=}') + + expected = { + 'BCDEEE+Calibri': 39, + 'BCDFEE+SwissReSan-Regu': 53, + 'BCDGEE+SwissReSan-Ital': 20, + 'BCDHEE+SwissReSan-Bold': 20, + 'BCDIEE+SwissReSan-Regu': 53, + 'BCDJEE+Calibri': 39, + } + + for xref, _, _, name, _, _ in page.get_fonts(): + _, _, _, content = document.extract_font(xref) + + if content: + font = pymupdf.Font(fontname=name, fontbuffer=content) + supported_symbols = font.valid_codepoints() + print(f'Font {name}: {len(supported_symbols)=}.', flush=1) + assert len(supported_symbols) == expected.get(name) + + +def test_3780(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3780.pdf') + with pymupdf.open(path) as document: + for page_i, page in enumerate(document): + for itm in page.get_fonts(): + buff=document.extract_font(itm[0])[-1] + font=pymupdf.Font(fontbuffer=buff) + print(f'{page_i=}: xref {itm[0]} {font.name=} {font.ascender=} {font.descender=}.') + if page_i == 0: + d = page.get_text('dict') + #for n, v in d.items(): + # print(f' {n}: {v!r}') + for i, block in enumerate(d['blocks']): + print(f'block {i}:') + for j, line in enumerate(block['lines']): + print(f' line {j}:') + for k, span in enumerate(line['spans']): + print(f' span {k}:') + for n, v in span.items(): + print(f' {n}: {v!r}') + + +def test_3887(): + print(f'{pymupdf.version=}') + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3887.pdf') + + path2 = os.path.normpath(f'{__file__}/../../tests/resources/test_3887.pdf.ez.pdf') + with pymupdf.open(path) as document: + document.subset_fonts(fallback=False) + document.ez_save(path2) + + with pymupdf.open(path2) as document: + text = f"\u0391\u3001\u0392\u3001\u0393\u3001\u0394\u3001\u0395\u3001\u0396\u3001\u0397\u3001\u0398\u3001\u0399\u3001\u039a\u3001\u039b\u3001\u039c\u3001\u039d\u3001\u039e\u3001\u039f\u3001\u03a0\u3001\u03a1\u3001\u03a3\u3001\u03a4\u3001\u03a5\u3001\u03a6\u3001\u03a7\u3001\u03a8\u3001\u03a9\u3002\u03b1\u3001\u03b2\u3001\u03b3\u3001\u03b4\u3001\u03b5\u3001\u03b6\u3001\u03b7\u3001\u03b8\u3001\u03b9\u3001\u03ba\u3001\u03bb\u3001\u03bc\u3001\u03bd\u3001\u03be\u3001\u03bf\u3001\u03c0\u3001\u03c1\u3001\u03c2\u3001\u03c4\u3001\u03c5\u3001\u03c6\u3001\u03c7\u3001\u03c8\u3001\u03c9\u3002" + page = document[0] + chars = [c for b in page.get_text("rawdict",flags=0)["blocks"] for l in b["lines"] for s in l["spans"] for c in s["chars"]] + output = [c["c"] for c in chars] + print(f'text:\n {text}') + print(f'output:\n {output}') + pixmap = page.get_pixmap() + path_pixmap = f'{path}.0.png' + pixmap.save(path_pixmap) + print(f'Have saved to: {path_pixmap=}') + assert set(output)==set(text) + + +def test_4457(): + print() + files = ( + ('https://github.com/user-attachments/files/20862923/test_4457_a.pdf', 'test_4457_a.pdf', None, 4), + ('https://github.com/user-attachments/files/20862922/test_4457_b.pdf', 'test_4457_b.pdf', None, 9), + ) + for url, name, size, rms_old_after_max in files: + path = util.download(url, name, size) + + with pymupdf.open(path) as document: + page = document[0] + + pixmap = document[0].get_pixmap() + path_pixmap = f'{path}.png' + pixmap.save(path_pixmap) + print(f'Have created: {path_pixmap=}') + + text = page.get_text() + path_before = f'{path}.before.pdf' + path_after = f'{path}.after.pdf' + document.ez_save(path_before, garbage=4) + print(f'Have created {path_before=}') + + document.subset_fonts() + document.ez_save(path_after, garbage=4) + print(f'Have created {path_after=}') + + with pymupdf.open(path_before) as document: + text_before = document[0].get_text() + pixmap_before = document[0].get_pixmap() + path_pixmap_before = f'{path_before}.png' + pixmap_before.save(path_pixmap_before) + print(f'Have created: {path_pixmap_before=}') + + with pymupdf.open(path_after) as document: + text_after = document[0].get_text() + pixmap_after = document[0].get_pixmap() + path_pixmap_after = f'{path_after}.png' + pixmap_after.save(path_pixmap_after) + print(f'Have created: {path_pixmap_after=}') + + import gentle_compare + rms_before = gentle_compare.pixmaps_rms(pixmap, pixmap_before) + rms_after = gentle_compare.pixmaps_rms(pixmap, pixmap_after) + print(f'{rms_before=}') + print(f'{rms_after=}') + + # Create .png file showing differences between and . + path_pixmap_after_diff = f'{path_after}.diff.png' + pixmap_after_diff = gentle_compare.pixmaps_diff(pixmap, pixmap_after) + pixmap_after_diff.save(path_pixmap_after_diff) + print(f'Have created: {path_pixmap_after_diff}') + + # Extract text from , and and write to + # files so we can show differences with `diff`. + path_text = os.path.normpath(f'{__file__}/../../tests/test_4457.txt') + path_text_before = f'{path_text}.before.txt' + path_text_after = f'{path_text}.after.txt' + with open(path_text, 'w', encoding='utf8') as f: + f.write(text) + with open(path_text_before, 'w', encoding='utf8') as f: + f.write(text_before) + with open(path_text_after, 'w', encoding='utf8') as f: + f.write(text_after) + + # Can't write text to stdout on Windows because of encoding errors. + if platform.system() != 'Windows': + print(f'text:\n{textwrap.indent(text, " ")}') + print(f'text_before:\n{textwrap.indent(text_before, " ")}') + print(f'text_after:\n{textwrap.indent(text_after, " ")}') + print(f'{path_text=}') + print(f'{path_text_before=}') + print(f'{path_text_after=}') + + command = f'diff -u {path_text} {path_text_before}' + print(f'Running: {command}', flush=1) + subprocess.run(command, shell=1) + + command = f'diff -u {path_text} {path_text_after}' + print(f'Running: {command}', flush=1) + subprocess.run(command, shell=1) + + assert text_before == text + assert rms_before == 0 + + if pymupdf.mupdf_version_tuple >= (1, 26, 6): + assert rms_after == 0 + else: + # As of 2025-05-20 there are some differences in some characters, + # e.g. the non-ascii characters in `Philipp Krahenbuhl`. See + # and . + assert abs(rms_after - rms_old_after_max) < 2 + + # Avoid test failure caused by mupdf warnings. + wt = pymupdf.TOOLS.mupdf_warnings() + print(f'{wt=}') + assert wt == 'bogus font ascent/descent values (0 / 0)\n... repeated 5 times...' diff -r 000000000000 -r 1d09e1dec1d9 tests/test_general.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_general.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,2053 @@ +# encoding utf-8 +""" +* Confirm sample doc has no links and no annots. +* Confirm proper release of file handles via Document.close() +* Confirm properly raising exceptions in document creation +""" +import io +import os + +import fnmatch +import json +import pymupdf +import pathlib +import pickle +import platform +import re +import shutil +import subprocess +import sys +import textwrap +import time +import util + +import gentle_compare + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "001003ED.pdf") + + +def test_haslinks(): + doc = pymupdf.open(filename) + assert doc.has_links() == False + + +def test_hasannots(): + doc = pymupdf.open(filename) + assert doc.has_annots() == False + + +def test_haswidgets(): + doc = pymupdf.open(filename) + assert doc.is_form_pdf == False + + +def test_isrepaired(): + doc = pymupdf.open(filename) + assert doc.is_repaired == False + pymupdf.TOOLS.mupdf_warnings() + + +def test_isdirty(): + doc = pymupdf.open(filename) + assert doc.is_dirty == False + + +def test_cansaveincrementally(): + doc = pymupdf.open(filename) + assert doc.can_save_incrementally() == True + + +def test_iswrapped(): + doc = pymupdf.open(filename) + page = doc[0] + assert page.is_wrapped + wt = pymupdf.TOOLS.mupdf_warnings() + if pymupdf.mupdf_version_tuple >= (1, 26, 0): + assert wt == 'bogus font ascent/descent values (0 / 0)' + else: + assert not wt + + +def test_wrapcontents(): + doc = pymupdf.open(filename) + page = doc[0] + page.wrap_contents() + xref = page.get_contents()[0] + cont = page.read_contents() + doc.update_stream(xref, cont) + page.set_contents(xref) + assert len(page.get_contents()) == 1 + page.clean_contents() + rebased = hasattr(pymupdf, 'mupdf') + if rebased: + wt = pymupdf.TOOLS.mupdf_warnings() + if pymupdf.mupdf_version_tuple >= (1, 26, 0): + assert wt == 'bogus font ascent/descent values (0 / 0)\nPDF stream Length incorrect' + else: + assert wt == 'PDF stream Length incorrect' + + +def test_page_clean_contents(): + """Assert that page contents cleaning actually is invoked.""" + doc = pymupdf.open() + page = doc.new_page() + + # draw two rectangles - will lead to two /Contents objects + page.draw_rect((10, 10, 20, 20)) + page.draw_rect((20, 20, 30, 30)) + assert len(page.get_contents()) == 2 + assert page.read_contents().startswith(b"q") == False + + # clean / consolidate into one /Contents object + page.clean_contents() + assert len(page.get_contents()) == 1 + assert page.read_contents().startswith(b"q") == True + + +def test_annot_clean_contents(): + """Assert that annot contents cleaning actually is invoked.""" + doc = pymupdf.open() + page = doc.new_page() + annot = page.add_highlight_annot((10, 10, 20, 20)) + + # the annotation appearance will not start with command b"q" + + + # invoke appearance stream cleaning and reformatting + annot.clean_contents() + + # appearance stream should now indeed start with command b"q" + assert annot._getAP().startswith(b"q") == True + + +def test_config(): + assert pymupdf.TOOLS.fitz_config["py-memory"] in (True, False) + + +def test_glyphnames(): + name = "INFINITY" + infinity = pymupdf.glyph_name_to_unicode(name) + assert pymupdf.unicode_to_glyph_name(infinity) == name + + +def test_rgbcodes(): + sRGB = 0xFFFFFF + assert pymupdf.sRGB_to_pdf(sRGB) == (1, 1, 1) + assert pymupdf.sRGB_to_rgb(sRGB) == (255, 255, 255) + + +def test_pdfstring(): + pymupdf.get_pdf_now() + pymupdf.get_pdf_str("Beijing, chinesisch 北京") + pymupdf.get_text_length("Beijing, chinesisch 北京", fontname="china-s") + pymupdf.get_pdf_str("Latin characters êßöäü") + + +def test_open_exceptions(): + path = os.path.normpath(f'{__file__}/../../tests/resources/001003ED.pdf') + doc = pymupdf.open(path, filetype="xps") + assert 'PDF' in doc.metadata["format"] + + doc = pymupdf.open(path, filetype="xxx") + assert 'PDF' in doc.metadata["format"] + + try: + pymupdf.open("x.y") + except Exception as e: + assert repr(e).startswith("FileNotFoundError") + else: + assert 0 + + try: + pymupdf.open(stream=b"", filetype="pdf") + except RuntimeError as e: + assert repr(e).startswith("EmptyFileError"), f'{repr(e)=}' + else: + print(f'{doc.metadata["format"]=}') + assert 0 + + +def test_bug1945(): + pdf = pymupdf.open(f'{scriptdir}/resources/bug1945.pdf') + buffer_ = io.BytesIO() + pdf.save(buffer_, clean=True) + + +def test_bug1971(): + for _ in range(2): + doc = pymupdf.Document(f'{scriptdir}/resources/bug1971.pdf') + page = next(doc.pages()) + page.get_drawings() + doc.close() + assert doc.is_closed + +def test_default_font(): + f = pymupdf.Font() + assert str(f) == "Font('Noto Serif Regular')" + assert repr(f) == "Font('Noto Serif Regular')" + +def test_add_ink_annot(): + import math + document = pymupdf.Document() + page = document.new_page() + line1 = [] + line2 = [] + for a in range( 0, 360*2, 15): + x = a + c = 300 + 200 * math.cos( a * math.pi/180) + s = 300 + 100 * math.sin( a * math.pi/180) + line1.append( (x, c)) + line2.append( (x, s)) + page.add_ink_annot( [line1, line2]) + page.insert_text((100, 72), 'Hello world') + page.add_text_annot((200,200), "Some Text") + page.get_bboxlog() + path = f'{scriptdir}/resources/test_add_ink_annot.pdf' + document.save( path) + print( f'Have saved to: path={path!r}') + +def test_techwriter_append(): + print(pymupdf.__doc__) + doc = pymupdf.open() + page = doc.new_page() + tw = pymupdf.TextWriter(page.rect) + text = "Red rectangle = TextWriter.text_rect, blue circle = .last_point" + r = tw.append((100, 100), text) + print(f'r={r!r}') + tw.write_text(page) + page.draw_rect(tw.text_rect, color=pymupdf.pdfcolor["red"]) + page.draw_circle(tw.last_point, 2, color=pymupdf.pdfcolor["blue"]) + path = f"{scriptdir}/resources/test_techwriter_append.pdf" + doc.ez_save(path) + print( f'Have saved to: {path}') + +def test_opacity(): + doc = pymupdf.open() + page = doc.new_page() + + annot1 = page.add_circle_annot((50, 50, 100, 100)) + annot1.set_colors(fill=(1, 0, 0), stroke=(1, 0, 0)) + annot1.set_opacity(2 / 3) + annot1.update(blend_mode="Multiply") + + annot2 = page.add_circle_annot((75, 75, 125, 125)) + annot2.set_colors(fill=(0, 0, 1), stroke=(0, 0, 1)) + annot2.set_opacity(1 / 3) + annot2.update(blend_mode="Multiply") + outfile = f'{scriptdir}/resources/opacity.pdf' + doc.save(outfile, expand=True, pretty=True) + print("saved", outfile) + +def test_get_text_dict(): + import json + doc=pymupdf.open(f'{scriptdir}/resources/v110-changes.pdf') + page=doc[0] + blocks=page.get_text("dict")["blocks"] + # Check no opaque types in `blocks`. + json.dumps( blocks, indent=4) + wt = pymupdf.TOOLS.mupdf_warnings() + if pymupdf.mupdf_version_tuple >= (1, 26, 0): + assert wt == 'bogus font ascent/descent values (0 / 0)' + else: + assert not wt + +def test_font(): + font = pymupdf.Font() + print(repr(font)) + bbox = font.glyph_bbox( 65) + print( f'bbox={bbox!r}') + +def test_insert_font(): + doc=pymupdf.open(f'{scriptdir}/resources/v110-changes.pdf') + page = doc[0] + i = page.insert_font() + print( f'page.insert_font() => {i}') + +def test_2173(): + from pymupdf import IRect, Pixmap, CS_RGB, Colorspace + for i in range( 100): + #print( f'i={i!r}') + image = Pixmap(Colorspace(CS_RGB), IRect(0, 0, 13, 37)) + print( 'test_2173() finished') + +def test_texttrace(): + import time + document = pymupdf.Document( f'{scriptdir}/resources/joined.pdf') + t = time.time() + for page in document: + tt = page.get_texttrace() + t = time.time() - t + print( f'test_texttrace(): t={t!r}') + + # Repeat, this time writing data to file. + import json + path = f'{scriptdir}/resources/test_texttrace.txt' + print( f'test_texttrace(): Writing to: {path}') + with open( path, 'w') as f: + for i, page in enumerate(document): + tt = page.get_texttrace() + print( f'page {i} json:\n{json.dumps(tt, indent=" ")}', file=f) + + +def test_2533(): + """Assert correct char bbox in page.get_texttrace(). + + Search for a unique char on page and confirm that page.get_texttrace() + returns the same bbox as the search method. + """ + if hasattr(pymupdf, 'mupdf') and not pymupdf.g_use_extra: + print('Not running test_2533() because rebased with use_extra=0 known to fail') + return + pymupdf.TOOLS.set_small_glyph_heights(True) + try: + doc = pymupdf.open(os.path.join(scriptdir, "resources", "test_2533.pdf")) + page = doc[0] + NEEDLE = "民" + ord_NEEDLE = ord(NEEDLE) + for span in page.get_texttrace(): + for char in span["chars"]: + if char[0] == ord_NEEDLE: + bbox = pymupdf.Rect(char[3]) + break + bbox2 = page.search_for(NEEDLE)[0] + assert bbox2 == bbox, f'{bbox=} {bbox2=} {bbox2-bbox=}.' + finally: + pymupdf.TOOLS.set_small_glyph_heights(False) + + +def test_2645(): + """Assert same font size calculation in corner cases. + """ + folder = os.path.join(scriptdir, "resources") + files = ("test_2645_1.pdf", "test_2645_2.pdf", "test_2645_3.pdf") + for f in files: + doc = pymupdf.open(os.path.join(folder, f)) + page = doc[0] + fontsize0 = page.get_texttrace()[0]["size"] + fontsize1 = page.get_text("dict", flags=pymupdf.TEXTFLAGS_TEXT)["blocks"][0]["lines"][ + 0 + ]["spans"][0]["size"] + assert abs(fontsize0 - fontsize1) < 1e-5 + + +def test_2506(): + """Ensure expected font size across text writing angles.""" + doc = pymupdf.open() + page = doc.new_page() + point = pymupdf.Point(100, 300) # insertion point + fontsize = 11 # fontsize + text = "Hello" # text + angles = (0, 30, 60, 90, 120) # some angles + + # write text with different angles + for angle in angles: + page.insert_text( + point, text, fontsize=fontsize, morph=(point, pymupdf.Matrix(angle)) + ) + + # ensure correct fontsize for get_texttrace() - forgiving rounding problems + for span in page.get_texttrace(): + print(span["dir"]) + assert round(span["size"]) == fontsize + + # ensure correct fontsize for get_text() - forgiving rounding problems + for block in page.get_text("dict")["blocks"]: + for line in block["lines"]: + print(line["dir"]) + for span in line["spans"]: + print(span["size"]) + assert round(span["size"]) == fontsize + + +def test_2108(): + doc = pymupdf.open(f'{scriptdir}/resources/test_2108.pdf') + page = doc[0] + areas = page.search_for("{sig}") + rect = areas[0] + page.add_redact_annot(rect) + page.apply_redactions() + text = page.get_text() + + text_expected = b'Frau\nClaire Dunphy\nTeststra\xc3\x9fe 5\n12345 Stadt\nVertragsnummer: 12345\nSehr geehrte Frau Dunphy,\nText\nMit freundlichen Gr\xc3\xbc\xc3\x9fen\nTestfirma\nVertrag:\n 12345\nAnsprechpartner:\nJay Pritchet\nTelefon:\n123456\nE-Mail:\ntest@test.de\nDatum:\n07.12.2022\n'.decode('utf8') + + if 1: + # Verbose info. + print(f'test_2108(): text is:\n{text}') + print(f'') + print(f'test_2108(): repr(text) is:\n{text!r}') + print(f'') + print(f'test_2108(): repr(text.encode("utf8")) is:\n{text.encode("utf8")!r}') + print(f'') + print(f'test_2108(): text_expected is:\n{text_expected}') + print(f'') + print(f'test_2108(): repr(text_expected) is:\n{text_expected!r}') + print(f'') + print(f'test_2108(): repr(text_expected.encode("utf8")) is:\n{text_expected.encode("utf8")!r}') + + ok1 = (text == text_expected) + ok2 = (text.encode("utf8") == text_expected.encode("utf8")) + ok3 = (repr(text.encode("utf8")) == repr(text_expected.encode("utf8"))) + + print(f'') + print(f'ok1={ok1}') + print(f'ok2={ok2}') + print(f'ok3={ok3}') + + print(f'') + + print(f'{pymupdf.mupdf_version_tuple=}') + if pymupdf.mupdf_version_tuple >= (1, 21, 2): + print('Asserting text==text_expected') + assert text == text_expected + else: + print('Asserting text!=text_expected') + assert text != text_expected + + +def test_2238(): + filepath = f'{scriptdir}/resources/test2238.pdf' + doc = pymupdf.open(filepath) + rebased = hasattr(pymupdf, 'mupdf') + if rebased: + wt = pymupdf.TOOLS.mupdf_warnings() + wt_expected = '' + if pymupdf.mupdf_version_tuple >= (1, 26): + wt_expected += 'garbage bytes before version marker\n' + wt_expected += 'syntax error: expected \'obj\' keyword (6 0 ?)\n' + else: + wt_expected += 'format error: cannot recognize version marker\n' + wt_expected += 'trying to repair broken xref\n' + wt_expected += 'repairing PDF document' + assert wt == wt_expected, f'{wt=}' + first_page = doc.load_page(0).get_text('text', clip=pymupdf.INFINITE_RECT()) + last_page = doc.load_page(-1).get_text('text', clip=pymupdf.INFINITE_RECT()) + + print(f'first_page={first_page!r}') + print(f'last_page={last_page!r}') + assert first_page == 'Hello World\n' + assert last_page == 'Hello World\n' + + first_page = doc.load_page(0).get_text('text') + last_page = doc.load_page(-1).get_text('text') + + print(f'first_page={first_page!r}') + print(f'last_page={last_page!r}') + assert first_page == 'Hello World\n' + assert last_page == 'Hello World\n' + + +def test_2093(): + if platform.python_implementation() == 'GraalVM': + print(f'test_2093(): Not running because slow on GraalVM.') + return + + doc = pymupdf.open(f'{scriptdir}/resources/test2093.pdf') + + def average_color(page): + pixmap = page.get_pixmap() + p_average = [0] * pixmap.n + for y in range(pixmap.height): + for x in range(pixmap.width): + p = pixmap.pixel(x, y) + for i in range(pixmap.n): + p_average[i] += p[i] + for i in range(pixmap.n): + p_average[i] /= (pixmap.height * pixmap.width) + return p_average + + page = doc.load_page(0) + pixel_average_before = average_color(page) + + rx=135.123 + ry=123.56878 + rw=69.8409 + rh=9.46397 + + x0 = rx + y0 = ry + x1 = rx + rw + y1 = ry + rh + + rect = pymupdf.Rect(x0, y0, x1, y1) + + font = pymupdf.Font("Helvetica") + fill_color=(0,0,0) + page.add_redact_annot( + quad=rect, + #text="null", + fontname=font.name, + fontsize=12, + align=pymupdf.TEXT_ALIGN_CENTER, + fill=fill_color, + text_color=(1,1,1), + ) + + page.apply_redactions() + pixel_average_after = average_color(page) + + print(f'pixel_average_before={pixel_average_before!r}') + print(f'pixel_average_after={pixel_average_after!r}') + + # Before this bug was fixed (MuPDF-1.22): + # pixel_average_before=[130.864323120088, 115.23577810900859, 92.9268559996174] + # pixel_average_after=[138.68844553555772, 123.05687162237561, 100.74275056194105] + # After fix: + # pixel_average_before=[130.864323120088, 115.23577810900859, 92.9268559996174] + # pixel_average_after=[130.8889209934799, 115.25722751837269, 92.94327384463327] + # + for i in range(len(pixel_average_before)): + diff = pixel_average_before[i] - pixel_average_after[i] + assert abs(diff) < 0.1 + + out = f'{scriptdir}/resources/test2093-out.pdf' + doc.save(out) + print(f'Have written to: {out}') + + +def test_2182(): + print(f'test_2182() started') + doc = pymupdf.open(f'{scriptdir}/resources/test2182.pdf') + page = doc[0] + for annot in page.annots(): + print(annot) + print(f'test_2182() finished') + + +def test_2246(): + """ + Test / confirm identical text positions generated by + * page.insert_text() + versus + * TextWriter.write_text() + + ... under varying situations as follows: + + 1. MediaBox does not start at (0, 0) + 2. CropBox origin is different from that of MediaBox + 3. Check for all 4 possible page rotations + + The test writes the same text at the same positions using `page.insert_text()`, + respectively `TextWriter.write_text()`. + Then extracts the text spans and confirms that they all occupy the same bbox. + This ensures coincidence of text positions of page.of insert_text() + (which is assumed correct) and TextWriter.write_text(). + """ + def bbox_count(rot): + """Make a page and insert identical text via different methods. + + Desired page rotation is a parameter. MediaBox and CropBox are chosen + to be "awkward": MediaBox does not start at (0,0) and CropBox is a + true subset of MediaBox. + """ + # bboxes of spans on page: same text positions are represented by ONE bbox + bboxes = set() + doc = pymupdf.open() + # prepare a page with desired MediaBox / CropBox peculiarities + mediabox = pymupdf.paper_rect("letter") + page = doc.new_page(width=mediabox.width, height=mediabox.height) + xref = page.xref + newmbox = list(map(float, doc.xref_get_key(xref, "MediaBox")[1][1:-1].split())) + newmbox = pymupdf.Rect(newmbox) + mbox = newmbox + (10, 20, 10, 20) + cbox = mbox + (10, 10, -10, -10) + doc.xref_set_key(xref, "MediaBox", "[%g %g %g %g]" % tuple(mbox)) + doc.xref_set_key(xref, "CrobBox", "[%g %g %g %g]" % tuple(cbox)) + # set page to desired rotation + page.set_rotation(rot) + page.insert_text((50, 50), "Text inserted at (50,50)") + tw = pymupdf.TextWriter(page.rect) + tw.append((50, 50), "Text inserted at (50,50)") + tw.write_text(page) + blocks = page.get_text("dict")["blocks"] + for b in blocks: + for l in b["lines"]: + for s in l["spans"]: + # store bbox rounded to 3 decimal places + bboxes.add(pymupdf.Rect(pymupdf.JM_TUPLE3(s["bbox"]))) + return len(bboxes) # should be 1! + + # the following tests must all pass + assert bbox_count(0) == 1 + assert bbox_count(90) == 1 + assert bbox_count(180) == 1 + assert bbox_count(270) == 1 + + +def test_2430(): + """Confirm that multiple font property checks will not destroy Py_None.""" + font = pymupdf.Font("helv") + for i in range(1000): + _ = font.flags + +def test_2692(): + document = pymupdf.Document(f'{scriptdir}/resources/2.pdf') + for page in document: + pix = page.get_pixmap(clip=pymupdf.Rect(0,0,10,10)) + dl = page.get_displaylist(annots=True) + pix = dl.get_pixmap( + matrix=pymupdf.Identity, + colorspace=pymupdf.csRGB, + alpha=False, + clip=pymupdf.Rect(0,0,10,10), + ) + pix = dl.get_pixmap( + matrix=pymupdf.Identity, + #colorspace=pymupdf.csRGB, + alpha=False, + clip=pymupdf.Rect(0,0,10,10), + ) + + +def test_2596(): + """Confirm correctly abandoning cache when reloading a page.""" + if platform.python_implementation() == 'GraalVM': + print(f'test_2596(): not running on Graal.') + return + doc = pymupdf.Document(f"{scriptdir}/resources/test_2596.pdf") + page = doc[0] + pix0 = page.get_pixmap() # render the page + _ = doc.tobytes(garbage=3) # save with garbage collection + + # Note this will invalidate cache content for this page. + # Reloading the page now empties the cache, so rendering + # will deliver the same pixmap + page = doc.reload_page(page) + pix1 = page.get_pixmap() + assert pix1.samples == pix0.samples + rebased = hasattr(pymupdf, 'mupdf') + if pymupdf.mupdf_version_tuple < (1, 26, 6): + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'too many indirections (possible indirection cycle involving 24 0 R)' + + +def test_2730(): + """Ensure identical output across text extractions.""" + doc = pymupdf.open(f"{scriptdir}/resources/test_2730.pdf") + page = doc[0] + s1 = set(page.get_text()) # plain text extraction + s2 = set(page.get_text(sort=True)) # uses "blocks" extraction + s3 = set(page.get_textbox(page.rect)) + assert s1 == s2 + assert s1 == s3 + + +def test_2553(): + """Ensure identical output across text extractions.""" + verbose = 0 + doc = pymupdf.open(f"{scriptdir}/resources/test_2553.pdf") + page = doc[0] + + # extract plain text, build set of all characters + list1 = page.get_text() + set1 = set(list1) + + # extract text blocks, build set of all characters + list2 = page.get_text(sort=True) # internally uses "blocks" + set2 = set(list2) + + # extract textbox content, build set of all characters + list3 = page.get_textbox(page.rect) + set3 = set(list3) + + def show(l): + ret = f'len={len(l)}\n' + for c in l: + cc = ord(c) + if (cc >= 32 and cc < 127) or c == '\n': + ret += c + else: + ret += f' [0x{hex(cc)}]' + return ret + + if verbose: + print(f'list1:\n{show(list1)}') + print(f'list2:\n{show(list2)}') + print(f'list3:\n{show(list3)}') + + # all sets must be equal + assert set1 == set2 + assert set1 == set3 + + # With mupdf later than 1.23.4, this special page contains no invalid + # Unicodes. + # + print(f'Checking no occurrence of 0xFFFD, {pymupdf.mupdf_version_tuple=}.') + assert chr(0xFFFD) not in set1 + +def test_2553_2(): + doc = pymupdf.open(f"{scriptdir}/resources/test_2553-2.pdf") + page = doc[0] + + # extract plain text, ensure that there are no 0xFFFD characters + text = page.get_text() + assert chr(0xfffd) not in text + +def test_2635(): + """Rendering a page before and after cleaning it should yield the same pixmap.""" + doc = pymupdf.open(f"{scriptdir}/resources/test_2635.pdf") + page = doc[0] + pix1 = page.get_pixmap() # pixmap before cleaning + + page.clean_contents() # clean page + pix2 = page.get_pixmap() # pixmap after cleaning + assert pix1.samples == pix2.samples # assert equality + + +def test_resolve_names(): + """Test PDF name resolution.""" + # guard against wrong PyMuPDF architecture version + if not hasattr(pymupdf.Document, "resolve_names"): + print("PyMuPDF version does not support resolving PDF names") + return + pickle_in = open(f"{scriptdir}/resources/cython.pickle", "rb") + old_names = pickle.load(pickle_in) + doc = pymupdf.open(f"{scriptdir}/resources/cython.pdf") + new_names = doc.resolve_names() + assert new_names == old_names + +def test_2777(): + document = pymupdf.Document() + page = document.new_page() + print(page.mediabox.width) + +def test_2710(): + doc = pymupdf.open(f'{scriptdir}/resources/test_2710.pdf') + page = doc.load_page(0) + + print(f'test_2710(): {page.cropbox=}') + print(f'test_2710(): {page.mediabox=}') + print(f'test_2710(): {page.rect=}') + + def numbers_approx_eq(a, b): + return abs(a-b) < 0.001 + def points_approx_eq(a, b): + return numbers_approx_eq(a.x, b.x) and numbers_approx_eq(a.y, b.y) + def rects_approx_eq(a, b): + return points_approx_eq(a.bottom_left, b.bottom_left) and points_approx_eq(a.top_right, b.top_right) + def assert_rects_approx_eq(a, b): + assert rects_approx_eq(a, b), f'Not nearly identical: {a=} {b=}' + + blocks = page.get_text('blocks') + print(f'test_2710(): {blocks=}') + assert len(blocks) == 2 + block = blocks[1] + rect = pymupdf.Rect(block[:4]) + text = block[4] + print(f'test_2710(): {rect=}') + print(f'test_2710(): {text=}') + assert text == 'Text at left page border\n' + + assert_rects_approx_eq(page.cropbox, pymupdf.Rect(30.0, 30.0, 565.3200073242188, 811.9199829101562)) + assert_rects_approx_eq(page.mediabox, pymupdf.Rect(0.0, 0.0, 595.3200073242188, 841.9199829101562)) + print(f'test_2710(): {pymupdf.mupdf_version_tuple=}') + # 2023-11-05: Currently broken in mupdf master. + print(f'test_2710(): Not Checking page.rect and rect.') + rebased = hasattr(pymupdf, 'mupdf') + if rebased: + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == ( + "syntax error: cannot find ExtGState resource 'GS7'\n" + "syntax error: cannot find ExtGState resource 'GS8'\n" + "encountered syntax errors; page may not be correct" + ) + + +def test_2736(): + """Check handling of CropBox changes vis-a-vis a MediaBox with + negative coordinates.""" + doc = pymupdf.open() + page = doc.new_page() + + # fake a MediaBox for demo purposes + doc.xref_set_key(page.xref, "MediaBox", "[-30 -20 595 842]") + + assert page.cropbox == pymupdf.Rect(-30, 0, 595, 862) + assert page.rect == pymupdf.Rect(0, 0, 625, 862) + + # change the CropBox: shift by (10, 10) in both dimensions. Please note: + # To achieve this, 10 must be subtracted from 862! yo must never be negative! + page.set_cropbox(pymupdf.Rect(-20, 0, 595, 852)) + + # get CropBox from the page definition + assert doc.xref_get_key(page.xref, "CropBox")[1] == "[-20 -10 595 842]" + assert page.rect == pymupdf.Rect(0, 0, 615, 852) + + error = False + text = "" + try: # check error detection + page.set_cropbox((-35, -10, 595, 842)) + except Exception as e: + text = str(e) + error = True + assert error == True + assert text == "CropBox not in MediaBox" + + +def test_subset_fonts(): + """Confirm subset_fonts is working.""" + if not hasattr(pymupdf, "mupdf"): + print("Not testing 'test_subset_fonts' in classic.") + return + text = "Just some arbitrary text." + arch = pymupdf.Archive() + css = pymupdf.css_for_pymupdf_font("ubuntu", archive=arch) + css += "* {font-family: ubuntu;}" + doc = pymupdf.open() + page = doc.new_page() + page.insert_htmlbox(page.rect, text, css=css, archive=arch) + doc.subset_fonts(verbose=True) + found = False + for xref in range(1, doc.xref_length()): + if "+Ubuntu#20Regular" in doc.xref_object(xref): + found = True + break + assert found is True + + +def test_2957_1(): + """Text following a redaction must not change coordinates.""" + # test file with redactions + doc = pymupdf.open(os.path.join(scriptdir, "resources", "test_2957_1.pdf")) + page = doc[0] + # search for string that must not move by redactions + rects0 = page.search_for("6e9f73dfb4384a2b8af6ebba") + # sort rectangles vertically + rects0 = sorted(rects0, key=lambda r: r.y1) + assert len(rects0) == 2 # must be 2 redactions + page.apply_redactions() + + # reload page to finalize updates + page = doc.reload_page(page) + + # the two string must retain their positions (except rounding errors) + rects1 = page.search_for("6e9f73dfb4384a2b8af6ebba") + rects1 = sorted(rects1, key=lambda r: r.y1) + + assert page.first_annot is None # make sure annotations have disappeared + for i in range(2): + r0 = rects0[i].irect # take rounded rects + r1 = rects1[i].irect + assert r0 == r1 + + +def test_2957_2(): + """Redacted text must not change positions of remaining text.""" + doc = pymupdf.open(os.path.join(scriptdir, "resources", "test_2957_2.pdf")) + page = doc[0] + words0 = page.get_text("words") # all words before redacting + page.apply_redactions() # remove/redact the word "longer" + words1 = page.get_text("words") # extract words again + assert len(words1) == len(words0) - 1 # must be one word less + assert words0[3][4] == "longer" # just confirm test file is correct one + del words0[3] # remove the redacted word from first list + for i in range(len(words1)): # compare words + w1 = words1[i] # word after redaction + bbox1 = pymupdf.Rect(w1[:4]).irect # its IRect coordinates + w0 = words0[i] # word before redaction + bbox0 = pymupdf.Rect(w0[:4]).irect # its IRect coordinates + assert bbox0 == bbox1 # must be same coordinates + + +def test_707560(): + """https://bugs.ghostscript.com/show_bug.cgi?id=707560 + Ensure that redactions also remove characters with an empty width bbox. + """ + # Make text that will contain characters with an empty bbox. + + greetings = ( + "Hello, World!", # english + "Hallo, Welt!", # german + "سلام دنیا!", # persian + "வணக்கம், உலகம்!", # tamil + "สวัสดีชาวโลก!", # thai + "Привіт Світ!", # ucranian + "שלום עולם!", # hebrew + "ওহে বিশ্ব!", # bengali + "你好世界!", # chinese + "こんにちは世界!", # japanese + "안녕하세요, 월드!", # korean + "नमस्कार, विश्व !", # sanskrit + "हैलो वर्ल्ड!", # hindi + ) + text = " ... ".join([g for g in greetings]) + where = (50, 50, 400, 500) + story = pymupdf.Story(text) + bio = io.BytesIO() + writer = pymupdf.DocumentWriter(bio) + more = True + while more: + dev = writer.begin_page(pymupdf.paper_rect("a4")) + more, _ = story.place(where) + story.draw(dev) + writer.end_page() + writer.close() + doc = pymupdf.open("pdf", bio) + page = doc[0] + text = page.get_text() + assert text, "Unexpected: test page has no text." + page.add_redact_annot(page.rect) + page.apply_redactions() + assert not page.get_text(), "Unexpected: text not fully redacted." + + +def test_3070(): + with pymupdf.open(os.path.abspath(f'{__file__}/../../tests/resources/test_3070.pdf')) as pdf: + links = pdf[0].get_links() + links[0]['uri'] = "https://www.ddg.gg" + pdf[0].update_link(links[0]) + pdf.save(os.path.abspath(f'{__file__}/../../tests/test_3070_out.pdf')) + +def test_bboxlog_2885(): + doc = pymupdf.open(os.path.abspath(f'{__file__}/../../tests/resources/test_2885.pdf')) + page=doc[0] + + bbl = page.get_bboxlog() + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'invalid marked content and clip nesting' + + bbl = page.get_bboxlog(layers=True) + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'invalid marked content and clip nesting' + +def test_3081(): + ''' + Check Document.close() closes file handles, even if a Page instance exists. + ''' + path1 = os.path.abspath(f'{__file__}/../../tests/resources/1.pdf') + path2 = os.path.abspath(f'{__file__}/../../tests/test_3081-2.pdf') + + rebased = hasattr(pymupdf, 'mupdf') + + import shutil + import sys + import traceback + shutil.copy2(path1, path2) + + # Find next two available fds. + next_fd_1 = os.open(path2, os.O_RDONLY) + next_fd_2 = os.open(path2, os.O_RDONLY) + os.close(next_fd_1) + os.close(next_fd_2) + + def next_fd(): + fd = os.open(path2, os.O_RDONLY) + os.close(fd) + return fd + + fd1 = next_fd() + document = pymupdf.open(path2) + page = document[0] + fd2 = next_fd() + document.close() + if rebased: + assert document.this is None + assert page.this is None + try: + document.page_count() + except Exception as e: + print(f'Received expected exception: {e}') + #traceback.print_exc(file=sys.stdout) + assert str(e) == 'document closed' + else: + assert 0, 'Did not receive expected exception.' + fd3 = next_fd() + try: + page.bound() + except Exception as e: + print(f'Received expected exception: {e}') + #traceback.print_exc(file=sys.stdout) + if rebased: + assert str(e) == 'page is None' + else: + assert str(e) == 'orphaned object: parent is None' + else: + assert 0, 'Did not receive expected exception.' + page = None + fd4 = next_fd() + print(f'{next_fd_1=} {next_fd_2=}') + print(f'{fd1=} {fd2=} {fd3=} {fd4=}') + print(f'{document=}') + assert fd1 == next_fd_1 + assert fd2 == next_fd_2 # Checks document only uses one fd. + assert fd3 == next_fd_1 # Checks no leaked fds after document close. + assert fd4 == next_fd_1 # Checks no leaked fds after failed page access. + +def test_xml(): + path = os.path.abspath(f'{__file__}/../../tests/resources/2.pdf') + with pymupdf.open(path) as document: + document.get_xml_metadata() + +def test_3112_set_xml_metadata(): + document = pymupdf.Document() + document.set_xml_metadata('hello world') + +def test_archive_3126(): + if not hasattr(pymupdf, 'mupdf'): + print(f'Not running because known to fail with classic.') + return + p = os.path.abspath(f'{__file__}/../../tests/resources') + p = pathlib.Path(p) + archive = pymupdf.Archive(p) + +def test_3140(): + if not hasattr(pymupdf, 'mupdf'): + print(f'Not running test_3140 on classic, because Page.insert_htmlbox() not available.') + return + css2 = '' + path = os.path.abspath(f'{__file__}/../../tests/resources/2.pdf') + oldfile = os.path.abspath(f'{__file__}/../../tests/test_3140_old.pdf') + newfile = os.path.abspath(f'{__file__}/../../tests/test_3140_new.pdf') + import shutil + shutil.copy2(path, oldfile) + def next_fd(): + fd = os.open(path, os.O_RDONLY) + os.close(fd) + return fd + fd1 = next_fd() + with pymupdf.open(oldfile) as doc: # open document + page = doc[0] + rect = pymupdf.Rect(130, 400, 430, 600) + CELLS = pymupdf.make_table(rect, cols=3, rows=5) + shape = page.new_shape() # create Shape + for i in range(5): + for j in range(3): + qtext = "" + "Ques #" + str(i*3+j+1) + ": " + "" # codespell:ignore + atext = "" + "Ans:" + "" # codespell:ignore + qtext = qtext + '
' + atext + shape.draw_rect(CELLS[i][j]) # draw rectangle + page.insert_htmlbox(CELLS[i][j], qtext, css=css2, scale_low=0) + shape.finish(width=2.5, color=pymupdf.pdfcolor["blue"], ) + shape.commit() # write all stuff to the page + doc.subset_fonts() + doc.ez_save(newfile) + fd2 = next_fd() + assert fd2 == fd1, f'{fd1=} {fd2=}' + os.remove(oldfile) + +def test_cli(): + if not hasattr(pymupdf, 'mupdf'): + print('test_cli(): Not running on classic because of fitz_old.') + return + import subprocess + subprocess.run(f'pymupdf -h', shell=1, check=1) + + +def check_lines(expected_regexes, actual): + ''' + Checks lines in match regexes in . + ''' + print(f'check_lines():', flush=1) + print(f'{expected_regexes=}', flush=1) + print(f'{actual=}', flush=1) + def str_to_list(s): + if isinstance(s, str): + return s.split('\n') if s else list() + return s + expected_regexes = str_to_list(expected_regexes) + actual = str_to_list(actual) + if expected_regexes and expected_regexes[-1]: + expected_regexes.append('') # Always expect a trailing empty line. + # Remove `None` regexes and make all regexes match entire lines. + expected_regexes = [f'^{i}$' for i in expected_regexes if i is not None] + print(f'{expected_regexes=}', flush=1) + for expected_regex_line, actual_line in zip(expected_regexes, actual): + print(f' {expected_regex_line=}', flush=1) + print(f' {actual_line=}', flush=1) + assert re.match(expected_regex_line, actual_line) + assert len(expected_regexes) == len(actual), \ + f'expected/actual lines mismatch: {len(expected_regexes)=} {len(actual)=}.' + +def test_cli_out(): + ''' + Check redirection of messages and log diagnostics with environment + variables PYMUPDF_LOG and PYMUPDF_MESSAGE. + ''' + if not hasattr(pymupdf, 'mupdf'): + print('test_cli(): Not running on classic because of fitz_old.') + return + import platform + import re + import subprocess + log_prefix = None + if os.environ.get('PYMUPDF_USE_EXTRA') == '0': + log_prefix = f'.+Using non-default setting from PYMUPDF_USE_EXTRA: \'0\'' + + def check( + expect_out, + expect_err, + message=None, + log=None, + verbose=0, + ): + ''' + Sets PYMUPDF_MESSAGE to `message` and PYMUPDF_LOG to `log`, runs + `pymupdf internal`, and checks lines stdout and stderr match regexes in + `expect_out` and `expect_err`. Note that we enclose regexes in `^...$`. + ''' + env = dict() + if log: + env['PYMUPDF_LOG'] = log + if message: + env['PYMUPDF_MESSAGE'] = message + env = os.environ | env + print(f'Running with {env=}: pymupdf internal', flush=1) + cp = subprocess.run(f'pymupdf internal', shell=1, check=1, capture_output=1, env=env, text=True) + + if verbose: + #print(f'{cp.stdout=}.', flush=1) + #print(f'{cp.stderr=}.', flush=1) + sys.stdout.write(f'stdout:\n{textwrap.indent(cp.stdout, " ")}') + sys.stdout.write(f'stderr:\n{textwrap.indent(cp.stderr, " ")}') + check_lines(expect_out, cp.stdout) + check_lines(expect_err, cp.stderr) + + # + print(f'Checking default, all output to stdout.') + check( + [ + log_prefix, + 'This is from PyMuPDF message[(][)][.]', + '.+This is from PyMuPDF log[(][)].', + ], + '', + ) + + # + if platform.system() != 'Windows': + print(f'Checking redirection of everything to /dev/null.') + check('', '', 'path:/dev/null', 'path:/dev/null') + + # + print(f'Checking redirection to files.') + path_out = os.path.abspath(f'{__file__}/../../tests/test_cli_out.out') + path_err = os.path.abspath(f'{__file__}/../../tests/test_cli_out.err') + check('', '', f'path:{path_out}', f'path:{path_err}') + def read(path): + with open(path) as f: + return f.read() + out = read(path_out) + err = read(path_err) + check_lines(['This is from PyMuPDF message[(][)][.]'], out) + check_lines([log_prefix, '.+This is from PyMuPDF log[(][)][.]'], err) + + # + print(f'Checking redirection to fds.') + check( + [ + 'This is from PyMuPDF message[(][)][.]', + ], + [ + log_prefix, + '.+This is from PyMuPDF log[(][)].', + ], + 'fd:1', + 'fd:2', + ) + + +def test_use_python_logging(): + ''' + Checks pymupdf.use_python_logging(). + ''' + log_prefix = None + if os.environ.get('PYMUPDF_USE_EXTRA') == '0': + log_prefix = f'.+Using non-default setting from PYMUPDF_USE_EXTRA: \'0\'' + + if os.path.basename(__file__).startswith(f'test_fitz_'): + # Do nothing, because command `pymupdf` outputs diagnostics containing + # `pymupdf` which are not renamed to `fitz`, which breaks our checking. + print(f'Not testing with fitz alias.') + return + + def check( + code, + regexes_stdout, + regexes_stderr, + env = None, + ): + code = textwrap.dedent(code) + path = os.path.abspath(f'{__file__}/../../tests/resources_test_logging.py') + with open(path, 'w') as f: + f.write(code) + command = f'{sys.executable} {path}' + if env: + print(f'{env=}.') + env = os.environ | env + print(f'Running: {command}', flush=1) + try: + cp = subprocess.run(command, shell=1, check=1, capture_output=1, text=True, env=env) + except Exception as e: + print(f'Command failed: {command}.', flush=1) + print(f'Stdout\n{textwrap.indent(e.stdout, " ")}', flush=1) + print(f'Stderr\n{textwrap.indent(e.stderr, " ")}', flush=1) + raise + check_lines(regexes_stdout, cp.stdout) + check_lines(regexes_stderr, cp.stderr) + + print(f'## Basic use of `logging` sends output to stderr instead of default stdout.') + check( + ''' + import pymupdf + pymupdf.message('this is pymupdf.message()') + pymupdf.log('this is pymupdf.log()') + pymupdf.set_messages(pylogging=1) + pymupdf.set_log(pylogging=1) + pymupdf.message('this is pymupdf.message() 2') + pymupdf.log('this is pymupdf.log() 2') + ''', + [ + log_prefix, + 'this is pymupdf.message[(][)]', + '.+this is pymupdf.log[(][)]', + ], + [ + 'this is pymupdf.message[(][)] 2', + '.+this is pymupdf.log[(][)] 2', + ], + ) + + print(f'## Calling logging.basicConfig() makes logging output contain : prefixes.') + check( + ''' + import pymupdf + + import logging + logging.basicConfig() + pymupdf.set_messages(pylogging=1) + pymupdf.set_log(pylogging=1) + + pymupdf.message('this is pymupdf.message()') + pymupdf.log('this is pymupdf.log()') + ''', + [ + log_prefix, + ], + [ + 'WARNING:pymupdf:this is pymupdf.message[(][)]', + 'WARNING:pymupdf:.+this is pymupdf.log[(][)]', + ], + ) + + print(f'## Setting PYMUPDF_USE_PYTHON_LOGGING=1 makes PyMuPDF use logging on startup.') + check( + ''' + import pymupdf + pymupdf.message('this is pymupdf.message()') + pymupdf.log('this is pymupdf.log()') + ''', + '', + [ + log_prefix, + 'this is pymupdf.message[(][)]', + '.+this is pymupdf.log[(][)]', + ], + env = dict( + PYMUPDF_MESSAGE='logging:', + PYMUPDF_LOG='logging:', + ), + ) + + print(f'## Pass explicit logger to pymupdf.use_python_logging() with logging.basicConfig().') + check( + ''' + import pymupdf + + import logging + logging.basicConfig() + + logger = logging.getLogger('foo') + pymupdf.set_messages(pylogging_logger=logger, pylogging_level=logging.WARNING) + pymupdf.set_log(pylogging_logger=logger, pylogging_level=logging.ERROR) + + pymupdf.message('this is pymupdf.message()') + pymupdf.log('this is pymupdf.log()') + ''', + [ + log_prefix, + ], + [ + 'WARNING:foo:this is pymupdf.message[(][)]', + 'ERROR:foo:.+this is pymupdf.log[(][)]', + ], + ) + + print(f'## Check pymupdf.set_messages() pylogging_level args.') + check( + ''' + import pymupdf + + import logging + logging.basicConfig(level=logging.DEBUG) + logger = logging.getLogger('pymupdf') + + pymupdf.set_messages(pylogging_level=logging.CRITICAL) + pymupdf.set_log(pylogging_level=logging.INFO) + + pymupdf.message('this is pymupdf.message()') + pymupdf.log('this is pymupdf.log()') + ''', + [ + log_prefix, + ], + [ + 'CRITICAL:pymupdf:this is pymupdf.message[(][)]', + 'INFO:pymupdf:.+this is pymupdf.log[(][)]', + ], + ) + + print(f'## Check messages() with sys.stdout=None.') + check( + ''' + import sys + sys.stdout = None + import pymupdf + + pymupdf.message('this is pymupdf.message()') + pymupdf.log('this is pymupdf.log()') + ''', + [], + [], + ) + + +def relpath(path, start=None): + ''' + A 'safe' alternative to os.path.relpath(). Avoids an exception on Windows + if the drive needs to change - in this case we use os.path.abspath(). + ''' + try: + return os.path.relpath(path, start) + except ValueError: + # os.path.relpath() fails if trying to change drives. + assert platform.system() == 'Windows' + return os.path.abspath(path) + + +def test_open(): + + if not hasattr(pymupdf, 'mupdf'): + print('test_open(): not running on classic.') + return + + import re + import textwrap + import traceback + + resources = relpath(os.path.abspath(f'{__file__}/../../tests/resources')) + + # We convert all strings to use `/` instead of os.sep, which avoids + # problems with regex's on windows. + resources = resources.replace(os.sep, '/') + + def check(filename=None, stream=None, filetype=None, exception=None): + ''' + Checks we receive expected exception if specified. + ''' + if isinstance(filename, str): + filename = filename.replace(os.sep, '/') + if exception: + etype, eregex = exception + if isinstance(eregex, (tuple, list)): + # Treat as sequence of regexes to look for. + eregex = '.*'.join(eregex) + try: + pymupdf.open(filename=filename, stream=stream, filetype=filetype) + except etype as e: + text = traceback.format_exc(limit=0) + text = text.replace(os.sep, '/') + text = textwrap.indent(text, ' ', lambda line: 1) + assert re.search(eregex, text, re.DOTALL), \ + f'Incorrect exception text, expected {eregex=}, received:\n{text}' + print(f'Received expected exception for {filename=} {stream=} {filetype=}:\n{text}') + except Exception as e: + assert 0, \ + f'Incorrect exception, expected {etype}, received {type(e)=}.' + else: + assert 0, f'Did not received exception, expected {etype=}. {filename=} {stream=} {filetype=} {exception=}' + else: + document = pymupdf.open(filename=filename, stream=stream, filetype=filetype) + return document + + check(f'{resources}/1.pdf') + + check(f'{resources}/Bezier.epub') + + path = 1234 + etype = TypeError + eregex = re.escape(f'bad filename: type(filename)= filename={path}.') + check(path, exception=(etype, eregex)) + + path = 'test_open-this-file-will-not-exist' + etype = pymupdf.FileNotFoundError + eregex = f'no such file: \'{path}\'' + check(path, exception=(etype, eregex)) + + path = resources + etype = pymupdf.FileDataError + eregex = re.escape(f'\'{path}\' is no file') + check(path, exception=(etype, eregex)) + + path = relpath(os.path.abspath(f'{resources}/../test_open_empty')) + path = path.replace(os.sep, '/') + with open(path, 'w') as f: + pass + etype = pymupdf.EmptyFileError + eregex = re.escape(f'Cannot open empty file: filename={path!r}.') + check(path, exception=(etype, eregex)) + + path = f'{resources}/1.pdf' + filetype = 'xps' + etype = pymupdf.FileDataError + # 2023-12-12: On OpenBSD, for some reason the SWIG catch code only catches + # the exception as FzErrorBase. + etype2 = 'FzErrorBase' if platform.system() == 'OpenBSD' else 'FzErrorFormat' + eregex = ( + # With a sysinstall with separate MuPDF install, we get + # `mupdf.FzErrorFormat` instead of `pymupdf.mupdf.FzErrorFormat`. So + # we just search for the former. + re.escape(f'mupdf.{etype2}: code=7: cannot recognize zip archive'), + re.escape(f'pymupdf.FileDataError: Failed to open file {path!r} as type {filetype!r}.'), + ) + check(path, filetype=filetype, exception=None) + + path = f'{resources}/chinese-tables.pickle' + etype = pymupdf.FileDataError + etype2 = 'FzErrorBase' if platform.system() == 'OpenBSD' else 'FzErrorUnsupported' + etext = ( + re.escape(f'mupdf.{etype2}: code=6: cannot find document handler for file: {path}'), + re.escape(f'pymupdf.FileDataError: Failed to open file {path!r}.'), + ) + check(path, exception=(etype, etext)) + + stream = 123 + etype = TypeError + etext = re.escape('bad stream: type(stream)=.') + check(stream=stream, exception=(etype, etext)) + + check(stream=b'', exception=(pymupdf.EmptyFileError, re.escape('Cannot open empty stream.'))) + + +def test_open2(): + ''' + Checks behaviour of fz_open_document() and fz_open_document_with_stream() + with different filenames/magic values. + ''' + if platform.system() == 'Windows': + print(f'test_open2(): not running on Windows because `git ls-files` known fail on Github Windows runners.') + return + + root = os.path.normpath(f'{__file__}/../..') + root = relpath(root) + + # Find tests/resources/test_open2.* input files/streams. We calculate + # paths relative to the PyMuPDF checkout directory , to allow use + # of tests/resources/test_open2_expected.json regardless of the actual + # checkout directory. + print() + sys.path.append(root) + try: + import pipcl + finally: + del sys.path[0] + paths = pipcl.git_items(f'{root}/tests/resources') + paths = fnmatch.filter(paths, f'test_open2.*') + paths = [f'tests/resources/{i}' for i in paths] + + # Get list of extensions of input files. + extensions = set() + extensions.add('.txt') + extensions.add('') + for path in paths: + _, ext = os.path.splitext(path) + extensions.add(ext) + extensions = sorted(list(extensions)) + + def get_result(e, document): + ''' + Return fz_lookup_metadata(document, 'format') or [ERROR]. + ''' + if e: + return f'[error]' + else: + try: + return pymupdf.mupdf.fz_lookup_metadata2(document, 'format') + except Exception: + return '' + + def dict_set_path(dict_, *items): + for item in items[:-2]: + dict_ = dict_.setdefault(item, dict()) + dict_[items[-2]] = items[-1] + + results = dict() + + # Prevent warnings while we are running. + _g_out_message = pymupdf._g_out_message + pymupdf._g_out_message = None + try: + results = dict() + + for path in paths: + print(path) + for ext in extensions: + path2 = f'{root}/foo{ext}' + path3 = shutil.copy2(f'{root}/{path}', path2) + assert(path3 == path2) + + # Test fz_open_document(). + e = None + document = None + try: + document = pymupdf.mupdf.fz_open_document(path2) + except Exception as ee: + e = ee + wt = pymupdf.TOOLS.mupdf_warnings() + text = get_result(e, document) + print(f' fz_open_document({path2}) => {text}') + dict_set_path(results, path, ext, 'file', text) + + # Test fz_open_document_with_stream(). + e = None + document = None + with open(f'{root}/{path}', 'rb') as f: + data = f.read() + stream = pymupdf.mupdf.fz_open_memory(pymupdf.mupdf.python_buffer_data(data), len(data)) + try: + document = pymupdf.mupdf.fz_open_document_with_stream(ext, stream) + except Exception as ee: + e = ee + wt = pymupdf.TOOLS.mupdf_warnings() + text = get_result(e, document) + print(f' fz_open_document_with_stream(magic={ext!r}) => {text}') + dict_set_path(results, path, ext, 'stream', text) + + finally: + pymupdf._g_out_message = _g_out_message + + # Create html table. + path_html = os.path.normpath(f'{__file__}/../../tests/test_open2.html') + with open(path_html, 'w') as f: + f.write(f'\n') + f.write(f'\n') + f.write(f'

We shall meet again at a place where there is no darkness.

+ """ + + MEDIABOX = pymupdf.paper_rect("letter") + WHERE = MEDIABOX + (36, 36, -36, -36) + # the font files are located in /home/chinese + arch = pymupdf.Archive(".") + # if not specified user_css, the output pdf has content + story = pymupdf.Story(HTML, user_css=CSS, archive=arch) + + writer = pymupdf.DocumentWriter("output.pdf") + + more = 1 + + while more: + device = writer.begin_page(MEDIABOX) + more, _ = story.place(WHERE) + story.draw(device) + writer.end_page() + + writer.close() + + +def test_2753(): + + def rectfn(rect_num, filled): + return pymupdf.Rect(0, 0, 200, 200), pymupdf.Rect(50, 50, 100, 150), None + + def make_pdf(html, path_out): + story = pymupdf.Story(html=html) + document = story.write_with_links(rectfn) + print(f'test_2753(): Writing to: {path_out=}.') + document.save(path_out) + return document + + doc_before = make_pdf( + textwrap.dedent(''' +

Before

+

+

After

+ '''), + os.path.abspath(f'{__file__}/../../tests/test_2753-out-before.pdf'), + ) + + doc_after = make_pdf( + textwrap.dedent(''' +

Before

+

+

After

+ '''), + os.path.abspath(f'{__file__}/../../tests/test_2753-out-after.pdf'), + ) + + path = os.path.normpath(f'{__file__}/../../tests/test_2753_out') + doc_before.save(f'{path}_before.pdf') + doc_after.save(f'{path}_after.pdf') + assert len(doc_before) == 2 + assert len(doc_after) == 2 + +# codespell:ignore-begin +springer_html = ''' +
+ +

SPRINGERS EINWÜRFE: INTIME VERBINDUNGEN

+ +

Wieso kann unsereins so vieles, was eine Maus nicht kann? Unser Gehirn ist nicht bloß größer, sondern vor allem überraschend vertrackt verdrahtet.

+ +

Der Heilige Gral der Neu­ro­wis­sen­schaft ist die komplette Kartierung des menschlichen Gehirns – die ge­treue Ab­bildung des Ge­strüpps der Nervenzellen mit den baum­för­mi­gen Ver­ästel­ungen der aus ihnen sprie­ßen­den Den­dri­ten und den viel län­ge­ren Axo­nen, wel­che oft der Sig­nal­über­tragung von einem Sin­nes­or­gan oder zu einer Mus­kel­fa­ser die­nen. Zum Gesamtbild gehören die winzigen Knötchen auf den Dendriten; dort sitzen die Synapsen. Das sind Kontakt- und Schalt­stel­len, leb­haf­te Ver­bin­dungen zu anderen Neuronen.

+ +

Dieses Dickicht bis zur Ebene einzelner Zel­len zu durchforsten und es räumlich dar­zu­stel­len, ist eine gigantische Aufgabe, die bis vor Kurzem utopisch anmuten musste. Neu­er­dings vermag der junge For­schungs­zweig der Konnektomik (von Englisch: con­nect für ver­bin­den) das Zusammenspiel der Neurone immer besser zu verstehen. Das gelingt mit dem Einsatz dreidimensionaler Elek­tro­nen­mik­ros­ko­pie. Aus Dünn­schicht­auf­nah­men von zerebralen Ge­we­be­pro­ben lassen sich plastische Bil­der ganzer Zellverbände zu­sam­men­setzen.

+ +

Da frisches menschliches Hirn­ge­we­be nicht ohne Wei­te­res zu­gäng­lich ist – in der Regel nur nach chirurgischen Eingriffen an Epi­lep­sie­pa­tien­ten –, hält die Maus als Mo­dell­or­ga­nis­mus her. Die evolutionäre Ver­wandt­schaft von Mensch und Nager macht die Wahl plau­sibel. Vor allem das Team um Moritz Helmstaedter am Max-Planck-Institut (MPI) für Hirnforschung in Frankfurt hat in den ver­gangenen Jahren Expertise bei der kon­nek­tomischen Analyse entwickelt.

+ +

Aber steckt in unserem Kopf bloß ein auf die tausendfache Neu­ro­nen­an­zahl auf­ge­bläh­tes Mäu­se­hirn? Oder ist menschliches Ner­ven­ge­we­be viel­leicht doch anders gestrickt? Zur Beantwortung dieser Frage unternahm die MPI-Gruppe einen detaillierten Vergleich von Maus, Makake und Mensch (Science 377, abo0924, 2022).

+ +

Menschliches Gewebe stammte diesmal nicht von Epileptikern, son­dern von zwei wegen Hirntumoren operierten Patienten. Die For­scher wollten damit vermeiden, dass die oft jahrelange Behandlung mit An­ti­epi­lep­ti­ka das Bild der synaptischen Verknüpfungen trübte. Sie verglichen die Proben mit denen eines Makaken und von fünf Mäusen.

+ +

Einerseits ergaben sich – einmal ab­ge­se­hen von den ganz of­fen­sicht­li­chen quan­titativen Unterschieden wie Hirngröße und Neu­ro­nen­anzahl – recht gute Über­ein­stim­mun­gen, die somit den Gebrauch von Tier­modellen recht­fer­ti­gen. Doch in einem Punkt erlebte das MPI-Team eine echte Über­raschung.

+ +

Gewisse Nervenzellen, die so genannten In­ter­neurone, zeichnen sich dadurch aus, dass sie aus­schließ­lich mit anderen Ner­ven­zel­len in­ter­agieren. Solche »Zwi­schen­neu­rone« mit meist kurzen Axonen sind nicht primär für das Verarbeiten externer Reize oder das Aus­lösen körperlicher Reaktionen zuständig; sie be­schäf­ti­gen sich bloß mit der Ver­stär­kung oder Dämpfung interner Signale.

+ +

Just dieser Neuronentyp ist nun bei Makaken und Menschen nicht nur mehr als doppelt so häufig wie bei Mäusen, sondern obendrein be­son­ders intensiv untereinander ver­flochten. Die meisten Interneurone kop­peln sich fast ausschließlich an ihresgleichen. Dadurch wirkt sich ihr konnektomisches Ge­wicht ver­gleichs­weise zehnmal so stark aus.

+ +

Vermutlich ist eine derart mit sich selbst be­schäf­tigte Sig­nal­ver­ar­beitung die Vor­be­ding­ung für ge­stei­gerte Hirn­leis­tungen. Um einen Ver­gleich mit verhältnismäßig pri­mi­ti­ver Tech­nik zu wagen: Bei küns­tli­chen neu­ro­na­len Netzen – Algorithmen nach dem Vor­bild verknüpfter Nervenzellen – ge­nü­gen schon ein, zwei so genannte ver­bor­ge­ne Schich­ten von selbst­be­züg­li­chen Schaltstellen zwischen Input und Output-Ebene, um die ver­blüf­fen­den Erfolge der künstlichen Intel­ligenz her­vor­zu­bringen.

+
+''' +#codespell:ignore-end + +def test_fit_springer(): + + if not hasattr(pymupdf, 'mupdf'): + print(f'test_fit_springer(): not running on classic.') + return + + verbose = 0 + story = pymupdf.Story(springer_html) + + def check(call, expected): + ''' + Checks that eval(call) returned parameter=expected. Also creates PDF + using path that contains `call` in its leafname, + ''' + fit_result = eval(call) + + print(f'test_fit_springer(): {call=} => {fit_result=}.') + if expected is None: + assert not fit_result.big_enough + else: + document = story.write_with_links(lambda rectnum, filled: (fit_result.rect, fit_result.rect, None)) + path = os.path.abspath(f'{__file__}/../../tests/test_fit_springer_{call}_{fit_result.parameter=}_{fit_result.rect=}.pdf') + document.save(path) + print(f'Have saved document to {path}.') + assert abs(fit_result.parameter-expected) < 0.001, f'{expected=} {fit_result.parameter=}' + + check(f'story.fit_scale(pymupdf.Rect(0, 0, 200, 200), scale_min=1, verbose={verbose})', 3.685728073120117) + check(f'story.fit_scale(pymupdf.Rect(0, 0, 595, 842), scale_min=1, verbose={verbose})', 1.0174560546875) + check(f'story.fit_scale(pymupdf.Rect(0, 0, 300, 421), scale_min=1, verbose={verbose})', 2.02752685546875) + check(f'story.fit_scale(pymupdf.Rect(0, 0, 600, 900), scale_min=1, scale_max=1, verbose={verbose})', 1) + + check(f'story.fit_height(20, verbose={verbose})', 10782.3291015625) + check(f'story.fit_height(200, verbose={verbose})', 2437.4990234375) + check(f'story.fit_height(2000, verbose={verbose})', 450.2998046875) + check(f'story.fit_height(5000, verbose={verbose})', 378.2998046875) + check(f'story.fit_height(5500, verbose={verbose})', 378.2998046875) + + check(f'story.fit_width(3000, verbose={verbose})', 167.30859375) + check(f'story.fit_width(2000, verbose={verbose})', 239.595703125) + check(f'story.fit_width(1000, verbose={verbose})', 510.85546875) + check(f'story.fit_width(500, verbose={verbose})', 1622.1272945404053) + check(f'story.fit_width(400, verbose={verbose})', 2837.507724761963) + check(f'story.fit_width(300, width_max=200000, verbose={verbose})', None) + check(f'story.fit_width(200, width_max=200000, verbose={verbose})', None) + + # Run without verbose to check no calls to log() - checked by assert. + check('story.fit_scale(pymupdf.Rect(0, 0, 600, 900), scale_min=1, scale_max=1, verbose=0)', 1) + check('story.fit_scale(pymupdf.Rect(0, 0, 300, 421), scale_min=1, verbose=0)', 2.02752685546875) + + +def test_write_stabilized_with_links(): + + def rectfn(rect_num, filled): + ''' + We return one rect per page. + ''' + rect = pymupdf.Rect(10, 20, 290, 380) + mediabox = pymupdf.Rect(0, 0, 300, 400) + #print(f'rectfn(): rect_num={rect_num} filled={filled}') + return mediabox, rect, None + + def contentfn(positions): + ret = '' + ret += textwrap.dedent(''' + + +

Contents

+
    + ''') + for position in positions: + if position.heading and (position.open_close & 1): + text = position.text if position.text else '' + if position.id: + ret += f'
  • {text}' + else: + ret += f'
  • {text}' + ret += f' page={position.page_num}\n' + ret += '
\n' + ret += textwrap.dedent(f''' +

First section

+

Contents of first section. +

+ +

Second section

+

Contents of second section. +

Second section first subsection

+ +

Contents of second section first subsection. +

IDTEST + +

Third section

+

Contents of third section. +

NAMETEST. + + + ''') + return ret.strip() + + document = pymupdf.Story.write_stabilized_with_links(contentfn, rectfn) + + # Check links. + links = list() + for page in document: + links += page.get_links() + print(f'{len(links)=}.') + external_links = dict() + for i, link in enumerate(links): + print(f' {i}: {link=}') + if link.get('kind') == pymupdf.LINK_URI: + uri = link['uri'] + external_links.setdefault(uri, 0) + external_links[uri] += 1 + + # Check there is one external link. + print(f'{external_links=}') + if hasattr(pymupdf, 'mupdf'): + assert len(external_links) == 1 + assert 'https://artifex.com/' in external_links + + out_path = __file__.replace('.py', '.pdf') + document.save(out_path) + +def test_archive_creation(): + s = pymupdf.Story(archive=pymupdf.Archive('.')) + s = pymupdf.Story(archive='.') + + +def test_3813(): + import pymupdf + + HTML = """ +

Count is fine:

+
    +
  1. Lorem +
      +
    1. Sub Lorem
    2. +
    3. Sub Lorem
    4. +
    +
  2. +
  3. Lorem
  4. +
  5. Lorem
  6. +
+ +

Broken count:

+
    +
  1. Lorem +
      +
    • Sub Lorem
    • +
    • Sub Lorem
    • +
    +
  2. +
  3. Lorem
  4. +
  5. Lorem
  6. +
+ """ + MEDIABOX = pymupdf.paper_rect("A4") + WHERE = MEDIABOX + (36, 36, -36, -36) + + story = pymupdf.Story(html=HTML) + path = os.path.normpath(f'{__file__}/../../tests/test_3813_out.pdf') + writer = pymupdf.DocumentWriter(path) + + more = 1 + + while more: + device = writer.begin_page(MEDIABOX) + more, _ = story.place(WHERE) + story.draw(device) + writer.end_page() + + writer.close() + + with pymupdf.open(path) as document: + page = document[0] + text = page.get_text() + text_utf8 = text.encode() + + text_expected_utf8 = b'Count is \xef\xac\x81ne:\n1. Lorem\n1. Sub Lorem\n2. Sub Lorem\n2. Lorem\n3. Lorem\nBroken count:\n1. Lorem\n\xe2\x80\xa2 Sub Lorem\n\xe2\x80\xa2 Sub Lorem\n2. Lorem\n3. Lorem\n' + text_expected = text_expected_utf8.decode() + + print(f'text_utf8:\n {text_utf8!r}') + print(f'text_expected_utf8:\n {text_expected_utf8!r}') + print(f'text:\n {textwrap.indent(text, " ")}') + print(f'text_expected:\n {textwrap.indent(text_expected, " ")}') + + assert text == text_expected diff -r 000000000000 -r 1d09e1dec1d9 tests/test_tables.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_tables.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,460 @@ +import os +import io +from pprint import pprint +import textwrap +import pickle +import platform + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "chinese-tables.pdf") +pickle_file = os.path.join(scriptdir, "resources", "chinese-tables.pickle") + + +def test_table1(): + """Compare pickled tables with those of the current run.""" + pickle_in = open(pickle_file, "rb") + doc = pymupdf.open(filename) + page = doc[0] + tabs = page.find_tables() + cells = tabs[0].cells + tabs[1].cells # all table cell tuples on page + extracts = [tabs[0].extract(), tabs[1].extract()] # all table cell content + old_data = pickle.load(pickle_in) # previously saved data + + # Compare cell contents + assert old_data["extracts"] == extracts # same cell contents + + # Compare cell coordinates. + # Cell rectangles may get somewhat larger due to more cautious border + # computations, but any differences must be small. + old_cells = old_data["cells"][0] + old_data["cells"][1] + assert len(cells) == len(old_cells) + for i in range(len(cells)): + c1 = pymupdf.Rect(cells[i]) # new cell coordinates + c0 = pymupdf.Rect(old_cells[i]) # old cell coordinates + assert c0 in c1 # always: old contained in new + assert abs(c1 - c0) < 0.2 # difference must be small + + +def test_table2(): + """Confirm header properties.""" + doc = pymupdf.open(filename) + page = doc[0] + tab1, tab2 = page.find_tables().tables + # both tables contain their header data + assert tab1.header.external == False + assert tab1.header.cells == tab1.rows[0].cells + assert tab2.header.external == False + assert tab2.header.cells == tab2.rows[0].cells + + +def test_2812(): + """Ensure table detection and extraction independent from page rotation. + + Make 4 pages with rotations 0, 90, 180 and 270 degrees respectively. + Each page shows the same 8x5 table. + We will check that each table is detected and delivers the same content. + """ + doc = pymupdf.open() + # Page 0: rotation 0 + page = doc.new_page(width=842, height=595) + rect = page.rect + (72, 72, -72, -72) + cols = 5 + rows = 8 + # define the cells, draw the grid and insert unique text in each cell. + cells = pymupdf.make_table(rect, rows=rows, cols=cols) + for i in range(rows): + for j in range(cols): + page.draw_rect(cells[i][j]) + for i in range(rows): + for j in range(cols): + page.insert_textbox( + cells[i][j], + f"cell[{i}][{j}]", + align=pymupdf.TEXT_ALIGN_CENTER, + ) + page.clean_contents() + + # Page 1: rotation 90 degrees + page = doc.new_page() + rect = page.rect + (72, 72, -72, -72) + cols = 8 + rows = 5 + cells = pymupdf.make_table(rect, rows=rows, cols=cols) + for i in range(rows): + for j in range(cols): + page.draw_rect(cells[i][j]) + for i in range(rows): + for j in range(cols): + page.insert_textbox( + cells[i][j], + f"cell[{j}][{rows-i-1}]", + rotate=90, + align=pymupdf.TEXT_ALIGN_CENTER, + ) + page.set_rotation(90) + page.clean_contents() + + # Page 2: rotation 180 degrees + page = doc.new_page(width=842, height=595) + rect = page.rect + (72, 72, -72, -72) + cols = 5 + rows = 8 + cells = pymupdf.make_table(rect, rows=rows, cols=cols) + for i in range(rows): + for j in range(cols): + page.draw_rect(cells[i][j]) + for i in range(rows): + for j in range(cols): + page.insert_textbox( + cells[i][j], + f"cell[{rows-i-1}][{cols-j-1}]", + rotate=180, + align=pymupdf.TEXT_ALIGN_CENTER, + ) + page.set_rotation(180) + page.clean_contents() + + # Page 3: rotation 270 degrees + page = doc.new_page() + rect = page.rect + (72, 72, -72, -72) + cols = 8 + rows = 5 + cells = pymupdf.make_table(rect, rows=rows, cols=cols) + for i in range(rows): + for j in range(cols): + page.draw_rect(cells[i][j]) + for i in range(rows): + for j in range(cols): + page.insert_textbox( + cells[i][j], + f"cell[{cols-j-1}][{i}]", + rotate=270, + align=pymupdf.TEXT_ALIGN_CENTER, + ) + page.set_rotation(270) + page.clean_contents() + + pdfdata = doc.tobytes() + # doc.ez_save("test-2812.pdf") + doc.close() + + # ------------------------------------------------------------------------- + # Test PDF prepared. Extract table on each page and + # ensure identical extracted table data. + # ------------------------------------------------------------------------- + doc = pymupdf.open("pdf", pdfdata) + extracts = [] + for page in doc: + tabs = page.find_tables() + assert len(tabs.tables) == 1 + tab = tabs[0] + fp = io.StringIO() + pprint(tab.extract(), stream=fp) + extracts.append(fp.getvalue()) + fp = None + assert tab.row_count == 8 + assert tab.col_count == 5 + e0 = extracts[0] + for e in extracts[1:]: + assert e == e0 + + +def test_2979(): + """This tests fix #2979 and #3001. + + 2979: identical cell count for each row + 3001: no change of global glyph heights + """ + filename = os.path.join(scriptdir, "resources", "test_2979.pdf") + doc = pymupdf.open(filename) + page = doc[0] + tab = page.find_tables()[0] # extract the table + lengths = set() # stores all row cell counts + for e in tab.extract(): + lengths.add(len(e)) # store number of cells for row + + # test 2979 + assert len(lengths) == 1 + + # test 3001 + assert ( + pymupdf.TOOLS.set_small_glyph_heights() is False + ), f"{pymupdf.TOOLS.set_small_glyph_heights()=}" + + wt = pymupdf.TOOLS.mupdf_warnings() + if pymupdf.mupdf_version_tuple >= (1, 26, 0): + assert ( + wt + == "bogus font ascent/descent values (3117 / -2463)\n... repeated 2 times..." + ) + else: + assert not wt + + +def test_3062(): + """Tests the fix for #3062. + After table extraction, a rotated page should behave and look + like as before.""" + if platform.python_implementation() == 'GraalVM': + print(f'test_3062(): Not running because slow on GraalVM.') + return + + filename = os.path.join(scriptdir, "resources", "test_3062.pdf") + doc = pymupdf.open(filename) + page = doc[0] + tab0 = page.find_tables()[0] + cells0 = tab0.cells + + page = None + page = doc[0] + tab1 = page.find_tables()[0] + cells1 = tab1.cells + assert cells1 == cells0 + + +def test_strict_lines(): + """Confirm that ignoring borderless rectangles improves table detection.""" + filename = os.path.join(scriptdir, "resources", "strict-yes-no.pdf") + doc = pymupdf.open(filename) + page = doc[0] + + tab1 = page.find_tables()[0] + tab2 = page.find_tables(strategy="lines_strict")[0] + assert tab2.row_count < tab1.row_count + assert tab2.col_count < tab1.col_count + + +def test_add_lines(): + """Test new parameter add_lines for table recognition.""" + if platform.python_implementation() == 'GraalVM': + print(f'test_add_lines(): Not running because breaks later tests on GraalVM.') + return + + filename = os.path.join(scriptdir, "resources", "small-table.pdf") + doc = pymupdf.open(filename) + page = doc[0] + assert page.find_tables().tables == [] + + more_lines = [ + ((238.9949951171875, 200.0), (238.9949951171875, 300.0)), + ((334.5559997558594, 200.0), (334.5559997558594, 300.0)), + ((433.1809997558594, 200.0), (433.1809997558594, 300.0)), + ] + + # these 3 additional vertical lines should additional 3 columns + tab2 = page.find_tables(add_lines=more_lines)[0] + assert tab2.col_count == 4 + assert tab2.row_count == 5 + + +def test_3148(): + """Ensure correct extraction text of rotated text.""" + doc = pymupdf.open() + page = doc.new_page() + rect = pymupdf.Rect(100, 100, 300, 300) + text = ( + "rotation 0 degrees", + "rotation 90 degrees", + "rotation 180 degrees", + "rotation 270 degrees", + ) + degrees = (0, 90, 180, 270) + delta = (2, 2, -2, -2) + cells = pymupdf.make_table(rect, cols=3, rows=4) + for i in range(3): + for j in range(4): + page.draw_rect(cells[j][i]) + k = (i + j) % 4 + page.insert_textbox(cells[j][i] + delta, text[k], rotate=degrees[k]) + # doc.save("multi-degree.pdf") + tabs = page.find_tables() + tab = tabs[0] + for extract in tab.extract(): + for item in extract: + item = item.replace("\n", " ") + assert item in text + + +def test_3179(): + """Test correct separation of multiple tables on page.""" + filename = os.path.join(scriptdir, "resources", "test_3179.pdf") + doc = pymupdf.open(filename) + page = doc[0] + tabs = page.find_tables() + assert len(tabs.tables) == 3 + + +def test_battery_file(): + """Tests correctly ignoring non-table suspects. + + Earlier versions erroneously tried to identify table headers + where there existed no table at all. + """ + filename = os.path.join(scriptdir, "resources", "battery-file-22.pdf") + doc = pymupdf.open(filename) + page = doc[0] + tabs = page.find_tables() + assert len(tabs.tables) == 0 + + +def test_markdown(): + """Confirm correct markdown output.""" + filename = os.path.join(scriptdir, "resources", "strict-yes-no.pdf") + doc = pymupdf.open(filename) + page = doc[0] + tab = page.find_tables(strategy="lines_strict")[0] + if pymupdf.mupdf_version_tuple < (1, 26, 3): + md_expected = textwrap.dedent(''' + |Header1|Header2|Header3| + |---|---|---| + |Col11
Col12|~~Col21~~
~~Col22~~|Col31
Col32
Col33| + |Col13|~~Col23~~|Col34
Col35| + |Col14|~~Col24~~|Col36| + |Col15|~~Col25~~
~~Col26~~|| + + ''').lstrip() + else: + md_expected = ( + "|Header1|Header2|Header3|\n" + "|---|---|---|\n" + "|Col11
Col12|Col21
Col22|Col31
Col32
Col33|\n" + "|Col13|Col23|Col34
Col35|\n" + "|Col14|Col24|Col36|\n" + "|Col15|Col25
Col26||\n\n" + ) + + + md = tab.to_markdown() + assert md == md_expected, f'Incorrect md:\n{textwrap.indent(md, " ")}' + + +def test_paths_param(): + """Confirm acceptance of supplied vector graphics list.""" + filename = os.path.join(scriptdir, "resources", "strict-yes-no.pdf") + doc = pymupdf.open(filename) + page = doc[0] + tabs = page.find_tables(paths=[]) # will cause all tables are missed + assert tabs.tables == [] + + +def test_boxes_param(): + """Confirm acceptance of supplied boxes list.""" + filename = os.path.join(scriptdir, "resources", "small-table.pdf") + doc = pymupdf.open(filename) + page = doc[0] + paths = page.get_drawings() + box0 = page.cluster_drawings(drawings=paths)[0] + boxes = [box0] + words = page.get_text("words") + x_vals = [w[0] - 5 for w in words if w[4] in ("min", "max", "avg")] + for x in x_vals: + r = +box0 + r.x1 = x + boxes.append(r) + + y_vals = sorted(set([round(w[3]) for w in words])) + for y in y_vals[:-1]: # skip last one to avoid empty row + r = +box0 + r.y1 = y + boxes.append(r) + + tabs = page.find_tables(paths=[], add_boxes=boxes) + tab = tabs.tables[0] + assert tab.extract() == [ + ["Boiling Points °C", "min", "max", "avg"], + ["Noble gases", "-269", "-62", "-170.5"], + ["Nonmetals", "-253", "4827", "414.1"], + ["Metalloids", "335", "3900", "741.5"], + ["Metals", "357", ">5000", "2755.9"], + ] + + +def test_dotted_grid(): + """Confirm dotted lines are detected as gridlines.""" + filename = os.path.join(scriptdir, "resources", "dotted-gridlines.pdf") + doc = pymupdf.open(filename) + page = doc[0] + tabs = page.find_tables() + assert len(tabs.tables) == 3 # must be 3 tables + t0, t1, t2 = tabs # extract them + # check that they have expected dimensions + assert t0.row_count, t0.col_count == (11, 12) + assert t1.row_count, t1.col_count == (25, 11) + assert t2.row_count, t2.col_count == (1, 10) + + +def test_4017(): + path = os.path.normpath(f"{__file__}/../../tests/resources/test_4017.pdf") + with pymupdf.open(path) as document: + page = document[0] + + tables = page.find_tables(add_lines=None) + print(f"{len(tables.tables)=}.") + tables_text = list() + for i, table in enumerate(tables): + print(f"## {i=}.") + t = table.extract() + for tt in t: + print(f" {tt}") + + # 2024-11-29: expect current incorrect output for last two tables. + + expected_a = [ + ["Class A/B Overcollateralization", "131.44%", ">=", "122.60%", "", "PASS"], + [None, None, None, None, None, "PASS"], + ["Class D Overcollateralization", "112.24%", ">=", "106.40%", "", "PASS"], + [None, None, None, None, None, "PASS"], + ["Event of Default", "156.08%", ">=", "102.50%", "", "PASS"], + [None, None, None, None, None, "PASS"], + ["Class A/B Interest Coverage", "N/A", ">=", "120.00%", "", "N/A"], + [None, None, None, None, None, "N/A"], + ["Class D Interest Coverage", "N/A", ">=", "105.00%", "", "N/A"], + ] + assert tables[-2].extract() == expected_a + + expected_b = [ + [ + "Moody's Maximum Rating Factor Test", + "2,577", + "<=", + "3,250", + "", + "PASS", + "2,581", + ], + [None, None, None, None, None, "PASS", None], + [ + "Minimum Floating Spread", + "3.5006%", + ">=", + "2.0000%", + "", + "PASS", + "3.4871%", + ], + [None, None, None, None, None, "PASS", None], + [ + "Minimum Weighted Average S&P Recovery\nRate Test", + "40.50%", + ">=", + "40.00%", + "", + "PASS", + "40.40%", + ], + [None, None, None, None, None, "PASS", None], + ["Weighted Average Life", "4.83", "<=", "9.00", "", "PASS", "4.92"], + ] + assert tables[-1].extract() == expected_b + + +def test_md_styles(): + """Test output of table with MD-styled cells.""" + filename = os.path.join(scriptdir, "resources", "test-styled-table.pdf") + doc = pymupdf.open(filename) + page = doc[0] + tabs = page.find_tables()[0] + text = """|Column 1|Column 2|Column 3|\n|---|---|---|\n|Zelle (0,0)|**Bold (0,1)**|Zelle (0,2)|\n|~~Strikeout (1,0), Zeile 1~~
~~Hier kommt Zeile 2.~~|Zelle (1,1)|~~Strikeout (1,2)~~|\n|**`Bold-monospaced`**
**`(2,0)`**|_Italic (2,1)_|**_Bold-italic_**
**_(2,2)_**|\n|Zelle (3,0)|~~**Bold-strikeout**~~
~~**(3,1)**~~|Zelle (3,2)|\n\n""" + assert tabs.to_markdown() == text diff -r 000000000000 -r 1d09e1dec1d9 tests/test_tesseract.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_tesseract.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,152 @@ +import os +import platform +import textwrap + +import pymupdf + +def test_tesseract(): + ''' + This checks that MuPDF has been built with tesseract support. + + By default we don't supply a valid `tessdata` directory, and just assert + that attempting to use Tesseract raises the expected error (which checks + that MuPDF is built with Tesseract support). + + But if TESSDATA_PREFIX is set in the environment, we assert that + FzPage.get_textpage_ocr() succeeds. + ''' + path = os.path.abspath( f'{__file__}/../resources/2.pdf') + doc = pymupdf.open( path) + page = doc[5] + if hasattr(pymupdf, 'mupdf'): + # rebased. + if pymupdf.mupdf_version_tuple < (1, 25, 4): + tail = 'OCR initialisation failed' + else: + tail = 'Tesseract language initialisation failed' + e_expected = f'code=3: {tail}' + if platform.system() == 'OpenBSD': + # 2023-12-12: For some reason the SWIG catch code only catches + # the exception as FzErrorBase. + e_expected_type = pymupdf.mupdf.FzErrorBase + print(f'OpenBSD workaround - expecting FzErrorBase, not FzErrorLibrary.') + else: + e_expected_type = pymupdf.mupdf.FzErrorLibrary + else: + # classic. + e_expected = 'OCR initialisation failed' + e_expected_type = None + tessdata_prefix = os.environ.get('TESSDATA_PREFIX') + if tessdata_prefix: + tp = page.get_textpage_ocr(full=True) + print(f'test_tesseract(): page.get_textpage_ocr() succeeded') + else: + try: + tp = page.get_textpage_ocr(full=True, tessdata='/foo/bar') + except Exception as e: + e_text = str(e) + print(f'Received exception as expected.') + print(f'{type(e)=}') + print(f'{e_text=}') + assert e_text == e_expected, f'Unexpected exception: {e_text!r}' + if e_expected_type: + print(f'{e_expected_type=}') + assert type(e) == e_expected_type, f'{type(e)=} != {e_expected_type=}.' + else: + assert 0, f'Expected exception {e_expected!r}' + rebased = hasattr(pymupdf, 'mupdf') + if rebased: + wt = pymupdf.TOOLS.mupdf_warnings() + if pymupdf.mupdf_version_tuple < (1, 25, 4): + assert wt == ( + 'UNHANDLED EXCEPTION!\n' + 'library error: Tesseract initialisation failed' + ) + else: + assert not wt + + +def test_3842b(): + # Check Tesseract failure when given a bogus languages. + # + # Note that Tesseract seems to output its own diagnostics. + # + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3842.pdf') + with pymupdf.open(path) as document: + page = document[6] + try: + partial_tp = page.get_textpage_ocr(flags=0, full=False, language='qwerty') + except Exception as e: + print(f'test_3842b(): received exception: {e}') + if 'No tessdata specified and Tesseract is not installed' in str(e): + pass + else: + if pymupdf.mupdf_version_tuple < (1, 25, 4): + assert 'OCR initialisation failed' in str(e) + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'UNHANDLED EXCEPTION!\nlibrary error: Tesseract initialisation failed\nUNHANDLED EXCEPTION!\nlibrary error: Tesseract initialisation failed', \ + f'Unexpected {wt=}' + else: + assert 'Tesseract language initialisation failed' in str(e) + + +def test_3842(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3842.pdf') + with pymupdf.open(path) as document: + page = document[6] + try: + partial_tp = page.get_textpage_ocr(flags=0, full=False) + except Exception as e: + print(f'test_3842(): received exception: {e}', flush=1) + if 'No tessdata specified and Tesseract is not installed' in str(e): + pass + elif 'Tesseract language initialisation failed' in str(e): + pass + else: + assert 0, f'Unexpected exception text: {str(e)=}' + else: + text = page.get_text(textpage=partial_tp) + print() + print(text) + print(f'text:\n{text!r}') + + # 2024-11-29: This is the current incorrect output. We use + # underscores for lines containing entirely whitespace (which + # textwrap.dedent() unfortunately replaces with empty lines). + text_expected = textwrap.dedent(''' + NIST SP 800-223 + _ + High-Performance Computing Security + February 2024 + _ + __ + iii + Table of Contents + 1. Introduction ...................................................................................................................................1 + 2. HPC System Reference Architecture and Main Components ............................................................2 + 2.1.1. Components of the High-Performance Computing Zone ............................................................. 3 + 2.1.2. Components of the Data Storage Zone ........................................................................................ 4 + 2.1.3. Parallel File System ....................................................................................................................... 4 + 2.1.4. Archival and Campaign Storage .................................................................................................... 5 + 2.1.5. Burst Buffer .................................................................................................................................. 5 + 2.1.6. Components of the Access Zone .................................................................................................. 6 + 2.1.7. Components of the Management Zone ....................................................................................... 6 + 2.1.8. General Architecture and Characteristics .................................................................................... 6 + 2.1.9. Basic Services ................................................................................................................................ 7 + 2.1.10. Configuration Management ....................................................................................................... 7 + 2.1.11. HPC Scheduler and Workflow Management .............................................................................. 7 + 2.1.12. HPC Software .............................................................................................................................. 8 + 2.1.13. User Software ............................................................................................................................. 8 + 2.1.14. Site-Provided Software and Vendor Software ........................................................................... 8 + 2.1.15. Containerized Software in HPC .................................................................................................. 9 + 3. HPC Threat Analysis...................................................................................................................... 10 + 3.2.1. Access Zone Threats ................................................................................................................... 11 + 3.2.2. Management Zone Threats ........................................................................................................ 11 + 3.2.3. High-Performance Computing Zone Threats .............................................................................. 12 + 3.2.4. Data Storage Zone Threats ......................................................................................................... 12 + 4. HPC Security Posture, Challenges, and Recommendations ............................................................. 14 + 5. Conclusions .................................................................................................................................. 19 + ''', + )[1:].replace('_', ' ') + print(f'text_expected:\n{text_expected!r}') + assert text == text_expected diff -r 000000000000 -r 1d09e1dec1d9 tests/test_textbox.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_textbox.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,288 @@ +""" +Fill a given text in a rectangle on some PDF page using +1. TextWriter object +2. Basic text output + +Check text is indeed contained in given rectangle. +""" +import pymupdf + +# codespell:ignore-begin +text = """Der Kleine Schwertwal (Pseudorca crassidens), auch bekannt als Unechter oder Schwarzer Schwertwal, ist eine Art der Delfine (Delphinidae) und der einzige rezente Vertreter der Gattung Pseudorca. + +Er ähnelt dem Orca in Form und Proportionen, ist aber einfarbig schwarz und mit einer Maximallänge von etwa sechs Metern deutlich kleiner. + +Kleine Schwertwale bilden Schulen von durchschnittlich zehn bis fünfzig Tieren, wobei sie sich auch mit anderen Delfinen vergesellschaften und sich meistens abseits der Küsten aufhalten. + +Sie sind in allen Ozeanen gemäßigter, subtropischer und tropischer Breiten beheimatet, sind jedoch vor allem in wärmeren Jahreszeiten auch bis in die gemäßigte bis subpolare Zone südlich der Südspitze Südamerikas, vor Nordeuropa und bis vor Kanada anzutreffen.""" +# codespell:ignore-end + +def test_textbox1(): + """Use TextWriter for text insertion.""" + doc = pymupdf.open() + page = doc.new_page() + rect = pymupdf.Rect(50, 50, 400, 400) + blue = (0, 0, 1) + tw = pymupdf.TextWriter(page.rect, color=blue) + tw.fill_textbox( + rect, + text, + align=pymupdf.TEXT_ALIGN_LEFT, + fontsize=12, + ) + tw.write_text(page, morph=(rect.tl, pymupdf.Matrix(1, 1))) + # check text containment + assert page.get_text() == page.get_text(clip=rect) + page.write_text(writers=tw) + + +def test_textbox2(): + """Use basic text insertion.""" + doc = pymupdf.open() + ocg = doc.add_ocg("ocg1") + page = doc.new_page() + rect = pymupdf.Rect(50, 50, 400, 400) + blue = pymupdf.utils.getColor("lightblue") + red = pymupdf.utils.getColorHSV("red") + page.insert_textbox( + rect, + text, + align=pymupdf.TEXT_ALIGN_LEFT, + fontsize=12, + color=blue, + oc=ocg, + ) + # check text containment + assert page.get_text() == page.get_text(clip=rect) + + +def test_textbox3(): + """Use TextWriter for text insertion.""" + doc = pymupdf.open() + page = doc.new_page() + font = pymupdf.Font("cjk") + rect = pymupdf.Rect(50, 50, 400, 400) + blue = (0, 0, 1) + tw = pymupdf.TextWriter(page.rect, color=blue) + tw.fill_textbox( + rect, + text, + align=pymupdf.TEXT_ALIGN_LEFT, + font=font, + fontsize=12, + right_to_left=True, + ) + tw.write_text(page, morph=(rect.tl, pymupdf.Matrix(1, 1))) + # check text containment + assert page.get_text() == page.get_text(clip=rect) + doc.scrub() + doc.subset_fonts() + + +def test_textbox4(): + """Use TextWriter for text insertion.""" + doc = pymupdf.open() + ocg = doc.add_ocg("ocg1") + page = doc.new_page() + rect = pymupdf.Rect(50, 50, 400, 600) + blue = (0, 0, 1) + tw = pymupdf.TextWriter(page.rect, color=blue) + tw.fill_textbox( + rect, + text, + align=pymupdf.TEXT_ALIGN_LEFT, + fontsize=12, + font=pymupdf.Font("cour"), + right_to_left=True, + ) + tw.write_text(page, oc=ocg, morph=(rect.tl, pymupdf.Matrix(1, 1))) + # check text containment + assert page.get_text() == page.get_text(clip=rect) + + +def test_textbox5(): + """Using basic text insertion.""" + small_glyph_heights0 = pymupdf.TOOLS.set_small_glyph_heights() + pymupdf.TOOLS.set_small_glyph_heights(True) + try: + doc = pymupdf.open() + page = doc.new_page() + r = pymupdf.Rect(100, 100, 150, 150) + text = "words and words and words and more words..." + rc = -1 + fontsize = 12 + page.draw_rect(r) + while rc < 0: + rc = page.insert_textbox( + r, + text, + fontsize=fontsize, + align=pymupdf.TEXT_ALIGN_JUSTIFY, + ) + fontsize -= 0.5 + + blocks = page.get_text("blocks") + bbox = pymupdf.Rect(blocks[0][:4]) + assert bbox in r + finally: + # Must restore small_glyph_heights, otherwise other tests can fail. + pymupdf.TOOLS.set_small_glyph_heights(small_glyph_heights0) + + +def test_2637(): + """Ensure correct calculation of fitting text.""" + doc = pymupdf.open() + page = doc.new_page() + text = ( + "The morning sun painted the sky with hues of orange and pink. " + "Birds chirped harmoniously, greeting the new day. " + "Nature awakened, filling the air with life and promise." + ) + rect = pymupdf.Rect(50, 50, 500, 280) + fontsize = 50 + rc = -1 + while rc < 0: # look for largest font size that makes the text fit + rc = page.insert_textbox(rect, text, fontname="hebo", fontsize=fontsize) + fontsize -= 1 + + # confirm text won't lap outside rect + blocks = page.get_text("blocks") + bbox = pymupdf.Rect(blocks[0][:4]) + assert bbox in rect + + +def test_htmlbox1(): + """Write HTML-styled text into a rect with different rotations. + + The text is styled and contains a link. + Then extract the text again, and + - assert that text was written in the 4 different angles, + - assert that text properties are correct (bold, italic, color), + - assert that the link has been correctly inserted. + + We try to insert into a rectangle that is too small, setting + scale=False and confirming we have a negative return code. + """ + if not hasattr(pymupdf, "mupdf"): + print("'test_htmlbox1' not executed in classic.") + return + + rect = pymupdf.Rect(100, 100, 200, 200) # this only works with scale=True + + base_text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.""" + + text = """Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.""" + + doc = pymupdf.Document() + + for rot in (0, 90, 180, 270): + wdirs = ((1, 0), (0, -1), (-1, 0), (0, 1)) # all writing directions + page = doc.new_page() + spare_height, scale = page.insert_htmlbox(rect, text, rotate=rot, scale_low=1) + assert spare_height < 0 + assert scale == 1 + spare_height, scale = page.insert_htmlbox(rect, text, rotate=rot, scale_low=0) + assert spare_height == 0 + assert 0 < scale < 1 + page = doc.reload_page(page) + link = page.get_links()[0] # extracts the links on the page + + assert link["uri"] == "https://www.artifex.com" + + # Assert plain text is complete. + # We must remove line breaks and any ligatures for this. + assert base_text == page.get_text(flags=0)[:-1].replace("\n", " ") + + encounters = 0 # counts the words with selected properties + for b in page.get_text("dict")["blocks"]: + for l in b["lines"]: + wdir = l["dir"] # writing direction + assert wdir == wdirs[page.number] + for s in l["spans"]: + stext = s["text"] + color = pymupdf.sRGB_to_pdf(s["color"]) + bold = bool(s["flags"] & 16) + italic = bool(s["flags"] & 2) + if stext in ("ullamco", "laboris", "voluptate"): + encounters += 1 + if stext == "ullamco": + assert bold is True + assert italic is False + assert color == pymupdf.pdfcolor["black"] + elif stext == "laboris": + assert bold is False + assert italic is True + assert color == pymupdf.pdfcolor["black"] + elif stext == "voluptate": + assert bold is True + assert italic is False + assert color == pymupdf.pdfcolor["green"] + else: + assert bold is False + assert italic is False + # all 3 special special words were encountered + assert encounters == 3 + + +def test_htmlbox2(): + """Test insertion without scaling""" + if not hasattr(pymupdf, "mupdf"): + print("'test_htmlbox2' not executed in classic.") + return + + doc = pymupdf.open() + rect = pymupdf.Rect(100, 100, 200, 200) # large enough to hold text + page = doc.new_page() + bottoms = set() + for rot in (0, 90, 180, 270): + spare_height, scale = page.insert_htmlbox( + rect, "Hello, World!", scale_low=1, rotate=rot + ) + assert scale == 1 + assert 0 < spare_height < rect.height + bottoms.add(spare_height) + assert len(bottoms) == 1 # same result for all rotations + + +def test_htmlbox3(): + """Test insertion with opacity""" + if not hasattr(pymupdf, "mupdf"): + print("'test_htmlbox3' not executed in classic.") + return + + rect = pymupdf.Rect(100, 250, 300, 350) + text = """Just some text.""" + doc = pymupdf.open() + page = doc.new_page() + + # insert some text with opacity + page.insert_htmlbox(rect, text, opacity=0.5) + + # lowlevel-extract inserted text to access opacity + span = page.get_texttrace()[0] + assert span["opacity"] == 0.5 + + +def test_3559(): + doc = pymupdf.Document() + page = doc.new_page() + text_insert="""

""" + rect = pymupdf.Rect(100, 100, 200, 200) + page.insert_htmlbox(rect, text_insert) + + +def test_3916(): + doc = pymupdf.open() + rect = pymupdf.Rect(100, 100, 101, 101) # Too small for the text. + page = doc.new_page() + spare_height, scale = page.insert_htmlbox(rect, "Hello, World!", scale_low=0.5) + assert spare_height == -1 + + +def test_4400(): + with pymupdf.open() as document: + page = document.new_page() + writer = pymupdf.TextWriter(page.rect) + text = '111111111' + print(f'Calling writer.fill_textbox().', flush=1) + writer.fill_textbox(rect=pymupdf.Rect(0, 0, 100, 20), pos=(80, 0), text=text, fontsize=8) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_textextract.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_textextract.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,948 @@ +""" +Extract page text in various formats. +No checks performed - just contribute to code coverage. +""" +import os +import platform +import sys +import textwrap + +import pymupdf + +import gentle_compare + + +pymupdfdir = os.path.abspath(f'{__file__}/../..') +scriptdir = f'{pymupdfdir}/tests' +filename = os.path.join(scriptdir, "resources", "symbol-list.pdf") + + +def test_extract1(): + doc = pymupdf.open(filename) + page = doc[0] + text = page.get_text("text") + blocks = page.get_text("blocks") + words = page.get_text("words") + d1 = page.get_text("dict") + d2 = page.get_text("json") + d3 = page.get_text("rawdict") + d3 = page.get_text("rawjson") + text = page.get_text("html") + text = page.get_text("xhtml") + text = page.get_text("xml") + rects = pymupdf.get_highlight_selection(page, start=page.rect.tl, stop=page.rect.br) + text = pymupdf.ConversionHeader("xml") + text = pymupdf.ConversionTrailer("xml") + +def _test_extract2(): + import sys + import time + path = f'{scriptdir}/../../PyMuPDF-performance/adobe.pdf' + if not os.path.exists(path): + print(f'test_extract2(): not running because does not exist: {path}') + return + doc = pymupdf.open( path) + for opt in ( + 'dict', + 'dict2', + 'text', + 'blocks', + 'words', + 'html', + 'xhtml', + 'xml', + 'json', + 'rawdict', + 'rawjson', + ): + for flags in None, pymupdf.TEXTFLAGS_TEXT: + t0 = time.time() + for page in doc: + page.get_text(opt, flags=flags) + t = time.time() - t0 + print(f't={t:.02f}: opt={opt} flags={flags}') + sys.stdout.flush() + +def _test_extract3(): + import sys + import time + path = f'{scriptdir}/../../PyMuPDF-performance/adobe.pdf' + if not os.path.exists(path): + print(f'test_extract3(): not running because does not exist: {path}') + return + doc = pymupdf.open( path) + t0 = time.time() + for page in doc: + page.get_text('json') + t = time.time() - t0 + print(f't={t}') + sys.stdout.flush() + +def test_extract4(): + ''' + Rebased-specific. + ''' + if not hasattr(pymupdf, 'mupdf'): + return + path = f'{pymupdfdir}/tests/resources/2.pdf' + document = pymupdf.open(path) + page = document[4] + + out = 'test_stext.html' + text = page.get_text('html') + with open(out, 'w') as f: + f.write(text) + print(f'Have written to: {out}') + + out = 'test_extract.html' + writer = pymupdf.mupdf.FzDocumentWriter( + out, + 'html', + pymupdf.mupdf.FzDocumentWriter.OutputType_DOCX, + ) + device = pymupdf.mupdf.fz_begin_page(writer, pymupdf.mupdf.fz_bound_page(page)) + pymupdf.mupdf.fz_run_page(page, device, pymupdf.mupdf.FzMatrix(), pymupdf.mupdf.FzCookie()) + pymupdf.mupdf.fz_end_page(writer) + pymupdf.mupdf.fz_close_document_writer(writer) + print(f'Have written to: {out}') + + def get_text(page, space_guess): + buffer_ = pymupdf.mupdf.FzBuffer( 10) + out = pymupdf.mupdf.FzOutput( buffer_) + writer = pymupdf.mupdf.FzDocumentWriter( + out, + 'text,space-guess={space_guess}', + pymupdf.mupdf.FzDocumentWriter.OutputType_DOCX, + ) + device = pymupdf.mupdf.fz_begin_page(writer, pymupdf.mupdf.fz_bound_page(page)) + pymupdf.mupdf.fz_run_page(page, device, pymupdf.mupdf.FzMatrix(), pymupdf.mupdf.FzCookie()) + pymupdf.mupdf.fz_end_page(writer) + pymupdf.mupdf.fz_close_document_writer(writer) + text = buffer_.fz_buffer_extract() + text = text.decode('utf8') + n = text.count(' ') + print(f'{space_guess=}: {n=}') + return text, n + page = document[4] + text0, n0 = get_text(page, 0) + text1, n1 = get_text(page, 0.5) + text2, n2 = get_text(page, 0.001) + text2, n2 = get_text(page, 0.1) + text2, n2 = get_text(page, 0.3) + text2, n2 = get_text(page, 0.9) + text2, n2 = get_text(page, 5.9) + assert text1 == text0 + +def test_2954(): + ''' + Check handling of unknown unicode characters, issue #2954, fixed in + mupdf-1.23.9 with addition of FZ_STEXT_USE_CID_FOR_UNKNOWN_UNICODE. + ''' + path = os.path.abspath(f'{__file__}/../../tests/resources/test_2954.pdf') + flags0 = (0 + | pymupdf.TEXT_PRESERVE_WHITESPACE + | pymupdf.TEXT_PRESERVE_LIGATURES + | pymupdf.TEXT_MEDIABOX_CLIP + ) + + document = pymupdf.Document(path) + + expected_good = ( + "IT-204-IP (2021) Page 3 of 5\nNYPA2514 12/06/21\nPartner's share of \n" + " modifications (see instructions)\n20\n State additions\nNumber\n" + "A ' Total amount\nB '\n State allocated amount\n" + "EA '\n20a\nEA '\n20b\nEA '\n20c\nEA '\n20d\nEA '\n20e\nEA '\n20f\n" + "Total addition modifications (total of column A, lines 20a through 20f)\n" + ". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \n" + "21\n21\n22\n State subtractions\n" + "Number\nA ' Total amount\nB '\n State allocated amount\n" + "ES '\n22a\nES '\n22b\nES '\n22c\nES '\n22d\nES '\n22e\nES '\n22f\n23\n23\n" + "Total subtraction modifications (total of column A, lines 22a through 22f). . . . . . . . . . . . . . . . . . . . . . . . . . . . \n" + "Additions to itemized deductions\n24\nAmount\n" + "Letter\n" + "24a\n24b\n24c\n24d\n24e\n24f\n" + "Total additions to itemized deductions (add lines 24a through 24f)\n" + ". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \n" + "25\n25\n" + "Subtractions from itemized deductions\n" + "26\nLetter\nAmount\n26a\n26b\n26c\n26d\n26e\n26f\n" + "Total subtractions from itemized deductions (add lines 26a through 26f) . . . . . . . . . . . . . . . . . . . . . . . . . . . . \n" + "27\n27\n" + "This line intentionally left blank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \n" + "28\n28\n118003213032\n" + ) + + def check_good(text): + ''' + Returns true if `text` is approximately the same as `expected_good`. + + 2024-01-09: MuPDF master and 1.23.x give slightly different 'good' + output, differing in a missing newline. So we compare without newlines. + ''' + return text.replace('\n', '') == expected_good.replace('\n', '') + + n_fffd_good = 0 + n_fffd_bad = 749 + + def get(flags=None): + text = [page.get_text(flags=flags) for page in document] + assert len(text) == 1 + text = text[0] + n_fffd = text.count(chr(0xfffd)) + if 0: + # This print() fails on Windows with UnicodeEncodeError. + print(f'{flags=} {n_fffd=} {text=}') + return text, n_fffd + + text_none, n_fffd_none = get() + text_0, n_fffd_0 = get(flags0) + + text_1, n_fffd_1 = get(flags0 | pymupdf.TEXT_USE_CID_FOR_UNKNOWN_UNICODE) + + assert n_fffd_none == n_fffd_good + assert n_fffd_0 == n_fffd_bad + assert n_fffd_1 == n_fffd_good + + assert check_good(text_none) + assert not check_good(text_0) + assert check_good(text_1) + + +def test_3027(): + path = path = f'{pymupdfdir}/tests/resources/2.pdf' + doc = pymupdf.open(path) + page = doc[0] + textpage = page.get_textpage() + pymupdf.utils.get_text(page=page, option="dict", textpage=textpage)["blocks"] + + +def test_3186(): + + # codespell:ignore-begin + texts_expected = [ + "Assicurazione sulla vita di tipo Unit Linked\nDocumento informativo precontrattuale aggiuntivo\nper i prodotti d\x00investimento assicurativi\n(DIP aggiuntivo IBIP)\nImpresa: AXA MPS Financial DAC \nProdotto: Progetto Protetto New - Global Dividends\nContratto Unit linked (Ramo III)\nData di realizzazione: Aprile 2023\nIl presente documento contiene informazioni aggiuntive e complementari rispetto a quelle presenti nel documento \ncontenente le informazioni chiave per i prodotti di investimento assicurativi (KID) per aiutare il potenziale \ncontraente a capire più nel dettaglio le caratteristiche del prodotto, gli obblighi contrattuali e la situazione \npatrimoniale dell\x00impresa.\nIl Contraente deve prendere visione delle condizioni d\x00assicurazione prima della sottoscrizione del Contratto.\nAXA MPS Financial DAC, Wolfe Tone House, Wolfe Tone Street, Dublin, DO1 HP90, Irlanda; Tel: 00353-1-6439100; \nsito internet: www.axa-mpsfinancial.ie; e-mail: supporto@axa-mpsfinancial.ie;\nAXA MPS Financial DAC, società del Gruppo Assicurativo AXA Italia, iscritta nell\x00Albo delle Imprese di assicurazione \ncon il numero II.00234. \nLa Compagnia mette a disposizione dei clienti i seguenti recapiti per richiedere eventuali informazioni sia in merito alla \nCompagnia sia in relazione al contratto proposto: Tel: 00353-1-6439100; sito internet: www.axa-mpsfinancial.ie; \ne-mail: supporto@axa-mpsfinancial.ie;\nAXA MPS Financial DAC è un\x00impresa di assicurazione di diritto Irlandese, Sede legale 33 Sir John Rogerson's Quay, \nDublino D02 XK09 Irlanda. L\x00Impresa di Assicurazione è stata autorizzata all\x00esercizio dell\x00attività assicurativa con \nprovvedimento n. C33602 emesso dalla Central Bank of Ireland (l\x00Autorità di vigilanza irlandese) in data 14/05/1999 \ned è iscritta in Irlanda presso il Companies Registration Office (registered nr. 293822). \nLa Compagnia opera in Italia esclusivamente in regime di libera prestazione di servizi ai sensi dell\x00art. 24 del D. Lgs. \n07/09/2005, n. 209 e può investire in attivi non consentiti dalla normativa italiana in materia di assicurazione sulla \nvita, ma in conformità con la normativa irlandese di riferimento in quanto soggetta al controllo della Central Bank of \nIreland.\nCon riferimento all\x00ultimo bilancio d\x00esercizio (esercizio 2021) redatto ai sensi dei principi contabili vigenti, il patrimonio \nnetto di AXA MPS Financial DAC ammonta a 139,6 milioni di euro di cui 635 mila euro di capitale sociale interamente \nversato e 138,9 milioni di euro di riserve patrimoniali compreso il risultato di esercizio.\nAl 31 dicembre 2021 il Requisito patrimoniale di solvibilità è pari a 90 milioni di euro (Solvency Capital Requirement, \nSCR). Sulla base delle valutazioni effettuate della Compagnia coerentemente con gli esistenti dettami regolamentari, il \nRequisito patrimoniale minimo al 31 dicembre 2021 ammonta a 40 milioni di euro (Minimum Capital Requirement, \nMCR).\nL'indice di solvibilità di AXA MPS Financial DAC, ovvero l'indice che rappresenta il rapporto tra l'ammontare del margine \ndi solvibilità disponibile e l'ammontare del margine di solvibilità richiesto dalla normativa vigente, e relativo all'ultimo \nbilancio approvato, è pari al 304% (solvency ratio). L'importo dei fondi propri ammissibili a copertura dei requisiti \npatrimoniali è pari a 276 milioni di euro (Eligible Own Funds, EOF).\nPer informazioni patrimoniali sulla società è possibile consultare il sito: www.axa-mpsfinancial.ie/chi-siamo\nSi rinvia alla relazione sulla solvibilità e sulla condizione finanziaria dell\x00impresa (SFCR) disponibile sul sito internet \ndella Compagnia al seguente link www.axa-mpsfinancial.ie/comunicazioni \nAl contratto si applica la legge italiana\nDIP aggiuntivo IBIP - Progetto Protetto New - Global Dividends - Pag. 1 di 9\n", + "Quali sono le prestazioni?\nIl contratto prevede le seguenti prestazioni:\na)Prestazioni in caso di vita dell'assicurato\nPrestazione in caso di Riscatto Totale e parziale\nA condizione che siano trascorsi almeno 30 giorni dalla Data di Decorrenza (conclusione del Contratto) e fino all\x00ultimo \nGiorno Lavorativo della terzultima settimana precedente la data di scadenza, il Contraente può riscuotere, interamente \no parzialmente, il Valore di Riscatto. In caso di Riscatto totale, la liquidazione del Valore di Riscatto pone fine al \nContratto con effetto dalla data di ricezione della richiesta.\nIl Contraente ha inoltre la facoltà di esercitare parzialmente il diritto di Riscatto, nella misura minima di 500,00 euro, \nda esercitarsi con le stesse modalità previste per il Riscatto totale. In questo caso, il Contratto rimane in vigore per \nl\x00ammontare residuo, a condizione che il Controvalore delle Quote residue del Contratto non sia inferiore a 1.000,00 \neuro.\nb) Prestazione a Scadenza\nAlla data di scadenza, sempre che l\x00Assicurato sia in vita, l\x00Impresa di Assicurazione corrisponderà agli aventi diritto un \nammontare risultante dal Controvalore delle Quote collegate al Contratto alla scadenza, calcolato come prodotto tra il \nValore Unitario della Quota (rilevato in corrispondenza della data di scadenza) e il numero delle Quote attribuite al \nContratto alla medesima data.\nc) Prestazione in corso di Contratto\nPurché l\x00assicurato sia in vita, nel corso della durata del Contratto, il Fondo Interno mira alla corresponsione di due \nPrestazioni Periodiche. Le prestazioni saranno pari all\x00ammontare risultante dalla moltiplicazione tra il numero di Quote \nassegnate al Contratto il primo giorno Lavorativo della settimana successiva alla Data di Riferimento e 2,50% del \nValore Unitario della Quota registrato alla Data di istituzione del Fondo Interno.\nLe prestazioni verranno liquidate entro trenta giorni dalle Date di Riferimento.\nData di Riferimento\n 1° Prestazione Periodica\n24/04/2024\n 2° Prestazione Periodica\n23/04/2025\nLa corresponsione delle Prestazioni Periodiche non è collegata alla performance positiva o ai ricavi incassati dal Fondo \nInterno, pertanto, la corresponsione potrebbe comportare una riduzione del Controvalore delle Quote senza comportare \nalcuna riduzione del numero di Quote assegnate al Contratto.\nd) Prestazione assicurativa principale in caso di decesso dell'Assicurato\nIn caso di decesso dell\x00Assicurato nel corso della durata contrattuale, è previsto il pagamento ai Beneficiari di un \nimporto pari al Controvalore delle Quote attribuite al Contratto, calcolato come prodotto tra il Valore Unitario della \nQuota rilevato alla Data di Valorizzazione della settimana successiva alla data in cui la notifica di decesso \ndell\x00Assicurato perviene all\x00Impresa di Assicurazione e il numero delle Quote attribuite al Contratto alla medesima data, \nmaggiorato di una percentuale pari allo 0,1%.\nQualora il capitale così determinato fosse inferiore al Premio pagato, sarà liquidato un ulteriore importo pari alla \ndifferenza tra il Premio pagato, al netto della parte di Premio riferita a eventuali Riscatti parziali e l\x00importo caso morte \ncome sopra determinato. Tale importo non potrà essere in ogni caso superiore al 5% del Premio pagato.\nOpzioni contrattuali\nIl Contratto non prevede opzioni contrattuali.\nFondi Assicurativi\nLe prestazioni di cui sopra sono collegate, in base all\x00allocazione del premio come descritto alla sezione \x01Quando e \ncome devo pagare?\x02, al valore delle quote del Fondo Interno denominato PP27 Global Dividends.\nil Fondo interno mira al raggiungimento di un Obiettivo di Protezione del Valore Unitario di Quota, tramite il \nconseguimento di un Valore Unitario di Quota a scadenza almeno pari al 100% del valore di quota registrato alla Data \ndi istituzione dal Fondo Interno.\nIl regolamento di gestione del Fondo Interno è disponibile sul sito dell\x00Impresa di Assicurazione \nwww.axa-mpsfinancial.ie dove puo essere acquisito su supporto duraturo.\nDIP aggiuntivo IBIP - Progetto Protetto New - Global Dividends - Pag. 2 di 9\n", + 'Che cosa NON è assicurato\nRischi esclusi\nIl rischio di decesso dell\x00Assicurato è coperto qualunque sia la causa, senza limiti territoriali e senza \ntenere conto dei cambiamenti di professione dell\x00Assicurato, ad eccezione dei seguenti casi:\n\x03 il decesso, entro i primi sette anni dalla data di decorrenza del Contratto, dovuto alla sindrome da \nimmunodeficienza acquisita (AIDS) o ad altra patologia ad essa associata;\n\x03 dolo del Contraente o del Beneficiario;\n\x03 partecipazione attiva dell\x00Assicurato a delitti dolosi;\n\x03 partecipazione dell\x00Assicurato a fatti di guerra, salvo che non derivi da obblighi verso lo Stato \nItaliano: in questo caso la garanzia può essere prestata su richiesta del Contraente, alle condizioni \nstabilite dal competente Ministero;\n\x03 incidente di volo, se l\x00Assicurato viaggia a bordo di un aeromobile non autorizzato al volo o con \npilota non titolare di brevetto idoneo e, in ogni caso, se viaggia in qualità di membro \ndell\x00equipaggio;\n\x03 suicidio, se avviene nei primi due anni dalla Data di Decorrenza del Contratto\nCi sono limiti di copertura?\nNon vi sono ulteriori informazioni rispetto al contenuto del KID.\nChe obblighi ho? Quali obblighi ha l\x00Impresa?\nCosa fare in caso \ndi evento?\nDenuncia\nCon riferimento alla liquidazione delle prestazioni dedotte in Contratto, il Contraente o, se del caso, \nil Beneficiario e il Referente Terzo, sono tenuti a recarsi presso la sede dell\x00intermediario presso il \nquale il Contratto è stato sottoscritto ovvero a inviare preventivamente, a mezzo di lettera \nraccomandata con avviso di ricevimento al seguente recapito:\n\x03 AXA MPS Financial DAC\n\x03 Wolfe Tone House, Wolfe Tone Street,\n\x03 Dublin, DO1 HP90 - Ireland\n\x03 Numero Verde: 800.231.187\n\x03 email: supporto@axa-mpsfinancial.ie\ni documenti di seguito elencati per ciascuna prestazione, al fine di consentire all\x00Impresa di \nAssicurazione di verificare l\x00effettiva esistenza dell\x00obbligo di pagamento.\nin caso di Riscatto totale, il Contraente deve inviare all\x00Impresa di Assicurazione:\n\x04 la richiesta di Riscatto totale firmata dal Contraente, indicando il conto corrente su cui il \npagamento deve essere effettuato. Nel caso il conto corrente sia intestato a persona diversa dal \nContraente o dai beneficiari o sia cointestato, il Contraente deve fornire anche I documenti del \ncointestatario e specificare la relazione con il terzo il cui conto viene indicato.\n\x04 copia di un valido documento di identità del Contraente o di un documento attestante i poteri di \nlegale rappresentante, nel caso in cui il Contraente sia una persona giuridica;\nin caso di Riscatto parziale, il Contraente deve inviare all\x00Impresa di Assicurazione:\n\x04 la richiesta di Riscatto parziale firmata dal Contraente, contenente l\x00indicazione dei Fondi \nInterni/OICR che intende riscattare e il relativo ammontare non ché l\x00indicazione del conto corrente \nbancario sul quale effettuare il pagamento;\n\x04 copia di un valido documento di identità del Contraente, o di un documento attestante i poteri di \nlegale rappresentante, nel caso in cui il Contraente sia una persona giuridica.\nIn caso di richiesta di Riscatto totale o parziale non corredata dalla sopra elencata documentazione, \nl\x00Impresa di Assicurazione effettuerà il disinvestimento delle Quote collegate al Contratto alla data \ndi ricezione della relativa richiesta. L\x00Impresa di Assicurazione provvederà tuttavia alla liquidazione \ndelle somme unicamente al momento di ricezione della documentazione mancante, prive degli \neventuali interessi che dovessero maturare;\nIn caso di decesso dell\x00Assicurato, il Beneficiario/i o il Referente Terzo deve inviare all\x00Impresa di \nAssicurazione:\nDIP aggiuntivo IBIP - Progetto Protetto New - Global Dividends - Pag. 3 di 9\n', + '\x04 la richiesta di pagamento sottoscritta da tutti i Beneficiari, con l\x00indicazione del conto corrente \nbancario sul quale effettuare il pagamento; Nel caso il conto corrente sia intestato a persona \ndiversa dal Contraente o dai beneficiari o sia cointestato, il Contraente deve fornire anche I \ndocumenti del cointestatario e specificare la relazione con il terzo il cui conto viene indicato.\n\x04 copia di un valido documento d\x00identità dei Beneficiari o di un documento attestante i poteri di \nlegale rappresentante, nel caso in cui il Beneficiario sia una persona giuridica;\n\x04 il certificato di morte dell\x00Assicurato;\n\x04 la relazione medica sulle cause del decesso;\n\x04 copia autenticata del testamento accompagnato da dichiarazione sostitutiva di atto di notorietà \ncon l\x00indicazione (i) della circostanza che il testamento è l\x00ultimo da considerarsi valido e non è \nstato impugnato e (ii) degli eredi testamentari, le relative età e capacità\ndi agire;\n\x04 in assenza di testamento, atto notorio (o dichiarazione sostitutiva di atto di notorietà) attestante \nche il decesso è avvenuto senza lasciare testamento e che non vi sono altri soggetti cui la legge \nriconosce diritti o quote di eredità;\n\x04 decreto del Giudice Tutelare nel caso di Beneficiari di minore età, con l\x00indicazione della persona \ndesignata alla riscossione;\n\x04 copia del Questionario KYC.\nPrescrizione: Alla data di redazione del presente documento, i diritti dei beneficiari dei contratti di \nassicurazione sulla vita si prescrivono nel termine di dieci anni dal giorno in cui si è verificato il fatto \nsu cui il diritto si fonda. Decorso tale termine e senza che la Compagnia abbia ricevuto alcuna \ncomunicazione e/o disposizione, gli importi derivanti dal contratto saranno devoluti al Fondo \ncostitutivo presso il Ministero dell\x00Economia e delle Finanze \x01depositi dormienti\x02.\nErogazione della prestazione\nL\x00Impresa di Assicurazione esegue il pagamento entro trenta giorni dal ricevimento della \ndocumentazione completa all\x00indirizzo sopra indicato.\n \nLe dichiarazioni del Contraente, e dell\x00Assicurato se diverso dal Contraente, devono essere esatte e \nveritiere. In caso di dichiarazioni inesatte o reticenti relative a circostanze tali che l\x00Impresa di \nAssicurazione non avrebbe dato il suo consenso, non lo avrebbe dato alle medesime condizioni se \navesse conosciuto il vero stato delle cose, l\x00Impresa di Assicurazione ha diritto a:\na) in caso di dolo o colpa grave:\n\x04 impugnare il Contratto dichiarando al Contraente di voler esercitare tale diritto entro tre mesi dal \ngiorno in cui ha conosciuto l\x00inesattezza della dichiarazione o le reticenze;\n\x04 trattenere il Premio relativo al periodo di assicurazione in corso al momento dell\x00impugnazione e, \nin ogni caso, il Premio corrispondente al primo anno;\n\x04 restituire, in caso di decesso dell\x00Assicurato, solo il Controvalore delle Quote acquisite al \nmomento del decesso, se l\x00evento si verifica prima che sia decorso il termine dianzi indicato per \nl\x00impugnazione;\nb) ove non sussista dolo o colpa grave:\n\x04 recedere dal Contratto, mediante dichiarazione da farsi al Contraente entro tre mesi dal giorno in \ncui ha conosciuto l\x00inesattezza della dichiarazione o le reticenze;\n\x04 se il decesso si verifica prima che l\x00inesattezza della dichiarazione o la reticenza sia conosciuta \ndall\x00Impresa di Assicurazione, o prima che l\x00Impresa abbia dichiarato di recedere dal Contratto, di \nridurre la somma dovuta in proporzione alla differenza tra il Premio convenuto e quello che sarebbe \nstato applicato se si fosse conosciuto il vero stato delle cose.\nIl Contraente è tenuto a inoltrare per iscritto alla Compagnia (posta ordinaria e mail) eventuali \ncomunicazioni inerenti:\n-modifiche dell\x00indirizzo presso il quale intende ricevere le comunicazioni relative al contratto;\n-variazione della residenza Europea nel corso della durata del contratto, presso altro Paese \nmembro della Unione Europea;\n-variazione degli estremi di conto corrente bancario.\nIn tal caso è necessario inoltrare la richiesta attraverso l\x00invio del modulo del mandato, compilato e \nsottoscritto dal contraente, reperibile nella sezione \x01comunicazioni\x02 sul sito internet della \ncompagnia all\x00indirizzo www.axa-mpsfinancial.ie\nFATCA (Foreign Account Tax Compliance Act) e CRS (Common Standard Reporting)\nLa normativa denominata rispettivamente FATCA (Foreign Account Tax Compliance Act - \nIntergovernmental Agreement sottoscritto tra Italia e Stati Uniti in data 10 gennaio 2014 e Legge n. \n95 del 18 giugno 2015) e CRS (Common Reporting Standard - Decreto Ministeriale del 28 \ndicembre 2015) impone agli operatori commerciali, al fine di contrastare la frode fiscale e \nl\x00evasione fiscale transfrontaliera, di eseguire la puntuale identificazione della propria clientela al \nfine di determinarne l\x00effettivo status di contribuente estero.\nDichiarazioni \ninesatte o \nreticenti\nDIP aggiuntivo IBIP - Progetto Protetto New - Global Dividends - Pag. 4 di 9\n', + "I dati anagrafici e patrimoniali dei Contraenti identificati come fiscalmente residenti negli USA e/o \nin uno o più Paesi aderenti al CRS, dovranno essere trasmessi all\x00autorità fiscale locale, tramite \nl\x00Agenzia delle Entrate.\nL\x00identificazione avviene in fase di stipula del contratto e deve essere ripetuta in caso di \ncambiamento delle condizioni originarie durante tutta la sua durata, mediante l\x00acquisizione di \nautocertificazione rilasciata dai Contraenti. Ogni contraente è tenuto a comunicare \ntempestivamente eventuali variazioni rispetto a quanto dichiarato o rilevato in fase di sottoscrizione \ndel contratto di assicurazione. La Società si riserva inoltre di verificare i dati raccolti e di richiedere \nulteriori informazioni. In caso di autocertificazione che risulti compilata parzialmente o in maniera \nerrata, nonché in caso di mancata/non corretta comunicazione dei propri dati anagrafici, la società \nqualora abbia rilevato indizi di americanità e/o residenze fiscali estere nelle informazioni in suo \npossesso, assocerà al cliente la condizione di contribuente estero, provvedendo alla comunicazione \ndovuta.\nAntiriciclaggio\nIl Contraente è tenuto a fornire alla Compagnia tutte le informazioni necessarie al fine \ndell\x00assolvimento dell\x00adeguata verifica ai fini antiriciclaggio. Qualora la Compagnia, in ragione \ndella mancata collaborazione del Contraente, non sia in grado di portare a compimento l\x00adeguata \nverifica, la stessa non potrà concludere il Contratto o dovrà porre fine allo stesso. In tali ipotesi le \nsomme dovute al Contraente dovranno essere allo stesso versate mediante bonifico a valere un \nconto corrente intestato al Contraente stesso. In tali ipotesi le disponibilità finanziarie \neventualmente già acquisite dalla Compagnia dovranno essere restituite al Contraente liquidando il \nrelativo importo tramite bonifico bancario su un conto corrente bancario indicato dal Contraente e \nallo stesso intestato.\nIn nessun caso l'Impresa di Assicurazione sarà tenuta a fornire alcuna copertura assicurativa, \nsoddisfare richieste di risarcimento o garantire alcuna indennità in virtù del presente contratto, \nqualora tale copertura, pagamento o indennità possa esporla a divieti, sanzioni economiche o \nrestrizioni ai sensi di Risoluzioni delle Nazioni Unite o sanzioni economiche o commerciali, leggi o \nnorme dell\x00Unione Europea, del Regno Unito o degli Stati Uniti d\x00America, ove applicabili in Italia.\nQuando e come devo pagare?\nPremio\nIl Contratto prevede il pagamento di un Premio Unico il cui ammontare minimo è pari a 2.500,00 \neuro, incrementabile di importo pari o in multiplo di 50,00 euro, da corrispondersi in un\x00unica \nsoluzione prima della conclusione del Contratto.\nNon è prevista la possibilità di effettuare versamenti aggiuntivi successivi.\nIl versamento del Premio Unico può essere effettuato mediante addebito su conto corrente \nbancario, indicato nel Modulo di Proposta, previa autorizzazione del titolare del conto corrente.\nIl pagamento dei Premio Unico può essere eseguito mediante addebito su conto corrente bancario, \nprevia autorizzazione, intestato al Contraente oppure tramite bonifico bancario sul conto corrente \ndell\x00Impresa di Assicurazione.\nRimborso\nIl rimborso del Premio Versato è previsto nel caso in cui il Contraente decida di revocare la proposta \nfinché il contratto non è concluso.\nSconti\nAl verificarsi di condizioni particolari ed eccezionali che potrebbero riguardare \x03 a titolo \nesemplificativo ma non esaustivo \x03 il Contraente e la relativa situazione assicurativo/finanziaria, \nl\x00ammontare del Premio pagato e gli investimenti selezionati dal Contraente, l\x00Impresa di \nAssicurazione si riserva la facoltà di applicare sconti sugli oneri previsti dal contratto, concordando \ntale agevolazione con il Contraente.\nQuando comincia la copertura e quando finisce?\nDurata\nIl Contratto ha una durata massima pari a 5 anni 11 mesi e 27 giorni, sino alla data di scadenza \n(11/04/2029, la \x01data di scadenza\x02).\nSospensione\nNon sono possibili delle sospensioni della copertura assicurativa\nDIP aggiuntivo IBIP - Progetto Protetto New - Global Dividends - Pag. 5 di 9\n", + 'Come posso revocare la proposta, recedere dal contratto o risolvere il contratto? \nRevoca\nLa Proposta di assicurazione può essere revocata fino alle ore 24:00 del giorno in cui il Contratto è \nconcluso. In tal caso, l\x00Impresa di Assicurazione restituirà al Contraente il Premio pagato entro \ntrenta giorni dal ricevimento della comunicazione di Revoca.\nRecesso\nIl Contraente può recedere dal Contratto entro trenta giorni dalla sua conclusione. Il Recesso dovrà \nessere comunicato all\x00Impresa di Assicurazione mediante lettera raccomandata con avviso di \nricevimento.\nL\x00Impresa di Assicurazione, entro trenta giorni dal ricevimento della comunicazione relativa al \nRecesso, rimborserà al Contraente il Controvalore delle Quote attribuite al Contratto alla data di \nricevimento della richiesta di recesso incrementato dai caricamenti, ove previsti, e dedotte \neventuali agevolazioni.\nRisoluzione\nLa risoluzione del contratto è prevista tramite la richiesta di riscatto totale esercitabile in qualsiasi \nmomento della durata contrattuale\nSono previsti riscatti o riduzioni? Si\n no\nValori di\nriscatto e\nriduzione\nA condizione che siano trascorsi almeno 30 giorni dalla Data di Decorrenza (conclusione del \nContratto) e fino all\x00ultimo Giorno Lavorativo della terzultima settimana precedente la data di \nscadenza, il Contraente può riscuotere, interamente o parzialmente, il Valore di Riscatto. In caso di \nRiscatto totale, la liquidazione del Valore di Riscatto pone fine al Contratto con effetto dalla data di \nricezione della richiesta.\nL\x00importo che sarà corrisposto al Contraente in caso di Riscatto sarà pari al Controvalore delle \nQuote del Fondo Interno attribuite al Contratto alla data di Riscatto, al netto dei costi di Riscatto.\nIn caso di Riscatto, ai fini del calcolo del Valore Unitario della Quota, si farà riferimento alla Data di \nValorizzazione della settimana successiva alla data in cui la comunicazione di Riscatto del \nContraente perviene all\x00Impresa di Assicurazione, corredata di tutta la documentazione, al netto dei \ncosti di Riscatto, salvo il verificarsi di Eventi di Turbativa.\nIl Contraente assume il rischio connesso all\x00andamento negativo del valore delle Quote e, pertanto, \nesiste la possibilità di ricevere un ammontare inferiore all\x00investimento finanziario.\nIn caso di Riscatto del Contratto (totale o parziale), l\x00Impresa di Assicurazione non offre alcuna \ngaranzia finanziaria di rendimento minimo e pertanto il Contraente sopporta il rischio di ottenere un \nValore Unitario di Quota inferiore al 100% del Valore Unitario di Quota del Fondo Interno registrato \nalla Data di Istituzione in considerazione dei rischi connessi alla fluttuazione del valore di mercato \ndegli attivi in cui investe, direttamente o indirettamente, il Fondo Interno.\nRichiesta di\ninformazioni\nPer eventuali richieste di informazioni sul valore di riscatto, il Contraente può rivolgersi alla \nCompagnia AXA MPS Financial DAC \x03 Wolfe Tone House, Wolfe Tone Street, Dublin, DO1 HP90 \x03 \nIreland, Numero Verde 800.231.187, e-mail: supporto@axa-mpsfinancial.ie\nA chi è rivolto questo prodotto?\nL\x00investitore al dettaglio a cui è destinato il prodotto varia in funzione dell\x00opzione di investimento sottostante e \nillustrata nel relativo KID.\nIl prodotto è indirizzato a Contraenti persone fisiche e persone giuridiche a condizione che il Contraente (persona fisica) \ne l\x00Assicurato, al momento della sottoscrizione stessa, abbiano un\x00età compresa tra i 18 anni e i 85 anni.\nQuali costi devo sostenere?\nPer l\x00informativa dettagliata sui costi fare riferimento alle indicazioni del KID.\nIn aggiunta rispetto alle informazioni del KID , indicare i seguenti costi a carico del contraente.\nSpese di emissione:\nIl Contratto prevede una spesa fissa di emissione pari a 25 Euro.\nLa deduzione di tale importo avverrà contestualmente alla deduzione del premio.\nDIP aggiuntivo IBIP - Progetto Protetto New - Global Dividends - Pag. 6 di 9\n', + "L\x00obiettivo di protezione è da considerarsi al netto delle spese di emissione.\nCosti per riscatto\nIl Riscatto (totale e parziale) prevede un costo che varia in funzione della data di richiesta e secondo le percentuali di \nseguito indicate:\n1°Anno 5,00%; 2°Anno 3,50%; 3°Anno 2,00%; dal quarto anno in poi 0%;\nCosti di intermediazione\nla quota parte massima percepita dall\x00intermediario con riferimento all\x00intero flusso commissionale relativo al prodotto \nè pari al 35,17%.\nQuali sono i rischi e qual è il potenziale rendimento?\nSia con riferimento alla prestazione in caso di vita dell\x00assicurato, sia con riferimento al capitale caso morte riferito ai \nFondi Assicurativi Interni, la Compagnia non presta alcuna garanzia di rendimento minimo o di conservazione del \ncapitale. Pertanto il controvalore della prestazione della Compagnia potrebbe essere inferiore all\x00importo dei premi \nversati, in considerazione dei rischi connessi alla fluttuazione del valore di mercato degli attivi in cui investe, \ndirettamente o indirettamente il Fondo Interno.\nCOME POSSO PRESENTARE I RECLAMI E RISOLVERE LE CONTROVERSIE?\nAll\x00IVASS\nNel caso in cui il reclamo presentato all\x00impesa assicuratrice abbia esito insoddisfacente o risposta \ntardiva, è possibile rivolgersi all\x00IVASS, Via del Quirinale, 21 - 00187 Roma, fax 06.42133206, Info \nsu: www.ivass.it.\nEventuali reclami potranno inoltre essere indirizzati all\x00Autorità Irlandese competente al seguente \nindirizzo:\nFinancial Services Ombudsman 3rd Floor, Lincoln House, Lincoln Place, Dublin 2, D02 VH29 \x03 \nIreland\nPRIMA DI RICORRERE ALL\x00AUTORITÀ GIUDIZIARIA è possibile, in alcuni casi necessario, \navvalersi di sistemi alternativi di risoluzione delle controversie, quali:\nMediazione\nInterpellando un Organismo di Mediazione tra quelli presenti nell'elenco del Ministero della \nGiustizia, consultabile sul sito www.giustizia.it (Legge 9/8/2013, n.98)\nNegoziazione \nassistita\nTramite richiesta del proprio avvocato all\x00impresa\nAltri Sistemi \nalternative di \nrisoluzione delle \ncontroversie\nEventuali reclami relativi ad un contratto o servizio assicurativo nei confronti dell'Impresa di \nassicurazione o dell'Intermediario assicurativo con cui si entra in contatto, nonché qualsiasi \nrichiesta di informazioni, devono essere preliminarmente presentati per iscritto (posta, email) ad \nAXA MPS Financial DAC - Ufficio Reclami secondo seguenti modalità:\nEmail: reclami@axa-mpsfinancial.ie\nPosta: AXA MPS Financial DAC - Ufficio Reclami\nWolfe Tone House, Wolfe Tone Street,\nDublin DO1 HP90 - Ireland\nNumero Verde 800.231.187\navendo cura di indicare:\n-nome, cognome, indirizzo completo e recapito telefonico del reclamante;\n-numero della polizza e nominativo del contraente;\n-breve ed esaustiva descrizione del motivo di lamentela;\n-ogni altra indicazione e documento utile per descrivere le circostanze.\nSarà cura della Compagnia fornire risposta entro 45 giorni dalla data di ricevimento del reclamo, \ncome previsto dalla normativa vigente.\nNel caso di mancato o parziale accoglimento del reclamo, nella risposta verrà fornita una chiara \nspiegazione della posizione assunta dalla Compagnia in relazione al reclamo stesso ovvero della \nsua mancata risposta.\nQualora il reclamante non abbia ricevuto risposta oppure ritenga la stessa non soddisfacente, \nprima di rivolgersi all'Autorità Giudiziaria, può scrivere all'IVASS (Via del Quirinale, 21 - 00187 \nRoma; fax 06.42.133.745 o 06.42.133.353, ivass@pec.ivass.it) fornendo copia del reclamo già \nDIP aggiuntivo IBIP - Progetto Protetto New - Global Dividends - Pag. 7 di 9\n", + "inoltrato all'impresa ed il relativo riscontro anche utilizzando il modello presente nel sito dell'IVASS \nalla sezione per il Consumatore - come presentare un reclamo.\nEventuali reclami potranno inoltre essere indirizzati all'Autorità Irlandese competente al seguente \nindirizzo:\nFinancial Services Ombudsman\n3rd Floor, Lincoln House,\nLincoln Place, Dublin 2, D02 VH29 Ireland\nIl reclamante può ricorrere ai sistemi alternativi per la risoluzione delle controversie previsti a livello \nnormativo o convenzionale, quali:\n\x04 Mediazione: (Decreto Legislativo n.28/2010 e ss.mm.) puo' essere avviata presentando istanza \nad un Organismo di Mediazione tra quelle presenti nell'elenco del Ministero della Giustizia, \nconsultabile sul sito www.giustizia.it. La legge ne prevede l'obbligatorieta' nel caso in cui si intenda \nesercitare in giudizio i propri diritti in materia di contratti assicurativi o finanziari e di risarcimento \nda responsabilita' medica e sanitaria, costituendo condizione di procedibilita' della domanda.\n\x04 Negoziazione Assistita: (Legge n.162/2014) tramite richiesta del proprio Avvocato all'Impresa. E' \nun accordo mediante il quale le parti convengono di cooperare in buona fede e con lealta' per \nrisolvere in via amichevole la controversia tramite l'assistenza di avvocati. Fine del procedimento e' \nla composizione bonaria della lite, con la sottoscrizione delle parti - assistite dai rispettivi difensori - \ndi un accordo detto convenzione di negoziazione. Viene prevista la sua obbligatorieta' nel caso in \ncui si intenda esercitare in giudizio i propri diritti per ogni controversia in materia di risarcimento del \ndanno da circolazione di veicoli e natanti, ovverosia e' condizione di procedibilita' per l'eventuale \ngiudizio civile. Invece e' facoltativa per ogni altra controversia in materia di risarcimenti o di contratti \nassicurativi o finanziari.\nIn caso di controversia relativa alla determinazione dei danni si puo' ricorrere alla perizia \ncontrattuale prevista dalle Condizioni di Assicurazione per la risoluzione di tale tipologia di \ncontroversie. L'istanza di attivazione della perizia contrattuale dovra' essere indirizzata alla \nCompagnia all' indirizzo\nAXA MPS Financial DAC \nWolfe Tone House, Wolfe Tone Street\nDublin DO1 HP90 - Ireland\nPer maggiori informazioni si rimanda a quanto presente nell'area Reclami del sito \nwww.axa-mpsfinancial.ie. \nPer la risoluzione delle liti transfrontaliere è possibile presentare reclamo all'IVASS o direttamente \nal sistema estero http://ec.europa.eu/internal_market/fin-net/members_en.htm competente \nchiedendo l'attivazione della procedura FIN-NET.\nEventuali reclami relativi la mancata osservanza da parte della Compagnia, degli intermediari e dei \nperiti assicurativi, delle disposizioni del Codice delle assicurazioni, delle relative norme di \nattuazione nonché delle norme sulla commercializzazione a distanza dei prodotti assicurativi \npossono essere presentati direttamente all'IVASS, secondo le modalità sopra indicate.\nSi ricorda che resta salva la facoltà di adire l'autorità giudiziaria.\nREGIME FISCALE\nTrattamento \nfiscale applicabile \nal contratto\nLe seguenti informazioni sintetizzano alcuni aspetti del regime fiscale applicabile al Contratto, ai \nsensi della legislazione tributaria italiana e della prassi vigente alla data di pubblicazione del \npresente documento, fermo restando che le stesse rimangono soggette a possibili cambiamenti che \npotrebbero avere altresì effetti retroattivi. Quanto segue non intende rappresentare un\x00analisi \nesauriente di tutte le conseguenze fiscali del Contratto. I Contraenti sono tenuti a consultare i loro \nconsulenti in merito al regime fiscale proprio del Contratto.\nTasse e imposte\nLe imposte e tasse presenti e future applicabili per legge al Contratto sono a carico del Contraente \no dei Beneficiari e aventi diritto e non è prevista la corresponsione al Contraente di alcuna somma \naggiuntiva volta a compensare eventuali riduzioni dei pagamenti relativi al Contratto.\nTassazione delle somme corrisposte a soggetti non esercenti attività d\x00impresa\n1. In caso di decesso dell\x00Assicurato\nLe somme corrisposte dall\x00Impresa di Assicurazione in caso di decesso dell\x00Assicurato non sono \nsoggette a tassazione IRPEF in capo al percettore e sono esenti dall\x00imposta sulle successioni. Si \nricorda tuttavia che, per effetto della legge 23 dicembre 2014 n. 190 (c.d.\x02Legge di Stabilità\x02), i \nDIP aggiuntivo IBIP - Progetto Protetto New - Global Dividends - Pag. 8 di 9\n", + 'capitali percepiti in caso di morte, a decorrere dal 1 gennaio 2015, in dipendenza di contratti di \nassicurazione sulla vita, a copertura del rischio demografico, sono esenti dall\x00imposta sul reddito \ndelle persone fisiche.\n2. In caso di Riscatto totale o di Riscatto parziale.\nLe somme corrisposte dall\x05Impresa di Assicurazione in caso di Riscatto totale sono soggette ad \nun\x00imposta sostitutiva dell\x00imposta sui redditi nella misura prevista di volta in volta dalla legge. Tale \nimposta, al momento della redazione del presente documento, è pari al 26% sulla differenza \n(plusvalenza) tra il capitale maturato e l\x00ammontare dei premi versati (al netto di eventuali riscatti \nparziali), con l\x00eccezione dei proventi riferibili ai titoli di stato italiani ed equiparati (Paesi facenti \nparte della white list), per i quali l\x00imposta è pari al 12,5%.\nIn caso di Riscatto parziale, ai fini del computo del reddito di capitale da assoggettare alla predetta \nimposta sostitutiva, l\x00ammontare dei premi va rettificato in funzione del rapporto tra il capitale \nerogato ed il valore economico della polizza alla data del Riscatto parziale.\n3. In caso di Recesso\nLe somme corrisposte in caso di Recesso sono soggette all\x00imposta sostitutiva delle imposte sui \nredditi nella misura e con gli stessi criteri indicati per il Riscatto totale del Contratto.\nTassazione delle somme corrisposte a soggetti esercenti attività d\x00impresa\nLe somme corrisposte a soggetti che esercitano l\x00attività d\x00impresa non costituiscono redditi di \ncapitale, bensì redditi d\x00impresa. Su tali somme l\x00Impresa non applica l\x00imposta sostitutiva di cui \nall\x00art. 26-ter del D.P.R. 29 settembre 1973, n. 600.\nSe le somme sono corrisposte a persone fisiche o enti non commerciali in relazione a contratti \nstipulati nell\x00ambito dell\x00attività commerciale, l\x00Impresa non applica l\x00imposta sostitutiva, qualora gli \ninteressati presentino una dichiarazione in merito alla sussistenza di tale requisito.\nL\x00IMPRESA HA L\x00OBBLIGO DI TRASMETTERTI, ENTRO IL 31 MAGGIO DI OGNI ANNO, IL DOCUMENTO \nUNICO DI RENDICONTAZIONE ANNUALE DELLA TUA POSIZIONE ASSICURATIVA\nPER QUESTO CONTRATTO L\x00IMPRESA NON DISPONE DI UN\x00AREA INTERNET DISPOSITIVA RISERVATA \nAL CONTRAENTE (c.d. HOME INSURANCE), PERTANTO DOPO LA SOTTOSCRIZIONE NON POTRAI \nGESTIRE TELEMATICAMENTE IL CONTRATTO MEDESIMO.\nDIP aggiuntivo IBIP - Progetto Protetto New - Global Dividends - Pag. 9 di 9\n', + ] + # codespell:ignore-end + + path = os.path.abspath(f'{__file__}/../../tests/resources/test_3186.pdf') + fitz_doc = pymupdf.open(path) + texts = list() + for page in fitz_doc: + t = page.get_text() + texts.append(t) + assert texts == texts_expected, f'Unexpected output: {texts=}' + + +def test_3197(): + ''' + MuPDF's ActualText support fixes handling of test_3197.pdf. + ''' + path = os.path.abspath(f'{__file__}/../../tests/resources/test_3197.pdf') + + text_utf8_expected = [ + b'NYSE - Nasdaq Real Time Price \xe2\x80\xa2 USD\nFord Motor Company (F)\n12.14 -0.11 (-0.90%)\nAt close: 4:00 PM EST\nAfter hours: 7:43 PM EST\nAll numbers in thousands\nAnnual\nQuarterly\nDownload\nSummary\nNews\nChart\nConversations\nStatistics\nHistorical Data\nProfile\nFinancials\nAnalysis\nOptions\nHolders\nSustainability\nInsights\nFollow\n12.15 +0.01 (+0.08%)\nIncome Statement\nBalance Sheet\nCash Flow\nSearch for news, symbols or companies\nNews\nFinance\nSports\nSign in\nMy Portfolio\nNews\nMarkets\nSectors\nScreeners\nPersonal Finance\nVideos\nFinance Plus\nBack to classic\nMore\n', + b'Related Tickers\nTTM\n12/31/2023\n12/31/2022\n12/31/2021\n12/31/2020\n14,918,000\n14,918,000\n6,853,000\n15,787,000\n24,269,000\n-17,628,000\n-17,628,000\n-4,347,000\n2,745,000\n-18,615,000\n2,584,000\n2,584,000\n2,511,000\n-23,498,000\n2,315,000\n25,110,000\n25,110,000\n25,340,000\n20,737,000\n25,935,000\n-8,236,000\n-8,236,000\n-6,866,000\n-6,227,000\n-5,742,000\n51,659,000\n51,659,000\n45,470,000\n27,901,000\n65,900,000\n-41,965,000\n-41,965,000\n-45,655,000\n-54,164,000\n-60,514,000\n-335,000\n-335,000\n-484,000\n--\n--\n6,682,000\n6,682,000\n-13,000\n9,560,000\n18,527,000\n \nYahoo Finance Plus Essential\naccess required.\nUnlock Access\nBreakdown\nOperating Cash\nFlow\nInvesting Cash\nFlow\nFinancing Cash\nFlow\nEnd Cash Position\nCapital Expenditure\nIssuance of Debt\nRepayment of Debt\nRepurchase of\nCapital Stock\nFree Cash Flow\n12/31/2020 - 6/1/1972\nGM\nGeneral Motors Compa\xe2\x80\xa6\n39.49 +1.23%\n\xc2\xa0\nRIVN\nRivian Automotive, Inc.\n15.39 -3.15%\n\xc2\xa0\nNIO\nNIO Inc.\n5.97 +0.17%\n\xc2\xa0\nSTLA\nStellantis N.V.\n25.63 +0.91%\n\xc2\xa0\nLCID\nLucid Group, Inc.\n3.7000 +0.54%\n\xc2\xa0\nTSLA\nTesla, Inc.\n194.77 +0.52%\n\xc2\xa0\nTM\nToyota Motor Corporati\xe2\x80\xa6\n227.09 +0.14%\n\xc2\xa0\nXPEV\nXPeng Inc.\n9.08 +0.89%\n\xc2\xa0\nFSR\nFisker Inc.\n0.5579 -11.46%\n\xc2\xa0\nCopyright \xc2\xa9 2024 Yahoo.\nAll rights reserved.\nPOPULAR QUOTES\nTesla\nDAX Index\nKOSPI\nDow Jones\nS&P BSE SENSEX\nSPDR S&P 500 ETF Trust\nEXPLORE MORE\nCredit Score Management\nHousing Market\nActive vs. Passive Investing\nShort Selling\nToday\xe2\x80\x99s Mortgage Rates\nHow Much Mortgage Can You Afford\nABOUT\nData Disclaimer\nHelp\nSuggestions\nSitemap\n', + ] + + with pymupdf.open(path) as document: + for i, page in enumerate(document): + text = page.get_text() + #print(f'{i=}:') + text_utf8 = text.encode('utf8') + #print(f' {text_utf8=}') + #print(f' {text_utf8_expected[i]=}') + assert text_utf8 == text_utf8_expected[i] + + +def test_document_text(): + import platform + import time + + path = os.path.abspath(f'{__file__}/../../tests/resources/mupdf_explored.pdf') + concurrency = None + + def llen(texts): + l = 0 + for text in texts: + l += len(text) if isinstance(text, str) else text + return l + + results = dict() + _stats = 1 + + print('') + method = 'single' + t = time.time() + document = pymupdf.Document(path) + texts0 = pymupdf.get_text(path, _stats=_stats) + t0 = time.time() - t + print(f'{method}: {t0=} {llen(texts0)=}', flush=1) + + # Dummy run seems to avoid misleading stats with slow first run. + method = 'mp' + texts = pymupdf.get_text(path, concurrency=concurrency, method=method, _stats=_stats) + + method = 'mp' + t = time.time() + texts = pymupdf.get_text(path, concurrency=concurrency, method=method, _stats=_stats) + t = time.time() - t + print(f'{method}: {concurrency=} {t=} ({t0/t:.2f}x) {llen(texts)=}', flush=1) + assert texts == texts0 + + if platform.system() != 'Windows': + method = 'fork' + t = time.time() + texts = pymupdf.get_text(path, concurrency=concurrency, method='fork', _stats=_stats) + t = time.time() - t + print(f'{method}: {concurrency=} {t=} ({t0/t:.2f}x) {llen(texts)=}', flush=1) + assert texts == texts0 + + if _stats: + pymupdf._log_items_clear() + + +def test_4524(): + path = os.path.abspath(f'{__file__}/../../tests/resources/mupdf_explored.pdf') + print('') + document = pymupdf.Document(path) + texts_single = pymupdf.get_text(path, method='single', pages=[1, 3, 5]) + texts_mp = pymupdf.get_text(path, method='mp', pages=[1, 3, 5]) + print(f'{len(texts_single)=}') + print(f'{len(texts_mp)=}') + assert texts_mp == texts_single + + +def test_3594(): + verbose = 0 + print() + d = pymupdf.open(os.path.abspath(f'{__file__}/../../tests/resources/test_3594.pdf')) + for i, p in enumerate(d.pages()): + text = p.get_text() + print(f'Page {i}:') + if verbose: + for line in text.split('\n'): + print(f' {line!r}') + print('='*40) + + +def test_3687(): + path1 = pymupdf.open(os.path.normpath(f'{__file__}/../../tests/resources/test_3687.epub')) + path2 = pymupdf.open(os.path.normpath(f'{__file__}/../../tests/resources/test_3687-3.epub')) + for path in path1, path2: + print(f'Looking at {path=}.') + with pymupdf.open(path) as document: + page = document[0] + text = page.get_text("text") + print(f'{text=!s}') + wt = pymupdf.TOOLS.mupdf_warnings() + print(f'{wt=}') + assert wt == 'unknown epub version: 3.0' + +def test_3705(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3705.pdf') + def get_all_page_from_pdf(document, last_page=None): + if last_page: + document.select(list(range(0, last_page))) + if document.page_count > 30: + document.select(list(range(0, 30))) + return iter(page for page in document) + + filename = os.path.basename(path) + + doc = pymupdf.open(path) + texts0 = list() + for i, page in enumerate(get_all_page_from_pdf(doc)): + text = page.get_text() + print(i, text) + texts0.append(text) + + texts1 = list() + doc = pymupdf.open(path) + for page in doc: + if page.number >= 30: # leave the iterator immediately + break + text = page.get_text() + texts1.append(text) + + assert texts1 == texts0 + + wt = pymupdf.TOOLS.mupdf_warnings() + if pymupdf.mupdf_version_tuple < (1, 27): + assert wt == 'Actualtext with no position. Text may be lost or mispositioned.\n... repeated 434 times...' + else: + expected = 'format error: No common ancestor in structure tree\nstructure tree broken, assume tree is missing' + expected = '\n'.join([expected] * 56) + assert wt == expected + +def test_3650(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3650.pdf') + doc = pymupdf.Document(path) + blocks = doc[0].get_text("blocks") + t = [block[4] for block in blocks] + print(f'{t=}') + assert t == [ + 'RECUEIL DES ACTES ADMINISTRATIFS\n', + 'n° 78 du 28 avril 2023\n', + ] + +def test_4026(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4026.pdf') + with pymupdf.open(path) as document: + page = document[4] + blocks = page.get_text('blocks') + for i, block in enumerate(blocks): + print(f'block {i}: {block}') + assert len(blocks) == 5 + +def test_3725(): + # This currently just shows the extracted text. We don't check it is as expected. + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3725.pdf') + with pymupdf.open(path) as document: + page = document[0] + text = page.get_text() + if 0: + print(textwrap.indent(text, ' ')) + +def test_4147(): + print() + items = list() + for expect_visible, path in ( + (False, os.path.normpath(f'{__file__}/../../tests/resources/test_4147.pdf')), + (True, os.path.normpath(f'{__file__}/../../tests/resources/symbol-list.pdf')), + ): + print(f'{expect_visible=} {path=}') + with pymupdf.open(path) as document: + page = document[0] + text = page.get_text('rawdict') + for block in text['blocks']: + if block['type'] == 0: + #print(f' block') + for line in block['lines']: + #print(f' line') + for span in line['spans']: + #print(f' span') + if pymupdf.mupdf_version_tuple >= (1, 25, 2): + #print(f' span: {span["flags"]=:#x} {span["char_flags"]=:#x}') + if expect_visible: + assert span['char_flags'] & pymupdf.mupdf.FZ_STEXT_FILLED + else: + assert not (span['char_flags'] & pymupdf.mupdf.FZ_STEXT_FILLED) + assert not (span['char_flags'] & pymupdf.mupdf.FZ_STEXT_STROKED) + else: + #print(f' span: {span["flags"]=:#x}') + assert 'char_flags' not in span + # Check commit `add 'bidi' to span dict, add 'synthetic' to char dict.` + assert span['bidi'] == 0 + for ch in span['chars']: + assert isinstance(ch['synthetic'], bool) + + +def test_4139(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4139.pdf') + flags = (0 + | pymupdf.TEXT_PRESERVE_IMAGES + | pymupdf.TEXT_PRESERVE_WHITESPACE + | pymupdf.TEXT_USE_CID_FOR_UNKNOWN_UNICODE + ) + with pymupdf.open(path) as document: + page = document[0] + dicts = page.get_text('dict', flags=flags, sort=True) + seen = set() + for b_ctr, b in enumerate(dicts['blocks']): + for l_ctr, l in enumerate(b.get('lines', [])): + for s_ctr, s in enumerate(l['spans']): + color = s.get('color') + if color is not None and color not in seen: + seen.add(color) + print(f"B{b_ctr}.L{l_ctr}.S{s_ctr}: {color=} {hex(color)=} {s=}") + assert color == 0, f'{s=}' + assert s['alpha'] == 255 + + +def test_4245(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4245.pdf') + with pymupdf.open(path) as document: + page = document[0] + regions = page.search_for('Bart Simpson') + print(f'{regions=}') + page.add_highlight_annot(regions) + with pymupdf.open(path) as document: + page = document[0] + regions = page.search_for('Bart Simpson') + for region in regions: + highlight = page.add_highlight_annot(region) + highlight.update() + pixmap = page.get_pixmap() + path_out = os.path.normpath(f'{__file__}/../../tests/resources/test_4245_out.png') + pixmap.save(path_out) + + path_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_4245_expected.png') + rms = gentle_compare.pixmaps_rms(path_expected, pixmap) + pixmap_diff = gentle_compare.pixmaps_diff(path_expected, pixmap) + path_diff = os.path.normpath(f'{__file__}/../../tests/resources/test_4245_diff.png') + pixmap_diff.save(path_diff) + print(f'{rms=}') + if pymupdf.mupdf_version_tuple < (1, 25, 5): + # Prior to fix for mupdf bug 708274. + assert 0.1 < rms < 0.2 + else: + assert rms < 0.01 + + +def test_4180(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4180.pdf') + with pymupdf.open(path) as document: + page = document[0] + regions = page.search_for('Reference is made') + for region in regions: + page.add_redact_annot(region, fill=(0, 0, 0)) + page.apply_redactions() + pixmap = page.get_pixmap() + path_out = os.path.normpath(f'{__file__}/../../tests/resources/test_4180_out.png') + pixmap.save(path_out) + + path_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_4180_expected.png') + rms = gentle_compare.pixmaps_rms(path_expected, pixmap) + pixmap_diff = gentle_compare.pixmaps_diff(path_expected, pixmap) + path_diff = os.path.normpath(f'{__file__}/../../tests/resources/test_4180_diff.png') + pixmap_diff.save(path_diff) + print(f'{rms=}') + if pymupdf.mupdf_version_tuple < (1, 25, 5): + # Prior to fix for mupdf bug 708274. + assert 0.2 < rms < 0.3 + else: + assert rms < 0.01 + + +def test_4182(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4182.pdf') + with pymupdf.open(path) as document: + page = document[0] + dict_ = page.get_text('dict') + linelist = [] + for block in dict_['blocks']: + if block['type'] == 0: + paranum = block['number'] + if 'lines' in block: + for line in block.get('lines', ()): + for span in line['spans']: + if span['text'].strip(): + page.draw_rect(span['bbox'], color=(1, 0, 0)) + linelist.append([paranum, span['bbox'], repr(span['text'])]) + pixmap = page.get_pixmap() + path_out = os.path.normpath(f'{__file__}/../../tests/resources/test_4182_out.png') + pixmap.save(path_out) + if platform.system() != 'Windows': # Output on Windows can fail due to non-utf8 stdout. + for l in linelist: + print(l) + path_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_4182_expected.png') + pixmap_diff = gentle_compare.pixmaps_diff(path_expected, pixmap) + path_diff = os.path.normpath(f'{__file__}/../../tests/resources/test_4182_diff.png') + pixmap_diff.save(path_diff) + rms = gentle_compare.pixmaps_rms(path_expected, pixmap) + print(f'{rms=}') + if pymupdf.mupdf_version_tuple < (1, 25, 5): + # Prior to fix for mupdf bug 708274. + assert 3 < rms < 3.5 + else: + assert rms < 0.01 + + +def test_4179(): + if os.environ.get('PYMUPDF_USE_EXTRA') == '0': + # Looks like Python code doesn't behave same as C++, probably because + # of the code not being correct for Python's native unicode strings. + # + print(f'test_4179(): Not running with PYMUPDF_USE_EXTRA=0 because known to fail.') + return + # We check that using TEXT_ACCURATE_BBOXES gives the correct boxes. But + # this also requires that we disable PyMuPDF quad corrections. + # + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4179.pdf') + + # Disable anti-aliasing to avoid our drawing of multiple identical bboxes + # (from normal/accurate bboxes) giving slightly different results. + aa = pymupdf.mupdf.fz_aa_level() + uqc = pymupdf._globals.skip_quad_corrections + pymupdf.TOOLS.set_aa_level(0) + pymupdf.TOOLS.unset_quad_corrections(True) + assert pymupdf._globals.skip_quad_corrections + try: + with pymupdf.open(path) as document: + page = document[0] + + char_sqrt = b'\xe2\x88\x9a'.decode() + + # Search with defaults. + bboxes_search = page.search_for(char_sqrt) + assert len(bboxes_search) == 1 + print(f'bboxes_search[0]:\n {bboxes_search[0]!r}') + page.draw_rect(bboxes_search[0], color=(1, 0, 0)) + rms = gentle_compare.rms(bboxes_search[0], (250.0489959716797, 91.93604278564453, 258.34783935546875, 101.34073638916016)) + assert rms < 0.01 + + # Search with TEXT_ACCURATE_BBOXES. + bboxes_search_accurate = page.search_for( + char_sqrt, + flags = (0 + | pymupdf.TEXT_DEHYPHENATE + | pymupdf.TEXT_PRESERVE_WHITESPACE + | pymupdf.TEXT_PRESERVE_LIGATURES + | pymupdf.TEXT_MEDIABOX_CLIP + | pymupdf.TEXT_ACCURATE_BBOXES + ), + ) + assert len(bboxes_search_accurate) == 1 + print(f'bboxes_search_accurate[0]\n {bboxes_search_accurate[0]!r}') + page.draw_rect(bboxes_search_accurate[0], color=(0, 1, 0)) + rms = gentle_compare.rms(bboxes_search_accurate[0], (250.0489959716797, 99.00948333740234, 258.34783935546875, 108.97208404541016)) + assert rms < 0.01 + + # Iterate with TEXT_ACCURATE_BBOXES. + bboxes_iterate_accurate = list() + dict_ = page.get_text( + 'rawdict', + flags = pymupdf.TEXT_ACCURATE_BBOXES, + ) + linelist = [] + for block in dict_['blocks']: + if block['type'] == 0: + if 'lines' in block: + for line in block.get('lines', ()): + for span in line['spans']: + for ch in span['chars']: + if ch['c'] == char_sqrt: + bbox_iterate_accurate = ch['bbox'] + bboxes_iterate_accurate.append(bbox_iterate_accurate) + print(f'bbox_iterate_accurate:\n {bbox_iterate_accurate!r}') + page.draw_rect(bbox_iterate_accurate, color=(0, 0, 1)) + + assert bboxes_search_accurate != bboxes_search + assert bboxes_iterate_accurate == bboxes_search_accurate + pixmap = page.get_pixmap() + + path_out = os.path.normpath(f'{__file__}/../../tests/resources/test_4179_out.png') + pixmap.save(path_out) + path_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_4179_expected.png') + rms = gentle_compare.pixmaps_rms(path_expected, pixmap) + pixmap_diff = gentle_compare.pixmaps_diff(path_expected, pixmap) + path_out_diff = os.path.normpath(f'{__file__}/../../tests/resources/test_4179_diff.png') + pixmap_diff.save(path_out_diff) + print(f'Have saved to: {path_out_diff=}') + print(f'{rms=}') + if pymupdf.mupdf_version_tuple < (1, 25, 5): + # Prior to fix for mupdf bug 708274, our rects are rendered slightly incorrectly. + assert 3.5 < rms < 4.5 + else: + assert rms < 0.01 + + finally: + pymupdf.TOOLS.set_aa_level(aa) + pymupdf.TOOLS.unset_quad_corrections(uqc) + + +def test_extendable_textpage(): + + # 2025-01-28: + # + # We can create a pdf with two pages whose text is adjacent when stitched + # together vertically: + # + # Page 1: + # + # aaaa + # + # bbbb + # cccc + # + # dddd + # + # Page 2: + # + # eeee + # + # ffff + # gggg + # + # hhhh + # + # + # Create a textpage for both of these pages. Then when extracting text, + # we need to get (specifically the `dddd` and `eeee` sequences need to be + # treated as the same block): + # + # aaaa + # + # bbbb + # cccc + # + # dddd + # eeee + # + # ffff + # gggg + # + # hhhh + # + print() + + path = os.path.normpath(f'{__file__}/../../tests/test_extendable_textpage.pdf') + with pymupdf.open(filetype='pdf') as document: + document.new_page() + document.new_page() + page0 = document[0] + page1 = document[1] + y = 100 + line_height = 9.6 + for i in range(4): + page0.insert_text((100, y+line_height), 'abcd'[i] * 16) + page1.insert_text((100, y+line_height), 'efgh'[i] * 16) + y += line_height + if i%2 == 0: + y += line_height + rect = pymupdf.mupdf.FzRect(100, 100, 200, y) + document[0].draw_rect(rect, (1, 0, 0)) + document[1].draw_rect(rect, (1, 0, 0)) + document.save(path) + + # Create a stext page for the text regions in both pages of our document, + # using direct calls to MuPDF. + # + + with pymupdf.Document(path) as document: + + # Notes: + # + # We need to reuse the stext device for second page. Otherwise if we + # create a new device, the first text in second page will always be in + # a new block, because pen position for new device is (0, 0) and this + # will usually be treated as a paragraph gap to the first text. + # + # At the moment we use infinite mediabox when creating the + # fz_stext_page. I don't know what a non-infinite mediabox would be + # useful for. + # + # FZ_STEXT_CLIP_RECT isn't useful at the moment, because we would need + # to modify it to be in stext pagae coordinates (i.e. adding ctm.f + # to y0 and y1) when we append the second page. But it's internal + # data and there's no api to modify it. So for now we don't specify + # FZ_STEXT_CLIP_RECT when creating the stext device, so we always + # include each page's entire contents. + # + + # We use our knowledge of the text rect in each page to manipulate ctm + # so that the stext contains text starting at (0, 0) and extending + # downwards. + # + y = 0 + cookie = pymupdf.mupdf.FzCookie() + + stext_page = pymupdf.mupdf.FzStextPage( + pymupdf.mupdf.FzRect(pymupdf.mupdf.FzRect.Fixed_INFINITE), # mediabox + ) + stext_options = pymupdf.mupdf.FzStextOptions() + #stext_options.flags |= pymupdf.mupdf.FZ_STEXT_CLIP_RECT + #stext_options.clip = rect.internal() + device = pymupdf.mupdf.fz_new_stext_device(stext_page, stext_options) + + # Add first page to stext_page at (0, y), and update for the next + # page. + page = document[0] + ctm = pymupdf.mupdf.FzMatrix(1, 0, 0, 1, -rect.x0, -rect.y0 + y) + pymupdf.mupdf.fz_run_page(page.this, device, ctm, cookie) + y += rect.y1 - rect.y0 + + # Add second page to stext_page at (0, y), and update for the next + # page. + page = document[1] + ctm = pymupdf.mupdf.FzMatrix(1, 0, 0, 1, -rect.x0, -rect.y0 + y) + pymupdf.mupdf.fz_run_page(page.this, device, ctm, cookie) + y += rect.y1 - rect.y0 + + # We've finished adding text to stext_page. + pymupdf.mupdf.fz_close_device(device) + + # Create a pymupdf.TextPage() for so we can use + # text_page.extractDICT() etc. + text_page = pymupdf.TextPage(stext_page) + + # Read text from stext_page using text_page.extractDICT(). + print(f'Using text_page.extractDICT().') + print(f'{text_page.this.m_internal.mediabox=}') + d = text_page.extractDICT(sort=True) + y0_prev = None + pno = 0 + ydelta = 0 + for block in d['blocks']: + print(f'block {block["bbox"]=}') + for line in block['lines']: + print(f' line {line["bbox"]=}') + for span in line['spans']: + print(f' span {span["bbox"]=}') + bbox = span['bbox'] + x0, y0, x1, y1 = bbox + dy = y0 - y0_prev if y0_prev else 0 + y0_prev = y0 + print(f' {dy=: 5.2f} height={y1-y0:.02f} {x0:.02f} {y0:.02f} {x1:.02f} {y1:.02f} {span["text"]=}') + if 'eee' in span['text']: + pno = 1 + ydelta = rect.y1 - rect.y0 + y0 -= ydelta + y1 -= ydelta + # Debugging - add green lines on original document + # translating final blocks info into original coors. + document[pno].draw_rect((x0, y0, x1, y1), (0, 1, 0)) + + print('\n\n') + + print(f'Using text_page.extractText()') + text = text_page.extractText(True) + print(f'{text}') + + print('\n\n') + print(f'Using extractBLOCKS') + text = list() + for x0, y0, x1, y1, line, no, type_ in text_page.extractBLOCKS(): + print(f'block:') + print(f' bbox={x0, y0, x1, y1} {no=}') + print(f' {line=}') + text.append(line) + + print("\n\n") + print(f'extractBLOCKS joined by newlines:') + print('\n'.join(text)) + + # This checks that lines before/after pages break are treated as a + # single paragraph. + assert text == [ + 'aaaaaaaaaaaaaaaa\n', + 'bbbbbbbbbbbbbbbb\ncccccccccccccccc\n', + 'dddddddddddddddd\neeeeeeeeeeeeeeee\n', + 'ffffffffffffffff\ngggggggggggggggg\n', + 'hhhhhhhhhhhhhhhh\n', + ] + + path3 = os.path.normpath(f'{__file__}/../../tests/test_extendable_textpage3.pdf') + document.save(path3) + + +def test_4363(): + print() + print(f'{pymupdf.version=}') + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4363.pdf') + n = 0 + texts = list() + with pymupdf.open(path) as document: + assert len(document) == 1 + page = document[0] + t = page.search_for('tour') + print(f'{t=}') + n += len(t) + text = page.get_text() + texts.append(text) + print(f'{n=}') + print(f'{len(texts)=}') + text = texts[0] + print('text:') + print(f'{text=}') + text_expected = ( + 'Deal Roadshow SiteTour\n' + 'We know your process. We know your standard.\n' + 'Professional Site Tour Video Productions for the Capital Markets.\n' + '1\n' + ) + if text != text_expected: + print(f'Expected:\n {text_expected!r}') + print(f'Found:\n {text!r}') + assert 0 + + +def test_4546(): + # This issue will not be fixed (in mupdf) because the test input is faulty. + # + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4546.pdf') + with pymupdf.open(path) as document: + page = document[0] + text = page.get_text()[:200] + + # We can't actually test with 1.23.5 because it uses `fitz.` not `pymupdf.`. + expected_1_23_5 = b'JOB No.: \nShipper (complete name and address) \xe5\x8f\x91\xe8\xb4\xa7\xe4\xba\xba(\xe5\x90\x8d\xe7\xa7\xb0\xe5\x8f\x8a\xe5\x9c\xb0\n\xe5\x9d\x80) \nSINORICH TRANSPORT LIMITED\nADD:7C,WEST BLDG.,ZHONGQU\nMANSION,211 ZHONGSHAN\nRD. SHANTOU,515041 CN\nTEL:0754-88570001 FAX:0754-88572709\nS/O No. '.decode() + + # This output is different from expected_1_23_5. + expected_mupdf_1_26_1 = b'JOB No.: Shipper (complete name and address) \xe5\x8f\x91\xe8\xb4\xa7\xe4\xba\xba(\xe5\x90\x8d\xe7\xa7\xb0\xe5\x8f\x8a\xe5\x9c\xb0\xe5\x9d\x80) Tel: Fax: \n \nS/O No. \xe6\x89\x98\xe8\xbf\x90\xe5\x8d\x95\xe5\x8f\xb7\xe7\xa0\x81 \nSINORICH TRANSPORT LIMITED \nSHIPPING ORDER \n\xe6\x89\x98\xe8\xbf\x90\xe5\x8d\x95 \n \xe5\xb8\x82\xe5\x9c\xba\xe9\x83\xa8: \n88570009 \n88577019 \n88'.decode() + + print(f'expected_1_23_5\n{textwrap.indent(expected_1_23_5, " ")}') + print(f'expected_mupdf_1_26_1\n{textwrap.indent(expected_mupdf_1_26_1, " ")}') + + print(f'{pymupdf.version=}') + print(f'text is:\n{textwrap.indent(text, " ")}') + print(f'{text=}') + print(f'{text.encode()=}') + + if pymupdf.mupdf_version_tuple >= (1, 26, 1): + assert text == expected_mupdf_1_26_1 + else: + print(f'No expected output for {pymupdf.mupdf_version_tuple=}') + + +def test_4503(): + # Check detection of strikeout text. Behaviour is improved with + # mupdf>=1.26.2, and fixed with mupdf>=1.26.3. + # + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4503.pdf') + span_0 = None + text_0 = None + print() + print(f'{pymupdf.mupdf_version_tuple=}') + with pymupdf.open(path) as document: + page = document[0] + # Specify TEXT_COLLECT_STYLES so we collect char_flags, which contains + # FZ_STEXT_STRIKEOUT etc. + # + text = page.get_text('rawdict', flags=pymupdf.TEXTFLAGS_RAWDICT | pymupdf.TEXT_COLLECT_STYLES) + for i, block in enumerate(text['blocks']): + print(f'block {i}:') + for j, line in enumerate(block['lines']): + print(f' line {j}:') + for k, span in enumerate(line['spans']): + text = '' + for char in span['chars']: + text += char['c'] + print(f' span {k}: {span["flags"]=:#x} {span["char_flags"]=:#x}: {text!r}') + if 'the right to request the state to review' in text: + span_0 = span + text_0 = text + assert span_0 + #print(f'{span_0=}') + print(f'{span_0["flags"]=:#x}') + print(f'{span_0["char_flags"]=:#x}') + print(f'{text_0=}') + strikeout = span_0['char_flags'] & pymupdf.mupdf.FZ_STEXT_STRIKEOUT + print(f'{strikeout=}') + + if pymupdf.mupdf_version_tuple >= (1, 26, 3): + assert strikeout, f'Expected bit 0 (FZ_STEXT_STRIKEOUT) to be set in {span_0["char_flags"]=:#x}.' + assert text_0 == 'the right to request the state to review and, if appropriate,' + elif pymupdf.mupdf_version_tuple >= (1, 26, 2): + # 2025-06-09: This is still incorrect - the span should include the + # following text 'and, if appropriate,'. It looks like following spans + # are: + # strikeout=0: 'and, ' + # strikeout=1: 'if ' + # strikeout=0: 'appropri' + # strikeout=1: 'ate,' + # + assert strikeout, f'Expected bit 0 (FZ_STEXT_STRIKEOUT) to be set in {span_0["char_flags"]=:#x}.' + assert text_0 == 'the right to request the state to review ' + else: + # Expecting the bug. + assert not strikeout, f'Expected bit 0 (FZ_STEXT_STRIKEOUT) to be unset in {span_0["char_flags"]=:#x}.' + assert text_0 == 'notice the right to request the state to review and, if appropriate,' diff -r 000000000000 -r 1d09e1dec1d9 tests/test_textsearch.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_textsearch.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,52 @@ +""" +"test_search1": +Search for some text on a PDF page, and compare content of returned hit +rectangle with the searched text. + +"test_search2": +Text search with 'clip' parameter - clip rectangle contains two occurrences +of searched text. Confirm search locations are inside clip. +""" + +import os + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename1 = os.path.join(scriptdir, "resources", "2.pdf") +filename2 = os.path.join(scriptdir, "resources", "github_sample.pdf") +filename3 = os.path.join(scriptdir, "resources", "text-find-ligatures.pdf") + + +def test_search1(): + doc = pymupdf.open(filename1) + page = doc[0] + needle = "mupdf" + rlist = page.search_for(needle) + assert rlist != [] + for rect in rlist: + assert needle in page.get_textbox(rect).lower() + + +def test_search2(): + doc = pymupdf.open(filename2) + page = doc[0] + needle = "the" + clip = pymupdf.Rect(40.5, 228.31436157226562, 346.5226135253906, 239.5338592529297) + rl = page.search_for(needle, clip=clip) + assert len(rl) == 2 + for r in rl: + assert r in clip + + +def test_search3(): + """Ensure we find text whether or not it contains ligatures.""" + doc = pymupdf.open(filename3) + page = doc[0] + needle = "flag" + hits = page.search_for(needle, flags=pymupdf.TEXTFLAGS_SEARCH) + assert len(hits) == 2 # all occurrences found + hits = page.search_for( + needle, flags=pymupdf.TEXTFLAGS_SEARCH | pymupdf.TEXT_PRESERVE_LIGATURES + ) + assert len(hits) == 1 # only found text without ligatures diff -r 000000000000 -r 1d09e1dec1d9 tests/test_toc.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_toc.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,288 @@ +""" +* Verify equality of generated TOCs and expected results. +* Verify TOC deletion works +* Verify manipulation of single TOC item works +* Verify stability against circular TOC items +""" + +import os +import sys +import pymupdf +import pathlib + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "001003ED.pdf") +filename2 = os.path.join(scriptdir, "resources", "2.pdf") +circular = os.path.join(scriptdir, "resources", "circular-toc.pdf") +full_toc = os.path.join(scriptdir, "resources", "full_toc.txt") +simple_toc = os.path.join(scriptdir, "resources", "simple_toc.txt") +file_3820 = os.path.join(scriptdir, "resources", "test-3820.pdf") +doc = pymupdf.open(filename) + + +def test_simple_toc(): + simple_lines = open(simple_toc, "rb").read() + toc = b"".join([str(t).encode() for t in doc.get_toc(True)]) + assert toc == simple_lines + + +def test_full_toc(): + if not hasattr(pymupdf, "mupdf"): + # Classic implementation does not have fix for this test. + print(f"Not running test_full_toc on classic implementation.") + return + expected_path = f"{scriptdir}/resources/full_toc.txt" + expected = pathlib.Path(expected_path).read_bytes() + # Github windows x32 seems to insert \r characters; maybe something to + # do with the Python installation's line endings settings. + expected = expected.decode("utf8") + expected = expected.replace('\r', '') + toc = "\n".join([str(t) for t in doc.get_toc(False)]) + toc += "\n" + assert toc == expected + + +def test_erase_toc(): + doc.set_toc([]) + assert doc.get_toc() == [] + + +def test_replace_toc(): + toc = doc.get_toc(False) + doc.set_toc(toc) + + +def test_setcolors(): + doc = pymupdf.open(filename2) + toc = doc.get_toc(False) + for i in range(len(toc)): + d = toc[i][3] + d["color"] = (1, 0, 0) + d["bold"] = True + d["italic"] = True + doc.set_toc_item(i, dest_dict=d) + + toc2 = doc.get_toc(False) + assert len(toc2) == len(toc) + + for t in toc2: + d = t[3] + assert d["bold"] + assert d["italic"] + assert d["color"] == (1, 0, 0) + + +def test_circular(): + """The test file contains circular bookmarks.""" + doc = pymupdf.open(circular) + toc = doc.get_toc(False) # this must not loop + rebased = hasattr(pymupdf, 'mupdf') + if rebased: + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'Bad or missing prev pointer in outline tree, repairing', \ + f'{wt=}' + +def test_2355(): + + # Create a test PDF with toc. + doc = pymupdf.Document() + for _ in range(10): + doc.new_page(doc.page_count) + doc.set_toc([[1, 'test', 1], [1, 'test2', 5]]) + + path = 'test_2355.pdf' + doc.save(path) + + # Open many times + for i in range(10): + with pymupdf.open(path) as new_doc: + new_doc.get_toc() + + # Open once and read many times + with pymupdf.open(path) as new_doc: + for i in range(10): + new_doc.get_toc() + +def test_2788(): + ''' + Check handling of Document.get_toc() when toc item has kind=4. + ''' + if not hasattr(pymupdf, 'mupdf'): + # Classic implementation does not have fix for this test. + print(f'Not running test_2788 on classic implementation.') + return + path = os.path.abspath(f'{__file__}/../../tests/resources/test_2788.pdf') + document = pymupdf.open(path) + toc0 = [[1, 'page2', 2, {'kind': 4, 'xref': 14, 'page': 1, 'to': pymupdf.Point(100.0, 760.0), 'zoom': 0.0, 'nameddest': 'page.2'}]] + toc1 = document.get_toc(simple=False) + print(f'{toc0=}') + print(f'{toc1=}') + assert toc1 == toc0 + + doc.set_toc(toc0) + toc2 = document.get_toc(simple=False) + print(f'{toc0=}') + print(f'{toc2=}') + assert toc2 == toc0 + + # Also test Page.get_links() bugfix from #2817. + for page in document: + page.get_links() + rebased = hasattr(pymupdf, 'mupdf') + if rebased: + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == ( + "syntax error: expected 'obj' keyword (0 3 ?)\n" + "trying to repair broken xref\n" + "repairing PDF document" + ), f'{wt=}' + + +def test_toc_count(): + file_in = os.path.abspath(f'{__file__}/../../tests/resources/test_toc_count.pdf') + file_out = os.path.abspath(f'{__file__}/../../tests/test_toc_count_out.pdf') + + def get(doc): + outlines = doc.xref_get_key(doc.pdf_catalog(), "Outlines") + ret = doc.xref_object(int(outlines[1].split()[0])) + return ret + print() + with pymupdf.open(file_in) as doc: + print(f'1: {get(doc)}') + toc = doc.get_toc(simple=False) + doc.set_toc([]) + #print(f'2: {get(doc)}') + doc.set_toc(toc) + print(f'3: {get(doc)}') + doc.save(file_out, garbage=4) + with pymupdf.open(file_out) as doc: + print(f'4: {get(doc)}') + pymupdf._log_items_clear() + + +def test_3347(): + ''' + Check fix for #3347 - link destination rectangles when source/destination + pages have different sizes. + ''' + doc = pymupdf.open() + doc.new_page(width=500, height=800) + doc.new_page(width=800, height=500) + rects = [ + (0, pymupdf.Rect(10, 20, 50, 40), pymupdf.utils.getColor('red')), + (0, pymupdf.Rect(300, 350, 400, 450), pymupdf.utils.getColor('green')), + (1, pymupdf.Rect(20, 30, 40, 50), pymupdf.utils.getColor('blue')), + (1, pymupdf.Rect(350, 300, 450, 400), pymupdf.utils.getColor('black')) + ] + + for page, rect, color in rects: + doc[page].draw_rect(rect, color=color) + + for (from_page, from_rect, _), (to_page, to_rect, _) in zip(rects, rects[1:] + rects[:1]): + doc[from_page].insert_link({ + 'kind': 1, + 'from': from_rect, + 'page': to_page, + 'to': to_rect.top_left, + }) + + links_expected = [ + (0, {'kind': 1, 'xref': 11, 'from': pymupdf.Rect(10.0, 20.0, 50.0, 40.0), 'page': 0, 'to': pymupdf.Point(300.0, 350.0), 'zoom': 0.0, 'id': 'fitz-L0'}), + (0, {'kind': 1, 'xref': 12, 'from': pymupdf.Rect(300.0, 350.0, 400.0, 450.0), 'page': 1, 'to': pymupdf.Point(20.0, 30.0), 'zoom': 0.0, 'id': 'fitz-L1'}), + (1, {'kind': 1, 'xref': 13, 'from': pymupdf.Rect(20.0, 30.0, 40.0, 50.0), 'page': 1, 'to': pymupdf.Point(350.0, 300.0), 'zoom': 0.0, 'id': 'fitz-L0'}), + (1, {'kind': 1, 'xref': 14, 'from': pymupdf.Rect(350.0, 300.0, 450.0, 400.0), 'page': 0, 'to': pymupdf.Point(10.0, 20.0), 'zoom': 0.0, 'id': 'fitz-L1'}), + ] + + path = os.path.normpath(f'{__file__}/../../tests/test_3347_out.pdf') + doc.save(path) + print(f'Have saved to {path=}.') + + links_actual = list() + for page_i, page in enumerate(doc): + links = page.get_links() + for link_i, link in enumerate(links): + print(f'{page_i=} {link_i=}: {link!r}') + links_actual.append( (page_i, link) ) + + assert links_actual == links_expected + + +def test_3400(): + ''' + Check fix for #3400 - link destination rectangles when source/destination + pages have different rotations. + ''' + width = 750 + height = 1110 + circle_middle_point = pymupdf.Point(height / 4, width / 4) + print(f'{circle_middle_point=}') + with pymupdf.open() as doc: + + page = doc.new_page(width=width, height=height) + page.set_rotation(270) + # draw a circle at the middle point to facilitate debugging + page.draw_circle(circle_middle_point, color=(0, 0, 1), radius=5, width=2) + + for i in range(10): + for j in range(10): + x = i/10 * width + y = j/10 * height + page.draw_circle(pymupdf.Point(x, y), color=(0,0,0), radius=0.2, width=0.1) + page.insert_htmlbox(pymupdf.Rect(x, y, x+width/10, y+height/20), f'({x=:.1f},{y=:.1f})', ) + + # rotate the middle point by the page rotation for the new toc entry + toc_link_coords = circle_middle_point + print(f'{toc_link_coords=}') + + toc = [ + ( + 1, + "Link to circle", + 1, + { + "kind": pymupdf.LINK_GOTO, + "page": 1, + "to": toc_link_coords, + "from": pymupdf.Rect(0, 0, height / 4, width / 4), + }, + ) + ] + doc.set_toc(toc, 0) # set the toc + + page = doc.new_page(width=200, height=300) + from_rect = pymupdf.Rect(10, 10, 100, 50) + page.insert_htmlbox(from_rect, 'link') + link = dict() + link['from'] = from_rect + link['kind'] = pymupdf.LINK_GOTO + link['to'] = toc_link_coords + link['page'] = 0 + page.insert_link(link) + + path = os.path.normpath(f'{__file__}/../../tests/test_3400.pdf') + doc.save(path) + print(f'Saved to {path=}.') + + links_expected = [ + (1, {'kind': 1, 'xref': 1120, 'from': pymupdf.Rect(10.0, 10.0, 100.0, 50.0), 'page': 0, 'to': pymupdf.Point(187.5, 472.5), 'zoom': 0.0, 'id': 'fitz-L0'}) + ] + + links_actual = list() + for page_i, page in enumerate(doc): + links = page.get_links() + for link_i, link in enumerate(links): + print(f'({page_i}, {link!r})') + links_actual.append( (page_i, link) ) + + assert links_actual == links_expected + + + +def test_3820(): + """Ensure all extended TOC items point to pages.""" + doc = pymupdf.open(file_3820) + toc = doc.get_toc(simple=False) + for _, _, epage, dest in toc: + assert epage == dest["page"] + 1 + + diff -r 000000000000 -r 1d09e1dec1d9 tests/test_widgets.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_widgets.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,432 @@ +# -*- coding: utf-8 -*- +""" +Test PDF field (widget) insertion. +""" +import pymupdf +import os + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "widgettest.pdf") +file_2333 = os.path.join(scriptdir, "resources", "test-2333.pdf") +file_4055 = os.path.join(scriptdir, "resources", "test-4055.pdf") + + +doc = pymupdf.open() +page = doc.new_page() +gold = (1, 1, 0) # define some colors +blue = (0, 0, 1) +gray = (0.9, 0.9, 0.9) +fontsize = 11.0 # define a fontsize +lineheight = fontsize + 4.0 +rect = pymupdf.Rect(50, 72, 400, 200) + + +def test_text(): + doc = pymupdf.open() + page = doc.new_page() + widget = pymupdf.Widget() # create a widget object + widget.border_color = blue # border color + widget.border_width = 0.3 # border width + widget.border_style = "d" + widget.border_dashes = (2, 3) + widget.field_name = "Textfield-1" # field name + widget.field_label = "arbitrary text - e.g. to help filling the field" + widget.field_type = pymupdf.PDF_WIDGET_TYPE_TEXT # field type + widget.fill_color = gold # field background + widget.rect = rect # set field rectangle + widget.text_color = blue # rext color + widget.text_font = "TiRo" # use font Times-Roman + widget.text_fontsize = fontsize # set fontsize + widget.text_maxlen = 50 # restrict number of characters + widget.field_value = "Times-Roman" + page.add_widget(widget) # create the field + field = page.first_widget + assert field.field_type_string == "Text" + + +def test_checkbox(): + doc = pymupdf.open() + page = doc.new_page() + widget = pymupdf.Widget() + widget.border_style = "b" + widget.field_name = "Button-1" + widget.field_label = "a simple check box button" + widget.field_type = pymupdf.PDF_WIDGET_TYPE_CHECKBOX + widget.fill_color = gold + widget.rect = rect + widget.text_color = blue + widget.text_font = "ZaDb" + widget.field_value = True + page.add_widget(widget) # create the field + field = page.first_widget + assert field.field_type_string == "CheckBox" + + # Check #2350 - setting checkbox to readonly. + # + widget.field_flags |= pymupdf.PDF_FIELD_IS_READ_ONLY + widget.update() + path = f"{scriptdir}/test_checkbox.pdf" + doc.save(path) + + doc = pymupdf.open(path) + page = doc[0] + widget = page.first_widget + assert widget + assert widget.field_flags == pymupdf.PDF_FIELD_IS_READ_ONLY + + +def test_listbox(): + doc = pymupdf.open() + page = doc.new_page() + widget = pymupdf.Widget() + widget.field_name = "ListBox-1" + widget.field_label = "is not a drop down: scroll with cursor in field" + widget.field_type = pymupdf.PDF_WIDGET_TYPE_LISTBOX + widget.field_flags = pymupdf.PDF_CH_FIELD_IS_COMMIT_ON_SEL_CHANGE + widget.fill_color = gold + widget.choice_values = ( + "Frankfurt", + "Hamburg", + "Stuttgart", + "Hannover", + "Berlin", + "München", + "Köln", + "Potsdam", + ) + widget.rect = rect + widget.text_color = blue + widget.text_fontsize = fontsize + widget.field_value = widget.choice_values[-1] + print("About to add '%s'" % widget.field_name) + page.add_widget(widget) # create the field + field = page.first_widget + assert field.field_type_string == "ListBox" + + +def test_combobox(): + doc = pymupdf.open() + page = doc.new_page() + widget = pymupdf.Widget() + widget.field_name = "ComboBox-1" + widget.field_label = "an editable combo box ..." + widget.field_type = pymupdf.PDF_WIDGET_TYPE_COMBOBOX + widget.field_flags = ( + pymupdf.PDF_CH_FIELD_IS_COMMIT_ON_SEL_CHANGE | pymupdf.PDF_CH_FIELD_IS_EDIT + ) + widget.fill_color = gold + widget.choice_values = ( + "Spanien", + "Frankreich", + "Holland", + "Dänemark", + "Schweden", + "Norwegen", + "England", + "Polen", + "Russland", + "Italien", + "Portugal", + "Griechenland", + ) + widget.rect = rect + widget.text_color = blue + widget.text_fontsize = fontsize + widget.field_value = widget.choice_values[-1] + page.add_widget(widget) # create the field + field = page.first_widget + assert field.field_type_string == "ComboBox" + + +def test_text2(): + doc = pymupdf.open() + doc.new_page() + page = [p for p in doc.pages()][0] + widget = pymupdf.Widget() + widget.field_name = "textfield-2" + widget.field_label = "multi-line text with tabs is also possible!" + widget.field_flags = pymupdf.PDF_TX_FIELD_IS_MULTILINE + widget.field_type = pymupdf.PDF_WIDGET_TYPE_TEXT + widget.fill_color = gray + widget.rect = rect + widget.text_color = blue + widget.text_font = "TiRo" + widget.text_fontsize = fontsize + widget.field_value = "This\n\tis\n\t\ta\n\t\t\tmulti-\n\t\tline\n\ttext." + page.add_widget(widget) # create the field + widgets = [w for w in page.widgets()] + field = widgets[0] + assert field.field_type_string == "Text" + + +def test_2333(): + doc = pymupdf.open(file_2333) + page = doc[0] + + def values(): + return set( + ( + doc.xref_get_key(635, "AS")[1], + doc.xref_get_key(636, "AS")[1], + doc.xref_get_key(637, "AS")[1], + doc.xref_get_key(638, "AS")[1], + doc.xref_get_key(127, "V")[1], + ) + ) + + for i, xref in enumerate((635, 636, 637, 638)): + w = page.load_widget(xref) + w.field_value = True + w.update() + assert values() == set(("/Off", f"{i}", f"/{i}")) + w.field_value = False + w.update() + assert values() == set(("Off", "/Off")) + + +def test_2411(): + """Add combobox values in different formats.""" + doc = pymupdf.open() + page = doc.new_page() + rect = pymupdf.Rect(100, 100, 300, 200) + + widget = pymupdf.Widget() + widget.field_flags = ( + pymupdf.PDF_CH_FIELD_IS_COMBO + | pymupdf.PDF_CH_FIELD_IS_EDIT + | pymupdf.PDF_CH_FIELD_IS_COMMIT_ON_SEL_CHANGE + ) + widget.field_name = "ComboBox-1" + widget.field_label = "an editable combo box ..." + widget.field_type = pymupdf.PDF_WIDGET_TYPE_COMBOBOX + widget.fill_color = pymupdf.pdfcolor["gold"] + widget.rect = rect + widget.choice_values = [ + ["Spain", "ES"], # double value as list + ("Italy", "I"), # double value as tuple + "Portugal", # single value + ] + page.add_widget(widget) + + +def test_2391(): + """Confirm that multiple times setting a checkbox to ON/True/Yes will work.""" + doc = pymupdf.open(f"{scriptdir}/resources/widgettest.pdf") + page = doc[0] + # its work when we update first-time + for field in page.widgets(types=[pymupdf.PDF_WIDGET_TYPE_CHECKBOX]): + field.field_value = True + field.update() + + for i in range(5): + pdfdata = doc.tobytes() + doc.close() + doc = pymupdf.open("pdf", pdfdata) + page = doc[0] + for field in page.widgets(types=[pymupdf.PDF_WIDGET_TYPE_CHECKBOX]): + assert field.field_value == field.on_state() + field_field_value = field.on_state() + field.update() + + +def test_3216(): + document = pymupdf.open(filename) + for page in document: + while 1: + w = page.first_widget + print(f"{w=}") + if not w: + break + page.delete_widget(w) + + +def test_add_widget(): + doc = pymupdf.open() + page = doc.new_page() + w = pymupdf.Widget() + w.field_type = pymupdf.PDF_WIDGET_TYPE_BUTTON + w.rect = pymupdf.Rect(5, 5, 20, 20) + w.field_flags = pymupdf.PDF_BTN_FIELD_IS_PUSHBUTTON + w.field_name = "button" + w.fill_color = (0, 0, 1) + w.script = "app.alert('Hello, PDF!');" + page.add_widget(w) + + +def test_interfield_calculation(): + """Confirm correct working of interfield calculations. + + We are going to create three pages with a computed result field each. + + Tests the fix for https://github.com/pymupdf/PyMuPDF/issues/3402. + """ + # Field bboxes (same on each page) + r1 = pymupdf.Rect(100, 100, 300, 120) + r2 = pymupdf.Rect(100, 130, 300, 150) + r3 = pymupdf.Rect(100, 180, 300, 200) + + doc = pymupdf.open() + pdf = pymupdf._as_pdf_document(doc) # we need underlying PDF document + + # Make PDF name object for "CO" because it is not defined in MuPDF. + CO_name = pymupdf.mupdf.pdf_new_name("CO") # = PDF_NAME(CO) + for i in range(3): + page = doc.new_page() + w = pymupdf.Widget() + w.field_name = f"NUM1{page.number}" + w.rect = r1 + w.field_type = pymupdf.PDF_WIDGET_TYPE_TEXT + w.field_value = f"{i*100+1}" + w.field_flags = 2 + page.add_widget(w) + + w = pymupdf.Widget() + w.field_name = f"NUM2{page.number}" + w.rect = r2 + w.field_type = pymupdf.PDF_WIDGET_TYPE_TEXT + w.field_value = "200" + w.field_flags = 2 + page.add_widget(w) + + w = pymupdf.Widget() + w.field_name = f"RESULT{page.number}" + w.rect = r3 + w.field_type = pymupdf.PDF_WIDGET_TYPE_TEXT + w.field_value = "Result?" + # Script that adds previous two fields. + w.script_calc = f"""AFSimple_Calculate("SUM", + new Array("NUM1{page.number}", "NUM2{page.number}"));""" + page.add_widget(w) + + # Access the inter-field calculation array. It contains a reference to + # all fields which have a JavaScript stored in their "script_calc" + # property, i.e. an "AA/C" entry. + # Every iteration adds another such field, so this array's length must + # always equal the loop index. + if i == 0: # only need to execute this on first time through + CO = pymupdf.mupdf.pdf_dict_getl( + pymupdf.mupdf.pdf_trailer(pdf), + pymupdf.PDF_NAME("Root"), + pymupdf.PDF_NAME("AcroForm"), + CO_name, + ) + # we confirm CO is an array of foreseeable length + assert pymupdf.mupdf.pdf_array_len(CO) == i + 1 + + # the xref of the i-th item must equal that of the last widget + assert ( + pymupdf.mupdf.pdf_to_num(pymupdf.mupdf.pdf_array_get(CO, i)) + == list(page.widgets())[-1].xref + ) + + +def test_3950(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3950.pdf') + items = list() + with pymupdf.open(path) as document: + for page in document: + for widget in page.widgets(): + items.append(widget.field_label) + print(f'test_3950(): {widget.field_label=}.') + assert items == [ + '{{ named_insured }}', + '{{ policy_period_start_date }}', + '{{ policy_period_end_date }}', + '{{ insurance_line }}', + ] + + +def test_4004(): + import collections + + def get_widgets_by_name(doc): + """ + Extracts and returns a dictionary of widgets indexed by their names. + """ + widgets_by_name = collections.defaultdict(list) + for page_num in range(len(doc)): + page = doc.load_page(page_num) + for field in page.widgets(): + widgets_by_name[field.field_name].append({ + "page_num": page_num, + "widget": field + }) + return widgets_by_name + + # Open document and get widgets + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4004.pdf') + doc = pymupdf.open(path) + widgets_by_name = get_widgets_by_name(doc) + + # Print widget information + for name, widgets in widgets_by_name.items(): + print(f"Widget Name: {name}") + for entry in widgets: + widget = entry["widget"] + page_num = entry["page_num"] + print(f" Page: {page_num + 1}, Type: {widget.field_type}, Value: {widget.field_value}, Rect: {widget.rect}") + + # Attempt to update field value + w = widgets_by_name["Text1"][0] + field = w['widget'] + field.value = "1234567890" + try: + field.update() + except Exception as e: + assert str(e) == 'Annot is not bound to a page' + + doc.close() + + +def test_4055(): + """Check correct setting of CheckBox "Yes" values. + + Test scope: + * setting on with any of 'True' / 'Yes' / built-in values works + * setting off with any of 'False' or 'Off' works + """ + + # this PDF has digits as "Yes" values. + doc = pymupdf.open(file_4055) + page = doc[0] + + # Round 1: confirm all check boxes are off + for w in page.widgets(types=[2]): + # check that this file doesn't use the "Yes" standard + assert w.on_state() != "Yes" + assert w.field_value == "Off" # all check boxes are off + w.field_value = w.on_state() + w.update() + + page = doc.reload_page(page) # reload page to make sure we start fresh + + # Round 2: confirm that fields contain the PDF's own on values + for w in page.widgets(types=[2]): + # confirm each value coincides with the "Yes" value + assert w.field_value == w.on_state() + w.field_value = False # switch to "Off" using False + w.update() + + page = doc.reload_page(page) + + # Round 3: confirm that 'False' achieved "Off" values + for w in page.widgets(types=[2]): + assert w.field_value == "Off" + w.field_value = True # use True for the next round + w.update() + + page = doc.reload_page(page) + + # Round 4: confirm that setting to True also worked + for w in page.widgets(types=[2]): + assert w.field_value == w.on_state() + w.field_value = "Off" # set off again + w.update() + w.field_value = "Yes" + w.update() + + page = doc.reload_page(page) + + # Round 5: final check: setting to "Yes" also does work + for w in page.widgets(types=[2]): + assert w.field_value == w.on_state() diff -r 000000000000 -r 1d09e1dec1d9 tests/test_word_delimiters.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_word_delimiters.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,23 @@ +import pymupdf +import string + + +def test_delimiters(): + """Test changing word delimiting characters.""" + doc = pymupdf.open() + page = doc.new_page() + text = "word1,word2 - word3. word4?word5." + page.insert_text((50, 50), text) + + # Standard words extraction: + # only spaces and line breaks start a new word + words0 = [w[4] for w in page.get_text("words")] + assert words0 == ["word1,word2", "-", "word3.", "word4?word5."] + + # extract words again + words1 = [w[4] for w in page.get_text("words", delimiters=string.punctuation)] + assert words0 != words1 + assert " ".join(words1) == "word1 word2 word3 word4 word5" + + # confirm we will be getting old extraction + assert [w[4] for w in page.get_text("words")] == words0 diff -r 000000000000 -r 1d09e1dec1d9 tests/util.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/util.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,28 @@ +import os +import subprocess + + +def download(url, name, size=None): + ''' + Downloads from to a local file and returns its path. + + If file already exists and matches we do not re-download it. + + We put local files within a `cache/` directory so that it is not deleted by + `git clean` (unless `-d` is specified). + ''' + path = os.path.normpath(f'{__file__}/../../tests/cache/{name}') + if os.path.isfile(path) and (not size or os.stat(path).st_size == size): + print(f'Using existing file {path=}.') + else: + print(f'Downloading from {url=}.') + subprocess.run(f'pip install -U requests', check=1, shell=1) + import requests + r = requests.get(url, path, timeout=10) + r.raise_for_status() + if size is not None: + assert len(r.content) == size + os.makedirs(os.path.dirname(path), exist_ok=1) + with open(path, 'wb') as f: + f.write(r.content) + return path diff -r 000000000000 -r 1d09e1dec1d9 valgrind.supp --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/valgrind.supp Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,17 @@ +# Valgrind suppression for false-positives from use of shared-libraries. +# +{ + sharedlibrary-read + Memcheck:Addr8 + fun:strncmp + fun:is_dst + ... + fun:fillin_rpath.isra.0 + fun:decompose_rpath + ... + fun:openaux + fun:_dl_catch_exception + fun:_dl_map_object_deps + fun:dl_open_worker_begin + fun:_dl_catch_exception +} diff -r 000000000000 -r 1d09e1dec1d9 wdev.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/wdev.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,424 @@ +''' +Finds locations of Windows command-line development tools. +''' + +import os +import platform +import glob +import re +import subprocess +import sys +import sysconfig +import textwrap + +import pipcl + + +class WindowsVS: + r''' + Windows only. Finds locations of Visual Studio command-line tools. Assumes + VS2019-style paths. + + Members and example values:: + + .year: 2019 + .grade: Community + .version: 14.28.29910 + .directory: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community + .vcvars: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat + .cl: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64\cl.exe + .link: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64\link.exe + .csc: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Current\Bin\Roslyn\csc.exe + .msbuild: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Current\Bin\MSBuild.exe + .devenv: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\devenv.com + + `.csc` is C# compiler; will be None if not found. + ''' + def __init__( + self, + *, + year=None, + grade=None, + version=None, + cpu=None, + directory=None, + verbose=False, + ): + ''' + Args: + year: + None or, for example, `2019`. If None we use environment + variable WDEV_VS_YEAR if set. + grade: + None or, for example, one of: + + * `Community` + * `Professional` + * `Enterprise` + + If None we use environment variable WDEV_VS_GRADE if set. + version: + None or, for example: `14.28.29910`. If None we use environment + variable WDEV_VS_VERSION if set. + cpu: + None or a `WindowsCpu` instance. + directory: + Ignore year, grade, version and cpu and use this directory + directly. + verbose: + . + + ''' + if year is not None: + year = str(year) # Allow specification as a number. + def default(value, name): + if value is None: + name2 = f'WDEV_VS_{name.upper()}' + value = os.environ.get(name2) + if value is not None: + _log(f'Setting {name} from environment variable {name2}: {value!r}') + return value + try: + year = default(year, 'year') + grade = default(grade, 'grade') + version = default(version, 'version') + + if not cpu: + cpu = WindowsCpu() + + if not directory: + # Find `directory`. + # + pattern = _vs_pattern(year, grade) + directories = glob.glob( pattern) + if verbose: + _log( f'Matches for: {pattern=}') + _log( f'{directories=}') + assert directories, f'No match found for {pattern=}.' + directories.sort() + directory = directories[-1] + + # Find `devenv`. + # + devenv = f'{directory}\\Common7\\IDE\\devenv.com' + assert os.path.isfile( devenv), f'Does not exist: {devenv}' + + # Extract `year` and `grade` from `directory`. + # + # We use r'...' for regex strings because an extra level of escaping is + # required for backslashes. + # + regex = rf'^C:\\Program Files.*\\Microsoft Visual Studio\\([^\\]+)\\([^\\]+)' + m = re.match( regex, directory) + assert m, f'No match: {regex=} {directory=}' + year2 = m.group(1) + grade2 = m.group(2) + if year: + assert year2 == year + else: + year = year2 + if grade: + assert grade2 == grade + else: + grade = grade2 + + # Find vcvars.bat. + # + vcvars = f'{directory}\\VC\\Auxiliary\\Build\\vcvars{cpu.bits}.bat' + assert os.path.isfile( vcvars), f'No match for: {vcvars}' + + # Find cl.exe. + # + cl_pattern = f'{directory}\\VC\\Tools\\MSVC\\{version if version else "*"}\\bin\\Host{cpu.windows_name}\\{cpu.windows_name}\\cl.exe' + cl_s = glob.glob( cl_pattern) + assert cl_s, f'No match for: {cl_pattern}' + cl_s.sort() + cl = cl_s[ -1] + + # Extract `version` from cl.exe's path. + # + m = re.search( rf'\\VC\\Tools\\MSVC\\([^\\]+)\\bin\\Host{cpu.windows_name}\\{cpu.windows_name}\\cl.exe$', cl) + assert m + version2 = m.group(1) + if version: + assert version2 == version + else: + version = version2 + assert version + + # Find link.exe. + # + link_pattern = f'{directory}\\VC\\Tools\\MSVC\\{version}\\bin\\Host{cpu.windows_name}\\{cpu.windows_name}\\link.exe' + link_s = glob.glob( link_pattern) + assert link_s, f'No match for: {link_pattern}' + link_s.sort() + link = link_s[ -1] + + # Find csc.exe. + # + csc = None + for dirpath, dirnames, filenames in os.walk(directory): + for filename in filenames: + if filename == 'csc.exe': + csc = os.path.join(dirpath, filename) + #_log(f'{csc=}') + #break + + # Find MSBuild.exe. + # + msbuild = None + for dirpath, dirnames, filenames in os.walk(directory): + for filename in filenames: + if filename == 'MSBuild.exe': + msbuild = os.path.join(dirpath, filename) + #_log(f'{csc=}') + #break + + self.cl = cl + self.devenv = devenv + self.directory = directory + self.grade = grade + self.link = link + self.csc = csc + self.msbuild = msbuild + self.vcvars = vcvars + self.version = version + self.year = year + self.cpu = cpu + except Exception as e: + raise Exception( f'Unable to find Visual Studio {year=} {grade=} {version=} {cpu=} {directory=}') from e + + def description_ml( self, indent=''): + ''' + Return multiline description of `self`. + ''' + ret = textwrap.dedent(f''' + year: {self.year} + grade: {self.grade} + version: {self.version} + directory: {self.directory} + vcvars: {self.vcvars} + cl: {self.cl} + link: {self.link} + csc: {self.csc} + msbuild: {self.msbuild} + devenv: {self.devenv} + cpu: {self.cpu} + ''') + return textwrap.indent( ret, indent) + + def __repr__( self): + items = list() + for name in ( + 'year', + 'grade', + 'version', + 'directory', + 'vcvars', + 'cl', + 'link', + 'csc', + 'msbuild', + 'devenv', + 'cpu', + ): + items.append(f'{name}={getattr(self, name)!r}') + return ' '.join(items) + + +def _vs_pattern(year=None, grade=None): + return f'C:\\Program Files*\\Microsoft Visual Studio\\{year if year else "2*"}\\{grade if grade else "*"}' + + +def windows_vs_multiple(year=None, grade=None, verbose=0): + ''' + Returns list of WindowsVS instances. + ''' + ret = list() + directories = glob.glob(_vs_pattern(year, grade)) + for directory in directories: + vs = WindowsVS(directory=directory) + if verbose: + _log(vs.description_ml()) + ret.append(vs) + return ret + + +class WindowsCpu: + ''' + For Windows only. Paths and names that depend on cpu. + + Members: + .bits + 32 or 64. + .windows_subdir + Empty string or `x64/`. + .windows_name + `x86` or `x64`. + .windows_config + `x64` or `Win32`, e.g. for use in `/Build Release|x64`. + .windows_suffix + `64` or empty string. + ''' + def __init__(self, name=None): + if not name: + name = _cpu_name() + self.name = name + if name == 'x32': + self.bits = 32 + self.windows_subdir = '' + self.windows_name = 'x86' + self.windows_config = 'Win32' + self.windows_suffix = '' + elif name == 'x64': + self.bits = 64 + self.windows_subdir = 'x64/' + self.windows_name = 'x64' + self.windows_config = 'x64' + self.windows_suffix = '64' + else: + assert 0, f'Unrecognised cpu name: {name}' + + def __repr__(self): + return self.name + + +class WindowsPython: + ''' + Windows only. Information about installed Python with specific word size + and version. Defaults to the currently-running Python. + + Members: + + .path: + Path of python binary. + .version: + `{major}.{minor}`, e.g. `3.9` or `3.11`. Same as `version` passed + to `__init__()` if not None, otherwise the inferred version. + .include: + Python include path. + .cpu: + A `WindowsCpu` instance, same as `cpu` passed to `__init__()` if + not None, otherwise the inferred cpu. + + We parse the output from `py -0p` to find all available python + installations. + ''' + + def __init__( self, cpu=None, version=None, verbose=True): + ''' + Args: + + cpu: + A WindowsCpu instance. If None, we use whatever we are running + on. + version: + Two-digit Python version as a string such as `3.8`. If None we + use current Python's version. + verbose: + If true we show diagnostics. + ''' + if cpu is None: + cpu = WindowsCpu(_cpu_name()) + if version is None: + version = '.'.join(platform.python_version().split('.')[:2]) + _log(f'Looking for Python {version=} {cpu.bits=}.') + + if '.'.join(platform.python_version().split('.')[:2]) == version: + # Current python matches, so use it directly. This avoids problems + # on Github where experimental python-3.13 was not available via + # `py`, and is kept here in case a similar problems happens with + # future Python versions. + _log(f'{cpu=} {version=}: using {sys.executable=}.') + self.path = sys.executable + self.version = version + self.cpu = cpu + self.include = sysconfig.get_path('include') + + else: + command = 'py -0p' + if verbose: + _log(f'{cpu=} {version=}: Running: {command}') + text = subprocess.check_output( command, shell=True, text=True) + for line in text.split('\n'): + #_log( f' {line}') + if m := re.match( '^ *-V:([0-9.]+)(-32)? ([*])? +(.+)$', line): + version2 = m.group(1) + bits = 32 if m.group(2) else 64 + current = m.group(3) + path = m.group(4).strip() + elif m := re.match( '^ *-([0-9.]+)-((32)|(64)) +(.+)$', line): + version2 = m.group(1) + bits = int(m.group(2)) + path = m.group(5).strip() + else: + if verbose: + _log( f'No match for {line=}') + continue + if verbose: + _log( f'{version2=} {bits=} {path=} from {line=}.') + if bits != cpu.bits or version2 != version: + continue + root = os.path.dirname(path) + if not os.path.exists(path): + # Sometimes it seems that the specified .../python.exe does not exist, + # and we have to change it to .../python.exe. + # + assert path.endswith('.exe'), f'path={path!r}' + path2 = f'{path[:-4]}{version}.exe' + _log( f'Python {path!r} does not exist; changed to: {path2!r}') + assert os.path.exists( path2) + path = path2 + + self.path = path + self.version = version + self.cpu = cpu + command = f'{self.path} -c "import sysconfig; print(sysconfig.get_path(\'include\'))"' + _log(f'Finding Python include path by running {command=}.') + self.include = subprocess.check_output(command, shell=True, text=True).strip() + _log(f'Python include path is {self.include=}.') + #_log( f'pipcl.py:WindowsPython():\n{self.description_ml(" ")}') + break + else: + _log(f'Failed to find python matching cpu={cpu}.') + _log(f'Output from {command!r} was:\n{text}') + raise Exception( f'Failed to find python matching cpu={cpu} {version=}.') + + # Oddly there doesn't seem to be a + # `sysconfig.get_path('libs')`, but it seems to be next + # to `includes`: + self.libs = os.path.abspath(f'{self.include}/../libs') + + _log( f'WindowsPython:\n{self.description_ml(" ")}') + + def description_ml(self, indent=''): + ret = textwrap.dedent(f''' + path: {self.path} + version: {self.version} + cpu: {self.cpu} + include: {self.include} + libs: {self.libs} + ''') + return textwrap.indent( ret, indent) + + def __repr__(self): + return f'path={self.path!r} version={self.version!r} cpu={self.cpu!r} include={self.include!r} libs={self.libs!r}' + + +# Internal helpers. +# + +def _cpu_name(): + ''' + Returns `x32` or `x64` depending on Python build. + ''' + #log(f'sys.maxsize={hex(sys.maxsize)}') + return f'x{32 if sys.maxsize == 2**31 - 1 else 64}' + + + +def _log(text='', caller=1): + ''' + Logs lines with prefix. + ''' + pipcl.log1(text, caller+1)

{time.strftime("%F-%T")}\n') + f.write(f'\n') + f.write(f'') + for ext in extensions: + f.write(f'') + f.write('\n') + for path in sorted(results.keys()): + _, ext = os.path.splitext(path) + f.write(f'') + for ext2 in sorted(results[path].keys()): + text_file = results[path][ext2]['file'] + text_stream = results[path][ext2]['stream'] + b1, b2 = ('', '') if ext2==ext else ('', '') + if text_file == text_stream: + if text_file == '[error]': + f.write(f'') + else: + f.write(f'') + else: + f.write(f'') + f.write('\n') + f.write(f'
Extension/magic') + f.write(f'
Data file{ext}
{os.path.basename(path)}
{b1}{text_file}{b2}
{b1}{text_file}{b2}file: {b1}{text_file}{b2}
') + f.write(f'stream: {b1}{text_stream}{b2}
\n') + f.write(f'/\n') + f.write(f'\n') + print(f'Have created: {path_html}') + + path_out = os.path.normpath(f'{__file__}/../../tests/test_open2.json') + with open(path_out, 'w') as f: + json.dump(results, f, indent=4, sort_keys=1) + + if pymupdf.mupdf_version_tuple >= (1, 26): + with open(os.path.normpath(f'{__file__}/../../tests/resources/test_open2_expected.json')) as f: + results_expected = json.load(f) + if results != results_expected: + print(f'results != results_expected:') + def show(r, name): + text = json.dumps(r, indent=4, sort_keys=1) + print(f'{name}:') + print(textwrap.indent(text, ' ')) + show(results_expected, 'results_expected') + show(results, 'results') + assert 0 + + +def test_533(): + if not hasattr(pymupdf, 'mupdf'): + print('test_533(): Not running on classic.') + return + path = os.path.abspath(f'{__file__}/../../tests/resources/2.pdf') + doc = pymupdf.open(path) + print() + for p in doc: + print(f'test_533(): for p in doc: {p=}.') + for p in list(doc)[:]: + print(f'test_533(): for p in list(doc)[:]: {p=}.') + for p in doc[:]: + print(f'test_533(): for p in doc[:]: {p=}.') + +def test_3354(): + document = pymupdf.open(filename) + v = dict(foo='bar') + document.metadata = v + assert document.metadata == v + +def test_scientific_numbers(): + ''' + This is #3381. + ''' + doc = pymupdf.open() + page = doc.new_page(width=595, height=842) + point = pymupdf.Point(1e-11, -1e-10) + page.insert_text(point, "Test") + contents = page.read_contents() + print(f'{contents=}') + assert b" 1e-" not in contents + +def test_3615(): + print('') + print(f'{pymupdf.pymupdf_version=}', flush=1) + print(f'{pymupdf.VersionBind=}', flush=1) + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3615.epub') + doc = pymupdf.open(path) + print(doc.pagemode) + print(doc.pagelayout) + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt + +def test_3654(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3654.docx') + content = "" + with pymupdf.open(path) as document: + for page in document: + content += page.get_text() + '\n\n' + content = content.strip() + +def test_3727(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3727.pdf') + doc = pymupdf.open(path) + for page in doc: + page.get_pixmap(matrix = pymupdf.Matrix(2,2)) + +def test_3569(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3569.pdf') + document = pymupdf.open(path) + page = document[0] + svg = page.get_svg_image(text_as_path=False) + print(f'{svg=}') + if pymupdf.mupdf_version_tuple >= (1, 27): + assert svg == ( + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '**L1-13\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + ) + else: + assert svg == ( + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '**L1-13\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + '\n' + ) + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'unknown cid collection: PDFAUTOCAD-Indentity0\nnon-embedded font using identity encoding: ArialMT (mapping via )\ninvalid marked content and clip nesting' + +def test_3450(): + # This issue is a slow-down, so we just show time taken - it's not safe + # to fail if test takes too long because that can give spurious failures + # depending on hardware etc. + # + # On a mac-mini, PyMuPDF-1.24.8 takes 60s, PyMuPDF-1.24.9 takes 4s. + # + if os.environ.get('PYMUPDF_RUNNING_ON_VALGRIND') == '1': + print(f'test_3450(): not running on valgrind because very slow.', flush=1) + return + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3450.pdf') + pdf = pymupdf.open(path) + page = pdf[0] + t = time.time() + pix = page.get_pixmap(alpha=False, dpi=150) + t = time.time() - t + print(f'test_3450(): {t=}') + +def test_3859(): + print(f'{pymupdf.mupdf.PDF_NULL=}.') + print(f'{pymupdf.mupdf.PDF_TRUE=}.') + print(f'{pymupdf.mupdf.PDF_FALSE=}.') + for name in ('NULL', 'TRUE', 'FALSE'): + name2 = f'PDF_{name}' + v = getattr(pymupdf.mupdf, name2) + print(f'{name=} {name2=} {v=} {type(v)=}') + assert type(v)==pymupdf.mupdf.PdfObj, f'`v` is not a pymupdf.mupdf.PdfObj.' + +def test_3905(): + data = b'A,B,C,D\r\n1,2,1,2\r\n2,2,1,2\r\n' + try: + document = pymupdf.open(stream=data, filetype='pdf') + except pymupdf.FileDataError as e: + print(f'test_3905(): e: {e}') + else: + assert 0 + wt = pymupdf.TOOLS.mupdf_warnings() + if pymupdf.mupdf_version_tuple >= (1, 26): + assert wt == 'format error: cannot find version marker\ntrying to repair broken xref\nrepairing PDF document' + else: + assert wt == 'format error: cannot recognize version marker\ntrying to repair broken xref\nrepairing PDF document' + +def test_3624(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3624.pdf') + path_png_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_3624_expected.png') + path_png = os.path.normpath(f'{__file__}/../../tests/test_3624.png') + with pymupdf.open(path) as document: + page = document[0] + pixmap = page.get_pixmap(matrix=pymupdf.Matrix(2, 2)) + print(f'Saving to {path_png=}.') + pixmap.save(path_png) + rms = gentle_compare.pixmaps_rms(path_png_expected, path_png) + print(f'{rms=}') + # We get small differences in sysinstall tests, where some thirdparty + # libraries can differ. + if rms > 1: + pixmap_diff = gentle_compare.pixmaps_diff(path_png_expected, path_png) + path_png_diff = os.path.normpath(f'{__file__}/../../tests/test_3624_diff.png') + pixmap_diff.save(path_png_diff) + assert 0, f'{rms=}' + + +def test_4043(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4043.pdf') + doc = pymupdf.open(path) + doc.fullcopy_page(1) + + +def test_4018(): + document = pymupdf.open() + for page in document.pages(-1, -1): + pass + +def test_4034(): + # tests/resources/test_4034.pdf is first two pages of input file in + # https://github.com/pymupdf/PyMuPDF/issues/4034. + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4034.pdf') + path_clean = os.path.normpath(f'{__file__}/../../tests/test_4034_out.pdf') + with pymupdf.open(path) as document: + pixmap1 = document[0].get_pixmap() + document.save(path_clean, clean=1) + with pymupdf.open(path_clean) as document: + page = document[0] + pixmap2 = document[0].get_pixmap() + rms = gentle_compare.pixmaps_rms(pixmap1, pixmap2) + print(f'test_4034(): Comparison of original/cleaned page 0 pixmaps: {rms=}.') + if pymupdf.mupdf_version_tuple < (1, 25, 2): + assert 30 < rms < 50 + else: + assert rms == 0 + +def test_4309(): + document = pymupdf.open() + page = document.new_page() + document.delete_page() + +def test_4263(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4263.pdf') + path_out = f'{path}.linerarized.pdf' + command = f'pymupdf clean -linear {path} {path_out}' + print(f'Running: {command}') + cp = subprocess.run(command, shell=1, check=0) + if pymupdf.mupdf_version_tuple < (1, 26): + assert cp.returncode == 0 + else: + # Support for linerarisation dropped in MuPDF-1.26. + assert cp.returncode + +def test_4224(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4224.pdf') + with pymupdf.open(path) as document: + for page in document.pages(): + pixmap = page.get_pixmap(dpi=150) + path_pixmap = f'{path}.{page.number}.png' + pixmap.save(path_pixmap) + print(f'Have created: {path_pixmap}') + if pymupdf.mupdf_version_tuple < (1, 25, 5): + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'format error: negative code in 1d faxd\npadding truncated image' + +def test_4319(): + # Have not seen this test reproduce issue #4319, but keeping it anyway. + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4319.pdf') + doc = pymupdf.open() + page = doc.new_page() + page.insert_text((10, 100), "some text") + doc.save(path) + doc.close() + doc = pymupdf.open(path) + page = doc[0] + pc = doc.page_count + doc.close() + os.remove(path) + print(f"removed {doc.name=}") + +def test_3886(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3886.pdf') + path_clean0 = os.path.normpath(f'{__file__}/../../tests/resources/test_3886_clean0.pdf') + path_clean1 = os.path.normpath(f'{__file__}/../../tests/resources/test_3886_clean1.pdf') + + with pymupdf.open(path) as document: + pixmap = document[0].get_pixmap() + document.save(path_clean0, clean=0) + + with pymupdf.open(path) as document: + document.save(path_clean1, clean=1) + + with pymupdf.open(path_clean0) as document: + pixmap_clean0 = document[0].get_pixmap() + + with pymupdf.open(path_clean1) as document: + pixmap_clean1 = document[0].get_pixmap() + + rms_0 = gentle_compare.pixmaps_rms(pixmap, pixmap_clean0) + rms_1 = gentle_compare.pixmaps_rms(pixmap, pixmap_clean1) + print(f'test_3886(): {rms_0=} {rms_1=}') + +def test_4415(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4415.pdf') + path_out = os.path.normpath(f'{__file__}/../../tests/resources/test_4415_out.png') + path_out_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_4415_out_expected.png') + with pymupdf.open(path) as document: + page = document[0] + rot = page.rotation + orig = pymupdf.Point(100, 100) # apparent insertion point + text = 'Text at Top-Left' + mrot = page.derotation_matrix # matrix annihilating page rotation + page.insert_text(orig * mrot, text, fontsize=60, rotate=rot) + pixmap = page.get_pixmap() + pixmap.save(path_out) + rms = gentle_compare.pixmaps_rms(path_out_expected, path_out) + assert rms == 0, f'{rms=}' + +def test_4466(): + path = os.path.normpath(f'{__file__}/../../tests/test_4466.pdf') + with pymupdf.Document(path) as document: + for page in document: + print(f'{page=}', flush=1) + pixmap = page.get_pixmap(clip=(0, 0, 10, 10)) + print(f'{pixmap.n=} {pixmap.size=} {pixmap.stride=} {pixmap.width=} {pixmap.height=} {pixmap.x=} {pixmap.y=}', flush=1) + pixmap.is_unicolor # Used to crash. + + +def test_4479(): + # This passes with pymupdf-1.24.14, fails with pymupdf==1.25.*, passes with + # pymupdf-1.26.0. + print() + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4479.pdf') + with pymupdf.open(path) as document: + + def show(items): + for item in items: + print(f' {repr(item)}') + + items = document.layer_ui_configs() + show(items) + assert items == [ + {'depth': 0, 'locked': 0, 'number': 0, 'on': 1, 'text': 'layer_0', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 1, 'on': 1, 'text': 'layer_1', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 2, 'on': 0, 'text': 'layer_2', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 3, 'on': 1, 'text': 'layer_3', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 4, 'on': 1, 'text': 'layer_4', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 5, 'on': 1, 'text': 'layer_5', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 6, 'on': 1, 'text': 'layer_6', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 7, 'on': 1, 'text': 'layer_7', 'type': 'checkbox'}, + ] + + document.set_layer_ui_config(0, pymupdf.PDF_OC_OFF) + items = document.layer_ui_configs() + show(items) + assert items == [ + {'depth': 0, 'locked': 0, 'number': 0, 'on': 0, 'text': 'layer_0', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 1, 'on': 1, 'text': 'layer_1', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 2, 'on': 0, 'text': 'layer_2', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 3, 'on': 1, 'text': 'layer_3', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 4, 'on': 1, 'text': 'layer_4', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 5, 'on': 1, 'text': 'layer_5', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 6, 'on': 1, 'text': 'layer_6', 'type': 'checkbox'}, + {'depth': 0, 'locked': 0, 'number': 7, 'on': 1, 'text': 'layer_7', 'type': 'checkbox'}, + ] + + +def test_4533(): + print() + path = util.download( + 'https://github.com/user-attachments/files/20497146/NineData_user_manual_V3.0.5.pdf', + 'test_4533.pdf', + size=16864501, + ) + # This bug is a segv so we run the test in a child process. + command = f'{sys.executable} -c "import pymupdf; document = pymupdf.open({path!r}); print(len(document))"' + print(f'Running: {command}') + cp = subprocess.run(command, shell=1, check=0) + e = cp.returncode + print(f'{e=}') + if pymupdf.mupdf_version_tuple >= (1, 26, 6): + assert e == 0 + else: + assert e != 0 + + +def test_4564(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4564.pdf') + print() + with pymupdf.open(path) as document: + for key in sorted(document.metadata.keys()): + value = document.metadata[key] + print(f'{key}: {value!r}') + if pymupdf.mupdf_version_tuple >= (1, 27): + assert document.metadata['producer'] == 'Adobe PSL 1.3e for Canon\x00' + else: + assert document.metadata['producer'] == 'Adobe PSL 1.3e for Canon\udcc0\udc80' + + +def test_4496(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4496.hwpx') + with pymupdf.open(path) as document: + print(document.page_count) + + +def test_gitinfo(): + # This doesn't really test very much, but can be useful to see the current + # values. + print('') + print(f'test_4496():') + print(f'{pymupdf.mupdf_location=}') + print(f'{pymupdf.mupdf_version=}') + print(f'{pymupdf.pymupdf_git_branch=}') + print(f'{pymupdf.pymupdf_git_sha=}') + print(f'{pymupdf.pymupdf_version=}') + print(f'pymupdf.pymupdf_git_diff:\n{textwrap.indent(pymupdf.pymupdf_git_diff, " ")}') + + +def test_4392(): + print() + path = os.path.normpath(f'{__file__}/../../tests/test_4392.py') + with open(path, 'w') as f: + f.write('import pymupdf\n') + + command = f'pytest {path}' + print(f'Running: {command}', flush=1) + e1 = subprocess.run(command, shell=1, check=0).returncode + print(f'{e1=}') + + command = f'pytest -Werror {path}' + print(f'Running: {command}', flush=1) + e2 = subprocess.run(command, shell=1, check=0).returncode + print(f'{e2=}') + + command = f'{sys.executable} -Werror -c "import pymupdf"' + print(f'Running: {command}', flush=1) + e3 = subprocess.run(command, shell=1, check=0).returncode + print(f'{e3=}') + + print(f'{e1=} {e2=} {e3=}') + + print(f'{pymupdf.swig_version=}') + print(f'{pymupdf.swig_version_tuple=}') + + assert e1 == 5 + if pymupdf.swig_version_tuple >= (4, 4): + assert e2 == 5 + assert e3 == 0 + else: + # We get SEGV's etc with older swig. + if platform.system() == 'Windows': + assert (e2, e3) == (0xc0000005, 0xc0000005) + else: + # On plain linux we get (139, 139). On manylinux we get (-11, + # -11). On MacOS we get (-11, -11). + assert (e2, e3) == (139, 139) or (e2, e3) == (-11, -11) + + +def test_4639(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4639.pdf') + with pymupdf.open(path) as document: + page = document[-1] + page.get_bboxlog(layers=True) + + +def test_4590(): + + # Create test PDF. + path = os.path.normpath(f'{__file__}/../../tests/test_4590.pdf') + with pymupdf.open() as document: + page = document.new_page() + + # Add some text + text = 'This PDF contains a file attachment annotation.' + page.insert_text((72, 72), text, fontsize=12) + + # Create a sample file. + path_sample = os.path.normpath(f'{__file__}/../../tests/test_4590_annotation_sample.txt') + with open(path_sample, 'w') as f: + f.write('This is a sample attachment file.') + + # Read file as bytes + with open(path_sample, 'rb') as f: + sample = f.read() + + # Define annotation position (rect or point) + annot_pos = pymupdf.Rect(72, 100, 92, 120) # PushPin icon rectangle + + # Add the file attachment annotation + page.add_file_annot( + point = annot_pos, + buffer_ = sample, + filename = 'sample.txt', + ufilename = 'sample.txt', + desc = 'A test attachment file.', + icon = 'PushPin', + ) + + # Save the PDF + document.save(path) + + # Check pymupdf.Document.scrub() works. + with pymupdf.open(path) as document: + document.scrub() diff -r 000000000000 -r 1d09e1dec1d9 tests/test_geometry.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_geometry.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,348 @@ +""" +* Check various construction methods of rects, points, matrices +* Check matrix inversions in variations +* Check algebra constructs +""" +import os + +import pymupdf + + +def test_rect(): + assert tuple(pymupdf.Rect()) == (0, 0, 0, 0) + if hasattr(pymupdf, 'mupdf'): + assert tuple(pymupdf.Rect(y0=12)) == (0, 12, 0, 0) + assert tuple(pymupdf.Rect(10, 20, 100, 200, x1=12)) == (10, 20, 12, 200) + p1 = pymupdf.Point(10, 20) + p2 = pymupdf.Point(100, 200) + p3 = pymupdf.Point(150, 250) + r = pymupdf.Rect(10, 20, 100, 200) + r_tuple = tuple(r) + assert tuple(pymupdf.Rect(p1, p2)) == r_tuple + assert tuple(pymupdf.Rect(p1, 100, 200)) == r_tuple + assert tuple(pymupdf.Rect(10, 20, p2)) == r_tuple + assert tuple(r.include_point(p3)) == (10, 20, 150, 250) + r = pymupdf.Rect(10, 20, 100, 200) + assert tuple(r.include_rect((100, 200, 110, 220))) == (10, 20, 110, 220) + r = pymupdf.Rect(10, 20, 100, 200) + # include empty rect makes no change + assert tuple(r.include_rect((0, 0, 0, 0))) == r_tuple + # include invalid rect makes no change + assert tuple(r.include_rect((1, 1, -1, -1))) == r_tuple + r = pymupdf.Rect() + for i in range(4): + r[i] = i + 1 + assert r == pymupdf.Rect(1, 2, 3, 4) + assert pymupdf.Rect() / 5 == pymupdf.Rect() + assert pymupdf.Rect(1, 1, 2, 2) / pymupdf.Identity == pymupdf.Rect(1, 1, 2, 2) + failed = False + try: + r = pymupdf.Rect(1) + except: + failed = True + assert failed + failed = False + try: + r = pymupdf.Rect(1, 2, 3, 4, 5) + except: + failed = True + assert failed + failed = False + try: + r = pymupdf.Rect((1, 2, 3, 4, 5)) + except: + failed = True + assert failed + failed = False + try: + r = pymupdf.Rect(1, 2, 3, "x") + except: + failed = True + assert failed + failed = False + try: + r = pymupdf.Rect() + r[5] = 1 + except: + failed = True + assert failed + + +def test_irect(): + p1 = pymupdf.Point(10, 20) + p2 = pymupdf.Point(100, 200) + p3 = pymupdf.Point(150, 250) + r = pymupdf.IRect(10, 20, 100, 200) + r_tuple = tuple(r) + assert tuple(pymupdf.IRect(p1, p2)) == r_tuple + assert tuple(pymupdf.IRect(p1, 100, 200)) == r_tuple + assert tuple(pymupdf.IRect(10, 20, p2)) == r_tuple + assert tuple(r.include_point(p3)) == (10, 20, 150, 250) + r = pymupdf.IRect(10, 20, 100, 200) + assert tuple(r.include_rect((100, 200, 110, 220))) == (10, 20, 110, 220) + r = pymupdf.IRect(10, 20, 100, 200) + # include empty rect makes no change + assert tuple(r.include_rect((0, 0, 0, 0))) == r_tuple + r = pymupdf.IRect() + for i in range(4): + r[i] = i + 1 + assert r == pymupdf.IRect(1, 2, 3, 4) + + failed = False + try: + r = pymupdf.IRect(1) + except: + failed = True + assert failed + failed = False + try: + r = pymupdf.IRect(1, 2, 3, 4, 5) + except: + failed = True + assert failed + failed = False + try: + r = pymupdf.IRect((1, 2, 3, 4, 5)) + except: + failed = True + assert failed + failed = False + try: + r = pymupdf.IRect(1, 2, 3, "x") + except: + failed = True + assert failed + failed = False + try: + r = pymupdf.IRect() + r[5] = 1 + except: + failed = True + assert failed + + +def test_inversion(): + alpha = 255 + m1 = pymupdf.Matrix(alpha) + m2 = pymupdf.Matrix(-alpha) + m3 = m1 * m2 # should equal identity matrix + assert abs(m3 - pymupdf.Identity) < pymupdf.EPSILON + m = pymupdf.Matrix(1, 0, 1, 0, 1, 0) # not invertible! + # inverted matrix must be zero + assert ~m == pymupdf.Matrix() + + +def test_matrix(): + assert tuple(pymupdf.Matrix()) == (0, 0, 0, 0, 0, 0) + assert tuple(pymupdf.Matrix(90)) == (0, 1, -1, 0, 0, 0) + if hasattr(pymupdf, 'mupdf'): + assert tuple(pymupdf.Matrix(c=1)) == (0, 0, 1, 0, 0, 0) + assert tuple(pymupdf.Matrix(90, e=5)) == (0, 1, -1, 0, 5, 0) + m45p = pymupdf.Matrix(45) + m45m = pymupdf.Matrix(-45) + m90 = pymupdf.Matrix(90) + assert abs(m90 - m45p * m45p) < pymupdf.EPSILON + assert abs(pymupdf.Identity - m45p * m45m) < pymupdf.EPSILON + assert abs(m45p - ~m45m) < pymupdf.EPSILON + assert pymupdf.Matrix(2, 3, 1) == pymupdf.Matrix(1, 3, 2, 1, 0, 0) + m = pymupdf.Matrix(2, 3, 1) + m.invert() + assert abs(m * pymupdf.Matrix(2, 3, 1) - pymupdf.Identity) < pymupdf.EPSILON + assert pymupdf.Matrix(1, 1).pretranslate(2, 3) == pymupdf.Matrix(1, 0, 0, 1, 2, 3) + assert pymupdf.Matrix(1, 1).prescale(2, 3) == pymupdf.Matrix(2, 0, 0, 3, 0, 0) + assert pymupdf.Matrix(1, 1).preshear(2, 3) == pymupdf.Matrix(1, 3, 2, 1, 0, 0) + assert abs(pymupdf.Matrix(1, 1).prerotate(30) - pymupdf.Matrix(30)) < pymupdf.EPSILON + small = 1e-6 + assert pymupdf.Matrix(1, 1).prerotate(90 + small) == pymupdf.Matrix(90) + assert pymupdf.Matrix(1, 1).prerotate(180 + small) == pymupdf.Matrix(180) + assert pymupdf.Matrix(1, 1).prerotate(270 + small) == pymupdf.Matrix(270) + assert pymupdf.Matrix(1, 1).prerotate(small) == pymupdf.Matrix(0) + assert pymupdf.Matrix(1, 1).concat( + pymupdf.Matrix(1, 2), pymupdf.Matrix(3, 4) + ) == pymupdf.Matrix(3, 0, 0, 8, 0, 0) + assert pymupdf.Matrix(1, 2, 3, 4, 5, 6) / 1 == pymupdf.Matrix(1, 2, 3, 4, 5, 6) + assert m[0] == m.a + assert m[1] == m.b + assert m[2] == m.c + assert m[3] == m.d + assert m[4] == m.e + assert m[5] == m.f + m = pymupdf.Matrix() + for i in range(6): + m[i] = i + 1 + assert m == pymupdf.Matrix(1, 2, 3, 4, 5, 6) + failed = False + try: + m = pymupdf.Matrix(1, 2, 3) + except: + failed = True + assert failed + failed = False + try: + m = pymupdf.Matrix(1, 2, 3, 4, 5, 6, 7) + except: + failed = True + assert failed + + failed = False + try: + m = pymupdf.Matrix((1, 2, 3, 4, 5, 6, 7)) + except: + failed = True + assert failed + + failed = False + try: + m = pymupdf.Matrix(1, 2, 3, 4, 5, "x") + except: + failed = True + assert failed + + failed = False + try: + m = pymupdf.Matrix(1, 0, 1, 0, 1, 0) + n = pymupdf.Matrix(1, 1) / m + except: + failed = True + assert failed + + +def test_point(): + assert tuple(pymupdf.Point()) == (0, 0) + assert pymupdf.Point(1, -1).unit == pymupdf.Point(5, -5).unit + assert pymupdf.Point(-1, -1).abs_unit == pymupdf.Point(1, 1).unit + assert pymupdf.Point(1, 1).distance_to(pymupdf.Point(1, 1)) == 0 + assert pymupdf.Point(1, 1).distance_to(pymupdf.Rect(1, 1, 2, 2)) == 0 + assert pymupdf.Point().distance_to((1, 1, 2, 2)) > 0 + failed = False + try: + p = pymupdf.Point(1, 2, 3) + except: + failed = True + assert failed + + failed = False + try: + p = pymupdf.Point((1, 2, 3)) + except: + failed = True + assert failed + + failed = False + try: + p = pymupdf.Point(1, "x") + except: + failed = True + assert failed + + failed = False + try: + p = pymupdf.Point() + p[3] = 1 + except: + failed = True + assert failed + + +def test_algebra(): + p = pymupdf.Point(1, 2) + m = pymupdf.Matrix(1, 2, 3, 4, 5, 6) + r = pymupdf.Rect(1, 1, 2, 2) + assert p + p == p * 2 + assert p - p == pymupdf.Point() + assert m + m == m * 2 + assert m - m == pymupdf.Matrix() + assert r + r == r * 2 + assert r - r == pymupdf.Rect() + assert p + 5 == pymupdf.Point(6, 7) + assert m + 5 == pymupdf.Matrix(6, 7, 8, 9, 10, 11) + assert r.tl in r + assert r.tr not in r + assert r.br not in r + assert r.bl not in r + assert p * m == pymupdf.Point(12, 16) + assert r * m == pymupdf.Rect(9, 12, 13, 18) + assert (pymupdf.Rect(1, 1, 2, 2) & pymupdf.Rect(2, 2, 3, 3)).is_empty + assert not pymupdf.Rect(1, 1, 2, 2).intersects((2, 2, 4, 4)) + failed = False + try: + x = m + p + except: + failed = True + assert failed + failed = False + try: + x = m + r + except: + failed = True + assert failed + failed = False + try: + x = p + r + except: + failed = True + assert failed + failed = False + try: + x = r + m + except: + failed = True + assert failed + assert m not in r + + +def test_quad(): + r = pymupdf.Rect(10, 10, 20, 20) + q = r.quad + assert q.is_rectangular + assert not q.is_empty + assert q.is_convex + q *= pymupdf.Matrix(1, 1).preshear(2, 3) + assert not q.is_rectangular + assert not q.is_empty + assert q.is_convex + assert r.tl not in q + assert r not in q + assert r.quad not in q + failed = False + try: + q[5] = pymupdf.Point() + except: + failed = True + assert failed + + failed = False + try: + q /= (1, 0, 1, 0, 1, 0) + except: + failed = True + assert failed + + +def test_pageboxes(): + """Tests concerning ArtBox, TrimBox, BleedBox.""" + doc = pymupdf.open() + page = doc.new_page() + assert page.cropbox == page.artbox == page.bleedbox == page.trimbox + rect_methods = ( + page.set_cropbox, + page.set_artbox, + page.set_bleedbox, + page.set_trimbox, + ) + keys = ("CropBox", "ArtBox", "BleedBox", "TrimBox") + rect = pymupdf.Rect(100, 200, 400, 700) + for f in rect_methods: + f(rect) + for key in keys: + assert doc.xref_get_key(page.xref, key) == ("array", "[100 142 400 642]") + assert page.cropbox == page.artbox == page.bleedbox == page.trimbox + +def test_3163(): + b = {'number': 0, 'type': 0, 'bbox': (403.3577880859375, 330.8871765136719, 541.2731323242188, 349.5766296386719), 'lines': [{'spans': [{'size': 14.0, 'flags': 4, 'font': 'SFHello-Medium', 'color': 1907995, 'ascender': 1.07373046875, 'descender': -0.26123046875, 'text': 'Inclusion and diversity', 'origin': (403.3577880859375, 345.9194030761719), 'bbox': (403.3577880859375, 330.8871765136719, 541.2731323242188, 349.5766296386719)}], 'wmode': 0, 'dir': (1.0, 0.0), 'bbox': (403.3577880859375, 330.8871765136719, 541.2731323242188, 349.5766296386719)}]} + bbox = pymupdf.IRect(b["bbox"]) + +def test_3182(): + pix = pymupdf.Pixmap(os.path.abspath(f'{__file__}/../../tests/resources/img-transparent.png')) + rect = pymupdf.Rect(0, 0, 100, 100) + pix.invert_irect(rect) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_imagebbox.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_imagebbox.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,49 @@ +""" +Ensure equality of bboxes computed via +* page.get_image_bbox() +* page.get_image_info() +* page.get_bboxlog() + +""" +import os + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "image-file1.pdf") +image = os.path.join(scriptdir, "resources", "img-transparent.png") +doc = pymupdf.open(filename) + + +def test_image_bbox(): + page = doc[0] + imglist = page.get_images(True) + bbox_list = [] + for item in imglist: + bbox_list.append(page.get_image_bbox(item, transform=False)) + infos = page.get_image_info(xrefs=True) + match = False + for im in infos: + bbox1 = im["bbox"] + match = False + for bbox2 in bbox_list: + abs_bbox = (bbox2 - bbox1).norm() + if abs_bbox < 1e-4: + match = True + break + assert match + + +def test_bboxlog(): + doc = pymupdf.open() + page = doc.new_page() + xref = page.insert_image(page.rect, filename=image) + img_info = page.get_image_info(xrefs=True) + assert len(img_info) == 1 + info = img_info[0] + assert info["xref"] == xref + bbox_log = page.get_bboxlog() + assert len(bbox_log) == 1 + box_type, bbox = bbox_log[0] + assert box_type == "fill-image" + assert bbox == info["bbox"] diff -r 000000000000 -r 1d09e1dec1d9 tests/test_imagemasks.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_imagemasks.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,31 @@ +""" +Confirm image mask detection in TextPage extractions. +""" + +import os + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename1 = os.path.join(scriptdir, "resources", "img-regular.pdf") +filename2 = os.path.join(scriptdir, "resources", "img-transparent.pdf") + + +def test_imagemask1(): + doc = pymupdf.open(filename1) + page = doc[0] + blocks = page.get_text("dict")["blocks"] + img = blocks[0] + assert img["mask"] is None + img = page.get_image_info()[0] + assert img["has-mask"] is False + + +def test_imagemask2(): + doc = pymupdf.open(filename2) + page = doc[0] + blocks = page.get_text("dict")["blocks"] + img = blocks[0] + assert type(img["mask"]) is bytes + img = page.get_image_info()[0] + assert img["has-mask"] is True diff -r 000000000000 -r 1d09e1dec1d9 tests/test_import.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_import.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,18 @@ +import os +import subprocess +import sys +import textwrap + + +def test_import(): + root = os.path.abspath(f'{__file__}/../../') + p = f'{root}/tests/resources_test_import.py' + with open(p, 'w') as f: + f.write(textwrap.dedent( + ''' + from pymupdf.utils import * + from pymupdf.table import * + from pymupdf import * + ''' + )) + subprocess.run(f'{sys.executable} {p}', shell=1, check=1) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_insertimage.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_insertimage.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,66 @@ +""" +* Insert same image with different rotations in two places of a page. +* Extract bboxes and transformation matrices +* Assert image locations are inside given rectangles +""" +import json +import os + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +imgfile = os.path.join(scriptdir, "resources", "nur-ruhig.jpg") + + +def test_insert(): + doc = pymupdf.open() + page = doc.new_page() + r1 = pymupdf.Rect(50, 50, 100, 100) + r2 = pymupdf.Rect(50, 150, 200, 400) + page.insert_image(r1, filename=imgfile) + page.insert_image(r2, filename=imgfile, rotate=270) + info_list = page.get_image_info() + assert len(info_list) == 2 + bbox1 = pymupdf.Rect(info_list[0]["bbox"]) + bbox2 = pymupdf.Rect(info_list[1]["bbox"]) + assert bbox1 in r1 + assert bbox2 in r2 + +def test_compress(): + document = pymupdf.open(f'{scriptdir}/resources/2.pdf') + document_new = pymupdf.open() + for page in document: + pixmap = page.get_pixmap( + colorspace=pymupdf.csRGB, + dpi=72, + annots=False, + ) + page_new = document_new.new_page(-1) + page_new.insert_image(rect=page_new.bound(), pixmap=pixmap) + document_new.save( + f'{scriptdir}/resources/2.pdf.compress.pdf', + garbage=3, + deflate=True, + deflate_images=True, + deflate_fonts=True, + pretty=True, + ) + +def test_3087(): + path = os.path.abspath(f'{__file__}/../../tests/resources/test_3087.pdf') + + doc = pymupdf.open(path) + page = doc[0] + print(page.get_images()) + base = doc.extract_image(5)["image"] + mask = doc.extract_image(5)["image"] + page = doc.new_page() + page.insert_image(page.rect, stream=base, mask=mask) + + doc = pymupdf.open(path) + page = doc[0] + print(page.get_images()) + base = doc.extract_image(5)["image"] + mask = doc.extract_image(6)["image"] + page = doc.new_page() + page.insert_image(page.rect, stream=base, mask=mask) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_insertpdf.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_insertpdf.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,335 @@ +""" +* Join multiple PDFs into a new one. +* Compare with stored earlier result: + - must have identical object definitions + - must have different trailers +* Try inserting files in a loop. +""" + +import io +import os +import re +import pymupdf +from pymupdf import mupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +resources = os.path.join(scriptdir, "resources") + +def approx_parse( text): + ''' + Splits into sequence of (text, number) pairs. Where sequence of + [0-9.] is not convertible to a number (e.g. '4.5.6'), will be + None. + ''' + ret = [] + for m in re.finditer('([^0-9]+)([0-9.]*)', text): + text = m.group(1) + try: + number = float( m.group(2)) + except Exception: + text += m.group(2) + number = None + ret.append( (text, number)) + return ret + +def approx_compare( a, b, max_delta): + ''' + Compares
and , allowing numbers to differ by up to . + ''' + aa = approx_parse( a) + bb = approx_parse( b) + if len(aa) != len(bb): + return 1 + ret = 1 + for (at, an), (bt, bn) in zip( aa, bb): + if at != bt: + break + if an is not None and bn is not None: + if abs( an - bn) >= max_delta: + print( f'diff={an-bn}: an={an} bn={bn}') + break + elif (an is None) != (bn is None): + break + else: + ret = 0 + if ret: + print( f'Differ:\n a={a!r}\n b={b!r}') + return ret + + +def test_insert(): + all_text_original = [] # text on input pages + all_text_combined = [] # text on resulting output pages + # prepare input PDFs + doc1 = pymupdf.open() + for i in range(5): # just arbitrary number of pages + text = f"doc 1, page {i}" # the 'globally' unique text + page = doc1.new_page() + page.insert_text((100, 72), text) + all_text_original.append(text) + + doc2 = pymupdf.open() + for i in range(4): + text = f"doc 2, page {i}" + page = doc2.new_page() + page.insert_text((100, 72), text) + all_text_original.append(text) + + doc3 = pymupdf.open() + for i in range(3): + text = f"doc 3, page {i}" + page = doc3.new_page() + page.insert_text((100, 72), text) + all_text_original.append(text) + + doc4 = pymupdf.open() + for i in range(6): + text = f"doc 4, page {i}" + page = doc4.new_page() + page.insert_text((100, 72), text) + all_text_original.append(text) + + new_doc = pymupdf.open() # make combined PDF of input files + new_doc.insert_pdf(doc1) + new_doc.insert_pdf(doc2) + new_doc.insert_pdf(doc3) + new_doc.insert_pdf(doc4) + # read text from all pages and store in list + for page in new_doc: + all_text_combined.append(page.get_text().replace("\n", "")) + # the lists must be equal + assert all_text_combined == all_text_original + + +def test_issue1417_insertpdf_in_loop(): + """Using a context manager instead of explicitly closing files""" + f = os.path.join(resources, "1.pdf") + big_doc = pymupdf.open() + fd1 = os.open( f, os.O_RDONLY) + os.close( fd1) + for n in range(0, 1025): + with pymupdf.open(f) as pdf: + big_doc.insert_pdf(pdf) + # Create a raw file descriptor. If the above pymupdf.open() context leaks + # a file descriptor, fd will be seen to increment. + fd2 = os.open( f, os.O_RDONLY) + assert fd2 == fd1 + os.close( fd2) + big_doc.close() + + +def _test_insert_adobe(): + path = os.path.abspath( f'{__file__}/../../../PyMuPDF-performance/adobe.pdf') + if not os.path.exists(path): + print(f'Not running test_insert_adobe() because does not exist: {os.path.relpath(path)}') + return + a = pymupdf.Document() + b = pymupdf.Document(path) + a.insert_pdf(b) + + +def _2861_2871_merge_pdf(content: bytes, coverpage: bytes): + with pymupdf.Document(stream=coverpage, filetype="pdf") as coverpage_pdf: + with pymupdf.Document(stream=content, filetype="pdf") as content_pdf: + coverpage_pdf.insert_pdf(content_pdf) + doc = coverpage_pdf.write() + return doc + +def test_2861(): + path = os.path.abspath(f'{__file__}/../../tests/resources/test_2861.pdf') + with open(path, "rb") as content_pdf: + with open(path, "rb") as coverpage_pdf: + content = content_pdf.read() + coverpage = coverpage_pdf.read() + _2861_2871_merge_pdf(content, coverpage) + +def test_2871(): + path = os.path.abspath(f'{__file__}/../../tests/resources/test_2871.pdf') + with open(path, "rb") as content_pdf: + with open(path, "rb") as coverpage_pdf: + content = content_pdf.read() + coverpage = coverpage_pdf.read() + _2861_2871_merge_pdf(content, coverpage) + + +def test_3789(): + + file_path = os.path.abspath(f'{__file__}/../../tests/resources/test_3789.pdf') + result_path = os.path.abspath(f'{__file__}/../../tests/test_3789_out') + pages_per_split = 5 + + # Clean pdf + doc = pymupdf.open(file_path) + tmp = io.BytesIO() + tmp.write(doc.write(garbage=4, deflate=True)) + + source_doc = pymupdf.Document('pdf', tmp.getvalue()) + tmp.close() + + # Calculate the number of pages per split file and the number of split files + page_range = pages_per_split - 1 + split_range = range(0, source_doc.page_count, pages_per_split) + num_splits = len(split_range) + + # Loop through each split range and create a new PDF file + for i, start in enumerate(split_range): + output_doc = pymupdf.open() + + # Determine the ending page for this split file + to_page = start + page_range if i < num_splits - 1 else -1 + output_doc.insert_pdf(source_doc, from_page=start, to_page=to_page) + + # Save the output document to a file and add the path to the list of split files + path = f'{result_path}_{i}.pdf' + output_doc.save(path, garbage=2) + print(f'Have saved to {path=}.') + + # If this is the last split file, exit the loop + if to_page == -1: + break + + +def test_widget_insert(): + """Confirm copy of form fields / widgets.""" + tar = pymupdf.open(os.path.join(resources, "merge-form1.pdf")) + pc0 = tar.page_count # for later assertion + src = pymupdf.open(os.path.join(resources, "interfield-calculation.pdf")) + pc1 = src.page_count # for later assertion + + tarpdf = pymupdf._as_pdf_document(tar) + tar_field_count = mupdf.pdf_array_len( + mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/Fields") + ) + tar_co_count = mupdf.pdf_array_len( + mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/CO") + ) + srcpdf = pymupdf._as_pdf_document(src) + src_field_count = mupdf.pdf_array_len( + mupdf.pdf_dict_getp(mupdf.pdf_trailer(srcpdf), "Root/AcroForm/Fields") + ) + src_co_count = mupdf.pdf_array_len( + mupdf.pdf_dict_getp(mupdf.pdf_trailer(srcpdf), "Root/AcroForm/CO") + ) + + tar.insert_pdf(src) + new_field_count = mupdf.pdf_array_len( + mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/Fields") + ) + new_co_count = mupdf.pdf_array_len( + mupdf.pdf_dict_getp(mupdf.pdf_trailer(tarpdf), "Root/AcroForm/CO") + ) + assert tar.page_count == pc0 + pc1 + assert new_field_count == tar_field_count + src_field_count + assert new_co_count == tar_co_count + src_co_count + + +def names_and_kids(doc): + """Return a list of dictionaries with keys "name" and "kids". + + "name" is the name of a root field in "Root/AcroForm/Fields", and + "kids" is the count of its immediate children. + """ + rc = [] + pdf = pymupdf._as_pdf_document(doc) + fields = mupdf.pdf_dict_getl( + mupdf.pdf_trailer(pdf), + pymupdf.PDF_NAME("Root"), + pymupdf.PDF_NAME("AcroForm"), + pymupdf.PDF_NAME("Fields"), + ) + if not fields.pdf_is_array(): + return rc + root_count = fields.pdf_array_len() + if not root_count: + return rc + for i in range(root_count): + field = fields.pdf_array_get(i) + kids = field.pdf_dict_get(pymupdf.PDF_NAME("Kids")) + kid_count = kids.pdf_array_len() + T = field.pdf_dict_get_text_string(pymupdf.PDF_NAME("T")) + field_dict = {"name": T, "kids": kid_count} + rc.append(field_dict) + return rc + + +def test_merge_checks1(): + """Merge Form PDFs making any duplicate names unique.""" + merge_file1 = os.path.join(resources, "merge-form1.pdf") + merge_file2 = os.path.join(resources, "merge-form2.pdf") + tar = pymupdf.open(merge_file1) + rc0 = names_and_kids(tar) + src = pymupdf.open(merge_file2) + rc1 = names_and_kids(src) + tar.insert_pdf(src, join_duplicates=False) + rc2 = names_and_kids(tar) + assert len(rc2) == len(rc0) + len(rc1) + + +def test_merge_checks2(): + # Join / merge Form PDFs joining any duplicate names in the src PDF. + merge_file1 = os.path.join(resources, "merge-form1.pdf") + merge_file2 = os.path.join(resources, "merge-form2.pdf") + tar = pymupdf.open(merge_file1) + rc0 = names_and_kids(tar) # list of root names and kid counts + names0 = [itm["name"] for itm in rc0] # root names in target + kids0 = sum([itm["kids"] for itm in rc0]) # number of kids in target + + src = pymupdf.open(merge_file2) + rc1 = names_and_kids(src) # list of root namesand kids in source PDF + dup_count = 0 # counts duplicate names in source PDF + dup_kids = 0 # counts the expected kids after merge + + for itm in rc1: # walk root fields of source pdf + if itm["name"] not in names0: # not a duplicate name + continue + # if target field has kids, add their count, else add 1 + dup_kids0 = sum([i["kids"] for i in rc0 if i["name"] == itm["name"]]) + dup_kids += dup_kids0 if dup_kids0 else 1 + # if source field has kids add their count, else add 1 + dup_kids += itm["kids"] if itm["kids"] else 1 + + names1 = [itm["name"] for itm in rc1] # names in source + + tar.insert_pdf(src, join_duplicates=True) # join merging any duplicate names + + rc2 = names_and_kids(tar) # get names and kid counts in resulting PDF + names2 = [itm["name"] for itm in rc2] # resulting names in target + kids2 = sum([itm["kids"] for itm in rc2]) # total resulting kid count + + assert len(set(names0 + names1)) == len(names2) + assert kids2 == dup_kids + + +test_4412_path = os.path.normpath(f'{__file__}/../../tests/resources/test_4412.pdf') + +def test_4412(): + # This tests whether a page from a PDF containing widgets found in the wild + # can be inserted into a new document with default options (widget=True) + # and widget=False. + print() + for widget in True, False: + print(f'{widget=}', flush=1) + with pymupdf.open(test_4412_path) as doc, pymupdf.open() as new_doc: + buf = io.BytesIO() + new_doc.insert_pdf(doc, from_page=1, to_page=1) + new_doc.save(buf) + assert len(new_doc)==1 + + +def test_4571(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4571.pdf') + path_out = os.path.normpath(f'{__file__}/../../tests/resources/test_4571_out.pdf') + with pymupdf.open() as newdocument: + with pymupdf.open(path) as document: + newdocument.insert_pdf(document) + newdocument.save(path_out, garbage=4, clean=False) + print(f'Have saved to: {path_out=}') + with open(path_out, 'rb') as f: + content = f.read() + if pymupdf.mupdf_version_tuple >= (1, 26, 6): + # Correct. + assert b'<>' in content + else: + # Incorrect. + assert b'<>' in content + diff -r 000000000000 -r 1d09e1dec1d9 tests/test_linebreaks.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_linebreaks.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,16 @@ +import pymupdf + +import os.path + + +def test_linebreaks(): + """Test avoidance of linebreaks.""" + path = os.path.abspath(f"{__file__}/../../tests/resources/test-linebreaks.pdf") + doc = pymupdf.open(path) + page = doc[0] + tp = page.get_textpage(flags=pymupdf.TEXTFLAGS_WORDS) + word_count = len(page.get_text("words", textpage=tp)) + line_count1 = len(page.get_text(textpage=tp).splitlines()) + line_count2 = len(page.get_text(sort=True, textpage=tp).splitlines()) + assert word_count == line_count1 + assert line_count2 < line_count1 / 2 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_linequad.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_linequad.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,30 @@ +""" +Check approx. equality of search quads versus quads recovered from +text extractions. +""" +import os + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "quad-calc-0.pdf") + + +def test_quadcalc(): + text = " angle 327" # search for this text + doc = pymupdf.open(filename) + page = doc[0] + # This special page has one block with one line, and + # its last span contains the searched text. + block = page.get_text("dict", flags=0)["blocks"][0] + line = block["lines"][0] + # compute quad of last span in line + lineq = pymupdf.recover_line_quad(line, spans=line["spans"][-1:]) + + # let text search find the text returning quad coordinates + rl = page.search_for(text, quads=True) + searchq = rl[0] + assert abs(searchq.ul - lineq.ul) <= 1e-4 + assert abs(searchq.ur - lineq.ur) <= 1e-4 + assert abs(searchq.ll - lineq.ll) <= 1e-4 + assert abs(searchq.lr - lineq.lr) <= 1e-4 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_memory.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_memory.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,228 @@ +import pymupdf + +import gc +import os +import platform +import sys + + +def merge_pdf(content: bytes, coverpage: bytes): + with pymupdf.Document(stream=coverpage, filetype='pdf') as coverpage_pdf: + with pymupdf.Document(stream=content, filetype='pdf') as content_pdf: + coverpage_pdf.insert_pdf(content_pdf) + doc = coverpage_pdf.write() + return doc + +def test_2791(): + ''' + Check for memory leaks. + ''' + if os.environ.get('PYMUPDF_RUNNING_ON_VALGRIND') == '1': + print(f'test_2791(): not running because PYMUPDF_RUNNING_ON_VALGRIND=1.') + return + if platform.system().startswith('MSYS_NT-'): + print(f'test_2791(): not running on msys2 - psutil not available.') + return + #stat_type = 'tracemalloc' + stat_type = 'psutil' + if stat_type == 'tracemalloc': + import tracemalloc + tracemalloc.start(10) + def get_stat(): + current, peak = tracemalloc.get_traced_memory() + return current + elif stat_type == 'psutil': + # We use RSS, as used by mprof. + import psutil + process = psutil.Process() + def get_stat(): + return process.memory_info().rss + else: + def get_stat(): + return 0 + n = 1000 + verbose = False + if platform.python_implementation() == 'GraalVM': + n = 10 + verbose = True + stats = [1] * n + for i in range(n): + if verbose: + print(f'{i+1}/{n}.', flush=1) + root = os.path.abspath(f'{__file__}/../../tests/resources') + with open(f'{root}/test_2791_content.pdf', 'rb') as content_pdf: + with open(f'{root}/test_2791_coverpage.pdf', 'rb') as coverpage_pdf: + content = content_pdf.read() + coverpage = coverpage_pdf.read() + merge_pdf(content, coverpage) + sys.stdout.flush() + + gc.collect() + stats[i] = get_stat() + + print(f'Memory usage {stat_type=}.') + for i, stat in enumerate(stats): + sys.stdout.write(f' {stat}') + #print(f' {i}: {stat}') + sys.stdout.write('\n') + first = stats[2] + last = stats[-1] + ratio = last / first + print(f'{first=} {last=} {ratio=}') + + if platform.system() != 'Linux': + # Values from psutil indicate larger memory leaks on non-Linux. Don't + # yet know whether this is because rss is measured differently or a + # genuine leak is being exposed. + print(f'test_2791(): not asserting ratio because not running on Linux.') + elif not hasattr(pymupdf, 'mupdf'): + # Classic implementation has unfixed leaks. + print(f'test_2791(): not asserting ratio because using classic implementation.') + elif [int(x) for x in platform.python_version_tuple()[:2]] < [3, 11]: + print(f'test_2791(): not asserting ratio because python version less than 3.11: {platform.python_version()=}.') + elif stat_type == 'tracemalloc': + # With tracemalloc Before fix to src/extra.i's calls to + # PyObject_CallMethodObjArgs, ratio was 4.26; after it was 1.40. + assert ratio > 1 and ratio < 1.6 + elif stat_type == 'psutil': + # Prior to fix, ratio was 1.043. After the fix, improved to 1.005, but + # varies and sometimes as high as 1.010. + # 2024-06-03: have seen 0.99919 on musl linux, and sebras reports .025. + assert ratio >= 0.990 and ratio < 1.027, f'{ratio=}' + else: + pass + + +def test_4090(): + print(f'test_4090(): {os.environ.get("PYTHONMALLOC")=}.') + import psutil + process = psutil.Process() + rsss = list() + def rss(): + ret = process.memory_info().rss + rsss.append(ret) + return ret + + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4090.pdf') + for i in range(100): + d = dict() + d[i] = dict() + with pymupdf.open(path) as document: + for j, page in enumerate(document): + d[i][j] = page.get_text('rawdict') + print(f'test_4090(): {i}: {rss()=}') + print(f'test_4090(): {rss()=}') + gc.collect() + print(f'test_4090(): {rss()=}') + r1 = rsss[2] + r2 = rsss[-1] + r = r2 / r1 + if platform.system() == 'Windows': + assert 0.93 <= r < 1.05, f'{r1=} {r2=} {r=}.' + else: + assert 0.95 <= r < 1.05, f'{r1=} {r2=} {r=}.' + + +def show_tracemalloc_diff(snapshot1, snapshot2): + top_stats = snapshot2.compare_to(snapshot1, 'lineno') + n = 0 + mem = 0 + for i in top_stats: + n += i.count + mem += i.size + print(f'{n=}') + print(f'{mem=}') + print("Top 10:") + for stat in top_stats[:10]: + print(f' {stat}') + snapshot_diff = snapshot2.compare_to(snapshot1, key_type='lineno') + print(f'snapshot_diff:') + count_diff = 0 + size_diff = 0 + for i, s in enumerate(snapshot_diff): + print(f' {i}: {s.count=} {s.count_diff=} {s.size=} {s.size_diff=} {s.traceback=}') + count_diff += s.count_diff + size_diff += s.size_diff + print(f'{count_diff=} {size_diff=}') + + + +def test_4125(): + if os.environ.get('PYMUPDF_RUNNING_ON_VALGRIND') == '1': + print(f'test_4125(): not running because PYMUPDF_RUNNING_ON_VALGRIND=1.') + return + if platform.system().startswith('MSYS_NT-'): + print(f'test_4125(): not running on msys2 - psutil not available.') + return + + print('') + print(f'test_4125(): {platform.python_version()=}.') + + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4125.pdf') + import gc + import psutil + + root = os.path.normpath(f'{__file__}/../..') + sys.path.insert(0, root) + try: + import pipcl + finally: + del sys.path[0] + + process = psutil.Process() + + class State: pass + state = State() + state.rsss = list() + state.prev = None + + def get_stat(): + rss = process.memory_info().rss + if not state.rsss: + state.prev = rss + state.rsss.append(rss) + drss = rss - state.prev + state.prev = rss + print(f'test_4125():' + f' {rss=:,}' + f' rss-rss0={rss-state.rsss[0]:,}' + f' drss={drss:,}' + f'.' + ) + + for i in range(10): + with pymupdf.open(path) as document: + for page in document: + for image_info in page.get_images(full=True): + xref, smask, width, height, bpc, colorspace, alt_colorspace, name, filter_, referencer = image_info + pixmap = pymupdf.Pixmap(document, xref) + if pixmap.colorspace != pymupdf.csRGB: + pixmap2 = pymupdf.Pixmap(pymupdf.csRGB, pixmap) + del pixmap2 + del pixmap + pymupdf.TOOLS.store_shrink(100) + pymupdf.TOOLS.glyph_cache_empty() + gc.collect() + get_stat() + + if platform.system() == 'Linux': + rss_delta = state.rsss[-1] - state.rsss[3] + print(f'{rss_delta=}') + pv = platform.python_version_tuple() + pv = (int(pv[0]), int(pv[1])) + if pv < (3, 11): + # Python < 3.11 has less reliable memory usage so we exclude. + print(f'test_4125(): Not checking on {platform.python_version()=} because < 3.11.') + elif pymupdf.mupdf_version_tuple < (1, 25, 2): + rss_delta_expected = 4915200 * (len(state.rsss) - 3) + assert abs(1 - rss_delta / rss_delta_expected) < 0.15, f'{rss_delta_expected=}' + else: + # Before the fix, each iteration would leak 4.9MB. + rss_delta_max = 100*1000 * (len(state.rsss) - 3) + assert rss_delta < rss_delta_max + else: + # Unfortunately on non-Linux Github test machines the RSS values seem + # to vary a lot, which causes spurious test failures. So for at least + # we don't actually check. + # + print(f'Not checking results because non-Linux behaviour is too variable.') diff -r 000000000000 -r 1d09e1dec1d9 tests/test_metadata.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_metadata.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,43 @@ +""" +1. Read metadata and compare with stored expected result. +2. Erase metadata and assert object has indeed been deleted. +""" +import json +import os +import sys + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "001003ED.pdf") +metafile = os.path.join(scriptdir, "resources", "metadata.txt") +doc = pymupdf.open(filename) + + +def test_metadata(): + assert json.dumps(doc.metadata) == open(metafile).read() + + +def test_erase_meta(): + doc.set_metadata({}) + # Check PDF trailer and assert that there is no more /Info object + # or is set to "null". + statement1 = doc.xref_get_key(-1, "Info")[1] == "null" + statement2 = "Info" not in doc.xref_get_keys(-1) + assert statement2 or statement1 + + +def test_3237(): + filename = os.path.abspath(f'{__file__}/../../tests/resources/001003ED.pdf') + with pymupdf.open(filename) as doc: + # We need to explicitly encode in utf8 on windows. + metadata1 = doc.metadata + metadata1 = repr(metadata1).encode('utf8') + doc.set_metadata({}) + + metadata2 = doc.metadata + metadata2 = repr(metadata2).encode('utf8') + print(f'{metadata1=}') + print(f'{metadata2=}') + assert metadata1 == b'{\'format\': \'PDF 1.6\', \'title\': \'RUBRIK_Editorial_01-06.indd\', \'author\': \'Natalie Schaefer\', \'subject\': \'\', \'keywords\': \'\', \'creator\': \'\', \'producer\': \'Acrobat Distiller 7.0.5 (Windows)\', \'creationDate\': "D:20070113191400+01\'00\'", \'modDate\': "D:20070120104154+01\'00\'", \'trapped\': \'\', \'encryption\': None}' + assert metadata2 == b"{'format': 'PDF 1.6', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': '', 'encryption': None}" diff -r 000000000000 -r 1d09e1dec1d9 tests/test_mupdf_regressions.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_mupdf_regressions.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,98 @@ +import pymupdf +import os +import gentle_compare + +scriptdir = os.path.abspath(os.path.dirname(__file__)) + + +def test_707448(): + """Confirm page content cleaning does not destroy page appearance.""" + filename = os.path.join(scriptdir, "resources", "test-707448.pdf") + doc = pymupdf.open(filename) + page = doc[0] + words0 = page.get_text("words") + page.clean_contents(sanitize=True) + words1 = page.get_text("words") + assert gentle_compare.gentle_compare(words0, words1) + + +def test_707673(): + """Confirm page content cleaning does not destroy page appearance. + + Fails starting with MuPDF v1.23.9. + + Fixed in: + commit 779b8234529cb82aa1e92826854c7bb98b19e44b (golden/master) + """ + filename = os.path.join(scriptdir, "resources", "test-707673.pdf") + doc = pymupdf.open(filename) + page = doc[0] + words0 = page.get_text("words") + page.clean_contents(sanitize=True) + words1 = page.get_text("words") + ok = gentle_compare.gentle_compare(words0, words1) + assert ok + + +def test_707727(): + """Confirm page content cleaning does not destroy page appearance. + + MuPDF issue: https://bugs.ghostscript.com/show_bug.cgi?id=707727 + """ + filename = os.path.join(scriptdir, "resources", "test_3362.pdf") + doc = pymupdf.open(filename) + page = doc[0] + pix0 = page.get_pixmap() + page.clean_contents(sanitize=True) + page = doc.reload_page(page) # required to prevent re-use + pix1 = page.get_pixmap() + rms = gentle_compare.pixmaps_rms(pix0, pix1) + print(f'{rms=}', flush=1) + pix0.save(os.path.normpath(f'{__file__}/../../tests/test_707727_pix0.png')) + pix1.save(os.path.normpath(f'{__file__}/../../tests/test_707727_pix1.png')) + if pymupdf.mupdf_version_tuple >= (1, 25, 2): + # New sanitising gives small fp rounding errors. + assert rms < 0.05 + else: + assert rms == 0 + + +def test_707721(): + """Confirm text extraction works for nested MCID with Type 3 fonts. + PyMuPDF issue https://github.com/pymupdf/PyMuPDF/issues/3357 + MuPDF issue: https://bugs.ghostscript.com/show_bug.cgi?id=707721 + """ + filename = os.path.join(scriptdir, "resources", "test_3357.pdf") + doc = pymupdf.open(filename) + page = doc[0] + ok = page.get_text() + assert ok + + +def test_3376(): + """Check fix of MuPDF bug 707733. + + https://bugs.ghostscript.com/show_bug.cgi?id=707733 + PyMuPDF issue https://github.com/pymupdf/PyMuPDF/issues/3376 + + Test file contains a redaction for the first 3 words: "Table of Contents". + Test strategy: + - extract all words (sorted) + - apply redactions + - extract words again + - confirm: we now have 3 words less and remaining words are equal. + """ + filename = os.path.join(scriptdir, "resources", "test_3376.pdf") + doc = pymupdf.open(filename) + page = doc[0] + words0 = page.get_text("words", sort=True) + words0_s = words0[:3] # first 3 words + words0_e = words0[3:] # remaining words + assert " ".join([w[4] for w in words0_s]) == "Table of Contents" + + page.apply_redactions() + + words1 = page.get_text("words", sort=True) + + ok = gentle_compare.gentle_compare(words0_e, words1) + assert ok diff -r 000000000000 -r 1d09e1dec1d9 tests/test_named_links.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_named_links.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,105 @@ +import pymupdf + +import os + + +def test_2886(): + """Confirm correct insertion of a 'named' link.""" + if not hasattr(pymupdf, "mupdf"): + print(f"test_2886(): not running on classic.") + return + + path = os.path.abspath(f"{__file__}/../../tests/resources/cython.pdf") + doc = pymupdf.open(path) + # name "Doc-Start" is a valid named destination in that file + link = { + "kind": pymupdf.LINK_NAMED, + "from": pymupdf.Rect(0, 0, 50, 50), + "name": "Doc-Start", + } + # insert this link in an arbitrary page & rect + page = doc[-1] + page.insert_link(link) + # need this to update the internal MuPDF annotations array + page = doc.reload_page(page) + + # our new link must be the last in the following list + links = page.get_links() + l_dict = links[-1] + assert l_dict["kind"] == pymupdf.LINK_NAMED + assert l_dict["nameddest"] == link["name"] + assert l_dict["from"] == link["from"] + + +def test_2922(): + """Confirm correct recycling of a 'named' link. + + Re-insertion of a named link item in 'Page.get_links()' does not have + the required "name" key. We test the fallback here that uses key + "nameddest" instead. + """ + if not hasattr(pymupdf, "mupdf"): + print(f"test_2922(): not running on classic.") + return + + path = os.path.abspath(f"{__file__}/../../tests/resources/cython.pdf") + doc = pymupdf.open(path) + page = doc[2] # page has a few links, all are named + links = page.get_links() # list of all links + link0 = links[0] # take arbitrary link (1st one is ok) + page.insert_link(link0) # insert it again + page = doc.reload_page(page) # ensure page updates + links = page.get_links() # access all links again + link1 = links[-1] # re-inserted link + + # confirm equality of relevant key-values + assert link0["nameddest"] == link1["nameddest"] + assert link0["page"] == link1["page"] + assert link0["to"] == link1["to"] + assert link0["from"] == link1["from"] + + +def test_3301(): + """Test correct differentiation between URI and LAUNCH links. + + Links encoded as /URI in PDF are converted to either LINK_URI or + LINK_LAUNCH in PyMuPDF. + This function ensures that the 'Link.uri' containing a ':' colon + is converted to a URI if not explicitly starting with "file://". + """ + if not hasattr(pymupdf, "mupdf"): + print(f"test_3301(): not running on classic.") + return + + # list of links and their expected link "kind" upon extraction + text = { + "https://www.google.de": pymupdf.LINK_URI, + "http://www.google.de": pymupdf.LINK_URI, + "mailto:jorj.x.mckie@outlook.de": pymupdf.LINK_URI, + "www.wikipedia.de": pymupdf.LINK_LAUNCH, + "awkward:resource": pymupdf.LINK_URI, + "ftp://www.google.de": pymupdf.LINK_URI, + "some.program": pymupdf.LINK_LAUNCH, + "file://some.program": pymupdf.LINK_LAUNCH, + "another.exe": pymupdf.LINK_LAUNCH, + } + + # make enough "from" rectangles + r = pymupdf.Rect(0, 0, 50, 20) + rects = [r + (0, r.height * i, 0, r.height * i) for i in range(len(text.keys()))] + + # make test page and insert above links as kind=LINK_URI + doc = pymupdf.open() + page = doc.new_page() + for i, k in enumerate(text.keys()): + link = {"kind": pymupdf.LINK_URI, "uri": k, "from": rects[i]} + page.insert_link(link) + + # re-cycle the PDF preparing for link extraction + pdfdata = doc.write() + doc = pymupdf.open("pdf", pdfdata) + page = doc[0] + for link in page.get_links(): + # Extract the link text. Must be 'file' or 'uri'. + t = link["uri"] if (_ := link.get("file")) is None else _ + assert text[t] == link["kind"] diff -r 000000000000 -r 1d09e1dec1d9 tests/test_nonpdf.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_nonpdf.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,35 @@ +""" +* Check EPUB document is no PDF +* Check page access using (chapter, page) notation +* Re-layout EPUB ensuring a previous location is memorized +""" +import os + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "Bezier.epub") +doc = pymupdf.open(filename) + + +def test_isnopdf(): + assert not doc.is_pdf + + +def test_pageids(): + assert doc.chapter_count == 7 + assert doc.last_location == (6, 1) + assert doc.prev_location((6, 0)) == (5, 11) + assert doc.next_location((5, 11)) == (6, 0) + # Check page numbers have no gaps: + i = 0 + for chapter in range(doc.chapter_count): + for cpno in range(doc.chapter_page_count(chapter)): + assert doc.page_number_from_location((chapter, cpno)) == i + i += 1 + +def test_layout(): + """Memorize a page location, re-layout with ISO-A4, assert pre-determined location.""" + loc = doc.make_bookmark((5, 11)) + doc.layout(pymupdf.Rect(pymupdf.paper_rect("a4"))) + assert doc.find_bookmark(loc) == (5, 6) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_object_manipulation.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_object_manipulation.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,74 @@ +""" +Check some low-level PDF object manipulations: +1. Set page rotation and compare with string in object definition. +2. Set page rotation via string manipulation and compare with result of + proper page property. +3. Read the PDF trailer and verify it has the keys "/Root", "/ID", etc. +""" +import pymupdf +import os + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +resources = os.path.join(scriptdir, "resources") +filename = os.path.join(resources, "001003ED.pdf") + + +def test_rotation1(): + doc = pymupdf.open() + page = doc.new_page() + page.set_rotation(270) + assert doc.xref_get_key(page.xref, "Rotate") == ("int", "270") + + +def test_rotation2(): + doc = pymupdf.open() + page = doc.new_page() + doc.xref_set_key(page.xref, "Rotate", "270") + assert page.rotation == 270 + + +def test_trailer(): + """Access PDF trailer information.""" + doc = pymupdf.open(filename) + xreflen = doc.xref_length() + _, xreflen_str = doc.xref_get_key(-1, "Size") + assert xreflen == int(xreflen_str) + trailer_keys = doc.xref_get_keys(-1) + assert "ID" in trailer_keys + assert "Root" in trailer_keys + + +def test_valid_name(): + """Verify correct PDF names in method xref_set_key.""" + doc = pymupdf.open() + page = doc.new_page() + + # testing name in "key": confirm correct spec is accepted + doc.xref_set_key(page.xref, "Rotate", "90") + assert page.rotation == 90 + + # check wrong spec is detected + error_generated = False + try: + # illegal char in name (white space) + doc.xref_set_key(page.xref, "my rotate", "90") + except ValueError as e: + assert str(e) == "bad 'key'" + error_generated = True + assert error_generated + + # test name in "value": confirm correct spec is accepted + doc.xref_set_key(page.xref, "my_rotate/something", "90") + assert doc.xref_get_key(page.xref, "my_rotate/something") == ("int", "90") + doc.xref_set_key(page.xref, "my_rotate", "/90") + assert doc.xref_get_key(page.xref, "my_rotate") == ("name", "/90") + + # check wrong spec is detected + error_generated = False + try: + # no slash inside name allowed + doc.xref_set_key(page.xref, "my_rotate", "/9/0") + except ValueError as e: + assert str(e) == "bad 'value'" + error_generated = True + assert error_generated diff -r 000000000000 -r 1d09e1dec1d9 tests/test_objectstreams.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_objectstreams.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,83 @@ +import pymupdf + + +def test_objectstream1(): + """Test save option "use_objstms". + This option compresses PDF object definitions into a special object type + "ObjStm". We test its presence by searching for that /Type. + """ + if not hasattr(pymupdf, "mupdf"): + # only implemented for rebased + return + + # make some arbitrary page with content + text = "Hello, World! Hallo, Welt!" + doc = pymupdf.open() + page = doc.new_page() + rect = (50, 50, 200, 500) + + page.insert_htmlbox(rect, text) # place into the rectangle + _ = doc.write(use_objstms=True) + found = False + for xref in range(1, doc.xref_length()): + objstring = doc.xref_object(xref, compressed=True) + if "/Type/ObjStm" in objstring: + found = True + break + assert found, "No object stream found" + + +def test_objectstream2(): + """Test save option "use_objstms". + This option compresses PDF object definitions into a special object type + "ObjStm". We test its presence by searching for that /Type. + """ + if not hasattr(pymupdf, "mupdf"): + # only implemented for rebased + return + + # make some arbitrary page with content + text = "Hello, World! Hallo, Welt!" + doc = pymupdf.open() + page = doc.new_page() + rect = (50, 50, 200, 500) + + page.insert_htmlbox(rect, text) # place into the rectangle + _ = doc.write(use_objstms=False) + + found = False + for xref in range(1, doc.xref_length()): + objstring = doc.xref_object(xref, compressed=True) + if "/Type/ObjStm" in objstring: + found = True + break + assert not found, "Unexpected: Object stream found!" + + +def test_objectstream3(): + """Test ez_save(). + Should automatically use object streams + """ + if not hasattr(pymupdf, "mupdf"): + # only implemented for rebased + return + import io + + fp = io.BytesIO() + + # make some arbitrary page with content + text = "Hello, World! Hallo, Welt!" + doc = pymupdf.open() + page = doc.new_page() + rect = (50, 50, 200, 500) + + page.insert_htmlbox(rect, text) # place into the rectangle + + doc.ez_save(fp) # save PDF to memory + found = False + for xref in range(1, doc.xref_length()): + objstring = doc.xref_object(xref, compressed=True) + if "/Type/ObjStm" in objstring: + found = True + break + assert found, "No object stream found!" diff -r 000000000000 -r 1d09e1dec1d9 tests/test_optional_content.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_optional_content.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,139 @@ +""" +Test of Optional Content code. +""" + +import os + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +filename = os.path.join(scriptdir, "resources", "joined.pdf") + + +def test_oc1(): + """Arbitrary calls to OC code to get coverage.""" + doc = pymupdf.open() + ocg1 = doc.add_ocg("ocg1") + ocg2 = doc.add_ocg("ocg2") + ocg3 = doc.add_ocg("ocg3") + ocmd1 = doc.set_ocmd(xref=0, ocgs=(ocg1, ocg2)) + doc.set_layer(-1) + doc.add_layer("layer1") + test = doc.get_layer() + test = doc.get_layers() + test = doc.get_ocgs() + test = doc.layer_ui_configs() + doc.switch_layer(0) + + +def test_oc2(): + # source file with at least 4 pages + src = pymupdf.open(filename) + + # new PDF with one page + doc = pymupdf.open() + page = doc.new_page() + + # define the 4 rectangle quadrants to receive the source pages + r0 = page.rect / 2 + r1 = r0 + (r0.width, 0, r0.width, 0) + r2 = r0 + (0, r0.height, 0, r0.height) + r3 = r2 + (r2.width, 0, r2.width, 0) + + # make 4 OCGs - one for each source page image. + # only first is ON initially + ocg0 = doc.add_ocg("ocg0", on=True) + ocg1 = doc.add_ocg("ocg1", on=False) + ocg2 = doc.add_ocg("ocg2", on=False) + ocg3 = doc.add_ocg("ocg3", on=False) + + ocmd0 = doc.set_ocmd(ve=["and", ocg0, ["not", ["or", ocg1, ocg2, ocg3]]]) + ocmd1 = doc.set_ocmd(ve=["and", ocg1, ["not", ["or", ocg0, ocg2, ocg3]]]) + ocmd2 = doc.set_ocmd(ve=["and", ocg2, ["not", ["or", ocg1, ocg0, ocg3]]]) + ocmd3 = doc.set_ocmd(ve=["and", ocg3, ["not", ["or", ocg1, ocg2, ocg0]]]) + ocmds = (ocmd0, ocmd1, ocmd2, ocmd3) + # insert the 4 source page images, each connected to one OCG + page.show_pdf_page(r0, src, 0, oc=ocmd0) + page.show_pdf_page(r1, src, 1, oc=ocmd1) + page.show_pdf_page(r2, src, 2, oc=ocmd2) + page.show_pdf_page(r3, src, 3, oc=ocmd3) + xobj_ocmds = [doc.get_oc(item[0]) for item in page.get_xobjects() if item[1] != 0] + assert set(ocmds) <= set(xobj_ocmds) + assert set((ocg0, ocg1, ocg2, ocg3)) == set(tuple(doc.get_ocgs().keys())) + doc.get_ocmd(ocmd0) + page.get_oc_items() + + +def test_3143(): + """Support for non-ascii layer names.""" + doc = pymupdf.open(os.path.join(scriptdir, "resources", "test-3143.pdf")) + page = doc[0] + set0 = set([l["text"] for l in doc.layer_ui_configs()]) + set1 = set([p["layer"] for p in page.get_drawings()]) + set2 = set([b[2] for b in page.get_bboxlog(layers=True)]) + assert set0 == set1 == set2 + + +def test_3180(): + doc = pymupdf.open() + page = doc.new_page() + + # Define the items for the combo box + combo_items = ['first', 'second', 'third'] + + # Create a combo box field + combo_box = pymupdf.Widget() # create a new widget + combo_box.field_type = pymupdf.PDF_WIDGET_TYPE_COMBOBOX + combo_box.field_name = "myComboBox" + combo_box.field_value = combo_items[0] + combo_box.choice_values = combo_items + combo_box.rect = pymupdf.Rect(50, 50, 200, 75) # position of the combo box + combo_box.script_change = """ + var value = event.value; + app.alert('You selected: ' + value); + + //var group_id = optional_content_group_ids[value]; + + """ + + # Insert the combo box into the page + # https://pymupdf.readthedocs.io/en/latest/page.html#Page.add_widget + page.add_widget(combo_box) + + # Create optional content groups + # https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/jupyter-notebooks/optional-content.ipynb + + + # Load images and create OCGs for each + optional_content_group_ids = {} + for i, item in enumerate(combo_items): + optional_content_group_id = doc.add_ocg(item, on=False) + optional_content_group_ids[item] = optional_content_group_id + rect = pymupdf.Rect(50, 100, 250, 300) + image_file_name = f'{item}.png' + # xref = page.insert_image( + # rect, + # filename=image_file_name, + # oc=optional_content_group_id, + # ) + + + first_id = optional_content_group_ids['first'] + second_id = optional_content_group_ids['second'] + third_id = optional_content_group_ids['third'] + + # https://pymupdf.readthedocs.io/en/latest/document.html#Document.set_layer + + + doc.set_layer(-1, basestate="OFF") + layers = doc.get_layer() + doc.set_layer(config=-1, on=[first_id]) + + # https://pymupdf.readthedocs.io/en/latest/document.html#Document.set_layer_ui_config + # configs = doc.layer_ui_configs() + # doc.set_layer_ui_config(0, pymupdf.PDF_OC_ON) + # doc.set_layer_ui_config('third', action=2) + + # Save the PDF + doc.save(os.path.abspath(f'{__file__}/../../tests/test_3180.pdf')) + doc.close() diff -r 000000000000 -r 1d09e1dec1d9 tests/test_page_links.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_page_links.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,17 @@ +import pymupdf + +import os + + +def test_page_links_generator(): + # open some arbitrary PDF + path = os.path.abspath(f"{__file__}/../../tests/resources/2.pdf") + doc = pymupdf.open(path) + + # select an arbitrary page + page = doc[-1] + + # iterate over pages.links + link_generator = page.links() + links = list(link_generator) + assert len(links) == 7 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_pagedelete.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_pagedelete.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,115 @@ +""" +---------------------------------------------------- +This tests correct functioning of multi-page delete +---------------------------------------------------- +Create a PDF in memory with 100 pages with a unique text each. +Also create a TOC with a bookmark per page. +On every page after the first to-be-deleted page, also insert a link, which +points to this page. +The bookmark text equals the text on the page for easy verification. + +Then delete some pages and verify: +- the new TOC has empty items exactly for every deleted page +- the remaining TOC items still point to the correct page +- the document has no more links at all +""" + +import os + +import pymupdf + +scriptdir = os.path.dirname(__file__) +page_count = 100 # initial document length +r = range(5, 35, 5) # contains page numbers we will delete +# insert this link on pages after first deleted one +link = { + "from": pymupdf.Rect(100, 100, 120, 120), + "kind": pymupdf.LINK_GOTO, + "page": r[0], + "to": pymupdf.Point(100, 100), +} + + +def test_deletion(): + # First prepare the document. + doc = pymupdf.open() + toc = [] + for i in range(page_count): + page = doc.new_page() # make a page + page.insert_text((100, 100), "%i" % i) # insert unique text + if i > r[0]: # insert a link + page.insert_link(link) + toc.append([1, "%i" % i, i + 1]) # TOC bookmark to this page + + doc.set_toc(toc) # insert the TOC + assert doc.has_links() # check we did insert links + + # Test page deletion. + # Delete pages in range and verify result + del doc[r] + assert not doc.has_links() # verify all links have gone + assert doc.page_count == page_count - len(r) # correct number deleted? + toc_new = doc.get_toc() # this is the modified TOC + # verify number of emptied items (have page number -1) + assert len([item for item in toc_new if item[-1] == -1]) == len(r) + # Deleted page numbers must correspond to TOC items with page number -1. + for i in r: + assert toc_new[i][-1] == -1 + # Remaining pages must be correctly pointed to by the non-empty TOC items + for item in toc_new: + pno = item[-1] + if pno == -1: # one of the emptied items + continue + pno -= 1 # PDF page number + text = doc[pno].get_text().replace("\n", "") + # toc text must equal text on page + assert text == item[1] + + doc.delete_page(0) # just for the coverage stats + del doc[5:10] + doc.select(range(doc.page_count)) + doc.copy_page(0) + doc.move_page(0) + doc.fullcopy_page(0) + + +def test_3094(): + path = os.path.abspath(f"{__file__}/../../tests/resources/test_2871.pdf") + document = pymupdf.open(path) + pnos = [i for i in range(0, document.page_count, 2)] + document.delete_pages(pnos) + + +def test_3150(): + """Assert correct functioning for problem file. + + Implicitly also check use of new MuPDF function + pdf_rearrange_pages() since version 1.23.9. + """ + filename = os.path.join(scriptdir, "resources", "test-3150.pdf") + pages = [3, 3, 3, 2, 3, 1, 0, 0] + doc = pymupdf.open(filename) + doc.select(pages) + assert doc.page_count == len(pages) + + +def test_4462(): + path0 = os.path.normpath(f'{__file__}/../../tests/resources/test_4462_0.pdf') + path1 = os.path.normpath(f'{__file__}/../../tests/resources/test_4462_1.pdf') + path2 = os.path.normpath(f'{__file__}/../../tests/resources/test_4462_2.pdf') + with pymupdf.open() as document: + document.new_page() + document.new_page() + document.new_page() + document.new_page() + document.save(path0) + with pymupdf.open(path0) as document: + assert len(document) == 4 + document.delete_page(-1) + document.save(path1) + with pymupdf.open(path1) as document: + assert len(document) == 3 + document.delete_pages(-1) + document.save(path2) + with pymupdf.open(path2) as document: + assert len(document) == 2 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_pagelabels.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_pagelabels.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,58 @@ +""" +Define some page labels in a PDF. +Check success in various aspects. +""" + +import pymupdf + + +def make_doc(): + """Makes a PDF with 10 pages.""" + doc = pymupdf.open() + for i in range(10): + page = doc.new_page() + return doc + + +def make_labels(): + """Return page label range rules. + - Rule 1: labels like "A-n", page 0 is first and has "A-1". + - Rule 2: labels as capital Roman numbers, page 4 is first and has "I". + """ + return [ + {"startpage": 0, "prefix": "A-", "style": "D", "firstpagenum": 1}, + {"startpage": 4, "prefix": "", "style": "R", "firstpagenum": 1}, + ] + + +def test_setlabels(): + """Check setting and inquiring page labels. + - Make a PDF with 10 pages + - Label pages + - Inquire labels of pages + - Get list of page numbers for a given label. + """ + doc = make_doc() + doc.set_page_labels(make_labels()) + page_labels = [p.get_label() for p in doc] + answer = ["A-1", "A-2", "A-3", "A-4", "I", "II", "III", "IV", "V", "VI"] + assert page_labels == answer, f"page_labels={page_labels}" + assert doc.get_page_numbers("V") == [8] + assert doc.get_page_labels() == make_labels() + + +def test_labels_styleA(): + """Test correct indexing for styles "a", "A".""" + doc = make_doc() + labels = [ + {"startpage": 0, "prefix": "", "style": "a", "firstpagenum": 1}, + {"startpage": 5, "prefix": "", "style": "A", "firstpagenum": 1}, + ] + doc.set_page_labels(labels) + pdfdata = doc.tobytes() + doc.close() + doc = pymupdf.open("pdf", pdfdata) + answer = ["a", "b", "c", "d", "e", "A", "B", "C", "D", "E"] + page_labels = [page.get_label() for page in doc] + assert page_labels == answer + assert doc.get_page_labels() == labels diff -r 000000000000 -r 1d09e1dec1d9 tests/test_pixmap.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_pixmap.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,630 @@ +""" +Pixmap tests +* make pixmap of a page and assert bbox size +* make pixmap from a PDF xref and compare with extracted image +* pixmap from file and from binary image and compare +""" + +import pymupdf +import gentle_compare + +import os +import platform +import subprocess +import sys +import tempfile +import pytest +import textwrap +import time + + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +epub = os.path.join(scriptdir, "resources", "Bezier.epub") +pdf = os.path.join(scriptdir, "resources", "001003ED.pdf") +imgfile = os.path.join(scriptdir, "resources", "nur-ruhig.jpg") + + +def test_pagepixmap(): + # pixmap from an EPUB page + doc = pymupdf.open(epub) + page = doc[0] + pix = page.get_pixmap() + assert pix.irect == page.rect.irect + pix = page.get_pixmap(alpha=True) + assert pix.alpha + assert pix.n == pix.colorspace.n + pix.alpha + + +def test_pdfpixmap(): + # pixmap from xref in a PDF + doc = pymupdf.open(pdf) + # take first image item of first page + img = doc.get_page_images(0)[0] + # make pixmap of it + pix = pymupdf.Pixmap(doc, img[0]) + # assert pixmap properties + assert pix.width == img[2] + assert pix.height == img[3] + # extract image and compare metadata + extractimg = doc.extract_image(img[0]) + assert extractimg["width"] == pix.width + assert extractimg["height"] == pix.height + + +def test_filepixmap(): + # pixmaps from file and from stream + # should lead to same result + pix1 = pymupdf.Pixmap(imgfile) + stream = open(imgfile, "rb").read() + pix2 = pymupdf.Pixmap(stream) + assert repr(pix1) == repr(pix2) + assert pix1.digest == pix2.digest + + +def test_pilsave(): + # pixmaps from file then save to pillow image + # make pixmap from this and confirm equality + try: + pix1 = pymupdf.Pixmap(imgfile) + stream = pix1.pil_tobytes("JPEG") + pix2 = pymupdf.Pixmap(stream) + assert repr(pix1) == repr(pix2) + except ModuleNotFoundError: + assert platform.system() == 'Windows' and sys.maxsize == 2**31 - 1 + + +def test_save(tmpdir): + # pixmaps from file then save to image + # make pixmap from this and confirm equality + pix1 = pymupdf.Pixmap(imgfile) + outfile = os.path.join(tmpdir, "foo.png") + pix1.save(outfile, output="png") + # read it back + pix2 = pymupdf.Pixmap(outfile) + assert repr(pix1) == repr(pix2) + + +def test_setalpha(): + # pixmap from JPEG file, then add an alpha channel + # with 30% transparency + pix1 = pymupdf.Pixmap(imgfile) + opa = int(255 * 0.3) # corresponding to 30% transparency + alphas = [opa] * (pix1.width * pix1.height) + alphas = bytearray(alphas) + pix2 = pymupdf.Pixmap(pix1, 1) # add alpha channel + pix2.set_alpha(alphas) # make image 30% transparent + samples = pix2.samples # copy of samples + # confirm correct the alpha bytes + t = bytearray([samples[i] for i in range(3, len(samples), 4)]) + assert t == alphas + +def test_color_count(): + ''' + This is known to fail if MuPDF is built without PyMuPDF's custom config.h, + e.g. in Linux system installs. + ''' + pm = pymupdf.Pixmap(imgfile) + assert pm.color_count() == 40624 + +def test_memoryview(): + pm = pymupdf.Pixmap(imgfile) + samples = pm.samples_mv + assert isinstance( samples, memoryview) + print( f'samples={samples} samples.itemsize={samples.itemsize} samples.nbytes={samples.nbytes} samples.ndim={samples.ndim} samples.shape={samples.shape} samples.strides={samples.strides}') + assert samples.itemsize == 1 + assert samples.nbytes == 659817 + assert samples.ndim == 1 + assert samples.shape == (659817,) + assert samples.strides == (1,) + + color = pm.pixel( 100, 100) + print( f'color={color}') + assert color == (83, 66, 40) + +def test_samples_ptr(): + pm = pymupdf.Pixmap(imgfile) + samples = pm.samples_ptr + print( f'samples={samples}') + assert isinstance( samples, int) + +def test_2369(): + + width, height = 13, 37 + image = pymupdf.Pixmap(pymupdf.csGRAY, width, height, b"\x00" * (width * height), False) + + with pymupdf.Document(stream=image.tobytes(output="pam"), filetype="pam") as doc: + test_pdf_bytes = doc.convert_to_pdf() + + with pymupdf.Document(stream=test_pdf_bytes) as doc: + page = doc[0] + img_xref = page.get_images()[0][0] + img = doc.extract_image(img_xref) + img_bytes = img["image"] + pymupdf.Pixmap(img_bytes) + +def test_page_idx_int(): + doc = pymupdf.open(pdf) + with pytest.raises(AssertionError): + doc["0"] + assert doc[0] + assert doc[(0,0)] + +def test_fz_write_pixmap_as_jpeg(): + width, height = 13, 37 + image = pymupdf.Pixmap(pymupdf.csGRAY, width, height, b"\x00" * (width * height), False) + + with pymupdf.Document(stream=image.tobytes(output="jpeg"), filetype="jpeg") as doc: + test_pdf_bytes = doc.convert_to_pdf() + +def test_3020(): + pm = pymupdf.Pixmap(imgfile) + pm2 = pymupdf.Pixmap(pm, 20, 30, None) + pm3 = pymupdf.Pixmap(pymupdf.csGRAY, pm) + pm4 = pymupdf.Pixmap(pm, pm3) + +def test_3050(): + ''' + This is known to fail if MuPDF is built without it's default third-party + libraries, e.g. in Linux system installs. + ''' + path = os.path.normpath(f'{__file__}/../../tests/resources/001003ED.pdf') + with pymupdf.open(path) as pdf_file: + page_no = 0 + page = pdf_file[page_no] + zoom_x = 4.0 + zoom_y = 4.0 + matrix = pymupdf.Matrix(zoom_x, zoom_y) + pix = page.get_pixmap(matrix=matrix) + path_out = os.path.normpath(f'{__file__}/../../tests/test_3050_out.png') + pix.save(path_out) + print(f'{pix.width=} {pix.height=}') + def product(x, y): + for yy in y: + for xx in x: + yield (xx, yy) + n = 0 + # We use a small subset of the image because non-optimised rebase gets + # very slow. + for pos in product(range(100), range(100)): + if sum(pix.pixel(pos[0], pos[1])) >= 600: + n += 1 + pix.set_pixel(pos[0], pos[1], (255, 255, 255)) + path_out2 = os.path.normpath(f'{__file__}/../../tests/test_3050_out2.png') + pix.save(path_out2) + path_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_3050_expected.png') + rms = gentle_compare.pixmaps_rms(path_expected, path_out2) + print(f'{rms=}') + if pymupdf.mupdf_version_tuple < (1, 26): + # Slight differences in rendering from fix for mupdf bug 708274. + assert rms < 0.2 + else: + assert rms == 0 + wt = pymupdf.TOOLS.mupdf_warnings() + if pymupdf.mupdf_version_tuple >= (1, 26, 0): + assert wt == 'bogus font ascent/descent values (0 / 0)\nPDF stream Length incorrect' + else: + assert wt == 'PDF stream Length incorrect' + +def test_3058(): + doc = pymupdf.Document(os.path.abspath(f'{__file__}/../../tests/resources/test_3058.pdf')) + images = doc[0].get_images(full=True) + pix = pymupdf.Pixmap(doc, 17) + + # First bug was that `pix.colorspace` was DeviceRGB. + assert str(pix.colorspace) == 'Colorspace(CS_CMYK) - DeviceCMYK' + + pix = pymupdf.Pixmap(pymupdf.csRGB, pix) + assert str(pix.colorspace) == 'Colorspace(CS_RGB) - DeviceRGB' + + # Second bug was that the image was converted to RGB via greyscale proofing + # color space, so image contained only shades of grey. This compressed + # easily to a .png file, so we crudely check the bug is fixed by looking at + # size of .png file. + path = os.path.abspath(f'{__file__}/../../tests/test_3058_out.png') + pix.save(path) + s = os.path.getsize(path) + assert 1800000 < s < 2600000, f'Unexpected size of {path}: {s}' + +def test_3072(): + path = os.path.abspath(f'{__file__}/../../tests/resources/test_3072.pdf') + out = os.path.abspath(f'{__file__}/../../tests') + + doc = pymupdf.open(path) + page_48 = doc[0] + bbox = [147, 300, 447, 699] + rect = pymupdf.Rect(*bbox) + zoom = pymupdf.Matrix(3, 3) + pix = page_48.get_pixmap(clip=rect, matrix=zoom) + image_save_path = f'{out}/1.jpg' + pix.save(image_save_path, jpg_quality=95) + + doc = pymupdf.open(path) + page_49 = doc[1] + bbox = [147, 543, 447, 768] + rect = pymupdf.Rect(*bbox) + zoom = pymupdf.Matrix(3, 3) + pix = page_49.get_pixmap(clip=rect, matrix=zoom) + image_save_path = f'{out}/2.jpg' + pix.save(image_save_path, jpg_quality=95) + rebase = hasattr(pymupdf, 'mupdf') + if rebase: + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == ( + "syntax error: cannot find ExtGState resource 'BlendMode0'\n" + "encountered syntax errors; page may not be correct\n" + "syntax error: cannot find ExtGState resource 'BlendMode0'\n" + "encountered syntax errors; page may not be correct" + ) + +def test_3134(): + doc = pymupdf.Document() + page = doc.new_page() + page.get_pixmap(clip=pymupdf.Rect(0, 0, 100, 100)).save("test_3134_rect.jpg") + page.get_pixmap(clip=pymupdf.IRect(0, 0, 100, 100)).save("test_3134_irect.jpg") + stat_rect = os.stat('test_3134_rect.jpg') + stat_irect = os.stat('test_3134_irect.jpg') + print(f' {stat_rect=}') + print(f'{stat_irect=}') + assert stat_rect.st_size == stat_irect.st_size + +def test_3177(): + path = os.path.abspath(f'{__file__}/../../tests/resources/img-transparent.png') + pixmap = pymupdf.Pixmap(path) + pixmap2 = pymupdf.Pixmap(None, pixmap) + + +def test_3493(): + ''' + If python3-gi is installed, we check fix for #3493, where importing gi + would load an older version of libjpeg than is used in MuPDF, and break + MuPDF. + + This test is excluded by default in sysinstall tests, because running + commands in a new venv does not seem to pick up pymupdf as expected. + ''' + if platform.system() != 'Linux': + print(f'Not running because not Linux: {platform.system()=}') + return + + import subprocess + + root = os.path.abspath(f'{__file__}/../..') + in_path = f'{root}/tests/resources/test_3493.epub' + + def run(command, check=1, stdout=None): + print(f'Running with {check=}: {command}') + return subprocess.run(command, shell=1, check=check, stdout=stdout, text=1) + + def run_code(code, code_path, *, check=True, venv=None, venv_args='', pythonpath=None, stdout=None): + code = textwrap.dedent(code) + with open(code_path, 'w') as f: + f.write(code) + prefix = f'PYTHONPATH={pythonpath} ' if pythonpath else '' + if venv: + # Have seen this fail on Github in a curious way: + # + # Running: /tmp/tmp.fBeKNLJQKk/venv/bin/python -m venv --system-site-packages /project/tests/resources/test_3493_venv + # Error: [Errno 2] No such file or directory: '/project/tests/resources/test_3493_venv/bin/python' + # + r = run(f'{sys.executable} -m venv {venv_args} {venv}', check=check) + if r.returncode: + return r + r = run(f'. {venv}/bin/activate && {prefix}python {code_path}', check=check, stdout=stdout) + else: + r = run(f'{prefix}{sys.executable} {code_path}', check=check, stdout=stdout) + return r + + # Find location of system install of `gi`. + r = run_code( + ''' + from gi.repository import GdkPixbuf + import gi + print(gi.__file__) + ''' + , + f'{root}/tests/resources/test_3493_gi.py', + check=0, + venv=f'{root}/tests/resources/test_3493_venv', + venv_args='--system-site-packages', + stdout=subprocess.PIPE, + ) + if r.returncode: + print(f'test_3493(): Not running test because --system-site-packages venv cannot import gi.') + return + gi = r.stdout.strip() + gi_pythonpath = os.path.abspath(f'{gi}/../..') + + def do(gi): + # Run code that will import gi and pymupdf in different orders, and + # return contents of generated .png file as a bytes. + out = f'{root}/tests/resources/test_3493_{gi}.png' + run_code( + f''' + if {gi}==0: + import pymupdf + elif {gi}==1: + from gi.repository import GdkPixbuf + import pymupdf + elif {gi}==2: + import pymupdf + from gi.repository import GdkPixbuf + else: + assert 0 + document = pymupdf.Document('{in_path}') + page = document[0] + print(f'{gi=}: saving to: {out}') + page.get_pixmap().save('{out}') + ''' + , + os.path.abspath(f'{root}/tests/resources/test_3493_{gi}.py'), + pythonpath=gi_pythonpath, + ) + with open(out, 'rb') as f: + return f.read() + + out0 = do(0) + out1 = do(1) + out2 = do(2) + print(f'{len(out0)=} {len(out1)=} {len(out2)=}.') + assert out1 == out0 + assert out2 == out0 + + +def test_3848(): + if os.environ.get('PYMUPDF_RUNNING_ON_VALGRIND') == '1': + # Takes 40m on Github. + print(f'test_3848(): not running on valgrind because very slow.', flush=1) + return + if platform.python_implementation() == 'GraalVM': + print(f'test_3848(): Not running because slow on GraalVM.') + return + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3848.pdf') + with pymupdf.open(path) as document: + for i in range(len(document)): + page = document.load_page(i) + print(f'{page=}.') + for annot in page.get_drawings(): + if page.get_textbox(annot['rect']): + rect = annot['rect'] + pixmap = page.get_pixmap(clip=rect) + color_bytes = pixmap.color_topusage() + + +def test_3994(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3994.pdf') + with pymupdf.open(path) as document: + page = document[0] + txt_blocks = [blk for blk in page.get_text('dict')['blocks'] if blk['type']==0] + for blk in txt_blocks: + pix = page.get_pixmap(clip=pymupdf.Rect([int(v) for v in blk['bbox']]), colorspace=pymupdf.csRGB, alpha=False) + percent, color = pix.color_topusage() + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'premature end of data in flate filter\n... repeated 2 times...' + + +def test_3448(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3448.pdf') + with pymupdf.open(path) as document: + page = document[0] + pixmap = page.get_pixmap(alpha=False, dpi=150) + path_out = f'{path}.png' + pixmap.save(path_out) + print(f'Have written to: {path_out}') + path_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_3448.pdf-expected.png') + pixmap_expected = pymupdf.Pixmap(path_expected) + rms = gentle_compare.pixmaps_rms(pixmap, pixmap_expected) + diff = gentle_compare.pixmaps_diff(pixmap_expected, pixmap) + path_diff = os.path.normpath(f'{__file__}/../../tests/test_3448-diff.png') + diff.save(path_diff) + print(f'{rms=}') + if pymupdf.mupdf_version_tuple < (1, 25, 5): + # Prior to fix for mupdf bug 708274. + assert 1 < rms < 2 + else: + assert rms == 0 + + +def test_3854(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3854.pdf') + with pymupdf.open(path) as document: + page = document[0] + pixmap = page.get_pixmap() + pixmap.save(os.path.normpath(f'{__file__}/../../tests/test_3854_out.png')) + + # 2024-11-29: this is the incorrect expected output. + path_expected_png = os.path.normpath(f'{__file__}/../../tests/resources/test_3854_expected.png') + pixmap_expected = pymupdf.Pixmap(path_expected_png) + pixmap_diff = gentle_compare.pixmaps_diff(pixmap_expected, pixmap) + path_diff = os.path.normpath(f'{__file__}/../../tests/resources/test_3854_diff.png') + pixmap_diff.save(path_diff) + rms = gentle_compare.pixmaps_rms(pixmap, pixmap_expected) + print(f'{rms=}.') + if os.environ.get('PYMUPDF_SYSINSTALL_TEST') == '1': + # MuPDF using external third-party libs gives slightly different + # behaviour. + assert rms < 2 + elif pymupdf.mupdf_version_tuple < (1, 25, 5): + # # Prior to fix for mupdf bug 708274. + assert 0.5 < rms < 2 + else: + assert rms == 0 + + +def test_4155(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3854.pdf') + with pymupdf.open(path) as document: + page = document[0] + pixmap = page.get_pixmap() + mv = pixmap.samples_mv + mvb1 = mv.tobytes() + del page + del pixmap + try: + mvb2 = mv.tobytes() + except ValueError as e: + print(f'Received exception: {e}') + assert 'operation forbidden on released memoryview object' in str(e) + else: + assert 0, f'Did not receive expected exception when using defunct memoryview.' + + +def test_4336(): + if 0: + # Compare with last classic release. + import pickle + path_out = os.path.normpath(f'{__file__}/../../tests/resources/test_4336_cc') + code = textwrap.dedent(f''' + import fitz + import os + import time + import pickle + + path = os.path.normpath(f'{__file__}/../../tests/resources/nur-ruhig.jpg') + pixmap = fitz.Pixmap(path) + t = time.time() + for i in range(10): + cc = pixmap.color_count() + t = time.time() - t + print(f'test_4336(): {{t=}}') + with open({path_out!r}, 'wb') as f: + pickle.dump(cc, f) + ''') + path_code = os.path.normpath(f'{__file__}/../../tests/resources/test_4336.py') + with open(path_code, 'w') as f: + f.write(code) + venv = os.path.normpath(f'{__file__}/../../tests/resources/test_4336_venv') + command = f'{sys.executable} -m venv {venv}' + command += f' && . {venv}/bin/activate' + command += f' && pip install --force-reinstall pymupdf==1.23.8' + command += f' && python {path_code}' + print(f'Running: {command}', flush=1) + subprocess.run(command, shell=1, check=1) + with open(path_out, 'rb') as f: + cc_old = pickle.load(f) + else: + cc_old = None + path = os.path.normpath(f'{__file__}/../../tests/resources/nur-ruhig.jpg') + pixmap = pymupdf.Pixmap(path) + t = time.time() + for i in range(10): + cc = pixmap.color_count() + t = time.time() - t + print(f'test_4336(): {t=}') + + if cc_old: + assert cc == cc_old + + +def test_4435(): + print(f'{pymupdf.version=}') + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4435.pdf') + with pymupdf.open(path) as document: + page = document[2] + print(f'Calling page.get_pixmap().', flush=1) + pixmap = page.get_pixmap(alpha=False, dpi=120) + print(f'Called page.get_pixmap().', flush=1) + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'bogus font ascent/descent values (0 / 0)\n... repeated 9 times...' + + +def test_4423(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test_4423.pdf') + with pymupdf.open(path) as document: + path2 = f'{path}.pdf' + ee = None + try: + document.save( + path2, + garbage=4, + expand=1, + deflate=True, + pretty=True, + no_new_id=True, + ) + except Exception as e: + print(f'Exception: {e}') + ee = e + + if (1, 25, 5) <= pymupdf.mupdf_version_tuple < (1, 26): + assert ee, f'Did not receive the expected exception.' + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'dropping unclosed output' + else: + assert not ee, f'Received unexpected exception: {e}' + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == 'format error: cannot find object in xref (56 0 R)\nformat error: cannot find object in xref (68 0 R)' + + +def test_4445(): + print() + # Test case is large so we download it instead of having it in PyMuPDF + # git. We put it in `cache/` directory do it is not removed by `git clean` + # (unless `-d` is specified). + import util + path = util.download( + 'https://github.com/user-attachments/files/19738242/ss.pdf', + 'test_4445.pdf', + size=2671185, + ) + with pymupdf.open(path) as document: + page = document[0] + pixmap = page.get_pixmap() + print(f'{pixmap.width=}') + print(f'{pixmap.height=}') + if pymupdf.mupdf_version_tuple >= (1, 26): + assert (pixmap.width, pixmap.height) == (792, 612) + else: + assert (pixmap.width, pixmap.height) == (612, 792) + if 0: + path_pixmap = f'{path}.png' + pixmap.save(path_pixmap) + print(f'Have created {path_pixmap=}') + wt = pymupdf.TOOLS.mupdf_warnings() + print(f'{wt=}') + assert wt == 'broken xref subsection, proceeding anyway.\nTrailer Size is off-by-one. Ignoring.' + + +def test_3806(): + print() + print(f'{pymupdf.mupdf_version=}') + path = os.path.normpath(f'{__file__}/../../tests/resources/test_3806.pdf') + path_png_expected = os.path.normpath(f'{__file__}/../../tests/resources/test_3806-expected.png') + path_png = os.path.normpath(f'{__file__}/../../tests/test_3806.png') + + with pymupdf.open(path) as document: + pixmap = document[0].get_pixmap() + pixmap.save(path_png) + rms = gentle_compare.pixmaps_rms(path_png_expected, pixmap) + print(f'{rms=}') + if pymupdf.mupdf_version_tuple >= (1, 26, 6): + assert rms < 0.1 + else: + assert rms > 50 + + +def test_4388(): + print() + path_BOZ1 = os.path.normpath(f'{__file__}/../../tests/resources/test_4388_BOZ1.pdf') + path_BUL1 = os.path.normpath(f'{__file__}/../../tests/resources/test_4388_BUL1.pdf') + path_correct = os.path.normpath(f'{__file__}/../../tests/resources/test_4388_BUL1.pdf.correct.png') + path_test = os.path.normpath(f'{__file__}/../../tests/resources/test_4388_BUL1.pdf.test.png') + + with pymupdf.open(path_BUL1) as bul: + pixmap_correct = bul.load_page(0).get_pixmap() + pixmap_correct.save(path_correct) + + pymupdf.TOOLS.store_shrink(100) + + with pymupdf.open(path_BOZ1) as boz: + boz.load_page(0).get_pixmap() + + with pymupdf.open(path_BUL1) as bul: + pixmap_test = bul.load_page(0).get_pixmap() + pixmap_test.save(path_test) + + rms = gentle_compare.pixmaps_rms(pixmap_correct, pixmap_test) + print(f'{rms=}') + if pymupdf.mupdf_version_tuple >= (1, 26, 6): + assert rms == 0 + else: + assert rms >= 10 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_pylint.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_pylint.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,141 @@ +import pymupdf +import os +import re +import subprocess +import sys +import textwrap + +def test_pylint(): + + if not hasattr(pymupdf, 'mupdf'): + print(f'test_pylint(): Not running with classic implementation.') + return + + ignores = '' + + ignores += textwrap.dedent( + ''' + C0103: Constant name "g_exceptions_verbose" doesn't conform to UPPER_CASE naming style (invalid-name) + C0115: Missing class docstring (missing-class-docstring) + C0116: Missing function or method docstring (missing-function-docstring) + C0301: Line too long (142/100) (line-too-long) + C0302: Too many lines in module (23586/1000) (too-many-lines) + C0303: Trailing whitespace (trailing-whitespace) + C0325: Unnecessary parens after 'not' keyword (superfluous-parens) + C0415: Import outside toplevel (traceback) (import-outside-toplevel) + R0902: Too many instance attributes (9/7) (too-many-instance-attributes) + R0903: Too few public methods (1/2) (too-few-public-methods) + R0911: Too many return statements (9/6) (too-many-return-statements) + R0913: Too many arguments (6/5) (too-many-arguments) + R1705: Unnecessary "elif" after "return", remove the leading "el" from "elif" (no-else-return) + R1720: Unnecessary "elif" after "raise", remove the leading "el" from "elif" (no-else-raise) + R1724: Unnecessary "elif" after "continue", remove the leading "el" from "elif" (no-else-continue) + R1735: Consider using '{}' instead of a call to 'dict'. (use-dict-literal) + W0511: Fixme: we don't support JM_MEMORY=1. (fixme) + W0622: Redefining built-in 'FileNotFoundError' (redefined-builtin) + W0622: Redefining built-in 'open' (redefined-builtin) + W1309: Using an f-string that does not have any interpolated variables (f-string-without-interpolation) + R1734: Consider using [] instead of list() (use-list-literal) + R1727: Boolean condition '0 and g_exceptions_verbose' will always evaluate to '0' (condition-evals-to-constant) + R1726: (simplifiable-condition) + ''' + ) + + # Items that we might want to fix. + ignores += textwrap.dedent( + ''' + C0114: Missing module docstring (missing-module-docstring) + C0117: Consider changing "not rotate % 90 == 0" to "rotate % 90 != 0" (unnecessary-negation) + C0123: Use isinstance() rather than type() for a typecheck. (unidiomatic-typecheck) + C0200: Consider using enumerate instead of iterating with range and len (consider-using-enumerate) + C0201: Consider iterating the dictionary directly instead of calling .keys() (consider-iterating-dictionary) + C0209: Formatting a regular string which could be an f-string (consider-using-f-string) + C0305: Trailing newlines (trailing-newlines) + C0321: More than one statement on a single line (multiple-statements) + C1802: Do not use `len(SEQUENCE)` without comparison to determine if a sequence is empty (use-implicit-booleaness-not-len) + C1803: "select == []" can be simplified to "not select", if it is strictly a sequence, as an empty list is falsey (use-implicit-booleaness-not-comparison) + R0912: Too many branches (18/12) (too-many-branches) + R0914: Too many local variables (20/15) (too-many-locals) + R0915: Too many statements (58/50) (too-many-statements) + R1702: Too many nested blocks (7/5) (too-many-nested-blocks) + R1703: The if statement can be replaced with 'var = bool(test)' (simplifiable-if-statement) + R1710: Either all return statements in a function should return an expression, or none of them should. (inconsistent-return-statements) + R1714: Consider merging these comparisons with 'in' by using 'width not in (1, 0)'. Use a set instead if elements are hashable. (consider-using-in) + R1716: Simplify chained comparison between the operands (chained-comparison) + R1717: Consider using a dictionary comprehension (consider-using-dict-comprehension) + R1718: Consider using a set comprehension (consider-using-set-comprehension) + R1719: The if expression can be replaced with 'bool(test)' (simplifiable-if-expression) + R1721: Unnecessary use of a comprehension, use list(roman_num(num)) instead. (unnecessary-comprehension) + R1728: Consider using a generator instead 'max(len(k) for k in item.keys())' (consider-using-generator) + R1728: Consider using a generator instead 'max(len(r.cells) for r in self.rows)' (consider-using-generator) + R1730: Consider using 'rowheight = min(rowheight, height)' instead of unnecessary if block (consider-using-min-builtin) + R1731: Consider using 'right = max(right, x1)' instead of unnecessary if block (consider-using-max-builtin) + W0105: String statement has no effect (pointless-string-statement) + W0107: Unnecessary pass statement (unnecessary-pass) + W0212: Access to a protected member _graft_id of a client class (protected-access) + W0602: Using global for 'CHARS' but no assignment is done (global-variable-not-assigned) + W0602: Using global for 'EDGES' but no assignment is done (global-variable-not-assigned) + W0603: Using the global statement (global-statement) + W0612: Unused variable 'keyvals' (unused-variable) + W0613: Unused argument 'kwargs' (unused-argument) + W0621: Redefining name 'show' from outer scope (line 159) (redefined-outer-name) + W0640: Cell variable o defined in loop (cell-var-from-loop) + W0718: Catching too general exception Exception (broad-exception-caught) + W0719: Raising too general exception: Exception (broad-exception-raised) + C3001: Lambda expression assigned to a variable. Define a function using the "def" keyword instead. (unnecessary-lambda-assignment) + R0801: Similar lines in 2 files + R0917: Too many positional arguments (7/5) (too-many-positional-arguments) + ''' + ) + ignores_list = list() + for line in ignores.split('\n'): + if not line or line.startswith('#'): + continue + m = re.match('^(.....): ', line) + assert m, f'Failed to parse {line=}' + ignores_list.append(m.group(1)) + ignores = ','.join(ignores_list) + + root = os.path.abspath(f'{__file__}/../..') + + sys.path.insert(0, root) + import pipcl + del sys.path[0] + + # We want to run pylist on all of our src/*.py files so we find them with + # `pipcl.git_items()`. However this seems to fail on github windows with + # `fatal: not a git repository (or any of the parent directories): .git` so + # we also hard-code the list and verify it matches `git ls-files` on other + # platforms. This ensures that we will always pick up new .py files in the + # future. + # + command = f'pylint -d {ignores}' + directory = f'{root}/src' + directory = directory.replace('/', os.sep) + leafs = [ + '__init__.py', + '__main__.py', + '_apply_pages.py', + '_wxcolors.py', + 'fitz___init__.py', + 'fitz_table.py', + 'fitz_utils.py', + 'pymupdf.py', + 'table.py', + 'utils.py', + ] + leafs.sort() + try: + leafs_git = pipcl.git_items(directory) + except Exception as e: + import platform + assert platform.system() == 'Windows' + else: + leafs_git = [i for i in leafs_git if i.endswith('.py')] + leafs_git.sort() + assert leafs_git == leafs, f'leafs:\n {leafs!r}\nleafs_git:\n {leafs_git!r}' + for leaf in leafs: + command += f' {directory}/{leaf}' + print(f'Running: {command}') + subprocess.run(command, shell=1, check=1) + diff -r 000000000000 -r 1d09e1dec1d9 tests/test_remove-rotation.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_remove-rotation.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,30 @@ +import os +import pymupdf +from gentle_compare import gentle_compare + +scriptdir = os.path.dirname(__file__) + + +def test_remove_rotation(): + """Remove rotation verifying identical appearance and text.""" + filename = os.path.join(scriptdir, "resources", "test-2812.pdf") + doc = pymupdf.open(filename) + + # We always create fresh pages to avoid false positives from cache content. + # Text on these pages consists of pairwise different strings, sorting by + # these strings must therefore yield identical bounding boxes. + for i in range(1, doc.page_count): + assert doc[i].rotation # must be a rotated page + pix0 = doc[i].get_pixmap() # make image + words0 = [] + for w in doc[i].get_text("words"): + words0.append(list(pymupdf.Rect(w[:4]) * doc[i].rotation_matrix) + [w[4]]) + words0.sort(key=lambda w: w[4]) # sort by word strings + # derotate page and confirm nothing else has changed + doc[i].remove_rotation() + assert doc[i].rotation == 0 + pix1 = doc[i].get_pixmap() + words1 = doc[i].get_text("words") + words1.sort(key=lambda w: w[4]) # sort by word strings + assert pix1.digest == pix0.digest, f"{pix1.digest}/{pix0.digest}" + assert gentle_compare(words0, words1) diff -r 000000000000 -r 1d09e1dec1d9 tests/test_rewrite_images.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_rewrite_images.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,15 @@ +import pymupdf +import os + +scriptdir = os.path.dirname(__file__) + + +def test_rewrite_images(): + """Example for decreasing file size by more than 30%.""" + filename = os.path.join(scriptdir, "resources", "test-rewrite-images.pdf") + doc = pymupdf.open(filename) + size0 = os.path.getsize(doc.name) + doc.rewrite_images(dpi_threshold=100, dpi_target=72, quality=33) + data = doc.tobytes(garbage=3, deflate=True) + size1 = len(data) + assert (1 - (size1 / size0)) > 0.3 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_rtl.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_rtl.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,18 @@ +import pymupdf + +import os + + +def test_rtl(): + path = os.path.normpath(f'{__file__}/../../tests/resources/test-E+A.pdf') + doc = pymupdf.open(path) + page = doc[0] + # set of all RTL characters + rtl_chars = set([chr(i) for i in range(0x590, 0x901)]) + + for w in page.get_text("words"): + # every word string must either ONLY contain RTL chars + cond1 = rtl_chars.issuperset(w[4]) + # ... or NONE. + cond2 = rtl_chars.intersection(w[4]) == set() + assert cond1 or cond2 diff -r 000000000000 -r 1d09e1dec1d9 tests/test_showpdfpage.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_showpdfpage.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,55 @@ +""" +Tests: + * Convert some image to a PDF + * Insert it rotated in some rectangle of a PDF page + * Assert PDF Form XObject has been created + * Assert that image contained in inserted PDF is inside given rectangle +""" +import os + +import pymupdf + +scriptdir = os.path.abspath(os.path.dirname(__file__)) +imgfile = os.path.join(scriptdir, "resources", "nur-ruhig.jpg") + + +def test_insert(): + doc = pymupdf.open() + page = doc.new_page() + rect = pymupdf.Rect(50, 50, 100, 100) # insert in here + img = pymupdf.open(imgfile) # open image + tobytes = img.convert_to_pdf() # get its PDF version (bytes object) + src = pymupdf.open("pdf", tobytes) # open as PDF + xref = page.show_pdf_page(rect, src, 0, rotate=-23) # insert in rectangle + # extract just inserted image info + img = page.get_images(True)[0] + assert img[-1] == xref # xref of Form XObject! + img = page.get_image_info()[0] # read the page's images + + # Multiple computations may have lead to rounding deviations, so we need + # some generosity here: enlarge rect by 1 point in each direction. + assert img["bbox"] in rect + (-1, -1, 1, 1) + +def test_2742(): + dest = pymupdf.open() + destpage = dest.new_page(width=842, height=595) + + a5 = pymupdf.Rect(0, 0, destpage.rect.width / 3, destpage.rect.height) + shiftright = pymupdf.Rect(destpage.rect.width/3, 0, destpage.rect.width/3, 0) + + src = pymupdf.open(os.path.abspath(f'{__file__}/../../tests/resources/test_2742.pdf')) + + destpage.show_pdf_page(a5, src, 0) + destpage.show_pdf_page(a5 + shiftright, src, 0) + destpage.show_pdf_page(a5 + shiftright + shiftright, src, 0) + + dest.save(os.path.abspath(f'{__file__}/../../tests/test_2742-out.pdf')) + print("The end!") + + rebased = hasattr(pymupdf, 'mupdf') + if rebased: + wt = pymupdf.TOOLS.mupdf_warnings() + assert wt == ( + 'Circular dependencies! Consider page cleaning.\n' + '... repeated 3 times...' + ), f'{wt=}' diff -r 000000000000 -r 1d09e1dec1d9 tests/test_spikes.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_spikes.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,42 @@ +import pymupdf +import pathlib +import os + + +def test_spikes(): + """Check suppression of text spikes caused by long miters.""" + root = os.path.abspath(f"{__file__}/../..") + spikes_yes = pathlib.Path(f"{root}/docs/images/spikes-yes.png") + spikes_no = pathlib.Path(f"{root}/docs/images/spikes-no.png") + doc = pymupdf.open() + text = "NATO MEMBERS" # some text provoking spikes ("N", "M") + point = (10, 35) # insert point + + # make text provoking spikes + page = doc.new_page(width=200, height=50) # small page + page.insert_text( + point, + text, + fontsize=20, + render_mode=1, # stroke text only + border_width=0.3, # causes thick border lines + miter_limit=None, # do not care about miter spikes + ) + # write same text in white over the previous for better demo purpose + page.insert_text(point, text, fontsize=20, color=(1, 1, 1)) + pix1 = page.get_pixmap() + assert pix1.tobytes() == spikes_yes.read_bytes() + + # make text suppressing spikes + page = doc.new_page(width=200, height=50) + page.insert_text( + point, + text, + fontsize=20, + render_mode=1, + border_width=0.3, + miter_limit=1, # suppress each and every miter spike + ) + page.insert_text(point, text, fontsize=20, color=(1, 1, 1)) + pix2 = page.get_pixmap() + assert pix2.tobytes() == spikes_no.read_bytes() diff -r 000000000000 -r 1d09e1dec1d9 tests/test_story.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tests/test_story.py Mon Sep 15 11:37:51 2025 +0200 @@ -0,0 +1,295 @@ +import pymupdf +import os +import textwrap + + +def test_story(): + otf = os.path.abspath(f'{__file__}/../resources/PragmaticaC.otf') + # 2023-12-06: latest mupdf throws exception if path uses back-slashes. + otf = otf.replace('\\', '/') + CSS = f""" + @font-face {{font-family: test; src: url({otf});}} + """ + + HTML = """ +