Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/gumbo-parser/setup.py @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 #!/usr/bin/env python | |
| 2 import sys | |
| 3 from setuptools import setup | |
| 4 from setuptools.command.sdist import sdist | |
| 5 | |
| 6 _name_of_lib = 'libgumbo.so' | |
| 7 if sys.platform.startswith('darwin'): | |
| 8 _name_of_lib = 'libgumbo.dylib' | |
| 9 elif sys.platform.startswith('win'): | |
| 10 _name_of_lib = 'gumbo.dll' | |
| 11 | |
| 12 class CustomSdistCommand(sdist): | |
| 13 """Customized Sdist command, to copy libgumbo.so into the Python directory | |
| 14 so that it can be installed with `pip install`.""" | |
| 15 def run(self): | |
| 16 try: | |
| 17 import shutil | |
| 18 shutil.copyfile('.libs/' + _name_of_lib, | |
| 19 'python/gumbo/' + _name_of_lib) | |
| 20 sdist.run(self) | |
| 21 except IOError as e: | |
| 22 print(e) | |
| 23 | |
| 24 | |
| 25 README = '''Gumbo - A pure-C HTML5 parser. | |
| 26 ============================== | |
| 27 | |
| 28 Gumbo is an implementation of the `HTML5 parsing algorithm <http://www.whatwg.org/specs/web-apps/current-work/multipage/#auto-toc-12>`_ implemented | |
| 29 as a pure C99 library with no outside dependencies. It's designed to serve | |
| 30 as a building block for other tools and libraries such as linters, | |
| 31 validators, templating languages, and refactoring and analysis tools. This | |
| 32 package contains the library itself, Python ctypes bindings for the library, and | |
| 33 adapters for html5lib and BeautifulSoup (3.2) that give it the same API as those | |
| 34 libaries. | |
| 35 | |
| 36 Goals & features: | |
| 37 ----------------- | |
| 38 | |
| 39 - Robust and resilient to bad input. | |
| 40 | |
| 41 - Simple API that can be easily wrapped by other languages. | |
| 42 | |
| 43 - Support for source locations and pointers back to the original text. | |
| 44 | |
| 45 - Relatively lightweight, with no outside dependencies. | |
| 46 | |
| 47 - Passes all `html5lib-0.95 tests <https://github.com/html5lib/html5lib-tests>`_. | |
| 48 | |
| 49 - Tested on over 2.5 billion pages from Google's index. | |
| 50 | |
| 51 Non-goals: | |
| 52 ---------- | |
| 53 | |
| 54 - Execution speed. Gumbo gains some of this by virtue of being written in | |
| 55 C, but it is not an important consideration for the intended use-case, and | |
| 56 was not a major design factor. | |
| 57 | |
| 58 - Support for encodings other than UTF-8. For the most part, client code | |
| 59 can convert the input stream to UTF-8 text using another library before | |
| 60 processing. | |
| 61 | |
| 62 - Security. Gumbo was initially designed for a product that worked with | |
| 63 trusted input files only. We're working to harden this and make sure that it | |
| 64 behaves as expected even on malicious input, but for now, Gumbo should only be | |
| 65 run on trusted input or within a sandbox. | |
| 66 | |
| 67 - C89 support. Most major compilers support C99 by now; the major exception | |
| 68 (Microsoft Visual Studio) should be able to compile this in C++ mode with | |
| 69 relatively few changes. (Bug reports welcome.) | |
| 70 | |
| 71 Wishlist (aka "We couldn't get these into the original release, but are | |
| 72 hoping to add them soon"): | |
| 73 | |
| 74 - Support for recent HTML5 spec changes to support the template tag. | |
| 75 | |
| 76 - Support for fragment parsing. | |
| 77 | |
| 78 - Full-featured error reporting. | |
| 79 | |
| 80 - Bindings in other languages. | |
| 81 | |
| 82 Installation | |
| 83 ------------ | |
| 84 | |
| 85 ```pip install gumbo``` should do it. If you have a local copy, ```python | |
| 86 setup.py install``` from the root directory. | |
| 87 | |
| 88 The `html5lib <https://pypi.python.org/pypi/html5lib/0.999>`_ and | |
| 89 `BeautifulSoup <https://pypi.python.org/pypi/BeautifulSoup/3.2.1>`_ adapters | |
| 90 require that their respective libraries be installed separately to work. | |
| 91 | |
| 92 Basic Usage | |
| 93 ----------- | |
| 94 | |
| 95 For the ctypes bindings: | |
| 96 | |
| 97 .. code-block:: python | |
| 98 | |
| 99 import gumbo | |
| 100 | |
| 101 with gumbo.parse(text) as output: | |
| 102 root = output.contents.root.contents | |
| 103 # root is a Node object representing the root of the parse tree | |
| 104 # tree-walk over it as necessary. | |
| 105 | |
| 106 For the BeautifulSoup bindings: | |
| 107 | |
| 108 .. code-block:: python | |
| 109 | |
| 110 import gumbo | |
| 111 | |
| 112 soup = gumbo.soup_parse(text) | |
| 113 # soup is a BeautifulSoup object representing the parse tree. | |
| 114 | |
| 115 For the html5lib bindings: | |
| 116 | |
| 117 .. code-block:: python | |
| 118 | |
| 119 from gumbo import html5lib | |
| 120 | |
| 121 doc = html5lib.parse(text[, treebuilder='lxml']) | |
| 122 | |
| 123 Recommended best-practice for Python usage is to use one of the adapters to | |
| 124 an existing API (personally, I prefer BeautifulSoup) and write your program | |
| 125 in terms of those. The raw CTypes bindings should be considered building | |
| 126 blocks for higher-level libraries and rarely referenced directly. | |
| 127 | |
| 128 See the source code, Pydoc, and implementation of soup_adapter and | |
| 129 html5lib_adapter for more information. | |
| 130 | |
| 131 A note on API/ABI compatibility | |
| 132 ------------------------------- | |
| 133 | |
| 134 We'll make a best effort to preserve API compatibility between releases. | |
| 135 The initial release is a 0.9 (beta) release to solicit comments from early | |
| 136 adopters, but if no major problems are found with the API, a 1.0 release | |
| 137 will follow shortly, and the API of that should be considered stable. If | |
| 138 changes are necessary, we follow [semantic versioning][]. | |
| 139 | |
| 140 We make no such guarantees about the ABI, and it's very likely that | |
| 141 subsequent versions may require a recompile of client code. For this | |
| 142 reason, we recommend NOT using Gumbo data structures throughout a program, | |
| 143 and instead limiting them to a translation layer that picks out whatever | |
| 144 data is needed from the parse tree and then converts that to persistent | |
| 145 data structures more appropriate for the application. The API is | |
| 146 structured to encourage this use, with a single delete function for the | |
| 147 whole parse tree, and is not designed with mutation in mind. | |
| 148 | |
| 149 Most of this is transparent to Python usage, as the Python adapters are all | |
| 150 built with this in mind. However, since ctypes requires ABI compatibility, it | |
| 151 does mean you'll have to re-deploy the gumboc library and C extension when | |
| 152 upgrading to a new version. | |
| 153 ''' | |
| 154 | |
| 155 CLASSIFIERS = [ | |
| 156 'Development Status :: 4 - Beta', | |
| 157 'Intended Audience :: Developers', | |
| 158 'License :: OSI Approved :: Apache Software License', | |
| 159 'Operating System :: Unix', | |
| 160 'Operating System :: POSIX :: Linux', | |
| 161 'Programming Language :: C', | |
| 162 'Programming Language :: Python', | |
| 163 'Programming Language :: Python :: 2', | |
| 164 'Programming Language :: Python :: 2.7', | |
| 165 'Programming Language :: Python :: 3', | |
| 166 'Programming Language :: Python :: 3.4', | |
| 167 'Topic :: Software Development :: Libraries :: Python Modules', | |
| 168 'Topic :: Text Processing :: Markup :: HTML' | |
| 169 ] | |
| 170 | |
| 171 setup(name='gumbo', | |
| 172 version='0.10.1', | |
| 173 description='Python bindings for Gumbo HTML parser', | |
| 174 long_description=README, | |
| 175 url='http://github.com/google/gumbo-parser', | |
| 176 keywords='gumbo html html5 parser google html5lib beautifulsoup', | |
| 177 author='Jonathan Tang', | |
| 178 author_email='jonathan.d.tang@gmail.com', | |
| 179 license='Apache 2.0', | |
| 180 classifiers=CLASSIFIERS, | |
| 181 packages=['gumbo'], | |
| 182 package_dir={'': 'python'}, | |
| 183 package_data={'gumbo': [_name_of_lib]}, | |
| 184 cmdclass={ 'sdist': CustomSdistCommand }, | |
| 185 zip_safe=False) |
