Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/gumbo-parser/README.md @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 Gumbo - A pure-C HTML5 parser. | |
| 2 ============ | |
| 3 | |
| 4 [](https://travis-ci.org/google/gumbo-parser) [](https://ci.appveyor.com/project/nostrademons/gumbo-parser) | |
| 5 | |
| 6 Gumbo is an implementation of the [HTML5 parsing algorithm][] implemented | |
| 7 as a pure C99 library with no outside dependencies. It's designed to serve | |
| 8 as a building block for other tools and libraries such as linters, | |
| 9 validators, templating languages, and refactoring and analysis tools. | |
| 10 | |
| 11 Goals & features: | |
| 12 | |
| 13 * Fully conformant with the [HTML5 spec][]. | |
| 14 * Robust and resilient to bad input. | |
| 15 * Simple API that can be easily wrapped by other languages. | |
| 16 * Support for source locations and pointers back to the original text. | |
| 17 * Support for fragment parsing. | |
| 18 * Relatively lightweight, with no outside dependencies. | |
| 19 * Passes all [html5lib tests][], including the template tag. | |
| 20 * Tested on over 2.5 billion pages from Google's index. | |
| 21 | |
| 22 Non-goals: | |
| 23 | |
| 24 * Execution speed. Gumbo gains some of this by virtue of being written in | |
| 25 C, but it is not an important consideration for the intended use-case, and | |
| 26 was not a major design factor. | |
| 27 * Support for encodings other than UTF-8. For the most part, client code | |
| 28 can convert the input stream to UTF-8 text using another library before | |
| 29 processing. | |
| 30 * Mutability. Gumbo is intentionally designed to turn an HTML document into a | |
| 31 parse tree, and free that parse tree all at once. It's not designed to | |
| 32 persistently store nodes or subtrees outside of the parse tree, or to perform | |
| 33 arbitrary DOM mutations within your program. If you need this functionality, | |
| 34 we recommend translating the Gumbo parse tree into a mutable DOM | |
| 35 representation more suited for the particular needs of your program before | |
| 36 operating on it. | |
| 37 * C89 support. Most major compilers support C99 by now; the major exception | |
| 38 (Microsoft Visual Studio) should be able to compile this in C++ mode with | |
| 39 relatively few changes. (Bug reports welcome.) | |
| 40 * ~~Security. Gumbo was initially designed for a product that worked with | |
| 41 trusted input files only. We're working to harden this and make sure that it | |
| 42 behaves as expected even on malicious input, but for now, Gumbo should only be | |
| 43 run on trusted input or within a sandbox.~~ Gumbo underwent a number of | |
| 44 security fixes and passed Google's security review as of version 0.9.1. | |
| 45 | |
| 46 Wishlist (aka "We couldn't get these into the original release, but are | |
| 47 hoping to add them soon"): | |
| 48 | |
| 49 * Full-featured error reporting. | |
| 50 * Additional performance improvements. | |
| 51 * DOM wrapper library/libraries (possibly within other language bindings) | |
| 52 * Query libraries, to extract information from parse trees using CSS or XPATH. | |
| 53 | |
| 54 Installation | |
| 55 ============ | |
| 56 | |
| 57 To build and install the library, issue the standard UNIX incantation from | |
| 58 the root of the distribution: | |
| 59 | |
| 60 ```bash | |
| 61 $ ./autogen.sh | |
| 62 $ ./configure | |
| 63 $ make | |
| 64 $ sudo make install | |
| 65 ``` | |
| 66 | |
| 67 Gumbo comes with full pkg-config support, so you can use the pkg-config to | |
| 68 print the flags needed to link your program against it: | |
| 69 | |
| 70 ```bash | |
| 71 $ pkg-config --cflags gumbo # print compiler flags | |
| 72 $ pkg-config --libs gumbo # print linker flags | |
| 73 $ pkg-config --cflags --libs gumbo # print both | |
| 74 ``` | |
| 75 | |
| 76 For example: | |
| 77 | |
| 78 ```bash | |
| 79 $ gcc my_program.c `pkg-config --cflags --libs gumbo` | |
| 80 ``` | |
| 81 | |
| 82 See the pkg-config man page for more info. | |
| 83 | |
| 84 There are a number of sample programs in the examples/ directory. They're | |
| 85 built automatically by 'make', but can also be made individually with | |
| 86 `make <programname>` (eg. `make clean_text`). | |
| 87 | |
| 88 To run the unit tests, you'll need to have [googletest][] downloaded and | |
| 89 unzipped. The googletest maintainers recommend against using | |
| 90 `make install`; instead, symlink the root googletest directory to 'gtest' | |
| 91 inside gumbo's root directory, and then `make check`: | |
| 92 | |
| 93 ```bash | |
| 94 $ unzip gtest-1.6.0.zip | |
| 95 $ cd gumbo-* | |
| 96 $ ln -s ../gtest-1.6.0 gtest | |
| 97 $ make check | |
| 98 ``` | |
| 99 | |
| 100 Gumbo's `make check` has code to automatically configure & build gtest and | |
| 101 then link in the library. | |
| 102 | |
| 103 Debian and Fedora users can install libgtest with: | |
| 104 | |
| 105 ```bash | |
| 106 $ apt-get install libgtest-dev # Debian/Ubuntu | |
| 107 $ yum install gtest-devel # CentOS/Fedora | |
| 108 ``` | |
| 109 | |
| 110 Note for Ubuntu users: libgtest-dev package only install source files. | |
| 111 You have to make libraries yourself using cmake: | |
| 112 | |
| 113 $ sudo apt-get install cmake | |
| 114 $ cd /usr/src/gtest | |
| 115 $ sudo cmake CMakeLists.txt | |
| 116 $ sudo make | |
| 117 $ sudo cp *.a /usr/lib | |
| 118 | |
| 119 The configure script will detect the presence of the library and use that | |
| 120 instead. | |
| 121 | |
| 122 Note that you need to have super user privileges to execute these commands. | |
| 123 On most distros, you can prefix the commands above with `sudo` to execute | |
| 124 them as the super user. | |
| 125 | |
| 126 Debian installs usually don't have `sudo` installed (Ubuntu however does.) | |
| 127 Switch users first with `su -`, then run `apt-get`. | |
| 128 | |
| 129 Basic Usage | |
| 130 =========== | |
| 131 | |
| 132 Within your program, you need to include "gumbo.h" and then issue a call to | |
| 133 `gumbo_parse`: | |
| 134 | |
| 135 ```C | |
| 136 #include "gumbo.h" | |
| 137 | |
| 138 int main() { | |
| 139 GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>"); | |
| 140 // Do stuff with output->root | |
| 141 gumbo_destroy_output(&kGumboDefaultOptions, output); | |
| 142 } | |
| 143 ``` | |
| 144 | |
| 145 See the API documentation and sample programs for more details. | |
| 146 | |
| 147 A note on API/ABI compatibility | |
| 148 =============================== | |
| 149 | |
| 150 We'll make a best effort to preserve API compatibility between releases. | |
| 151 The initial release is a 0.9 (beta) release to solicit comments from early | |
| 152 adopters, but if no major problems are found with the API, a 1.0 release | |
| 153 will follow shortly, and the API of that should be considered stable. If | |
| 154 changes are necessary, we follow [semantic versioning][]. | |
| 155 | |
| 156 We make no such guarantees about the ABI, and it's very likely that | |
| 157 subsequent versions may require a recompile of client code. For this | |
| 158 reason, we recommend NOT using Gumbo data structures throughout a program, | |
| 159 and instead limiting them to a translation layer that picks out whatever | |
| 160 data is needed from the parse tree and then converts that to persistent | |
| 161 data structures more appropriate for the application. The API is | |
| 162 structured to encourage this use, with a single delete function for the | |
| 163 whole parse tree, and is not designed with mutation in mind. | |
| 164 | |
| 165 Python usage | |
| 166 ============ | |
| 167 | |
| 168 To install the python bindings, make sure that the | |
| 169 C library is installed first, and then `sudo python setup.py install` from | |
| 170 the root of the distro. This installs a 'gumbo' module; `pydoc gumbo` | |
| 171 should tell you about it. | |
| 172 | |
| 173 Recommended best-practice for Python usage is to use one of the adapters to | |
| 174 an existing API (personally, I prefer BeautifulSoup) and write your program | |
| 175 in terms of those. The raw CTypes bindings should be considered building | |
| 176 blocks for higher-level libraries and rarely referenced directly. | |
| 177 | |
| 178 External Bindings and other wrappers | |
| 179 ==================================== | |
| 180 | |
| 181 The following language bindings or other tools/wrappers are maintained by | |
| 182 various contributors in other repositories: | |
| 183 | |
| 184 * C++: [gumbo-query] by lazytiger | |
| 185 * Ruby: | |
| 186 * [ruby-gumbo] by Nicolas Martyanoff | |
| 187 * [nokogumbo] by Sam Ruby | |
| 188 * Node.js: [node-gumbo-parser] by Karl Westin | |
| 189 * D: [gumbo-d] by Christopher Bertels | |
| 190 * Lua: [lua-gumbo] by Craig Barnes | |
| 191 * Objective-C: | |
| 192 * [ObjectiveGumbo] by Programming Thomas | |
| 193 * [OCGumbo] by TracyYih | |
| 194 * C#: [GumboBindings] by Vladimir Zotov | |
| 195 * PHP: [GumboPHP] by Paul Preece | |
| 196 * Perl: [HTML::Gumbo] by Ruslan Zakirov | |
| 197 * Julia: [Gumbo.jl] by James Porter | |
| 198 * C/Libxml: [gumbo-libxml] by Jonathan Tang | |
| 199 | |
| 200 [gumbo-query]: https://github.com/lazytiger/gumbo-query | |
| 201 [ruby-gumbo]: https://github.com/nevir/ruby-gumbo | |
| 202 [nokogumbo]: https://github.com/rubys/nokogumbo | |
| 203 [node-gumbo-parser]: https://github.com/karlwestin/node-gumbo-parser | |
| 204 [gumbo-d]: https://github.com/bakkdoor/gumbo-d | |
| 205 [lua-gumbo]: https://github.com/craigbarnes/lua-gumbo | |
| 206 [OCGumbo]: https://github.com/tracy-e/OCGumbo | |
| 207 [ObjectiveGumbo]: https://github.com/programmingthomas/ObjectiveGumbo | |
| 208 [GumboBindings]: https://github.com/rgripper/GumboBindings | |
| 209 [GumboPHP]: https://github.com/BipSync/gumbo | |
| 210 [Gumbo.jl]: https://github.com/porterjamesj/Gumbo.jl | |
| 211 [gumbo-libxml]: https://github.com/nostrademons/gumbo-libxml | |
| 212 | |
| 213 [HTML5 parsing algorithm]: http://www.whatwg.org/specs/web-apps/current-work/multipage/#auto-toc-12 | |
| 214 [HTML5 spec]: http://www.whatwg.org/specs/web-apps/current-work/multipage/ | |
| 215 [html5lib tests]: https://github.com/html5lib/html5lib-tests | |
| 216 [googletest]: https://code.google.com/p/googletest/ | |
| 217 [semantic versioning]: http://semver.org/ | |
| 218 [HTML::Gumbo]: https://metacpan.org/pod/HTML::Gumbo |
