comparison mupdf-source/thirdparty/gumbo-parser/README.md @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
comparison
equal deleted inserted replaced
1:1d09e1dec1d9 2:b50eed0cc0ef
1 Gumbo - A pure-C HTML5 parser.
2 ============
3
4 [![Build Status](https://travis-ci.org/google/gumbo-parser.svg?branch=master)](https://travis-ci.org/google/gumbo-parser) [![Build status](https://ci.appveyor.com/api/projects/status/k5xxn4bxf62ao2cp?svg=true)](https://ci.appveyor.com/project/nostrademons/gumbo-parser)
5
6 Gumbo is an implementation of the [HTML5 parsing algorithm][] implemented
7 as a pure C99 library with no outside dependencies. It's designed to serve
8 as a building block for other tools and libraries such as linters,
9 validators, templating languages, and refactoring and analysis tools.
10
11 Goals & features:
12
13 * Fully conformant with the [HTML5 spec][].
14 * Robust and resilient to bad input.
15 * Simple API that can be easily wrapped by other languages.
16 * Support for source locations and pointers back to the original text.
17 * Support for fragment parsing.
18 * Relatively lightweight, with no outside dependencies.
19 * Passes all [html5lib tests][], including the template tag.
20 * Tested on over 2.5 billion pages from Google's index.
21
22 Non-goals:
23
24 * Execution speed. Gumbo gains some of this by virtue of being written in
25 C, but it is not an important consideration for the intended use-case, and
26 was not a major design factor.
27 * Support for encodings other than UTF-8. For the most part, client code
28 can convert the input stream to UTF-8 text using another library before
29 processing.
30 * Mutability. Gumbo is intentionally designed to turn an HTML document into a
31 parse tree, and free that parse tree all at once. It's not designed to
32 persistently store nodes or subtrees outside of the parse tree, or to perform
33 arbitrary DOM mutations within your program. If you need this functionality,
34 we recommend translating the Gumbo parse tree into a mutable DOM
35 representation more suited for the particular needs of your program before
36 operating on it.
37 * C89 support. Most major compilers support C99 by now; the major exception
38 (Microsoft Visual Studio) should be able to compile this in C++ mode with
39 relatively few changes. (Bug reports welcome.)
40 * ~~Security. Gumbo was initially designed for a product that worked with
41 trusted input files only. We're working to harden this and make sure that it
42 behaves as expected even on malicious input, but for now, Gumbo should only be
43 run on trusted input or within a sandbox.~~ Gumbo underwent a number of
44 security fixes and passed Google's security review as of version 0.9.1.
45
46 Wishlist (aka "We couldn't get these into the original release, but are
47 hoping to add them soon"):
48
49 * Full-featured error reporting.
50 * Additional performance improvements.
51 * DOM wrapper library/libraries (possibly within other language bindings)
52 * Query libraries, to extract information from parse trees using CSS or XPATH.
53
54 Installation
55 ============
56
57 To build and install the library, issue the standard UNIX incantation from
58 the root of the distribution:
59
60 ```bash
61 $ ./autogen.sh
62 $ ./configure
63 $ make
64 $ sudo make install
65 ```
66
67 Gumbo comes with full pkg-config support, so you can use the pkg-config to
68 print the flags needed to link your program against it:
69
70 ```bash
71 $ pkg-config --cflags gumbo # print compiler flags
72 $ pkg-config --libs gumbo # print linker flags
73 $ pkg-config --cflags --libs gumbo # print both
74 ```
75
76 For example:
77
78 ```bash
79 $ gcc my_program.c `pkg-config --cflags --libs gumbo`
80 ```
81
82 See the pkg-config man page for more info.
83
84 There are a number of sample programs in the examples/ directory. They're
85 built automatically by 'make', but can also be made individually with
86 `make <programname>` (eg. `make clean_text`).
87
88 To run the unit tests, you'll need to have [googletest][] downloaded and
89 unzipped. The googletest maintainers recommend against using
90 `make install`; instead, symlink the root googletest directory to 'gtest'
91 inside gumbo's root directory, and then `make check`:
92
93 ```bash
94 $ unzip gtest-1.6.0.zip
95 $ cd gumbo-*
96 $ ln -s ../gtest-1.6.0 gtest
97 $ make check
98 ```
99
100 Gumbo's `make check` has code to automatically configure & build gtest and
101 then link in the library.
102
103 Debian and Fedora users can install libgtest with:
104
105 ```bash
106 $ apt-get install libgtest-dev # Debian/Ubuntu
107 $ yum install gtest-devel # CentOS/Fedora
108 ```
109
110 Note for Ubuntu users: libgtest-dev package only install source files.
111 You have to make libraries yourself using cmake:
112
113 $ sudo apt-get install cmake
114 $ cd /usr/src/gtest
115 $ sudo cmake CMakeLists.txt
116 $ sudo make
117 $ sudo cp *.a /usr/lib
118
119 The configure script will detect the presence of the library and use that
120 instead.
121
122 Note that you need to have super user privileges to execute these commands.
123 On most distros, you can prefix the commands above with `sudo` to execute
124 them as the super user.
125
126 Debian installs usually don't have `sudo` installed (Ubuntu however does.)
127 Switch users first with `su -`, then run `apt-get`.
128
129 Basic Usage
130 ===========
131
132 Within your program, you need to include "gumbo.h" and then issue a call to
133 `gumbo_parse`:
134
135 ```C
136 #include "gumbo.h"
137
138 int main() {
139 GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
140 // Do stuff with output->root
141 gumbo_destroy_output(&kGumboDefaultOptions, output);
142 }
143 ```
144
145 See the API documentation and sample programs for more details.
146
147 A note on API/ABI compatibility
148 ===============================
149
150 We'll make a best effort to preserve API compatibility between releases.
151 The initial release is a 0.9 (beta) release to solicit comments from early
152 adopters, but if no major problems are found with the API, a 1.0 release
153 will follow shortly, and the API of that should be considered stable. If
154 changes are necessary, we follow [semantic versioning][].
155
156 We make no such guarantees about the ABI, and it's very likely that
157 subsequent versions may require a recompile of client code. For this
158 reason, we recommend NOT using Gumbo data structures throughout a program,
159 and instead limiting them to a translation layer that picks out whatever
160 data is needed from the parse tree and then converts that to persistent
161 data structures more appropriate for the application. The API is
162 structured to encourage this use, with a single delete function for the
163 whole parse tree, and is not designed with mutation in mind.
164
165 Python usage
166 ============
167
168 To install the python bindings, make sure that the
169 C library is installed first, and then `sudo python setup.py install` from
170 the root of the distro. This installs a 'gumbo' module; `pydoc gumbo`
171 should tell you about it.
172
173 Recommended best-practice for Python usage is to use one of the adapters to
174 an existing API (personally, I prefer BeautifulSoup) and write your program
175 in terms of those. The raw CTypes bindings should be considered building
176 blocks for higher-level libraries and rarely referenced directly.
177
178 External Bindings and other wrappers
179 ====================================
180
181 The following language bindings or other tools/wrappers are maintained by
182 various contributors in other repositories:
183
184 * C++: [gumbo-query] by lazytiger
185 * Ruby:
186 * [ruby-gumbo] by Nicolas Martyanoff
187 * [nokogumbo] by Sam Ruby
188 * Node.js: [node-gumbo-parser] by Karl Westin
189 * D: [gumbo-d] by Christopher Bertels
190 * Lua: [lua-gumbo] by Craig Barnes
191 * Objective-C:
192 * [ObjectiveGumbo] by Programming Thomas
193 * [OCGumbo] by TracyYih
194 * C#: [GumboBindings] by Vladimir Zotov
195 * PHP: [GumboPHP] by Paul Preece
196 * Perl: [HTML::Gumbo] by Ruslan Zakirov
197 * Julia: [Gumbo.jl] by James Porter
198 * C/Libxml: [gumbo-libxml] by Jonathan Tang
199
200 [gumbo-query]: https://github.com/lazytiger/gumbo-query
201 [ruby-gumbo]: https://github.com/nevir/ruby-gumbo
202 [nokogumbo]: https://github.com/rubys/nokogumbo
203 [node-gumbo-parser]: https://github.com/karlwestin/node-gumbo-parser
204 [gumbo-d]: https://github.com/bakkdoor/gumbo-d
205 [lua-gumbo]: https://github.com/craigbarnes/lua-gumbo
206 [OCGumbo]: https://github.com/tracy-e/OCGumbo
207 [ObjectiveGumbo]: https://github.com/programmingthomas/ObjectiveGumbo
208 [GumboBindings]: https://github.com/rgripper/GumboBindings
209 [GumboPHP]: https://github.com/BipSync/gumbo
210 [Gumbo.jl]: https://github.com/porterjamesj/Gumbo.jl
211 [gumbo-libxml]: https://github.com/nostrademons/gumbo-libxml
212
213 [HTML5 parsing algorithm]: http://www.whatwg.org/specs/web-apps/current-work/multipage/#auto-toc-12
214 [HTML5 spec]: http://www.whatwg.org/specs/web-apps/current-work/multipage/
215 [html5lib tests]: https://github.com/html5lib/html5lib-tests
216 [googletest]: https://code.google.com/p/googletest/
217 [semantic versioning]: http://semver.org/
218 [HTML::Gumbo]: https://metacpan.org/pod/HTML::Gumbo