Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/harfbuzz/docs/usermanual-what-is-harfbuzz.xml @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 <?xml version="1.0"?> | |
| 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" | |
| 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ | |
| 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> | |
| 5 <!ENTITY version SYSTEM "version.xml"> | |
| 6 ]> | |
| 7 <chapter id="what-is-harfbuzz"> | |
| 8 <title>What is HarfBuzz?</title> | |
| 9 <para> | |
| 10 HarfBuzz is a <emphasis>text-shaping engine</emphasis>. If you | |
| 11 give HarfBuzz a font and a string containing a sequence of Unicode | |
| 12 codepoints, HarfBuzz selects and positions the corresponding | |
| 13 glyphs from the font, applying all of the necessary layout rules | |
| 14 and font features. HarfBuzz then returns the string to you in the | |
| 15 form that is correctly arranged for the language and writing | |
| 16 system. | |
| 17 </para> | |
| 18 <para> | |
| 19 HarfBuzz can properly shape all of the world's major writing | |
| 20 systems. It runs on all major operating systems and software | |
| 21 platforms and it supports the major font formats in use | |
| 22 today. | |
| 23 </para> | |
| 24 <section id="what-is-text-shaping"> | |
| 25 <title>What is text shaping?</title> | |
| 26 <para> | |
| 27 Text shaping is the process of translating a string of character | |
| 28 codes (such as Unicode codepoints) into a properly arranged | |
| 29 sequence of glyphs that can be rendered onto a screen or into | |
| 30 final output form for inclusion in a document. | |
| 31 </para> | |
| 32 <para> | |
| 33 The shaping process is dependent on the input string, the active | |
| 34 font, the script (or writing system) that the string is in, and | |
| 35 the language that the string is in. | |
| 36 </para> | |
| 37 <para> | |
| 38 Modern software systems generally only deal with strings in the | |
| 39 Unicode encoding scheme (although legacy systems and documents may | |
| 40 involve other encodings). | |
| 41 </para> | |
| 42 <para> | |
| 43 There are several font formats that a program might | |
| 44 encounter, each of which has a set of standard text-shaping | |
| 45 rules. | |
| 46 </para> | |
| 47 <para>The dominant format is <ulink | |
| 48 url="http://www.microsoft.com/typography/otspec/">OpenType</ulink>. The | |
| 49 OpenType specification defines a series of <ulink url="https://github.com/n8willis/opentype-shaping-documents">shaping models</ulink> for | |
| 50 various scripts from around the world. These shaping models depend on | |
| 51 the font incorporating certain features as | |
| 52 <emphasis>lookups</emphasis> in its <literal>GSUB</literal> | |
| 53 and <literal>GPOS</literal> tables. | |
| 54 </para> | |
| 55 <para> | |
| 56 Alternatively, OpenType fonts can include shaping features for | |
| 57 the <ulink url="https://graphite.sil.org/">Graphite</ulink> shaping model. | |
| 58 </para> | |
| 59 <para> | |
| 60 TrueType fonts can also include OpenType shaping | |
| 61 features. Alternatively, TrueType fonts can also include <ulink url="https://developer.apple.com/fonts/TrueType-Reference-Manual/RM09/AppendixF.html">Apple | |
| 62 Advanced Typography</ulink> (AAT) tables to implement shaping | |
| 63 support. AAT fonts are generally only found on macOS and iOS systems. | |
| 64 </para> | |
| 65 <para> | |
| 66 Text strings will usually be tagged with a script and language | |
| 67 tag that provide the context needed to perform text shaping | |
| 68 correctly. The necessary <ulink | |
| 69 url="https://docs.microsoft.com/en-us/typography/opentype/spec/scripttags">script</ulink> | |
| 70 and <ulink | |
| 71 url="https://docs.microsoft.com/en-us/typography/opentype/spec/languagetags">language</ulink> | |
| 72 tags are defined by OpenType. | |
| 73 </para> | |
| 74 </section> | |
| 75 | |
| 76 <section id="why-do-i-need-a-shaping-engine"> | |
| 77 <title>Why do I need a shaping engine?</title> | |
| 78 <para> | |
| 79 Text shaping is an integral part of preparing text for | |
| 80 display. Before a Unicode sequence can be rendered, the | |
| 81 codepoints in the sequence must be mapped to the corresponding | |
| 82 glyphs provided in the font, and those glyphs must be positioned | |
| 83 correctly relative to each other. For many of the scripts | |
| 84 supported in Unicode, these steps involve script-specific layout | |
| 85 rules, including complex joining, reordering, and positioning | |
| 86 behavior. Implementing these rules is the job of the shaping engine. | |
| 87 </para> | |
| 88 <para> | |
| 89 Text shaping is a fairly low-level operation. HarfBuzz is | |
| 90 used directly by text-handling libraries like <ulink | |
| 91 url="https://www.pango.org/">Pango</ulink>, as well as by the layout | |
| 92 engines in Firefox, LibreOffice, and Chromium. Unless you are | |
| 93 <emphasis>writing</emphasis> one of these layout engines | |
| 94 yourself, you will probably not need to use HarfBuzz: normally, | |
| 95 a layout engine, toolkit, or other library will turn text into | |
| 96 glyphs for you. | |
| 97 </para> | |
| 98 <para> | |
| 99 However, if you <emphasis>are</emphasis> writing a layout engine | |
| 100 or graphics library yourself, then you will need to perform text | |
| 101 shaping, and this is where HarfBuzz can help you. | |
| 102 </para> | |
| 103 <para> | |
| 104 Here are some specific scenarios where a text-shaping engine | |
| 105 like HarfBuzz helps you: | |
| 106 </para> | |
| 107 <itemizedlist> | |
| 108 <listitem> | |
| 109 <para> | |
| 110 OpenType fonts contain a set of glyphs (that is, shapes | |
| 111 to represent the letters, numbers, punctuation marks, and | |
| 112 all other symbols), which are indexed by a <literal>glyph ID</literal>. | |
| 113 </para> | |
| 114 <para> | |
| 115 A particular glyph ID within the font does not necessarily | |
| 116 correlate to a predictable Unicode codepoint. For instance, | |
| 117 some fonts have the letter "a" as glyph ID 1, but | |
| 118 many others do not. In order to retrieve the right glyph | |
| 119 from the font to display "a", you need to consult | |
| 120 the table inside the font (the <literal>cmap</literal> | |
| 121 table) that maps Unicode codepoints to glyph IDs. In other | |
| 122 words, <emphasis>text shaping turns codepoints into glyph | |
| 123 IDs</emphasis>. | |
| 124 </para> | |
| 125 </listitem> | |
| 126 <listitem> | |
| 127 <para> | |
| 128 Many OpenType fonts contain ligatures: combinations of | |
| 129 characters that are rendered as a single unit. For instance, | |
| 130 it is common for the "f, i" letter | |
| 131 sequence to appear in print as the single ligature glyph | |
| 132 "fi". | |
| 133 </para> | |
| 134 <para> | |
| 135 Whether you should render an "f, i" sequence | |
| 136 as <literal>fi</literal> or as "fi" does not | |
| 137 depend on the input text. Instead, it depends on the whether | |
| 138 or not the font includes an "fi" glyph and on the | |
| 139 level of ligature application you wish to perform. The font | |
| 140 and the amount of ligature application used are under your | |
| 141 control. In other words, <emphasis>text shaping involves | |
| 142 querying the font's ligature tables and determining what | |
| 143 substitutions should be made</emphasis>. | |
| 144 </para> | |
| 145 </listitem> | |
| 146 <listitem> | |
| 147 <para> | |
| 148 While ligatures like "fi" are optional typographic | |
| 149 refinements, some languages <emphasis>require</emphasis> certain | |
| 150 substitutions to be made in order to display text correctly. | |
| 151 </para> | |
| 152 <para> | |
| 153 For example, in Tamil, when the letter "TTA" (ட) | |
| 154 letter is followed by the vowel sign "U" (ு), the pair | |
| 155 must be replaced by the single glyph "டு". The | |
| 156 sequence of Unicode characters "ட,ு" needs to be | |
| 157 substituted with a single "டு" glyph from the | |
| 158 font. | |
| 159 </para> | |
| 160 <para> | |
| 161 But "டு" does not have a Unicode codepoint. To | |
| 162 find this glyph, you need to consult the table inside | |
| 163 the font (the <literal>GSUB</literal> table) that contains | |
| 164 substitution information. In other words, <emphasis>text shaping | |
| 165 chooses the correct glyph for a sequence of characters | |
| 166 provided</emphasis>. | |
| 167 </para> | |
| 168 </listitem> | |
| 169 <listitem> | |
| 170 <para> | |
| 171 Similarly, each Arabic character has four different variants | |
| 172 corresponding to the different positions it might appear in | |
| 173 within a sequence. Inside a font, there will be separate | |
| 174 glyphs for the initial, medial, final, and isolated forms of | |
| 175 each letter, each at a different glyph ID. | |
| 176 </para> | |
| 177 <para> | |
| 178 Unicode only assigns one codepoint per character, so a | |
| 179 Unicode string will not tell you which glyph variant to use | |
| 180 for each character. To decide, you need to analyze the whole | |
| 181 string and determine the appropriate glyph for each character | |
| 182 based on its position. In other words, <emphasis>text | |
| 183 shaping chooses the correct form of the letter by its | |
| 184 position and returns the correct glyph from the font</emphasis>. | |
| 185 </para> | |
| 186 </listitem> | |
| 187 <listitem> | |
| 188 <para> | |
| 189 Other languages involve marks and accents that need to be | |
| 190 rendered in specific positions relative a base character. For | |
| 191 instance, the Moldovan language includes the Cyrillic letter | |
| 192 "zhe" (ж) with a breve accent, like so: "ӂ". | |
| 193 </para> | |
| 194 <para> | |
| 195 Some fonts will provide this character as a single | |
| 196 zhe-with-breve glyph, but other fonts will not and, instead, | |
| 197 will expect the rendering engine to form the character by | |
| 198 superimposing the separate "ж" and "˘" | |
| 199 glyphs. | |
| 200 </para> | |
| 201 <para> | |
| 202 But exactly where you should draw the breve depends on the | |
| 203 height and width of the preceding zhe glyph. To find the | |
| 204 right position, you need to consult the table inside | |
| 205 the font (the <literal>GPOS</literal> table) that contains | |
| 206 positioning information. | |
| 207 In other words, <emphasis>text shaping tells you whether you | |
| 208 have a precomposed glyph within your font or if you need to | |
| 209 compose a glyph yourself out of combining marks—and, | |
| 210 if so, where to position those marks.</emphasis> | |
| 211 </para> | |
| 212 </listitem> | |
| 213 </itemizedlist> | |
| 214 <para> | |
| 215 If tasks like these are something that you need to do, then you | |
| 216 need a text shaping engine. You could use Uniscribe if you are | |
| 217 writing Windows software; you could use CoreText on macOS; or | |
| 218 you could use HarfBuzz. | |
| 219 </para> | |
| 220 <note> | |
| 221 <para> | |
| 222 In the rest of this manual, the text will assume that the reader | |
| 223 is that implementor of a text-layout engine. | |
| 224 </para> | |
| 225 </note> | |
| 226 </section> | |
| 227 | |
| 228 | |
| 229 <section id="what-does-harfbuzz-do"> | |
| 230 <title>What does HarfBuzz do?</title> | |
| 231 <para> | |
| 232 HarfBuzz provides text shaping through a cross-platform | |
| 233 C API that accepts sequences of Unicode codepoints as input. Currently, | |
| 234 the following OpenType shaping models are supported: | |
| 235 </para> | |
| 236 <itemizedlist> | |
| 237 <listitem> | |
| 238 <para> | |
| 239 Indic (covering Devanagari, Bengali, Gujarati, | |
| 240 Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu) | |
| 241 </para> | |
| 242 </listitem> | |
| 243 <listitem> | |
| 244 <para> | |
| 245 Arabic (covering Arabic, N'Ko, Syriac, and Mongolian) | |
| 246 </para> | |
| 247 </listitem> | |
| 248 <listitem> | |
| 249 <para> | |
| 250 Thai and Lao | |
| 251 </para> | |
| 252 </listitem> | |
| 253 <listitem> | |
| 254 <para> | |
| 255 Khmer | |
| 256 </para> | |
| 257 </listitem> | |
| 258 <listitem> | |
| 259 <para> | |
| 260 Myanmar | |
| 261 </para> | |
| 262 </listitem> | |
| 263 | |
| 264 <listitem> | |
| 265 <para> | |
| 266 Tibetan | |
| 267 </para> | |
| 268 </listitem> | |
| 269 | |
| 270 <listitem> | |
| 271 <para> | |
| 272 Hangul | |
| 273 </para> | |
| 274 </listitem> | |
| 275 | |
| 276 <listitem> | |
| 277 <para> | |
| 278 Hebrew | |
| 279 </para> | |
| 280 </listitem> | |
| 281 <listitem> | |
| 282 <para> | |
| 283 The Universal Shaping Engine or <emphasis>USE</emphasis> | |
| 284 (covering complex scripts not covered by the above shaping | |
| 285 models) | |
| 286 </para> | |
| 287 </listitem> | |
| 288 <listitem> | |
| 289 <para> | |
| 290 A default shaping model for non-complex scripts | |
| 291 (covering Latin, Cyrillic, Greek, Armenian, Georgian, Tifinagh, | |
| 292 and many others) | |
| 293 </para> | |
| 294 </listitem> | |
| 295 <listitem> | |
| 296 <para> | |
| 297 Emoji (including emoji modifier sequences, flag sequences, | |
| 298 and ZWJ sequences) | |
| 299 </para> | |
| 300 </listitem> | |
| 301 </itemizedlist> | |
| 302 | |
| 303 <para> | |
| 304 In addition to OpenType shaping, HarfBuzz supports the latest | |
| 305 version of Graphite shaping (the "Graphite 2" model) and AAT | |
| 306 shaping. | |
| 307 </para> | |
| 308 | |
| 309 <para> | |
| 310 HarfBuzz can read and understand TrueType fonts (.ttf), TrueType | |
| 311 collections (.ttc), and OpenType fonts (.otf, including those | |
| 312 fonts that contain TrueType-style outlines and those that | |
| 313 contain PostScript CFF or CFF2 outlines). | |
| 314 </para> | |
| 315 | |
| 316 <para> | |
| 317 HarfBuzz is designed and tested to run on top of the FreeType | |
| 318 font renderer. It can run on Linux, Android, Windows, macOS, and | |
| 319 iOS systems. | |
| 320 </para> | |
| 321 | |
| 322 <para> | |
| 323 In addition to its core shaping functionality, HarfBuzz provides | |
| 324 functions for accessing other font features, including optional | |
| 325 GSUB and GPOS OpenType features, as well as | |
| 326 all color-font formats (<literal>CBDT</literal>, | |
| 327 <literal>sbix</literal>, <literal>COLR/CPAL</literal>, and | |
| 328 <literal>SVG-OT</literal>) and OpenType variable fonts. HarfBuzz | |
| 329 also includes a font-subsetting feature. HarfBuzz can perform | |
| 330 some low-level math-shaping operations, although it does not | |
| 331 currently perform full shaping for mathematical typesetting. | |
| 332 </para> | |
| 333 | |
| 334 <para> | |
| 335 A suite of command-line utilities is also provided in the | |
| 336 source-code tree, designed to help users test and debug | |
| 337 HarfBuzz's features on real-world fonts and input. | |
| 338 </para> | |
| 339 </section> | |
| 340 | |
| 341 <section id="what-harfbuzz-doesnt-do"> | |
| 342 <title>What HarfBuzz doesn't do</title> | |
| 343 <para> | |
| 344 HarfBuzz will take a Unicode string, shape it, and give you the | |
| 345 information required to lay it out correctly on a single | |
| 346 horizontal (or vertical) line using the font provided. That is the | |
| 347 extent of HarfBuzz's responsibility. | |
| 348 </para> | |
| 349 <para> | |
| 350 It is important to note that if you are implementing a complete | |
| 351 text-layout engine you may have other responsibilities that | |
| 352 HarfBuzz will <emphasis>not</emphasis> help you with. For example: | |
| 353 </para> | |
| 354 <itemizedlist> | |
| 355 <listitem> | |
| 356 <para> | |
| 357 HarfBuzz won't help you with bidirectionality. If you want to | |
| 358 lay out text that includes a mix of Hebrew and English, you | |
| 359 will need to ensure that each buffer provided to HarfBuzz | |
| 360 has all of its characters in the same order and that the | |
| 361 directionality of the buffer is set correctly. This may mean | |
| 362 segmenting the text before it is placed into HarfBuzz buffers. In | |
| 363 other words, the user will hit the keys in the following | |
| 364 sequence: | |
| 365 </para> | |
| 366 <programlisting> | |
| 367 A B C [space] ג ב א [space] D E F | |
| 368 </programlisting> | |
| 369 <para> | |
| 370 but will expect to see in the output: | |
| 371 </para> | |
| 372 <programlisting> | |
| 373 ABC אבג DEF | |
| 374 </programlisting> | |
| 375 <para> | |
| 376 This reordering is called <emphasis>bidi processing</emphasis> | |
| 377 ("bidi" is short for bidirectional), and there's an | |
| 378 algorithm as an annex to the Unicode Standard which tells you how | |
| 379 to process a string of mixed directionality. | |
| 380 Before sending your string to HarfBuzz, you may need to apply the | |
| 381 bidi algorithm to it. Libraries such as <ulink | |
| 382 url="http://icu-project.org/">ICU</ulink> and <ulink | |
| 383 url="http://fribidi.org/">fribidi</ulink> can do this for you. | |
| 384 </para> | |
| 385 </listitem> | |
| 386 <listitem> | |
| 387 <para> | |
| 388 HarfBuzz won't help you with text that contains different font | |
| 389 properties. For instance, if you have the string "a | |
| 390 <emphasis>huge</emphasis> breakfast", and you expect | |
| 391 "huge" to be italic, then you will need to send three | |
| 392 strings to HarfBuzz: <literal>a</literal>, in your Roman font; | |
| 393 <literal>huge</literal> using your italic font; and | |
| 394 <literal>breakfast</literal> using your Roman font again. | |
| 395 </para> | |
| 396 <para> | |
| 397 Similarly, if you change the font, font size, script, | |
| 398 language, or direction within your string, then you will | |
| 399 need to shape each run independently and output them | |
| 400 independently. HarfBuzz expects to shape a run of characters | |
| 401 that all share the same properties. | |
| 402 </para> | |
| 403 </listitem> | |
| 404 <listitem> | |
| 405 <para> | |
| 406 HarfBuzz won't help you with line breaking, hyphenation, or | |
| 407 justification. As mentioned above, HarfBuzz lays out the string | |
| 408 along a <emphasis>single line</emphasis> of, notionally, | |
| 409 infinite length. If you want to find out where the potential | |
| 410 word, sentence and line break points are in your text, you | |
| 411 could use the ICU library's break iterator functions. | |
| 412 </para> | |
| 413 <para> | |
| 414 HarfBuzz can tell you how wide a shaped piece of text is, which is | |
| 415 useful input to a justification algorithm, but it knows nothing | |
| 416 about paragraphs, lines or line lengths. Nor will it adjust the | |
| 417 space between words to fit them proportionally into a line. | |
| 418 </para> | |
| 419 </listitem> | |
| 420 </itemizedlist> | |
| 421 <para> | |
| 422 As a layout-engine implementor, HarfBuzz will help you with the | |
| 423 interface between your text and your font, and that's something | |
| 424 that you'll need—what you then do with the glyphs that your font | |
| 425 returns is up to you. | |
| 426 </para> | |
| 427 </section> | |
| 428 | |
| 429 <section id="why-is-it-called-harfbuzz"> | |
| 430 <title>Why is it called HarfBuzz?</title> | |
| 431 <para> | |
| 432 HarfBuzz began its life as text-shaping code within the FreeType | |
| 433 project (and you will see references to the FreeType authors | |
| 434 within the source code copyright declarations), but was then | |
| 435 extracted out to its own project. This project is maintained by | |
| 436 Behdad Esfahbod, who named it HarfBuzz. Originally, it was a | |
| 437 shaping engine for OpenType fonts—"HarfBuzz" is | |
| 438 the Persian for "open type". | |
| 439 </para> | |
| 440 </section> | |
| 441 </chapter> |
