Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/harfbuzz/docs/usermanual-buffers-language-script-and-direction.xml @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 <?xml version="1.0"?> | |
| 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" | |
| 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ | |
| 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> | |
| 5 <!ENTITY version SYSTEM "version.xml"> | |
| 6 ]> | |
| 7 <chapter id="buffers-language-script-and-direction"> | |
| 8 <title>Buffers, language, script and direction</title> | |
| 9 <para> | |
| 10 The input to the HarfBuzz shaper is a series of Unicode characters, stored in a | |
| 11 buffer. In this chapter, we'll look at how to set up a buffer with | |
| 12 the text that we want and how to customize the properties of the | |
| 13 buffer. We'll also look at a piece of lower-level machinery that | |
| 14 you will need to understand before proceeding: the functions that | |
| 15 HarfBuzz uses to retrieve Unicode information. | |
| 16 </para> | |
| 17 <para> | |
| 18 After shaping is complete, HarfBuzz puts its output back | |
| 19 into the buffer. But getting that output requires setting up a | |
| 20 face and a font first, so we will look at that in the next chapter | |
| 21 instead of here. | |
| 22 </para> | |
| 23 <section id="creating-and-destroying-buffers"> | |
| 24 <title>Creating and destroying buffers</title> | |
| 25 <para> | |
| 26 As we saw in our <emphasis>Getting Started</emphasis> example, a | |
| 27 buffer is created and | |
| 28 initialized with <function>hb_buffer_create()</function>. This | |
| 29 produces a new, empty buffer object, instantiated with some | |
| 30 default values and ready to accept your Unicode strings. | |
| 31 </para> | |
| 32 <para> | |
| 33 HarfBuzz manages the memory of objects (such as buffers) that it | |
| 34 creates, so you don't have to. When you have finished working on | |
| 35 a buffer, you can call <function>hb_buffer_destroy()</function>: | |
| 36 </para> | |
| 37 <programlisting language="C"> | |
| 38 hb_buffer_t *buf = hb_buffer_create(); | |
| 39 ... | |
| 40 hb_buffer_destroy(buf); | |
| 41 </programlisting> | |
| 42 <para> | |
| 43 This will destroy the object and free its associated memory - | |
| 44 unless some other part of the program holds a reference to this | |
| 45 buffer. If you acquire a HarfBuzz buffer from another subsystem | |
| 46 and want to ensure that it is not garbage collected by someone | |
| 47 else destroying it, you should increase its reference count: | |
| 48 </para> | |
| 49 <programlisting language="C"> | |
| 50 void somefunc(hb_buffer_t *buf) { | |
| 51 buf = hb_buffer_reference(buf); | |
| 52 ... | |
| 53 </programlisting> | |
| 54 <para> | |
| 55 And then decrease it once you're done with it: | |
| 56 </para> | |
| 57 <programlisting language="C"> | |
| 58 hb_buffer_destroy(buf); | |
| 59 } | |
| 60 </programlisting> | |
| 61 <para> | |
| 62 While we are on the subject of reference-counting buffers, it is | |
| 63 worth noting that an individual buffer can only meaningfully be | |
| 64 used by one thread at a time. | |
| 65 </para> | |
| 66 <para> | |
| 67 To throw away all the data in your buffer and start from scratch, | |
| 68 call <function>hb_buffer_reset(buf)</function>. If you want to | |
| 69 throw away the string in the buffer but keep the options, you can | |
| 70 instead call <function>hb_buffer_clear_contents(buf)</function>. | |
| 71 </para> | |
| 72 </section> | |
| 73 | |
| 74 <section id="adding-text-to-the-buffer"> | |
| 75 <title>Adding text to the buffer</title> | |
| 76 <para> | |
| 77 Now we have a brand new HarfBuzz buffer. Let's start filling it | |
| 78 with text! From HarfBuzz's perspective, a buffer is just a stream | |
| 79 of Unicode code points, but your input string is probably in one of | |
| 80 the standard Unicode character encodings (UTF-8, UTF-16, or | |
| 81 UTF-32). HarfBuzz provides convenience functions that accept | |
| 82 each of these encodings: | |
| 83 <function>hb_buffer_add_utf8()</function>, | |
| 84 <function>hb_buffer_add_utf16()</function>, and | |
| 85 <function>hb_buffer_add_utf32()</function>. Other than the | |
| 86 character encoding they accept, they function identically. | |
| 87 </para> | |
| 88 <para> | |
| 89 You can add UTF-8 text to a buffer by passing in the text array, | |
| 90 the array's length, an offset into the array for the first | |
| 91 character to add, and the length of the segment to add: | |
| 92 </para> | |
| 93 <programlisting language="C"> | |
| 94 hb_buffer_add_utf8 (hb_buffer_t *buf, | |
| 95 const char *text, | |
| 96 int text_length, | |
| 97 unsigned int item_offset, | |
| 98 int item_length) | |
| 99 </programlisting> | |
| 100 <para> | |
| 101 So, in practice, you can say: | |
| 102 </para> | |
| 103 <programlisting language="C"> | |
| 104 hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text)); | |
| 105 </programlisting> | |
| 106 <para> | |
| 107 This will append your new characters to | |
| 108 <parameter>buf</parameter>, not replace its existing | |
| 109 contents. Also, note that you can use <literal>-1</literal> in | |
| 110 place of the first instance of <function>strlen(text)</function> | |
| 111 if your text array is NULL-terminated. Similarly, you can also use | |
| 112 <literal>-1</literal> as the final argument want to add its full | |
| 113 contents. | |
| 114 </para> | |
| 115 <para> | |
| 116 Whatever start <parameter>item_offset</parameter> and | |
| 117 <parameter>item_length</parameter> you provide, HarfBuzz will also | |
| 118 attempt to grab the five characters <emphasis>before</emphasis> | |
| 119 the offset point and the five characters | |
| 120 <emphasis>after</emphasis> the designated end. These are the | |
| 121 before and after "context" segments, which are used internally | |
| 122 for HarfBuzz to make shaping decisions. They will not be part of | |
| 123 the final output, but they ensure that HarfBuzz's | |
| 124 script-specific shaping operations are correct. If there are | |
| 125 fewer than five characters available for the before or after | |
| 126 contexts, HarfBuzz will just grab what is there. | |
| 127 </para> | |
| 128 <para> | |
| 129 For longer text runs, such as full paragraphs, it might be | |
| 130 tempting to only add smaller sub-segments to a buffer and | |
| 131 shape them in piecemeal fashion. Generally, this is not a good | |
| 132 idea, however, because a lot of shaping decisions are | |
| 133 dependent on this context information. For example, in Arabic | |
| 134 and other connected scripts, HarfBuzz needs to know the code | |
| 135 points before and after each character in order to correctly | |
| 136 determine which glyph to return. | |
| 137 </para> | |
| 138 <para> | |
| 139 The safest approach is to add all of the text available (even | |
| 140 if your text contains a mix of scripts, directions, languages | |
| 141 and fonts), then use <parameter>item_offset</parameter> and | |
| 142 <parameter>item_length</parameter> to indicate which characters you | |
| 143 want shaped (which must all have the same script, direction, | |
| 144 language and font), so that HarfBuzz has access to any context. | |
| 145 </para> | |
| 146 <para> | |
| 147 You can also add Unicode code points directly with | |
| 148 <function>hb_buffer_add_codepoints()</function>. The arguments | |
| 149 to this function are the same as those for the UTF | |
| 150 encodings. But it is particularly important to note that | |
| 151 HarfBuzz does not do validity checking on the text that is added | |
| 152 to a buffer. Invalid code points will be replaced, but it is up | |
| 153 to you to do any deep-sanity checking necessary. | |
| 154 </para> | |
| 155 | |
| 156 </section> | |
| 157 | |
| 158 <section id="setting-buffer-properties"> | |
| 159 <title>Setting buffer properties</title> | |
| 160 <para> | |
| 161 Buffers containing input characters still need several | |
| 162 properties set before HarfBuzz can shape their text correctly. | |
| 163 </para> | |
| 164 <para> | |
| 165 Initially, all buffers are set to the | |
| 166 <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content | |
| 167 type. After adding text, the buffer should be set to | |
| 168 <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which | |
| 169 indicates that it contains un-shaped input | |
| 170 characters. After shaping, the buffer will have the | |
| 171 <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type. | |
| 172 </para> | |
| 173 <para> | |
| 174 <function>hb_buffer_add_utf8()</function> and the | |
| 175 other UTF functions set the content type of their buffer | |
| 176 automatically. But if you are reusing a buffer you may want to | |
| 177 check its state with | |
| 178 <function>hb_buffer_get_content_type(buffer)</function>. If | |
| 179 necessary you can set the content type with | |
| 180 </para> | |
| 181 <programlisting language="C"> | |
| 182 hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE); | |
| 183 </programlisting> | |
| 184 <para> | |
| 185 to prepare for shaping. | |
| 186 </para> | |
| 187 <para> | |
| 188 Buffers also need to carry information about the script, | |
| 189 language, and text direction of their contents. You can set | |
| 190 these properties individually: | |
| 191 </para> | |
| 192 <programlisting language="C"> | |
| 193 hb_buffer_set_direction(buf, HB_DIRECTION_LTR); | |
| 194 hb_buffer_set_script(buf, HB_SCRIPT_LATIN); | |
| 195 hb_buffer_set_language(buf, hb_language_from_string("en", -1)); | |
| 196 </programlisting> | |
| 197 <para> | |
| 198 However, since these properties are often repeated for | |
| 199 multiple text runs, you can also save them in a | |
| 200 <literal>hb_segment_properties_t</literal> for reuse: | |
| 201 </para> | |
| 202 <programlisting language="C"> | |
| 203 hb_segment_properties_t *savedprops; | |
| 204 hb_buffer_get_segment_properties (buf, savedprops); | |
| 205 ... | |
| 206 hb_buffer_set_segment_properties (buf2, savedprops); | |
| 207 </programlisting> | |
| 208 <para> | |
| 209 HarfBuzz also provides getter functions to retrieve a buffer's | |
| 210 direction, script, and language properties individually. | |
| 211 </para> | |
| 212 <para> | |
| 213 HarfBuzz recognizes four text directions in | |
| 214 <type>hb_direction_t</type>: left-to-right | |
| 215 (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>), | |
| 216 top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and | |
| 217 bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the | |
| 218 script property, HarfBuzz uses identifiers based on the | |
| 219 <ulink | |
| 220 url="https://unicode.org/iso15924/">ISO 15924 | |
| 221 standard</ulink>. For languages, HarfBuzz uses tags based on the | |
| 222 <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard. | |
| 223 </para> | |
| 224 <para> | |
| 225 Helper functions are provided to convert character strings into | |
| 226 the necessary script and language tag types. | |
| 227 </para> | |
| 228 <para> | |
| 229 Two additional buffer properties to be aware of are the | |
| 230 "invisible glyph" and the replacement code point. The | |
| 231 replacement code point is inserted into buffer output in place of | |
| 232 any invalid code points encountered in the input. By default, it | |
| 233 is the Unicode <literal>REPLACEMENT CHARACTER</literal> code | |
| 234 point, <literal>U+FFFD</literal> "�". You can change this with | |
| 235 </para> | |
| 236 <programlisting language="C"> | |
| 237 hb_buffer_set_replacement_codepoint(buf, replacement); | |
| 238 </programlisting> | |
| 239 <para> | |
| 240 passing in the replacement Unicode code point as the | |
| 241 <parameter>replacement</parameter> parameter. | |
| 242 </para> | |
| 243 <para> | |
| 244 The invisible glyph is used to replace all output glyphs that | |
| 245 are invisible. By default, the standard space character | |
| 246 <literal>U+0020</literal> is used; you can replace this (for | |
| 247 example, when using a font that provides script-specific | |
| 248 spaces) with | |
| 249 </para> | |
| 250 <programlisting language="C"> | |
| 251 hb_buffer_set_invisible_glyph(buf, replacement_glyph); | |
| 252 </programlisting> | |
| 253 <para> | |
| 254 Do note that in the <parameter>replacement_glyph</parameter> | |
| 255 parameter, you must provide the glyph ID of the replacement you | |
| 256 wish to use, not the Unicode code point. | |
| 257 </para> | |
| 258 <para> | |
| 259 HarfBuzz supports a few additional flags you might want to set | |
| 260 on your buffer under certain circumstances. The | |
| 261 <literal>HB_BUFFER_FLAG_BOT</literal> and | |
| 262 <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz | |
| 263 that the buffer represents the beginning or end (respectively) | |
| 264 of a text element (such as a paragraph or other block). Knowing | |
| 265 this allows HarfBuzz to apply certain contextual font features | |
| 266 when shaping, such as initial or final variants in connected | |
| 267 scripts. | |
| 268 </para> | |
| 269 <para> | |
| 270 <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal> | |
| 271 tells HarfBuzz not to hide glyphs with the | |
| 272 <literal>Default_Ignorable</literal> property in Unicode. This | |
| 273 property designates control characters and other non-printing | |
| 274 code points, such as joiners and variation selectors. Normally | |
| 275 HarfBuzz replaces them in the output buffer with zero-width | |
| 276 space glyphs (using the "invisible glyph" property discussed | |
| 277 above); setting this flag causes them to be printed, which can | |
| 278 be helpful for troubleshooting. | |
| 279 </para> | |
| 280 <para> | |
| 281 Conversely, setting the | |
| 282 <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag | |
| 283 tells HarfBuzz to remove <literal>Default_Ignorable</literal> | |
| 284 glyphs from the output buffer entirely. Finally, setting the | |
| 285 <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal> | |
| 286 flag tells HarfBuzz not to insert the dotted-circle glyph | |
| 287 (<literal>U+25CC</literal>, "◌"), which is normally | |
| 288 inserted into buffer output when broken character sequences are | |
| 289 encountered (such as combining marks that are not attached to a | |
| 290 base character). | |
| 291 </para> | |
| 292 </section> | |
| 293 | |
| 294 <section id="customizing-unicode-functions"> | |
| 295 <title>Customizing Unicode functions</title> | |
| 296 <para> | |
| 297 HarfBuzz requires some simple functions for accessing | |
| 298 information from the Unicode Character Database (such as the | |
| 299 <literal>General_Category</literal> (gc) and | |
| 300 <literal>Script</literal> (sc) properties) that is useful | |
| 301 for shaping, as well as some useful operations like composing and | |
| 302 decomposing code points. | |
| 303 </para> | |
| 304 <para> | |
| 305 HarfBuzz includes its own internal, lightweight set of Unicode | |
| 306 functions. At build time, it is also possible to compile support | |
| 307 for some other options, such as the Unicode functions provided | |
| 308 by GLib or the International Components for Unicode (ICU) | |
| 309 library. Generally, this option is only of interest for client | |
| 310 programs that have specific integration requirements or that do | |
| 311 a significant amount of customization. | |
| 312 </para> | |
| 313 <para> | |
| 314 If your program has access to other Unicode functions, however, | |
| 315 such as through a system library or application framework, you | |
| 316 might prefer to use those instead of the built-in | |
| 317 options. HarfBuzz supports this by implementing its Unicode | |
| 318 functions as a set of virtual methods that you can replace — | |
| 319 without otherwise affecting HarfBuzz's functionality. | |
| 320 </para> | |
| 321 <para> | |
| 322 The Unicode functions are specified in a structure called | |
| 323 <literal>unicode_funcs</literal> which is attached to each | |
| 324 buffer. But even though <literal>unicode_funcs</literal> is | |
| 325 associated with a <type>hb_buffer_t</type>, the functions | |
| 326 themselves are called by other HarfBuzz APIs that access | |
| 327 buffers, so it would be unwise for you to hook different | |
| 328 functions into different buffers. | |
| 329 </para> | |
| 330 <para> | |
| 331 In addition, you can mark your <literal>unicode_funcs</literal> | |
| 332 as immutable by calling | |
| 333 <function>hb_unicode_funcs_make_immutable (ufuncs)</function>. | |
| 334 This is especially useful if your code is a | |
| 335 library or framework that will have its own client programs. By | |
| 336 marking your Unicode function choices as immutable, you prevent | |
| 337 your own client programs from changing the | |
| 338 <literal>unicode_funcs</literal> configuration and introducing | |
| 339 inconsistencies and errors downstream. | |
| 340 </para> | |
| 341 <para> | |
| 342 You can retrieve the Unicode-functions configuration for | |
| 343 your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>: | |
| 344 </para> | |
| 345 <programlisting language="C"> | |
| 346 hb_unicode_funcs_t *ufunctions; | |
| 347 ufunctions = hb_buffer_get_unicode_funcs(buf); | |
| 348 </programlisting> | |
| 349 <para> | |
| 350 The current version of <literal>unicode_funcs</literal> uses six functions: | |
| 351 </para> | |
| 352 <itemizedlist> | |
| 353 <listitem> | |
| 354 <para> | |
| 355 <function>hb_unicode_combining_class_func_t</function>: | |
| 356 returns the Canonical Combining Class of a code point. | |
| 357 </para> | |
| 358 </listitem> | |
| 359 <listitem> | |
| 360 <para> | |
| 361 <function>hb_unicode_general_category_func_t</function>: | |
| 362 returns the General Category (gc) of a code point. | |
| 363 </para> | |
| 364 </listitem> | |
| 365 <listitem> | |
| 366 <para> | |
| 367 <function>hb_unicode_mirroring_func_t</function>: returns | |
| 368 the Mirroring Glyph code point (for bi-directional | |
| 369 replacement) of a code point. | |
| 370 </para> | |
| 371 </listitem> | |
| 372 <listitem> | |
| 373 <para> | |
| 374 <function>hb_unicode_script_func_t</function>: returns the | |
| 375 Script (sc) property of a code point. | |
| 376 </para> | |
| 377 </listitem> | |
| 378 <listitem> | |
| 379 <para> | |
| 380 <function>hb_unicode_compose_func_t</function>: returns the | |
| 381 canonical composition of a sequence of two code points. | |
| 382 </para> | |
| 383 </listitem> | |
| 384 <listitem> | |
| 385 <para> | |
| 386 <function>hb_unicode_decompose_func_t</function>: returns | |
| 387 the canonical decomposition of a code point. | |
| 388 </para> | |
| 389 </listitem> | |
| 390 </itemizedlist> | |
| 391 <para> | |
| 392 Note, however, that future HarfBuzz releases may alter this set. | |
| 393 </para> | |
| 394 <para> | |
| 395 Each Unicode function has a corresponding setter, with which you | |
| 396 can assign a callback to your replacement function. For example, | |
| 397 to replace | |
| 398 <function>hb_unicode_general_category_func_t</function>, you can call | |
| 399 </para> | |
| 400 <programlisting language="C"> | |
| 401 hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy) | |
| 402 </programlisting> | |
| 403 <para> | |
| 404 Virtualizing this set of Unicode functions is primarily intended | |
| 405 to improve portability. There is no need for every client | |
| 406 program to make the effort to replace the default options, so if | |
| 407 you are unsure, do not feel any pressure to customize | |
| 408 <literal>unicode_funcs</literal>. | |
| 409 </para> | |
| 410 </section> | |
| 411 | |
| 412 </chapter> |
