comparison mupdf-source/thirdparty/harfbuzz/docs/usermanual-buffers-language-script-and-direction.xml @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
comparison
equal deleted inserted replaced
1:1d09e1dec1d9 2:b50eed0cc0ef
1 <?xml version="1.0"?>
2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
5 <!ENTITY version SYSTEM "version.xml">
6 ]>
7 <chapter id="buffers-language-script-and-direction">
8 <title>Buffers, language, script and direction</title>
9 <para>
10 The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
11 buffer. In this chapter, we'll look at how to set up a buffer with
12 the text that we want and how to customize the properties of the
13 buffer. We'll also look at a piece of lower-level machinery that
14 you will need to understand before proceeding: the functions that
15 HarfBuzz uses to retrieve Unicode information.
16 </para>
17 <para>
18 After shaping is complete, HarfBuzz puts its output back
19 into the buffer. But getting that output requires setting up a
20 face and a font first, so we will look at that in the next chapter
21 instead of here.
22 </para>
23 <section id="creating-and-destroying-buffers">
24 <title>Creating and destroying buffers</title>
25 <para>
26 As we saw in our <emphasis>Getting Started</emphasis> example, a
27 buffer is created and
28 initialized with <function>hb_buffer_create()</function>. This
29 produces a new, empty buffer object, instantiated with some
30 default values and ready to accept your Unicode strings.
31 </para>
32 <para>
33 HarfBuzz manages the memory of objects (such as buffers) that it
34 creates, so you don't have to. When you have finished working on
35 a buffer, you can call <function>hb_buffer_destroy()</function>:
36 </para>
37 <programlisting language="C">
38 hb_buffer_t *buf = hb_buffer_create();
39 ...
40 hb_buffer_destroy(buf);
41 </programlisting>
42 <para>
43 This will destroy the object and free its associated memory -
44 unless some other part of the program holds a reference to this
45 buffer. If you acquire a HarfBuzz buffer from another subsystem
46 and want to ensure that it is not garbage collected by someone
47 else destroying it, you should increase its reference count:
48 </para>
49 <programlisting language="C">
50 void somefunc(hb_buffer_t *buf) {
51 buf = hb_buffer_reference(buf);
52 ...
53 </programlisting>
54 <para>
55 And then decrease it once you're done with it:
56 </para>
57 <programlisting language="C">
58 hb_buffer_destroy(buf);
59 }
60 </programlisting>
61 <para>
62 While we are on the subject of reference-counting buffers, it is
63 worth noting that an individual buffer can only meaningfully be
64 used by one thread at a time.
65 </para>
66 <para>
67 To throw away all the data in your buffer and start from scratch,
68 call <function>hb_buffer_reset(buf)</function>. If you want to
69 throw away the string in the buffer but keep the options, you can
70 instead call <function>hb_buffer_clear_contents(buf)</function>.
71 </para>
72 </section>
73
74 <section id="adding-text-to-the-buffer">
75 <title>Adding text to the buffer</title>
76 <para>
77 Now we have a brand new HarfBuzz buffer. Let's start filling it
78 with text! From HarfBuzz's perspective, a buffer is just a stream
79 of Unicode code points, but your input string is probably in one of
80 the standard Unicode character encodings (UTF-8, UTF-16, or
81 UTF-32). HarfBuzz provides convenience functions that accept
82 each of these encodings:
83 <function>hb_buffer_add_utf8()</function>,
84 <function>hb_buffer_add_utf16()</function>, and
85 <function>hb_buffer_add_utf32()</function>. Other than the
86 character encoding they accept, they function identically.
87 </para>
88 <para>
89 You can add UTF-8 text to a buffer by passing in the text array,
90 the array's length, an offset into the array for the first
91 character to add, and the length of the segment to add:
92 </para>
93 <programlisting language="C">
94 hb_buffer_add_utf8 (hb_buffer_t *buf,
95 const char *text,
96 int text_length,
97 unsigned int item_offset,
98 int item_length)
99 </programlisting>
100 <para>
101 So, in practice, you can say:
102 </para>
103 <programlisting language="C">
104 hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
105 </programlisting>
106 <para>
107 This will append your new characters to
108 <parameter>buf</parameter>, not replace its existing
109 contents. Also, note that you can use <literal>-1</literal> in
110 place of the first instance of <function>strlen(text)</function>
111 if your text array is NULL-terminated. Similarly, you can also use
112 <literal>-1</literal> as the final argument want to add its full
113 contents.
114 </para>
115 <para>
116 Whatever start <parameter>item_offset</parameter> and
117 <parameter>item_length</parameter> you provide, HarfBuzz will also
118 attempt to grab the five characters <emphasis>before</emphasis>
119 the offset point and the five characters
120 <emphasis>after</emphasis> the designated end. These are the
121 before and after "context" segments, which are used internally
122 for HarfBuzz to make shaping decisions. They will not be part of
123 the final output, but they ensure that HarfBuzz's
124 script-specific shaping operations are correct. If there are
125 fewer than five characters available for the before or after
126 contexts, HarfBuzz will just grab what is there.
127 </para>
128 <para>
129 For longer text runs, such as full paragraphs, it might be
130 tempting to only add smaller sub-segments to a buffer and
131 shape them in piecemeal fashion. Generally, this is not a good
132 idea, however, because a lot of shaping decisions are
133 dependent on this context information. For example, in Arabic
134 and other connected scripts, HarfBuzz needs to know the code
135 points before and after each character in order to correctly
136 determine which glyph to return.
137 </para>
138 <para>
139 The safest approach is to add all of the text available (even
140 if your text contains a mix of scripts, directions, languages
141 and fonts), then use <parameter>item_offset</parameter> and
142 <parameter>item_length</parameter> to indicate which characters you
143 want shaped (which must all have the same script, direction,
144 language and font), so that HarfBuzz has access to any context.
145 </para>
146 <para>
147 You can also add Unicode code points directly with
148 <function>hb_buffer_add_codepoints()</function>. The arguments
149 to this function are the same as those for the UTF
150 encodings. But it is particularly important to note that
151 HarfBuzz does not do validity checking on the text that is added
152 to a buffer. Invalid code points will be replaced, but it is up
153 to you to do any deep-sanity checking necessary.
154 </para>
155
156 </section>
157
158 <section id="setting-buffer-properties">
159 <title>Setting buffer properties</title>
160 <para>
161 Buffers containing input characters still need several
162 properties set before HarfBuzz can shape their text correctly.
163 </para>
164 <para>
165 Initially, all buffers are set to the
166 <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
167 type. After adding text, the buffer should be set to
168 <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
169 indicates that it contains un-shaped input
170 characters. After shaping, the buffer will have the
171 <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
172 </para>
173 <para>
174 <function>hb_buffer_add_utf8()</function> and the
175 other UTF functions set the content type of their buffer
176 automatically. But if you are reusing a buffer you may want to
177 check its state with
178 <function>hb_buffer_get_content_type(buffer)</function>. If
179 necessary you can set the content type with
180 </para>
181 <programlisting language="C">
182 hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
183 </programlisting>
184 <para>
185 to prepare for shaping.
186 </para>
187 <para>
188 Buffers also need to carry information about the script,
189 language, and text direction of their contents. You can set
190 these properties individually:
191 </para>
192 <programlisting language="C">
193 hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
194 hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
195 hb_buffer_set_language(buf, hb_language_from_string("en", -1));
196 </programlisting>
197 <para>
198 However, since these properties are often repeated for
199 multiple text runs, you can also save them in a
200 <literal>hb_segment_properties_t</literal> for reuse:
201 </para>
202 <programlisting language="C">
203 hb_segment_properties_t *savedprops;
204 hb_buffer_get_segment_properties (buf, savedprops);
205 ...
206 hb_buffer_set_segment_properties (buf2, savedprops);
207 </programlisting>
208 <para>
209 HarfBuzz also provides getter functions to retrieve a buffer's
210 direction, script, and language properties individually.
211 </para>
212 <para>
213 HarfBuzz recognizes four text directions in
214 <type>hb_direction_t</type>: left-to-right
215 (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
216 top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
217 bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the
218 script property, HarfBuzz uses identifiers based on the
219 <ulink
220 url="https://unicode.org/iso15924/">ISO 15924
221 standard</ulink>. For languages, HarfBuzz uses tags based on the
222 <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
223 </para>
224 <para>
225 Helper functions are provided to convert character strings into
226 the necessary script and language tag types.
227 </para>
228 <para>
229 Two additional buffer properties to be aware of are the
230 "invisible glyph" and the replacement code point. The
231 replacement code point is inserted into buffer output in place of
232 any invalid code points encountered in the input. By default, it
233 is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
234 point, <literal>U+FFFD</literal> "&#xFFFD;". You can change this with
235 </para>
236 <programlisting language="C">
237 hb_buffer_set_replacement_codepoint(buf, replacement);
238 </programlisting>
239 <para>
240 passing in the replacement Unicode code point as the
241 <parameter>replacement</parameter> parameter.
242 </para>
243 <para>
244 The invisible glyph is used to replace all output glyphs that
245 are invisible. By default, the standard space character
246 <literal>U+0020</literal> is used; you can replace this (for
247 example, when using a font that provides script-specific
248 spaces) with
249 </para>
250 <programlisting language="C">
251 hb_buffer_set_invisible_glyph(buf, replacement_glyph);
252 </programlisting>
253 <para>
254 Do note that in the <parameter>replacement_glyph</parameter>
255 parameter, you must provide the glyph ID of the replacement you
256 wish to use, not the Unicode code point.
257 </para>
258 <para>
259 HarfBuzz supports a few additional flags you might want to set
260 on your buffer under certain circumstances. The
261 <literal>HB_BUFFER_FLAG_BOT</literal> and
262 <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
263 that the buffer represents the beginning or end (respectively)
264 of a text element (such as a paragraph or other block). Knowing
265 this allows HarfBuzz to apply certain contextual font features
266 when shaping, such as initial or final variants in connected
267 scripts.
268 </para>
269 <para>
270 <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
271 tells HarfBuzz not to hide glyphs with the
272 <literal>Default_Ignorable</literal> property in Unicode. This
273 property designates control characters and other non-printing
274 code points, such as joiners and variation selectors. Normally
275 HarfBuzz replaces them in the output buffer with zero-width
276 space glyphs (using the "invisible glyph" property discussed
277 above); setting this flag causes them to be printed, which can
278 be helpful for troubleshooting.
279 </para>
280 <para>
281 Conversely, setting the
282 <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
283 tells HarfBuzz to remove <literal>Default_Ignorable</literal>
284 glyphs from the output buffer entirely. Finally, setting the
285 <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
286 flag tells HarfBuzz not to insert the dotted-circle glyph
287 (<literal>U+25CC</literal>, "&#x25CC;"), which is normally
288 inserted into buffer output when broken character sequences are
289 encountered (such as combining marks that are not attached to a
290 base character).
291 </para>
292 </section>
293
294 <section id="customizing-unicode-functions">
295 <title>Customizing Unicode functions</title>
296 <para>
297 HarfBuzz requires some simple functions for accessing
298 information from the Unicode Character Database (such as the
299 <literal>General_Category</literal> (gc) and
300 <literal>Script</literal> (sc) properties) that is useful
301 for shaping, as well as some useful operations like composing and
302 decomposing code points.
303 </para>
304 <para>
305 HarfBuzz includes its own internal, lightweight set of Unicode
306 functions. At build time, it is also possible to compile support
307 for some other options, such as the Unicode functions provided
308 by GLib or the International Components for Unicode (ICU)
309 library. Generally, this option is only of interest for client
310 programs that have specific integration requirements or that do
311 a significant amount of customization.
312 </para>
313 <para>
314 If your program has access to other Unicode functions, however,
315 such as through a system library or application framework, you
316 might prefer to use those instead of the built-in
317 options. HarfBuzz supports this by implementing its Unicode
318 functions as a set of virtual methods that you can replace —
319 without otherwise affecting HarfBuzz's functionality.
320 </para>
321 <para>
322 The Unicode functions are specified in a structure called
323 <literal>unicode_funcs</literal> which is attached to each
324 buffer. But even though <literal>unicode_funcs</literal> is
325 associated with a <type>hb_buffer_t</type>, the functions
326 themselves are called by other HarfBuzz APIs that access
327 buffers, so it would be unwise for you to hook different
328 functions into different buffers.
329 </para>
330 <para>
331 In addition, you can mark your <literal>unicode_funcs</literal>
332 as immutable by calling
333 <function>hb_unicode_funcs_make_immutable (ufuncs)</function>.
334 This is especially useful if your code is a
335 library or framework that will have its own client programs. By
336 marking your Unicode function choices as immutable, you prevent
337 your own client programs from changing the
338 <literal>unicode_funcs</literal> configuration and introducing
339 inconsistencies and errors downstream.
340 </para>
341 <para>
342 You can retrieve the Unicode-functions configuration for
343 your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
344 </para>
345 <programlisting language="C">
346 hb_unicode_funcs_t *ufunctions;
347 ufunctions = hb_buffer_get_unicode_funcs(buf);
348 </programlisting>
349 <para>
350 The current version of <literal>unicode_funcs</literal> uses six functions:
351 </para>
352 <itemizedlist>
353 <listitem>
354 <para>
355 <function>hb_unicode_combining_class_func_t</function>:
356 returns the Canonical Combining Class of a code point.
357 </para>
358 </listitem>
359 <listitem>
360 <para>
361 <function>hb_unicode_general_category_func_t</function>:
362 returns the General Category (gc) of a code point.
363 </para>
364 </listitem>
365 <listitem>
366 <para>
367 <function>hb_unicode_mirroring_func_t</function>: returns
368 the Mirroring Glyph code point (for bi-directional
369 replacement) of a code point.
370 </para>
371 </listitem>
372 <listitem>
373 <para>
374 <function>hb_unicode_script_func_t</function>: returns the
375 Script (sc) property of a code point.
376 </para>
377 </listitem>
378 <listitem>
379 <para>
380 <function>hb_unicode_compose_func_t</function>: returns the
381 canonical composition of a sequence of two code points.
382 </para>
383 </listitem>
384 <listitem>
385 <para>
386 <function>hb_unicode_decompose_func_t</function>: returns
387 the canonical decomposition of a code point.
388 </para>
389 </listitem>
390 </itemizedlist>
391 <para>
392 Note, however, that future HarfBuzz releases may alter this set.
393 </para>
394 <para>
395 Each Unicode function has a corresponding setter, with which you
396 can assign a callback to your replacement function. For example,
397 to replace
398 <function>hb_unicode_general_category_func_t</function>, you can call
399 </para>
400 <programlisting language="C">
401 hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy)
402 </programlisting>
403 <para>
404 Virtualizing this set of Unicode functions is primarily intended
405 to improve portability. There is no need for every client
406 program to make the effort to replace the default options, so if
407 you are unsure, do not feel any pressure to customize
408 <literal>unicode_funcs</literal>.
409 </para>
410 </section>
411
412 </chapter>