Mercurial > hgrepos > Python2 > PyMuPDF
diff mupdf-source/thirdparty/harfbuzz/docs/usermanual-buffers-language-script-and-direction.xml @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/mupdf-source/thirdparty/harfbuzz/docs/usermanual-buffers-language-script-and-direction.xml Mon Sep 15 11:43:07 2025 +0200 @@ -0,0 +1,412 @@ +<?xml version="1.0"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" + "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ + <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> + <!ENTITY version SYSTEM "version.xml"> +]> +<chapter id="buffers-language-script-and-direction"> + <title>Buffers, language, script and direction</title> + <para> + The input to the HarfBuzz shaper is a series of Unicode characters, stored in a + buffer. In this chapter, we'll look at how to set up a buffer with + the text that we want and how to customize the properties of the + buffer. We'll also look at a piece of lower-level machinery that + you will need to understand before proceeding: the functions that + HarfBuzz uses to retrieve Unicode information. + </para> + <para> + After shaping is complete, HarfBuzz puts its output back + into the buffer. But getting that output requires setting up a + face and a font first, so we will look at that in the next chapter + instead of here. + </para> + <section id="creating-and-destroying-buffers"> + <title>Creating and destroying buffers</title> + <para> + As we saw in our <emphasis>Getting Started</emphasis> example, a + buffer is created and + initialized with <function>hb_buffer_create()</function>. This + produces a new, empty buffer object, instantiated with some + default values and ready to accept your Unicode strings. + </para> + <para> + HarfBuzz manages the memory of objects (such as buffers) that it + creates, so you don't have to. When you have finished working on + a buffer, you can call <function>hb_buffer_destroy()</function>: + </para> + <programlisting language="C"> + hb_buffer_t *buf = hb_buffer_create(); + ... + hb_buffer_destroy(buf); + </programlisting> + <para> + This will destroy the object and free its associated memory - + unless some other part of the program holds a reference to this + buffer. If you acquire a HarfBuzz buffer from another subsystem + and want to ensure that it is not garbage collected by someone + else destroying it, you should increase its reference count: + </para> + <programlisting language="C"> + void somefunc(hb_buffer_t *buf) { + buf = hb_buffer_reference(buf); + ... + </programlisting> + <para> + And then decrease it once you're done with it: + </para> + <programlisting language="C"> + hb_buffer_destroy(buf); + } + </programlisting> + <para> + While we are on the subject of reference-counting buffers, it is + worth noting that an individual buffer can only meaningfully be + used by one thread at a time. + </para> + <para> + To throw away all the data in your buffer and start from scratch, + call <function>hb_buffer_reset(buf)</function>. If you want to + throw away the string in the buffer but keep the options, you can + instead call <function>hb_buffer_clear_contents(buf)</function>. + </para> + </section> + + <section id="adding-text-to-the-buffer"> + <title>Adding text to the buffer</title> + <para> + Now we have a brand new HarfBuzz buffer. Let's start filling it + with text! From HarfBuzz's perspective, a buffer is just a stream + of Unicode code points, but your input string is probably in one of + the standard Unicode character encodings (UTF-8, UTF-16, or + UTF-32). HarfBuzz provides convenience functions that accept + each of these encodings: + <function>hb_buffer_add_utf8()</function>, + <function>hb_buffer_add_utf16()</function>, and + <function>hb_buffer_add_utf32()</function>. Other than the + character encoding they accept, they function identically. + </para> + <para> + You can add UTF-8 text to a buffer by passing in the text array, + the array's length, an offset into the array for the first + character to add, and the length of the segment to add: + </para> + <programlisting language="C"> + hb_buffer_add_utf8 (hb_buffer_t *buf, + const char *text, + int text_length, + unsigned int item_offset, + int item_length) + </programlisting> + <para> + So, in practice, you can say: + </para> + <programlisting language="C"> + hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text)); + </programlisting> + <para> + This will append your new characters to + <parameter>buf</parameter>, not replace its existing + contents. Also, note that you can use <literal>-1</literal> in + place of the first instance of <function>strlen(text)</function> + if your text array is NULL-terminated. Similarly, you can also use + <literal>-1</literal> as the final argument want to add its full + contents. + </para> + <para> + Whatever start <parameter>item_offset</parameter> and + <parameter>item_length</parameter> you provide, HarfBuzz will also + attempt to grab the five characters <emphasis>before</emphasis> + the offset point and the five characters + <emphasis>after</emphasis> the designated end. These are the + before and after "context" segments, which are used internally + for HarfBuzz to make shaping decisions. They will not be part of + the final output, but they ensure that HarfBuzz's + script-specific shaping operations are correct. If there are + fewer than five characters available for the before or after + contexts, HarfBuzz will just grab what is there. + </para> + <para> + For longer text runs, such as full paragraphs, it might be + tempting to only add smaller sub-segments to a buffer and + shape them in piecemeal fashion. Generally, this is not a good + idea, however, because a lot of shaping decisions are + dependent on this context information. For example, in Arabic + and other connected scripts, HarfBuzz needs to know the code + points before and after each character in order to correctly + determine which glyph to return. + </para> + <para> + The safest approach is to add all of the text available (even + if your text contains a mix of scripts, directions, languages + and fonts), then use <parameter>item_offset</parameter> and + <parameter>item_length</parameter> to indicate which characters you + want shaped (which must all have the same script, direction, + language and font), so that HarfBuzz has access to any context. + </para> + <para> + You can also add Unicode code points directly with + <function>hb_buffer_add_codepoints()</function>. The arguments + to this function are the same as those for the UTF + encodings. But it is particularly important to note that + HarfBuzz does not do validity checking on the text that is added + to a buffer. Invalid code points will be replaced, but it is up + to you to do any deep-sanity checking necessary. + </para> + + </section> + + <section id="setting-buffer-properties"> + <title>Setting buffer properties</title> + <para> + Buffers containing input characters still need several + properties set before HarfBuzz can shape their text correctly. + </para> + <para> + Initially, all buffers are set to the + <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content + type. After adding text, the buffer should be set to + <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which + indicates that it contains un-shaped input + characters. After shaping, the buffer will have the + <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type. + </para> + <para> + <function>hb_buffer_add_utf8()</function> and the + other UTF functions set the content type of their buffer + automatically. But if you are reusing a buffer you may want to + check its state with + <function>hb_buffer_get_content_type(buffer)</function>. If + necessary you can set the content type with + </para> + <programlisting language="C"> + hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE); + </programlisting> + <para> + to prepare for shaping. + </para> + <para> + Buffers also need to carry information about the script, + language, and text direction of their contents. You can set + these properties individually: + </para> + <programlisting language="C"> + hb_buffer_set_direction(buf, HB_DIRECTION_LTR); + hb_buffer_set_script(buf, HB_SCRIPT_LATIN); + hb_buffer_set_language(buf, hb_language_from_string("en", -1)); + </programlisting> + <para> + However, since these properties are often repeated for + multiple text runs, you can also save them in a + <literal>hb_segment_properties_t</literal> for reuse: + </para> + <programlisting language="C"> + hb_segment_properties_t *savedprops; + hb_buffer_get_segment_properties (buf, savedprops); + ... + hb_buffer_set_segment_properties (buf2, savedprops); + </programlisting> + <para> + HarfBuzz also provides getter functions to retrieve a buffer's + direction, script, and language properties individually. + </para> + <para> + HarfBuzz recognizes four text directions in + <type>hb_direction_t</type>: left-to-right + (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>), + top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and + bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the + script property, HarfBuzz uses identifiers based on the + <ulink + url="https://unicode.org/iso15924/">ISO 15924 + standard</ulink>. For languages, HarfBuzz uses tags based on the + <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard. + </para> + <para> + Helper functions are provided to convert character strings into + the necessary script and language tag types. + </para> + <para> + Two additional buffer properties to be aware of are the + "invisible glyph" and the replacement code point. The + replacement code point is inserted into buffer output in place of + any invalid code points encountered in the input. By default, it + is the Unicode <literal>REPLACEMENT CHARACTER</literal> code + point, <literal>U+FFFD</literal> "�". You can change this with + </para> + <programlisting language="C"> + hb_buffer_set_replacement_codepoint(buf, replacement); + </programlisting> + <para> + passing in the replacement Unicode code point as the + <parameter>replacement</parameter> parameter. + </para> + <para> + The invisible glyph is used to replace all output glyphs that + are invisible. By default, the standard space character + <literal>U+0020</literal> is used; you can replace this (for + example, when using a font that provides script-specific + spaces) with + </para> + <programlisting language="C"> + hb_buffer_set_invisible_glyph(buf, replacement_glyph); + </programlisting> + <para> + Do note that in the <parameter>replacement_glyph</parameter> + parameter, you must provide the glyph ID of the replacement you + wish to use, not the Unicode code point. + </para> + <para> + HarfBuzz supports a few additional flags you might want to set + on your buffer under certain circumstances. The + <literal>HB_BUFFER_FLAG_BOT</literal> and + <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz + that the buffer represents the beginning or end (respectively) + of a text element (such as a paragraph or other block). Knowing + this allows HarfBuzz to apply certain contextual font features + when shaping, such as initial or final variants in connected + scripts. + </para> + <para> + <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal> + tells HarfBuzz not to hide glyphs with the + <literal>Default_Ignorable</literal> property in Unicode. This + property designates control characters and other non-printing + code points, such as joiners and variation selectors. Normally + HarfBuzz replaces them in the output buffer with zero-width + space glyphs (using the "invisible glyph" property discussed + above); setting this flag causes them to be printed, which can + be helpful for troubleshooting. + </para> + <para> + Conversely, setting the + <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag + tells HarfBuzz to remove <literal>Default_Ignorable</literal> + glyphs from the output buffer entirely. Finally, setting the + <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal> + flag tells HarfBuzz not to insert the dotted-circle glyph + (<literal>U+25CC</literal>, "◌"), which is normally + inserted into buffer output when broken character sequences are + encountered (such as combining marks that are not attached to a + base character). + </para> + </section> + + <section id="customizing-unicode-functions"> + <title>Customizing Unicode functions</title> + <para> + HarfBuzz requires some simple functions for accessing + information from the Unicode Character Database (such as the + <literal>General_Category</literal> (gc) and + <literal>Script</literal> (sc) properties) that is useful + for shaping, as well as some useful operations like composing and + decomposing code points. + </para> + <para> + HarfBuzz includes its own internal, lightweight set of Unicode + functions. At build time, it is also possible to compile support + for some other options, such as the Unicode functions provided + by GLib or the International Components for Unicode (ICU) + library. Generally, this option is only of interest for client + programs that have specific integration requirements or that do + a significant amount of customization. + </para> + <para> + If your program has access to other Unicode functions, however, + such as through a system library or application framework, you + might prefer to use those instead of the built-in + options. HarfBuzz supports this by implementing its Unicode + functions as a set of virtual methods that you can replace — + without otherwise affecting HarfBuzz's functionality. + </para> + <para> + The Unicode functions are specified in a structure called + <literal>unicode_funcs</literal> which is attached to each + buffer. But even though <literal>unicode_funcs</literal> is + associated with a <type>hb_buffer_t</type>, the functions + themselves are called by other HarfBuzz APIs that access + buffers, so it would be unwise for you to hook different + functions into different buffers. + </para> + <para> + In addition, you can mark your <literal>unicode_funcs</literal> + as immutable by calling + <function>hb_unicode_funcs_make_immutable (ufuncs)</function>. + This is especially useful if your code is a + library or framework that will have its own client programs. By + marking your Unicode function choices as immutable, you prevent + your own client programs from changing the + <literal>unicode_funcs</literal> configuration and introducing + inconsistencies and errors downstream. + </para> + <para> + You can retrieve the Unicode-functions configuration for + your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>: + </para> + <programlisting language="C"> + hb_unicode_funcs_t *ufunctions; + ufunctions = hb_buffer_get_unicode_funcs(buf); + </programlisting> + <para> + The current version of <literal>unicode_funcs</literal> uses six functions: + </para> + <itemizedlist> + <listitem> + <para> + <function>hb_unicode_combining_class_func_t</function>: + returns the Canonical Combining Class of a code point. + </para> + </listitem> + <listitem> + <para> + <function>hb_unicode_general_category_func_t</function>: + returns the General Category (gc) of a code point. + </para> + </listitem> + <listitem> + <para> + <function>hb_unicode_mirroring_func_t</function>: returns + the Mirroring Glyph code point (for bi-directional + replacement) of a code point. + </para> + </listitem> + <listitem> + <para> + <function>hb_unicode_script_func_t</function>: returns the + Script (sc) property of a code point. + </para> + </listitem> + <listitem> + <para> + <function>hb_unicode_compose_func_t</function>: returns the + canonical composition of a sequence of two code points. + </para> + </listitem> + <listitem> + <para> + <function>hb_unicode_decompose_func_t</function>: returns + the canonical decomposition of a code point. + </para> + </listitem> + </itemizedlist> + <para> + Note, however, that future HarfBuzz releases may alter this set. + </para> + <para> + Each Unicode function has a corresponding setter, with which you + can assign a callback to your replacement function. For example, + to replace + <function>hb_unicode_general_category_func_t</function>, you can call + </para> + <programlisting language="C"> + hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy) + </programlisting> + <para> + Virtualizing this set of Unicode functions is primarily intended + to improve portability. There is no need for every client + program to make the effort to replace the default options, so if + you are unsure, do not feel any pressure to customize + <literal>unicode_funcs</literal>. + </para> + </section> + +</chapter>
