Mercurial > hgrepos > Python2 > PyMuPDF

diff mupdf-source/thirdparty/harfbuzz/docs/usermanual-buffers-language-script-and-direction.xml @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author: Franz Glasner <fzglas.hg@dom66.de>
date: Mon, 15 Sep 2025 11:43:07 +0200
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/mupdf-source/thirdparty/harfbuzz/docs/usermanual-buffers-language-script-and-direction.xml	Mon Sep 15 11:43:07 2025 +0200
@@ -0,0 +1,412 @@
+<?xml version="1.0"?>
+<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
+               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
+  <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
+  <!ENTITY version SYSTEM "version.xml">
+]>
+<chapter id="buffers-language-script-and-direction">
+  <title>Buffers, language, script and direction</title>
+  <para>
+    The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
+    buffer. In this chapter, we'll look at how to set up a buffer with
+    the text that we want and how to customize the properties of the
+    buffer. We'll also look at a piece of lower-level machinery that
+    you will need to understand before proceeding: the functions that
+    HarfBuzz uses to retrieve Unicode information.
+  </para>
+  <para>
+    After shaping is complete, HarfBuzz puts its output back
+    into the buffer. But getting that output requires setting up a
+    face and a font first, so we will look at that in the next chapter
+    instead of here.
+  </para>
+  <section id="creating-and-destroying-buffers">
+    <title>Creating and destroying buffers</title>
+    <para>
+      As we saw in our <emphasis>Getting Started</emphasis> example, a
+      buffer is created and 
+      initialized with <function>hb_buffer_create()</function>. This
+      produces a new, empty buffer object, instantiated with some
+      default values and ready to accept your Unicode strings.
+    </para>
+    <para>
+      HarfBuzz manages the memory of objects (such as buffers) that it
+      creates, so you don't have to. When you have finished working on 
+      a buffer, you can call <function>hb_buffer_destroy()</function>:
+    </para>
+    <programlisting language="C">
+      hb_buffer_t *buf = hb_buffer_create();
+      ...
+      hb_buffer_destroy(buf);
+    </programlisting>
+    <para>
+      This will destroy the object and free its associated memory -
+      unless some other part of the program holds a reference to this
+      buffer. If you acquire a HarfBuzz buffer from another subsystem
+      and want to ensure that it is not garbage collected by someone
+      else destroying it, you should increase its reference count:
+    </para>
+    <programlisting language="C">
+      void somefunc(hb_buffer_t *buf) {
+      buf = hb_buffer_reference(buf);
+      ...
+    </programlisting>
+    <para>
+      And then decrease it once you're done with it:
+    </para>
+    <programlisting language="C">
+      hb_buffer_destroy(buf);
+      }
+    </programlisting>
+    <para>
+      While we are on the subject of reference-counting buffers, it is
+      worth noting that an individual buffer can only meaningfully be
+      used by one thread at a time.
+    </para>
+    <para>
+      To throw away all the data in your buffer and start from scratch,
+      call <function>hb_buffer_reset(buf)</function>. If you want to
+      throw away the string in the buffer but keep the options, you can
+      instead call <function>hb_buffer_clear_contents(buf)</function>.
+    </para>
+  </section>
+  
+  <section id="adding-text-to-the-buffer">
+    <title>Adding text to the buffer</title>
+    <para>
+      Now we have a brand new HarfBuzz buffer. Let's start filling it
+      with text! From HarfBuzz's perspective, a buffer is just a stream
+      of Unicode code points, but your input string is probably in one of
+      the standard Unicode character encodings (UTF-8, UTF-16, or
+      UTF-32). HarfBuzz provides convenience functions that accept
+      each of these encodings:
+      <function>hb_buffer_add_utf8()</function>,
+      <function>hb_buffer_add_utf16()</function>, and
+      <function>hb_buffer_add_utf32()</function>. Other than the
+      character encoding they accept, they function identically.
+    </para>
+    <para>
+      You can add UTF-8 text to a buffer by passing in the text array,
+      the array's length, an offset into the array for the first
+      character to add, and the length of the segment to add:
+    </para>
+    <programlisting language="C">
+    hb_buffer_add_utf8 (hb_buffer_t *buf,
+                    const char *text,
+                    int text_length,
+                    unsigned int item_offset,
+                    int item_length)
+    </programlisting>
+    <para>
+      So, in practice, you can say:
+    </para>
+    <programlisting language="C">
+      hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
+    </programlisting>
+    <para>
+      This will append your new characters to
+      <parameter>buf</parameter>, not replace its existing
+      contents. Also, note that you can use <literal>-1</literal> in
+      place of the first instance of <function>strlen(text)</function>
+      if your text array is NULL-terminated. Similarly, you can also use
+      <literal>-1</literal> as the final argument want to add its full
+      contents.
+    </para>
+    <para>
+      Whatever start <parameter>item_offset</parameter> and
+      <parameter>item_length</parameter> you provide, HarfBuzz will also
+      attempt to grab the five characters <emphasis>before</emphasis>
+      the offset point and the five characters
+      <emphasis>after</emphasis> the designated end. These are the
+      before and after "context" segments, which are used internally
+      for HarfBuzz to make shaping decisions. They will not be part of
+      the final output, but they ensure that HarfBuzz's
+      script-specific shaping operations are correct. If there are
+      fewer than five characters available for the before or after
+      contexts, HarfBuzz will just grab what is there.
+    </para>
+    <para>
+      For longer text runs, such as full paragraphs, it might be
+      tempting to only add smaller sub-segments to a buffer and
+      shape them in piecemeal fashion. Generally, this is not a good
+      idea, however, because a lot of shaping decisions are
+      dependent on this context information. For example, in Arabic
+      and other connected scripts, HarfBuzz needs to know the code
+      points before and after each character in order to correctly
+      determine which glyph to return.
+    </para>
+    <para>
+      The safest approach is to add all of the text available (even
+      if your text contains a mix of scripts, directions, languages
+      and fonts), then use <parameter>item_offset</parameter> and
+      <parameter>item_length</parameter> to indicate which characters you
+      want shaped (which must all have the same script, direction,
+      language and font), so that HarfBuzz has access to any context.
+    </para>
+    <para>
+      You can also add Unicode code points directly with
+      <function>hb_buffer_add_codepoints()</function>. The arguments
+      to this function are the same as those for the UTF
+      encodings. But it is particularly important to note that
+      HarfBuzz does not do validity checking on the text that is added
+      to a buffer. Invalid code points will be replaced, but it is up
+      to you to do any deep-sanity checking necessary.
+    </para>
+    
+  </section>
+  
+  <section id="setting-buffer-properties">
+    <title>Setting buffer properties</title>
+    <para>
+      Buffers containing input characters still need several
+      properties set before HarfBuzz can shape their text correctly.
+    </para>
+    <para>
+      Initially, all buffers are set to the
+      <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
+      type. After adding text, the buffer should be set to
+      <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
+      indicates that it contains un-shaped input
+      characters. After shaping, the buffer will have the
+      <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
+    </para>
+    <para>
+      <function>hb_buffer_add_utf8()</function> and the
+      other UTF functions set the content type of their buffer
+      automatically. But if you are reusing a buffer you may want to
+      check its state with
+      <function>hb_buffer_get_content_type(buffer)</function>. If
+      necessary you can set the content type with
+    </para>
+    <programlisting language="C">
+      hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
+    </programlisting>
+    <para>
+      to prepare for shaping.
+    </para>
+    <para>
+      Buffers also need to carry information about the script,
+      language, and text direction of their contents. You can set
+      these properties individually:
+    </para>
+    <programlisting language="C">
+      hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
+      hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
+      hb_buffer_set_language(buf, hb_language_from_string("en", -1));
+    </programlisting>
+    <para>
+      However, since these properties are often repeated for
+      multiple text runs, you can also save them in a
+      <literal>hb_segment_properties_t</literal> for reuse:
+    </para>
+    <programlisting language="C">
+      hb_segment_properties_t *savedprops;
+      hb_buffer_get_segment_properties (buf, savedprops);
+      ...
+      hb_buffer_set_segment_properties (buf2, savedprops);
+    </programlisting>
+    <para>
+      HarfBuzz also provides getter functions to retrieve a buffer's
+      direction, script, and language properties individually.
+    </para>
+    <para>
+      HarfBuzz recognizes four text directions in
+      <type>hb_direction_t</type>: left-to-right
+      (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
+      top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
+      bottom-to-top (<literal>HB_DIRECTION_BTT</literal>).  For the
+      script property, HarfBuzz uses identifiers based on the
+      <ulink
+      url="https://unicode.org/iso15924/">ISO 15924
+      standard</ulink>. For languages, HarfBuzz uses tags based on the
+      <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
+    </para>
+    <para>
+      Helper functions are provided to convert character strings into
+      the necessary script and language tag types.
+    </para>
+    <para>
+      Two additional buffer properties to be aware of are the
+      "invisible glyph" and the replacement code point. The
+      replacement code point is inserted into buffer output in place of
+      any invalid code points encountered in the input. By default, it
+      is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
+      point, <literal>U+FFFD</literal> "&#xFFFD;". You can change this with
+    </para>
+    <programlisting language="C">
+      hb_buffer_set_replacement_codepoint(buf, replacement);
+    </programlisting>
+    <para>
+      passing in the replacement Unicode code point as the
+      <parameter>replacement</parameter> parameter.
+    </para>
+    <para>
+      The invisible glyph is used to replace all output glyphs that
+      are invisible. By default, the standard space character
+      <literal>U+0020</literal> is used; you can replace this (for
+      example, when using a font that provides script-specific
+      spaces) with 
+    </para>
+    <programlisting language="C">
+      hb_buffer_set_invisible_glyph(buf, replacement_glyph);
+    </programlisting>
+    <para>
+      Do note that in the <parameter>replacement_glyph</parameter>
+      parameter, you must provide the glyph ID of the replacement you
+      wish to use, not the Unicode code point.
+    </para>
+    <para>
+      HarfBuzz supports a few additional flags you might want to set
+      on your buffer under certain circumstances. The
+      <literal>HB_BUFFER_FLAG_BOT</literal> and
+      <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
+      that the buffer represents the beginning or end (respectively)
+      of a text element (such as a paragraph or other block). Knowing
+      this allows HarfBuzz to apply certain contextual font features
+      when shaping, such as initial or final variants in connected
+      scripts.
+    </para>
+    <para>
+      <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
+      tells HarfBuzz not to hide glyphs with the
+      <literal>Default_Ignorable</literal> property in Unicode. This 
+      property designates control characters and other non-printing
+      code points, such as joiners and variation selectors. Normally
+      HarfBuzz replaces them in the output buffer with zero-width
+      space glyphs (using the "invisible glyph" property discussed
+      above); setting this flag causes them to be printed, which can
+      be helpful for troubleshooting.
+    </para>
+    <para>
+      Conversely, setting the
+      <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
+      tells HarfBuzz to remove <literal>Default_Ignorable</literal>
+      glyphs from the output buffer entirely. Finally, setting the
+      <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
+      flag tells HarfBuzz not to insert the dotted-circle glyph
+      (<literal>U+25CC</literal>, "&#x25CC;"), which is normally
+      inserted into buffer output when broken character sequences are
+      encountered (such as combining marks that are not attached to a
+      base character).
+    </para>
+  </section>
+  
+  <section id="customizing-unicode-functions">
+    <title>Customizing Unicode functions</title>
+    <para>
+      HarfBuzz requires some simple functions for accessing
+      information from the Unicode Character Database (such as the
+      <literal>General_Category</literal> (gc) and
+      <literal>Script</literal> (sc) properties) that is useful
+      for shaping, as well as some useful operations like composing and
+      decomposing code points.
+    </para>
+    <para>
+      HarfBuzz includes its own internal, lightweight set of Unicode
+      functions. At build time, it is also possible to compile support
+      for some other options, such as the Unicode functions provided
+      by GLib or the International Components for Unicode (ICU)
+      library. Generally, this option is only of interest for client
+      programs that have specific integration requirements or that do
+      a significant amount of customization.
+    </para>
+    <para>
+      If your program has access to other Unicode functions, however,
+      such as through a system library or application framework, you
+      might prefer to use those instead of the built-in
+      options. HarfBuzz supports this by implementing its Unicode
+      functions as a set of virtual methods that you can replace —
+      without otherwise affecting HarfBuzz's functionality.
+    </para>
+    <para>
+      The Unicode functions are specified in a structure called
+      <literal>unicode_funcs</literal> which is attached to each
+      buffer. But even though <literal>unicode_funcs</literal> is
+      associated with a <type>hb_buffer_t</type>, the functions
+      themselves are called by other HarfBuzz APIs that access
+      buffers, so it would be unwise for you to hook different
+      functions into different buffers.
+    </para>
+    <para>
+      In addition, you can mark your <literal>unicode_funcs</literal>
+      as immutable by calling
+      <function>hb_unicode_funcs_make_immutable (ufuncs)</function>.
+      This is especially useful if your code is a
+      library or framework that will have its own client programs. By
+      marking your Unicode function choices as immutable, you prevent
+      your own client programs from changing the
+      <literal>unicode_funcs</literal> configuration and introducing
+      inconsistencies and errors downstream.
+    </para>
+    <para>
+      You can retrieve the Unicode-functions configuration for
+      your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
+    </para>
+    <programlisting language="C">
+      hb_unicode_funcs_t *ufunctions;
+      ufunctions = hb_buffer_get_unicode_funcs(buf);
+    </programlisting>
+    <para>
+      The current version of <literal>unicode_funcs</literal> uses six functions:
+    </para>
+    <itemizedlist>
+      <listitem>
+	<para>
+	  <function>hb_unicode_combining_class_func_t</function>:
+	  returns the Canonical Combining Class of a code point.
+      	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  <function>hb_unicode_general_category_func_t</function>:
+	  returns the General Category (gc) of a code point.
+      	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  <function>hb_unicode_mirroring_func_t</function>: returns
+	  the Mirroring Glyph code point (for bi-directional
+	  replacement) of a code point.
+      	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  <function>hb_unicode_script_func_t</function>: returns the
+	  Script (sc) property of a code point.
+      	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  <function>hb_unicode_compose_func_t</function>: returns the
+	  canonical composition of a sequence of two code points.
+	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  <function>hb_unicode_decompose_func_t</function>: returns
+	  the canonical decomposition of a code point.
+	</para>
+      </listitem>
+    </itemizedlist>
+    <para>
+      Note, however, that future HarfBuzz releases may alter this set.
+    </para>
+    <para>
+      Each Unicode function has a corresponding setter, with which you
+      can assign a callback to your replacement function. For example,
+      to replace
+      <function>hb_unicode_general_category_func_t</function>, you can call
+    </para>
+    <programlisting language="C">
+      hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy)	    
+    </programlisting>
+    <para>
+      Virtualizing this set of Unicode functions is primarily intended
+      to improve portability. There is no need for every client
+      program to make the effort to replace the default options, so if
+      you are unsure, do not feel any pressure to customize
+      <literal>unicode_funcs</literal>. 
+    </para>
+  </section>
+  
+</chapter>
author	Franz Glasner <fzglas.hg@dom66.de>
date	Mon, 15 Sep 2025 11:43:07 +0200
parents
children