Mercurial > hgrepos > Python2 > PyMuPDF
diff mupdf-source/thirdparty/harfbuzz/docs/usermanual-clusters.xml @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/mupdf-source/thirdparty/harfbuzz/docs/usermanual-clusters.xml Mon Sep 15 11:43:07 2025 +0200 @@ -0,0 +1,697 @@ +<?xml version="1.0"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" + "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ + <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> + <!ENTITY version SYSTEM "version.xml"> +]> +<chapter id="clusters"> + <title>Clusters</title> + <section id="clusters-and-shaping"> + <title>Clusters and shaping</title> + <para> + In text shaping, a <emphasis>cluster</emphasis> is a sequence of + characters that needs to be treated as a single, indivisible + unit. A single letter or symbol can be a cluster of its + own. Other clusters correspond to longer subsequences of the + input code points — such as a ligature or conjunct form + — and require the shaper to ensure that the cluster is not + broken during the shaping process. + </para> + <para> + A cluster is distinct from a <emphasis>grapheme</emphasis>, + which is the smallest unit of meaning in a writing system or + script. + </para> + <para> + The definitions of the two terms are similar. However, clusters + are only relevant for script shaping and glyph layout. In + contrast, graphemes are a property of the underlying script, and + are of interest when client programs implement orthographic + or linguistic functionality. + </para> + <para> + For example, two individual letters are often two separate + graphemes. When two letters form a ligature, however, they + combine into a single glyph. They are then part of the same + cluster and are treated as a unit by the shaping engine — + even though the two original, underlying letters remain separate + graphemes. + </para> + <para> + HarfBuzz is concerned with clusters, <emphasis>not</emphasis> + with graphemes — although client programs using HarfBuzz + may still care about graphemes for other reasons from time to time. + </para> + <para> + During the shaping process, there are several shaping operations + that may merge adjacent characters (for example, when two code + points form a ligature or a conjunct form and are replaced by a + single glyph) or split one character into several (for example, + when decomposing a code point through the + <literal>ccmp</literal> feature). Operations like these alter + clusters; HarfBuzz tracks the changes to ensure that no clusters + get lost or broken during shaping. + </para> + <para> + HarfBuzz records cluster information independently from how + shaping operations affect the individual glyphs returned in an + output buffer. Consequently, a client program using HarfBuzz can + utilize the cluster information to implement features such as: + </para> + <itemizedlist> + <listitem> + <para> + Correctly positioning the cursor within a shaped text run, + even when characters have formed ligatures, composed or + decomposed, reordered, or undergone other shaping operations. + </para> + </listitem> + <listitem> + <para> + Correctly highlighting a text selection that includes some, + but not all, of the characters in a word. + </para> + </listitem> + <listitem> + <para> + Applying text attributes (such as color or underlining) to + part, but not all, of a word. + </para> + </listitem> + <listitem> + <para> + Generating output document formats (such as PDF) with + embedded text that can be fully extracted. + </para> + </listitem> + <listitem> + <para> + Determining the mapping between input characters and output + glyphs, such as which glyphs are ligatures. + </para> + </listitem> + <listitem> + <para> + Performing line-breaking, justification, and other + line-level or paragraph-level operations that must be done + after shaping is complete, but which require examining + character-level properties. + </para> + </listitem> + </itemizedlist> + </section> + <section id="working-with-harfbuzz-clusters"> + <title>Working with HarfBuzz clusters</title> + <para> + When you add text to a HarfBuzz buffer, each code point must be + assigned a <emphasis>cluster value</emphasis>. + </para> + <para> + This cluster value is an arbitrary number; HarfBuzz uses it only + to distinguish between clusters. Many client programs will use + the index of each code point in the input text stream as the + cluster value. This is for the sake of convenience; the actual + value does not matter. + </para> + <para> + Some of the shaping operations performed by HarfBuzz — + such as reordering, composition, decomposition, and substitution + — may alter the cluster values of some characters. The + final cluster values in the buffer at the end of the shaping + process will indicate to client programs which subsequences of + glyphs represent a cluster and, therefore, must not be + separated. + </para> + <para> + In addition, client programs can query the final cluster values + to discern other potentially important information about the + glyphs in the output buffer (such as whether or not a ligature + was formed). + </para> + <para> + For example, if the initial sequence of cluster values was: + </para> + <programlisting> + 0,1,2,3,4 + </programlisting> + <para> + and the final sequence of cluster values is: + </para> + <programlisting> + 0,0,3,3 + </programlisting> + <para> + then there are two clusters in the output buffer: the first + cluster includes the first two glyphs, and the second cluster + includes the third and fourth glyphs. It is also evident that a + ligature or conjunct has been formed, because there are fewer + glyphs in the output buffer (four) than there were code points + in the input buffer (five). + </para> + <para> + Although client programs using HarfBuzz are free to assign + initial cluster values in any manner they choose to, HarfBuzz + does offer some useful guarantees if the cluster values are + assigned in a monotonic (either non-decreasing or non-increasing) + order. + </para> + <para> + For buffers in the left-to-right (LTR) + or top-to-bottom (TTB) text flow direction, + HarfBuzz will preserve the monotonic property: client programs + are guaranteed that monotonically increasing initial cluster + values will be returned as monotonically increasing final + cluster values. + </para> + <para> + For buffers in the right-to-left (RTL) + or bottom-to-top (BTT) text flow direction, + the directionality of the buffer itself is reversed for final + output as a matter of design. Therefore, HarfBuzz inverts the + monotonic property: client programs are guaranteed that + monotonically increasing initial cluster values will be + returned as monotonically <emphasis>decreasing</emphasis> final + cluster values. + </para> + <para> + Client programs can adjust how HarfBuzz handles clusters during + shaping by setting the + <literal>cluster_level</literal> of the + buffer. HarfBuzz offers three <emphasis>levels</emphasis> of + clustering support for this property: + </para> + <itemizedlist> + <listitem> + <para><emphasis>Level 0</emphasis> is the default and + reproduces the behavior of the old HarfBuzz library. + </para> + <para> + The distinguishing feature of level 0 behavior is that, at + the beginning of processing the buffer, all code points that + are categorized as <emphasis>marks</emphasis>, + <emphasis>modifier symbols</emphasis>, or + <emphasis>Emoji extended pictographic</emphasis> modifiers, + as well as the <emphasis>Zero Width Joiner</emphasis> and + <emphasis>Zero Width Non-Joiner</emphasis> code points, are + assigned the cluster value of the closest preceding code + point from <emphasis>different</emphasis> category. + </para> + <para> + In essence, whenever a base character is followed by a mark + character or a sequence of mark characters, those marks are + reassigned to the same initial cluster value as the base + character. This reassignment is referred to as + "merging" the affected clusters. This behavior is based on + the Grapheme Cluster Boundary specification in <ulink + url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode + Technical Report 29</ulink>. + </para> + <para> + Client programs can specify level 0 behavior for a buffer by + setting its <literal>cluster_level</literal> to + <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>. + </para> + </listitem> + <listitem> + <para> + <emphasis>Level 1</emphasis> tweaks the old behavior + slightly to produce better results. Therefore, level 1 + clustering is recommended for code that is not required to + implement backward compatibility with the old HarfBuzz. + </para> + <para> + Level 1 differs from level 0 by not merging the + clusters of marks and other modifier code points with the + preceding "base" code point's cluster. By preserving the + separate cluster values of these marks and modifier code + points, script shapers can perform additional operations + that might lead to improved results (for example, reordering + a sequence of marks). + </para> + <para> + Client programs can specify level 1 behavior for a buffer by + setting its <literal>cluster_level</literal> to + <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>. + </para> + </listitem> + <listitem> + <para> + <emphasis>Level 2</emphasis> differs significantly in how it + treats cluster values. In level 2, HarfBuzz never merges + clusters. + </para> + <para> + This difference can be seen most clearly when HarfBuzz processes + ligature substitutions and glyph decompositions. In level 0 + and level 1, ligatures and glyph decomposition both involve + merging clusters; in level 2, neither of these operations + triggers a merge. + </para> + <para> + Client programs can specify level 2 behavior for a buffer by + setting its <literal>cluster_level</literal> to + <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>. + </para> + </listitem> + </itemizedlist> + <para> + As mentioned earlier, client programs using HarfBuzz often + assign initial cluster values in a buffer by reusing the indices + of the code points in the input text. This gives a sequence of + cluster values that is monotonically increasing (for example, + 0,1,2,3,4). + </para> + <para> + It is not <emphasis>required</emphasis> that the cluster values + in a buffer be monotonically increasing. However, if the initial + cluster values in a buffer are monotonic and the buffer is + configured to use cluster level 0 or 1, then HarfBuzz + guarantees that the final cluster values in the shaped buffer + will also be monotonic. No such guarantee is made for cluster + level 2. + </para> + <para> + In levels 0 and 1, HarfBuzz implements the following conceptual + model for cluster values: + </para> + <itemizedlist spacing="compact"> + <listitem> + <para> + If the sequence of input cluster values is monotonic, the + sequence of cluster values will remain monotonic. + </para> + </listitem> + <listitem> + <para> + Each cluster value represents a single cluster. + </para> + </listitem> + <listitem> + <para> + Each cluster contains one or more glyphs and one or more + characters. + </para> + </listitem> + </itemizedlist> + <para> + In practice, this model offers several benefits. Assuming that + the initial cluster values were monotonically increasing + and distinct before shaping began, then, in the final output: + </para> + <itemizedlist spacing="compact"> + <listitem> + <para> + All adjacent glyphs having the same final cluster + value belong to the same cluster. + </para> + </listitem> + <listitem> + <para> + Each character belongs to the cluster that has the highest + cluster value <emphasis>not larger than</emphasis> its + initial cluster value. + </para> + </listitem> + </itemizedlist> + </section> + + <section id="a-clustering-example-for-levels-0-and-1"> + <title>A clustering example for levels 0 and 1</title> + <para> + The basic shaping operations affect clusters in a predictable + manner when using level 0 or level 1: + </para> + <itemizedlist> + <listitem> + <para> + When two or more clusters <emphasis>merge</emphasis>, the + resulting merged cluster takes as its cluster value the + <emphasis>minimum</emphasis> of the incoming cluster values. + </para> + </listitem> + <listitem> + <para> + When a cluster <emphasis>decomposes</emphasis>, all of the + resulting child clusters inherit as their cluster value the + cluster value of the parent cluster. + </para> + </listitem> + <listitem> + <para> + When a character is <emphasis>reordered</emphasis>, the + reordered character and all clusters that the character + moves past as part of the reordering are merged into one cluster. + </para> + </listitem> + </itemizedlist> + <para> + The functionality, guarantees, and benefits of level 0 and level + 1 behavior can be seen with some examples. First, let us examine + what happens with cluster values when shaping involves cluster + merging with ligatures and decomposition. + </para> + + <para> + Let's say we start with the following character sequence (top row) and + initial cluster values (bottom row): + </para> + <programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> + <para> + During shaping, HarfBuzz maps these characters to glyphs from + the font. For simplicity, let us assume that each character maps + to the corresponding, identical-looking glyph: + </para> + <programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> + <para> + Now if, for example, <literal>B</literal> and <literal>C</literal> + form a ligature, then the clusters to which they belong + "merge". This merged cluster takes for its cluster + value the minimum of all the cluster values of the clusters that + went in to the ligature. In this case, we get: + </para> + <programlisting> + A,BC,D,E + 0,1 ,3,4 + </programlisting> + <para> + because 1 is the minimum of the set {1,2}, which were the + cluster values of <literal>B</literal> and + <literal>C</literal>. + </para> + <para> + Next, let us say that the <literal>BC</literal> ligature glyph + decomposes into three components, and <literal>D</literal> also + decomposes into two components. Whenever a cluster decomposes, + its components each inherit the cluster value of their parent: + </para> + <programlisting> + A,BC0,BC1,BC2,D0,D1,E + 0,1 ,1 ,1 ,3 ,3 ,4 + </programlisting> + <para> + Next, if <literal>BC2</literal> and <literal>D0</literal> form a + ligature, then their clusters (cluster values 1 and 3) merge into + <literal>min(1,3) = 1</literal>: + </para> + <programlisting> + A,BC0,BC1,BC2D0,D1,E + 0,1 ,1 ,1 ,1 ,4 + </programlisting> + <para> + Note that the entirety of cluster 3 merges into cluster 1, not + just the <literal>D0</literal> glyph. This reflects the fact + that the cluster <emphasis>must</emphasis> be treated as an + indivisible unit. + </para> + <para> + At this point, cluster 1 means: the character sequence + <literal>BCD</literal> is represented by glyphs + <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any + further. + </para> + </section> + <section id="reordering-in-levels-0-and-1"> + <title>Reordering in levels 0 and 1</title> + <para> + Another common operation in some shapers is glyph + reordering. In order to maintain a monotonic cluster sequence + when glyph reordering takes place, HarfBuzz merges the clusters + of everything in the reordering sequence. + </para> + <para> + For example, let us again start with the character sequence (top + row) and initial cluster values (bottom row): + </para> + <programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> + <para> + If <literal>D</literal> is reordered to the position immediately + before <literal>B</literal>, then HarfBuzz merges the + <literal>B</literal>, <literal>C</literal>, and + <literal>D</literal> clusters — all the clusters between + the final position of the reordered glyph and its original + position. This means that we get: + </para> + <programlisting> + A,D,B,C,E + 0,1,1,1,4 + </programlisting> + <para> + as the final cluster sequence. + </para> + <para> + Merging this many clusters is not ideal, but it is the only + sensible way for HarfBuzz to maintain the guarantee that the + sequence of cluster values remains monotonic and to retain the + true relationship between glyphs and characters. + </para> + </section> + <section id="the-distinction-between-levels-0-and-1"> + <title>The distinction between levels 0 and 1</title> + <para> + The preceding examples demonstrate the main effects of using + cluster levels 0 and 1. The only difference between the two + levels is this: in level 0, at the very beginning of the shaping + process, HarfBuzz merges the cluster of each base character + with the clusters of all Unicode marks (combining or not) and + modifiers that follow it. + </para> + <para> + For example, let us start with the following character sequence + (top row) and accompanying initial cluster values (bottom row): + </para> + <programlisting> + A,acute,B + 0,1 ,2 + </programlisting> + <para> + The <literal>acute</literal> is a Unicode mark. If HarfBuzz is + using cluster level 0 on this sequence, then the + <literal>A</literal> and <literal>acute</literal> clusters will + merge, and the result will become: + </para> + <programlisting> + A,acute,B + 0,0 ,2 + </programlisting> + <para> + This merger is performed before any other script-shaping + steps. + </para> + <para> + This initial cluster merging is the default behavior of the + Windows shaping engine, and the old HarfBuzz codebase copied + that behavior to maintain compatibility. Consequently, it has + remained the default behavior in the new HarfBuzz codebase. + </para> + <para> + But this initial cluster-merging behavior makes it impossible + for client programs to implement some features (such as to + color diacritic marks differently from their base + characters). That is why, in level 1, HarfBuzz does not perform + the initial merging step. + </para> + <para> + For client programs that rely on HarfBuzz cluster values to + perform cursor positioning, level 0 is more convenient. But + relying on cluster boundaries for cursor positioning is wrong: cursor + positions should be determined based on Unicode grapheme + boundaries, not on shaping-cluster boundaries. As such, using + level 1 clustering behavior is recommended. + </para> + <para> + One final facet of levels 0 and 1 is worth noting. HarfBuzz + currently does not allow any + <emphasis>multiple-substitution</emphasis> GSUB lookups to + replace a glyph with zero glyphs (in other words, to delete a + glyph). + </para> + <para> + But, in some other situations, glyphs can be deleted. In + those cases, if the glyph being deleted is the last glyph of its + cluster, HarfBuzz makes sure to merge the deleted glyph's + cluster with a neighboring cluster. + </para> + <para> + This is done primarily to make sure that the starting cluster of the + text always has the cluster index pointing to the start of the text + for the run; more than one client program currently relies on this + guarantee. + </para> + <para> + Incidentally, Apple's CoreText does something different to + maintain the same promise: it inserts a glyph with id 65535 at + the beginning of the glyph string if the glyph corresponding to + the first character in the run was deleted. HarfBuzz might do + something similar in the future. + </para> + </section> + <section id="level-2"> + <title>Level 2</title> + <para> + HarfBuzz's level 2 cluster behavior uses a significantly + different model than that of level 0 and level 1. + </para> + <para> + The level 2 behavior is easy to describe, but it may be + difficult to understand in practical terms. In brief, level 2 + performs no merging of clusters whatsoever. + </para> + <para> + This means that there is no initial base-and-mark merging step + (as is done in level 0), and it means that reordering moves and + ligature substitutions do not trigger a cluster merge. + </para> + <para> + Only one shaping operation directly affects clusters when using + level 2: + </para> + <itemizedlist> + <listitem> + <para> + When a cluster <emphasis>decomposes</emphasis>, all of the + resulting child clusters inherit as their cluster value the + cluster value of the parent cluster. + </para> + </listitem> + </itemizedlist> + <para> + When glyphs do form a ligature (or when some other feature + substitutes multiple glyphs with one glyph) the cluster value + of the first glyph is retained as the cluster value for the + resulting ligature. + </para> + <para> + This occurrence sounds similar to a cluster merge, but it is + different. In particular, no subsequent characters — + including marks and modifiers — are affected. They retain + their previous cluster values. + </para> + <para> + Level 2 cluster behavior is ultimately less complex than level 0 + or level 1, but there are several cases for which processing + cluster values produced at level 2 may be tricky. + </para> + <section id="ligatures-with-combining-marks-in-level-2"> + <title>Ligatures with combining marks in level 2</title> + <para> + The first example of how HarfBuzz's level 2 cluster behavior + can be tricky is when the text to be shaped includes combining + marks attached to ligatures. + </para> + <para> + Let us start with an input sequence with the following + characters (top row) and initial cluster values (bottom row): + </para> + <programlisting> + A,acute,B,breve,C,circumflex + 0,1 ,2,3 ,4,5 + </programlisting> + <para> + If the sequence <literal>A,B,C</literal> forms a ligature, + then these are the cluster values HarfBuzz will return under + the various cluster levels: + </para> + <para> + Level 0: + </para> + <programlisting> + ABC,acute,breve,circumflex + 0 ,0 ,0 ,0 + </programlisting> + <para> + Level 1: + </para> + <programlisting> + ABC,acute,breve,circumflex + 0 ,0 ,0 ,5 + </programlisting> + <para> + Level 2: + </para> + <programlisting> + ABC,acute,breve,circumflex + 0 ,1 ,3 ,5 + </programlisting> + <para> + Making sense of the level 2 result is the hardest for a client + program, because there is nothing in the cluster values that + indicates that <literal>B</literal> and <literal>C</literal> + formed a ligature with <literal>A</literal>. + </para> + <para> + In contrast, the "merged" cluster values of the mark glyphs + that are seen in the level 0 and level 1 output are evidence + that a ligature substitution took place. + </para> + </section> + <section id="reordering-in-level-2"> + <title>Reordering in level 2</title> + <para> + Another example of how HarfBuzz's level 2 cluster behavior + can be tricky is when glyphs reorder. Consider an input sequence + with the following characters (top row) and initial cluster + values (bottom row): + </para> + <programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> + <para> + Now imagine <literal>D</literal> moves before + <literal>B</literal> in a reordering operation. The cluster + values will then be: + </para> + <programlisting> + A,D,B,C,E + 0,3,1,2,4 + </programlisting> + <para> + Next, if <literal>D</literal> forms a ligature with + <literal>B</literal>, the output is: + </para> + <programlisting> + A,DB,C,E + 0,3 ,2,4 + </programlisting> + <para> + However, in a different scenario, in which the shaping rules + of the script instead caused <literal>A</literal> and + <literal>B</literal> to form a ligature + <emphasis>before</emphasis> the <literal>D</literal> reordered, the + result would be: + </para> + <programlisting> + AB,D,C,E + 0 ,3,2,4 + </programlisting> + <para> + There is no way for a client program to differentiate between + these two scenarios based on the cluster values + alone. Consequently, client programs that use level 2 might + need to undertake additional work in order to manage cursor + positioning, text attributes, or other desired features. + </para> + </section> + <section id="other-considerations-in-level-2"> + <title>Other considerations in level 2</title> + <para> + There may be other problems encountered with ligatures under + level 2, such as if the direction of the text is forced to + the opposite of its natural direction (for example, Arabic text + that is forced into left-to-right directionality). But, + generally speaking, these other scenarios are minor corner + cases that are too obscure for most client programs to need to + worry about. + </para> + </section> + </section> +</chapter>
