Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/harfbuzz/docs/usermanual-clusters.xml @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 <?xml version="1.0"?> | |
| 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" | |
| 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ | |
| 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> | |
| 5 <!ENTITY version SYSTEM "version.xml"> | |
| 6 ]> | |
| 7 <chapter id="clusters"> | |
| 8 <title>Clusters</title> | |
| 9 <section id="clusters-and-shaping"> | |
| 10 <title>Clusters and shaping</title> | |
| 11 <para> | |
| 12 In text shaping, a <emphasis>cluster</emphasis> is a sequence of | |
| 13 characters that needs to be treated as a single, indivisible | |
| 14 unit. A single letter or symbol can be a cluster of its | |
| 15 own. Other clusters correspond to longer subsequences of the | |
| 16 input code points — such as a ligature or conjunct form | |
| 17 — and require the shaper to ensure that the cluster is not | |
| 18 broken during the shaping process. | |
| 19 </para> | |
| 20 <para> | |
| 21 A cluster is distinct from a <emphasis>grapheme</emphasis>, | |
| 22 which is the smallest unit of meaning in a writing system or | |
| 23 script. | |
| 24 </para> | |
| 25 <para> | |
| 26 The definitions of the two terms are similar. However, clusters | |
| 27 are only relevant for script shaping and glyph layout. In | |
| 28 contrast, graphemes are a property of the underlying script, and | |
| 29 are of interest when client programs implement orthographic | |
| 30 or linguistic functionality. | |
| 31 </para> | |
| 32 <para> | |
| 33 For example, two individual letters are often two separate | |
| 34 graphemes. When two letters form a ligature, however, they | |
| 35 combine into a single glyph. They are then part of the same | |
| 36 cluster and are treated as a unit by the shaping engine — | |
| 37 even though the two original, underlying letters remain separate | |
| 38 graphemes. | |
| 39 </para> | |
| 40 <para> | |
| 41 HarfBuzz is concerned with clusters, <emphasis>not</emphasis> | |
| 42 with graphemes — although client programs using HarfBuzz | |
| 43 may still care about graphemes for other reasons from time to time. | |
| 44 </para> | |
| 45 <para> | |
| 46 During the shaping process, there are several shaping operations | |
| 47 that may merge adjacent characters (for example, when two code | |
| 48 points form a ligature or a conjunct form and are replaced by a | |
| 49 single glyph) or split one character into several (for example, | |
| 50 when decomposing a code point through the | |
| 51 <literal>ccmp</literal> feature). Operations like these alter | |
| 52 clusters; HarfBuzz tracks the changes to ensure that no clusters | |
| 53 get lost or broken during shaping. | |
| 54 </para> | |
| 55 <para> | |
| 56 HarfBuzz records cluster information independently from how | |
| 57 shaping operations affect the individual glyphs returned in an | |
| 58 output buffer. Consequently, a client program using HarfBuzz can | |
| 59 utilize the cluster information to implement features such as: | |
| 60 </para> | |
| 61 <itemizedlist> | |
| 62 <listitem> | |
| 63 <para> | |
| 64 Correctly positioning the cursor within a shaped text run, | |
| 65 even when characters have formed ligatures, composed or | |
| 66 decomposed, reordered, or undergone other shaping operations. | |
| 67 </para> | |
| 68 </listitem> | |
| 69 <listitem> | |
| 70 <para> | |
| 71 Correctly highlighting a text selection that includes some, | |
| 72 but not all, of the characters in a word. | |
| 73 </para> | |
| 74 </listitem> | |
| 75 <listitem> | |
| 76 <para> | |
| 77 Applying text attributes (such as color or underlining) to | |
| 78 part, but not all, of a word. | |
| 79 </para> | |
| 80 </listitem> | |
| 81 <listitem> | |
| 82 <para> | |
| 83 Generating output document formats (such as PDF) with | |
| 84 embedded text that can be fully extracted. | |
| 85 </para> | |
| 86 </listitem> | |
| 87 <listitem> | |
| 88 <para> | |
| 89 Determining the mapping between input characters and output | |
| 90 glyphs, such as which glyphs are ligatures. | |
| 91 </para> | |
| 92 </listitem> | |
| 93 <listitem> | |
| 94 <para> | |
| 95 Performing line-breaking, justification, and other | |
| 96 line-level or paragraph-level operations that must be done | |
| 97 after shaping is complete, but which require examining | |
| 98 character-level properties. | |
| 99 </para> | |
| 100 </listitem> | |
| 101 </itemizedlist> | |
| 102 </section> | |
| 103 <section id="working-with-harfbuzz-clusters"> | |
| 104 <title>Working with HarfBuzz clusters</title> | |
| 105 <para> | |
| 106 When you add text to a HarfBuzz buffer, each code point must be | |
| 107 assigned a <emphasis>cluster value</emphasis>. | |
| 108 </para> | |
| 109 <para> | |
| 110 This cluster value is an arbitrary number; HarfBuzz uses it only | |
| 111 to distinguish between clusters. Many client programs will use | |
| 112 the index of each code point in the input text stream as the | |
| 113 cluster value. This is for the sake of convenience; the actual | |
| 114 value does not matter. | |
| 115 </para> | |
| 116 <para> | |
| 117 Some of the shaping operations performed by HarfBuzz — | |
| 118 such as reordering, composition, decomposition, and substitution | |
| 119 — may alter the cluster values of some characters. The | |
| 120 final cluster values in the buffer at the end of the shaping | |
| 121 process will indicate to client programs which subsequences of | |
| 122 glyphs represent a cluster and, therefore, must not be | |
| 123 separated. | |
| 124 </para> | |
| 125 <para> | |
| 126 In addition, client programs can query the final cluster values | |
| 127 to discern other potentially important information about the | |
| 128 glyphs in the output buffer (such as whether or not a ligature | |
| 129 was formed). | |
| 130 </para> | |
| 131 <para> | |
| 132 For example, if the initial sequence of cluster values was: | |
| 133 </para> | |
| 134 <programlisting> | |
| 135 0,1,2,3,4 | |
| 136 </programlisting> | |
| 137 <para> | |
| 138 and the final sequence of cluster values is: | |
| 139 </para> | |
| 140 <programlisting> | |
| 141 0,0,3,3 | |
| 142 </programlisting> | |
| 143 <para> | |
| 144 then there are two clusters in the output buffer: the first | |
| 145 cluster includes the first two glyphs, and the second cluster | |
| 146 includes the third and fourth glyphs. It is also evident that a | |
| 147 ligature or conjunct has been formed, because there are fewer | |
| 148 glyphs in the output buffer (four) than there were code points | |
| 149 in the input buffer (five). | |
| 150 </para> | |
| 151 <para> | |
| 152 Although client programs using HarfBuzz are free to assign | |
| 153 initial cluster values in any manner they choose to, HarfBuzz | |
| 154 does offer some useful guarantees if the cluster values are | |
| 155 assigned in a monotonic (either non-decreasing or non-increasing) | |
| 156 order. | |
| 157 </para> | |
| 158 <para> | |
| 159 For buffers in the left-to-right (LTR) | |
| 160 or top-to-bottom (TTB) text flow direction, | |
| 161 HarfBuzz will preserve the monotonic property: client programs | |
| 162 are guaranteed that monotonically increasing initial cluster | |
| 163 values will be returned as monotonically increasing final | |
| 164 cluster values. | |
| 165 </para> | |
| 166 <para> | |
| 167 For buffers in the right-to-left (RTL) | |
| 168 or bottom-to-top (BTT) text flow direction, | |
| 169 the directionality of the buffer itself is reversed for final | |
| 170 output as a matter of design. Therefore, HarfBuzz inverts the | |
| 171 monotonic property: client programs are guaranteed that | |
| 172 monotonically increasing initial cluster values will be | |
| 173 returned as monotonically <emphasis>decreasing</emphasis> final | |
| 174 cluster values. | |
| 175 </para> | |
| 176 <para> | |
| 177 Client programs can adjust how HarfBuzz handles clusters during | |
| 178 shaping by setting the | |
| 179 <literal>cluster_level</literal> of the | |
| 180 buffer. HarfBuzz offers three <emphasis>levels</emphasis> of | |
| 181 clustering support for this property: | |
| 182 </para> | |
| 183 <itemizedlist> | |
| 184 <listitem> | |
| 185 <para><emphasis>Level 0</emphasis> is the default and | |
| 186 reproduces the behavior of the old HarfBuzz library. | |
| 187 </para> | |
| 188 <para> | |
| 189 The distinguishing feature of level 0 behavior is that, at | |
| 190 the beginning of processing the buffer, all code points that | |
| 191 are categorized as <emphasis>marks</emphasis>, | |
| 192 <emphasis>modifier symbols</emphasis>, or | |
| 193 <emphasis>Emoji extended pictographic</emphasis> modifiers, | |
| 194 as well as the <emphasis>Zero Width Joiner</emphasis> and | |
| 195 <emphasis>Zero Width Non-Joiner</emphasis> code points, are | |
| 196 assigned the cluster value of the closest preceding code | |
| 197 point from <emphasis>different</emphasis> category. | |
| 198 </para> | |
| 199 <para> | |
| 200 In essence, whenever a base character is followed by a mark | |
| 201 character or a sequence of mark characters, those marks are | |
| 202 reassigned to the same initial cluster value as the base | |
| 203 character. This reassignment is referred to as | |
| 204 "merging" the affected clusters. This behavior is based on | |
| 205 the Grapheme Cluster Boundary specification in <ulink | |
| 206 url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode | |
| 207 Technical Report 29</ulink>. | |
| 208 </para> | |
| 209 <para> | |
| 210 Client programs can specify level 0 behavior for a buffer by | |
| 211 setting its <literal>cluster_level</literal> to | |
| 212 <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>. | |
| 213 </para> | |
| 214 </listitem> | |
| 215 <listitem> | |
| 216 <para> | |
| 217 <emphasis>Level 1</emphasis> tweaks the old behavior | |
| 218 slightly to produce better results. Therefore, level 1 | |
| 219 clustering is recommended for code that is not required to | |
| 220 implement backward compatibility with the old HarfBuzz. | |
| 221 </para> | |
| 222 <para> | |
| 223 Level 1 differs from level 0 by not merging the | |
| 224 clusters of marks and other modifier code points with the | |
| 225 preceding "base" code point's cluster. By preserving the | |
| 226 separate cluster values of these marks and modifier code | |
| 227 points, script shapers can perform additional operations | |
| 228 that might lead to improved results (for example, reordering | |
| 229 a sequence of marks). | |
| 230 </para> | |
| 231 <para> | |
| 232 Client programs can specify level 1 behavior for a buffer by | |
| 233 setting its <literal>cluster_level</literal> to | |
| 234 <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>. | |
| 235 </para> | |
| 236 </listitem> | |
| 237 <listitem> | |
| 238 <para> | |
| 239 <emphasis>Level 2</emphasis> differs significantly in how it | |
| 240 treats cluster values. In level 2, HarfBuzz never merges | |
| 241 clusters. | |
| 242 </para> | |
| 243 <para> | |
| 244 This difference can be seen most clearly when HarfBuzz processes | |
| 245 ligature substitutions and glyph decompositions. In level 0 | |
| 246 and level 1, ligatures and glyph decomposition both involve | |
| 247 merging clusters; in level 2, neither of these operations | |
| 248 triggers a merge. | |
| 249 </para> | |
| 250 <para> | |
| 251 Client programs can specify level 2 behavior for a buffer by | |
| 252 setting its <literal>cluster_level</literal> to | |
| 253 <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>. | |
| 254 </para> | |
| 255 </listitem> | |
| 256 </itemizedlist> | |
| 257 <para> | |
| 258 As mentioned earlier, client programs using HarfBuzz often | |
| 259 assign initial cluster values in a buffer by reusing the indices | |
| 260 of the code points in the input text. This gives a sequence of | |
| 261 cluster values that is monotonically increasing (for example, | |
| 262 0,1,2,3,4). | |
| 263 </para> | |
| 264 <para> | |
| 265 It is not <emphasis>required</emphasis> that the cluster values | |
| 266 in a buffer be monotonically increasing. However, if the initial | |
| 267 cluster values in a buffer are monotonic and the buffer is | |
| 268 configured to use cluster level 0 or 1, then HarfBuzz | |
| 269 guarantees that the final cluster values in the shaped buffer | |
| 270 will also be monotonic. No such guarantee is made for cluster | |
| 271 level 2. | |
| 272 </para> | |
| 273 <para> | |
| 274 In levels 0 and 1, HarfBuzz implements the following conceptual | |
| 275 model for cluster values: | |
| 276 </para> | |
| 277 <itemizedlist spacing="compact"> | |
| 278 <listitem> | |
| 279 <para> | |
| 280 If the sequence of input cluster values is monotonic, the | |
| 281 sequence of cluster values will remain monotonic. | |
| 282 </para> | |
| 283 </listitem> | |
| 284 <listitem> | |
| 285 <para> | |
| 286 Each cluster value represents a single cluster. | |
| 287 </para> | |
| 288 </listitem> | |
| 289 <listitem> | |
| 290 <para> | |
| 291 Each cluster contains one or more glyphs and one or more | |
| 292 characters. | |
| 293 </para> | |
| 294 </listitem> | |
| 295 </itemizedlist> | |
| 296 <para> | |
| 297 In practice, this model offers several benefits. Assuming that | |
| 298 the initial cluster values were monotonically increasing | |
| 299 and distinct before shaping began, then, in the final output: | |
| 300 </para> | |
| 301 <itemizedlist spacing="compact"> | |
| 302 <listitem> | |
| 303 <para> | |
| 304 All adjacent glyphs having the same final cluster | |
| 305 value belong to the same cluster. | |
| 306 </para> | |
| 307 </listitem> | |
| 308 <listitem> | |
| 309 <para> | |
| 310 Each character belongs to the cluster that has the highest | |
| 311 cluster value <emphasis>not larger than</emphasis> its | |
| 312 initial cluster value. | |
| 313 </para> | |
| 314 </listitem> | |
| 315 </itemizedlist> | |
| 316 </section> | |
| 317 | |
| 318 <section id="a-clustering-example-for-levels-0-and-1"> | |
| 319 <title>A clustering example for levels 0 and 1</title> | |
| 320 <para> | |
| 321 The basic shaping operations affect clusters in a predictable | |
| 322 manner when using level 0 or level 1: | |
| 323 </para> | |
| 324 <itemizedlist> | |
| 325 <listitem> | |
| 326 <para> | |
| 327 When two or more clusters <emphasis>merge</emphasis>, the | |
| 328 resulting merged cluster takes as its cluster value the | |
| 329 <emphasis>minimum</emphasis> of the incoming cluster values. | |
| 330 </para> | |
| 331 </listitem> | |
| 332 <listitem> | |
| 333 <para> | |
| 334 When a cluster <emphasis>decomposes</emphasis>, all of the | |
| 335 resulting child clusters inherit as their cluster value the | |
| 336 cluster value of the parent cluster. | |
| 337 </para> | |
| 338 </listitem> | |
| 339 <listitem> | |
| 340 <para> | |
| 341 When a character is <emphasis>reordered</emphasis>, the | |
| 342 reordered character and all clusters that the character | |
| 343 moves past as part of the reordering are merged into one cluster. | |
| 344 </para> | |
| 345 </listitem> | |
| 346 </itemizedlist> | |
| 347 <para> | |
| 348 The functionality, guarantees, and benefits of level 0 and level | |
| 349 1 behavior can be seen with some examples. First, let us examine | |
| 350 what happens with cluster values when shaping involves cluster | |
| 351 merging with ligatures and decomposition. | |
| 352 </para> | |
| 353 | |
| 354 <para> | |
| 355 Let's say we start with the following character sequence (top row) and | |
| 356 initial cluster values (bottom row): | |
| 357 </para> | |
| 358 <programlisting> | |
| 359 A,B,C,D,E | |
| 360 0,1,2,3,4 | |
| 361 </programlisting> | |
| 362 <para> | |
| 363 During shaping, HarfBuzz maps these characters to glyphs from | |
| 364 the font. For simplicity, let us assume that each character maps | |
| 365 to the corresponding, identical-looking glyph: | |
| 366 </para> | |
| 367 <programlisting> | |
| 368 A,B,C,D,E | |
| 369 0,1,2,3,4 | |
| 370 </programlisting> | |
| 371 <para> | |
| 372 Now if, for example, <literal>B</literal> and <literal>C</literal> | |
| 373 form a ligature, then the clusters to which they belong | |
| 374 "merge". This merged cluster takes for its cluster | |
| 375 value the minimum of all the cluster values of the clusters that | |
| 376 went in to the ligature. In this case, we get: | |
| 377 </para> | |
| 378 <programlisting> | |
| 379 A,BC,D,E | |
| 380 0,1 ,3,4 | |
| 381 </programlisting> | |
| 382 <para> | |
| 383 because 1 is the minimum of the set {1,2}, which were the | |
| 384 cluster values of <literal>B</literal> and | |
| 385 <literal>C</literal>. | |
| 386 </para> | |
| 387 <para> | |
| 388 Next, let us say that the <literal>BC</literal> ligature glyph | |
| 389 decomposes into three components, and <literal>D</literal> also | |
| 390 decomposes into two components. Whenever a cluster decomposes, | |
| 391 its components each inherit the cluster value of their parent: | |
| 392 </para> | |
| 393 <programlisting> | |
| 394 A,BC0,BC1,BC2,D0,D1,E | |
| 395 0,1 ,1 ,1 ,3 ,3 ,4 | |
| 396 </programlisting> | |
| 397 <para> | |
| 398 Next, if <literal>BC2</literal> and <literal>D0</literal> form a | |
| 399 ligature, then their clusters (cluster values 1 and 3) merge into | |
| 400 <literal>min(1,3) = 1</literal>: | |
| 401 </para> | |
| 402 <programlisting> | |
| 403 A,BC0,BC1,BC2D0,D1,E | |
| 404 0,1 ,1 ,1 ,1 ,4 | |
| 405 </programlisting> | |
| 406 <para> | |
| 407 Note that the entirety of cluster 3 merges into cluster 1, not | |
| 408 just the <literal>D0</literal> glyph. This reflects the fact | |
| 409 that the cluster <emphasis>must</emphasis> be treated as an | |
| 410 indivisible unit. | |
| 411 </para> | |
| 412 <para> | |
| 413 At this point, cluster 1 means: the character sequence | |
| 414 <literal>BCD</literal> is represented by glyphs | |
| 415 <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any | |
| 416 further. | |
| 417 </para> | |
| 418 </section> | |
| 419 <section id="reordering-in-levels-0-and-1"> | |
| 420 <title>Reordering in levels 0 and 1</title> | |
| 421 <para> | |
| 422 Another common operation in some shapers is glyph | |
| 423 reordering. In order to maintain a monotonic cluster sequence | |
| 424 when glyph reordering takes place, HarfBuzz merges the clusters | |
| 425 of everything in the reordering sequence. | |
| 426 </para> | |
| 427 <para> | |
| 428 For example, let us again start with the character sequence (top | |
| 429 row) and initial cluster values (bottom row): | |
| 430 </para> | |
| 431 <programlisting> | |
| 432 A,B,C,D,E | |
| 433 0,1,2,3,4 | |
| 434 </programlisting> | |
| 435 <para> | |
| 436 If <literal>D</literal> is reordered to the position immediately | |
| 437 before <literal>B</literal>, then HarfBuzz merges the | |
| 438 <literal>B</literal>, <literal>C</literal>, and | |
| 439 <literal>D</literal> clusters — all the clusters between | |
| 440 the final position of the reordered glyph and its original | |
| 441 position. This means that we get: | |
| 442 </para> | |
| 443 <programlisting> | |
| 444 A,D,B,C,E | |
| 445 0,1,1,1,4 | |
| 446 </programlisting> | |
| 447 <para> | |
| 448 as the final cluster sequence. | |
| 449 </para> | |
| 450 <para> | |
| 451 Merging this many clusters is not ideal, but it is the only | |
| 452 sensible way for HarfBuzz to maintain the guarantee that the | |
| 453 sequence of cluster values remains monotonic and to retain the | |
| 454 true relationship between glyphs and characters. | |
| 455 </para> | |
| 456 </section> | |
| 457 <section id="the-distinction-between-levels-0-and-1"> | |
| 458 <title>The distinction between levels 0 and 1</title> | |
| 459 <para> | |
| 460 The preceding examples demonstrate the main effects of using | |
| 461 cluster levels 0 and 1. The only difference between the two | |
| 462 levels is this: in level 0, at the very beginning of the shaping | |
| 463 process, HarfBuzz merges the cluster of each base character | |
| 464 with the clusters of all Unicode marks (combining or not) and | |
| 465 modifiers that follow it. | |
| 466 </para> | |
| 467 <para> | |
| 468 For example, let us start with the following character sequence | |
| 469 (top row) and accompanying initial cluster values (bottom row): | |
| 470 </para> | |
| 471 <programlisting> | |
| 472 A,acute,B | |
| 473 0,1 ,2 | |
| 474 </programlisting> | |
| 475 <para> | |
| 476 The <literal>acute</literal> is a Unicode mark. If HarfBuzz is | |
| 477 using cluster level 0 on this sequence, then the | |
| 478 <literal>A</literal> and <literal>acute</literal> clusters will | |
| 479 merge, and the result will become: | |
| 480 </para> | |
| 481 <programlisting> | |
| 482 A,acute,B | |
| 483 0,0 ,2 | |
| 484 </programlisting> | |
| 485 <para> | |
| 486 This merger is performed before any other script-shaping | |
| 487 steps. | |
| 488 </para> | |
| 489 <para> | |
| 490 This initial cluster merging is the default behavior of the | |
| 491 Windows shaping engine, and the old HarfBuzz codebase copied | |
| 492 that behavior to maintain compatibility. Consequently, it has | |
| 493 remained the default behavior in the new HarfBuzz codebase. | |
| 494 </para> | |
| 495 <para> | |
| 496 But this initial cluster-merging behavior makes it impossible | |
| 497 for client programs to implement some features (such as to | |
| 498 color diacritic marks differently from their base | |
| 499 characters). That is why, in level 1, HarfBuzz does not perform | |
| 500 the initial merging step. | |
| 501 </para> | |
| 502 <para> | |
| 503 For client programs that rely on HarfBuzz cluster values to | |
| 504 perform cursor positioning, level 0 is more convenient. But | |
| 505 relying on cluster boundaries for cursor positioning is wrong: cursor | |
| 506 positions should be determined based on Unicode grapheme | |
| 507 boundaries, not on shaping-cluster boundaries. As such, using | |
| 508 level 1 clustering behavior is recommended. | |
| 509 </para> | |
| 510 <para> | |
| 511 One final facet of levels 0 and 1 is worth noting. HarfBuzz | |
| 512 currently does not allow any | |
| 513 <emphasis>multiple-substitution</emphasis> GSUB lookups to | |
| 514 replace a glyph with zero glyphs (in other words, to delete a | |
| 515 glyph). | |
| 516 </para> | |
| 517 <para> | |
| 518 But, in some other situations, glyphs can be deleted. In | |
| 519 those cases, if the glyph being deleted is the last glyph of its | |
| 520 cluster, HarfBuzz makes sure to merge the deleted glyph's | |
| 521 cluster with a neighboring cluster. | |
| 522 </para> | |
| 523 <para> | |
| 524 This is done primarily to make sure that the starting cluster of the | |
| 525 text always has the cluster index pointing to the start of the text | |
| 526 for the run; more than one client program currently relies on this | |
| 527 guarantee. | |
| 528 </para> | |
| 529 <para> | |
| 530 Incidentally, Apple's CoreText does something different to | |
| 531 maintain the same promise: it inserts a glyph with id 65535 at | |
| 532 the beginning of the glyph string if the glyph corresponding to | |
| 533 the first character in the run was deleted. HarfBuzz might do | |
| 534 something similar in the future. | |
| 535 </para> | |
| 536 </section> | |
| 537 <section id="level-2"> | |
| 538 <title>Level 2</title> | |
| 539 <para> | |
| 540 HarfBuzz's level 2 cluster behavior uses a significantly | |
| 541 different model than that of level 0 and level 1. | |
| 542 </para> | |
| 543 <para> | |
| 544 The level 2 behavior is easy to describe, but it may be | |
| 545 difficult to understand in practical terms. In brief, level 2 | |
| 546 performs no merging of clusters whatsoever. | |
| 547 </para> | |
| 548 <para> | |
| 549 This means that there is no initial base-and-mark merging step | |
| 550 (as is done in level 0), and it means that reordering moves and | |
| 551 ligature substitutions do not trigger a cluster merge. | |
| 552 </para> | |
| 553 <para> | |
| 554 Only one shaping operation directly affects clusters when using | |
| 555 level 2: | |
| 556 </para> | |
| 557 <itemizedlist> | |
| 558 <listitem> | |
| 559 <para> | |
| 560 When a cluster <emphasis>decomposes</emphasis>, all of the | |
| 561 resulting child clusters inherit as their cluster value the | |
| 562 cluster value of the parent cluster. | |
| 563 </para> | |
| 564 </listitem> | |
| 565 </itemizedlist> | |
| 566 <para> | |
| 567 When glyphs do form a ligature (or when some other feature | |
| 568 substitutes multiple glyphs with one glyph) the cluster value | |
| 569 of the first glyph is retained as the cluster value for the | |
| 570 resulting ligature. | |
| 571 </para> | |
| 572 <para> | |
| 573 This occurrence sounds similar to a cluster merge, but it is | |
| 574 different. In particular, no subsequent characters — | |
| 575 including marks and modifiers — are affected. They retain | |
| 576 their previous cluster values. | |
| 577 </para> | |
| 578 <para> | |
| 579 Level 2 cluster behavior is ultimately less complex than level 0 | |
| 580 or level 1, but there are several cases for which processing | |
| 581 cluster values produced at level 2 may be tricky. | |
| 582 </para> | |
| 583 <section id="ligatures-with-combining-marks-in-level-2"> | |
| 584 <title>Ligatures with combining marks in level 2</title> | |
| 585 <para> | |
| 586 The first example of how HarfBuzz's level 2 cluster behavior | |
| 587 can be tricky is when the text to be shaped includes combining | |
| 588 marks attached to ligatures. | |
| 589 </para> | |
| 590 <para> | |
| 591 Let us start with an input sequence with the following | |
| 592 characters (top row) and initial cluster values (bottom row): | |
| 593 </para> | |
| 594 <programlisting> | |
| 595 A,acute,B,breve,C,circumflex | |
| 596 0,1 ,2,3 ,4,5 | |
| 597 </programlisting> | |
| 598 <para> | |
| 599 If the sequence <literal>A,B,C</literal> forms a ligature, | |
| 600 then these are the cluster values HarfBuzz will return under | |
| 601 the various cluster levels: | |
| 602 </para> | |
| 603 <para> | |
| 604 Level 0: | |
| 605 </para> | |
| 606 <programlisting> | |
| 607 ABC,acute,breve,circumflex | |
| 608 0 ,0 ,0 ,0 | |
| 609 </programlisting> | |
| 610 <para> | |
| 611 Level 1: | |
| 612 </para> | |
| 613 <programlisting> | |
| 614 ABC,acute,breve,circumflex | |
| 615 0 ,0 ,0 ,5 | |
| 616 </programlisting> | |
| 617 <para> | |
| 618 Level 2: | |
| 619 </para> | |
| 620 <programlisting> | |
| 621 ABC,acute,breve,circumflex | |
| 622 0 ,1 ,3 ,5 | |
| 623 </programlisting> | |
| 624 <para> | |
| 625 Making sense of the level 2 result is the hardest for a client | |
| 626 program, because there is nothing in the cluster values that | |
| 627 indicates that <literal>B</literal> and <literal>C</literal> | |
| 628 formed a ligature with <literal>A</literal>. | |
| 629 </para> | |
| 630 <para> | |
| 631 In contrast, the "merged" cluster values of the mark glyphs | |
| 632 that are seen in the level 0 and level 1 output are evidence | |
| 633 that a ligature substitution took place. | |
| 634 </para> | |
| 635 </section> | |
| 636 <section id="reordering-in-level-2"> | |
| 637 <title>Reordering in level 2</title> | |
| 638 <para> | |
| 639 Another example of how HarfBuzz's level 2 cluster behavior | |
| 640 can be tricky is when glyphs reorder. Consider an input sequence | |
| 641 with the following characters (top row) and initial cluster | |
| 642 values (bottom row): | |
| 643 </para> | |
| 644 <programlisting> | |
| 645 A,B,C,D,E | |
| 646 0,1,2,3,4 | |
| 647 </programlisting> | |
| 648 <para> | |
| 649 Now imagine <literal>D</literal> moves before | |
| 650 <literal>B</literal> in a reordering operation. The cluster | |
| 651 values will then be: | |
| 652 </para> | |
| 653 <programlisting> | |
| 654 A,D,B,C,E | |
| 655 0,3,1,2,4 | |
| 656 </programlisting> | |
| 657 <para> | |
| 658 Next, if <literal>D</literal> forms a ligature with | |
| 659 <literal>B</literal>, the output is: | |
| 660 </para> | |
| 661 <programlisting> | |
| 662 A,DB,C,E | |
| 663 0,3 ,2,4 | |
| 664 </programlisting> | |
| 665 <para> | |
| 666 However, in a different scenario, in which the shaping rules | |
| 667 of the script instead caused <literal>A</literal> and | |
| 668 <literal>B</literal> to form a ligature | |
| 669 <emphasis>before</emphasis> the <literal>D</literal> reordered, the | |
| 670 result would be: | |
| 671 </para> | |
| 672 <programlisting> | |
| 673 AB,D,C,E | |
| 674 0 ,3,2,4 | |
| 675 </programlisting> | |
| 676 <para> | |
| 677 There is no way for a client program to differentiate between | |
| 678 these two scenarios based on the cluster values | |
| 679 alone. Consequently, client programs that use level 2 might | |
| 680 need to undertake additional work in order to manage cursor | |
| 681 positioning, text attributes, or other desired features. | |
| 682 </para> | |
| 683 </section> | |
| 684 <section id="other-considerations-in-level-2"> | |
| 685 <title>Other considerations in level 2</title> | |
| 686 <para> | |
| 687 There may be other problems encountered with ligatures under | |
| 688 level 2, such as if the direction of the text is forced to | |
| 689 the opposite of its natural direction (for example, Arabic text | |
| 690 that is forced into left-to-right directionality). But, | |
| 691 generally speaking, these other scenarios are minor corner | |
| 692 cases that are too obscure for most client programs to need to | |
| 693 worry about. | |
| 694 </para> | |
| 695 </section> | |
| 696 </section> | |
| 697 </chapter> |
