Mercurial > hgrepos > Python2 > PyMuPDF
comparison mupdf-source/thirdparty/harfbuzz/docs/usermanual-shaping-concepts.xml @ 2:b50eed0cc0ef upstream
ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4.
The directory name has changed: no version number in the expanded directory now.
| author | Franz Glasner <fzglas.hg@dom66.de> |
|---|---|
| date | Mon, 15 Sep 2025 11:43:07 +0200 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:1d09e1dec1d9 | 2:b50eed0cc0ef |
|---|---|
| 1 <?xml version="1.0"?> | |
| 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" | |
| 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ | |
| 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> | |
| 5 <!ENTITY version SYSTEM "version.xml"> | |
| 6 ]> | |
| 7 <chapter id="shaping-concepts"> | |
| 8 <title>Shaping concepts</title> | |
| 9 <section id="text-shaping-concepts"> | |
| 10 <title>Text shaping</title> | |
| 11 <para> | |
| 12 Text shaping is the process of transforming a sequence of Unicode | |
| 13 codepoints that represent individual characters (letters, | |
| 14 diacritics, tone marks, numbers, symbols, etc.) into the | |
| 15 orthographically and linguistically correct two-dimensional layout | |
| 16 of glyph shapes taken from a specified font. | |
| 17 </para> | |
| 18 <para> | |
| 19 For some writing systems (or <emphasis>scripts</emphasis>) and | |
| 20 languages, the process is simple, requiring the shaper to do | |
| 21 little more than advance the horizontal position forward by the | |
| 22 correct amount for each successive glyph. | |
| 23 </para> | |
| 24 <para> | |
| 25 But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of | |
| 26 several shaping operations may be required, and the rules for how | |
| 27 and when they are applied vary from script to script. HarfBuzz and | |
| 28 other shaping engines implement these rules. | |
| 29 </para> | |
| 30 <para> | |
| 31 The exact rules and necessary operations for a particular script | |
| 32 constitute a shaping <emphasis>model</emphasis>. OpenType | |
| 33 specifies a set of shaping models that covers all of | |
| 34 Unicode. Other shaping models are available, however, including | |
| 35 Graphite and Apple Advanced Typography (AAT). | |
| 36 </para> | |
| 37 </section> | |
| 38 | |
| 39 <section id="script-specific-shaping"> | |
| 40 <title>Script-specific shaping</title> | |
| 41 <para> | |
| 42 In many scripts, transforming the input | |
| 43 sequence into the final layout often requires some combination of | |
| 44 operations—such as context-dependent substitutions, | |
| 45 context-dependent mark positioning, glyph-to-glyph joining, | |
| 46 glyph reordering, or glyph stacking. | |
| 47 </para> | |
| 48 <para> | |
| 49 In some scripts, the shaping rules require that a text | |
| 50 run be divided into syllables before the operations can be | |
| 51 applied. Other scripts may apply shaping operations over | |
| 52 entire words or over the entire text run, with no subdivision | |
| 53 required. | |
| 54 </para> | |
| 55 <para> | |
| 56 Other scripts, do not require these | |
| 57 operations. However, correctly shaping a text run in | |
| 58 any script may still involve Unicode normalization, | |
| 59 ligature substitutions, mark positioning, kerning, and applying | |
| 60 other font features. | |
| 61 </para> | |
| 62 </section> | |
| 63 | |
| 64 <section id="shaping-operations"> | |
| 65 <title>Shaping operations</title> | |
| 66 <para> | |
| 67 Shaping a text run involves transforming the | |
| 68 input sequence of Unicode codepoints with some combination of | |
| 69 operations that is specified in the shaping model for the | |
| 70 script. | |
| 71 </para> | |
| 72 <para> | |
| 73 The specific conditions that trigger a given operation for a | |
| 74 text run varies from script to script, as do the order that the | |
| 75 operations are performed in and which codepoints are | |
| 76 affected. However, the same general set of shaping operations is | |
| 77 common to all of the script shaping models. | |
| 78 </para> | |
| 79 | |
| 80 <itemizedlist> | |
| 81 <listitem> | |
| 82 <para> | |
| 83 A <emphasis>reordering</emphasis> operation moves a glyph | |
| 84 from its original ("logical") position in the sequence to | |
| 85 some other ("visual") position. | |
| 86 </para> | |
| 87 <para> | |
| 88 The shaping model for a given script might involve | |
| 89 more than one reordering step. | |
| 90 </para> | |
| 91 </listitem> | |
| 92 | |
| 93 <listitem> | |
| 94 <para> | |
| 95 A <emphasis>joining</emphasis> operation replaces a glyph | |
| 96 with an alternate form that is designed to connect with one | |
| 97 or more of the adjacent glyphs in the sequence. | |
| 98 </para> | |
| 99 </listitem> | |
| 100 | |
| 101 <listitem> | |
| 102 <para> | |
| 103 A contextual <emphasis>substitution</emphasis> operation | |
| 104 replaces either a single glyph or a subsequence of several | |
| 105 glyphs with an alternate glyph. This substitution is | |
| 106 performed when the original glyph or subsequence of glyphs | |
| 107 occurs in a specified position with respect to the | |
| 108 surrounding sequence. For example, one substitution might be | |
| 109 performed only when the target glyph is the first glyph in | |
| 110 the sequence, while another substitution is performed only | |
| 111 when a different target glyph occurs immediately after a | |
| 112 particular string pattern. | |
| 113 </para> | |
| 114 <para> | |
| 115 The shaping model for a given script might involve | |
| 116 multiple contextual-substitution operations, each applying | |
| 117 to different target glyphs and patterns, and which are | |
| 118 performed in separate steps. | |
| 119 </para> | |
| 120 </listitem> | |
| 121 | |
| 122 <listitem> | |
| 123 <para> | |
| 124 A contextual <emphasis>positioning</emphasis> operation | |
| 125 moves the horizontal and/or vertical position of a | |
| 126 glyph. This positioning move is performed when the glyph | |
| 127 occurs in a specified position with respect to the | |
| 128 surrounding sequence. | |
| 129 </para> | |
| 130 <para> | |
| 131 Many contextual positioning operations are used to place | |
| 132 <emphasis>mark</emphasis> glyphs (such as diacritics, vowel | |
| 133 signs, and tone markers) with respect to | |
| 134 <emphasis>base</emphasis> glyphs. However, some | |
| 135 scripts may use contextual positioning operations to | |
| 136 correctly place base glyphs as well, such as | |
| 137 when the script uses <emphasis>stacking</emphasis> characters. | |
| 138 </para> | |
| 139 </listitem> | |
| 140 | |
| 141 </itemizedlist> | |
| 142 </section> | |
| 143 | |
| 144 <section id="unicode-character-categories"> | |
| 145 <title>Unicode character categories</title> | |
| 146 <para> | |
| 147 Shaping models are typically specified with respect to how | |
| 148 scripts are defined in the Unicode standard. | |
| 149 </para> | |
| 150 <para> | |
| 151 Every codepoint in the Unicode Character Database (UCD) is | |
| 152 assigned a <emphasis>Unicode General Category</emphasis> (UGC), | |
| 153 which provides the most fundamental information about the | |
| 154 codepoint: whether the codepoint represents a | |
| 155 <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a | |
| 156 <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a | |
| 157 <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>, | |
| 158 or something else (<emphasis>Other</emphasis>). | |
| 159 </para> | |
| 160 <para> | |
| 161 These UGC properties are "Major" categories. Each codepoint is | |
| 162 further assigned to a "minor" category within its Major | |
| 163 category, such as "Letter, uppercase" (<literal>Lu</literal>) or | |
| 164 "Letter, modifier" (<literal>Lm</literal>). | |
| 165 </para> | |
| 166 <para> | |
| 167 Shaping models are concerned primarily with Letter and Mark | |
| 168 codepoints. The minor categories of Mark codepoints are | |
| 169 particularly important for shaping. Marks can be nonspacing | |
| 170 (<literal>Mn</literal>), spacing combining | |
| 171 (<literal>Mc</literal>), or enclosing (<literal>Me</literal>). | |
| 172 </para> | |
| 173 <para> | |
| 174 In addition to the UGC property, codepoints in the Indic and | |
| 175 Southeast Asian scripts are also assigned | |
| 176 <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and | |
| 177 <emphasis>Unicode Indic Positional Category</emphasis> (UIPC) | |
| 178 properties that provide more detailed information needed for | |
| 179 shaping. | |
| 180 </para> | |
| 181 <para> | |
| 182 The UISC property sub-categorizes Letters and Marks according to | |
| 183 common script-shaping behaviors. For example, UISC distinguishes | |
| 184 between consonant letters, vowel letters, and vowel marks. The | |
| 185 UIPC property sub-categorizes Mark codepoints by the relative visual | |
| 186 position that they occupy (above, below, right, left, or in | |
| 187 multiple positions). | |
| 188 </para> | |
| 189 <para> | |
| 190 Some scripts require that the text run be split into | |
| 191 syllables. What constitutes a valid syllable in these | |
| 192 scripts is specified in regular expressions, formed from the | |
| 193 Letter and Mark codepoints, that take the UISC and UIPC | |
| 194 properties into account. | |
| 195 </para> | |
| 196 | |
| 197 </section> | |
| 198 | |
| 199 <section id="text-runs"> | |
| 200 <title>Text runs</title> | |
| 201 <para> | |
| 202 Real-world text usually contains codepoints from a mixture of | |
| 203 different Unicode scripts (including punctuation, numbers, symbols, | |
| 204 white-space characters, and other codepoints that do not belong | |
| 205 to any script). Real-world text may also be marked up with | |
| 206 formatting that changes font properties (including the font, | |
| 207 font style, and font size). | |
| 208 </para> | |
| 209 <para> | |
| 210 For shaping purposes, all real-world text streams must be first | |
| 211 segmented into runs that have a uniform set of properties. | |
| 212 </para> | |
| 213 <para> | |
| 214 In particular, shaping models always assume that every codepoint | |
| 215 in a text run has the same <emphasis>direction</emphasis>, | |
| 216 <emphasis>script</emphasis> tag, and | |
| 217 <emphasis>language</emphasis> tag. | |
| 218 </para> | |
| 219 </section> | |
| 220 | |
| 221 <section id="opentype-shaping-models"> | |
| 222 <title>OpenType shaping models</title> | |
| 223 <para> | |
| 224 OpenType provides shaping models for the following scripts: | |
| 225 </para> | |
| 226 | |
| 227 <itemizedlist> | |
| 228 <listitem> | |
| 229 <para> | |
| 230 The <emphasis>default</emphasis> shaping model handles all | |
| 231 scripts with no script-specific shaping model, and may also be used as a fallback for | |
| 232 handling unrecognized scripts. | |
| 233 </para> | |
| 234 </listitem> | |
| 235 | |
| 236 <listitem> | |
| 237 <para> | |
| 238 The <emphasis>Indic</emphasis> shaping model handles the Indic | |
| 239 scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, | |
| 240 Malayalam, Oriya, Tamil, and Telugu. | |
| 241 </para> | |
| 242 <para> | |
| 243 The Indic shaping model was revised significantly in | |
| 244 2005. To denote the change, a new set of <emphasis>script | |
| 245 tags</emphasis> was assigned for Bengali, Devanagari, | |
| 246 Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and | |
| 247 Telugu. For the sake of clarity, the term "Indic2" is | |
| 248 sometimes used to refer to the current, revised shaping | |
| 249 model. | |
| 250 </para> | |
| 251 </listitem> | |
| 252 | |
| 253 <listitem> | |
| 254 <para> | |
| 255 The <emphasis>Arabic</emphasis> shaping model supports | |
| 256 Arabic, Mongolian, N'Ko, Syriac, and several other connected | |
| 257 or cursive scripts. | |
| 258 </para> | |
| 259 </listitem> | |
| 260 | |
| 261 <listitem> | |
| 262 <para> | |
| 263 The <emphasis>Thai/Lao</emphasis> shaping model supports | |
| 264 the Thai and Lao scripts. | |
| 265 </para> | |
| 266 </listitem> | |
| 267 | |
| 268 <listitem> | |
| 269 <para> | |
| 270 The <emphasis>Khmer</emphasis> shaping model supports the | |
| 271 Khmer script. | |
| 272 </para> | |
| 273 </listitem> | |
| 274 | |
| 275 <listitem> | |
| 276 <para> | |
| 277 The <emphasis>Myanmar</emphasis> shaping model supports the | |
| 278 Myanmar (or Burmese) script. | |
| 279 </para> | |
| 280 </listitem> | |
| 281 | |
| 282 <listitem> | |
| 283 <para> | |
| 284 The <emphasis>Tibetan</emphasis> shaping model supports the | |
| 285 Tibetan script. | |
| 286 </para> | |
| 287 </listitem> | |
| 288 | |
| 289 <listitem> | |
| 290 <para> | |
| 291 The <emphasis>Hangul</emphasis> shaping model supports the | |
| 292 Hangul script. | |
| 293 </para> | |
| 294 </listitem> | |
| 295 | |
| 296 <listitem> | |
| 297 <para> | |
| 298 The <emphasis>Hebrew</emphasis> shaping model supports the | |
| 299 Hebrew script. | |
| 300 </para> | |
| 301 </listitem> | |
| 302 | |
| 303 <listitem> | |
| 304 <para> | |
| 305 The <emphasis>Universal Shaping Engine</emphasis> (USE) | |
| 306 shaping model supports scripts not covered by one of | |
| 307 the above, script-specific shaping models, including | |
| 308 Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi, | |
| 309 Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai | |
| 310 Viet, and many others. | |
| 311 </para> | |
| 312 </listitem> | |
| 313 | |
| 314 <listitem> | |
| 315 <para> | |
| 316 Text runs that do not fall under one of the above shaping | |
| 317 models may still require processing by a shaping engine. Of | |
| 318 particular note is <emphasis>Emoji</emphasis> shaping, which | |
| 319 may involve variation-selector sequences and glyph | |
| 320 substitution. Emoji shaping is handled by the default | |
| 321 shaping model. | |
| 322 </para> | |
| 323 </listitem> | |
| 324 | |
| 325 </itemizedlist> | |
| 326 | |
| 327 </section> | |
| 328 | |
| 329 <section id="graphite-shaping"> | |
| 330 <title>Graphite shaping</title> | |
| 331 <para> | |
| 332 In contrast to OpenType shaping, Graphite shaping does not | |
| 333 specify a predefined set of shaping models or a set of supported | |
| 334 scripts. | |
| 335 </para> | |
| 336 <para> | |
| 337 Instead, each Graphite font contains a complete set of rules that | |
| 338 implement the required shaping model for the intended | |
| 339 script. These rules include finite-state machines to match | |
| 340 sequences of codepoints to the shaping operations to perform. | |
| 341 </para> | |
| 342 <para> | |
| 343 Graphite shaping can perform the same shaping operations used in | |
| 344 OpenType shaping, as well as other functions that have not been | |
| 345 defined for OpenType shaping. | |
| 346 </para> | |
| 347 </section> | |
| 348 | |
| 349 <section id="aat-shaping"> | |
| 350 <title>AAT shaping</title> | |
| 351 <para> | |
| 352 In contrast to OpenType shaping, AAT shaping does not specify a | |
| 353 predefined set of shaping models or a set of supported scripts. | |
| 354 </para> | |
| 355 <para> | |
| 356 Instead, each AAT font includes a complete set of rules that | |
| 357 implement the desired shaping model for the intended | |
| 358 script. These rules include finite-state machines to match glyph | |
| 359 sequences and the shaping operations to perform. | |
| 360 </para> | |
| 361 <para> | |
| 362 Notably, AAT shaping rules are expressed for glyphs in the font, | |
| 363 not for Unicode codepoints. AAT shaping can perform the same | |
| 364 shaping operations used in OpenType shaping, as well as other | |
| 365 functions that have not been defined for OpenType shaping. | |
| 366 </para> | |
| 367 </section> | |
| 368 </chapter> |
