diff mupdf-source/thirdparty/harfbuzz/docs/usermanual-shaping-concepts.xml @ 2:b50eed0cc0ef upstream

ADD: MuPDF v1.26.7: the MuPDF source as downloaded by a default build of PyMuPDF 1.26.4. The directory name has changed: no version number in the expanded directory now.
author Franz Glasner <fzglas.hg@dom66.de>
date Mon, 15 Sep 2025 11:43:07 +0200
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/mupdf-source/thirdparty/harfbuzz/docs/usermanual-shaping-concepts.xml	Mon Sep 15 11:43:07 2025 +0200
@@ -0,0 +1,368 @@
+<?xml version="1.0"?>
+<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
+               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
+  <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
+  <!ENTITY version SYSTEM "version.xml">
+]>
+<chapter id="shaping-concepts">
+  <title>Shaping concepts</title>
+  <section id="text-shaping-concepts">
+    <title>Text shaping</title>
+    <para>
+      Text shaping is the process of transforming a sequence of Unicode
+      codepoints that represent individual characters (letters,
+      diacritics, tone marks, numbers, symbols, etc.) into the
+      orthographically and linguistically correct two-dimensional layout
+      of glyph shapes taken from a specified font.
+    </para>
+    <para>
+      For some writing systems (or <emphasis>scripts</emphasis>) and
+      languages, the process is simple, requiring the shaper to do
+      little more than advance the horizontal position forward by the
+      correct amount for each successive glyph.
+    </para>
+    <para>
+      But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of
+      several shaping operations may be required, and the rules for how
+      and when they are applied vary from script to script. HarfBuzz and
+      other shaping engines implement these rules.
+    </para>
+    <para>
+      The exact rules and necessary operations for a particular script
+      constitute a shaping <emphasis>model</emphasis>. OpenType
+      specifies a set of shaping models that covers all of
+      Unicode. Other shaping models are available, however, including
+      Graphite and Apple Advanced Typography (AAT). 
+    </para>
+  </section>
+  
+  <section id="script-specific-shaping">
+    <title>Script-specific shaping</title>
+    <para>
+      In many scripts, transforming the input
+      sequence into the final layout often requires some combination of
+      operations&mdash;such as context-dependent substitutions,
+      context-dependent mark positioning, glyph-to-glyph joining,
+      glyph reordering, or glyph stacking.
+    </para>
+    <para>
+      In some scripts, the shaping rules require that a text
+      run be divided into syllables before the operations can be
+      applied. Other scripts may apply shaping operations over
+      entire words or over the entire text run, with no subdivision
+      required.
+    </para>
+    <para>
+      Other scripts, do not require these
+      operations. However, correctly shaping a text run in
+      any script may still involve Unicode normalization,
+      ligature substitutions, mark positioning, kerning, and applying
+      other font features.
+    </para>
+  </section>
+  
+  <section id="shaping-operations">
+    <title>Shaping operations</title>
+    <para>
+      Shaping a text run involves transforming the
+      input sequence of Unicode codepoints with some combination of
+      operations that is specified in the shaping model for the
+      script.
+    </para>
+    <para>
+      The specific conditions that trigger a given operation for a
+      text run varies from script to script, as do the order that the
+      operations are performed in and which codepoints are
+      affected. However, the same general set of shaping operations is
+      common to all of the script shaping models. 
+    </para>
+    
+    <itemizedlist>
+      <listitem>
+	<para>
+	  A <emphasis>reordering</emphasis> operation moves a glyph
+	  from its original ("logical") position in the sequence to
+	  some other ("visual") position.
+	</para>
+	<para>
+	  The shaping model for a given script might involve
+	  more than one reordering step.
+	</para>
+      </listitem>
+      
+      <listitem>
+	<para>
+	  A <emphasis>joining</emphasis> operation replaces a glyph
+	  with an alternate form that is designed to connect with one
+	  or more of the adjacent glyphs in the sequence.
+	</para>
+      </listitem>
+      
+      <listitem>
+	<para>
+	  A contextual <emphasis>substitution</emphasis> operation
+	  replaces either a single glyph or a subsequence of several
+	  glyphs with an alternate glyph. This substitution is
+	  performed when the original glyph or subsequence of glyphs
+	  occurs in a specified position with respect to the
+	  surrounding sequence. For example, one substitution might be
+	  performed only when the target glyph is the first glyph in
+	  the sequence, while another substitution is performed only
+	  when a different target glyph occurs immediately after a
+	  particular string pattern.
+	</para>
+	<para>
+	  The shaping model for a given script might involve
+	  multiple contextual-substitution operations, each applying
+	  to different target glyphs and patterns, and which are
+	  performed in separate steps.
+	</para>
+      </listitem>
+      
+      <listitem>
+	<para>
+	  A contextual <emphasis>positioning</emphasis> operation
+	  moves the horizontal and/or vertical position of a
+	  glyph. This positioning move is performed when the glyph
+	  occurs in a specified position with respect to the
+	  surrounding sequence.
+	</para>
+	<para>
+	  Many contextual positioning operations are used to place
+	  <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
+	  signs, and tone markers) with respect to
+	  <emphasis>base</emphasis> glyphs. However, some
+	  scripts may use contextual positioning operations to
+	  correctly place base glyphs as well, such as
+	  when the script uses <emphasis>stacking</emphasis> characters.
+	</para>
+      </listitem>
+      
+    </itemizedlist>
+  </section>
+  
+  <section id="unicode-character-categories">
+    <title>Unicode character categories</title>
+    <para>
+      Shaping models are typically specified with respect to how
+      scripts are defined in the Unicode standard.
+    </para>
+    <para>
+      Every codepoint in the Unicode Character Database (UCD) is
+      assigned a <emphasis>Unicode General Category</emphasis> (UGC),
+      which provides the most fundamental information about the
+      codepoint: whether the codepoint represents a
+      <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
+      <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
+      <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
+      or something else (<emphasis>Other</emphasis>).
+    </para>
+    <para>
+      These UGC properties are "Major" categories. Each codepoint is
+      further assigned to a "minor" category within its Major
+      category, such as "Letter, uppercase" (<literal>Lu</literal>) or
+      "Letter, modifier" (<literal>Lm</literal>).
+    </para>
+    <para>
+      Shaping models are concerned primarily with Letter and Mark
+      codepoints. The minor categories of Mark codepoints are
+      particularly important for shaping. Marks can be nonspacing
+      (<literal>Mn</literal>), spacing combining
+      (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
+    </para>
+    <para>
+      In addition to the UGC property, codepoints in the Indic and
+      Southeast Asian scripts are also assigned
+      <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
+      <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
+      properties that provide more detailed information needed for
+      shaping.
+    </para>
+    <para>
+      The UISC property sub-categorizes Letters and Marks according to
+      common script-shaping behaviors. For example, UISC distinguishes
+      between consonant letters, vowel letters, and vowel marks. The
+      UIPC property sub-categorizes Mark codepoints by the relative visual
+      position that they occupy (above, below, right, left, or in
+      multiple positions).
+    </para>
+    <para>
+      Some scripts require that the text run be split into
+      syllables. What constitutes a valid syllable in these
+      scripts is specified in regular expressions, formed from the
+      Letter and Mark codepoints, that take the UISC and UIPC
+      properties into account.
+    </para>
+
+  </section>
+  
+  <section id="text-runs">
+    <title>Text runs</title>
+    <para>
+      Real-world text usually contains codepoints from a mixture of
+      different Unicode scripts (including punctuation, numbers, symbols,
+      white-space characters, and other codepoints that do not belong
+      to any script). Real-world text may also be marked up with
+      formatting that changes font properties (including the font,
+      font style, and font size).
+    </para>
+    <para>
+      For shaping purposes, all real-world text streams must be first
+      segmented into runs that have a uniform set of properties. 
+    </para>
+    <para>
+      In particular, shaping models always assume that every codepoint
+      in a text run has the same <emphasis>direction</emphasis>,
+      <emphasis>script</emphasis> tag, and
+      <emphasis>language</emphasis> tag.
+    </para>
+  </section>
+  
+  <section id="opentype-shaping-models">
+    <title>OpenType shaping models</title>
+    <para>
+      OpenType provides shaping models for the following scripts:
+    </para>
+
+    <itemizedlist>
+      <listitem>
+	<para>
+	  The <emphasis>default</emphasis> shaping model handles all
+	  scripts with no script-specific shaping model, and may also be used as a fallback for
+	  handling unrecognized scripts.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Indic</emphasis> shaping model handles the Indic
+	  scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
+	  Malayalam, Oriya, Tamil, and Telugu.
+	</para>
+	<para>
+	  The Indic shaping model was revised significantly in
+	  2005. To denote the change, a new set of <emphasis>script
+	  tags</emphasis> was assigned for Bengali, Devanagari,
+	  Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
+	  Telugu. For the sake of clarity, the term "Indic2" is
+	  sometimes used to refer to the current, revised shaping
+	  model.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Arabic</emphasis> shaping model supports
+	  Arabic, Mongolian, N'Ko, Syriac, and several other connected
+	  or cursive scripts.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Thai/Lao</emphasis> shaping model supports
+	  the Thai and Lao scripts.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Khmer</emphasis> shaping model supports the
+	  Khmer script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Myanmar</emphasis> shaping model supports the
+	  Myanmar (or Burmese) script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Tibetan</emphasis> shaping model supports the
+	  Tibetan script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Hangul</emphasis> shaping model supports the
+	  Hangul script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Hebrew</emphasis> shaping model supports the
+	  Hebrew script.
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  The <emphasis>Universal Shaping Engine</emphasis> (USE)
+	  shaping model supports scripts not covered by one of
+	  the above, script-specific shaping models, including
+	  Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
+	  Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
+	  Viet, and many others. 
+	</para>
+      </listitem>
+
+      <listitem>
+	<para>
+	  Text runs that do not fall under one of the above shaping
+	  models may still require processing by a shaping engine. Of
+	  particular note is <emphasis>Emoji</emphasis> shaping, which
+	  may involve variation-selector sequences and glyph
+	  substitution. Emoji shaping is handled by the default
+	  shaping model.
+	</para>
+      </listitem>
+
+    </itemizedlist>
+    
+  </section>
+  
+  <section id="graphite-shaping">
+    <title>Graphite shaping</title>
+    <para>
+      In contrast to OpenType shaping, Graphite shaping does not
+      specify a predefined set of shaping models or a set of supported
+      scripts.
+    </para>
+    <para>
+      Instead, each Graphite font contains a complete set of rules that
+      implement the required shaping model for the intended
+      script. These rules include finite-state machines to match
+      sequences of codepoints to the shaping operations to perform.
+    </para>
+    <para>
+      Graphite shaping can perform the same shaping operations used in
+      OpenType shaping, as well as other functions that have not been
+      defined for OpenType shaping.
+    </para>
+  </section>
+  
+  <section id="aat-shaping">
+    <title>AAT shaping</title>
+    <para>
+      In contrast to OpenType shaping, AAT shaping does not specify a 
+      predefined set of shaping models or a set of supported scripts.
+    </para>
+    <para>
+      Instead, each AAT font includes a complete set of rules that
+      implement the desired shaping model for the intended
+      script. These rules include finite-state machines to match glyph
+      sequences and the shaping operations to perform.
+    </para>
+    <para>
+      Notably, AAT shaping rules are expressed for glyphs in the font,
+      not for Unicode codepoints. AAT shaping can perform the same
+      shaping operations used in OpenType shaping, as well as other
+      functions that have not been defined for OpenType shaping.
+    </para>
+  </section>
+</chapter>