Line Break Transformation Rules v1
When line breaks are collapsible, they are
transformed into either a space (U+0020), a zero-width space (U+200B),
or no character depending the script of the first character on each
side of the line break. The script of each character is determined by
Unicode [[UNICODE]]. Characters such as punctuation that belong to
the COMMON and INHERITED scripts are ignored; the next character is
examined instead.
- If a character on either side of the line break belongs to a script
in which the space character (U+0020) is used as a word separator,
then the line break is converted to a space (U+0020).
Examples of such scripts include Latin, Arabic, and Hangul
- Otherwise, if a character on either side of the line break belongs
to a script (other than Han, Hiragana, and Katakana) in which there
is no visible word separator, then the line break is converted to a
zero-width space (U+200B).
Examples of such scripts include Thai and Khmer.
- Otherwise, if a character on either side of the line break belongs
to the Han, Hiragana, or Katakana scripts, in which there is no word
separator, then the line break is removed.
- Otherwise, the line break is converted to a space (U+0020).