A Touch of Class

2012-07-27

Approaching UTR50

There's been a lot of Japanese discussion lately on Twitter, and hopefully on other media, about Unicode Technical Report #50: Properties for Horizontal and Vertical Text Layout. I apologize for being absent from these discussions; I cannot read Japanese, so I have not been able to follow any of it. I can only see it streaming by, knowing that it's a conversation I ought to be participating in.

I thought I'd take some time to write up my thoughts on what should be the purpose of the Unicode data and its corresponding modes in CSS3 Writing Modes. I hope that this, together with Taro Yamamoto's proposal can bring some clarity to our discussions and help us reach consensus on the controversial elements of UTR50.

But first, I am very grateful to the meticulous and principled work that Microsoft's Laurentiu Iancu and Dwayne Robinson have put into UTR50, which has helped us bring the draft to a point where we can have these detailed discussions. I hope that Taro Yamamoto's effort can similarly guide us through this next difficult part.

Goals of Mixed and Stacked

UTR50, parallelling CSS's text-orientation property, defines two modes for its character orientation data: stacked and mixed. Mixed is intended to be used for mixed orientation typesetting, and stacked for an upright-only presentation.

In my mind, the purpose of Mixed Vertical Orientation (MVO)

To serve as a usable default for the handling of arbitrary, mixed plain text (i.e. when markup is not available).
To serve as a usable baseline for customization and rule-tailoring.

I don't believe the goal of MVO should be to:

To approximate good traditional Japanese typography, because markup is necessary here.
To minimize the use of markup when markup is assumed to be used, because serving this goal will result in bad or inconsistent typesetting when markup is unavailable.
To be consistent with the defaults in existing Japanese page layout software, because they assume the availability of markup and of the text being Japanese.

Where markup is available, it can be used; but where markup is unavailable, the defaults must serve. The fundamental domain of the Unicode standards is plain text, and a key characteristic of the Web is the rampant creation and repurposing of content, most of which is also unmarked text. Therefore UTR50 solve the problem of plain text layout. Above all, any such unmarked text no matter what must be understandable, even if ugly. Ideally, it should also be typographically self-consistent, such that related characters, when placed in context of each other, have consistent orientations. Imho, this should be true even if, when placed individually, without context some might prefer a different posture.

Assuming the availability of markup, e.g. that numbers and units will be typeset as tate-chu-yoko, will result in very different defaults than assuming the unavailability of markup. Therefore minimizing markup, or otherwise assuming it is there, is incompatible with the goal of handling plain text. On the other hand, I believe the goal of handling unmarked text is compatible with the goal of providing a usable baseline orientation because the design principle of both is the same: consistency.

Principles of Categorization

The principles I have used thus far in breaking down Unicode into "upright" (U) and "sideways" (R) categories are as follows:

For stacked mode (SVO)

If it's at all sensible to put upright, then the codepoint should be upright (U).
If it's only sensible to put sideways, then the codepoint should be sideways (R).

Most characters, it's sensible to be able to typeset them upright. Some characters (mainly dashes and enclosing punctuation like brackets) it never makes sense to make them upright, so they should be sideways.

For mixed mode (MVO)

If the codepoint has definite East Asian usage and not much (if any) non-East Asian usage, it should follow East Asian usage patterns.
If the codepoint is rarely used in East Asian usage, but often used in non-East Asian contexts, it should follow non-East Asian patterns.
If the codepoint is conflicted between U and R, its orientation should follow from consistency with similar characters and/or characters with which it is most often used.
If the codepoint is conflicted, and consistency does not resolve, the MVO should be chosen to minimize brokenness.
If the codepoint is conflicted and no arguments can be made that bias one way or another, then it should follow East Asian patterns.
If the codepoint's usage is unknown, it should be R until we have further information.

Consistency is important to me, because it minimizes confusion for the reader, increases predictability for the author, and reduces the chances of blatantly inconsistent typesetting.

Taro Yamamoto takes a slightly different approach in his principles of MVO categorization, breaking down in more detail on the origins of the characters:

Upright: Chinese characters, Japanese kana characters and their related marks that need neither special glyph shapes nor different orientations in Japanese vertical lines.
Upright: Full-width characters separately encoded from the proportional characters (in the Unicode Half-width and Full-width Forms section)
Upright: Symbols and abbreviations that are mere pictures or geometric shapes without any directionality.
Upright: Western-origin ligatures and abbreviations whose decomposed forms can be represented with ordinary Latin alphabet characters or Arabic numbers or symbols that are −90 degrees rotated in vertical lines.
Ambiguous: If a character has one or more similar characters, if the category of the character is narrower or more specialized than the others’, the posture of the others’ should be adopted. For example, if an emoji picture and a mathematical symbol resemble each other, but can have different vertical orientations, due to the original source fields (emoji and mathematics, etc), choose the greater, more popular field’s vertical orientation.
Sideways: Latin alphabet characters, Arabic numbers, and related proportional punctuations and marks.
Sideways: Proportional characters separately encoded from the full-width characters (in the Unicode Half-width and Full-width Forms section and others)
Sideways: Standard multi-purpose, parentheses, punctuations and symbols with directionality (arrows and symbols for musical notations, etc).
Sideways: Symbols and abbreviations that are originated in Western typography or writing systems.

However, his principles also fulfill the goals I have and are. Thus I support his proposal and hope that bringing our three approaches together can result in an agreeable draft for UTR50.

Details of Contentious Categories

When viewing the details of Yamamoto's proposal, the differences with the current UTR50 data are not great. To understand these differences, and indeed to have any sense of logic rather than arbitrariness in what we are trying to create, it is useful to break down these differences into thematic categories.

Mathematical Alphanumerics

The first thematic category is the mathematical alphanumerics. These are stylized Latin, and sometimes Greek, letters. They are encoded separately in Unicode not to provide font styling in plaintext but because in mathematics sometimes different styles of a letter are semantically distinct. The Unicode Standard is clear that these letters are intended to be used in conjunction with—and in contrast to—the normal Latin and Greek letters. Together they all form the mathematical set of symbolic constants and variables.

Most of these letters are encoded in the "Mathematical Alphanumeric Symbols" block, U+1D400–U+1D7FF. However, for historical reasons, some of the more commonly-used letters are encoded in the "Letterlike Symbols" block, U+2100–U+214f. These, unlike regular Latin or Greek, unlike the stylized letters in the "Mathematical Alphanumeric Symbols" block, and unlike all of the mathematical symbols (general category Sm), are given a Mixed Vertical Orientation of upright, rather than sideways. Yamamoto-san's proposal puts these sideways, and I strongly agree with this: that the accident of encoding order should not split apart what is meant to be a homogenous set.

IPR Symbols

The second thematic category is IPR symbols: © ® ℗ ™ ℠. Usage of these is highly context-dependent, and both orientations are common. Yamamoto's proposal biases to sideways, because these symbols derive from Western letters. UTR50 currently biases to upright, because in purely East Asian usage (no Latin), they would be upright. I personally have no strong opinion, except that this set should remain internally consistent.

Footnote Symbols

The third thematic category is footnote symbols and similar characters: § ¶ ‖ † ‡ ⁂ ⁑. Yamamoto's proposal puts these sideways. At first I was unsure of this, however the single asterisk (*) is sideways already; the double vertical line (‖) in almost all fonts has a rotated vert alternate glyph that effectively prevents it from ever being upright in vertical text; the dagger can be used as a mathematical operator; and the section sign (§) is also used with numbers (which are sideways). Yamamoto asserts that these footnote symbols are not especially common in Japanese typography, where superscript numbers are more frequently used. So altogether, this makes me lean towards making this set sideways.

Letter-derived Symbols

The fifth thematic category is letter-derived symbols: the slashed abbreviations (℀ ℁ ℅ ℆ ⅍), the per sign (⅌), musical verse notation (℣ ℟), architectural symbols (℄ ⅊), and some packaging-related symbols (℞, ℮) symbols. Yamamoto's principles place these sideways because they derive from Latin letters and have primarily Western usage.

Units

The fourth thematic category is units: the apothecary measures (℈ ℔ ℥ ), the scientific units (℧ Ω K Å), and script small L (ℓ) which is also used in place of L for Liters. Yamamoto puts these sideways because they are Western-derived, Murakami puts these sideways because they are often used with European digits. I can agree for both these reasons, because K and Å normalize (NFC and NFD) to the letters K and Å, and because then they are consistent with other units that are letters (like the apothecary measure "ʒ" or the scientific unit "m").

Per 10ⁿ

The sixth thematic category in my analysis is the per-mille and per-ten-thousand signs: ‰ ‱. These are, in usage, similar to the units. Murakami argues that these are most often used together with Western digits, and so should follow their orientation. This makes sense to me. (Yamamoto further argues that their glyphs are often are too wide to be typeset upright anyway.)

Curly Quotes

The last, and most contentious, thematic category is the curly quotes: ‘’‚‛“”„‟. UTR50 puts all but the single quotes sideways. Yamamoto puts all of them sideways because they derive from Western typography. They are not often used in vertical East Asian typography because other forms of quote marks are used instead. But although quote marks are all originally Western, as punctuation in vertical lines, they are more appropriately upright. However, the single right quote is also used as the apostrophe, in which it appears in the middle of words and names. Since foreign names and titles are a common use case for mixing Latin in East Asian text, I feel strongly that the apostrophe, and thus all single quotes, should be consistent so that it works well in such proper nouns.

(Double quotes have no such double usage, however, and could safely be made upright if the inconsistency with single quotes did not confuse people.)

Conclusion

These results might not match ideal expectations in East Asian text when the characters are used in isolation. However, aside from curly quotes, none of these are particularly common. To place them sideways is sometimes correct and sometimes awkward, but would never be incomprehensible.

In conclusion, I think Yamamoto's principles, together with the ones I outlined above when ambiguous, create a predictable framework to determine the orientation of Unicode characters. I might not be 100% comfortable with all of his categorizations, but I recognize that if we followed his principles straight through, we would have a solid foundation in UTR50 and an easy way to categorize codepoints added in the future.