This module describes the typesetting controls of CSS; that is, the features of CSS that control the translation of source text to formatted, line-wrapped text. Various CSS properties provide control over case transformation, white space collapsing, text wrapping, line breaking rules and hyphenation, alignment and justification, spacing, and indentation. Further information about the typesetting requirements of various languages and writing systems around the world can be found in the Internationalization Working Group’s Typography Index.

 Authors should language-tag their content accurately for the best typographic behavior.

The content language of an element is the (human) language the element is declared to be in, according to the rules of the document language. For example, the rules for determining the content language of an HTML are defined in [HTML], and the rules for determining the content language of an XML element use are defined in [XML10]. Note that it is possible for the content language of an element to be unknown—e.g. untagged content, or content in a document language that does not have a language-tagging facility is considered to have an unknown content language.

Note: Authors can tag content using the global lang attribute in HTML, the universal xml:lang attribute in XML, and the HTTP Content-Language header for content served over HTTP.

The content language an element is declared to be in also identifies the specific written form of that language used in that element, known as the content writing system.

Note: Depending on the document language's facilities for identifying the content language, information about the writing system may only be carried implicitly. That is typically the case with the [BCP47] language tag used in [HTML], although it can optionally indicate the writing system explicitly using a script subtag.

Language and writing system conventions can affect line breaking, hyphenation, justification, glyph selection, and many other typographic effects. In CSS, language-specific typographic tailorings are only applied when the content language is known (declared). Therefore, higher quality typography requires authors to communicate to the UA the correct linguistic context of the text in the document.

More information about language tags and their interpretation, particularly the use of script tags for atypical language + writing-system combinations, can be found in Appendix F. Tagging Content by Writing System.
1.4. Characters and Letters

The basic unit of typesetting is the character. However, because writing systems are not always as simple as the basic English alphabet, what a character actually is depends on the context in which the term is used. For example, in Hangul (the Korean writing system), each square representation of a syllable (e.g. 한=Han) can be considered a character. However, the square symbol is really composed of multiple letters each representing a phoneme (e.g. ㅎ=h, ㅏ=a, ㄴ=n) and these also could each be considered a character.

A basic unit of computer text encoding, for any given encoding, is also called a character, and depending on the encoding, a single encoding character might correspond to the entire pre-composed syllabic character (e.g. 한), to the individual phonemic character (e.g. ㅎ), or to smaller units such as a base letterform (e.g. ㅇ) and any combining marks that vary it (e.g. extra strokes that represent aspiration).

In turn, a single encoding character can be represented in the data stream as one or more bytes; and in programming environments one byte is sometimes also called a character.

Therefore the term character is fairly ambiguous where technical precision is required.

For text layout, we will refer to the typographic character unit as the basic unit of text. Even within the realm of text layout, the relevant character unit depends on the operation. For example, line-breaking and letter-spacing will segment a sequence of Thai characters that include U+0E33 THAI CHARACTER SARA AM differently; or the behaviour of a conjunct consonant in a script such as Devanagari may depend on the font in use. So the typographic character represents a unit of the writing system— such as a Latin alphabetic letter (including its diacritics), Hangul syllable, Chinese ideographic character, Myanmar syllable cluster— that is indivisible with respect to a particular typographic operation (line-breaking, first-letter effects, tracking, justification, vertical arrangement, etc.).

Unicode Standard Annex #29: Text Segmentation defines a unit called the grapheme cluster which approximates the typographic character. A UA must use the extended grapheme cluster (not legacy grapheme cluster), as defined in [UAX29], as the basis for its typographic character unit. However, the UA should tailor the definitions as required by typographic tradition since the default rules are not always appropriate or ideal—and is expected to tailor them differently depending on the operation as needed.

A typographic letter unit or letter for the purpose of this specification is a typographic character unit belonging to one of the Letter or Number general categories in Unicode. [UAX44] See Character Properties for how to determine the Unicode properties of a typographic character unit.

The rendering characteristics of a typographic character unit divided by an element boundary is undefined. Ideally each component should be rendered according to the formatting requirements of its respective element’s properties while maintaining correct shaping and positioning of the typographic character unit as a whole. However, depending on the nature of the formatting differences between its parts and the capabilities of the font technology in use, this is not always possible. Therefore such a typographic character unit may be rendered as belonging to either side of the boundary, or as some approximation of belonging to both. Authors are forewarned that dividing grapheme clusters by element boundaries may give inconsistent or undesired results.
1.5. Text Processing

CSS is built on [UNICODE]. UAs that support Unicode must adhere to all normative requirements of the Unicode Core Standard, except where explicitly overridden by CSS. UAs that use a different encoding are not explicitly supported by the CSS specifications; they are, however, expected to fulfill the same text handling requirements by assuming an appropriate mapping between that encoding and Unicode.

A block container element that directly contains inline-level content—such as inline boxes, atomic inlines, and text runs—establishes an inline formatting context. The block container also generates a root inline box, which is an anonymous inline box that holds all of its inline-level contents. The root inline box inherits from its parent block container, but is otherwise unstyleable.

In an inline formatting context, content is laid out along the inline axis, ordered according to the Unicode bidirectional algorithm and its controls [CSS-WRITING-MODES-3] and distributed according to the typesetting controls in [CSS-TEXT-3]. Inline-axis margins, borders, and padding are respected between inline-level boxes (and their margins do not collapse). The rectangular area that contains the boxes that form a line of inline-level content is called a line box.

Note: Line boxes and inline boxes and inline-level boxes are each different things! See [CSS-DISPLAY-3] for an in-depth discussion of box types and related terminology.

Line boxes are created as needed to hold inline-level content within an inline formatting context. When an inline box exceeds the logical width of a line box, or contains a forced line break, it is split (see CSS Text 3 §5 Line Breaking and Word Boundaries) into several fragments [css-break-3], which are distributed across multiple line boxes. Like column boxes in multi-column layout [CSS-MULTICOL-1], line boxes are fragmentation containers generated by their formatting context, and are not part of the CSS box tree.

Note: Inline boxes can also be split into several fragments within the same line box due to bidirectional text processing. See [CSS-WRITING-MODES-3].

Line boxes are stacked as the direct contents of the block container box in the block flow direction and aligned within this container as specified by align-content [css-align-3]. Thus, an inline formatting context consists of a stack of line boxes. Line boxes are stacked with no separation (except as specified elsewhere, e.g. for float clearance) and they never overlap.

In general, the line-left edge of a line box touches the line-left edge of its containing block and the line-right edge touches the line-right edge of its containing block, and thus the logical width of a line box is equal to the inner logical width of its containing block (i.e. the block container’s content box). However, floating boxes or initial letter boxes can come between the containing block edge and the line box edge, reducing space available to, and thus the logical width, of any such impacted line boxes. (See CSS2§9.4.2/CSS2§9.5 and § 5 Initial Letters.)

Within the line box, inline-level boxes can be aligned along the block axis in different ways: their over or under edges can be aligned, or the baselines of text within them can be aligned. See vertical-align and its longhands. The logical height of a line box is fitted to its contents by the rules given in § 3.4 Line Sizing Containment: the line-sizing property.

Line boxes that contain no text, no preserved white space, no inline boxes with non-zero margins, padding, or borders, and no other in-flow content (such as atomic inlines or ruby annotations), and do not end with a preserved newline must be treated as zero-height line boxes for the purposes of determining the positions of any elements inside of them (such as absolutely positioned boxes), and must be treated as not existing for any other purpose (such as collapsing margins).