Many Unicode characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the null character (.mw-parser-output .monospacedfont-family:monospace,monospaceU+0000 NULL) is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string (as opposed to a starting address and a length), since the string ends once the program reads the null character.
In the narrowest sense, a control code is a character with the general category Cc, which comprises the C0 and C1 control codes, a concept defined in ISO/IEC 2022 and inherited by Unicode, with the most common set being defined in ISO/IEC 6429. Control codes are handled distinctly from ordinary Unicode characters, for example, by not being assigned character names (although they are assigned normative formal aliases).[1] In a broader sense, other non-printing format characters, such as those used in bidirectional text, are also referred to as control characters by software;[2] these are mostly assigned to the general category Cf (format), used for format effectors introduced and defined by Unicode itself.
Category "Cc" control codes can serve a variety of purposes, not limited to format effectors: for example, the default ASCII C0 set includes six format effectors (BS, HT, LF, VT, FF and CR), ten transmission controls, four device controls, four information separators and eight other control codes.[4] Most of these characters play no explicit role in Unicode text handling, and are used only by higher-level protocols such as those used by terminal emulators. Certain characters are commonly used for formatting or sentinel purposes:
In an attempt to simplify the several newline characters used in legacy text[citation needed], Unicode introduces its own newline characters to separate either lines or paragraphs: U+2028 LINE SEPARATOR (abbreviated LS or LSEP) and U+2029 PARAGRAPH SEPARATOR (abbreviated PS or PSEP).
Like CR and LF, LS and PS are effectors for text formatting; unlike CR and LF, they are not treated as "control codes" for ECMA-35/ECMA-48 purposes (category Cc), rather having semantics defined entirely by Unicode itself. They are assigned to sui generis Unicode categories Zl and Zp respectively, under the major category Z (separator) used for certain whitespace characters.
Unicode previously included 128 characters, now deprecated, for language tags. These characters essentially mirrored the 128 ASCII characters but were used to identify the subsequent text as belonging to a particular language according to BCP 47. For example, to indicate subsequent text as the variant of English as written in the United States, the sequence U+E0001 LANGUAGE TAG, U+E0065 TAG LATIN SMALL LETTER E, U+E006E TAG LATIN SMALL LETTER N, U+E002D TAG HYPHEN-MINUS, U+E0075 TAG LATIN SMALL LETTER U and U+E0073 TAG LATIN SMALL LETTER S would have been used.
These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example, the display of Unihan ideographs might have substituted different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might have influenced the display of decimal digits 0 through 9 differently depending on the language they appeared in.
Three formatting characters provide support for interlinear annotation (U+FFF9 INTERLINEAR ANNOTATION ANCHOR, U+FFFA INTERLINEAR ANNOTATION SEPARATOR, U+FFFB INTERLINEAR ANNOTATION TERMINATOR). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C Ruby markup recommendation is an example of an alternate protocol supporting more advanced interlinear annotation.
However, directionality may not be detected correctly if left-to-right text is quoted at the beginning of a right-to-left paragraph (or vice versa),[2] and the support for bidirectional text becomes even more complicated when text flowing in opposite directions is embedded hierarchically, for example if an English text quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides twelve characters to help control these embedded bidirectional text levels up to 125 levels deep:[9]
Many characters map to alternate glyphs depending on the context. For example, Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character. These types of glyph substitution are easily handled by the context of the character with no other authoring input involved. Authors may also use special-purpose characters such as joiners and non-joiners to force an alternate form of glyph where it would not otherwise appear. Ligatures are similar instances where glyphs may be substituted simply by turning ligatures on or off as a rich text attribute.
However, for other glyph substitution, the author's intent may need to be encoded with the text and cannot be determined contextually. This is the case with character/glyphs referred to as gaiji where different glyphs are used for the same character either historically or for ideographs for family names. This is one of the gray areas in distinguishing between a glyph and a character. If a family name differs slightly from the ideograph character it derives from, then is that a simple glyph variant or a character variant. As of Unicode 3.2 and 4.0, the character set now includes 256 variation selectors so that these combining mark characters can select from 256 possible character/glyph variations for the preceding character.
Unicode provides graphic characters for representing C0 control codes (and space and a generic newline) in the Control Pictures block. They are visual representations, not the actual control codes themselves. There are no equivalent characters for the C1 control codes.
Particularly I am wanting to implement something for the web, i.e., I would like to display correctly formatted hieroglyphs without having to generate an image on the backend. Ideally, something like ??? would display correctly.
Peter Constable's comments are correct, rendering Egyptian Hieroglyphs in fonts is very complex and takes time. There is a working group which combines the talents of Egyptologists and Unicode experts to ensure that the desired rendering capabilities are developed. We expect fonts to start becoming available during 2023 after platforms have had time to update to Unicode 15. As Peter pointed out, github.com/microsoft/font-tools is a good place to watch for progress.
One reason for that may have been the ongoing requests to add further control characters to provide richer support for combining hieroglyphics, so there was a reluctance to commit to doing anything while those proposed enhancements were still being discussed. However, those changes have now been approved (see Item 4 "Egyptian Hieroglyphs", pages 6 through 10).
Hopefully those changes will be included in Unicode 15 which is due to be released this September, but I couldn't find any documentation confirming that. If not, you may face a long wait! In the meantime I think you are stuck with using images, although that approach may not be practicable, depending on your requirements.
See this thread, Support Egyptian Hieroglyph Format Controls #1469, which confirms that there is no current support for the existing (Unicode 12) control characters with Google's Noto Sans Egyptian Hieroglyphs font. A comment near the end of the thread discussing support for Egyptian hieroglyph Format controls once Unicode 15 is released states "we will consider it along with all of the other Unicode changes, bug fixes, etc...".
This annex describes specifications for recommended defaultsfor the use of Unicode in the definitions of general-purpose identifiers, immutable identifiers, hashtag identifiers, and inpattern-based syntax. It also supplies guidelines for use ofnormalization with identifiers.
This document has been reviewed by Unicode members and otherinterested parties, and has been approved for publication by theUnicode Consortium. This is a stable document and may be used asreference material or cited as a normative reference by otherspecifications.
A Unicode Standard Annex (UAX) forms an integral partof the Unicode Standard, but is published online as a separatedocument. The Unicode Standard may require conformance to normativecontent in a Unicode Standard Annex, if so specified in theConformance chapter of that version of the Unicode Standard. Theversion number of a UAX document corresponds to the version of theUnicode Standard of which it forms a part.
A common task facing an implementer of the Unicode Standard is theprovision of a parsing and/or lexing engine for identifiers, such asprogramming language variables or domain names.There are also realms where identifiers need to be defined with an extended set ofcharacters to align better with what end users expect, such as inhashtags.
To assist in the standard treatment of identifiers in Unicodecharacter-based parsers and lexical analyzers, a set ofspecifications is provided here as abasis for parsing identifiers that contain Unicode characters. These specificationsinclude:
These guidelines follow the typical pattern of identifiersyntax rules in common programming languages, by defining an ID_Startclass and an ID_Continue class and using a simple BNF rule foridentifiers based on those classes; however, the composition of thoseclasses is more complex and contains additional types of characters,due to the universal scope of the Unicode Standard.
795a8134c1