upload
The Unicode Consortium
産業: Computer; Software
Number of terms: 11048
Number of blossaries: 0
Company Profile:
The Unicode Consortium or Unicode Inc. is a not-for-profit organization that coordinates the development of the Unicode standard. Its stated goal is to eventually enable computers to operate in all languages from around the world. The consortium develops and publishes a list of freely-available ...
The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. * Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units—that is, octets. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. * A code unit is also referred to as a code value in the information industry. * In the Unicode Standard, specific values of some code units cannot be used to represent an encoded character in isolation. This restriction applies to isolated surrogate code units in UTF-16 and to the bytes 80–FF in UTF-8. Similar restrictions apply for the implementations of other character encoding standards; for example, the bytes 81–9F, E0–FC in SJIS (Shift-JIS) cannot represent an encoded character by themselves.
Industry:Computer; Software
An ordered sequence of one or more code units. * When the code unit is an 8-bit unit, a code unit sequence may also be referred to as a byte sequence. * A code unit sequence may consist of a single code unit. * In the context of programming languages, the value of a string data type basically consists of a code unit sequence. Informally, a code unit sequence is itself just referred to as a string, and a byte sequence is referred to as a byte string. Care must be taken in making this terminological equivalence, however, because the formally defined concept of a string may have additional requirements or complications in programming languages. For example, a string is defined as a pointer to char in the C language and is conventionally terminated with a NULL character. In object-oriented languages, a string is a complex object, with associated methods, and its value may or may not consist of merely a code unit sequence. * Depending on the structure of a character encoding standard, it may be necessary to use a code unit sequence (of more than one unit) to represent a single encoded character. For example, the code unit in SJIS is a byte: encoded characters such as “a” can be represented with a single byte in SJIS, whereas ideographs require a sequence of two code units. The Unicode Standard also makes use of code unit sequences whose length is greater than one code unit.
Industry:Computer; Software
Obsolete synonym for code unit.
Industry:Computer; Software
Synonym for coded character sequence.
Industry:Computer; Software
An ordered sequence of one or more code points. * A coded character sequence is also known as a coded character representation. * Normally a coded character sequence consists of a sequence of encoded characters, but it may also include noncharacters or reserved code points. * Internally, a process may choose to make use of noncharacter code points in its coded character sequences. However, such noncharacter code points may not be interpreted as abstract characters (see conformance clause C2). Their removal by a conformant process constitutes modification of interpretation of the coded character sequence (see conformance clause C7). * Reserved code points are included in coded character sequences, so that the conformance requirements regarding interpretation and modification are properly defined when a Unicode-conformant implementation encounters coded character sequences produced under a future version of the standard. Unless specified otherwise for clarity, in the text of the Unicode Standard the term character alone designates an encoded character. Similarly, the term character sequence alone designates a coded character sequence.
Industry:Computer; Software
A character set in which each character is assigned a numeric code point. Frequently abbreviated as character set, charset, or code set.
Industry:Computer; Software
See encoded character.
Industry:Computer; Software
(1) A range of numerical values available for encoding characters. (2) For the Unicode Standard, a range of integers from 0 to 10FFFF16.
Industry:Computer; Software
The process of ordering units of textual information. Collation is usually specific to a particular language. Also known as alphabetizing or alphabetic sorting. Unicode Technical Standard #10, “Unicode Collation Algorithm," defines a complete, unambiguous, specified ordering for all characters in the Unicode Standard.
Industry:Computer; Software
A character with the General Category of Combining Mark (M). * Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing Mark (Me). * All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class. * The interpretation of private-use characters (Co) as combining characters or not is determined by the implementation. * These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras. * The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width nonjoiner. The combining character is said to apply to that base character. * There may be no such base character, such as when a combining character is at the start of text or follows a control or format character—for example, a carriage return, tab, or right-left mark. In such cases, the combining characters are called isolated combining characters. * With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character. * The representative images of combining characters are depicted with a dotted circle in the code charts. When presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.
Industry:Computer; Software