1 |
14 |
jlechner |
|
2 |
|
|
UNICODE 2.1 CHARACTER DATABASE
|
3 |
|
|
|
4 |
|
|
Copyright (c) 1991-1998 Unicode, Inc.
|
5 |
|
|
All Rights reserved.
|
6 |
|
|
|
7 |
|
|
DISCLAIMER
|
8 |
|
|
|
9 |
|
|
The Unicode Character Database "UNIDAT21.TXT" is provided as-is by
|
10 |
|
|
Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any
|
11 |
|
|
particular purpose. No warranties of any kind are expressed or implied. The
|
12 |
|
|
recipient agrees to determine applicability of information provided. If this
|
13 |
|
|
file has been purchased on magnetic or optical media from Unicode, Inc.,
|
14 |
|
|
the sole remedy for any claim will be exchange of defective media within
|
15 |
|
|
90 days of receipt.
|
16 |
|
|
|
17 |
|
|
This disclaimer is applicable for all other data files accompanying the
|
18 |
|
|
Unicode Character Database, some of which have been compiled by the
|
19 |
|
|
Unicode Consortium, and some of which have been supplied by other vendors.
|
20 |
|
|
|
21 |
|
|
LIMITATIONS ON RIGHTS TO REDISTRIBUTE THIS DATA
|
22 |
|
|
|
23 |
|
|
Recipient is granted the right to make copies in any form for internal
|
24 |
|
|
distribution and to freely use the information supplied in the creation of
|
25 |
|
|
products supporting the Unicode (TM) Standard. This file can be redistributed
|
26 |
|
|
to third parties or other organizations (whether for profit or not) as long
|
27 |
|
|
as this notice and the disclaimer notice are retained.
|
28 |
|
|
|
29 |
|
|
EXPLANATORY INFORMATION
|
30 |
|
|
|
31 |
|
|
The Unicode Character Database defines the default Unicode character
|
32 |
|
|
properties, and internal mappings. Particular implementations may choose to
|
33 |
|
|
override the properties and mappings that are not normative. If that is done,
|
34 |
|
|
it is up to the implementer to establish a protocol to convey that
|
35 |
|
|
information. For more information about character properties and mappings,
|
36 |
|
|
see "The Unicode Standard, Worldwide Character Encoding, Version 2.0",
|
37 |
|
|
published by Addison-Wesley. For information about other data files
|
38 |
|
|
accompanying the Unicode Character Database, see the section of the
|
39 |
|
|
Unicode Standard they were extracted from, or the explanatory readme
|
40 |
|
|
files and/or header sections with those files.
|
41 |
|
|
|
42 |
|
|
The Unicode Character Database has been updated to reflect Version 2.1
|
43 |
|
|
of the Unicode Standard, with two additional characters added to those
|
44 |
|
|
published in Version 2.0:
|
45 |
|
|
|
46 |
|
|
U+20AC EURO SIGN
|
47 |
|
|
U+FFFC OBJECT REPLACEMENT CHARACTER
|
48 |
|
|
|
49 |
|
|
A number of corrections have also been made to case mappings or other
|
50 |
|
|
errors in the database noted since the publication of Version 2.0. And
|
51 |
|
|
a few normative bidirectional properties have been modified to reflect
|
52 |
|
|
decisions of the Unicode Technical Committee.
|
53 |
|
|
|
54 |
|
|
The Unicode Character Database is a plain ASCII text file consisting of lines
|
55 |
|
|
containing fields terminated by semicolons. Each line represents the data for
|
56 |
|
|
one encoded character in the Unicode Standard, Version 2.1. Every encoded
|
57 |
|
|
character has a data entry, with the exception of certain special ranges, as
|
58 |
|
|
detailed below.
|
59 |
|
|
|
60 |
|
|
There are five special ranges of characters that are represented only by
|
61 |
|
|
their start and end characters, since the properties in the file are uniform,
|
62 |
|
|
except for code values (which are all sequential and assigned). The names of CJK
|
63 |
|
|
ideograph characters and Hangul syllable characters are algorithmically
|
64 |
|
|
derivable. (See the Unicode Standard for more information). Surrogate
|
65 |
|
|
characters and private use characters have no names.
|
66 |
|
|
|
67 |
|
|
The exact ranges represented by start and end characters are:
|
68 |
|
|
|
69 |
|
|
The CJK Ideographs Area (U+4E00 - U+9FFF)
|
70 |
|
|
The Hangul Syllables Area (U+AC00 - U+D7A3)
|
71 |
|
|
The Surrogates Area (U+D800 - U+DFFF)
|
72 |
|
|
The Private Use Area (U+E000 - U+F8FF)
|
73 |
|
|
CJK Compatibility Ideographs (U+F900 - U+FAFF)
|
74 |
|
|
|
75 |
|
|
The following table describes the format and meaning of each field in a
|
76 |
|
|
data entry in the Unicode Character Database. Fields which contain
|
77 |
|
|
normative information are so indicated.
|
78 |
|
|
|
79 |
|
|
Field Explanation
|
80 |
|
|
----- -----------
|
81 |
|
|
|
82 |
|
|
|
83 |
|
|
This field is normative.
|
84 |
|
|
|
85 |
|
|
1 Unicode 2.1 Character Name. These names match exactly the
|
86 |
|
|
names published in Chapter 7 of the Unicode Standard, Version
|
87 |
|
|
2.0, except for the two additional characters.
|
88 |
|
|
This field is normative.
|
89 |
|
|
|
90 |
|
|
2 General Category. This is a useful breakdown into various "character
|
91 |
|
|
types" which can be used as a default categorization in implementations.
|
92 |
|
|
Some of the values are normative, and some are informative.
|
93 |
|
|
See below for a brief explanation.
|
94 |
|
|
|
95 |
|
|
3 Canonical Combining Classes. The classes used for the
|
96 |
|
|
Canonical Ordering Algorithm in the Unicode Standard. These
|
97 |
|
|
classes are also printed in Chapter 4 of the Unicode Standard.
|
98 |
|
|
This field is normative. See below for a brief explanation.
|
99 |
|
|
|
100 |
|
|
4 Bidirectional Category. See the list below for an explanation of the
|
101 |
|
|
abbreviations used in this field. These are the categories required
|
102 |
|
|
by the Bidirectional Behavior Algorithm in the Unicode Standard.
|
103 |
|
|
These categories are summarized in Chapter 4 of the Unicode Standard.
|
104 |
|
|
This field is normative.
|
105 |
|
|
|
106 |
|
|
5 Character Decomposition. In the Unicode Standard, not all of
|
107 |
|
|
the decompositions are full decompositions. Recursive
|
108 |
|
|
application of look-up for decompositions will, in all cases, lead to
|
109 |
|
|
a maximal decomposition. The decompositions match exactly the
|
110 |
|
|
decompositions published with the character names in Chapter 7
|
111 |
|
|
of the Unicode Standard. This field is normative.
|
112 |
|
|
|
113 |
|
|
6 Decimal digit value. This is a numeric field. If the character
|
114 |
|
|
has the decimal digit property, as specified in Chapter 4 of
|
115 |
|
|
the Unicode Standard, the value of that digit is represented
|
116 |
|
|
with an integer value in this field. This field is normative.
|
117 |
|
|
|
118 |
|
|
7 Digit value. This is a numeric field. If the character represents a
|
119 |
|
|
digit, not necessarily a decimal digit, the value is here. This
|
120 |
|
|
covers digits which do not form decimal radix forms, such as the
|
121 |
|
|
compatibility superscript digits. This field is informative.
|
122 |
|
|
|
123 |
|
|
8 Numeric value. This is a numeric field. If the character has the
|
124 |
|
|
numeric property, as specified in Chapter 4 of the Unicode
|
125 |
|
|
Standard, the value of that character is represented with an
|
126 |
|
|
integer or rational number in this field. This includes fractions as,
|
127 |
|
|
e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.
|
128 |
|
|
Also included are numerical values for compatibility characters
|
129 |
|
|
such as circled numbers. This field is normative.
|
130 |
|
|
|
131 |
|
|
9 If the characters has been identified as a "mirrored" character in
|
132 |
|
|
bidirectional text, this field has the value "Y"; otherwise "N".
|
133 |
|
|
The list of mirrored characters is also printed in Chapter 4 of
|
134 |
|
|
the Unicode Standard. This field is normative.
|
135 |
|
|
|
136 |
|
|
10 Unicode 1.0 Name. This is the old name as published in Unicode 1.0.
|
137 |
|
|
This name is only provided when it is significantly different from
|
138 |
|
|
the Unicode 2.1 name for the character. This field is informative.
|
139 |
|
|
|
140 |
|
|
11 10646 Comment field. This field is informative.
|
141 |
|
|
|
142 |
|
|
12 Upper case equivalent mapping. If a character is part of an
|
143 |
|
|
alphabet with case distinctions, and has an upper case equivalent,
|
144 |
|
|
then the upper case equivalent is in this field. See the explanation
|
145 |
|
|
below on case distinctions. These mappings are always one-to-one,
|
146 |
|
|
not one-to-many or many-to-one. This field is informative.
|
147 |
|
|
|
148 |
|
|
13 Lower case equivalent mapping. Similar to 12. This field is informative.
|
149 |
|
|
|
150 |
|
|
14 Title case equivalent mapping. Similar to 12. This field is informative.
|
151 |
|
|
|
152 |
|
|
GENERAL CATEGORY
|
153 |
|
|
|
154 |
|
|
The values in this field are abbreviations for the following. Some of the
|
155 |
|
|
values are normative, and some are informative. For more information, see
|
156 |
|
|
the Unicode Standard. Note: the standard does not assign information to
|
157 |
|
|
control characters (except for TAB in the Bidirectonal Algorithm).
|
158 |
|
|
Implementations will generally also assign categories to certain control
|
159 |
|
|
characters, notably CR and LF, according to platform conventions.
|
160 |
|
|
|
161 |
|
|
|
162 |
|
|
Normative
|
163 |
|
|
Mn = Mark, Non-Spacing
|
164 |
|
|
Mc = Mark, Spacing Combining
|
165 |
|
|
Me = Mark, Enclosing
|
166 |
|
|
|
167 |
|
|
Nd = Number, Decimal Digit
|
168 |
|
|
Nl = Number, Letter
|
169 |
|
|
No = Number, Other
|
170 |
|
|
|
171 |
|
|
Zs = Separator, Space
|
172 |
|
|
Zl = Separator, Line
|
173 |
|
|
Zp = Separator, Paragraph
|
174 |
|
|
|
175 |
|
|
Cc = Other, Control
|
176 |
|
|
Cf = Other, Format
|
177 |
|
|
Cs = Other, Surrogate
|
178 |
|
|
Co = Other, Private Use
|
179 |
|
|
Cn = Other, Not Assigned
|
180 |
|
|
|
181 |
|
|
Informative
|
182 |
|
|
Lu = Letter, Uppercase
|
183 |
|
|
Ll = Letter, Lowercase
|
184 |
|
|
Lt = Letter, Titlecase
|
185 |
|
|
Lm = Letter, Modifier
|
186 |
|
|
Lo = Letter, Other
|
187 |
|
|
|
188 |
|
|
Pc = Punctuation, Connector
|
189 |
|
|
Pd = Punctuation, Dash
|
190 |
|
|
Ps = Punctuation, Open
|
191 |
|
|
Pe = Punctuation, Close
|
192 |
|
|
Po = Punctuation, Other
|
193 |
|
|
|
194 |
|
|
Sm = Symbol, Math
|
195 |
|
|
Sc = Symbol, Currency
|
196 |
|
|
Sk = Symbol, Modifier
|
197 |
|
|
So = Symbol, Other
|
198 |
|
|
|
199 |
|
|
BIDIRECTIONAL PROPERTIES
|
200 |
|
|
|
201 |
|
|
Please refer to the Unicode Standard for an explanation of the algorithm for
|
202 |
|
|
Bidirectional Behavior and an explanation of the sigificance of these categories.
|
203 |
|
|
These values are normative.
|
204 |
|
|
|
205 |
|
|
Strong types:
|
206 |
|
|
L Left-Right; Most alphabetic, syllabic, and logographic
|
207 |
|
|
characters (e.g., CJK ideographs)
|
208 |
|
|
R Right-Left; Arabic, Hebrew, and
|
209 |
|
|
punctuation specific to those scripts
|
210 |
|
|
Weak types:
|
211 |
|
|
EN European Number
|
212 |
|
|
ES European Number Separator
|
213 |
|
|
ET European Number Terminator
|
214 |
|
|
AN Arabic Number
|
215 |
|
|
CS Common Number Separator
|
216 |
|
|
|
217 |
|
|
Separators:
|
218 |
|
|
B Block Separator
|
219 |
|
|
S Segment Separator
|
220 |
|
|
|
221 |
|
|
Neutrals:
|
222 |
|
|
WS Whitespace
|
223 |
|
|
ON Other Neutrals ; All other characters: punctuation, symbols
|
224 |
|
|
|
225 |
|
|
CHARACTER DECOMPOSITION TAGS
|
226 |
|
|
|
227 |
|
|
The decomposition is a normative property of a character. The tags supplied
|
228 |
|
|
with certain decompositions generally indicate formatting information.
|
229 |
|
|
Where no such tag is given, the decomposition is designated as canonical.
|
230 |
|
|
Conversely, the presence of a formatting tag also indicates
|
231 |
|
|
that the decomposition is a compatibility decomposition and not a canonical
|
232 |
|
|
decomposition. In the absence of other formatting information in a
|
233 |
|
|
compatibility decomposition, the tag is used to distinguish it from
|
234 |
|
|
canonical decompositions.
|
235 |
|
|
|
236 |
|
|
In some instances a canonical decomposition or a compatibility decomposition
|
237 |
|
|
may consist of a single character. For a canonical decomposition, this
|
238 |
|
|
indicates that the character is a canonical equivalent of another single
|
239 |
|
|
character. For a compatibility decomposition, this indicates that the
|
240 |
|
|
character is a compatibility equivalent of another single character.
|
241 |
|
|
|
242 |
|
|
The compatibility formatting tags used are:
|
243 |
|
|
|
244 |
|
|
A font variant (e.g. a blackletter form).
|
245 |
|
|
A no-break version of a space or hyphen.
|
246 |
|
|
An initial presentation form (Arabic).
|
247 |
|
|
A medial presentation form (Arabic).
|
248 |
|
|
A final presentation form (Arabic).
|
249 |
|
|
An isolated presentation form (Arabic).
|
250 |
|
|
An encircled form.
|
251 |
|
|
A superscript form.
|
252 |
|
|
A subscript form.
|
253 |
|
|
A vertical layout presentation form.
|
254 |
|
|
A wide (or zenkaku) compatibility character.
|
255 |
|
|
A narrow (or hankaku) compatibility character.
|
256 |
|
|
A small variant form (CNS compatibility).
|
257 |
|
|
A CJK squared font variant.
|
258 |
|
|
A vulgar fraction form.
|
259 |
|
|
Otherwise unspecified compatibility character.
|
260 |
|
|
|
261 |
|
|
CANONICAL COMBINING CLASSES
|
262 |
|
|
|
263 |
|
|
0: Spacing, enclosing, reordrant, and surrounding
|
264 |
|
|
1: Overlays and interior
|
265 |
|
|
6: Tibetan subjoined Letters
|
266 |
|
|
7: Nuktas
|
267 |
|
|
8: Hiragana/Katakana voiced marks
|
268 |
|
|
9: Viramas
|
269 |
|
|
10: Start of fixed position classes
|
270 |
|
|
199: End of fixed position classes
|
271 |
|
|
200: Below left attached
|
272 |
|
|
202: Below attached
|
273 |
|
|
204: Below right attached
|
274 |
|
|
208: Left attached (reordrant around single base character)
|
275 |
|
|
210: Right attached
|
276 |
|
|
212: Above left attached
|
277 |
|
|
214: Above attached
|
278 |
|
|
216: Above right attached
|
279 |
|
|
218: Below left
|
280 |
|
|
220: Below
|
281 |
|
|
222: Below right
|
282 |
|
|
224: Left (reordrant around single base character)
|
283 |
|
|
226: Right
|
284 |
|
|
228: Above left
|
285 |
|
|
230: Above
|
286 |
|
|
232: Above right
|
287 |
|
|
234: Double above
|
288 |
|
|
|
289 |
|
|
Note: some of the combining classes in this list do not currently have
|
290 |
|
|
members but are specified here for completeness.
|
291 |
|
|
|
292 |
|
|
CASE MAPPINGS
|
293 |
|
|
|
294 |
|
|
In addition to uppercase and lowercase, because of the inclusion of certain
|
295 |
|
|
composite characters for compatibility, such as "01F1;LATIN CAPITAL LETTER
|
296 |
|
|
DZ", there is a third case, called titlecase, which is used where the first
|
297 |
|
|
character of a word is to be capitalized (e.g. UPPERCASE, Titlecase,
|
298 |
|
|
lowercase). An example of such a character is "01F2;LATIN CAPITAL LETTER D
|
299 |
|
|
WITH SMALL LETTER Z".
|
300 |
|
|
|
301 |
|
|
The uppercase, titlecase and lowercase fields are only included for characters
|
302 |
|
|
that have a single corresponding character of that type. Composite characters
|
303 |
|
|
(such as "339D;SQUARE CM") that do not have a single corresponding character
|
304 |
|
|
of that type can be cased by decomposition.
|
305 |
|
|
|
306 |
|
|
The case mapping is an informative, default mapping. Certain languages, such
|
307 |
|
|
as Turkish, German, French, or Greek may have small deviations from the
|
308 |
|
|
default mappings listed in the Unicode Character Database.
|
309 |
|
|
|
310 |
|
|
MODIFICATION HISTORY
|
311 |
|
|
|
312 |
|
|
Modifications made in updating the Unicode Character Database for
|
313 |
|
|
the Unicode Standard, Version 2.1 (from Version 2.0) are:
|
314 |
|
|
* Added two characters (U+20AC and U+FFFC).
|
315 |
|
|
* Amended bidi properties for U+0026, U+002E, U+0040, U+2007.
|
316 |
|
|
* Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275,
|
317 |
|
|
U+03C2, U+1E9B.
|
318 |
|
|
* Changed combining order class for U+0F71.
|
319 |
|
|
* Corrected canonical decompositions for U+0F73, U+1FBE.
|
320 |
|
|
* Changed decomposition for U+FB1F from compatibility to canonical.
|
321 |
|
|
* Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB.
|
322 |
|
|
* Corrected compatibility decompositions for U+2469, U+246A, U+3358.
|
323 |
|
|
|
324 |
|
|
|
325 |
|
|
Some of the modifications made in updating the Unicode Character Database
|
326 |
|
|
for the Unicode Standard, Version 2.0 are:
|
327 |
|
|
* Fixed decompositions with TONOS to use correct NSM: 030D.
|
328 |
|
|
* Removed old Hangul Syllables; mapping to new characters are
|
329 |
|
|
in a separate table.
|
330 |
|
|
* Marked compability decompositions with additional tags.
|
331 |
|
|
* Changed old tag names for clarity.
|
332 |
|
|
* Revision of decompositions to use first-level decomposition, instead
|
333 |
|
|
of maximal decomposition.
|
334 |
|
|
* Correction of all known errors in decompositions from earlier versions.
|
335 |
|
|
* Added control code names (as old Unicode names).
|
336 |
|
|
* Added Hangul Jamo decompositions.
|
337 |
|
|
* Added Number category to match properties list in book.
|
338 |
|
|
* Fixed categories of Koranic Arabic marks.
|
339 |
|
|
* Fixed categories of precomposed characters to match decomposition where possible.
|
340 |
|
|
* Added Hebrew cantillation marks and the Tibetan script.
|
341 |
|
|
* Added place holders for ranges such as CJK Ideographic Area and the
|
342 |
|
|
Private Use Area.
|
343 |
|
|
* Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in the
|
344 |
|
|
database.
|