OpenCores
URL https://opencores.org/ocsvn/scarts/scarts/trunk

Subversion Repositories scarts

[/] [scarts/] [trunk/] [toolchain/] [scarts-gcc/] [gcc-4.1.1/] [libjava/] [classpath/] [doc/] [unicode/] [UnicodeData-3.0.0.html] - Blame information for rev 14

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 14 jlechner
<html>
2
 
3
 
4
 
5
<head>
6
 
7
<meta NAME="GENERATOR" CONTENT="Microsoft FrontPage 4.0">
8
 
9
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
10
 
11
<link REL="stylesheet" HREF="http://www.unicode.org/unicode.css" TYPE="text/css">
12
 
13
<title>UnicodeData File Format</title>
14
 
15
</head>
16
 
17
 
18
 
19
<body>
20
 
21
 
22
 
23
<h1>UnicodeData File Format<br>
24
Version 3.0.0</h1>
25
 
26
 
27
 
28
<table BORDER="1" CELLSPACING="2" CELLPADDING="0" HEIGHT="87" WIDTH="100%">
29
 
30
  <tr>
31
 
32
    <td VALIGN="TOP" width="144">Revision</td>
33
 
34
    <td VALIGN="TOP">3.0.0</td>
35
 
36
  </tr>
37
 
38
  <tr>
39
 
40
    <td VALIGN="TOP" width="144">Authors</td>
41
 
42
    <td VALIGN="TOP">Mark Davis and Ken Whistler</td>
43
 
44
  </tr>
45
 
46
  <tr>
47
 
48
    <td VALIGN="TOP" width="144">Date</td>
49
 
50
    <td VALIGN="TOP">1999-09-12</td>
51
 
52
  </tr>
53
 
54
  <tr>
55
 
56
    <td VALIGN="TOP" width="144">This Version</td>
57
 
58
    <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td>
59
 
60
  </tr>
61
 
62
  <tr>
63
 
64
    <td VALIGN="TOP" width="144">Previous Version</td>
65
 
66
    <td VALIGN="TOP">n/a</td>
67
 
68
  </tr>
69
 
70
  <tr>
71
 
72
    <td VALIGN="TOP" width="144">Latest Version</td>
73
 
74
    <td VALIGN="TOP"><a href="ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td>
75
 
76
  </tr>
77
 
78
</table>
79
 
80
 
81
 
82
<p align="center">Copyright © 1995-1999 Unicode, Inc. All Rights reserved.<br>
83
 
84
<i>For more information, including Disclamer and Limitations, see <a HREF="UnicodeCharacterDatabase-3.0.0.html">UnicodeCharacterDatabase-3.0.0.html</a> </i></p>
85
 
86
 
87
 
88
<p>This document describes the format of the UnicodeData.txt file, which is one of the
89
 
90
files in the Unicode Character Database. The document is divided into the following
91
 
92
sections:
93
 
94
 
95
 
96
<ul>
97
 
98
  <li><a HREF="#Field Formats">Field Formats</a> <ul>
99
 
100
      <li><a HREF="#General Category">General Category</a> </li>
101
 
102
      <li><a HREF="#Bidirectional Category">Bidirectional Category</a> </li>
103
 
104
      <li><a HREF="#Character Decomposition">Character Decomposition Mapping</a> </li>
105
 
106
      <li><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </li>
107
 
108
      <li><a HREF="#Decompositions and Normalization">Decompositions and Normalization</a> </li>
109
 
110
      <li><a HREF="#Case Mappings">Case Mappings</a> </li>
111
 
112
    </ul>
113
 
114
  </li>
115
 
116
  <li><a HREF="#Property Invariants">Property Invariants</a> </li>
117
 
118
  <li><a HREF="#Modification History">Modification History</a> </li>
119
 
120
</ul>
121
 
122
 
123
 
124
<p><b>Warning: </b>the information in this file does not completely describe the use and
125
 
126
interpretation of Unicode character properties and behavior. It must be used in
127
 
128
conjunction with the data in the other files in the Unicode Character Database, and relies
129
 
130
on the notation and definitions supplied in <i><a href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html"> The Unicode
131
Standard</a></i>. All chapter references
132
 
133
are to Version 3.0 of the standard.</p>
134
 
135
 
136
 
137
<h2><a NAME="Field Formats"></a>Field Formats</h2>
138
 
139
 
140
 
141
<p>The file consists of lines containing fields terminated by semicolons. Each line
142
 
143
represents the data for one encoded character in the Unicode Standard. Every encoded
144
 
145
character has a data entry, with the exception of certain special ranges, as detailed
146
 
147
below.
148
 
149
 
150
 
151
<ul>
152
 
153
  <li>There are six special ranges of characters that are represented only by their start and
154
 
155
    end characters, since the properties in the file are uniform, except for code values
156
 
157
    (which are all sequential and assigned). </li>
158
 
159
  <li>The names of CJK ideograph characters and the names and decompositions of Hangul
160
 
161
    syllable characters are algorithmically derivable. (See the Unicode Standard and <a
162
 
163
    HREF="http://www.unicode.org/unicode/reports/tr15/">Unicode Technical Report #15</a> for
164
 
165
    more information). </li>
166
 
167
  <li>Surrogate code values and private use characters have no names. </li>
168
 
169
  <li>The Private Use character outside of the BMP (U+F0000..U+FFFFD, U+100000..U+10FFFD) are
170
 
171
    not listed. These correspond to surrogate pairs where the first surrogate is in the High
172
 
173
    Surrogate Private Use section. </li>
174
 
175
</ul>
176
 
177
 
178
 
179
<p>The exact ranges represented by start and end characters are:
180
 
181
 
182
 
183
<ul>
184
 
185
  <li>CJK Ideographs Extension A (U+3400 - U+4DB5) </li>
186
 
187
  <li>CJK Ideographs (U+4E00 - U+9FA5) </li>
188
 
189
  <li>Hangul Syllables (U+AC00 - U+D7A3) </li>
190
 
191
  <li>Non-Private Use High Surrogates (U+D800 - U+DB7F) </li>
192
 
193
  <li>Private Use High Surrogates (U+DB80 - U+DBFF) </li>
194
 
195
  <li>Low Surrogates (U+DC00 - U+DFFF) </li>
196
 
197
  <li>The Private Use Area (U+E000 - U+F8FF) </li>
198
 
199
</ul>
200
 
201
 
202
 
203
<p>The following table describes the format and meaning of each field in a data entry in
204
 
205
the UnicodeData file. Fields which contain normative information are so indicated.</p>
206
 
207
 
208
 
209
<table BORDER="1" CELLSPACING="2" CELLPADDING="2">
210
 
211
  <tr>
212
 
213
    <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Field</th>
214
 
215
    <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Name</th>
216
 
217
    <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Status</th>
218
 
219
    <th VALIGN="top" ALIGN="LEFT"><p ALIGN="LEFT">Explanation</th>
220
 
221
  </tr>
222
 
223
  <tr>
224
 
225
    <th VALIGN="top">0</th>
226
 
227
    <td VALIGN="top">Code value</td>
228
 
229
    <td VALIGN="top">normative</td>
230
 
231
    <td VALIGN="top">Code value in 4-digit hexadecimal format.</td>
232
 
233
  </tr>
234
 
235
  <tr>
236
 
237
    <th VALIGN="top">1</th>
238
 
239
    <td VALIGN="top">Character name</td>
240
 
241
    <td VALIGN="top">normative</td>
242
 
243
    <td VALIGN="top">These names match exactly the names published in Chapter 14 of the
244
 
245
    Unicode Standard, Version 3.0.</td>
246
 
247
  </tr>
248
 
249
  <tr>
250
 
251
    <th VALIGN="top">2</th>
252
 
253
    <td VALIGN="top"><a HREF="#General Category">General Category</a> </td>
254
 
255
    <td VALIGN="top">normative / informative<br>
256
 
257
    (see below)</td>
258
 
259
    <td VALIGN="top">This is a useful breakdown into various &quot;character types&quot; which
260
 
261
    can be used as a default categorization in implementations. See below for a brief
262
 
263
    explanation.</td>
264
 
265
  </tr>
266
 
267
  <tr>
268
 
269
    <th VALIGN="top">3</th>
270
 
271
    <td VALIGN="top"><a HREF="#Canonical Combining Classes">Canonical Combining Classes</a> </td>
272
 
273
    <td VALIGN="top">normative</td>
274
 
275
    <td VALIGN="top">The classes used for the Canonical Ordering Algorithm in the Unicode
276
 
277
    Standard. These classes are also printed in Chapter 4 of the Unicode Standard.</td>
278
 
279
  </tr>
280
 
281
  <tr>
282
 
283
    <th VALIGN="top">4</th>
284
 
285
    <td VALIGN="top"><a HREF="#Bidirectional Category">Bidirectional Category</a> </td>
286
 
287
    <td VALIGN="top">normative</td>
288
 
289
    <td VALIGN="top">See the list below for an explanation of the abbreviations used in this
290
 
291
    field. These are the categories required by the Bidirectional Behavior Algorithm in the
292
 
293
    Unicode Standard. These categories are summarized in Chapter 3 of the Unicode Standard.</td>
294
 
295
  </tr>
296
 
297
  <tr>
298
 
299
    <th VALIGN="top">5</th>
300
 
301
    <td VALIGN="top"><a HREF="#Character Decomposition">Character Decomposition
302
      Mapping</a></td>
303
 
304
    <td VALIGN="top">normative</td>
305
 
306
    <td VALIGN="top">In the Unicode Standard, not all of the mappings are full (maximal)
307
 
308
    decompositions. Recursive application of look-up for decompositions will, in all cases,
309
 
310
    lead to a maximal decomposition. The decomposition mappings match exactly the
311
 
312
    decomposition mappings published with the character names in the Unicode Standard.</td>
313
 
314
  </tr>
315
 
316
  <tr>
317
 
318
    <th VALIGN="top">6</th>
319
 
320
    <td VALIGN="top">Decimal digit value</td>
321
 
322
    <td VALIGN="top">normative</td>
323
 
324
    <td VALIGN="top">This is a numeric field. If the character has the decimal digit property,
325
 
326
    as specified in Chapter 4 of the Unicode Standard, the value of that digit is represented
327
 
328
    with an integer value in this field</td>
329
 
330
  </tr>
331
 
332
  <tr>
333
 
334
    <th VALIGN="top">7</th>
335
 
336
    <td VALIGN="top">Digit value</td>
337
 
338
    <td VALIGN="top">normative</td>
339
 
340
    <td VALIGN="top">This is a numeric field. If the character represents a digit, not
341
 
342
    necessarily a decimal digit, the value is here. This covers digits which do not form
343
 
344
    decimal radix forms, such as the compatibility superscript digits</td>
345
 
346
  </tr>
347
 
348
  <tr>
349
 
350
    <th VALIGN="top">8</th>
351
 
352
    <td VALIGN="top">Numeric value</td>
353
 
354
    <td VALIGN="top">normative</td>
355
 
356
    <td VALIGN="top">This is a numeric field. If the character has the numeric property, as
357
 
358
    specified in Chapter 4 of the Unicode Standard, the value of that character is represented
359
 
360
    with an integer or rational number in this field. This includes fractions as, e.g.,
361
 
362
    &quot;1/5&quot; for U+2155 VULGAR FRACTION ONE FIFTH Also included are numerical values
363
 
364
    for compatibility characters such as circled numbers.</td>
365
 
366
  </tr>
367
 
368
  <tr>
369
 
370
    <th VALIGN="top">8</th>
371
 
372
    <td VALIGN="top">Mirrored</td>
373
 
374
    <td VALIGN="top">normative</td>
375
 
376
    <td VALIGN="top">If the character has been identified as a &quot;mirrored&quot; character
377
 
378
    in bidirectional text, this field has the value &quot;Y&quot;; otherwise &quot;N&quot;.
379
 
380
    The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard.</td>
381
 
382
  </tr>
383
 
384
  <tr>
385
 
386
    <th VALIGN="top">10</th>
387
 
388
    <td VALIGN="top">Unicode 1.0 Name</td>
389
 
390
    <td VALIGN="top">informative</td>
391
 
392
    <td VALIGN="top">This is the old name as published in Unicode 1.0. This name is only
393
 
394
    provided when it is significantly different from the Unicode 3.0 name for the character.</td>
395
 
396
  </tr>
397
 
398
  <tr>
399
 
400
    <th VALIGN="top">11</th>
401
 
402
    <td VALIGN="top">10646 comment field</td>
403
 
404
    <td VALIGN="top">informative</td>
405
 
406
    <td VALIGN="top">This is the ISO 10646 comment field. It is in parantheses in the 10646
407
 
408
    names list.</td>
409
 
410
  </tr>
411
 
412
  <tr>
413
 
414
    <th VALIGN="top">12</th>
415
 
416
    <td VALIGN="top"><a HREF="#Case Mappings">Uppercase Mapping</a></td>
417
 
418
    <td VALIGN="top">informative</td>
419
 
420
    <td VALIGN="top">Upper case equivalent mapping. If a character is part of an alphabet with
421
 
422
    case distinctions, and has an upper case equivalent, then the upper case equivalent is in
423
 
424
    this field. See the explanation below on case distinctions. These mappings are always
425
 
426
    one-to-one, not one-to-many or many-to-one. This field is informative.</td>
427
 
428
  </tr>
429
 
430
  <tr>
431
 
432
    <th VALIGN="top">13</th>
433
 
434
    <td VALIGN="top"><a HREF="#Case Mappings">Lowercase Mapping</a></td>
435
 
436
    <td VALIGN="top">informative</td>
437
 
438
    <td VALIGN="top">Similar to Uppercase mapping</td>
439
 
440
  </tr>
441
 
442
  <tr>
443
 
444
    <th VALIGN="top">14</th>
445
 
446
    <td VALIGN="top"><a HREF="#Case Mappings">Titlecase Mapping</a></td>
447
 
448
    <td VALIGN="top">informative</td>
449
 
450
    <td VALIGN="top">Similar to Uppercase mapping</td>
451
 
452
  </tr>
453
 
454
</table>
455
 
456
 
457
 
458
<h3><a NAME="General Category"></a>General Category</h3>
459
 
460
 
461
 
462
<p>The values in this field are abbreviations for the following. Some of the values are
463
 
464
normative, and some are informative. For more information, see the Unicode Standard.</p>
465
 
466
 
467
 
468
<p><b>Note:</b> the standard does not assign information to control characters (except for
469
 
470
certain cases in the Bidirectional Algorithm). Implementations will generally also assign
471
 
472
categories to certain control characters, notably CR and LF, according to platform
473
 
474
conventions.</p>
475
 
476
 
477
 
478
<h4>Normative Categories</h4>
479
 
480
 
481
 
482
<table BORDER="0" CELLSPACING="2" CELLPADDING="0">
483
 
484
  <tr>
485
 
486
    <th><p ALIGN="LEFT">Abbr.</th>
487
 
488
    <th><p ALIGN="LEFT">Description</th>
489
 
490
  </tr>
491
 
492
  <tr>
493
 
494
    <td ALIGN="CENTER">Lu</td>
495
 
496
    <td>Letter, Uppercase</td>
497
 
498
  </tr>
499
 
500
  <tr>
501
 
502
    <td ALIGN="CENTER">Ll</td>
503
 
504
    <td>Letter, Lowercase</td>
505
 
506
  </tr>
507
 
508
  <tr>
509
 
510
    <td ALIGN="CENTER">Lt</td>
511
 
512
    <td>Letter, Titlecase</td>
513
 
514
  </tr>
515
 
516
  <tr>
517
 
518
    <td ALIGN="CENTER">Mn</td>
519
 
520
    <td>Mark, Non-Spacing</td>
521
 
522
  </tr>
523
 
524
  <tr>
525
 
526
    <td ALIGN="CENTER">Mc</td>
527
 
528
    <td>Mark, Spacing Combining</td>
529
 
530
  </tr>
531
 
532
  <tr>
533
 
534
    <td ALIGN="CENTER">Me</td>
535
 
536
    <td>Mark, Enclosing</td>
537
 
538
  </tr>
539
 
540
  <tr>
541
 
542
    <td ALIGN="CENTER">Nd</td>
543
 
544
    <td>Number, Decimal Digit</td>
545
 
546
  </tr>
547
 
548
  <tr>
549
 
550
    <td ALIGN="CENTER">Nl</td>
551
 
552
    <td>Number, Letter</td>
553
 
554
  </tr>
555
 
556
  <tr>
557
 
558
    <td ALIGN="CENTER">No</td>
559
 
560
    <td>Number, Other</td>
561
 
562
  </tr>
563
 
564
  <tr>
565
 
566
    <td ALIGN="CENTER">Zs</td>
567
 
568
    <td>Separator, Space</td>
569
 
570
  </tr>
571
 
572
  <tr>
573
 
574
    <td ALIGN="CENTER">Zl</td>
575
 
576
    <td>Separator, Line</td>
577
 
578
  </tr>
579
 
580
  <tr>
581
 
582
    <td ALIGN="CENTER">Zp</td>
583
 
584
    <td>Separator, Paragraph</td>
585
 
586
  </tr>
587
 
588
  <tr>
589
 
590
    <td ALIGN="CENTER">Cc</td>
591
 
592
    <td>Other, Control</td>
593
 
594
  </tr>
595
 
596
  <tr>
597
 
598
    <td ALIGN="CENTER">Cf</td>
599
 
600
    <td>Other, Format</td>
601
 
602
  </tr>
603
 
604
  <tr>
605
 
606
    <td ALIGN="CENTER">Cs</td>
607
 
608
    <td>Other, Surrogate</td>
609
 
610
  </tr>
611
 
612
  <tr>
613
 
614
    <td ALIGN="CENTER">Co</td>
615
 
616
    <td>Other, Private Use</td>
617
 
618
  </tr>
619
 
620
  <tr>
621
 
622
    <td ALIGN="CENTER">Cn</td>
623
 
624
    <td>Other, Not Assigned (no characters in the file have this property)</td>
625
 
626
  </tr>
627
 
628
</table>
629
 
630
 
631
 
632
<h4>Informative Categories</h4>
633
 
634
 
635
 
636
<table BORDER="0" CELLSPACING="2" CELLPADDING="0">
637
 
638
  <tr>
639
 
640
    <th><p ALIGN="LEFT">Abbr.</th>
641
 
642
    <th><p ALIGN="LEFT">Description</th>
643
 
644
  </tr>
645
 
646
  <tr>
647
 
648
    <td ALIGN="CENTER">Lm</td>
649
 
650
    <td>Letter, Modifier</td>
651
 
652
  </tr>
653
 
654
  <tr>
655
 
656
    <td ALIGN="CENTER">Lo</td>
657
 
658
    <td>Letter, Other</td>
659
 
660
  </tr>
661
 
662
  <tr>
663
 
664
    <td ALIGN="CENTER">Pc</td>
665
 
666
    <td>Punctuation, Connector</td>
667
 
668
  </tr>
669
 
670
  <tr>
671
 
672
    <td ALIGN="CENTER">Pd</td>
673
 
674
    <td>Punctuation, Dash</td>
675
 
676
  </tr>
677
 
678
  <tr>
679
 
680
    <td ALIGN="CENTER">Ps</td>
681
 
682
    <td>Punctuation, Open</td>
683
 
684
  </tr>
685
 
686
  <tr>
687
 
688
    <td ALIGN="CENTER">Pe</td>
689
 
690
    <td>Punctuation, Close</td>
691
 
692
  </tr>
693
 
694
  <tr>
695
 
696
    <td ALIGN="CENTER">Pi</td>
697
 
698
    <td>Punctuation, Initial quote (may behave like Ps or Pe depending on usage)</td>
699
 
700
  </tr>
701
 
702
  <tr>
703
 
704
    <td ALIGN="CENTER">Pf</td>
705
 
706
    <td>Punctuation, Final quote (may behave like Ps or Pe depending on usage)</td>
707
 
708
  </tr>
709
 
710
  <tr>
711
 
712
    <td ALIGN="CENTER">Po</td>
713
 
714
    <td>Punctuation, Other</td>
715
 
716
  </tr>
717
 
718
  <tr>
719
 
720
    <td ALIGN="CENTER">Sm</td>
721
 
722
    <td>Symbol, Math</td>
723
 
724
  </tr>
725
 
726
  <tr>
727
 
728
    <td ALIGN="CENTER">Sc</td>
729
 
730
    <td>Symbol, Currency</td>
731
 
732
  </tr>
733
 
734
  <tr>
735
 
736
    <td ALIGN="CENTER">Sk</td>
737
 
738
    <td>Symbol, Modifier</td>
739
 
740
  </tr>
741
 
742
  <tr>
743
 
744
    <td ALIGN="CENTER">So</td>
745
 
746
    <td>Symbol, Other</td>
747
 
748
  </tr>
749
 
750
</table>
751
 
752
 
753
 
754
<h3><a NAME="Bidirectional Category"></a>Bidirectional Category</h3>
755
 
756
 
757
 
758
<p>Please refer to Chapter 3 for an explanation of the algorithm for Bidirectional
759
 
760
Behavior and an explanation of the significance of these categories. An up-to-date version
761
 
762
can be found on <a HREF="http://www.unicode.org/unicode/reports/tr9/">Unicode Technical
763
 
764
Report #9: The Bidirectional Algorithm</a>. These values are normative.</p>
765
 
766
 
767
 
768
<table BORDER="0" CELLPADDING="2">
769
 
770
  <tr>
771
 
772
    <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Type</th>
773
 
774
    <th VALIGN="TOP" ALIGN="LEFT"><p ALIGN="LEFT">Description</th>
775
 
776
  </tr>
777
 
778
  <tr>
779
 
780
    <td VALIGN="TOP"><b>L</b></td>
781
 
782
    <td VALIGN="TOP">Left-to-Right</td>
783
 
784
  </tr>
785
 
786
  <tr>
787
 
788
    <td VALIGN="TOP"><b>LRE</b></td>
789
 
790
    <td VALIGN="TOP">Left-to-Right Embedding</td>
791
 
792
  </tr>
793
 
794
  <tr>
795
 
796
    <td VALIGN="TOP"><b>LRO</b></td>
797
 
798
    <td VALIGN="TOP">Left-to-Right Override</td>
799
 
800
  </tr>
801
 
802
  <tr>
803
 
804
    <td VALIGN="TOP"><b>R</b></td>
805
 
806
    <td VALIGN="TOP">Right-to-Left</td>
807
 
808
  </tr>
809
 
810
  <tr>
811
 
812
    <td VALIGN="TOP"><b>AL</b></td>
813
 
814
    <td VALIGN="TOP">Right-to-Left Arabic</td>
815
 
816
  </tr>
817
 
818
  <tr>
819
 
820
    <td VALIGN="TOP"><b>RLE</b></td>
821
 
822
    <td VALIGN="TOP">Right-to-Left Embedding</td>
823
 
824
  </tr>
825
 
826
  <tr>
827
 
828
    <td VALIGN="TOP"><b>RLO</b></td>
829
 
830
    <td VALIGN="TOP">Right-to-Left Override</td>
831
 
832
  </tr>
833
 
834
  <tr>
835
 
836
    <td VALIGN="TOP"><b>PDF</b></td>
837
 
838
    <td VALIGN="TOP">Pop Directional Format</td>
839
 
840
  </tr>
841
 
842
  <tr>
843
 
844
    <td VALIGN="TOP"><b>EN</b></td>
845
 
846
    <td VALIGN="TOP">European Number</td>
847
 
848
  </tr>
849
 
850
  <tr>
851
 
852
    <td VALIGN="TOP"><b>ES</b></td>
853
 
854
    <td VALIGN="TOP">European Number Separator</td>
855
 
856
  </tr>
857
 
858
  <tr>
859
 
860
    <td VALIGN="TOP"><b>ET</b></td>
861
 
862
    <td VALIGN="TOP">European Number Terminator</td>
863
 
864
  </tr>
865
 
866
  <tr>
867
 
868
    <td VALIGN="TOP"><b>AN</b></td>
869
 
870
    <td VALIGN="TOP">Arabic Number</td>
871
 
872
  </tr>
873
 
874
  <tr>
875
 
876
    <td VALIGN="TOP"><b>CS</b></td>
877
 
878
    <td VALIGN="TOP">Common Number Separator</td>
879
 
880
  </tr>
881
 
882
  <tr>
883
 
884
    <td VALIGN="TOP"><b>NSM</b></td>
885
 
886
    <td VALIGN="TOP">Non-Spacing Mark</td>
887
 
888
  </tr>
889
 
890
  <tr>
891
 
892
    <td VALIGN="TOP"><b>BN</b></td>
893
 
894
    <td VALIGN="TOP">Boundary Neutral</td>
895
 
896
  </tr>
897
 
898
  <tr>
899
 
900
    <td VALIGN="TOP"><b>B</b></td>
901
 
902
    <td VALIGN="TOP">Paragraph Separator</td>
903
 
904
  </tr>
905
 
906
  <tr>
907
 
908
    <td VALIGN="TOP"><b>S</b></td>
909
 
910
    <td VALIGN="TOP">Segment Separator</td>
911
 
912
  </tr>
913
 
914
  <tr>
915
 
916
    <td VALIGN="TOP"><b>WS</b></td>
917
 
918
    <td VALIGN="TOP">Whitespace</td>
919
 
920
  </tr>
921
 
922
  <tr>
923
 
924
    <td VALIGN="TOP"><b>ON</b></td>
925
 
926
    <td VALIGN="TOP">Other Neutrals</td>
927
 
928
  </tr>
929
 
930
</table>
931
 
932
 
933
 
934
<h3><a NAME="Character Decomposition"></a>Character Decomposition Mapping</h3>
935
 
936
 
937
 
938
<p>The decomposition is a normative property of a character. The tags supplied with
939
 
940
certain decomposition mappings generally indicate formatting information. Where no such
941
 
942
tag is given, the mapping is designated as canonical. Conversely, the presence of a
943
 
944
formatting tag also indicates that the mapping is a compatibility mapping and not a
945
 
946
canonical mapping. In the absence of other formatting information in a compatibility
947
 
948
mapping, the tag is used to distinguish it from canonical mappings.</p>
949
 
950
 
951
 
952
<p>In some instances a canonical mapping or a compatibility mapping may consist of a
953
 
954
single character. For a canonical mapping, this indicates that the character is a
955
 
956
canonical equivalent of another single character. For a compatibility mapping, this
957
 
958
indicates that the character is a compatibility equivalent of another single character.
959
 
960
The compatibility formatting tags used are:</p>
961
 
962
 
963
 
964
<table BORDER="0" CELLSPACING="2" CELLPADDING="0">
965
 
966
  <tr>
967
 
968
    <th>Tag</th>
969
 
970
    <th><p ALIGN="LEFT">Description</th>
971
 
972
  </tr>
973
 
974
  <tr>
975
 
976
    <td ALIGN="CENTER">&lt;font&gt;&nbsp;&nbsp;</td>
977
 
978
    <td>A font variant (e.g. a blackletter form).</td>
979
 
980
  </tr>
981
 
982
  <tr>
983
 
984
    <td ALIGN="CENTER">&lt;noBreak&gt;&nbsp;&nbsp;</td>
985
 
986
    <td>A no-break version of a space or hyphen.</td>
987
 
988
  </tr>
989
 
990
  <tr>
991
 
992
    <td ALIGN="CENTER">&lt;initial&gt;&nbsp;&nbsp;</td>
993
 
994
    <td>An initial presentation form (Arabic).</td>
995
 
996
  </tr>
997
 
998
  <tr>
999
 
1000
    <td ALIGN="CENTER">&lt;medial&gt;&nbsp;&nbsp;</td>
1001
 
1002
    <td>A medial presentation form (Arabic).</td>
1003
 
1004
  </tr>
1005
 
1006
  <tr>
1007
 
1008
    <td ALIGN="CENTER">&lt;final&gt;&nbsp;&nbsp;</td>
1009
 
1010
    <td>A final presentation form (Arabic).</td>
1011
 
1012
  </tr>
1013
 
1014
  <tr>
1015
 
1016
    <td ALIGN="CENTER">&lt;isolated&gt;&nbsp;&nbsp;</td>
1017
 
1018
    <td>An isolated presentation form (Arabic).</td>
1019
 
1020
  </tr>
1021
 
1022
  <tr>
1023
 
1024
    <td ALIGN="CENTER">&lt;circle&gt;&nbsp;&nbsp;</td>
1025
 
1026
    <td>An encircled form.</td>
1027
 
1028
  </tr>
1029
 
1030
  <tr>
1031
 
1032
    <td ALIGN="CENTER">&lt;super&gt;&nbsp;&nbsp;</td>
1033
 
1034
    <td>A superscript form.</td>
1035
 
1036
  </tr>
1037
 
1038
  <tr>
1039
 
1040
    <td ALIGN="CENTER">&lt;sub&gt;&nbsp;&nbsp;</td>
1041
 
1042
    <td>A subscript form.</td>
1043
 
1044
  </tr>
1045
 
1046
  <tr>
1047
 
1048
    <td ALIGN="CENTER">&lt;vertical&gt;&nbsp;&nbsp;</td>
1049
 
1050
    <td>A vertical layout presentation form.</td>
1051
 
1052
  </tr>
1053
 
1054
  <tr>
1055
 
1056
    <td ALIGN="CENTER">&lt;wide&gt;&nbsp;&nbsp;</td>
1057
 
1058
    <td>A wide (or zenkaku) compatibility character.</td>
1059
 
1060
  </tr>
1061
 
1062
  <tr>
1063
 
1064
    <td ALIGN="CENTER">&lt;narrow&gt;&nbsp;&nbsp;</td>
1065
 
1066
    <td>A narrow (or hankaku) compatibility character.</td>
1067
 
1068
  </tr>
1069
 
1070
  <tr>
1071
 
1072
    <td ALIGN="CENTER">&lt;small&gt;&nbsp;&nbsp;</td>
1073
 
1074
    <td>A small variant form (CNS compatibility).</td>
1075
 
1076
  </tr>
1077
 
1078
  <tr>
1079
 
1080
    <td ALIGN="CENTER">&lt;square&gt;&nbsp;&nbsp;</td>
1081
 
1082
    <td>A CJK squared font variant.</td>
1083
 
1084
  </tr>
1085
 
1086
  <tr>
1087
 
1088
    <td ALIGN="CENTER">&lt;fraction&gt;&nbsp;&nbsp;</td>
1089
 
1090
    <td>A vulgar fraction form.</td>
1091
 
1092
  </tr>
1093
 
1094
  <tr>
1095
 
1096
    <td ALIGN="CENTER">&lt;compat&gt;&nbsp;&nbsp;</td>
1097
 
1098
    <td>Otherwise unspecified compatibility character.</td>
1099
 
1100
  </tr>
1101
 
1102
</table>
1103
 
1104
 
1105
 
1106
<p><b>Reminder: </b>There is a difference between decomposition and decomposition mapping.
1107
 
1108
The decomposition mappings are defined in the UnicodeData, while the decomposition (also
1109
 
1110
termed &quot;full decomposition&quot;) is defined in Chapter 3 to use those mappings
1111
<i>
1112
 
1113
recursively.</i>
1114
 
1115
 
1116
 
1117
<ul>
1118
 
1119
  <li>The canonical decomposition is formed by recursively applying the canonical mappings,
1120
 
1121
    then applying the canonical reordering algorithm. </li>
1122
 
1123
  <li>The compatibility decomposition is formed by recursively applying the canonical <em>and</em>
1124
 
1125
    compatibility mappings, then applying the canonical reordering algorithm. </li>
1126
 
1127
</ul>
1128
 
1129
 
1130
 
1131
<h3><a NAME="Canonical Combining Classes"></a>Canonical Combining Classes</h3>
1132
 
1133
 
1134
 
1135
<table BORDER="0" CELLSPACING="2" CELLPADDING="0">
1136
 
1137
  <tr>
1138
 
1139
    <th><p ALIGN="LEFT">Value</th>
1140
 
1141
    <th><p ALIGN="LEFT">Description</th>
1142
 
1143
  </tr>
1144
 
1145
  <tr>
1146
 
1147
    <td ALIGN="RIGHT">0:</td>
1148
 
1149
    <td>Spacing, split, enclosing, reordrant, and Tibetan subjoined</td>
1150
 
1151
  </tr>
1152
 
1153
  <tr>
1154
 
1155
    <td ALIGN="RIGHT">1:</td>
1156
 
1157
    <td>Overlays and interior</td>
1158
 
1159
  </tr>
1160
 
1161
  <tr>
1162
 
1163
    <td ALIGN="RIGHT">7:</td>
1164
 
1165
    <td>Nuktas</td>
1166
 
1167
  </tr>
1168
 
1169
  <tr>
1170
 
1171
    <td ALIGN="RIGHT">8:</td>
1172
 
1173
    <td>Hiragana/Katakana voicing marks</td>
1174
 
1175
  </tr>
1176
 
1177
  <tr>
1178
 
1179
    <td ALIGN="RIGHT">9:</td>
1180
 
1181
    <td>Viramas</td>
1182
 
1183
  </tr>
1184
 
1185
  <tr>
1186
 
1187
    <td ALIGN="RIGHT">10:</td>
1188
 
1189
    <td>Start of fixed position classes</td>
1190
 
1191
  </tr>
1192
 
1193
  <tr>
1194
 
1195
    <td ALIGN="RIGHT">199:</td>
1196
 
1197
    <td>End of fixed position classes</td>
1198
 
1199
  </tr>
1200
 
1201
  <tr>
1202
 
1203
    <td ALIGN="RIGHT">200:</td>
1204
 
1205
    <td>Below left attached</td>
1206
 
1207
  </tr>
1208
 
1209
  <tr>
1210
 
1211
    <td ALIGN="RIGHT">202:</td>
1212
 
1213
    <td>Below attached</td>
1214
 
1215
  </tr>
1216
 
1217
  <tr>
1218
 
1219
    <td ALIGN="RIGHT">204:</td>
1220
 
1221
    <td>Below right attached</td>
1222
 
1223
  </tr>
1224
 
1225
  <tr>
1226
 
1227
    <td ALIGN="RIGHT">208:</td>
1228
 
1229
    <td>Left attached (reordrant around single base character)</td>
1230
 
1231
  </tr>
1232
 
1233
  <tr>
1234
 
1235
    <td ALIGN="RIGHT">210:</td>
1236
 
1237
    <td>Right attached</td>
1238
 
1239
  </tr>
1240
 
1241
  <tr>
1242
 
1243
    <td ALIGN="RIGHT">212:</td>
1244
 
1245
    <td>Above left attached</td>
1246
 
1247
  </tr>
1248
 
1249
  <tr>
1250
 
1251
    <td ALIGN="RIGHT">214:</td>
1252
 
1253
    <td>Above attached</td>
1254
 
1255
  </tr>
1256
 
1257
  <tr>
1258
 
1259
    <td ALIGN="RIGHT">216:</td>
1260
 
1261
    <td>Above right attached</td>
1262
 
1263
  </tr>
1264
 
1265
  <tr>
1266
 
1267
    <td ALIGN="RIGHT">218:</td>
1268
 
1269
    <td>Below left</td>
1270
 
1271
  </tr>
1272
 
1273
  <tr>
1274
 
1275
    <td ALIGN="RIGHT">220:</td>
1276
 
1277
    <td>Below</td>
1278
 
1279
  </tr>
1280
 
1281
  <tr>
1282
 
1283
    <td ALIGN="RIGHT">222:</td>
1284
 
1285
    <td>Below right</td>
1286
 
1287
  </tr>
1288
 
1289
  <tr>
1290
 
1291
    <td ALIGN="RIGHT">224:</td>
1292
 
1293
    <td>Left (reordrant around single base character)</td>
1294
 
1295
  </tr>
1296
 
1297
  <tr>
1298
 
1299
    <td ALIGN="RIGHT">226:</td>
1300
 
1301
    <td>Right</td>
1302
 
1303
  </tr>
1304
 
1305
  <tr>
1306
 
1307
    <td ALIGN="RIGHT">228:</td>
1308
 
1309
    <td>Above left</td>
1310
 
1311
  </tr>
1312
 
1313
  <tr>
1314
 
1315
    <td ALIGN="RIGHT">230:</td>
1316
 
1317
    <td>Above</td>
1318
 
1319
  </tr>
1320
 
1321
  <tr>
1322
 
1323
    <td ALIGN="RIGHT">232:</td>
1324
 
1325
    <td>Above right</td>
1326
 
1327
  </tr>
1328
 
1329
  <tr>
1330
 
1331
    <td ALIGN="RIGHT">233:</td>
1332
 
1333
    <td>Double below</td>
1334
 
1335
  </tr>
1336
 
1337
  <tr>
1338
 
1339
    <td ALIGN="RIGHT">234:</td>
1340
 
1341
    <td>Double above</td>
1342
 
1343
  </tr>
1344
 
1345
  <tr>
1346
 
1347
    <td ALIGN="RIGHT">240:</td>
1348
 
1349
    <td>Below (iota subscript)</td>
1350
 
1351
  </tr>
1352
 
1353
</table>
1354
 
1355
 
1356
 
1357
<p><strong>Note: </strong>some of the combining classes in this list do not currently have
1358
 
1359
members but are specified here for completeness.</p>
1360
 
1361
 
1362
 
1363
<h3><a NAME="Decompositions and Normalization"></a>Decompositions and Normalization</h3>
1364
 
1365
 
1366
 
1367
<p>Decomposition is specified in Chapter 3. <a href="http://www.unicode.org/unicode/reports/tr15/"><i>Unicode Technical Report #15:
1368
 
1369
Normalization Forms</i></a> specifies the interaction between decomposition and normalization. The
1370
 
1371
most up-to-date version is found on <a HREF="http://www.unicode.org/unicode/reports/tr15/">http://www.unicode.org/unicode/reports/tr15/</a>.
1372
 
1373
That report specifies how the decompositions defined in UnicodeData.txt are used to derive
1374
 
1375
normalized forms of Unicode text.</p>
1376
 
1377
 
1378
 
1379
<p>Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions
1380
 
1381
in the UnicodeData.txt file can be used to recursively derive the full decomposition in
1382
 
1383
canonical order, without the need to separately apply canonical reordering. However,
1384
 
1385
canonical reordering of combining character sequences must still be applied in
1386
 
1387
decomposition when normalizing source text which contains any combining marks.</p>
1388
 
1389
 
1390
 
1391
<h3><a NAME="Case Mappings"></a>Case Mappings</h3>
1392
 
1393
 
1394
 
1395
<p>The case mapping is an informative, default mapping. Case itself, on the other hand,
1396
 
1397
has normative status. Thus, for example, 0041 LATIN CAPITAL LETTER A is normatively
1398
 
1399
uppercase, but its lowercase mapping the 0061 LATIN SMALL LETTER A is informative. The
1400
 
1401
reason for this is that case can be considered to be an inherent property of a particular
1402
 
1403
character (and is usually, but not always, derivable from the presence of the terms
1404
 
1405
&quot;CAPITAL&quot; or &quot;SMALL&quot; in the character name), but case mappings between
1406
 
1407
characters are occasionally influenced by local conventions. For example, certain
1408
 
1409
languages, such as Turkish, German, French, or Greek may have small deviations from the
1410
 
1411
default mappings listed in UnicodeData.</p>
1412
 
1413
 
1414
 
1415
<p>In addition to uppercase and lowercase, because of the inclusion of certain composite
1416
 
1417
characters for compatibility, such as 01F1 LATIN CAPITAL LETTER DZ, there is a third case,
1418
 
1419
called <i>titlecase</i>, which is used where the first letter of a word is to be
1420
 
1421
capitalized (e.g. UPPERCASE, Titlecase, lowercase). An example of such a titlecase letter
1422
 
1423
is 01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z.</p>
1424
 
1425
 
1426
 
1427
<p>The uppercase, titlecase and lowercase fields are only included for characters that
1428
 
1429
have a single corresponding character of that type. Composite characters (such as
1430
 
1431
&quot;339D SQUARE CM&quot;) that do not have a single corresponding character of that type
1432
 
1433
can be cased by decomposition.</p>
1434
 
1435
 
1436
 
1437
<p>For compatibility with existing parsers, UnicodeData only contains case mappings for
1438
 
1439
characters where they are one-to-one mappings; it also omits information about
1440
 
1441
context-sensitive case mappings. Information about these special cases can be found in a
1442
 
1443
separate data file, SpecialCasing.txt,
1444
 
1445
which has been added starting with the 2.1.8 update to the Unicode data files.
1446
 
1447
SpecialCasing.txt contains additional informative case mappings that are either not
1448
 
1449
one-to-one or which are context-sensitive.</p>
1450
 
1451
 
1452
 
1453
<h2><a NAME="Property Invariants"></a>Property Invariants</h2>
1454
 
1455
 
1456
 
1457
<p>Values in UnicodeData.txt are subject to correction as errors are found; however, some
1458
 
1459
characteristics of the categories themselves can be considered invariants. Applications
1460
 
1461
may wish to take these invariants into account when choosing how to implement character
1462
 
1463
properties. The following is a partial list of known invariants for the Unicode Character
1464
 
1465
Database.</p>
1466
 
1467
 
1468
 
1469
<h4>Database Fields</h4>
1470
 
1471
 
1472
 
1473
<ul>
1474
 
1475
  <li>The number of fields in UnicodeData.txt is fixed. </li>
1476
 
1477
  <li>The order of the fields is also fixed. <ul>
1478
 
1479
      <li>Any additional information about character properties to be added in the future will
1480
 
1481
        appear in separate data tables, rather than being added on to the existing table or by
1482
 
1483
        subdivision or reinterpretation of existing fields. </li>
1484
 
1485
    </ul>
1486
 
1487
  </li>
1488
 
1489
</ul>
1490
 
1491
 
1492
 
1493
<h4>General Category</h4>
1494
 
1495
 
1496
 
1497
<ul>
1498
 
1499
  <li>There will never be more than 32 General Category values. <ul>
1500
 
1501
      <li>It is very unlikely that the Unicode Technical Committee will subdivide the General
1502
 
1503
        Category partition any further, since that can cause implementations to misbehave. Because
1504
 
1505
        the General Category is limited to 32 values, 5 bits can be used to represent the
1506
 
1507
        information, and a 32-bit integer can be used as a bitmask to represent arbitrary sets of
1508
 
1509
        categories. </li>
1510
 
1511
    </ul>
1512
 
1513
  </li>
1514
 
1515
</ul>
1516
 
1517
 
1518
 
1519
<h4>Combining Classes</h4>
1520
 
1521
 
1522
 
1523
<ul>
1524
 
1525
  <li>Combining classes are limited to the values 0 to 255. <ul>
1526
 
1527
      <li>In practice, there are far fewer than 256 values used. Implementations may take
1528
 
1529
        advantage of this fact for compression, since only the ordering of the non-zero values
1530
 
1531
        matters for the Canonical Reordering Algorithm. It is possible for up to 256 values to be
1532
 
1533
        used in the future; however, UTC decisions in the future may restrict the number of values
1534
 
1535
        to 128, since this has implementation advantages. [Signed bytes can be used without
1536
 
1537
        widening to ints in Java, for example.] </li>
1538
 
1539
    </ul>
1540
 
1541
  </li>
1542
 
1543
  <li>All characters other than those of General Category M* have the combining class 0. <ul>
1544
 
1545
      <li>Currently, all characters other than those of General Category Mn have the value 0.
1546
 
1547
        However, some characters of General Category Me or Mc may be given non-zero values in the
1548
 
1549
        future. </li>
1550
 
1551
      <li>The precise values above the value 0 are not invariant--only the relative ordering is
1552
 
1553
        considered normative. For example, it is not guaranteed in future versions that the class
1554
 
1555
        of U+05B4 will be precisely 14. </li>
1556
 
1557
    </ul>
1558
 
1559
  </li>
1560
 
1561
</ul>
1562
 
1563
 
1564
 
1565
<h4>Case</h4>
1566
 
1567
 
1568
 
1569
<ul>
1570
 
1571
  <li>Characters of type Lu, Lt, or Ll are called <i>cased</i>. All characters with an Upper,
1572
 
1573
    Lower, or Titlecase mapping are cased characters. <ul>
1574
 
1575
      <li>However, characters with the General Categories of Lu, Ll, or Lt may not always have
1576
 
1577
        case mappings, and case mappings may vary by locale. (See
1578
 
1579
        ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt). </li>
1580
 
1581
    </ul>
1582
 
1583
  </li>
1584
 
1585
</ul>
1586
 
1587
 
1588
 
1589
<h4>Canonical Decomposition</h4>
1590
 
1591
 
1592
 
1593
<ul>
1594
 
1595
  <li>Canonical mappings are always in canonical order. </li>
1596
 
1597
  <li>Canonical mappings have only the first of a pair possibly further decomposing. </li>
1598
 
1599
  <li>Canonical decompositions are &quot;transparent&quot; to other character data: <ul>
1600
 
1601
      <li><tt>BIDI(a) = BIDI(principal(canonicalDecomposition(a))</tt> </li>
1602
 
1603
      <li><tt>Category(a) = Category(principal(canonicalDecomposition(a))</tt> </li>
1604
 
1605
      <li><tt>CombiningClass(a) = CombiningClass(principal(canonicalDecomposition(a))</tt><br>
1606
 
1607
        where principal(a) is the first character not of type Mn, or the first character if all
1608
 
1609
        characters are of type Mn. </li>
1610
 
1611
    </ul>
1612
 
1613
  </li>
1614
 
1615
  <li>However, because there are sometimes missing case pairs, and because of some legacy
1616
 
1617
    characters, it is only generally true that: <ul>
1618
 
1619
      <li><tt>upper(canonicalDecomposition(a)) = canonicalDecomposition(upper(a))</tt> </li>
1620
 
1621
      <li><tt>lower(canonicalDecomposition(a)) = canonicalDecomposition(lower(a))</tt> </li>
1622
 
1623
      <li><tt>title(canonicalDecomposition(a)) = canonicalDecomposition(title(a))</tt> </li>
1624
 
1625
    </ul>
1626
 
1627
  </li>
1628
 
1629
</ul>
1630
 
1631
 
1632
 
1633
<h2><a NAME="Modification History"></a>Modification History</h2>
1634
 
1635
 
1636
 
1637
<p>This section provides a summary of the changes between update versions of the Unicode
1638
 
1639
Standard.</p>
1640
 
1641
 
1642
 
1643
<h3><a href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.0"> Unicode 3.0.0</a></h3>
1644
 
1645
 
1646
 
1647
<p>Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and
1648
 
1649
a number of property changes. These are summarized in Appendex D of <em>The Unicode
1650
 
1651
Standard, Version 3.0.</em></p>
1652
 
1653
 
1654
 
1655
<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.9">Unicode 2.1.9</a> </h3>
1656
 
1657
 
1658
 
1659
<p>Modifications made for Version 2.1.9 of UnicodeData.txt include:
1660
 
1661
 
1662
 
1663
<ul>
1664
 
1665
  <li>Corrected combining class for U+05AE HEBREW ACCENT ZINOR. </li>
1666
 
1667
  <li>Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE </li>
1668
 
1669
  <li>Corrected combining class for U+0F35 and U+0F37 to 220. </li>
1670
 
1671
  <li>Corrected combining class for U+0F71 to 129. </li>
1672
 
1673
  <li>Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR. </li>
1674
 
1675
  <li>Added&nbsp; decompositions for several Greek symbol letters: U+03D0..U+03D2, U+03D5,
1676
 
1677
    U+03D6, U+03F0..U+03F2. </li>
1678
 
1679
  <li>Removed&nbsp; decompositions from the conjoining jamo block: U+1100..U+11F8. </li>
1680
 
1681
  <li>Changes to decomposition mappings for some Tibetan vowels for consistency in
1682
 
1683
    normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81) </li>
1684
 
1685
  <li>Updated the decomposition mappings for several Vietnamese characters with two diacritics
1686
 
1687
    (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8, U+1ED9), so that the recursive
1688
 
1689
    decomposition can be generated directly in canonically reordered form (not a normative
1690
 
1691
    change). </li>
1692
 
1693
  <li>Updated the decomposition mappings for several Arabic compatibility characters involving
1694
 
1695
    shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin characters (U+1E1C, U+1E1D), so
1696
 
1697
    that the decompositions are generated directly in canonically reordered form (not a
1698
 
1699
    normative change). </li>
1700
 
1701
  <li>Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE, U+2028 LINE
1702
 
1703
    SEPARATOR. </li>
1704
 
1705
  <li>Changed BIDI category for extenders of General Category Lm: U+3005, U+3021..U+3035,
1706
 
1707
    U+FF9E, U+FF9F. </li>
1708
 
1709
  <li>Changed General Category and BIDI category for the Greek numeral signs: U+0374, U+0375. </li>
1710
 
1711
  <li>Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL. </li>
1712
 
1713
  <li>Added Unicode 1.0 names for many Tibetan characters (informative). </li>
1714
 
1715
</ul>
1716
 
1717
 
1718
 
1719
<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.8">Unicode 2.1.8</a> </h3>
1720
 
1721
 
1722
 
1723
<p>Modifications made for Version 2.1.8 of UnicodeData.txt include:
1724
 
1725
 
1726
 
1727
<ul>
1728
 
1729
  <li>Added combining class 240 for U+0345 COMBINING GREEK YPOGEGRAMMENI so that
1730
 
1731
    decompositions involving iota subscript are derivable directly in canonically reordered
1732
 
1733
    form; this also has a bearing on simplification of casing of polytonic Greek. </li>
1734
 
1735
  <li>Changes in decompositions related to Greek tonos. These result from the clarification
1736
 
1737
    that monotonic Greek &quot;tonos&quot; should be equated with U+0301 COMBINING ACUTE,
1738
 
1739
    rather than with U+030D COMBINING VERTICAL LINE ABOVE. (All Greek characters in the Greek
1740
 
1741
    block involving &quot;tonos&quot;; some Greek characters in the polytonic Greek in the
1742
 
1743
    1FXX block.) </li>
1744
 
1745
  <li>Changed decompositions involving dialytika tonos. (U+0390, U+03B0) </li>
1746
 
1747
  <li>Changed ternary decompositions to binary. (U+0CCB, U+FB2C, U+FB2D) These changes
1748
 
1749
    simplify normalization. </li>
1750
 
1751
  <li>Removed canonical decomposition for Latin Candrabindu. (U+0310) </li>
1752
 
1753
  <li>Corrected error in canonical decomposition for U+1FF4. </li>
1754
 
1755
  <li>Added compatibility decompositions to clarify collation tables. (U+2100, U+2101, U+2105,
1756
 
1757
    U+2106, U+1E9A) </li>
1758
 
1759
  <li>A series of general category changes to assist the convergence of of Unicode definition
1760
 
1761
    of identifier with ISO TR 10176: <ul>
1762
 
1763
      <li>So &gt; Lo: U+0950, U+0AD0, U+0F00, U+0F88..U+0F8B </li>
1764
 
1765
      <li>Po &gt; Lo: U+0E2F, U+0EAF, U+3006 </li>
1766
 
1767
      <li>Lm &gt; Sk: U+309B, U+309C </li>
1768
 
1769
      <li>Po &gt; Pc: U+30FB, U+FF65 </li>
1770
 
1771
      <li>Ps/Pe &gt; Mn: U+0F3E, U+0F3F </li>
1772
 
1773
    </ul>
1774
 
1775
  </li>
1776
 
1777
  <li>A series of bidi property changes for consistency. <ul>
1778
 
1779
      <li>L &gt; ET: U+09F2, U+09F3 </li>
1780
 
1781
      <li>ON &gt; L: U+3007 </li>
1782
 
1783
      <li>L &gt; ON: U+0F3A..U+0F3D, U+037E, U+0387 </li>
1784
 
1785
    </ul>
1786
 
1787
  </li>
1788
 
1789
  <li>Add case mapping: U+01A6 &lt;-&gt; U+0280 </li>
1790
 
1791
  <li>Updated symmetric swapping value for guillemets: U+00AB, U+00BB, U+2039, U+203A. </li>
1792
 
1793
  <li>Changes to combining class values. Most Indic fixed position class non-spacing marks
1794
 
1795
    were changed to combining class 0. This fixes some inconsistencies in how canonical
1796
 
1797
    reordering would apply to Indic scripts, including Tibetan. Indic interacting top/bottom
1798
 
1799
    fixed position classes were merged into single (non-zero) classes as part of this change.
1800
 
1801
    Tibetan subjoined consonants are changed from combining class 6 to combining class 0. Thai
1802
 
1803
    pinthu (U+0E3A) moved to combining class 9. Moved two Devanagari stress marks into generic
1804
 
1805
    above and below combining classes (U+0951, U+0952). </li>
1806
 
1807
  <li>Corrected placement of semicolon near symmetric swapping field. (U+FA0E, etc., scattered
1808
 
1809
    positions to U+FA29) </li>
1810
 
1811
</ul>
1812
 
1813
 
1814
 
1815
<h3>Version 2.1.7</h3>
1816
 
1817
 
1818
 
1819
<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1820
 
1821
 
1822
 
1823
<h3>Version 2.1.6</h3>
1824
 
1825
 
1826
 
1827
<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1828
 
1829
 
1830
 
1831
<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.5">Unicode 2.1.5</a> </h3>
1832
 
1833
 
1834
 
1835
<p>Modifications made for Version 2.1.5 of UnicodeData.txt include:
1836
 
1837
 
1838
 
1839
<ul>
1840
 
1841
  <li>Changed decomposition for U+FF9E and U+FF9F so that correct collation weighting will
1842
 
1843
    automatically result from the canonical equivalences. </li>
1844
 
1845
  <li>Removed canonical decompositions for U+04D4, U+04D5, U+04D8, U+04D9, U+04E0, U+04E1,
1846
 
1847
    U+04E8, U+04E9 (the implication being that no canonical equivalence is claimed between
1848
 
1849
    these 8 characters and similar Latin letters), and updated 4 canonical decompositions for
1850
 
1851
    U+04DB, U+04DC, U+04EA, U+04EB to reflect the implied difference in the base character. </li>
1852
 
1853
  <li>Added Pi, and Pf categories and assigned the relevant quotation marks to those
1854
 
1855
    categories, based on the Unicode Technical Corrigendum on Quotation Characters. </li>
1856
 
1857
  <li>Updating of many bidi properties, following the advice of the ad hoc committee on bidi,
1858
 
1859
    and to make the bidi properties of compatibility characters more consistent. </li>
1860
 
1861
  <li>Changed category of several Tibetan characters: U+0F3E, U+0F3F, U+0F88..U+0F8B to make
1862
 
1863
    them non-combining, reflecting the combined opinion of Tibetan experts. </li>
1864
 
1865
  <li>Added case mapping for U+03F2. </li>
1866
 
1867
  <li>Corrected case mapping for U+0275. </li>
1868
 
1869
  <li>Added titlecase mappings for U+03D0, U+03D1, U+03D5, U+03D6, U+03F0.. U+03F2. </li>
1870
 
1871
  <li>Corrected compatibility label for U+2121. </li>
1872
 
1873
  <li>Add specific entries for all the CJK compatibility ideographs, U+F900..U+FA2D, so the
1874
 
1875
    canonical decomposition for each (the URO character it is equivalent to) can be carried in
1876
 
1877
    the database. </li>
1878
 
1879
</ul>
1880
 
1881
 
1882
 
1883
<h3>Version 2.1.4</h3>
1884
 
1885
 
1886
 
1887
<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1888
 
1889
 
1890
 
1891
<h3>Version 2.1.3</h3>
1892
 
1893
 
1894
 
1895
<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1896
 
1897
 
1898
 
1899
<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.2">Unicode 2.1.2</a> </h3>
1900
 
1901
 
1902
 
1903
<p>Modifications made in updating UnicodeData.txt to Version 2.1.2 for the Unicode
1904
 
1905
Standard, Version 2.1 (from Version 2.0) include:
1906
 
1907
 
1908
 
1909
<ul>
1910
 
1911
  <li>Added two characters (U+20AC and U+FFFC). </li>
1912
 
1913
  <li>Amended bidi properties for U+0026, U+002E, U+0040, U+2007. </li>
1914
 
1915
  <li>Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275, U+03C2, U+1E9B. </li>
1916
 
1917
  <li>Changed combining order class for U+0F71. </li>
1918
 
1919
  <li>Corrected canonical decompositions for U+0F73, U+1FBE. </li>
1920
 
1921
  <li>Changed decomposition for U+FB1F from compatibility to canonical. </li>
1922
 
1923
  <li>Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB. </li>
1924
 
1925
  <li>Corrected compatibility decompositions for U+2469, U+246A, U+3358. </li>
1926
 
1927
</ul>
1928
 
1929
 
1930
 
1931
<h3>Version 2.1.1</h3>
1932
 
1933
 
1934
 
1935
<p><i>This version was for internal change tracking only, and never publicly released.</i></p>
1936
 
1937
 
1938
 
1939
<h3><a HREF="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.0.0">Unicode 2.0.0</a> </h3>
1940
 
1941
 
1942
 
1943
<p>The modifications made in updating UnicodeData.txt for the Unicode
1944
 
1945
Standard, Version 2.0 include:
1946
 
1947
 
1948
 
1949
<ul>
1950
 
1951
  <li>Fixed decompositions with TONOS to use correct NSM: 030D. </li>
1952
 
1953
  <li>Removed old Hangul Syllables; mapping to new characters are in a separate table. </li>
1954
 
1955
  <li>Marked compatibility decompositions with additional tags. </li>
1956
 
1957
  <li>Changed old tag names for clarity. </li>
1958
 
1959
  <li>Revision of decompositions to use first-level decomposition, instead of maximal
1960
 
1961
    decomposition. </li>
1962
 
1963
  <li>Correction of all known errors in decompositions from earlier versions. </li>
1964
 
1965
  <li>Added control code names (as old Unicode names). </li>
1966
 
1967
  <li>Added Hangul Jamo decompositions. </li>
1968
 
1969
  <li>Added Number category to match properties list in book. </li>
1970
 
1971
  <li>Fixed categories of Koranic Arabic marks. </li>
1972
 
1973
  <li>Fixed categories of precomposed characters to match decomposition where possible. </li>
1974
 
1975
  <li>Added Hebrew cantillation marks and the Tibetan script. </li>
1976
 
1977
  <li>Added place holders for ranges such as CJK Ideographic Area and the Private Use Area. </li>
1978
 
1979
  <li>Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in the
1980
 
1981
    database. </li>
1982
 
1983
</ul>
1984
 
1985
</body>
1986
 
1987
</html>
1988
 

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.