OpenCores
URL https://opencores.org/ocsvn/openrisc_me/openrisc_me/trunk

Subversion Repositories openrisc_me

[/] [openrisc/] [trunk/] [gnu-src/] [newlib-1.17.0/] [newlib/] [libc/] [iconv/] [iconv.tex] - Blame information for rev 438

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 148 jeremybenn
@node Iconv
2
@chapter Encoding conversions (@file{iconv.h})
3
 
4
This chapter describes the Newlib iconv library.
5
The iconv functions declarations are in
6
@file{iconv.h}.
7
 
8
@menu
9
* iconv::                           Encoding conversion routines
10
* Introduction::                    Introduction to iconv and encodings
11
* Supported encodings::             The list of currently supported encodings
12
* iconv design decisions::          General iconv library design issues
13
* iconv configuration::             iconv-related configure script options
14
* Encoding names::                  How encodings are named.
15
* CCS tables::                      CCS tables format and 'mktbl.pl' Perl script
16
* CES converters::                  CES converters description
17
* The encodings description file::  The 'encoding.deps' file and 'mkdeps.pl'
18
* How to add new encoding::         The steps to add new encoding support
19
* The locale support interfaces::   Locale-related iconv interfaces
20
* Contact::                         The author contact
21
@end menu
22
 
23
@page
24
@include iconv/iconv.def
25
 
26
@page
27
@node Introduction
28
@section Introduction
29
@findex encoding
30
@findex character set
31
@findex charset
32
@findex CES
33
@findex CCS
34
@*
35
The iconv library is intended to convert characters from one encoding to
36
another. It implements iconv(), iconv_open() and iconv_close()
37
calls, which are defined by the Single Unix Specification.
38
 
39
@*
40
In addition to these user-level interfaces, the iconv library also has
41
several useful interfaces which are needed to support coding
42
capabilities of the Newlib Locale infrastructure.  Since Locale
43
support also needs to
44
convert various character sets to and from the @emph{wide characters
45
set}, the iconv library shares it's capabilities with the Newlib Locale
46
subsystem. Moreover, the iconv library supports several features which are
47
only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).
48
 
49
@*
50
The Newlib iconv library was created using concepts from another iconv
51
library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library
52
was rewritten from scratch and contains a lot of improvements with respect to
53
the original iconv library.
54
 
55
@*
56
Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
57
are often used with various meanings. The following are the definitions of terms
58
which are used in this documentation as well as in the iconv library
59
implementation:
60
 
61
@itemize @bullet
62
@item
63
@dfn{encoding} - a machine representation of characters by means of bits;
64
 
65
@item
66
@dfn{Character Set} or @dfn{Charset} - just a collection of
67
characters, i.e. the encoding is the machine representation of the character set;
68
 
69
@item
70
@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
71
set of integers @dfn{character codes};
72
 
73
@item
74
@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
75
codes to a sequence of bytes;
76
@end itemize
77
 
78
@*
79
Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
80
ASCII, etc. Encodings are formed by the following chain of steps:
81
 
82
@enumerate
83
@item
84
User has a set of characters which are specific to his or her language (character set).
85
 
86
@item
87
Each character from this set is uniquely numbered, resulting in an CCS.
88
 
89
@item
90
Each number from the CCS is converted to a sequence of bits or bytes by means
91
of a CES and form some encoding. Thus, CES may be considered as a
92
function of CCS which produces some encoding. Note, that CES may be
93
applied to more than one CCS.
94
@end enumerate
95
 
96
@*
97
Thus, an encoding may be considered as one or more CCS + CES.
98
 
99
@*
100
Sometimes, there is no CES and in such cases encoding is equivalent
101
to CCS, e.g. KOI8-R or ASCII.
102
 
103
@*
104
An example of a more complicated encoding is UTF-8 which is the UCS
105
(or Unicode) CCS plus the UTF-8 CES.
106
 
107
@*
108
The following is a brief list of iconv library features:
109
@itemize
110
@item
111
Generic architecture;
112
@item
113
Locale infrastructure support;
114
@item
115
Automatic generation of the program code which handles
116
CES/CCS/Encoding/Names/Aliases dependencies;
117
@item
118
The ability to choose size- or speed-optimazed
119
configuration;
120
@item
121
The ability to exclude a lot of unneeded code and data from the linking step.
122
@end itemize
123
 
124
 
125
 
126
 
127
@page
128
@node Supported encodings
129
@section Supported encodings
130
@findex big5
131
@findex cp775
132
@findex cp850
133
@findex cp852
134
@findex cp855
135
@findex cp866
136
@findex euc_jp
137
@findex euc_kr
138
@findex euc_tw
139
@findex iso_8859_1
140
@findex iso_8859_10
141
@findex iso_8859_11
142
@findex iso_8859_13
143
@findex iso_8859_14
144
@findex iso_8859_15
145
@findex iso_8859_2
146
@findex iso_8859_3
147
@findex iso_8859_4
148
@findex iso_8859_5
149
@findex iso_8859_6
150
@findex iso_8859_7
151
@findex iso_8859_8
152
@findex iso_8859_9
153
@findex iso_ir_111
154
@findex koi8_r
155
@findex koi8_ru
156
@findex koi8_u
157
@findex koi8_uni
158
@findex ucs_2
159
@findex ucs_2_internal
160
@findex ucs_2be
161
@findex ucs_2le
162
@findex ucs_4
163
@findex ucs_4_internal
164
@findex ucs_4be
165
@findex ucs_4le
166
@findex us_ascii
167
@findex utf_16
168
@findex utf_16be
169
@findex utf_16le
170
@findex utf_8
171
@findex win_1250
172
@findex win_1251
173
@findex win_1252
174
@findex win_1253
175
@findex win_1254
176
@findex win_1255
177
@findex win_1256
178
@findex win_1257
179
@findex win_1258
180
@*
181
The following is the list of currently supported encodings. The first column
182
corresponds to the encoding name, the second column is the list of aliases,
183
the third column is its CES and CCS components names, and the fourth column
184
is a short description.
185
 
186
@multitable @columnfractions .20 .26 .24 .30
187
@item
188
Name
189
@tab
190
Aliases
191
@tab
192
CES/CCS
193
@tab
194
Short description
195
@item
196
@tab
197
@tab
198
@tab
199
 
200
 
201
@item
202
big5
203
@tab
204
csbig5, big_five, bigfive, cn_big5, cp950
205
@tab
206
table_pcs / big5, us_ascii
207
@tab
208
The encoding for the Traditional Chinese.
209
 
210
 
211
@item
212
cp775
213
@tab
214
ibm775, cspc775baltic
215
@tab
216
table / cp775
217
@tab
218
The updated version of CP 437 that supports the balitic languages.
219
 
220
 
221
@item
222
cp850
223
@tab
224
ibm850, 850, cspc850multilingual
225
@tab
226
table / cp850
227
@tab
228
IBM 850 - the updated version of CP 437 where several Latin 1 characters have been
229
added instead of some less-often used characters like the line-drawing
230
and the greek ones.
231
 
232
 
233
@item
234
cp852
235
@tab
236
ibm852, 852, cspcp852
237
@tab
238
@tab
239
IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added
240
instead of some less-often used characters like the line-drawing and the greek ones.
241
 
242
 
243
@item
244
cp855
245
@tab
246
ibm855, 855, csibm855
247
@tab
248
table / cp855
249
@tab
250
IBM 855 - the updated version of CP 437 that supports Cyrillic.
251
 
252
 
253
@item
254
cp866
255
@tab
256
866, IBM866, CSIBM866
257
@tab
258
table / cp866
259
@tab
260
IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet
261
ordering of the alternative variant that is preferred by many Russian users.
262
 
263
 
264
@item
265
euc_jp
266
@tab
267
eucjp
268
@tab
269
euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
270
@tab
271
EUC-JP - The EUC for Japanese.
272
 
273
 
274
@item
275
euc_kr
276
@tab
277
euckr
278
@tab
279
euc / ksx1001
280
@tab
281
EUC-KR - The EUC for Korean.
282
 
283
 
284
@item
285
euc_tw
286
@tab
287
euctw
288
@tab
289
euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
290
@tab
291
EUC-TW - The EUC for Traditional Chinese.
292
 
293
 
294
@item
295
iso_8859_1
296
@tab
297
iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
298
@tab
299
table / iso_8859_1
300
@tab
301
ISO 8859-1:1987 - Latin 1, West European.
302
 
303
 
304
@item
305
iso_8859_10
306
@tab
307
iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
308
@tab
309
table / iso_8859_10
310
@tab
311
ISO 8859-10:1992 - Latin 6, Nordic.
312
 
313
 
314
@item
315
iso_8859_11
316
@tab
317
iso8859_11, iso885911
318
@tab
319
table / iso_8859_11
320
@tab
321
ISO 8859-11 - Thai.
322
 
323
 
324
@item
325
iso_8859_13
326
@tab
327
iso_8859_13:1998, iso8859_13, iso885913
328
@tab
329
table / iso_8859_13
330
@tab
331
ISO 8859-13:1998 - Latin 7, Baltic Rim.
332
 
333
 
334
@item
335
iso_8859_14
336
@tab
337
iso_8859_14:1998, iso885914, iso8859_14
338
@tab
339
table / iso_8859_14
340
@tab
341
ISO 8859-14:1998 - Latin 8, Celtic.
342
 
343
 
344
@item
345
iso_8859_15
346
@tab
347
iso885915, iso_8859_15:1998, iso8859_15,
348
@tab
349
table / iso_8859_15
350
@tab
351
ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.
352
 
353
 
354
@item
355
iso_8859_2
356
@tab
357
iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
358
@tab
359
table / iso_8859_2
360
@tab
361
ISO 8859-2:1987 - Latin 2, East European.
362
 
363
 
364
@item
365
iso_8859_3
366
@tab
367
iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
368
@tab
369
table / iso_8859_3
370
@tab
371
ISO 8859-3:1988 - Latin 3, South European.
372
 
373
 
374
@item
375
iso_8859_4
376
@tab
377
iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
378
@tab
379
table / iso_8859_4
380
@tab
381
ISO 8859-4:1988 - Latin 4, North European.
382
 
383
 
384
@item
385
iso_8859_5
386
@tab
387
iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
388
@tab
389
table / iso_8859_5
390
@tab
391
ISO 8859-5:1988 - Cyrillic.
392
 
393
 
394
@item
395
iso_8859_6
396
@tab
397
iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
398
@tab
399
table / iso_8859_6
400
@tab
401
ISO i8859-6:1987 - Arabic.
402
 
403
 
404
@item
405
iso_8859_7
406
@tab
407
iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
408
@tab
409
table / iso_8859_7
410
@tab
411
ISO 8859-7:1987 - Greek.
412
 
413
 
414
@item
415
iso_8859_8
416
@tab
417
iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
418
@tab
419
table / iso_8859_8
420
@tab
421
ISO 8859-8:1988 - Hebrew.
422
 
423
 
424
@item
425
iso_8859_9
426
@tab
427
iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
428
@tab
429
table / iso_8859_9
430
@tab
431
ISO 8859-9:1989 - Latin 5, Turkish.
432
 
433
 
434
@item
435
iso_ir_111
436
@tab
437
ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
438
@tab
439
table / iso_ir_111
440
@tab
441
ISO IR 111/ECMA Cyrillic.
442
 
443
 
444
@item
445
koi8_r
446
@tab
447
cskoi8r, koi8r, koi8
448
@tab
449
table / koi8_r
450
@tab
451
RFC 1489 Cyrillic.
452
 
453
 
454
@item
455
koi8_ru
456
@tab
457
koi8ru
458
@tab
459
table / koi8_ru
460
@tab
461
The obsolete Ukrainian.
462
 
463
 
464
@item
465
koi8_u
466
@tab
467
koi8u
468
@tab
469
table / koi8_u
470
@tab
471
RFC 2319 Ukrainian.
472
 
473
 
474
@item
475
koi8_uni
476
@tab
477
koi8uni
478
@tab
479
table / koi8_uni
480
@tab
481
KOI8 Unified.
482
 
483
 
484
@item
485
ucs_2
486
@tab
487
ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
488
@tab
489
ucs_2 / (UCS)
490
@tab
491
ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
492
 
493
 
494
@item
495
ucs_2_internal
496
@tab
497
ucs2_internal, ucs_2internal, ucs2internal
498
@tab
499
ucs_2_internal / (UCS)
500
@tab
501
ISO-10646-UCS-2 in system byte order.
502
NBSP is always interpreted as NBSP (BOM isn't supported).
503
 
504
 
505
@item
506
ucs_2be
507
@tab
508
ucs2be
509
@tab
510
ucs_2 / (UCS)
511
@tab
512
Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
513
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
514
 
515
 
516
@item
517
ucs_2le
518
@tab
519
ucs2le
520
@tab
521
ucs_2 / (UCS)
522
@tab
523
Little Endian version of ISO-10646-UCS-2.
524
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
525
 
526
 
527
@item
528
ucs_4
529
@tab
530
ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
531
@tab
532
ucs_4 / (UCS)
533
@tab
534
ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
535
 
536
 
537
@item
538
ucs_4_internal
539
@tab
540
ucs4_internal, ucs_4internal, ucs4internal
541
@tab
542
ucs_4_internal / (UCS)
543
@tab
544
ISO-10646-UCS-4 in system byte order.
545
NBSP is always interpreted as NBSP (BOM isn't supported).
546
 
547
 
548
@item
549
ucs_4be
550
@tab
551
ucs4be
552
@tab
553
ucs_4 / (UCS)
554
@tab
555
Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
556
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
557
 
558
 
559
@item
560
ucs_4le
561
@tab
562
ucs4le
563
@tab
564
ucs_4 / (UCS)
565
@tab
566
Little Endian version of ISO-10646-UCS-4.
567
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
568
 
569
 
570
@item
571
us_ascii
572
@tab
573
ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
574
@tab
575
us_ascii / (ASCII)
576
@tab
577
7-bit ASCII.
578
 
579
 
580
@item
581
utf_16
582
@tab
583
utf16
584
@tab
585
utf_16 / (UCS)
586
@tab
587
RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.
588
 
589
 
590
@item
591
utf_16be
592
@tab
593
utf16be
594
@tab
595
utf_16 / (UCS)
596
@tab
597
Big Endian version of RFC 2781 UTF-16.
598
NBSP is always interpreted as NBSP (BOM isn't supported).
599
 
600
 
601
@item
602
utf_16le
603
@tab
604
utf16le
605
@tab
606
utf_16 / (UCS)
607
@tab
608
Little Endian version of RFC 2781 UTF-16.
609
NBSP is always interpreted as NBSP (BOM isn't supported).
610
 
611
 
612
@item
613
utf_8
614
@tab
615
utf8
616
@tab
617
utf_8 / (UCS)
618
@tab
619
RFC 3629 UTF-8.
620
 
621
 
622
@item
623
win_1250
624
@tab
625
cp1250
626
@tab
627
@tab
628
Win-1250 Croatian.
629
 
630
 
631
@item
632
win_1251
633
@tab
634
cp1251
635
@tab
636
table / win_1251
637
@tab
638
Win-1251 - Cyrillic.
639
 
640
 
641
@item
642
win_1252
643
@tab
644
cp1252
645
@tab
646
table / win_1252
647
@tab
648
Win-1252 - Latin 1.
649
 
650
 
651
@item
652
win_1253
653
@tab
654
cp1253
655
@tab
656
table / win_1253
657
@tab
658
Win-1253 - Greek.
659
 
660
 
661
@item
662
win_1254
663
@tab
664
cp1254
665
@tab
666
table / win_1254
667
@tab
668
Win-1254 - Turkish.
669
 
670
 
671
@item
672
win_1255
673
@tab
674
cp1255
675
@tab
676
table / win_1255
677
@tab
678
Win-1255 - Hebrew.
679
 
680
 
681
@item
682
win_1256
683
@tab
684
cp1256
685
@tab
686
table / win_1256
687
@tab
688
Win-1256 - Arabic.
689
 
690
 
691
@item
692
win_1257
693
@tab
694
cp1257
695
@tab
696
table / win_1257
697
@tab
698
Win-1257 - Baltic.
699
 
700
 
701
@item
702
win_1258
703
@tab
704
cp1258
705
@tab
706
table / win_1258
707
@tab
708
Win-1258 - Vietnamese7 that supports Cyrillic.
709
@end multitable
710
 
711
 
712
 
713
 
714
 
715
@page
716
@node iconv design decisions
717
@section iconv design decisions
718
@findex CCS table
719
@findex CES converter
720
@findex Speed-optimized tables
721
@findex Size-optimized tables
722
@*
723
The first iconv library design issue arises when considering the
724
following two design approaches:
725
 
726
@enumerate
727
@item
728
Have modules which implement conversion from the encoding A to the encoding B
729
and vice versa i.e., one conversion module relates to any two encodings.
730
@item
731
Have modules which implement conversion from the encoding A to the fixed
732
encoding C and vice versa i.e., one conversion module relates to any
733
one encoding A and one fixed encoding C. In this case, to convert from
734
the encoding A to the encoding B, two modules are needed (in order to convert
735
from A to C and then from C to B).
736
@end enumerate
737
 
738
@*
739
It's obvious, that we have tradeoff between commonality/flexibility and
740
efficiency: the first method is more efficient since it converts
741
directly; however, it isn't so flexible since for each
742
encoding pair a distinct module is needed.
743
 
744
@*
745
The Newlib iconv model uses the second method and always converts through the 32-bit
746
UCS but its design also allows one to write specialized conversion
747
modules if the conversion speed is critical.
748
 
749
@*
750
The second design issue is how to break down (decompose) encodings.
751
The Newlib iconv library uses the fact that any encoding may be
752
considered as one or more CCS plus a CES. It also decomposes its
753
conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
754
tables}. CCS tables map CCS to UCS and vice versa; the CES converters
755
map CCS to the encoding and vice versa.
756
 
757
@*
758
As the example, let's consider the conversion from the big5 encoding to
759
the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
760
CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
761
and CNS11643_PLANE14 CCS-es plus the EUC CES.
762
 
763
@*
764
The euc_jp -> big5 conversion is performed as follows:
765
 
766
@enumerate
767
@item
768
The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
769
transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
770
CCS-es);
771
@item
772
The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
773
CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
774
@item
775
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
776
the corresponding CCS tables;
777
@item
778
The obtained CCS codes are transformed to the big5 encoding using the corresponding
779
CES converter.
780
@end enumerate
781
 
782
@*
783
Analogously, the backward conversion is performed as follows:
784
 
785
@enumerate
786
@item
787
The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
788
(the ASCII and BIG5 CCS-es);
789
@item
790
The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
791
@item
792
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
793
the corresponding CCS tables;
794
@item
795
The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
796
CES converter.
797
@end enumerate
798
 
799
@*
800
Note, the above is just an example and real names (which are implemented
801
in the Newlib iconv) of the CES converters and the CCS tables are slightly different.
802
 
803
@*
804
The third design issue also relates to flexibility. Obviously, it isn't
805
desirable to always link all the CES converters and the CCS tables to the library
806
but instead, we want to be able to load the needed converters and tables
807
dynamically on demand. This isn't a problem on "big" machines such as
808
a PC, but it may be very problematical within "small" embedded systems.
809
 
810
@*
811
Since the CCS tables are just data, it is possible to load them
812
dynamically from external files.  The CES converters, on the other hand
813
are algorithms with some code so a dynamic library loading
814
capability is required.
815
 
816
@*
817
Apart from possible restrictions applied by embedded systems (small
818
RAM for example), Newlib itself has no dynamic library support and
819
therefore, all the CES converters which will ever be used must be linked into
820
the library.   However, loading of the dynamic CCS tables is possible and is
821
implemented in the Newlib iconv library.  It may be enabled via the Newlib
822
configure script options.
823
 
824
@*
825
The next design issue is fine-tuning the iconv library
826
configuration.  One important ability is for iconv to not link all it's
827
converters and tables (if dynamic loading is not enabled) but instead,
828
enable only those encodings which are specified at configuration
829
time (see the section about the configure script options).
830
 
831
@*
832
In addition, the Newlib iconv library configure options distinguish between
833
conversion directions. This means that not only are supported encodings
834
selectable, the conversion direction is as well. For example, if user wants
835
the configuration which allows conversions from UTF-8 to UTF-16 and
836
doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can
837
enable only
838
this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will
839
be included) thus, saving some memory (note, that such technique allows to
840
exclude one half of a CCS table from linking which may be big enough).
841
 
842
@*
843
One more design aspect are the speed- and size- optimized tables. Users can
844
select between them using configure script options. The
845
speed-optimized CCS tables are the same as the size-optimized ones in
846
case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized
847
CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the
848
other hand, conversion with speed tables is several times faster.
849
 
850
@*
851
Its worth to stress that the new encoding support can't be
852
dynamically added into an already compiled Newlib library, even if it
853
needs only an additional CCS table and iconv is configured to use
854
the external files with CCS tables (this isn't the fundamental restriction
855
and the possibility to add new Table-based encoding support dynamically, by
856
means of just adding new .cct file, may be easily added).
857
 
858
@*
859
Theoretically, the compiled-in CCS tables should be more appropriate for
860
embedded systems than dynamically loaded CCS tables.  This is because the compiled-in tables are read-only and can be placed in ROM
861
whereas dynamic loading requires RAM.  Moreover, in the current iconv
862
implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
863
This means, for example, that if two iconv descriptors for
864
"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of
865
koi8-r .cct file will be loaded (actually, iconv loads only the needed part
866
of these files).  On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.
867
 
868
@page
869
@node iconv configuration
870
@section iconv configuration
871
@findex iconv configuration
872
@findex --enable-newlib-iconv-encodings
873
@findex --enable-newlib-iconv-from-encodings
874
@findex --enable-newlib-iconv-to-encodings
875
@findex --enable-newlib-iconv-external-ccs
876
@findex NLSPATH
877
@*
878
To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure
879
script option should be used. This option accepts a comma-separated list
880
of @emph{encodings} that should be enabled. The option enables each encoding in both
881
("to" and "from") directions.
882
 
883
@*
884
The @option{--enable-newlib-iconv-from-encodings} configure script option enables
885
"from" support for each encoding that was passed to it.
886
 
887
@*
888
The @option{--enable-newlib-iconv-to-encodings} configure script option enables
889
"to" support for each encoding that was passed to it.
890
 
891
@*
892
Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and
893
"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv
894
code and data will be linked) is to configure Newlib with the following
895
options:
896
@*
897
@code{--enable-newlib-iconv-encodings=UTF-8
898
--enable-newlib-iconv-from-encodings=KOI8-R
899
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}
900
@*
901
which is the same as
902
@*
903
@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8
904
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}
905
@*
906
User may also just use the
907
@*
908
@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}
909
@*
910
configure script option, but it isn't so optimal since there will be
911
some unneeded data and code.
912
 
913
@*
914
The @option{--enable-newlib-iconv-external-ccs} option enables iconv's
915
capabilities to work with the external CCS files.
916
 
917
@*
918
The @option{--enable-target-optspace} Newlib configure script option also affects
919
the iconv library. If this option is present, the library uses the size
920
optimized CCS tables. This means, that only the size-optimized CCS
921
tables will be linked or, if the
922
@option{--enable-newlib-iconv-external-ccs} configure script option was used,
923
the iconv library will load the size-optimized tables. If the
924
@option{--enable-target-optspace}configure script option is disabled,
925
the speed-optimized CCS tables are used.
926
 
927
@*
928
Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
929
Thus, the NLSPATH environment variable should be set.
930
 
931
 
932
 
933
 
934
 
935
@page
936
@node Encoding names
937
@section Encoding names
938
@findex encoding name
939
@findex encoding alias
940
@findex normalized name
941
@*
942
Each encoding has one @dfn{name} and a number of @dfn{aliases}. When
943
user works with the iconv library (i.e., when the @code{iconv_open} call
944
is used) both name or aliases may be used. The same is when encoding
945
names are used in configure script options.
946
 
947
@*
948
Names and aliases may be specified in any case (small or capital
949
letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.
950
Also, when working with the iconv library,
951
 
952
@*
953
Internally the Newlib iconv library always converts aliases to names. It
954
also converts names and aliases in the @dfn{normalized} form which means
955
that all capital letters are converted to small letters and the @kbd{-}
956
symbols are converted to @kbd{_} symbols.
957
 
958
 
959
 
960
 
961
@page
962
@node CCS tables
963
@section CCS tables
964
@findex Size-optimized CCS table
965
@findex Speed-optimized CCS table
966
@findex mktbl.pl Perl script
967
@findex .cct files
968
@findex The CCT tables source files
969
@findex CCS source files
970
@*
971
The iconv library stores files with CCS tables in the the @emph{ccs/}
972
subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
973
(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form
974
of compilable .c source files. The .cct files are only used when the
975
@option{--enable-newlib-iconv-external-ccs} configure script option is enabled.
976
The .c files are linked to the Newlib library if the corresponding
977
encoding is enabled.
978
 
979
@*
980
As stated earlier, the Newlib iconv library performs all
981
conversions through the 32-bit UCS, but the codes which are used
982
in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.
983
Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is
984
used instead of the 32-bit UCS-4.
985
 
986
@*
987
CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to
988
16-bit UCS-2 and vice versa while 16-bit CCS tables map
989
16-bit CCS to 16-bit UCS-2 and vice versa.
990
8-bit tables are small (in size) while 16-bit tables may be big enough.
991
Because of this, 16-bit CCS tables may be
992
either speed- or size-optimized. Size-optimized CCS tables are
993
smaller then speed-optimized ones, but the conversion process is
994
slower if the size-optimized CCS tables are used. 8-bit CCS tables have only
995
size-optimized variant.
996
 
997
Each CCS table (both speed- and size-optimized) consists of
998
@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps
999
UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
1000
UCS-2 codes.
1001
 
1002
@*
1003
Almost all 16-bit CCS tables contain less then 0xFFFF codes and
1004
a lot of gaps exist.
1005
 
1006
@subsection Speed-optimized tables format
1007
@*
1008
In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is
1009
trivial - it is just the array of 256 16-bit UCS codes. Therefore, an
1010
UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates
1011
as @emph{Y = to_ucs[X]}.
1012
 
1013
@*
1014
Obviously, the simplest way to create the "from_ucs" table or the
1015
16-bit "to_ucs" table is to use the huge 16-bit array like in case
1016
of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain
1017
less then 0xFFFF code maps and this fact may be exploited to reduce
1018
the size of the CCS tables.
1019
 
1020
@*
1021
In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The
1022
16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping
1023
direction and the CCS bits number.
1024
 
1025
@*
1026
In case of the 8-bit speed-optimized table the "from_ucs" subtable
1027
corresponds the "from_ucs" array and has the following layout:
1028
 
1029
@*
1030
from_ucs array:
1031
@*
1032
-------------------------------------
1033
@*
1034
0xFF mapping (2 bytes) (only for
1035
8-bit table).
1036
@*
1037
-------------------------------------
1038
@*
1039
Heading block
1040
@*
1041
-------------------------------------
1042
@*
1043
Block 1
1044
@*
1045
-------------------------------------
1046
@*
1047
Block 2
1048
@*
1049
-------------------------------------
1050
@*
1051
  ...
1052
@*
1053
-------------------------------------
1054
@*
1055
Block N
1056
@*
1057
-------------------------------------
1058
 
1059
@*
1060
The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each
1061
subrange is represented by an 256-element @dfn{block} (256 1-byte
1062
elements or 256 2-byte element in case of 16-bit CCS table) with
1063
elements which are equivalent to the CCS codes of this subrange.
1064
If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be
1065
absent and there will be less then 256 blocks.
1066
 
1067
@*
1068
Any element number @emph{m} of @dfn{the heading block} (which contains
1069
256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.
1070
If the subrange contains some codes, the value of the @emph{m}-th element of
1071
the heading block contains the offset of the corresponding block in the
1072
"from_ucs" array. If there is no codes in the subrange, the heading
1073
block element contains 0xFFFF.
1074
 
1075
@*
1076
If there are some gaps in a block, the corresponding block elements have
1077
the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping
1078
is defined in the first 2-byte element of the "from_ucs" array.
1079
 
1080
@*
1081
Having such a table format, the algorithm of searching the CCS code
1082
@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.
1083
 
1084
@*
1085
@enumerate
1086
@item If @emph{Y} is equivalent to the value of the first 2-byte element
1087
of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.
1088
 
1089
@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.
1090
 
1091
@item If the heading block element with number @emph{BlkN} is 0xFFFF, there
1092
is no corresponding CCS code (error, wrong input data). Else, fetch the
1093
"flom_ucs" array index of the @emph{BlkN}-th block.
1094
 
1095
@item Calculate the offset of the @emph{X} code in its block:
1096
@emph{Xindex = Y & 0xFF}
1097
 
1098
@item If the @emph{Xintex}-th element of the block (which is equivalent to
1099
@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding
1100
CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.
1101
@end enumerate
1102
 
1103
@subsection Size-optimized tables format
1104
@*
1105
As it is stated above, size-optimized tables exist only for 16-bit CCS-es.
1106
This is because there is too small difference between the speed-optimized
1107
and the size-optimized table sizes in case of 8-bit CCS-es.
1108
 
1109
@*
1110
Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
1111
size-optimized tables.
1112
 
1113
This sections describes the format of the "UCS-2 -> CCS" size-optimized
1114
CCS table. The format of "CCS -> UCS-2" table is the same.
1115
 
1116
The idea of the size-optimized tables is to split the UCS-2 codes
1117
("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).
1118
Then CCS codes ("to" codes) are stored only for the codes from these
1119
ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored
1120
together with the corresponding "to" codes.
1121
 
1122
@*
1123
The following is the layout of the size-optimized table array:
1124
 
1125
@*
1126
size_arr array:
1127
@*
1128
-------------------------------------
1129
@*
1130
Ranges number (2 bytes)
1131
@*
1132
-------------------------------------
1133
@*
1134
Unranged codes number (2 bytes)
1135
@*
1136
-------------------------------------
1137
@*
1138
Unranged codes array index (2 bytes)
1139
@*
1140
-------------------------------------
1141
@*
1142
Ranges indexes (triads)
1143
@*
1144
-------------------------------------
1145
@*
1146
Ranges
1147
@*
1148
-------------------------------------
1149
@*
1150
Unranged codes array
1151
@*
1152
-------------------------------------
1153
 
1154
@*
1155
The @dfn{Unranged codes array index} @emph{size_arr} section helps to find
1156
the offset of the needed range in the @emph{size_arr} and has
1157
the following format (triads):
1158
@*
1159
the first code in range, the last code in range, range offset.
1160
 
1161
@*
1162
The array of these triads is sorted by the firs element, therefore it is
1163
possible to quickly find the needed range index.
1164
 
1165
@*
1166
Each range has the corresponding sub-array containing the "to" codes. These
1167
sub-arrays are stored in the place marked as "Ranges" in the layout
1168
diagram.
1169
 
1170
@*
1171
The "Unranged codes array" contains pairs ("from" code, "to" code") for
1172
each unranged code. The array of these pairs is sorted by "from" code
1173
values, therefore it is possible to find the needed pair quickly.
1174
 
1175
@*
1176
Note, that each range requires 6 bytes to form its index. If, for
1177
example, there are two ranges (1 - 5 and 9 - 10), and one unranged code
1178
(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged
1179
code (total 16). But it is better to join both ranges as 1 - 10 and
1180
mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the
1181
range index and 4 bytes to mark codes 6 and 8 as absent are needed
1182
(total 10 bytes). This optimization is done in the size-optimized tables.
1183
Thus, ranges may contain small gaps. The absent codes in ranges are marked
1184
as 0xFFFF.
1185
 
1186
@*
1187
Note, a pair of "from" codes is stored by means of unranged codes since
1188
the number of bytes which are needed to form the range is greater than
1189
the number of bytes to store two unranged codes (5 against 4).
1190
 
1191
@*
1192
The algorithm of searching of the CCS code
1193
@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 ->
1194
CCS" size-optimized table is as follows.
1195
 
1196
@*
1197
@enumerate
1198
@item Try to find the corresponding triad in the "Unranged codes array
1199
index". Since we are searching in the sorted array, we can do it quickly
1200
(divide by 2, compare, etc).
1201
 
1202
@item If the triad is found, fetch the @emph{X} code from the corresponding
1203
range array. If it is 0xFFFF, return an error.
1204
 
1205
@item If there is no corresponding triad, search the @emph{X} code among the
1206
sorted unranged codes. Return error, if noting was found.
1207
@end enumerate
1208
 
1209
@subsection .cct ant .c CCS Table files
1210
@*
1211
The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs"
1212
speed-optimized tables. The .c source files for 16-bit CCS tables have
1213
"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
1214
tables.
1215
 
1216
@*
1217
When .c files are compiled and used, all the 16-bit and 32-bit values
1218
have the native endian format (Big Endian for the BE systems and Little
1219
Endian for the LE systems) since they are compile for the system before
1220
they are used.
1221
 
1222
@*
1223
In case of .cct files, which are intended for dynamic CCS tables
1224
loading, the CCS tables are stored either in LE or BE format. Since the
1225
.cct files are generated by the 'mktbl.pl' Perl script, it is possible
1226
to choose the endianess of the tables. It is also possible to store two
1227
copies (both LE and BE) of the CCS tables in one .cct file. The default
1228
.cct files (which come with the Newlib sources) have both LE and BE CCS
1229
tables. The Newlib iconv library automatically chooses the needed CCS tables
1230
(with appropriate endianess).
1231
 
1232
@*
1233
Note, the .cct files are only used when the
1234
@option{--enable-newlib-iconv-external-ccs} is used.
1235
 
1236
@subsection The 'mktbl.pl' Perl script
1237
@*
1238
The 'mktbl.pl' script is intended to generate .cct and .c CCS table
1239
files from the @dfn{CCS source files}.
1240
 
1241
@*
1242
The CCS source files are just text files which has one or more colons
1243
with CCS <-> UCS-2 codes mapping. To see an example of the CCS table
1244
source files see one of them using URL-s which will be given bellow.
1245
 
1246
@*
1247
The following table describes where the source files for CCS table files
1248
provided by the Newlib distribution are located.
1249
 
1250
@multitable @columnfractions .25 .75
1251
@item
1252
Name
1253
@tab
1254
URL
1255
 
1256
@item
1257
@tab
1258
 
1259
@item
1260
big5
1261
@tab
1262
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
1263
 
1264
@item
1265
cns11643_plane1
1266
cns11643_plane14
1267
cns11643_plane2
1268
@tab
1269
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
1270
 
1271
@item
1272
cp775
1273
cp850
1274
cp852
1275
cp855
1276
cp866
1277
@tab
1278
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1279
 
1280
@item
1281
iso_8859_1
1282
iso_8859_2
1283
iso_8859_3
1284
iso_8859_4
1285
iso_8859_5
1286
iso_8859_6
1287
iso_8859_7
1288
iso_8859_8
1289
iso_8859_9
1290
iso_8859_10
1291
iso_8859_11
1292
iso_8859_13
1293
iso_8859_14
1294
iso_8859_15
1295
@tab
1296
http://www.unicode.org/Public/MAPPINGS/ISO8859/
1297
 
1298
@item
1299
iso_ir_111
1300
@tab
1301
http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
1302
 
1303
@item
1304
jis_x0201_1976
1305
jis_x0208_1990
1306
jis_x0212_1990
1307
@tab
1308
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
1309
 
1310
@item
1311
koi8_r
1312
@tab
1313
http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
1314
 
1315
@item
1316
koi8_ru
1317
@tab
1318
http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
1319
 
1320
@item
1321
koi8_u
1322
@tab
1323
http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
1324
 
1325
@item
1326
koi8_uni
1327
@tab
1328
http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
1329
 
1330
@item
1331
ksx1001
1332
@tab
1333
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
1334
 
1335
@item
1336
win_1250
1337
win_1251
1338
win_1252
1339
win_1253
1340
win_1254
1341
win_1255
1342
win_1256
1343
win_1257
1344
win_1258
1345
@tab
1346
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1347
@end multitable
1348
 
1349
The CCS source files aren't distributed with Newlib because of License
1350
restrictions in most Unicode.org's files.
1351
 
1352
The following are 'mktbl.pl' options which were used to generate .cct
1353
files. Note, to generate CCS tables source files @option{-s} option
1354
should be added.
1355
 
1356
@enumerate
1357
@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
1358
iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
1359
iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
1360
iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
1361
win_1256.cct, win_1258.cct, win_1251.cct,
1362
win_1253.cct, win_1255.cct, win_1257.cct,
1363
koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
1364
big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
1365
files, only the @option{-i <SRC_FILE_NAME>} option were used.
1366
 
1367
@item To generate the jis_x0208_1990.cct file, the
1368
@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used.
1369
 
1370
@item To generate the cns11643_plane1.cct file, the
1371
@option{-i cns11643.txt -p1 -N cns11643_plane1  -o cns11643_plane1.cct}
1372
options were used.
1373
 
1374
@item To generate the cns11643_plane2.cct file, the
1375
@option{-i cns11643.txt -p2 -N cns11643_plane2  -o cns11643_plane2.cct}
1376
options were used.
1377
 
1378
@item To generate the cns11643_plane14.cct file, the
1379
@option{-i cns11643.txt -p0xE -N cns11643_plane14  -o cns11643_plane14.cct}
1380
options were used.
1381
@end enumerate
1382
 
1383
@*
1384
For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.
1385
 
1386
@*
1387
It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes
1388
in the CCS source file, the bits which are higher then 16 defines plane (see the
1389
cns11643.txt CCS source file).
1390
 
1391
@*
1392
Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example,
1393
several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to
1394
the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost
1395
codes}) aren't just rejected but instead, they are mapped to the default
1396
UCS-2 code (which is currently the @kbd{?} character's code).
1397
 
1398
 
1399
 
1400
 
1401
 
1402
@page
1403
@node CES converters
1404
@section CES converters
1405
@findex PCS
1406
@*
1407
Similar to the CCS tables, CES converters are also split into "from UCS"
1408
and "to UCS" parts. Depending on the iconv library configuration, these
1409
parts are enabled or disabled.
1410
 
1411
@*
1412
The following it the list of CES converters which are currently present
1413
in the Newlib iconv library.
1414
 
1415
@itemize @bullet
1416
@item
1417
@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw}
1418
encodings. The @emph{euc} CES converter uses the @emph{table} and the
1419
@emph{us_ascii} CES converters.
1420
 
1421
@item
1422
@emph{table} - this CES converter corresponds to "null" and just performs
1423
tables-based conversion using 8- and 16-bit CCS tables. This converter
1424
is also used by any other CES converter which needs the CCS table-based
1425
conversions. The @emph{table} converter is also responsible for .cct files
1426
loading.
1427
 
1428
@item
1429
@emph{table_pcs} - this is the wrapper over the @emph{table} converter
1430
which is intended for 16-bit encodings which also use the @dfn{Portable
1431
Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}.
1432
This means, that if the first byte the CCS code is in range of [0x00-0x7f],
1433
this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course,
1434
the 16-bit codes must not contain bytes in the range of [0x00-0x7f].
1435
The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the
1436
@emph{table_pcs} CES converter depends on the @emph{table} CES converter.
1437
 
1438
@item
1439
@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and
1440
@emph{ucs_2le} encodings support.
1441
 
1442
@item
1443
@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and
1444
@emph{ucs_4le} encodings support.
1445
 
1446
@item
1447
@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support.
1448
 
1449
@item
1450
@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support.
1451
 
1452
@item
1453
@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In
1454
principle, the most natural way to support the @emph{us_ascii} encoding
1455
is to define the @emph{us_ascii} CCS and use the @emph{table} CES
1456
converter. But for the optimization purposes, the specialized
1457
@emph{us_ascii} CES converter was created.
1458
 
1459
@item
1460
@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and
1461
@emph{utf_16le} encodings support.
1462
 
1463
@item
1464
@emph{utf_8} - intended for the @emph{utf_8} encoding support.
1465
@end itemize
1466
 
1467
 
1468
 
1469
 
1470
 
1471
@page
1472
@node The encodings description file
1473
@section The encodings description file
1474
@findex encoding.deps description file
1475
@findex mkdeps.pl Perl script
1476
@*
1477
To simplify the process of adding new encodings support allowing to
1478
automatically generate a lot of "glue" files.
1479
 
1480
@*
1481
There is the 'encoding.deps' file in the @emph{lib/} subdirectory which
1482
is used to describe encoding's properties. The 'mkdeps.pl' Perl script
1483
uses 'encoding.deps' to generates the "glue" files.
1484
 
1485
@*
1486
The 'encoding.deps' file is composed of sections, each section consists
1487
of entries, each entry contains some encoding/CES/CCS description.
1488
 
1489
@*
1490
The 'encoding.deps' file's syntax is very simple. Currently only two
1491
sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}.
1492
 
1493
@*
1494
Each @emph{ENCODINGS} section's entry describes one encoding and
1495
contains the following information.
1496
 
1497
@itemize @bullet
1498
@item
1499
Encoding name (the @emph{ENCODING} field). The name should
1500
be unique and only one name is possible.
1501
 
1502
@item
1503
The encoding's CES converter name (the @emph{CES} field). Only one CES
1504
converter is allowed.
1505
 
1506
@item
1507
The whitespace-separated list of CCS table names which are used by the
1508
encoding (the @emph{CCS} field).
1509
 
1510
@item
1511
The whitespace-separated list of aliases names (the @emph{ENCODING}
1512
field).
1513
@end itemize
1514
 
1515
@*
1516
Note all names in the 'encoding.deps' file have to have the normalized
1517
form.
1518
 
1519
@*
1520
Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of
1521
one CES converted. For example, the @emph{euc} CES converter depends on
1522
the @emph{table} and the @emph{us_ascii} CES converter since the
1523
@emph{euc} CES converter uses them. This means, that both @emph{table}
1524
and @emph{us_ascii} CES converters should be linked if the @emph{euc}
1525
CES converter is enabled.
1526
 
1527
@*
1528
The @emph{CES_DEPENDENCIES} section defines the following:
1529
 
1530
@itemize @bullet
1531
@item
1532
the CES converter name for which the dependencies are defined in this
1533
entry (the @emph{CES} field);
1534
 
1535
@item
1536
the whitespace-separated list of CES converters which are needed for
1537
this CES converter (the @emph{USED_CES} field).
1538
@end itemize
1539
 
1540
@*
1541
The 'mktbl.pl' Perl script automatically solves the following tasks.
1542
 
1543
@itemize @bullet
1544
@item
1545
User works with the iconv library in terms of encodings and doesn't know
1546
anything about CES converters and CCS tables. The script automatically
1547
generates code which enables all needed CES converters and CCS tables
1548
for all encodings, which were enabled by the user.
1549
 
1550
@item
1551
The CES converters may have dependencies and the script automatically
1552
generates the code which handles these dependencies.
1553
 
1554
@item
1555
The list of encoding's aliases is also automatically generated.
1556
 
1557
@item
1558
The script uses a lot of macros in order to enable only the minimum set
1559
of code/data which is needed to support the requested encodings in the
1560
requested directions.
1561
@end itemize
1562
 
1563
@*
1564
The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
1565
file and generates the following files.
1566
 
1567
@itemize @bullet
1568
@item
1569
@emph{lib/encnames.h} - this header files contains macro definitions for all
1570
encoding names
1571
 
1572
@item
1573
@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array
1574
is used to find the name of requested encoding by it's alias.
1575
 
1576
@item
1577
@emph{ces/cesbi.c} - this file defines two arrays
1578
(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain
1579
description of enabled "to UCS" and "from UCS" CES converters and the
1580
names of encodings which are supported by these CES converters.
1581
 
1582
@item
1583
@emph{ces/cesbi.h} - this file contains the set of macros which defines
1584
the set of CES converters which should be enabled if only the set of
1585
enabled encodings is given (through macros defined in the
1586
@emph{newlib.h} file). Note, that one CES converter may handle several
1587
encodings.
1588
 
1589
@item
1590
@emph{ces/cesdeps.h} - the CES converters dependencies are handled in
1591
this file.
1592
 
1593
@item
1594
@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined
1595
here.
1596
 
1597
@item
1598
@emph{ccs/ccsnames.h} - this header files contains macro definitions for all
1599
CCS names.
1600
 
1601
@item
1602
@emph{encoding.aliases} - the list of supported encodings and their
1603
aliases which is intended for the Newlib configure scripts in order to
1604
handle the iconv-related configure script options.
1605
@end itemize
1606
 
1607
 
1608
 
1609
 
1610
 
1611
@page
1612
@node How to add new encoding
1613
@section How to add new encoding
1614
@*
1615
At first, the new encoding should be broken down to CCS and CES. Then,
1616
the process of adding new encoding is split to the following activities.
1617
 
1618
@enumerate
1619
@item Generate the .cct CCS file and the .c source file for the new
1620
encoding's CCS (if it isn't already present). To do this, the CCS source
1621
file should be had and the 'mktbl.pl' script should be used.
1622
 
1623
@item Write the corresponding CES converter (if it isn't already
1624
present). Use the existing CES converters as an example.
1625
 
1626
@item
1627
Add the corresponding entries to the 'encoding.deps' file and regenerate
1628
the autogenerated "glue" files using the 'mkdeps.pl' script.
1629
 
1630
@item
1631
Don't forget to add entries to the newlib/newlib.hin file.
1632
 
1633
@item
1634
Of course, the 'Makefile.am'-s should also be updated (if new files were
1635
added) and the 'Makefile.in'-s should be regenerated using the correct
1636
version of 'automake'.
1637
 
1638
@item
1639
Don't forget to update the documentation (the list of
1640
supported encodings and CES converters).
1641
@end enumerate
1642
 
1643
In case a new encoding doesn't fit to the CES/CCS decomposition model or
1644
it is desired to add the specialized (non UCS-based) conversion support,
1645
the Newlib iconv library code should be upgraded.
1646
 
1647
 
1648
 
1649
 
1650
 
1651
@page
1652
@node The locale support interfaces
1653
@section The locale support interfaces
1654
@*
1655
The newlib iconv library also has some interface functions (besides the
1656
@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which
1657
are intended for the Locale subsystem. All the locale-related code is
1658
placed in the @emph{lib/iconvnls.c} file.
1659
 
1660
@*
1661
The following is the description of the locale-related interfaces:
1662
 
1663
@itemize @bullet
1664
@item
1665
@code{_iconv_nls_open} - opens two iconv descriptors for "CCS ->
1666
wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
1667
passed in the function parameters. The @emph{wchar_t} characters encoding is
1668
either ucs_2_internal or ucs_4_internal depending on size of
1669
@emph{wchar_t}.
1670
 
1671
@item
1672
@code{_iconv_nls_conv} - the function is similar to the @code{iconv}
1673
functions, but if there is no character in the output encoding which
1674
corresponds to the character in the input encoding, the default
1675
conversion isn't performed (the @code{iconv} function sets such output
1676
characters to the @kbd{?} symbol and this is the behavior, which is
1677
specified in SUSv3).
1678
 
1679
@item
1680
@code{_iconv_nls_get_state} - returns the current encoding's shift state
1681
(the @code{mbstate_t} object).
1682
 
1683
@item
1684
@code{_iconv_nls_set_state} sets the current encoding's shift state (the
1685
@code{mbstate_t} object).
1686
 
1687
@item
1688
@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful
1689
or stateless.
1690
 
1691
@item
1692
@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the
1693
maximum bytes number) of the encoding's characters.
1694
@end itemize
1695
 
1696
 
1697
 
1698
 
1699
@page
1700
@node Contact
1701
@section Contact
1702
@*
1703
The author of the original BSD iconv library (Alexander Chuguev) no longer
1704
supports that code.
1705
 
1706
@*
1707
Any questions regarding the iconv library may be forwarded to
1708
Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
1709
well as to the public Newlib mailing list.
1710
 

powered by: WebSVN 2.1.0

© copyright 1999-2025 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.