1 |
207 |
jeremybenn |
@node Iconv
|
2 |
|
|
@chapter Encoding conversions (@file{iconv.h})
|
3 |
|
|
|
4 |
|
|
This chapter describes the Newlib iconv library.
|
5 |
|
|
The iconv functions declarations are in
|
6 |
|
|
@file{iconv.h}.
|
7 |
|
|
|
8 |
|
|
@menu
|
9 |
|
|
* iconv:: Encoding conversion routines
|
10 |
|
|
* Introduction:: Introduction to iconv and encodings
|
11 |
|
|
* Supported encodings:: The list of currently supported encodings
|
12 |
|
|
* iconv design decisions:: General iconv library design issues
|
13 |
|
|
* iconv configuration:: iconv-related configure script options
|
14 |
|
|
* Encoding names:: How encodings are named.
|
15 |
|
|
* CCS tables:: CCS tables format and 'mktbl.pl' Perl script
|
16 |
|
|
* CES converters:: CES converters description
|
17 |
|
|
* The encodings description file:: The 'encoding.deps' file and 'mkdeps.pl'
|
18 |
|
|
* How to add new encoding:: The steps to add new encoding support
|
19 |
|
|
* The locale support interfaces:: Locale-related iconv interfaces
|
20 |
|
|
* Contact:: The author contact
|
21 |
|
|
@end menu
|
22 |
|
|
|
23 |
|
|
@page
|
24 |
|
|
@include iconv/iconv.def
|
25 |
|
|
|
26 |
|
|
@page
|
27 |
|
|
@node Introduction
|
28 |
|
|
@section Introduction
|
29 |
|
|
@findex encoding
|
30 |
|
|
@findex character set
|
31 |
|
|
@findex charset
|
32 |
|
|
@findex CES
|
33 |
|
|
@findex CCS
|
34 |
|
|
@*
|
35 |
|
|
The iconv library is intended to convert characters from one encoding to
|
36 |
|
|
another. It implements iconv(), iconv_open() and iconv_close()
|
37 |
|
|
calls, which are defined by the Single Unix Specification.
|
38 |
|
|
|
39 |
|
|
@*
|
40 |
|
|
In addition to these user-level interfaces, the iconv library also has
|
41 |
|
|
several useful interfaces which are needed to support coding
|
42 |
|
|
capabilities of the Newlib Locale infrastructure. Since Locale
|
43 |
|
|
support also needs to
|
44 |
|
|
convert various character sets to and from the @emph{wide characters
|
45 |
|
|
set}, the iconv library shares it's capabilities with the Newlib Locale
|
46 |
|
|
subsystem. Moreover, the iconv library supports several features which are
|
47 |
|
|
only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).
|
48 |
|
|
|
49 |
|
|
@*
|
50 |
|
|
The Newlib iconv library was created using concepts from another iconv
|
51 |
|
|
library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library
|
52 |
|
|
was rewritten from scratch and contains a lot of improvements with respect to
|
53 |
|
|
the original iconv library.
|
54 |
|
|
|
55 |
|
|
@*
|
56 |
|
|
Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
|
57 |
|
|
are often used with various meanings. The following are the definitions of terms
|
58 |
|
|
which are used in this documentation as well as in the iconv library
|
59 |
|
|
implementation:
|
60 |
|
|
|
61 |
|
|
@itemize @bullet
|
62 |
|
|
@item
|
63 |
|
|
@dfn{encoding} - a machine representation of characters by means of bits;
|
64 |
|
|
|
65 |
|
|
@item
|
66 |
|
|
@dfn{Character Set} or @dfn{Charset} - just a collection of
|
67 |
|
|
characters, i.e. the encoding is the machine representation of the character set;
|
68 |
|
|
|
69 |
|
|
@item
|
70 |
|
|
@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
|
71 |
|
|
set of integers @dfn{character codes};
|
72 |
|
|
|
73 |
|
|
@item
|
74 |
|
|
@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
|
75 |
|
|
codes to a sequence of bytes;
|
76 |
|
|
@end itemize
|
77 |
|
|
|
78 |
|
|
@*
|
79 |
|
|
Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
|
80 |
|
|
ASCII, etc. Encodings are formed by the following chain of steps:
|
81 |
|
|
|
82 |
|
|
@enumerate
|
83 |
|
|
@item
|
84 |
|
|
User has a set of characters which are specific to his or her language (character set).
|
85 |
|
|
|
86 |
|
|
@item
|
87 |
|
|
Each character from this set is uniquely numbered, resulting in an CCS.
|
88 |
|
|
|
89 |
|
|
@item
|
90 |
|
|
Each number from the CCS is converted to a sequence of bits or bytes by means
|
91 |
|
|
of a CES and form some encoding. Thus, CES may be considered as a
|
92 |
|
|
function of CCS which produces some encoding. Note, that CES may be
|
93 |
|
|
applied to more than one CCS.
|
94 |
|
|
@end enumerate
|
95 |
|
|
|
96 |
|
|
@*
|
97 |
|
|
Thus, an encoding may be considered as one or more CCS + CES.
|
98 |
|
|
|
99 |
|
|
@*
|
100 |
|
|
Sometimes, there is no CES and in such cases encoding is equivalent
|
101 |
|
|
to CCS, e.g. KOI8-R or ASCII.
|
102 |
|
|
|
103 |
|
|
@*
|
104 |
|
|
An example of a more complicated encoding is UTF-8 which is the UCS
|
105 |
|
|
(or Unicode) CCS plus the UTF-8 CES.
|
106 |
|
|
|
107 |
|
|
@*
|
108 |
|
|
The following is a brief list of iconv library features:
|
109 |
|
|
@itemize
|
110 |
|
|
@item
|
111 |
|
|
Generic architecture;
|
112 |
|
|
@item
|
113 |
|
|
Locale infrastructure support;
|
114 |
|
|
@item
|
115 |
|
|
Automatic generation of the program code which handles
|
116 |
|
|
CES/CCS/Encoding/Names/Aliases dependencies;
|
117 |
|
|
@item
|
118 |
|
|
The ability to choose size- or speed-optimazed
|
119 |
|
|
configuration;
|
120 |
|
|
@item
|
121 |
|
|
The ability to exclude a lot of unneeded code and data from the linking step.
|
122 |
|
|
@end itemize
|
123 |
|
|
|
124 |
|
|
|
125 |
|
|
|
126 |
|
|
|
127 |
|
|
@page
|
128 |
|
|
@node Supported encodings
|
129 |
|
|
@section Supported encodings
|
130 |
|
|
@findex big5
|
131 |
|
|
@findex cp775
|
132 |
|
|
@findex cp850
|
133 |
|
|
@findex cp852
|
134 |
|
|
@findex cp855
|
135 |
|
|
@findex cp866
|
136 |
|
|
@findex euc_jp
|
137 |
|
|
@findex euc_kr
|
138 |
|
|
@findex euc_tw
|
139 |
|
|
@findex iso_8859_1
|
140 |
|
|
@findex iso_8859_10
|
141 |
|
|
@findex iso_8859_11
|
142 |
|
|
@findex iso_8859_13
|
143 |
|
|
@findex iso_8859_14
|
144 |
|
|
@findex iso_8859_15
|
145 |
|
|
@findex iso_8859_2
|
146 |
|
|
@findex iso_8859_3
|
147 |
|
|
@findex iso_8859_4
|
148 |
|
|
@findex iso_8859_5
|
149 |
|
|
@findex iso_8859_6
|
150 |
|
|
@findex iso_8859_7
|
151 |
|
|
@findex iso_8859_8
|
152 |
|
|
@findex iso_8859_9
|
153 |
|
|
@findex iso_ir_111
|
154 |
|
|
@findex koi8_r
|
155 |
|
|
@findex koi8_ru
|
156 |
|
|
@findex koi8_u
|
157 |
|
|
@findex koi8_uni
|
158 |
|
|
@findex ucs_2
|
159 |
|
|
@findex ucs_2_internal
|
160 |
|
|
@findex ucs_2be
|
161 |
|
|
@findex ucs_2le
|
162 |
|
|
@findex ucs_4
|
163 |
|
|
@findex ucs_4_internal
|
164 |
|
|
@findex ucs_4be
|
165 |
|
|
@findex ucs_4le
|
166 |
|
|
@findex us_ascii
|
167 |
|
|
@findex utf_16
|
168 |
|
|
@findex utf_16be
|
169 |
|
|
@findex utf_16le
|
170 |
|
|
@findex utf_8
|
171 |
|
|
@findex win_1250
|
172 |
|
|
@findex win_1251
|
173 |
|
|
@findex win_1252
|
174 |
|
|
@findex win_1253
|
175 |
|
|
@findex win_1254
|
176 |
|
|
@findex win_1255
|
177 |
|
|
@findex win_1256
|
178 |
|
|
@findex win_1257
|
179 |
|
|
@findex win_1258
|
180 |
|
|
@*
|
181 |
|
|
The following is the list of currently supported encodings. The first column
|
182 |
|
|
corresponds to the encoding name, the second column is the list of aliases,
|
183 |
|
|
the third column is its CES and CCS components names, and the fourth column
|
184 |
|
|
is a short description.
|
185 |
|
|
|
186 |
|
|
@multitable @columnfractions .20 .26 .24 .30
|
187 |
|
|
@item
|
188 |
|
|
Name
|
189 |
|
|
@tab
|
190 |
|
|
Aliases
|
191 |
|
|
@tab
|
192 |
|
|
CES/CCS
|
193 |
|
|
@tab
|
194 |
|
|
Short description
|
195 |
|
|
@item
|
196 |
|
|
@tab
|
197 |
|
|
@tab
|
198 |
|
|
@tab
|
199 |
|
|
|
200 |
|
|
|
201 |
|
|
@item
|
202 |
|
|
big5
|
203 |
|
|
@tab
|
204 |
|
|
csbig5, big_five, bigfive, cn_big5, cp950
|
205 |
|
|
@tab
|
206 |
|
|
table_pcs / big5, us_ascii
|
207 |
|
|
@tab
|
208 |
|
|
The encoding for the Traditional Chinese.
|
209 |
|
|
|
210 |
|
|
|
211 |
|
|
@item
|
212 |
|
|
cp775
|
213 |
|
|
@tab
|
214 |
|
|
ibm775, cspc775baltic
|
215 |
|
|
@tab
|
216 |
|
|
table / cp775
|
217 |
|
|
@tab
|
218 |
|
|
The updated version of CP 437 that supports the balitic languages.
|
219 |
|
|
|
220 |
|
|
|
221 |
|
|
@item
|
222 |
|
|
cp850
|
223 |
|
|
@tab
|
224 |
|
|
ibm850, 850, cspc850multilingual
|
225 |
|
|
@tab
|
226 |
|
|
table / cp850
|
227 |
|
|
@tab
|
228 |
|
|
IBM 850 - the updated version of CP 437 where several Latin 1 characters have been
|
229 |
|
|
added instead of some less-often used characters like the line-drawing
|
230 |
|
|
and the greek ones.
|
231 |
|
|
|
232 |
|
|
|
233 |
|
|
@item
|
234 |
|
|
cp852
|
235 |
|
|
@tab
|
236 |
|
|
ibm852, 852, cspcp852
|
237 |
|
|
@tab
|
238 |
|
|
@tab
|
239 |
|
|
IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added
|
240 |
|
|
instead of some less-often used characters like the line-drawing and the greek ones.
|
241 |
|
|
|
242 |
|
|
|
243 |
|
|
@item
|
244 |
|
|
cp855
|
245 |
|
|
@tab
|
246 |
|
|
ibm855, 855, csibm855
|
247 |
|
|
@tab
|
248 |
|
|
table / cp855
|
249 |
|
|
@tab
|
250 |
|
|
IBM 855 - the updated version of CP 437 that supports Cyrillic.
|
251 |
|
|
|
252 |
|
|
|
253 |
|
|
@item
|
254 |
|
|
cp866
|
255 |
|
|
@tab
|
256 |
|
|
866, IBM866, CSIBM866
|
257 |
|
|
@tab
|
258 |
|
|
table / cp866
|
259 |
|
|
@tab
|
260 |
|
|
IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet
|
261 |
|
|
ordering of the alternative variant that is preferred by many Russian users.
|
262 |
|
|
|
263 |
|
|
|
264 |
|
|
@item
|
265 |
|
|
euc_jp
|
266 |
|
|
@tab
|
267 |
|
|
eucjp
|
268 |
|
|
@tab
|
269 |
|
|
euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
|
270 |
|
|
@tab
|
271 |
|
|
EUC-JP - The EUC for Japanese.
|
272 |
|
|
|
273 |
|
|
|
274 |
|
|
@item
|
275 |
|
|
euc_kr
|
276 |
|
|
@tab
|
277 |
|
|
euckr
|
278 |
|
|
@tab
|
279 |
|
|
euc / ksx1001
|
280 |
|
|
@tab
|
281 |
|
|
EUC-KR - The EUC for Korean.
|
282 |
|
|
|
283 |
|
|
|
284 |
|
|
@item
|
285 |
|
|
euc_tw
|
286 |
|
|
@tab
|
287 |
|
|
euctw
|
288 |
|
|
@tab
|
289 |
|
|
euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
|
290 |
|
|
@tab
|
291 |
|
|
EUC-TW - The EUC for Traditional Chinese.
|
292 |
|
|
|
293 |
|
|
|
294 |
|
|
@item
|
295 |
|
|
iso_8859_1
|
296 |
|
|
@tab
|
297 |
|
|
iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
|
298 |
|
|
@tab
|
299 |
|
|
table / iso_8859_1
|
300 |
|
|
@tab
|
301 |
|
|
ISO 8859-1:1987 - Latin 1, West European.
|
302 |
|
|
|
303 |
|
|
|
304 |
|
|
@item
|
305 |
|
|
iso_8859_10
|
306 |
|
|
@tab
|
307 |
|
|
iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
|
308 |
|
|
@tab
|
309 |
|
|
table / iso_8859_10
|
310 |
|
|
@tab
|
311 |
|
|
ISO 8859-10:1992 - Latin 6, Nordic.
|
312 |
|
|
|
313 |
|
|
|
314 |
|
|
@item
|
315 |
|
|
iso_8859_11
|
316 |
|
|
@tab
|
317 |
|
|
iso8859_11, iso885911
|
318 |
|
|
@tab
|
319 |
|
|
table / iso_8859_11
|
320 |
|
|
@tab
|
321 |
|
|
ISO 8859-11 - Thai.
|
322 |
|
|
|
323 |
|
|
|
324 |
|
|
@item
|
325 |
|
|
iso_8859_13
|
326 |
|
|
@tab
|
327 |
|
|
iso_8859_13:1998, iso8859_13, iso885913
|
328 |
|
|
@tab
|
329 |
|
|
table / iso_8859_13
|
330 |
|
|
@tab
|
331 |
|
|
ISO 8859-13:1998 - Latin 7, Baltic Rim.
|
332 |
|
|
|
333 |
|
|
|
334 |
|
|
@item
|
335 |
|
|
iso_8859_14
|
336 |
|
|
@tab
|
337 |
|
|
iso_8859_14:1998, iso885914, iso8859_14
|
338 |
|
|
@tab
|
339 |
|
|
table / iso_8859_14
|
340 |
|
|
@tab
|
341 |
|
|
ISO 8859-14:1998 - Latin 8, Celtic.
|
342 |
|
|
|
343 |
|
|
|
344 |
|
|
@item
|
345 |
|
|
iso_8859_15
|
346 |
|
|
@tab
|
347 |
|
|
iso885915, iso_8859_15:1998, iso8859_15,
|
348 |
|
|
@tab
|
349 |
|
|
table / iso_8859_15
|
350 |
|
|
@tab
|
351 |
|
|
ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.
|
352 |
|
|
|
353 |
|
|
|
354 |
|
|
@item
|
355 |
|
|
iso_8859_2
|
356 |
|
|
@tab
|
357 |
|
|
iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
|
358 |
|
|
@tab
|
359 |
|
|
table / iso_8859_2
|
360 |
|
|
@tab
|
361 |
|
|
ISO 8859-2:1987 - Latin 2, East European.
|
362 |
|
|
|
363 |
|
|
|
364 |
|
|
@item
|
365 |
|
|
iso_8859_3
|
366 |
|
|
@tab
|
367 |
|
|
iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
|
368 |
|
|
@tab
|
369 |
|
|
table / iso_8859_3
|
370 |
|
|
@tab
|
371 |
|
|
ISO 8859-3:1988 - Latin 3, South European.
|
372 |
|
|
|
373 |
|
|
|
374 |
|
|
@item
|
375 |
|
|
iso_8859_4
|
376 |
|
|
@tab
|
377 |
|
|
iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
|
378 |
|
|
@tab
|
379 |
|
|
table / iso_8859_4
|
380 |
|
|
@tab
|
381 |
|
|
ISO 8859-4:1988 - Latin 4, North European.
|
382 |
|
|
|
383 |
|
|
|
384 |
|
|
@item
|
385 |
|
|
iso_8859_5
|
386 |
|
|
@tab
|
387 |
|
|
iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
|
388 |
|
|
@tab
|
389 |
|
|
table / iso_8859_5
|
390 |
|
|
@tab
|
391 |
|
|
ISO 8859-5:1988 - Cyrillic.
|
392 |
|
|
|
393 |
|
|
|
394 |
|
|
@item
|
395 |
|
|
iso_8859_6
|
396 |
|
|
@tab
|
397 |
|
|
iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
|
398 |
|
|
@tab
|
399 |
|
|
table / iso_8859_6
|
400 |
|
|
@tab
|
401 |
|
|
ISO i8859-6:1987 - Arabic.
|
402 |
|
|
|
403 |
|
|
|
404 |
|
|
@item
|
405 |
|
|
iso_8859_7
|
406 |
|
|
@tab
|
407 |
|
|
iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
|
408 |
|
|
@tab
|
409 |
|
|
table / iso_8859_7
|
410 |
|
|
@tab
|
411 |
|
|
ISO 8859-7:1987 - Greek.
|
412 |
|
|
|
413 |
|
|
|
414 |
|
|
@item
|
415 |
|
|
iso_8859_8
|
416 |
|
|
@tab
|
417 |
|
|
iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
|
418 |
|
|
@tab
|
419 |
|
|
table / iso_8859_8
|
420 |
|
|
@tab
|
421 |
|
|
ISO 8859-8:1988 - Hebrew.
|
422 |
|
|
|
423 |
|
|
|
424 |
|
|
@item
|
425 |
|
|
iso_8859_9
|
426 |
|
|
@tab
|
427 |
|
|
iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
|
428 |
|
|
@tab
|
429 |
|
|
table / iso_8859_9
|
430 |
|
|
@tab
|
431 |
|
|
ISO 8859-9:1989 - Latin 5, Turkish.
|
432 |
|
|
|
433 |
|
|
|
434 |
|
|
@item
|
435 |
|
|
iso_ir_111
|
436 |
|
|
@tab
|
437 |
|
|
ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
|
438 |
|
|
@tab
|
439 |
|
|
table / iso_ir_111
|
440 |
|
|
@tab
|
441 |
|
|
ISO IR 111/ECMA Cyrillic.
|
442 |
|
|
|
443 |
|
|
|
444 |
|
|
@item
|
445 |
|
|
koi8_r
|
446 |
|
|
@tab
|
447 |
|
|
cskoi8r, koi8r, koi8
|
448 |
|
|
@tab
|
449 |
|
|
table / koi8_r
|
450 |
|
|
@tab
|
451 |
|
|
RFC 1489 Cyrillic.
|
452 |
|
|
|
453 |
|
|
|
454 |
|
|
@item
|
455 |
|
|
koi8_ru
|
456 |
|
|
@tab
|
457 |
|
|
koi8ru
|
458 |
|
|
@tab
|
459 |
|
|
table / koi8_ru
|
460 |
|
|
@tab
|
461 |
|
|
The obsolete Ukrainian.
|
462 |
|
|
|
463 |
|
|
|
464 |
|
|
@item
|
465 |
|
|
koi8_u
|
466 |
|
|
@tab
|
467 |
|
|
koi8u
|
468 |
|
|
@tab
|
469 |
|
|
table / koi8_u
|
470 |
|
|
@tab
|
471 |
|
|
RFC 2319 Ukrainian.
|
472 |
|
|
|
473 |
|
|
|
474 |
|
|
@item
|
475 |
|
|
koi8_uni
|
476 |
|
|
@tab
|
477 |
|
|
koi8uni
|
478 |
|
|
@tab
|
479 |
|
|
table / koi8_uni
|
480 |
|
|
@tab
|
481 |
|
|
KOI8 Unified.
|
482 |
|
|
|
483 |
|
|
|
484 |
|
|
@item
|
485 |
|
|
ucs_2
|
486 |
|
|
@tab
|
487 |
|
|
ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
|
488 |
|
|
@tab
|
489 |
|
|
ucs_2 / (UCS)
|
490 |
|
|
@tab
|
491 |
|
|
ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
492 |
|
|
|
493 |
|
|
|
494 |
|
|
@item
|
495 |
|
|
ucs_2_internal
|
496 |
|
|
@tab
|
497 |
|
|
ucs2_internal, ucs_2internal, ucs2internal
|
498 |
|
|
@tab
|
499 |
|
|
ucs_2_internal / (UCS)
|
500 |
|
|
@tab
|
501 |
|
|
ISO-10646-UCS-2 in system byte order.
|
502 |
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
503 |
|
|
|
504 |
|
|
|
505 |
|
|
@item
|
506 |
|
|
ucs_2be
|
507 |
|
|
@tab
|
508 |
|
|
ucs2be
|
509 |
|
|
@tab
|
510 |
|
|
ucs_2 / (UCS)
|
511 |
|
|
@tab
|
512 |
|
|
Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
|
513 |
|
|
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
514 |
|
|
|
515 |
|
|
|
516 |
|
|
@item
|
517 |
|
|
ucs_2le
|
518 |
|
|
@tab
|
519 |
|
|
ucs2le
|
520 |
|
|
@tab
|
521 |
|
|
ucs_2 / (UCS)
|
522 |
|
|
@tab
|
523 |
|
|
Little Endian version of ISO-10646-UCS-2.
|
524 |
|
|
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
525 |
|
|
|
526 |
|
|
|
527 |
|
|
@item
|
528 |
|
|
ucs_4
|
529 |
|
|
@tab
|
530 |
|
|
ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
|
531 |
|
|
@tab
|
532 |
|
|
ucs_4 / (UCS)
|
533 |
|
|
@tab
|
534 |
|
|
ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
535 |
|
|
|
536 |
|
|
|
537 |
|
|
@item
|
538 |
|
|
ucs_4_internal
|
539 |
|
|
@tab
|
540 |
|
|
ucs4_internal, ucs_4internal, ucs4internal
|
541 |
|
|
@tab
|
542 |
|
|
ucs_4_internal / (UCS)
|
543 |
|
|
@tab
|
544 |
|
|
ISO-10646-UCS-4 in system byte order.
|
545 |
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
546 |
|
|
|
547 |
|
|
|
548 |
|
|
@item
|
549 |
|
|
ucs_4be
|
550 |
|
|
@tab
|
551 |
|
|
ucs4be
|
552 |
|
|
@tab
|
553 |
|
|
ucs_4 / (UCS)
|
554 |
|
|
@tab
|
555 |
|
|
Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
|
556 |
|
|
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
557 |
|
|
|
558 |
|
|
|
559 |
|
|
@item
|
560 |
|
|
ucs_4le
|
561 |
|
|
@tab
|
562 |
|
|
ucs4le
|
563 |
|
|
@tab
|
564 |
|
|
ucs_4 / (UCS)
|
565 |
|
|
@tab
|
566 |
|
|
Little Endian version of ISO-10646-UCS-4.
|
567 |
|
|
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
568 |
|
|
|
569 |
|
|
|
570 |
|
|
@item
|
571 |
|
|
us_ascii
|
572 |
|
|
@tab
|
573 |
|
|
ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
|
574 |
|
|
@tab
|
575 |
|
|
us_ascii / (ASCII)
|
576 |
|
|
@tab
|
577 |
|
|
7-bit ASCII.
|
578 |
|
|
|
579 |
|
|
|
580 |
|
|
@item
|
581 |
|
|
utf_16
|
582 |
|
|
@tab
|
583 |
|
|
utf16
|
584 |
|
|
@tab
|
585 |
|
|
utf_16 / (UCS)
|
586 |
|
|
@tab
|
587 |
|
|
RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.
|
588 |
|
|
|
589 |
|
|
|
590 |
|
|
@item
|
591 |
|
|
utf_16be
|
592 |
|
|
@tab
|
593 |
|
|
utf16be
|
594 |
|
|
@tab
|
595 |
|
|
utf_16 / (UCS)
|
596 |
|
|
@tab
|
597 |
|
|
Big Endian version of RFC 2781 UTF-16.
|
598 |
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
599 |
|
|
|
600 |
|
|
|
601 |
|
|
@item
|
602 |
|
|
utf_16le
|
603 |
|
|
@tab
|
604 |
|
|
utf16le
|
605 |
|
|
@tab
|
606 |
|
|
utf_16 / (UCS)
|
607 |
|
|
@tab
|
608 |
|
|
Little Endian version of RFC 2781 UTF-16.
|
609 |
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
610 |
|
|
|
611 |
|
|
|
612 |
|
|
@item
|
613 |
|
|
utf_8
|
614 |
|
|
@tab
|
615 |
|
|
utf8
|
616 |
|
|
@tab
|
617 |
|
|
utf_8 / (UCS)
|
618 |
|
|
@tab
|
619 |
|
|
RFC 3629 UTF-8.
|
620 |
|
|
|
621 |
|
|
|
622 |
|
|
@item
|
623 |
|
|
win_1250
|
624 |
|
|
@tab
|
625 |
|
|
cp1250
|
626 |
|
|
@tab
|
627 |
|
|
@tab
|
628 |
|
|
Win-1250 Croatian.
|
629 |
|
|
|
630 |
|
|
|
631 |
|
|
@item
|
632 |
|
|
win_1251
|
633 |
|
|
@tab
|
634 |
|
|
cp1251
|
635 |
|
|
@tab
|
636 |
|
|
table / win_1251
|
637 |
|
|
@tab
|
638 |
|
|
Win-1251 - Cyrillic.
|
639 |
|
|
|
640 |
|
|
|
641 |
|
|
@item
|
642 |
|
|
win_1252
|
643 |
|
|
@tab
|
644 |
|
|
cp1252
|
645 |
|
|
@tab
|
646 |
|
|
table / win_1252
|
647 |
|
|
@tab
|
648 |
|
|
Win-1252 - Latin 1.
|
649 |
|
|
|
650 |
|
|
|
651 |
|
|
@item
|
652 |
|
|
win_1253
|
653 |
|
|
@tab
|
654 |
|
|
cp1253
|
655 |
|
|
@tab
|
656 |
|
|
table / win_1253
|
657 |
|
|
@tab
|
658 |
|
|
Win-1253 - Greek.
|
659 |
|
|
|
660 |
|
|
|
661 |
|
|
@item
|
662 |
|
|
win_1254
|
663 |
|
|
@tab
|
664 |
|
|
cp1254
|
665 |
|
|
@tab
|
666 |
|
|
table / win_1254
|
667 |
|
|
@tab
|
668 |
|
|
Win-1254 - Turkish.
|
669 |
|
|
|
670 |
|
|
|
671 |
|
|
@item
|
672 |
|
|
win_1255
|
673 |
|
|
@tab
|
674 |
|
|
cp1255
|
675 |
|
|
@tab
|
676 |
|
|
table / win_1255
|
677 |
|
|
@tab
|
678 |
|
|
Win-1255 - Hebrew.
|
679 |
|
|
|
680 |
|
|
|
681 |
|
|
@item
|
682 |
|
|
win_1256
|
683 |
|
|
@tab
|
684 |
|
|
cp1256
|
685 |
|
|
@tab
|
686 |
|
|
table / win_1256
|
687 |
|
|
@tab
|
688 |
|
|
Win-1256 - Arabic.
|
689 |
|
|
|
690 |
|
|
|
691 |
|
|
@item
|
692 |
|
|
win_1257
|
693 |
|
|
@tab
|
694 |
|
|
cp1257
|
695 |
|
|
@tab
|
696 |
|
|
table / win_1257
|
697 |
|
|
@tab
|
698 |
|
|
Win-1257 - Baltic.
|
699 |
|
|
|
700 |
|
|
|
701 |
|
|
@item
|
702 |
|
|
win_1258
|
703 |
|
|
@tab
|
704 |
|
|
cp1258
|
705 |
|
|
@tab
|
706 |
|
|
table / win_1258
|
707 |
|
|
@tab
|
708 |
|
|
Win-1258 - Vietnamese7 that supports Cyrillic.
|
709 |
|
|
@end multitable
|
710 |
|
|
|
711 |
|
|
|
712 |
|
|
|
713 |
|
|
|
714 |
|
|
|
715 |
|
|
@page
|
716 |
|
|
@node iconv design decisions
|
717 |
|
|
@section iconv design decisions
|
718 |
|
|
@findex CCS table
|
719 |
|
|
@findex CES converter
|
720 |
|
|
@findex Speed-optimized tables
|
721 |
|
|
@findex Size-optimized tables
|
722 |
|
|
@*
|
723 |
|
|
The first iconv library design issue arises when considering the
|
724 |
|
|
following two design approaches:
|
725 |
|
|
|
726 |
|
|
@enumerate
|
727 |
|
|
@item
|
728 |
|
|
Have modules which implement conversion from the encoding A to the encoding B
|
729 |
|
|
and vice versa i.e., one conversion module relates to any two encodings.
|
730 |
|
|
@item
|
731 |
|
|
Have modules which implement conversion from the encoding A to the fixed
|
732 |
|
|
encoding C and vice versa i.e., one conversion module relates to any
|
733 |
|
|
one encoding A and one fixed encoding C. In this case, to convert from
|
734 |
|
|
the encoding A to the encoding B, two modules are needed (in order to convert
|
735 |
|
|
from A to C and then from C to B).
|
736 |
|
|
@end enumerate
|
737 |
|
|
|
738 |
|
|
@*
|
739 |
|
|
It's obvious, that we have tradeoff between commonality/flexibility and
|
740 |
|
|
efficiency: the first method is more efficient since it converts
|
741 |
|
|
directly; however, it isn't so flexible since for each
|
742 |
|
|
encoding pair a distinct module is needed.
|
743 |
|
|
|
744 |
|
|
@*
|
745 |
|
|
The Newlib iconv model uses the second method and always converts through the 32-bit
|
746 |
|
|
UCS but its design also allows one to write specialized conversion
|
747 |
|
|
modules if the conversion speed is critical.
|
748 |
|
|
|
749 |
|
|
@*
|
750 |
|
|
The second design issue is how to break down (decompose) encodings.
|
751 |
|
|
The Newlib iconv library uses the fact that any encoding may be
|
752 |
|
|
considered as one or more CCS plus a CES. It also decomposes its
|
753 |
|
|
conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
|
754 |
|
|
tables}. CCS tables map CCS to UCS and vice versa; the CES converters
|
755 |
|
|
map CCS to the encoding and vice versa.
|
756 |
|
|
|
757 |
|
|
@*
|
758 |
|
|
As the example, let's consider the conversion from the big5 encoding to
|
759 |
|
|
the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
|
760 |
|
|
CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
|
761 |
|
|
and CNS11643_PLANE14 CCS-es plus the EUC CES.
|
762 |
|
|
|
763 |
|
|
@*
|
764 |
|
|
The euc_jp -> big5 conversion is performed as follows:
|
765 |
|
|
|
766 |
|
|
@enumerate
|
767 |
|
|
@item
|
768 |
|
|
The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
|
769 |
|
|
transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
|
770 |
|
|
CCS-es);
|
771 |
|
|
@item
|
772 |
|
|
The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
|
773 |
|
|
CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
|
774 |
|
|
@item
|
775 |
|
|
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
|
776 |
|
|
the corresponding CCS tables;
|
777 |
|
|
@item
|
778 |
|
|
The obtained CCS codes are transformed to the big5 encoding using the corresponding
|
779 |
|
|
CES converter.
|
780 |
|
|
@end enumerate
|
781 |
|
|
|
782 |
|
|
@*
|
783 |
|
|
Analogously, the backward conversion is performed as follows:
|
784 |
|
|
|
785 |
|
|
@enumerate
|
786 |
|
|
@item
|
787 |
|
|
The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
|
788 |
|
|
(the ASCII and BIG5 CCS-es);
|
789 |
|
|
@item
|
790 |
|
|
The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
|
791 |
|
|
@item
|
792 |
|
|
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
|
793 |
|
|
the corresponding CCS tables;
|
794 |
|
|
@item
|
795 |
|
|
The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
|
796 |
|
|
CES converter.
|
797 |
|
|
@end enumerate
|
798 |
|
|
|
799 |
|
|
@*
|
800 |
|
|
Note, the above is just an example and real names (which are implemented
|
801 |
|
|
in the Newlib iconv) of the CES converters and the CCS tables are slightly different.
|
802 |
|
|
|
803 |
|
|
@*
|
804 |
|
|
The third design issue also relates to flexibility. Obviously, it isn't
|
805 |
|
|
desirable to always link all the CES converters and the CCS tables to the library
|
806 |
|
|
but instead, we want to be able to load the needed converters and tables
|
807 |
|
|
dynamically on demand. This isn't a problem on "big" machines such as
|
808 |
|
|
a PC, but it may be very problematical within "small" embedded systems.
|
809 |
|
|
|
810 |
|
|
@*
|
811 |
|
|
Since the CCS tables are just data, it is possible to load them
|
812 |
|
|
dynamically from external files. The CES converters, on the other hand
|
813 |
|
|
are algorithms with some code so a dynamic library loading
|
814 |
|
|
capability is required.
|
815 |
|
|
|
816 |
|
|
@*
|
817 |
|
|
Apart from possible restrictions applied by embedded systems (small
|
818 |
|
|
RAM for example), Newlib itself has no dynamic library support and
|
819 |
|
|
therefore, all the CES converters which will ever be used must be linked into
|
820 |
|
|
the library. However, loading of the dynamic CCS tables is possible and is
|
821 |
|
|
implemented in the Newlib iconv library. It may be enabled via the Newlib
|
822 |
|
|
configure script options.
|
823 |
|
|
|
824 |
|
|
@*
|
825 |
|
|
The next design issue is fine-tuning the iconv library
|
826 |
|
|
configuration. One important ability is for iconv to not link all it's
|
827 |
|
|
converters and tables (if dynamic loading is not enabled) but instead,
|
828 |
|
|
enable only those encodings which are specified at configuration
|
829 |
|
|
time (see the section about the configure script options).
|
830 |
|
|
|
831 |
|
|
@*
|
832 |
|
|
In addition, the Newlib iconv library configure options distinguish between
|
833 |
|
|
conversion directions. This means that not only are supported encodings
|
834 |
|
|
selectable, the conversion direction is as well. For example, if user wants
|
835 |
|
|
the configuration which allows conversions from UTF-8 to UTF-16 and
|
836 |
|
|
doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can
|
837 |
|
|
enable only
|
838 |
|
|
this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will
|
839 |
|
|
be included) thus, saving some memory (note, that such technique allows to
|
840 |
|
|
exclude one half of a CCS table from linking which may be big enough).
|
841 |
|
|
|
842 |
|
|
@*
|
843 |
|
|
One more design aspect are the speed- and size- optimized tables. Users can
|
844 |
|
|
select between them using configure script options. The
|
845 |
|
|
speed-optimized CCS tables are the same as the size-optimized ones in
|
846 |
|
|
case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized
|
847 |
|
|
CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the
|
848 |
|
|
other hand, conversion with speed tables is several times faster.
|
849 |
|
|
|
850 |
|
|
@*
|
851 |
|
|
Its worth to stress that the new encoding support can't be
|
852 |
|
|
dynamically added into an already compiled Newlib library, even if it
|
853 |
|
|
needs only an additional CCS table and iconv is configured to use
|
854 |
|
|
the external files with CCS tables (this isn't the fundamental restriction
|
855 |
|
|
and the possibility to add new Table-based encoding support dynamically, by
|
856 |
|
|
means of just adding new .cct file, may be easily added).
|
857 |
|
|
|
858 |
|
|
@*
|
859 |
|
|
Theoretically, the compiled-in CCS tables should be more appropriate for
|
860 |
|
|
embedded systems than dynamically loaded CCS tables. This is because the compiled-in tables are read-only and can be placed in ROM
|
861 |
|
|
whereas dynamic loading requires RAM. Moreover, in the current iconv
|
862 |
|
|
implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
|
863 |
|
|
This means, for example, that if two iconv descriptors for
|
864 |
|
|
"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of
|
865 |
|
|
koi8-r .cct file will be loaded (actually, iconv loads only the needed part
|
866 |
|
|
of these files). On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.
|
867 |
|
|
|
868 |
|
|
@page
|
869 |
|
|
@node iconv configuration
|
870 |
|
|
@section iconv configuration
|
871 |
|
|
@findex iconv configuration
|
872 |
|
|
@findex --enable-newlib-iconv-encodings
|
873 |
|
|
@findex --enable-newlib-iconv-from-encodings
|
874 |
|
|
@findex --enable-newlib-iconv-to-encodings
|
875 |
|
|
@findex --enable-newlib-iconv-external-ccs
|
876 |
|
|
@findex NLSPATH
|
877 |
|
|
@*
|
878 |
|
|
To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure
|
879 |
|
|
script option should be used. This option accepts a comma-separated list
|
880 |
|
|
of @emph{encodings} that should be enabled. The option enables each encoding in both
|
881 |
|
|
("to" and "from") directions.
|
882 |
|
|
|
883 |
|
|
@*
|
884 |
|
|
The @option{--enable-newlib-iconv-from-encodings} configure script option enables
|
885 |
|
|
"from" support for each encoding that was passed to it.
|
886 |
|
|
|
887 |
|
|
@*
|
888 |
|
|
The @option{--enable-newlib-iconv-to-encodings} configure script option enables
|
889 |
|
|
"to" support for each encoding that was passed to it.
|
890 |
|
|
|
891 |
|
|
@*
|
892 |
|
|
Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and
|
893 |
|
|
"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv
|
894 |
|
|
code and data will be linked) is to configure Newlib with the following
|
895 |
|
|
options:
|
896 |
|
|
@*
|
897 |
|
|
@code{--enable-newlib-iconv-encodings=UTF-8
|
898 |
|
|
--enable-newlib-iconv-from-encodings=KOI8-R
|
899 |
|
|
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}
|
900 |
|
|
@*
|
901 |
|
|
which is the same as
|
902 |
|
|
@*
|
903 |
|
|
@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8
|
904 |
|
|
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}
|
905 |
|
|
@*
|
906 |
|
|
User may also just use the
|
907 |
|
|
@*
|
908 |
|
|
@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}
|
909 |
|
|
@*
|
910 |
|
|
configure script option, but it isn't so optimal since there will be
|
911 |
|
|
some unneeded data and code.
|
912 |
|
|
|
913 |
|
|
@*
|
914 |
|
|
The @option{--enable-newlib-iconv-external-ccs} option enables iconv's
|
915 |
|
|
capabilities to work with the external CCS files.
|
916 |
|
|
|
917 |
|
|
@*
|
918 |
|
|
The @option{--enable-target-optspace} Newlib configure script option also affects
|
919 |
|
|
the iconv library. If this option is present, the library uses the size
|
920 |
|
|
optimized CCS tables. This means, that only the size-optimized CCS
|
921 |
|
|
tables will be linked or, if the
|
922 |
|
|
@option{--enable-newlib-iconv-external-ccs} configure script option was used,
|
923 |
|
|
the iconv library will load the size-optimized tables. If the
|
924 |
|
|
@option{--enable-target-optspace}configure script option is disabled,
|
925 |
|
|
the speed-optimized CCS tables are used.
|
926 |
|
|
|
927 |
|
|
@*
|
928 |
|
|
Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
|
929 |
|
|
Thus, the NLSPATH environment variable should be set.
|
930 |
|
|
|
931 |
|
|
|
932 |
|
|
|
933 |
|
|
|
934 |
|
|
|
935 |
|
|
@page
|
936 |
|
|
@node Encoding names
|
937 |
|
|
@section Encoding names
|
938 |
|
|
@findex encoding name
|
939 |
|
|
@findex encoding alias
|
940 |
|
|
@findex normalized name
|
941 |
|
|
@*
|
942 |
|
|
Each encoding has one @dfn{name} and a number of @dfn{aliases}. When
|
943 |
|
|
user works with the iconv library (i.e., when the @code{iconv_open} call
|
944 |
|
|
is used) both name or aliases may be used. The same is when encoding
|
945 |
|
|
names are used in configure script options.
|
946 |
|
|
|
947 |
|
|
@*
|
948 |
|
|
Names and aliases may be specified in any case (small or capital
|
949 |
|
|
letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.
|
950 |
|
|
Also, when working with the iconv library,
|
951 |
|
|
|
952 |
|
|
@*
|
953 |
|
|
Internally the Newlib iconv library always converts aliases to names. It
|
954 |
|
|
also converts names and aliases in the @dfn{normalized} form which means
|
955 |
|
|
that all capital letters are converted to small letters and the @kbd{-}
|
956 |
|
|
symbols are converted to @kbd{_} symbols.
|
957 |
|
|
|
958 |
|
|
|
959 |
|
|
|
960 |
|
|
|
961 |
|
|
@page
|
962 |
|
|
@node CCS tables
|
963 |
|
|
@section CCS tables
|
964 |
|
|
@findex Size-optimized CCS table
|
965 |
|
|
@findex Speed-optimized CCS table
|
966 |
|
|
@findex mktbl.pl Perl script
|
967 |
|
|
@findex .cct files
|
968 |
|
|
@findex The CCT tables source files
|
969 |
|
|
@findex CCS source files
|
970 |
|
|
@*
|
971 |
|
|
The iconv library stores files with CCS tables in the the @emph{ccs/}
|
972 |
|
|
subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
|
973 |
|
|
(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form
|
974 |
|
|
of compilable .c source files. The .cct files are only used when the
|
975 |
|
|
@option{--enable-newlib-iconv-external-ccs} configure script option is enabled.
|
976 |
|
|
The .c files are linked to the Newlib library if the corresponding
|
977 |
|
|
encoding is enabled.
|
978 |
|
|
|
979 |
|
|
@*
|
980 |
|
|
As stated earlier, the Newlib iconv library performs all
|
981 |
|
|
conversions through the 32-bit UCS, but the codes which are used
|
982 |
|
|
in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.
|
983 |
|
|
Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is
|
984 |
|
|
used instead of the 32-bit UCS-4.
|
985 |
|
|
|
986 |
|
|
@*
|
987 |
|
|
CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to
|
988 |
|
|
16-bit UCS-2 and vice versa while 16-bit CCS tables map
|
989 |
|
|
16-bit CCS to 16-bit UCS-2 and vice versa.
|
990 |
|
|
8-bit tables are small (in size) while 16-bit tables may be big enough.
|
991 |
|
|
Because of this, 16-bit CCS tables may be
|
992 |
|
|
either speed- or size-optimized. Size-optimized CCS tables are
|
993 |
|
|
smaller then speed-optimized ones, but the conversion process is
|
994 |
|
|
slower if the size-optimized CCS tables are used. 8-bit CCS tables have only
|
995 |
|
|
size-optimized variant.
|
996 |
|
|
|
997 |
|
|
Each CCS table (both speed- and size-optimized) consists of
|
998 |
|
|
@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps
|
999 |
|
|
UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
|
1000 |
|
|
UCS-2 codes.
|
1001 |
|
|
|
1002 |
|
|
@*
|
1003 |
|
|
Almost all 16-bit CCS tables contain less then 0xFFFF codes and
|
1004 |
|
|
a lot of gaps exist.
|
1005 |
|
|
|
1006 |
|
|
@subsection Speed-optimized tables format
|
1007 |
|
|
@*
|
1008 |
|
|
In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is
|
1009 |
|
|
trivial - it is just the array of 256 16-bit UCS codes. Therefore, an
|
1010 |
|
|
UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates
|
1011 |
|
|
as @emph{Y = to_ucs[X]}.
|
1012 |
|
|
|
1013 |
|
|
@*
|
1014 |
|
|
Obviously, the simplest way to create the "from_ucs" table or the
|
1015 |
|
|
16-bit "to_ucs" table is to use the huge 16-bit array like in case
|
1016 |
|
|
of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain
|
1017 |
|
|
less then 0xFFFF code maps and this fact may be exploited to reduce
|
1018 |
|
|
the size of the CCS tables.
|
1019 |
|
|
|
1020 |
|
|
@*
|
1021 |
|
|
In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The
|
1022 |
|
|
16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping
|
1023 |
|
|
direction and the CCS bits number.
|
1024 |
|
|
|
1025 |
|
|
@*
|
1026 |
|
|
In case of the 8-bit speed-optimized table the "from_ucs" subtable
|
1027 |
|
|
corresponds the "from_ucs" array and has the following layout:
|
1028 |
|
|
|
1029 |
|
|
@*
|
1030 |
|
|
from_ucs array:
|
1031 |
|
|
@*
|
1032 |
|
|
-------------------------------------
|
1033 |
|
|
@*
|
1034 |
|
|
0xFF mapping (2 bytes) (only for
|
1035 |
|
|
8-bit table).
|
1036 |
|
|
@*
|
1037 |
|
|
-------------------------------------
|
1038 |
|
|
@*
|
1039 |
|
|
Heading block
|
1040 |
|
|
@*
|
1041 |
|
|
-------------------------------------
|
1042 |
|
|
@*
|
1043 |
|
|
Block 1
|
1044 |
|
|
@*
|
1045 |
|
|
-------------------------------------
|
1046 |
|
|
@*
|
1047 |
|
|
Block 2
|
1048 |
|
|
@*
|
1049 |
|
|
-------------------------------------
|
1050 |
|
|
@*
|
1051 |
|
|
...
|
1052 |
|
|
@*
|
1053 |
|
|
-------------------------------------
|
1054 |
|
|
@*
|
1055 |
|
|
Block N
|
1056 |
|
|
@*
|
1057 |
|
|
-------------------------------------
|
1058 |
|
|
|
1059 |
|
|
@*
|
1060 |
|
|
The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each
|
1061 |
|
|
subrange is represented by an 256-element @dfn{block} (256 1-byte
|
1062 |
|
|
elements or 256 2-byte element in case of 16-bit CCS table) with
|
1063 |
|
|
elements which are equivalent to the CCS codes of this subrange.
|
1064 |
|
|
If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be
|
1065 |
|
|
absent and there will be less then 256 blocks.
|
1066 |
|
|
|
1067 |
|
|
@*
|
1068 |
|
|
Any element number @emph{m} of @dfn{the heading block} (which contains
|
1069 |
|
|
256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.
|
1070 |
|
|
If the subrange contains some codes, the value of the @emph{m}-th element of
|
1071 |
|
|
the heading block contains the offset of the corresponding block in the
|
1072 |
|
|
"from_ucs" array. If there is no codes in the subrange, the heading
|
1073 |
|
|
block element contains 0xFFFF.
|
1074 |
|
|
|
1075 |
|
|
@*
|
1076 |
|
|
If there are some gaps in a block, the corresponding block elements have
|
1077 |
|
|
the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping
|
1078 |
|
|
is defined in the first 2-byte element of the "from_ucs" array.
|
1079 |
|
|
|
1080 |
|
|
@*
|
1081 |
|
|
Having such a table format, the algorithm of searching the CCS code
|
1082 |
|
|
@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.
|
1083 |
|
|
|
1084 |
|
|
@*
|
1085 |
|
|
@enumerate
|
1086 |
|
|
@item If @emph{Y} is equivalent to the value of the first 2-byte element
|
1087 |
|
|
of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.
|
1088 |
|
|
|
1089 |
|
|
@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.
|
1090 |
|
|
|
1091 |
|
|
@item If the heading block element with number @emph{BlkN} is 0xFFFF, there
|
1092 |
|
|
is no corresponding CCS code (error, wrong input data). Else, fetch the
|
1093 |
|
|
"flom_ucs" array index of the @emph{BlkN}-th block.
|
1094 |
|
|
|
1095 |
|
|
@item Calculate the offset of the @emph{X} code in its block:
|
1096 |
|
|
@emph{Xindex = Y & 0xFF}
|
1097 |
|
|
|
1098 |
|
|
@item If the @emph{Xintex}-th element of the block (which is equivalent to
|
1099 |
|
|
@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding
|
1100 |
|
|
CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.
|
1101 |
|
|
@end enumerate
|
1102 |
|
|
|
1103 |
|
|
@subsection Size-optimized tables format
|
1104 |
|
|
@*
|
1105 |
|
|
As it is stated above, size-optimized tables exist only for 16-bit CCS-es.
|
1106 |
|
|
This is because there is too small difference between the speed-optimized
|
1107 |
|
|
and the size-optimized table sizes in case of 8-bit CCS-es.
|
1108 |
|
|
|
1109 |
|
|
@*
|
1110 |
|
|
Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
|
1111 |
|
|
size-optimized tables.
|
1112 |
|
|
|
1113 |
|
|
This sections describes the format of the "UCS-2 -> CCS" size-optimized
|
1114 |
|
|
CCS table. The format of "CCS -> UCS-2" table is the same.
|
1115 |
|
|
|
1116 |
|
|
The idea of the size-optimized tables is to split the UCS-2 codes
|
1117 |
|
|
("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).
|
1118 |
|
|
Then CCS codes ("to" codes) are stored only for the codes from these
|
1119 |
|
|
ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored
|
1120 |
|
|
together with the corresponding "to" codes.
|
1121 |
|
|
|
1122 |
|
|
@*
|
1123 |
|
|
The following is the layout of the size-optimized table array:
|
1124 |
|
|
|
1125 |
|
|
@*
|
1126 |
|
|
size_arr array:
|
1127 |
|
|
@*
|
1128 |
|
|
-------------------------------------
|
1129 |
|
|
@*
|
1130 |
|
|
Ranges number (2 bytes)
|
1131 |
|
|
@*
|
1132 |
|
|
-------------------------------------
|
1133 |
|
|
@*
|
1134 |
|
|
Unranged codes number (2 bytes)
|
1135 |
|
|
@*
|
1136 |
|
|
-------------------------------------
|
1137 |
|
|
@*
|
1138 |
|
|
Unranged codes array index (2 bytes)
|
1139 |
|
|
@*
|
1140 |
|
|
-------------------------------------
|
1141 |
|
|
@*
|
1142 |
|
|
Ranges indexes (triads)
|
1143 |
|
|
@*
|
1144 |
|
|
-------------------------------------
|
1145 |
|
|
@*
|
1146 |
|
|
Ranges
|
1147 |
|
|
@*
|
1148 |
|
|
-------------------------------------
|
1149 |
|
|
@*
|
1150 |
|
|
Unranged codes array
|
1151 |
|
|
@*
|
1152 |
|
|
-------------------------------------
|
1153 |
|
|
|
1154 |
|
|
@*
|
1155 |
|
|
The @dfn{Unranged codes array index} @emph{size_arr} section helps to find
|
1156 |
|
|
the offset of the needed range in the @emph{size_arr} and has
|
1157 |
|
|
the following format (triads):
|
1158 |
|
|
@*
|
1159 |
|
|
the first code in range, the last code in range, range offset.
|
1160 |
|
|
|
1161 |
|
|
@*
|
1162 |
|
|
The array of these triads is sorted by the firs element, therefore it is
|
1163 |
|
|
possible to quickly find the needed range index.
|
1164 |
|
|
|
1165 |
|
|
@*
|
1166 |
|
|
Each range has the corresponding sub-array containing the "to" codes. These
|
1167 |
|
|
sub-arrays are stored in the place marked as "Ranges" in the layout
|
1168 |
|
|
diagram.
|
1169 |
|
|
|
1170 |
|
|
@*
|
1171 |
|
|
The "Unranged codes array" contains pairs ("from" code, "to" code") for
|
1172 |
|
|
each unranged code. The array of these pairs is sorted by "from" code
|
1173 |
|
|
values, therefore it is possible to find the needed pair quickly.
|
1174 |
|
|
|
1175 |
|
|
@*
|
1176 |
|
|
Note, that each range requires 6 bytes to form its index. If, for
|
1177 |
|
|
example, there are two ranges (1 - 5 and 9 - 10), and one unranged code
|
1178 |
|
|
(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged
|
1179 |
|
|
code (total 16). But it is better to join both ranges as 1 - 10 and
|
1180 |
|
|
mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the
|
1181 |
|
|
range index and 4 bytes to mark codes 6 and 8 as absent are needed
|
1182 |
|
|
(total 10 bytes). This optimization is done in the size-optimized tables.
|
1183 |
|
|
Thus, ranges may contain small gaps. The absent codes in ranges are marked
|
1184 |
|
|
as 0xFFFF.
|
1185 |
|
|
|
1186 |
|
|
@*
|
1187 |
|
|
Note, a pair of "from" codes is stored by means of unranged codes since
|
1188 |
|
|
the number of bytes which are needed to form the range is greater than
|
1189 |
|
|
the number of bytes to store two unranged codes (5 against 4).
|
1190 |
|
|
|
1191 |
|
|
@*
|
1192 |
|
|
The algorithm of searching of the CCS code
|
1193 |
|
|
@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 ->
|
1194 |
|
|
CCS" size-optimized table is as follows.
|
1195 |
|
|
|
1196 |
|
|
@*
|
1197 |
|
|
@enumerate
|
1198 |
|
|
@item Try to find the corresponding triad in the "Unranged codes array
|
1199 |
|
|
index". Since we are searching in the sorted array, we can do it quickly
|
1200 |
|
|
(divide by 2, compare, etc).
|
1201 |
|
|
|
1202 |
|
|
@item If the triad is found, fetch the @emph{X} code from the corresponding
|
1203 |
|
|
range array. If it is 0xFFFF, return an error.
|
1204 |
|
|
|
1205 |
|
|
@item If there is no corresponding triad, search the @emph{X} code among the
|
1206 |
|
|
sorted unranged codes. Return error, if noting was found.
|
1207 |
|
|
@end enumerate
|
1208 |
|
|
|
1209 |
|
|
@subsection .cct ant .c CCS Table files
|
1210 |
|
|
@*
|
1211 |
|
|
The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs"
|
1212 |
|
|
speed-optimized tables. The .c source files for 16-bit CCS tables have
|
1213 |
|
|
"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
|
1214 |
|
|
tables.
|
1215 |
|
|
|
1216 |
|
|
@*
|
1217 |
|
|
When .c files are compiled and used, all the 16-bit and 32-bit values
|
1218 |
|
|
have the native endian format (Big Endian for the BE systems and Little
|
1219 |
|
|
Endian for the LE systems) since they are compile for the system before
|
1220 |
|
|
they are used.
|
1221 |
|
|
|
1222 |
|
|
@*
|
1223 |
|
|
In case of .cct files, which are intended for dynamic CCS tables
|
1224 |
|
|
loading, the CCS tables are stored either in LE or BE format. Since the
|
1225 |
|
|
.cct files are generated by the 'mktbl.pl' Perl script, it is possible
|
1226 |
|
|
to choose the endianess of the tables. It is also possible to store two
|
1227 |
|
|
copies (both LE and BE) of the CCS tables in one .cct file. The default
|
1228 |
|
|
.cct files (which come with the Newlib sources) have both LE and BE CCS
|
1229 |
|
|
tables. The Newlib iconv library automatically chooses the needed CCS tables
|
1230 |
|
|
(with appropriate endianess).
|
1231 |
|
|
|
1232 |
|
|
@*
|
1233 |
|
|
Note, the .cct files are only used when the
|
1234 |
|
|
@option{--enable-newlib-iconv-external-ccs} is used.
|
1235 |
|
|
|
1236 |
|
|
@subsection The 'mktbl.pl' Perl script
|
1237 |
|
|
@*
|
1238 |
|
|
The 'mktbl.pl' script is intended to generate .cct and .c CCS table
|
1239 |
|
|
files from the @dfn{CCS source files}.
|
1240 |
|
|
|
1241 |
|
|
@*
|
1242 |
|
|
The CCS source files are just text files which has one or more colons
|
1243 |
|
|
with CCS <-> UCS-2 codes mapping. To see an example of the CCS table
|
1244 |
|
|
source files see one of them using URL-s which will be given bellow.
|
1245 |
|
|
|
1246 |
|
|
@*
|
1247 |
|
|
The following table describes where the source files for CCS table files
|
1248 |
|
|
provided by the Newlib distribution are located.
|
1249 |
|
|
|
1250 |
|
|
@multitable @columnfractions .25 .75
|
1251 |
|
|
@item
|
1252 |
|
|
Name
|
1253 |
|
|
@tab
|
1254 |
|
|
URL
|
1255 |
|
|
|
1256 |
|
|
@item
|
1257 |
|
|
@tab
|
1258 |
|
|
|
1259 |
|
|
@item
|
1260 |
|
|
big5
|
1261 |
|
|
@tab
|
1262 |
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
|
1263 |
|
|
|
1264 |
|
|
@item
|
1265 |
|
|
cns11643_plane1
|
1266 |
|
|
cns11643_plane14
|
1267 |
|
|
cns11643_plane2
|
1268 |
|
|
@tab
|
1269 |
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
|
1270 |
|
|
|
1271 |
|
|
@item
|
1272 |
|
|
cp775
|
1273 |
|
|
cp850
|
1274 |
|
|
cp852
|
1275 |
|
|
cp855
|
1276 |
|
|
cp866
|
1277 |
|
|
@tab
|
1278 |
|
|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
|
1279 |
|
|
|
1280 |
|
|
@item
|
1281 |
|
|
iso_8859_1
|
1282 |
|
|
iso_8859_2
|
1283 |
|
|
iso_8859_3
|
1284 |
|
|
iso_8859_4
|
1285 |
|
|
iso_8859_5
|
1286 |
|
|
iso_8859_6
|
1287 |
|
|
iso_8859_7
|
1288 |
|
|
iso_8859_8
|
1289 |
|
|
iso_8859_9
|
1290 |
|
|
iso_8859_10
|
1291 |
|
|
iso_8859_11
|
1292 |
|
|
iso_8859_13
|
1293 |
|
|
iso_8859_14
|
1294 |
|
|
iso_8859_15
|
1295 |
|
|
@tab
|
1296 |
|
|
http://www.unicode.org/Public/MAPPINGS/ISO8859/
|
1297 |
|
|
|
1298 |
|
|
@item
|
1299 |
|
|
iso_ir_111
|
1300 |
|
|
@tab
|
1301 |
|
|
http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
|
1302 |
|
|
|
1303 |
|
|
@item
|
1304 |
|
|
jis_x0201_1976
|
1305 |
|
|
jis_x0208_1990
|
1306 |
|
|
jis_x0212_1990
|
1307 |
|
|
@tab
|
1308 |
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
|
1309 |
|
|
|
1310 |
|
|
@item
|
1311 |
|
|
koi8_r
|
1312 |
|
|
@tab
|
1313 |
|
|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
|
1314 |
|
|
|
1315 |
|
|
@item
|
1316 |
|
|
koi8_ru
|
1317 |
|
|
@tab
|
1318 |
|
|
http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
|
1319 |
|
|
|
1320 |
|
|
@item
|
1321 |
|
|
koi8_u
|
1322 |
|
|
@tab
|
1323 |
|
|
http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
|
1324 |
|
|
|
1325 |
|
|
@item
|
1326 |
|
|
koi8_uni
|
1327 |
|
|
@tab
|
1328 |
|
|
http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
|
1329 |
|
|
|
1330 |
|
|
@item
|
1331 |
|
|
ksx1001
|
1332 |
|
|
@tab
|
1333 |
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
|
1334 |
|
|
|
1335 |
|
|
@item
|
1336 |
|
|
win_1250
|
1337 |
|
|
win_1251
|
1338 |
|
|
win_1252
|
1339 |
|
|
win_1253
|
1340 |
|
|
win_1254
|
1341 |
|
|
win_1255
|
1342 |
|
|
win_1256
|
1343 |
|
|
win_1257
|
1344 |
|
|
win_1258
|
1345 |
|
|
@tab
|
1346 |
|
|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
|
1347 |
|
|
@end multitable
|
1348 |
|
|
|
1349 |
|
|
The CCS source files aren't distributed with Newlib because of License
|
1350 |
|
|
restrictions in most Unicode.org's files.
|
1351 |
|
|
|
1352 |
|
|
The following are 'mktbl.pl' options which were used to generate .cct
|
1353 |
|
|
files. Note, to generate CCS tables source files @option{-s} option
|
1354 |
|
|
should be added.
|
1355 |
|
|
|
1356 |
|
|
@enumerate
|
1357 |
|
|
@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
|
1358 |
|
|
iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
|
1359 |
|
|
iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
|
1360 |
|
|
iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
|
1361 |
|
|
win_1256.cct, win_1258.cct, win_1251.cct,
|
1362 |
|
|
win_1253.cct, win_1255.cct, win_1257.cct,
|
1363 |
|
|
koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
|
1364 |
|
|
big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
|
1365 |
|
|
files, only the @option{-i <SRC_FILE_NAME>} option were used.
|
1366 |
|
|
|
1367 |
|
|
@item To generate the jis_x0208_1990.cct file, the
|
1368 |
|
|
@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used.
|
1369 |
|
|
|
1370 |
|
|
@item To generate the cns11643_plane1.cct file, the
|
1371 |
|
|
@option{-i cns11643.txt -p1 -N cns11643_plane1 -o cns11643_plane1.cct}
|
1372 |
|
|
options were used.
|
1373 |
|
|
|
1374 |
|
|
@item To generate the cns11643_plane2.cct file, the
|
1375 |
|
|
@option{-i cns11643.txt -p2 -N cns11643_plane2 -o cns11643_plane2.cct}
|
1376 |
|
|
options were used.
|
1377 |
|
|
|
1378 |
|
|
@item To generate the cns11643_plane14.cct file, the
|
1379 |
|
|
@option{-i cns11643.txt -p0xE -N cns11643_plane14 -o cns11643_plane14.cct}
|
1380 |
|
|
options were used.
|
1381 |
|
|
@end enumerate
|
1382 |
|
|
|
1383 |
|
|
@*
|
1384 |
|
|
For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.
|
1385 |
|
|
|
1386 |
|
|
@*
|
1387 |
|
|
It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes
|
1388 |
|
|
in the CCS source file, the bits which are higher then 16 defines plane (see the
|
1389 |
|
|
cns11643.txt CCS source file).
|
1390 |
|
|
|
1391 |
|
|
@*
|
1392 |
|
|
Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example,
|
1393 |
|
|
several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to
|
1394 |
|
|
the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost
|
1395 |
|
|
codes}) aren't just rejected but instead, they are mapped to the default
|
1396 |
|
|
UCS-2 code (which is currently the @kbd{?} character's code).
|
1397 |
|
|
|
1398 |
|
|
|
1399 |
|
|
|
1400 |
|
|
|
1401 |
|
|
|
1402 |
|
|
@page
|
1403 |
|
|
@node CES converters
|
1404 |
|
|
@section CES converters
|
1405 |
|
|
@findex PCS
|
1406 |
|
|
@*
|
1407 |
|
|
Similar to the CCS tables, CES converters are also split into "from UCS"
|
1408 |
|
|
and "to UCS" parts. Depending on the iconv library configuration, these
|
1409 |
|
|
parts are enabled or disabled.
|
1410 |
|
|
|
1411 |
|
|
@*
|
1412 |
|
|
The following it the list of CES converters which are currently present
|
1413 |
|
|
in the Newlib iconv library.
|
1414 |
|
|
|
1415 |
|
|
@itemize @bullet
|
1416 |
|
|
@item
|
1417 |
|
|
@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw}
|
1418 |
|
|
encodings. The @emph{euc} CES converter uses the @emph{table} and the
|
1419 |
|
|
@emph{us_ascii} CES converters.
|
1420 |
|
|
|
1421 |
|
|
@item
|
1422 |
|
|
@emph{table} - this CES converter corresponds to "null" and just performs
|
1423 |
|
|
tables-based conversion using 8- and 16-bit CCS tables. This converter
|
1424 |
|
|
is also used by any other CES converter which needs the CCS table-based
|
1425 |
|
|
conversions. The @emph{table} converter is also responsible for .cct files
|
1426 |
|
|
loading.
|
1427 |
|
|
|
1428 |
|
|
@item
|
1429 |
|
|
@emph{table_pcs} - this is the wrapper over the @emph{table} converter
|
1430 |
|
|
which is intended for 16-bit encodings which also use the @dfn{Portable
|
1431 |
|
|
Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}.
|
1432 |
|
|
This means, that if the first byte the CCS code is in range of [0x00-0x7f],
|
1433 |
|
|
this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course,
|
1434 |
|
|
the 16-bit codes must not contain bytes in the range of [0x00-0x7f].
|
1435 |
|
|
The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the
|
1436 |
|
|
@emph{table_pcs} CES converter depends on the @emph{table} CES converter.
|
1437 |
|
|
|
1438 |
|
|
@item
|
1439 |
|
|
@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and
|
1440 |
|
|
@emph{ucs_2le} encodings support.
|
1441 |
|
|
|
1442 |
|
|
@item
|
1443 |
|
|
@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and
|
1444 |
|
|
@emph{ucs_4le} encodings support.
|
1445 |
|
|
|
1446 |
|
|
@item
|
1447 |
|
|
@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support.
|
1448 |
|
|
|
1449 |
|
|
@item
|
1450 |
|
|
@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support.
|
1451 |
|
|
|
1452 |
|
|
@item
|
1453 |
|
|
@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In
|
1454 |
|
|
principle, the most natural way to support the @emph{us_ascii} encoding
|
1455 |
|
|
is to define the @emph{us_ascii} CCS and use the @emph{table} CES
|
1456 |
|
|
converter. But for the optimization purposes, the specialized
|
1457 |
|
|
@emph{us_ascii} CES converter was created.
|
1458 |
|
|
|
1459 |
|
|
@item
|
1460 |
|
|
@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and
|
1461 |
|
|
@emph{utf_16le} encodings support.
|
1462 |
|
|
|
1463 |
|
|
@item
|
1464 |
|
|
@emph{utf_8} - intended for the @emph{utf_8} encoding support.
|
1465 |
|
|
@end itemize
|
1466 |
|
|
|
1467 |
|
|
|
1468 |
|
|
|
1469 |
|
|
|
1470 |
|
|
|
1471 |
|
|
@page
|
1472 |
|
|
@node The encodings description file
|
1473 |
|
|
@section The encodings description file
|
1474 |
|
|
@findex encoding.deps description file
|
1475 |
|
|
@findex mkdeps.pl Perl script
|
1476 |
|
|
@*
|
1477 |
|
|
To simplify the process of adding new encodings support allowing to
|
1478 |
|
|
automatically generate a lot of "glue" files.
|
1479 |
|
|
|
1480 |
|
|
@*
|
1481 |
|
|
There is the 'encoding.deps' file in the @emph{lib/} subdirectory which
|
1482 |
|
|
is used to describe encoding's properties. The 'mkdeps.pl' Perl script
|
1483 |
|
|
uses 'encoding.deps' to generates the "glue" files.
|
1484 |
|
|
|
1485 |
|
|
@*
|
1486 |
|
|
The 'encoding.deps' file is composed of sections, each section consists
|
1487 |
|
|
of entries, each entry contains some encoding/CES/CCS description.
|
1488 |
|
|
|
1489 |
|
|
@*
|
1490 |
|
|
The 'encoding.deps' file's syntax is very simple. Currently only two
|
1491 |
|
|
sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}.
|
1492 |
|
|
|
1493 |
|
|
@*
|
1494 |
|
|
Each @emph{ENCODINGS} section's entry describes one encoding and
|
1495 |
|
|
contains the following information.
|
1496 |
|
|
|
1497 |
|
|
@itemize @bullet
|
1498 |
|
|
@item
|
1499 |
|
|
Encoding name (the @emph{ENCODING} field). The name should
|
1500 |
|
|
be unique and only one name is possible.
|
1501 |
|
|
|
1502 |
|
|
@item
|
1503 |
|
|
The encoding's CES converter name (the @emph{CES} field). Only one CES
|
1504 |
|
|
converter is allowed.
|
1505 |
|
|
|
1506 |
|
|
@item
|
1507 |
|
|
The whitespace-separated list of CCS table names which are used by the
|
1508 |
|
|
encoding (the @emph{CCS} field).
|
1509 |
|
|
|
1510 |
|
|
@item
|
1511 |
|
|
The whitespace-separated list of aliases names (the @emph{ENCODING}
|
1512 |
|
|
field).
|
1513 |
|
|
@end itemize
|
1514 |
|
|
|
1515 |
|
|
@*
|
1516 |
|
|
Note all names in the 'encoding.deps' file have to have the normalized
|
1517 |
|
|
form.
|
1518 |
|
|
|
1519 |
|
|
@*
|
1520 |
|
|
Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of
|
1521 |
|
|
one CES converted. For example, the @emph{euc} CES converter depends on
|
1522 |
|
|
the @emph{table} and the @emph{us_ascii} CES converter since the
|
1523 |
|
|
@emph{euc} CES converter uses them. This means, that both @emph{table}
|
1524 |
|
|
and @emph{us_ascii} CES converters should be linked if the @emph{euc}
|
1525 |
|
|
CES converter is enabled.
|
1526 |
|
|
|
1527 |
|
|
@*
|
1528 |
|
|
The @emph{CES_DEPENDENCIES} section defines the following:
|
1529 |
|
|
|
1530 |
|
|
@itemize @bullet
|
1531 |
|
|
@item
|
1532 |
|
|
the CES converter name for which the dependencies are defined in this
|
1533 |
|
|
entry (the @emph{CES} field);
|
1534 |
|
|
|
1535 |
|
|
@item
|
1536 |
|
|
the whitespace-separated list of CES converters which are needed for
|
1537 |
|
|
this CES converter (the @emph{USED_CES} field).
|
1538 |
|
|
@end itemize
|
1539 |
|
|
|
1540 |
|
|
@*
|
1541 |
|
|
The 'mktbl.pl' Perl script automatically solves the following tasks.
|
1542 |
|
|
|
1543 |
|
|
@itemize @bullet
|
1544 |
|
|
@item
|
1545 |
|
|
User works with the iconv library in terms of encodings and doesn't know
|
1546 |
|
|
anything about CES converters and CCS tables. The script automatically
|
1547 |
|
|
generates code which enables all needed CES converters and CCS tables
|
1548 |
|
|
for all encodings, which were enabled by the user.
|
1549 |
|
|
|
1550 |
|
|
@item
|
1551 |
|
|
The CES converters may have dependencies and the script automatically
|
1552 |
|
|
generates the code which handles these dependencies.
|
1553 |
|
|
|
1554 |
|
|
@item
|
1555 |
|
|
The list of encoding's aliases is also automatically generated.
|
1556 |
|
|
|
1557 |
|
|
@item
|
1558 |
|
|
The script uses a lot of macros in order to enable only the minimum set
|
1559 |
|
|
of code/data which is needed to support the requested encodings in the
|
1560 |
|
|
requested directions.
|
1561 |
|
|
@end itemize
|
1562 |
|
|
|
1563 |
|
|
@*
|
1564 |
|
|
The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
|
1565 |
|
|
file and generates the following files.
|
1566 |
|
|
|
1567 |
|
|
@itemize @bullet
|
1568 |
|
|
@item
|
1569 |
|
|
@emph{lib/encnames.h} - this header files contains macro definitions for all
|
1570 |
|
|
encoding names
|
1571 |
|
|
|
1572 |
|
|
@item
|
1573 |
|
|
@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array
|
1574 |
|
|
is used to find the name of requested encoding by it's alias.
|
1575 |
|
|
|
1576 |
|
|
@item
|
1577 |
|
|
@emph{ces/cesbi.c} - this file defines two arrays
|
1578 |
|
|
(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain
|
1579 |
|
|
description of enabled "to UCS" and "from UCS" CES converters and the
|
1580 |
|
|
names of encodings which are supported by these CES converters.
|
1581 |
|
|
|
1582 |
|
|
@item
|
1583 |
|
|
@emph{ces/cesbi.h} - this file contains the set of macros which defines
|
1584 |
|
|
the set of CES converters which should be enabled if only the set of
|
1585 |
|
|
enabled encodings is given (through macros defined in the
|
1586 |
|
|
@emph{newlib.h} file). Note, that one CES converter may handle several
|
1587 |
|
|
encodings.
|
1588 |
|
|
|
1589 |
|
|
@item
|
1590 |
|
|
@emph{ces/cesdeps.h} - the CES converters dependencies are handled in
|
1591 |
|
|
this file.
|
1592 |
|
|
|
1593 |
|
|
@item
|
1594 |
|
|
@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined
|
1595 |
|
|
here.
|
1596 |
|
|
|
1597 |
|
|
@item
|
1598 |
|
|
@emph{ccs/ccsnames.h} - this header files contains macro definitions for all
|
1599 |
|
|
CCS names.
|
1600 |
|
|
|
1601 |
|
|
@item
|
1602 |
|
|
@emph{encoding.aliases} - the list of supported encodings and their
|
1603 |
|
|
aliases which is intended for the Newlib configure scripts in order to
|
1604 |
|
|
handle the iconv-related configure script options.
|
1605 |
|
|
@end itemize
|
1606 |
|
|
|
1607 |
|
|
|
1608 |
|
|
|
1609 |
|
|
|
1610 |
|
|
|
1611 |
|
|
@page
|
1612 |
|
|
@node How to add new encoding
|
1613 |
|
|
@section How to add new encoding
|
1614 |
|
|
@*
|
1615 |
|
|
At first, the new encoding should be broken down to CCS and CES. Then,
|
1616 |
|
|
the process of adding new encoding is split to the following activities.
|
1617 |
|
|
|
1618 |
|
|
@enumerate
|
1619 |
|
|
@item Generate the .cct CCS file and the .c source file for the new
|
1620 |
|
|
encoding's CCS (if it isn't already present). To do this, the CCS source
|
1621 |
|
|
file should be had and the 'mktbl.pl' script should be used.
|
1622 |
|
|
|
1623 |
|
|
@item Write the corresponding CES converter (if it isn't already
|
1624 |
|
|
present). Use the existing CES converters as an example.
|
1625 |
|
|
|
1626 |
|
|
@item
|
1627 |
|
|
Add the corresponding entries to the 'encoding.deps' file and regenerate
|
1628 |
|
|
the autogenerated "glue" files using the 'mkdeps.pl' script.
|
1629 |
|
|
|
1630 |
|
|
@item
|
1631 |
|
|
Don't forget to add entries to the newlib/newlib.hin file.
|
1632 |
|
|
|
1633 |
|
|
@item
|
1634 |
|
|
Of course, the 'Makefile.am'-s should also be updated (if new files were
|
1635 |
|
|
added) and the 'Makefile.in'-s should be regenerated using the correct
|
1636 |
|
|
version of 'automake'.
|
1637 |
|
|
|
1638 |
|
|
@item
|
1639 |
|
|
Don't forget to update the documentation (the list of
|
1640 |
|
|
supported encodings and CES converters).
|
1641 |
|
|
@end enumerate
|
1642 |
|
|
|
1643 |
|
|
In case a new encoding doesn't fit to the CES/CCS decomposition model or
|
1644 |
|
|
it is desired to add the specialized (non UCS-based) conversion support,
|
1645 |
|
|
the Newlib iconv library code should be upgraded.
|
1646 |
|
|
|
1647 |
|
|
|
1648 |
|
|
|
1649 |
|
|
|
1650 |
|
|
|
1651 |
|
|
@page
|
1652 |
|
|
@node The locale support interfaces
|
1653 |
|
|
@section The locale support interfaces
|
1654 |
|
|
@*
|
1655 |
|
|
The newlib iconv library also has some interface functions (besides the
|
1656 |
|
|
@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which
|
1657 |
|
|
are intended for the Locale subsystem. All the locale-related code is
|
1658 |
|
|
placed in the @emph{lib/iconvnls.c} file.
|
1659 |
|
|
|
1660 |
|
|
@*
|
1661 |
|
|
The following is the description of the locale-related interfaces:
|
1662 |
|
|
|
1663 |
|
|
@itemize @bullet
|
1664 |
|
|
@item
|
1665 |
|
|
@code{_iconv_nls_open} - opens two iconv descriptors for "CCS ->
|
1666 |
|
|
wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
|
1667 |
|
|
passed in the function parameters. The @emph{wchar_t} characters encoding is
|
1668 |
|
|
either ucs_2_internal or ucs_4_internal depending on size of
|
1669 |
|
|
@emph{wchar_t}.
|
1670 |
|
|
|
1671 |
|
|
@item
|
1672 |
|
|
@code{_iconv_nls_conv} - the function is similar to the @code{iconv}
|
1673 |
|
|
functions, but if there is no character in the output encoding which
|
1674 |
|
|
corresponds to the character in the input encoding, the default
|
1675 |
|
|
conversion isn't performed (the @code{iconv} function sets such output
|
1676 |
|
|
characters to the @kbd{?} symbol and this is the behavior, which is
|
1677 |
|
|
specified in SUSv3).
|
1678 |
|
|
|
1679 |
|
|
@item
|
1680 |
|
|
@code{_iconv_nls_get_state} - returns the current encoding's shift state
|
1681 |
|
|
(the @code{mbstate_t} object).
|
1682 |
|
|
|
1683 |
|
|
@item
|
1684 |
|
|
@code{_iconv_nls_set_state} sets the current encoding's shift state (the
|
1685 |
|
|
@code{mbstate_t} object).
|
1686 |
|
|
|
1687 |
|
|
@item
|
1688 |
|
|
@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful
|
1689 |
|
|
or stateless.
|
1690 |
|
|
|
1691 |
|
|
@item
|
1692 |
|
|
@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the
|
1693 |
|
|
maximum bytes number) of the encoding's characters.
|
1694 |
|
|
@end itemize
|
1695 |
|
|
|
1696 |
|
|
|
1697 |
|
|
|
1698 |
|
|
|
1699 |
|
|
@page
|
1700 |
|
|
@node Contact
|
1701 |
|
|
@section Contact
|
1702 |
|
|
@*
|
1703 |
|
|
The author of the original BSD iconv library (Alexander Chuguev) no longer
|
1704 |
|
|
supports that code.
|
1705 |
|
|
|
1706 |
|
|
@*
|
1707 |
|
|
Any questions regarding the iconv library may be forwarded to
|
1708 |
|
|
Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
|
1709 |
|
|
well as to the public Newlib mailing list.
|
1710 |
|
|
|