OpenCores
URL https://opencores.org/ocsvn/scarts/scarts/trunk

Subversion Repositories scarts

[/] [scarts/] [trunk/] [toolchain/] [scarts-gcc/] [gcc-4.1.1/] [libstdc++-v3/] [docs/] [html/] [22_locale/] [codecvt.html] - Blame information for rev 20

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 20 jlechner
<?xml version="1.0" encoding="ISO-8859-1"?>
2
<!DOCTYPE html
3
          PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
4
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
5
 
6
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7
<head>
8
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
9
   <meta name="AUTHOR" content="bkoz@redhat.com (Benjamin Kosnik)" />
10
   <meta name="KEYWORDS" content="HOWTO, libstdc++, GCC, g++, libg++, STL" />
11
   <meta name="DESCRIPTION" content="Notes on the codecvt implementation." />
12
   <title>Notes on the codecvt implementation.</title>
13
<link rel="StyleSheet" href="../lib3styles.css" type="text/css" />
14
<link rel="Start" href="../documentation.html" type="text/html"
15
  title="GNU C++ Standard Library" />
16
<link rel="Bookmark" href="howto.html" type="text/html" title="Localization" />
17
<link rel="Copyright" href="../17_intro/license.html" type="text/html" />
18
<link rel="Help" href="../faq/index.html" type="text/html" title="F.A.Q." />
19
</head>
20
<body>
21
  <h1>
22
  Notes on the codecvt implementation.
23
  </h1>
24
<p>
25
<em>
26
prepared by Benjamin Kosnik (bkoz@redhat.com) on August 28, 2000
27
</em>
28
</p>
29
 
30
<h2>
31
1. Abstract
32
</h2>
33
<p>
34
The standard class codecvt attempts to address conversions between
35
different character encoding schemes. In particular, the standard
36
attempts to detail conversions between the implementation-defined wide
37
characters (hereafter referred to as wchar_t) and the standard type
38
char that is so beloved in classic &quot;C&quot; (which can now be referred to
39
as narrow characters.)  This document attempts to describe how the GNU
40
libstdc++-v3 implementation deals with the conversion between wide and
41
narrow characters, and also presents a framework for dealing with the
42
huge number of other encodings that iconv can convert, including
43
Unicode and UTF8. Design issues and requirements are addressed, and
44
examples of correct usage for both the required specializations for
45
wide and narrow characters and the implementation-provided extended
46
functionality are given.
47
</p>
48
 
49
<h2>
50
2. What the standard says
51
</h2>
52
Around page 425 of the C++ Standard, this charming heading comes into view:
53
 
54
<blockquote>
55
22.2.1.5 - Template class codecvt [lib.locale.codecvt]
56
</blockquote>
57
 
58
The text around the codecvt definition gives some clues:
59
 
60
<blockquote>
61
<em>
62
-1- The class codecvt&lt;internT,externT,stateT&gt; is for use when
63
converting from one codeset to another, such as from wide characters
64
to multibyte characters, between wide character encodings such as
65
Unicode and EUC.
66
</em>
67
</blockquote>
68
 
69
<p>
70
Hmm. So, in some unspecified way, Unicode encodings and
71
translations between other character sets should be handled by this
72
class.
73
</p>
74
 
75
<blockquote>
76
<em>
77
-2- The stateT argument selects the pair of codesets being mapped between.
78
</em>
79
</blockquote>
80
 
81
<p>
82
Ah ha! Another clue...
83
</p>
84
 
85
<blockquote>
86
<em>
87
-3- The instantiations required in the Table ??
88
(lib.locale.category), namely codecvt&lt;wchar_t,char,mbstate_t&gt; and
89
codecvt&lt;char,char,mbstate_t&gt;, convert the implementation-defined
90
native character set. codecvt&lt;char,char,mbstate_t&gt; implements a
91
degenerate conversion; it does not convert at
92
all. codecvt&lt;wchar_t,char,mbstate_t&gt; converts between the native
93
character sets for tiny and wide characters. Instantiations on
94
mbstate_t perform conversion between encodings known to the library
95
implementor.  Other encodings can be converted by specializing on a
96
user-defined stateT type. The stateT object can contain any state that
97
is useful to communicate to or from the specialized do_convert member.
98
</em>
99
</blockquote>
100
 
101
<p>
102
At this point, a couple points become clear:
103
</p>
104
 
105
<p>
106
One: The standard clearly implies that attempts to add non-required
107
(yet useful and widely used) conversions need to do so through the
108
third template parameter, stateT.</p>
109
 
110
<p>
111
Two: The required conversions, by specifying mbstate_t as the third
112
template parameter, imply an implementation strategy that is mostly
113
(or wholly) based on the underlying C library, and the functions
114
mcsrtombs and wcsrtombs in particular.</p>
115
 
116
<h2>
117
3. Some thoughts on what would be useful
118
</h2>
119
Probably the most frequently asked question about code conversion is:
120
&quot;So dudes, what's the deal with Unicode strings?&quot; The dude part is
121
optional, but apparently the usefulness of Unicode strings is pretty
122
widely appreciated. Sadly, this specific encoding (And other useful
123
encodings like UTF8, UCS4, ISO 8859-10, etc etc etc) are not mentioned
124
in the C++ standard.
125
 
126
<p>
127
In particular, the simple implementation detail of wchar_t's size
128
seems to repeatedly confound people. Many systems use a two byte,
129
unsigned integral type to represent wide characters, and use an
130
internal encoding of Unicode or UCS2. (See AIX, Microsoft NT, Java,
131
others.) Other systems, use a four byte, unsigned integral type to
132
represent wide characters, and use an internal encoding of
133
UCS4. (GNU/Linux systems using glibc, in particular.) The C
134
programming language (and thus C++) does not specify a specific size
135
for the type wchar_t.
136
</p>
137
 
138
<p>
139
Thus, portable C++ code cannot assume a byte size (or endianness) either.
140
</p>
141
 
142
<p>
143
Getting back to the frequently asked question: What about Unicode strings?
144
</p>
145
 
146
<p>
147
What magic spell will do this conversion?
148
</p>
149
 
150
<p>
151
A couple of comments:
152
</p>
153
 
154
<p>
155
The thought that all one needs to convert between two arbitrary
156
codesets is two types and some kind of state argument is
157
unfortunate. In particular, encodings may be stateless. The naming of
158
the third parameter as stateT is unfortunate, as what is really needed
159
is some kind of generalized type that accounts for the issues that
160
abstract encodings will need. The minimum information that is required
161
includes:
162
</p>
163
 
164
<ul>
165
<li>
166
<p>
167
 Identifiers for each of the codesets involved in the conversion. For
168
example, using the iconv family of functions from the Single Unix
169
Specification (what used to be called X/Open) hosted on the GNU/Linux
170
operating system allows bi-directional mapping between far more than
171
the following tantalizing possibilities:
172
</p>
173
 
174
<p>
175
(An edited list taken from <code>`iconv --list`</code> on a Red Hat 6.2/Intel system:
176
</p>
177
 
178
<blockquote>
179
<pre>
180
8859_1, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ARABIC, ARABIC7,
181
ASCII, EUC-CN, EUC-JP, EUC-KR, EUC-TW, GREEK-CCIcode, GREEK, GREEK7-OLD,
182
GREEK7, GREEK8, HEBREW, ISO-8859-1, ISO-8859-2, ISO-8859-3,
183
ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,
184
ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14,
185
ISO-8859-15, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4,
186
ISO-10646/UTF-8, ISO-10646/UTF8, SHIFT-JIS, SHIFT_JIS, UCS-2, UCS-4,
187
UCS2, UCS4, UNICODE, UNICODEBIG, UNICODELIcodeLE, US-ASCII, US, UTF-8,
188
UTF-16, UTF8, UTF16).
189
</pre>
190
</blockquote>
191
 
192
<p>
193
For iconv-based implementations, string literals for each of the
194
encodings (ie. &quot;UCS-2&quot; and &quot;UTF-8&quot;) are necessary,
195
although for other,
196
non-iconv implementations a table of enumerated values or some other
197
mechanism may be required.
198
</p>
199
</li>
200
 
201
<li>
202
 Maximum length of the identifying string literal.
203
</li>
204
 
205
<li>
206
 Some encodings are require explicit endian-ness. As such, some kind
207
  of endian marker or other byte-order marker will be necessary. See
208
  &quot;Footnotes for C/C++ developers&quot; in Haible for more information on
209
  UCS-2/Unicode endian issues. (Summary: big endian seems most likely,
210
  however implementations, most notably Microsoft, vary.)
211
</li>
212
 
213
<li>
214
 Types representing the conversion state, for conversions involving
215
  the machinery in the &quot;C&quot; library, or the conversion descriptor, for
216
  conversions using iconv (such as the type iconv_t.)  Note that the
217
  conversion descriptor encodes more information than a simple encoding
218
  state type.
219
</li>
220
 
221
<li>
222
 Conversion descriptors for both directions of encoding. (ie, both
223
  UCS-2 to UTF-8 and UTF-8 to UCS-2.)
224
</li>
225
 
226
<li>
227
 Something to indicate if the conversion requested if valid.
228
</li>
229
 
230
<li>
231
 Something to represent if the conversion descriptors are valid.
232
</li>
233
 
234
<li>
235
 Some way to enforce strict type checking on the internal and
236
  external types. As part of this, the size of the internal and
237
  external types will need to be known.
238
</li>
239
</ul>
240
 
241
<h2>
242
4. Problems with &quot;C&quot; code conversions : thread safety, global
243
locales, termination.
244
</h2>
245
 
246
In addition, multi-threaded and multi-locale environments also impact
247
the design and requirements for code conversions. In particular, they
248
affect the required specialization codecvt&lt;wchar_t, char, mbstate_t&gt;
249
when implemented using standard &quot;C&quot; functions.
250
 
251
<p>
252
Three problems arise, one big, one of medium importance, and one small.
253
</p>
254
 
255
<p>
256
First, the small: mcsrtombs and wcsrtombs may not be multithread-safe
257
on all systems required by the GNU tools. For GNU/Linux and glibc,
258
this is not an issue.
259
</p>
260
 
261
<p>
262
Of medium concern, in the grand scope of things, is that the functions
263
used to implement this specialization work on null-terminated
264
strings. Buffers, especially file buffers, may not be null-terminated,
265
thus giving conversions that end prematurely or are otherwise
266
incorrect. Yikes!
267
</p>
268
 
269
<p>
270
The last, and fundamental problem, is the assumption of a global
271
locale for all the &quot;C&quot; functions referenced above. For something like
272
C++ iostreams (where codecvt is explicitly used) the notion of
273
multiple locales is fundamental. In practice, most users may not run
274
into this limitation. However, as a quality of implementation issue,
275
the GNU C++ library would like to offer a solution that allows
276
multiple locales and or simultaneous usage with computationally
277
correct results. In short, libstdc++-v3 is trying to offer, as an
278
option, a high-quality implementation, damn the additional complexity!
279
</p>
280
 
281
<p>
282
For the required specialization codecvt&lt;wchar_t, char, mbstate_t&gt; ,
283
conversions are made between the internal character set (always UCS4
284
on GNU/Linux) and whatever the currently selected locale for the
285
LC_CTYPE category implements.
286
</p>
287
 
288
<h2>
289
5. Design
290
</h2>
291
The two required specializations are implemented as follows:
292
 
293
<p>
294
<code>
295
codecvt&lt;char, char, mbstate_t&gt;
296
</code>
297
</p>
298
<p>
299
This is a degenerate (ie, does nothing) specialization. Implementing
300
this was a piece of cake.
301
</p>
302
 
303
<p>
304
<code>
305
codecvt&lt;char, wchar_t, mbstate_t&gt;
306
</code>
307
</p>
308
<p>
309
This specialization, by specifying all the template parameters, pretty
310
much ties the hands of implementors. As such, the implementation is
311
straightforward, involving mcsrtombs for the conversions between char
312
to wchar_t and wcsrtombs for conversions between wchar_t and char.
313
</p>
314
 
315
<p>
316
Neither of these two required specializations deals with Unicode
317
characters. As such, libstdc++-v3 implements a partial specialization
318
of the codecvt class with and iconv wrapper class, __enc_traits as the
319
third template parameter.
320
</p>
321
 
322
<p>
323
This implementation should be standards conformant. First of all, the
324
standard explicitly points out that instantiations on the third
325
template parameter, stateT, are the proper way to implement
326
non-required conversions. Second of all, the standard says (in Chapter
327
17) that partial specializations of required classes are a-ok. Third
328
of all, the requirements for the stateT type elsewhere in the standard
329
(see 21.1.2 traits typedefs) only indicate that this type be copy
330
constructible.
331
</p>
332
 
333
<p>
334
As such, the type __enc_traits is defined as a non-templatized, POD
335
type to be used as the third type of a codecvt instantiation. This
336
type is just a wrapper class for iconv, and provides an easy interface
337
to iconv functionality.
338
</p>
339
 
340
<p>
341
There are two constructors for __enc_traits:
342
</p>
343
 
344
<p>
345
<code>
346
__enc_traits() : __in_desc(0), __out_desc(0)
347
</code>
348
</p>
349
<p>
350
This default constructor sets the internal encoding to some default
351
(currently UCS4) and the external encoding to whatever is returned by
352
nl_langinfo(CODESET).
353
</p>
354
 
355
<p>
356
<code>
357
__enc_traits(const char* __int, const char* __ext)
358
</code>
359
</p>
360
<p>
361
This constructor takes as parameters string literals that indicate the
362
desired internal and external encoding. There are no defaults for
363
either argument.
364
</p>
365
 
366
<p>
367
One of the issues with iconv is that the string literals identifying
368
conversions are not standardized. Because of this, the thought of
369
mandating and or enforcing some set of pre-determined valid
370
identifiers seems iffy: thus, a more practical (and non-migraine
371
inducing) strategy was implemented: end-users can specify any string
372
(subject to a pre-determined length qualifier, currently 32 bytes) for
373
encodings. It is up to the user to make sure that these strings are
374
valid on the target system.
375
</p>
376
 
377
<p>
378
<code>
379
void
380
_M_init()
381
</code>
382
</p>
383
<p>
384
Strangely enough, this member function attempts to open conversion
385
descriptors for a given __enc_traits object. If the conversion
386
descriptors are not valid, the conversion descriptors returned will
387
not be valid and the resulting calls to the codecvt conversion
388
functions will return error.
389
</p>
390
 
391
<p>
392
<code>
393
bool
394
_M_good()
395
</code>
396
</p>
397
<p>
398
Provides a way to see if the given __enc_traits object has been
399
properly initialized. If the string literals describing the desired
400
internal and external encoding are not valid, initialization will
401
fail, and this will return false. If the internal and external
402
encodings are valid, but iconv_open could not allocate conversion
403
descriptors, this will also return false. Otherwise, the object is
404
ready to convert and will return true.
405
</p>
406
 
407
<p>
408
<code>
409
__enc_traits(const __enc_traits&amp;)
410
</code>
411
</p>
412
<p>
413
As iconv allocates memory and sets up conversion descriptors, the copy
414
constructor can only copy the member data pertaining to the internal
415
and external code conversions, and not the conversion descriptors
416
themselves.
417
</p>
418
 
419
<p>
420
Definitions for all the required codecvt member functions are provided
421
for this specialization, and usage of codecvt&lt;internal character type,
422
external character type, __enc_traits&gt; is consistent with other
423
codecvt usage.
424
</p>
425
 
426
<h2>
427
6.  Examples
428
</h2>
429
 
430
<ul>
431
        <li>
432
        a. conversions involving string literals
433
 
434
<pre>
435
  typedef codecvt_base::result                  result;
436
  typedef unsigned short                        unicode_t;
437
  typedef unicode_t                             int_type;
438
  typedef char                                  ext_type;
439
  typedef __enc_traits                          enc_type;
440
  typedef codecvt&lt;int_type, ext_type, enc_type&gt; unicode_codecvt;
441
 
442
  const ext_type*       e_lit = "black pearl jasmine tea";
443
  int                   size = strlen(e_lit);
444
  int_type              i_lit_base[24] =
445
  { 25088, 27648, 24832, 25344, 27392, 8192, 28672, 25856, 24832, 29184,
446
    27648, 8192, 27136, 24832, 29440, 27904, 26880, 28160, 25856, 8192, 29696,
447
    25856, 24832, 2560
448
  };
449
  const int_type*       i_lit = i_lit_base;
450
  const ext_type*       efrom_next;
451
  const int_type*       ifrom_next;
452
  ext_type*             e_arr = new ext_type[size + 1];
453
  ext_type*             eto_next;
454
  int_type*             i_arr = new int_type[size + 1];
455
  int_type*             ito_next;
456
 
457
  // construct a locale object with the specialized facet.
458
  locale                loc(locale::classic(), new unicode_codecvt);
459
  // sanity check the constructed locale has the specialized facet.
460
  VERIFY( has_facet&lt;unicode_codecvt&gt;(loc) );
461
  const unicode_codecvt&amp; cvt = use_facet&lt;unicode_codecvt&gt;(loc);
462
  // convert between const char* and unicode strings
463
  unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1");
464
  initialize_state(state01);
465
  result r1 = cvt.in(state01, e_lit, e_lit + size, efrom_next,
466
                     i_arr, i_arr + size, ito_next);
467
  VERIFY( r1 == codecvt_base::ok );
468
  VERIFY( !int_traits::compare(i_arr, i_lit, size) );
469
  VERIFY( efrom_next == e_lit + size );
470
  VERIFY( ito_next == i_arr + size );
471
</pre>
472
        </li>
473
        <li>
474
        b. conversions involving std::string
475
        </li>
476
        <li>
477
        c. conversions involving std::filebuf and std::ostream
478
        </li>
479
</ul>
480
 
481
More information can be found in the following testcases:
482
<ul>
483
<li> testsuite/22_locale/codecvt_char_char.cc       </li>
484
<li> testsuite/22_locale/codecvt_unicode_wchar_t.cc </li>
485
<li> testsuite/22_locale/codecvt_unicode_char.cc    </li>
486
<li> testsuite/22_locale/codecvt_wchar_t_char.cc    </li>
487
</ul>
488
 
489
<h2>
490
7.  Unresolved Issues
491
</h2>
492
<ul>
493
<li>
494
   a. things that are sketchy, or remain unimplemented:
495
      do_encoding, max_length and length member functions
496
      are only weakly implemented. I have no idea how to do
497
      this correctly, and in a generic manner.  Nathan?
498
</li>
499
 
500
<li>
501
   b. conversions involving std::string
502
 
503
   <ul>
504
      <li>
505
      how should operators != and == work for string of
506
      different/same encoding?
507
      </li>
508
 
509
      <li>
510
      what is equal? A byte by byte comparison or an
511
      encoding then byte comparison?
512
      </li>
513
 
514
      <li>
515
      conversions between narrow, wide, and unicode strings
516
      </li>
517
   </ul>
518
</li>
519
<li>
520
   c. conversions involving std::filebuf and std::ostream
521
   <ul>
522
      <li>
523
      how to initialize the state object in a
524
      standards-conformant manner?
525
      </li>
526
 
527
                <li>
528
      how to synchronize the &quot;C&quot; and &quot;C++&quot;
529
      conversion information?
530
      </li>
531
 
532
                <li>
533
      wchar_t/char internal buffers and conversions between
534
      internal/external buffers?
535
      </li>
536
   </ul>
537
</li>
538
</ul>
539
 
540
<h2>
541
8. Acknowledgments
542
</h2>
543
Ulrich Drepper for the iconv suggestions and patient answering of
544
late-night questions, Jason Merrill for the template partial
545
specialization hints, language clarification, and wchar_t fixes.
546
 
547
<h2>
548
9. Bibliography / Referenced Documents
549
</h2>
550
 
551
Drepper, Ulrich, GNU libc (glibc) 2.2 manual. In particular, Chapters &quot;6. Character Set Handling&quot; and &quot;7 Locales and Internationalization&quot;
552
 
553
<p>
554
Drepper, Ulrich, Numerous, late-night email correspondence
555
</p>
556
 
557
<p>
558
Feather, Clive, &quot;A brief description of Normative Addendum 1,&quot; in particular the parts on Extended Character Sets
559
http://www.lysator.liu.se/c/na1.html
560
</p>
561
 
562
<p>
563
Haible, Bruno, &quot;The Unicode HOWTO&quot; v0.18, 4 August 2000
564
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
565
</p>
566
 
567
<p>
568
ISO/IEC 14882:1998 Programming languages - C++
569
</p>
570
 
571
<p>
572
ISO/IEC 9899:1999 Programming languages - C
573
</p>
574
 
575
<p>
576
Khun, Markus, &quot;UTF-8 and Unicode FAQ for Unix/Linux&quot;
577
http://www.cl.cam.ac.uk/~mgk25/unicode.html
578
</p>
579
 
580
<p>
581
Langer, Angelika and Klaus Kreft, Standard C++ IOStreams and Locales, Advanced Programmer's Guide and Reference, Addison Wesley Longman, Inc. 2000
582
</p>
583
 
584
<p>
585
Stroustrup, Bjarne, Appendix D, The C++ Programming Language, Special Edition, Addison Wesley, Inc. 2000
586
</p>
587
 
588
<p>
589
System Interface Definitions, Issue 6 (IEEE Std. 1003.1-200x)
590
The Open Group/The Institute of Electrical and Electronics Engineers, Inc.
591
http://www.opennc.org/austin/docreg.html
592
</p>
593
 
594
</body>
595
</html>

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.