OpenCores
URL https://opencores.org/ocsvn/openrisc/openrisc/trunk

Subversion Repositories openrisc

[/] [openrisc/] [trunk/] [gnu-dev/] [or1k-gcc/] [libstdc++-v3/] [doc/] [xml/] [manual/] [codecvt.xml] - Blame information for rev 742

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 742 jeremybenn
2
         xml:id="std.localization.facet.codecvt" xreflabel="codecvt">
3
4
 
5
codecvt
6
  
7
    
8
      ISO C++
9
    
10
    
11
      codecvt
12
    
13
  
14
15
 
16
 
17
 
18
19
The standard class codecvt attempts to address conversions between
20
different character encoding schemes. In particular, the standard
21
attempts to detail conversions between the implementation-defined wide
22
characters (hereafter referred to as wchar_t) and the standard type
23
char that is so beloved in classic C (which can now be
24
referred to as narrow characters.)  This document attempts to describe
25
how the GNU libstdc++ implementation deals with the conversion between
26
wide and narrow characters, and also presents a framework for dealing
27
with the huge number of other encodings that iconv can convert,
28
including Unicode and UTF8. Design issues and requirements are
29
addressed, and examples of correct usage for both the required
30
specializations for wide and narrow characters and the
31
implementation-provided extended functionality are given.
32
33
 
34
Requirements
35
 
36
 
37
38
Around page 425 of the C++ Standard, this charming heading comes into view:
39
40
 
41
42
43
22.2.1.5 - Template class codecvt
44
45
46
 
47
48
The text around the codecvt definition gives some clues:
49
50
 
51
52
53
54
-1- The class codecvt<internT,externT,stateT> is for use when
55
converting from one codeset to another, such as from wide characters
56
to multibyte characters, between wide character encodings such as
57
Unicode and EUC.
58
59
60
61
 
62
63
Hmm. So, in some unspecified way, Unicode encodings and
64
translations between other character sets should be handled by this
65
class.
66
67
 
68
69
70
71
-2- The stateT argument selects the pair of codesets being mapped between.
72
73
74
75
 
76
77
Ah ha! Another clue...
78
79
 
80
81
82
83
-3- The instantiations required in the Table ??
84
(lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and
85
codecvt<char,char,mbstate_t>, convert the implementation-defined
86
native character set. codecvt<char,char,mbstate_t> implements a
87
degenerate conversion; it does not convert at
88
all. codecvt<wchar_t,char,mbstate_t> converts between the native
89
character sets for tiny and wide characters. Instantiations on
90
mbstate_t perform conversion between encodings known to the library
91
implementor.  Other encodings can be converted by specializing on a
92
user-defined stateT type. The stateT object can contain any state that
93
is useful to communicate to or from the specialized do_convert member.
94
95
96
97
 
98
99
At this point, a couple points become clear:
100
101
 
102
103
One: The standard clearly implies that attempts to add non-required
104
(yet useful and widely used) conversions need to do so through the
105
third template parameter, stateT.
106
 
107
108
Two: The required conversions, by specifying mbstate_t as the third
109
template parameter, imply an implementation strategy that is mostly
110
(or wholly) based on the underlying C library, and the functions
111
mcsrtombs and wcsrtombs in particular.
112
113
 
114
Design
115
 
116
 
117
<type>wchar_t</type> Size
118
 
119
 
120
    
121
      The simple implementation detail of wchar_t's size seems to
122
      repeatedly confound people. Many systems use a two byte,
123
      unsigned integral type to represent wide characters, and use an
124
      internal encoding of Unicode or UCS2. (See AIX, Microsoft NT,
125
      Java, others.) Other systems, use a four byte, unsigned integral
126
      type to represent wide characters, and use an internal encoding
127
      of UCS4. (GNU/Linux systems using glibc, in particular.) The C
128
      programming language (and thus C++) does not specify a specific
129
      size for the type wchar_t.
130
    
131
 
132
    
133
      Thus, portable C++ code cannot assume a byte size (or endianness) either.
134
    
135
  
136
 
137
Support for Unicode
138
 
139
  
140
    Probably the most frequently asked question about code conversion
141
    is: "So dudes, what's the deal with Unicode strings?"
142
    The dude part is optional, but apparently the usefulness of
143
    Unicode strings is pretty widely appreciated. Sadly, this specific
144
    encoding (And other useful encodings like UTF8, UCS4, ISO 8859-10,
145
    etc etc etc) are not mentioned in the C++ standard.
146
  
147
 
148
  
149
    A couple of comments:
150
  
151
 
152
  
153
    The thought that all one needs to convert between two arbitrary
154
    codesets is two types and some kind of state argument is
155
    unfortunate. In particular, encodings may be stateless. The naming
156
    of the third parameter as stateT is unfortunate, as what is really
157
    needed is some kind of generalized type that accounts for the
158
    issues that abstract encodings will need. The minimum information
159
    that is required includes:
160
  
161
 
162
  
163
    
164
      
165
        Identifiers for each of the codesets involved in the
166
        conversion. For example, using the iconv family of functions
167
        from the Single Unix Specification (what used to be called
168
        X/Open) hosted on the GNU/Linux operating system allows
169
        bi-directional mapping between far more than the following
170
        tantalizing possibilities:
171
      
172
 
173
      
174
        (An edited list taken from `iconv --list` on a
175
        Red Hat 6.2/Intel system:
176
      
177
 
178
179
180
8859_1, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ARABIC, ARABIC7,
181
ASCII, EUC-CN, EUC-JP, EUC-KR, EUC-TW, GREEK-CCIcode, GREEK, GREEK7-OLD,
182
GREEK7, GREEK8, HEBREW, ISO-8859-1, ISO-8859-2, ISO-8859-3,
183
ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,
184
ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14,
185
ISO-8859-15, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4,
186
ISO-10646/UTF-8, ISO-10646/UTF8, SHIFT-JIS, SHIFT_JIS, UCS-2, UCS-4,
187
UCS2, UCS4, UNICODE, UNICODEBIG, UNICODELIcodeLE, US-ASCII, US, UTF-8,
188
UTF-16, UTF8, UTF16).
189
190
191
 
192
193
For iconv-based implementations, string literals for each of the
194
encodings (i.e. "UCS-2" and "UTF-8") are necessary,
195
although for other,
196
non-iconv implementations a table of enumerated values or some other
197
mechanism may be required.
198
199
200
 
201
202
 Maximum length of the identifying string literal.
203
204
 
205
206
 Some encodings require explicit endian-ness. As such, some kind
207
  of endian marker or other byte-order marker will be necessary. See
208
  "Footnotes for C/C++ developers" in Haible for more information on
209
  UCS-2/Unicode endian issues. (Summary: big endian seems most likely,
210
  however implementations, most notably Microsoft, vary.)
211
212
 
213
214
 Types representing the conversion state, for conversions involving
215
  the machinery in the "C" library, or the conversion descriptor, for
216
  conversions using iconv (such as the type iconv_t.)  Note that the
217
  conversion descriptor encodes more information than a simple encoding
218
  state type.
219
220
 
221
222
 Conversion descriptors for both directions of encoding. (i.e., both
223
  UCS-2 to UTF-8 and UTF-8 to UCS-2.)
224
225
 
226
227
 Something to indicate if the conversion requested if valid.
228
229
 
230
231
 Something to represent if the conversion descriptors are valid.
232
233
 
234
235
 Some way to enforce strict type checking on the internal and
236
  external types. As part of this, the size of the internal and
237
  external types will need to be known.
238
239
240
241
 
242
Other Issues
243
 
244
245
In addition, multi-threaded and multi-locale environments also impact
246
the design and requirements for code conversions. In particular, they
247
affect the required specialization codecvt<wchar_t, char, mbstate_t>
248
when implemented using standard "C" functions.
249
250
 
251
252
Three problems arise, one big, one of medium importance, and one small.
253
254
 
255
256
First, the small: mcsrtombs and wcsrtombs may not be multithread-safe
257
on all systems required by the GNU tools. For GNU/Linux and glibc,
258
this is not an issue.
259
260
 
261
262
Of medium concern, in the grand scope of things, is that the functions
263
used to implement this specialization work on null-terminated
264
strings. Buffers, especially file buffers, may not be null-terminated,
265
thus giving conversions that end prematurely or are otherwise
266
incorrect. Yikes!
267
268
 
269
270
The last, and fundamental problem, is the assumption of a global
271
locale for all the "C" functions referenced above. For something like
272
C++ iostreams (where codecvt is explicitly used) the notion of
273
multiple locales is fundamental. In practice, most users may not run
274
into this limitation. However, as a quality of implementation issue,
275
the GNU C++ library would like to offer a solution that allows
276
multiple locales and or simultaneous usage with computationally
277
correct results. In short, libstdc++ is trying to offer, as an
278
option, a high-quality implementation, damn the additional complexity!
279
280
 
281
282
For the required specialization codecvt<wchar_t, char, mbstate_t> ,
283
conversions are made between the internal character set (always UCS4
284
on GNU/Linux) and whatever the currently selected locale for the
285
LC_CTYPE category implements.
286
287
 
288
289
 
290
291
 
292
Implementation
293
 
294
 
295
296
The two required specializations are implemented as follows:
297
298
 
299
300
301
codecvt<char, char, mbstate_t>
302
303
304
305
This is a degenerate (i.e., does nothing) specialization. Implementing
306
this was a piece of cake.
307
308
 
309
310
311
codecvt<char, wchar_t, mbstate_t>
312
313
314
 
315
316
This specialization, by specifying all the template parameters, pretty
317
much ties the hands of implementors. As such, the implementation is
318
straightforward, involving mcsrtombs for the conversions between char
319
to wchar_t and wcsrtombs for conversions between wchar_t and char.
320
321
 
322
323
Neither of these two required specializations deals with Unicode
324
characters. As such, libstdc++ implements a partial specialization
325
of the codecvt class with and iconv wrapper class, encoding_state as the
326
third template parameter.
327
328
 
329
330
This implementation should be standards conformant. First of all, the
331
standard explicitly points out that instantiations on the third
332
template parameter, stateT, are the proper way to implement
333
non-required conversions. Second of all, the standard says (in Chapter
334
17) that partial specializations of required classes are a-ok. Third
335
of all, the requirements for the stateT type elsewhere in the standard
336
(see 21.1.2 traits typedefs) only indicate that this type be copy
337
constructible.
338
339
 
340
341
As such, the type encoding_state is defined as a non-templatized, POD
342
type to be used as the third type of a codecvt instantiation. This
343
type is just a wrapper class for iconv, and provides an easy interface
344
to iconv functionality.
345
346
 
347
348
There are two constructors for encoding_state:
349
350
 
351
352
353
encoding_state() : __in_desc(0), __out_desc(0)
354
355
356
357
This default constructor sets the internal encoding to some default
358
(currently UCS4) and the external encoding to whatever is returned by
359
nl_langinfo(CODESET).
360
361
 
362
363
364
encoding_state(const char* __int, const char* __ext)
365
366
367
 
368
369
This constructor takes as parameters string literals that indicate the
370
desired internal and external encoding. There are no defaults for
371
either argument.
372
373
 
374
375
One of the issues with iconv is that the string literals identifying
376
conversions are not standardized. Because of this, the thought of
377
mandating and or enforcing some set of pre-determined valid
378
identifiers seems iffy: thus, a more practical (and non-migraine
379
inducing) strategy was implemented: end-users can specify any string
380
(subject to a pre-determined length qualifier, currently 32 bytes) for
381
encodings. It is up to the user to make sure that these strings are
382
valid on the target system.
383
384
 
385
386
387
void
388
_M_init()
389
390
391
392
Strangely enough, this member function attempts to open conversion
393
descriptors for a given encoding_state object. If the conversion
394
descriptors are not valid, the conversion descriptors returned will
395
not be valid and the resulting calls to the codecvt conversion
396
functions will return error.
397
398
 
399
400
401
bool
402
_M_good()
403
404
405
 
406
407
Provides a way to see if the given encoding_state object has been
408
properly initialized. If the string literals describing the desired
409
internal and external encoding are not valid, initialization will
410
fail, and this will return false. If the internal and external
411
encodings are valid, but iconv_open could not allocate conversion
412
descriptors, this will also return false. Otherwise, the object is
413
ready to convert and will return true.
414
415
 
416
417
418
encoding_state(const encoding_state&)
419
420
421
 
422
423
As iconv allocates memory and sets up conversion descriptors, the copy
424
constructor can only copy the member data pertaining to the internal
425
and external code conversions, and not the conversion descriptors
426
themselves.
427
428
 
429
430
Definitions for all the required codecvt member functions are provided
431
for this specialization, and usage of codecvt<internal character type,
432
external character type, encoding_state> is consistent with other
433
codecvt usage.
434
435
 
436
437
 
438
Use
439
 
440
A conversions involving string literal.
441
 
442
443
  typedef codecvt_base::result                  result;
444
  typedef unsigned short                        unicode_t;
445
  typedef unicode_t                             int_type;
446
  typedef char                                  ext_type;
447
  typedef encoding_state                          state_type;
448
  typedef codecvt<int_type, ext_type, state_type> unicode_codecvt;
449
 
450
  const ext_type*       e_lit = "black pearl jasmine tea";
451
  int                   size = strlen(e_lit);
452
  int_type              i_lit_base[24] =
453
  { 25088, 27648, 24832, 25344, 27392, 8192, 28672, 25856, 24832, 29184,
454
    27648, 8192, 27136, 24832, 29440, 27904, 26880, 28160, 25856, 8192, 29696,
455
    25856, 24832, 2560
456
  };
457
  const int_type*       i_lit = i_lit_base;
458
  const ext_type*       efrom_next;
459
  const int_type*       ifrom_next;
460
  ext_type*             e_arr = new ext_type[size + 1];
461
  ext_type*             eto_next;
462
  int_type*             i_arr = new int_type[size + 1];
463
  int_type*             ito_next;
464
 
465
  // construct a locale object with the specialized facet.
466
  locale                loc(locale::classic(), new unicode_codecvt);
467
  // sanity check the constructed locale has the specialized facet.
468
  VERIFY( has_facet<unicode_codecvt>(loc) );
469
  const unicode_codecvt& cvt = use_facet<unicode_codecvt>(loc);
470
  // convert between const char* and unicode strings
471
  unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1");
472
  initialize_state(state01);
473
  result r1 = cvt.in(state01, e_lit, e_lit + size, efrom_next,
474
                     i_arr, i_arr + size, ito_next);
475
  VERIFY( r1 == codecvt_base::ok );
476
  VERIFY( !int_traits::compare(i_arr, i_lit, size) );
477
  VERIFY( efrom_next == e_lit + size );
478
  VERIFY( ito_next == i_arr + size );
479
480
 
481
482
 
483
Future
484
 
485
486
487
  
488
   a. things that are sketchy, or remain unimplemented:
489
      do_encoding, max_length and length member functions
490
      are only weakly implemented. I have no idea how to do
491
      this correctly, and in a generic manner.  Nathan?
492
493
494
 
495
496
  
497
   b. conversions involving std::string
498
  
499
   
500
      
501
      how should operators != and == work for string of
502
      different/same encoding?
503
      
504
 
505
      
506
      what is equal? A byte by byte comparison or an
507
      encoding then byte comparison?
508
      
509
 
510
      
511
      conversions between narrow, wide, and unicode strings
512
      
513
   
514
515
516
   c. conversions involving std::filebuf and std::ostream
517
518
   
519
      
520
      how to initialize the state object in a
521
      standards-conformant manner?
522
      
523
 
524
                
525
      how to synchronize the "C" and "C++"
526
      conversion information?
527
      
528
 
529
                
530
      wchar_t/char internal buffers and conversions between
531
      internal/external buffers?
532
      
533
   
534
535
536
537
 
538
 
539
Bibliography
540
 
541
 
542
  
543
    
544
      The GNU C Library
545
    
546
    McGrathRoland
547
    DrepperUlrich
548
    
549
      2007
550
      FSF
551
    
552
    
553
      Chapters 6 Character Set Handling and 7 Locales and Internationalization
554
    
555
  
556
 
557
  
558
    
559
      Correspondence
560
    
561
    DrepperUlrich
562
    
563
      2002
564
      
565
    
566
  
567
 
568
  
569
    
570
      ISO/IEC 14882:1998 Programming languages - C++
571
    
572
    
573
      1998
574
      ISO
575
    
576
  
577
 
578
  
579
    
580
      ISO/IEC 9899:1999 Programming languages - C
581
    
582
    
583
      1999
584
      ISO
585
    
586
  
587
 
588
  
589
      </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>590</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>        <link xmlns:xlink="http://www.w3.org/1999/xlink"</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>591</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>              xlink:href="http://www.opengroup.org/austin"></code></pre></td>
      </tr>
      <tr valign="middle">
         <td>592</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      System Interface Definitions, Issue 7 (IEEE Std. 1003.1-2008)</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>593</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>        </link></code></pre></td>
      </tr>
      <tr valign="middle">
         <td>594</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      
595
 
596
    
597
      2008
598
      
599
        The Open Group/The Institute of Electrical and Electronics
600
        Engineers, Inc.
601
      
602
    
603
  
604
 
605
  
606
    
607
      The C++ Programming Language, Special Edition
608
    
609
    StroustrupBjarne
610
    
611
      2000
612
      Addison Wesley, Inc.
613
    
614
    Appendix D
615
    
616
      
617
        Addison Wesley
618
      
619
    
620
  
621
 
622
 
623
  
624
    
625
      Standard C++ IOStreams and Locales
626
    
627
    
628
      Advanced Programmer's Guide and Reference
629
    
630
    LangerAngelika
631
    KreftKlaus
632
    
633
      2000
634
      Addison Wesley Longman, Inc.
635
    
636
    
637
      
638
        Addison Wesley Longman
639
      
640
    
641
  
642
 
643
  
644
      </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>645</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>        <link xmlns:xlink="http://www.w3.org/1999/xlink"</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>646</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>              xlink:href="http://www.lysator.liu.se/c/na1.html"></code></pre></td>
      </tr>
      <tr valign="middle">
         <td>647</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      A brief description of Normative Addendum 1</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>648</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>        </link></code></pre></td>
      </tr>
      <tr valign="middle">
         <td>649</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      
650
 
651
    FeatherClive
652
    Extended Character Sets
653
  
654
 
655
  
656
      </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>657</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>        <link xmlns:xlink="http://www.w3.org/1999/xlink"</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>658</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>              xlink:href="http://tldp.org/HOWTO/Unicode-HOWTO.html"></code></pre></td>
      </tr>
      <tr valign="middle">
         <td>659</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>          The Unicode HOWTO</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>660</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>        </link></code></pre></td>
      </tr>
      <tr valign="middle">
         <td>661</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      
662
 
663
    HaibleBruno
664
  
665
 
666
  
667
      </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>668</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>        <link xmlns:xlink="http://www.w3.org/1999/xlink"</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>669</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>              xlink:href="http://www.cl.cam.ac.uk/~mgk25/unicode.html"></code></pre></td>
      </tr>
      <tr valign="middle">
         <td>670</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      UTF-8 and Unicode FAQ for Unix/Linux</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>671</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>        </link></code></pre></td>
      </tr>
      <tr valign="middle">
         <td>672</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      
673
 
674
 
675
    KhunMarkus
676
  
677
 
678
679
 
680

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.