OpenCores
URL https://opencores.org/ocsvn/openrisc/openrisc/trunk

Subversion Repositories openrisc

[/] [openrisc/] [tags/] [gnu-src/] [gcc-4.5.1/] [gcc-4.5.1-or32-1.0rc4/] [libstdc++-v3/] [doc/] [xml/] [manual/] [codecvt.xml] - Blame information for rev 424

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 424 jeremybenn
2
3
 
4
5
  
6
    
7
      ISO C++
8
    
9
    
10
      codecvt
11
    
12
  
13
14
 
15
codecvt
16
 
17
18
The standard class codecvt attempts to address conversions between
19
different character encoding schemes. In particular, the standard
20
attempts to detail conversions between the implementation-defined wide
21
characters (hereafter referred to as wchar_t) and the standard type
22
char that is so beloved in classic C (which can now be
23
referred to as narrow characters.)  This document attempts to describe
24
how the GNU libstdc++ implementation deals with the conversion between
25
wide and narrow characters, and also presents a framework for dealing
26
with the huge number of other encodings that iconv can convert,
27
including Unicode and UTF8. Design issues and requirements are
28
addressed, and examples of correct usage for both the required
29
specializations for wide and narrow characters and the
30
implementation-provided extended functionality are given.
31
32
 
33
34
Requirements
35
 
36
37
Around page 425 of the C++ Standard, this charming heading comes into view:
38
39
 
40
41
42
22.2.1.5 - Template class codecvt
43
44
45
 
46
47
The text around the codecvt definition gives some clues:
48
49
 
50
51
52
53
-1- The class codecvt<internT,externT,stateT> is for use when
54
converting from one codeset to another, such as from wide characters
55
to multibyte characters, between wide character encodings such as
56
Unicode and EUC.
57
58
59
60
 
61
62
Hmm. So, in some unspecified way, Unicode encodings and
63
translations between other character sets should be handled by this
64
class.
65
66
 
67
68
69
70
-2- The stateT argument selects the pair of codesets being mapped between.
71
72
73
74
 
75
76
Ah ha! Another clue...
77
78
 
79
80
81
82
-3- The instantiations required in the Table ??
83
(lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and
84
codecvt<char,char,mbstate_t>, convert the implementation-defined
85
native character set. codecvt<char,char,mbstate_t> implements a
86
degenerate conversion; it does not convert at
87
all. codecvt<wchar_t,char,mbstate_t> converts between the native
88
character sets for tiny and wide characters. Instantiations on
89
mbstate_t perform conversion between encodings known to the library
90
implementor.  Other encodings can be converted by specializing on a
91
user-defined stateT type. The stateT object can contain any state that
92
is useful to communicate to or from the specialized do_convert member.
93
94
95
96
 
97
98
At this point, a couple points become clear:
99
100
 
101
102
One: The standard clearly implies that attempts to add non-required
103
(yet useful and widely used) conversions need to do so through the
104
third template parameter, stateT.
105
 
106
107
Two: The required conversions, by specifying mbstate_t as the third
108
template parameter, imply an implementation strategy that is mostly
109
(or wholly) based on the underlying C library, and the functions
110
mcsrtombs and wcsrtombs in particular.
111
112
 
113
114
Design
115
 
116
117
    <type>wchar_t</type> Size
118
 
119
    
120
      The simple implementation detail of wchar_t's size seems to
121
      repeatedly confound people. Many systems use a two byte,
122
      unsigned integral type to represent wide characters, and use an
123
      internal encoding of Unicode or UCS2. (See AIX, Microsoft NT,
124
      Java, others.) Other systems, use a four byte, unsigned integral
125
      type to represent wide characters, and use an internal encoding
126
      of UCS4. (GNU/Linux systems using glibc, in particular.) The C
127
      programming language (and thus C++) does not specify a specific
128
      size for the type wchar_t.
129
    
130
 
131
    
132
      Thus, portable C++ code cannot assume a byte size (or endianness) either.
133
    
134
  
135
 
136
137
  Support for Unicode
138
  
139
    Probably the most frequently asked question about code conversion
140
    is: "So dudes, what's the deal with Unicode strings?"
141
    The dude part is optional, but apparently the usefulness of
142
    Unicode strings is pretty widely appreciated. Sadly, this specific
143
    encoding (And other useful encodings like UTF8, UCS4, ISO 8859-10,
144
    etc etc etc) are not mentioned in the C++ standard.
145
  
146
 
147
  
148
    A couple of comments:
149
  
150
 
151
  
152
    The thought that all one needs to convert between two arbitrary
153
    codesets is two types and some kind of state argument is
154
    unfortunate. In particular, encodings may be stateless. The naming
155
    of the third parameter as stateT is unfortunate, as what is really
156
    needed is some kind of generalized type that accounts for the
157
    issues that abstract encodings will need. The minimum information
158
    that is required includes:
159
  
160
 
161
  
162
    
163
      
164
        Identifiers for each of the codesets involved in the
165
        conversion. For example, using the iconv family of functions
166
        from the Single Unix Specification (what used to be called
167
        X/Open) hosted on the GNU/Linux operating system allows
168
        bi-directional mapping between far more than the following
169
        tantalizing possibilities:
170
      
171
 
172
      
173
        (An edited list taken from `iconv --list` on a
174
        Red Hat 6.2/Intel system:
175
      
176
 
177
178
179
8859_1, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ARABIC, ARABIC7,
180
ASCII, EUC-CN, EUC-JP, EUC-KR, EUC-TW, GREEK-CCIcode, GREEK, GREEK7-OLD,
181
GREEK7, GREEK8, HEBREW, ISO-8859-1, ISO-8859-2, ISO-8859-3,
182
ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,
183
ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14,
184
ISO-8859-15, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4,
185
ISO-10646/UTF-8, ISO-10646/UTF8, SHIFT-JIS, SHIFT_JIS, UCS-2, UCS-4,
186
UCS2, UCS4, UNICODE, UNICODEBIG, UNICODELIcodeLE, US-ASCII, US, UTF-8,
187
UTF-16, UTF8, UTF16).
188
189
190
 
191
192
For iconv-based implementations, string literals for each of the
193
encodings (i.e. "UCS-2" and "UTF-8") are necessary,
194
although for other,
195
non-iconv implementations a table of enumerated values or some other
196
mechanism may be required.
197
198
199
 
200
201
 Maximum length of the identifying string literal.
202
203
 
204
205
 Some encodings require explicit endian-ness. As such, some kind
206
  of endian marker or other byte-order marker will be necessary. See
207
  "Footnotes for C/C++ developers" in Haible for more information on
208
  UCS-2/Unicode endian issues. (Summary: big endian seems most likely,
209
  however implementations, most notably Microsoft, vary.)
210
211
 
212
213
 Types representing the conversion state, for conversions involving
214
  the machinery in the "C" library, or the conversion descriptor, for
215
  conversions using iconv (such as the type iconv_t.)  Note that the
216
  conversion descriptor encodes more information than a simple encoding
217
  state type.
218
219
 
220
221
 Conversion descriptors for both directions of encoding. (i.e., both
222
  UCS-2 to UTF-8 and UTF-8 to UCS-2.)
223
224
 
225
226
 Something to indicate if the conversion requested if valid.
227
228
 
229
230
 Something to represent if the conversion descriptors are valid.
231
232
 
233
234
 Some way to enforce strict type checking on the internal and
235
  external types. As part of this, the size of the internal and
236
  external types will need to be known.
237
238
239
240
 
241
242
  Other Issues
243
244
In addition, multi-threaded and multi-locale environments also impact
245
the design and requirements for code conversions. In particular, they
246
affect the required specialization codecvt<wchar_t, char, mbstate_t>
247
when implemented using standard "C" functions.
248
249
 
250
251
Three problems arise, one big, one of medium importance, and one small.
252
253
 
254
255
First, the small: mcsrtombs and wcsrtombs may not be multithread-safe
256
on all systems required by the GNU tools. For GNU/Linux and glibc,
257
this is not an issue.
258
259
 
260
261
Of medium concern, in the grand scope of things, is that the functions
262
used to implement this specialization work on null-terminated
263
strings. Buffers, especially file buffers, may not be null-terminated,
264
thus giving conversions that end prematurely or are otherwise
265
incorrect. Yikes!
266
267
 
268
269
The last, and fundamental problem, is the assumption of a global
270
locale for all the "C" functions referenced above. For something like
271
C++ iostreams (where codecvt is explicitly used) the notion of
272
multiple locales is fundamental. In practice, most users may not run
273
into this limitation. However, as a quality of implementation issue,
274
the GNU C++ library would like to offer a solution that allows
275
multiple locales and or simultaneous usage with computationally
276
correct results. In short, libstdc++ is trying to offer, as an
277
option, a high-quality implementation, damn the additional complexity!
278
279
 
280
281
For the required specialization codecvt<wchar_t, char, mbstate_t> ,
282
conversions are made between the internal character set (always UCS4
283
on GNU/Linux) and whatever the currently selected locale for the
284
LC_CTYPE category implements.
285
286
 
287
288
 
289
290
 
291
292
Implementation
293
 
294
295
The two required specializations are implemented as follows:
296
297
 
298
299
300
codecvt<char, char, mbstate_t>
301
302
303
304
This is a degenerate (i.e., does nothing) specialization. Implementing
305
this was a piece of cake.
306
307
 
308
309
310
codecvt<char, wchar_t, mbstate_t>
311
312
313
 
314
315
This specialization, by specifying all the template parameters, pretty
316
much ties the hands of implementors. As such, the implementation is
317
straightforward, involving mcsrtombs for the conversions between char
318
to wchar_t and wcsrtombs for conversions between wchar_t and char.
319
320
 
321
322
Neither of these two required specializations deals with Unicode
323
characters. As such, libstdc++ implements a partial specialization
324
of the codecvt class with and iconv wrapper class, encoding_state as the
325
third template parameter.
326
327
 
328
329
This implementation should be standards conformant. First of all, the
330
standard explicitly points out that instantiations on the third
331
template parameter, stateT, are the proper way to implement
332
non-required conversions. Second of all, the standard says (in Chapter
333
17) that partial specializations of required classes are a-ok. Third
334
of all, the requirements for the stateT type elsewhere in the standard
335
(see 21.1.2 traits typedefs) only indicate that this type be copy
336
constructible.
337
338
 
339
340
As such, the type encoding_state is defined as a non-templatized, POD
341
type to be used as the third type of a codecvt instantiation. This
342
type is just a wrapper class for iconv, and provides an easy interface
343
to iconv functionality.
344
345
 
346
347
There are two constructors for encoding_state:
348
349
 
350
351
352
encoding_state() : __in_desc(0), __out_desc(0)
353
354
355
356
This default constructor sets the internal encoding to some default
357
(currently UCS4) and the external encoding to whatever is returned by
358
nl_langinfo(CODESET).
359
360
 
361
362
363
encoding_state(const char* __int, const char* __ext)
364
365
366
 
367
368
This constructor takes as parameters string literals that indicate the
369
desired internal and external encoding. There are no defaults for
370
either argument.
371
372
 
373
374
One of the issues with iconv is that the string literals identifying
375
conversions are not standardized. Because of this, the thought of
376
mandating and or enforcing some set of pre-determined valid
377
identifiers seems iffy: thus, a more practical (and non-migraine
378
inducing) strategy was implemented: end-users can specify any string
379
(subject to a pre-determined length qualifier, currently 32 bytes) for
380
encodings. It is up to the user to make sure that these strings are
381
valid on the target system.
382
383
 
384
385
386
void
387
_M_init()
388
389
390
391
Strangely enough, this member function attempts to open conversion
392
descriptors for a given encoding_state object. If the conversion
393
descriptors are not valid, the conversion descriptors returned will
394
not be valid and the resulting calls to the codecvt conversion
395
functions will return error.
396
397
 
398
399
400
bool
401
_M_good()
402
403
404
 
405
406
Provides a way to see if the given encoding_state object has been
407
properly initialized. If the string literals describing the desired
408
internal and external encoding are not valid, initialization will
409
fail, and this will return false. If the internal and external
410
encodings are valid, but iconv_open could not allocate conversion
411
descriptors, this will also return false. Otherwise, the object is
412
ready to convert and will return true.
413
414
 
415
416
417
encoding_state(const encoding_state&)
418
419
420
 
421
422
As iconv allocates memory and sets up conversion descriptors, the copy
423
constructor can only copy the member data pertaining to the internal
424
and external code conversions, and not the conversion descriptors
425
themselves.
426
427
 
428
429
Definitions for all the required codecvt member functions are provided
430
for this specialization, and usage of codecvt<internal character type,
431
external character type, encoding_state> is consistent with other
432
codecvt usage.
433
434
 
435
436
 
437
438
Use
439
A conversions involving string literal.
440
 
441
442
  typedef codecvt_base::result                  result;
443
  typedef unsigned short                        unicode_t;
444
  typedef unicode_t                             int_type;
445
  typedef char                                  ext_type;
446
  typedef encoding_state                          state_type;
447
  typedef codecvt<int_type, ext_type, state_type> unicode_codecvt;
448
 
449
  const ext_type*       e_lit = "black pearl jasmine tea";
450
  int                   size = strlen(e_lit);
451
  int_type              i_lit_base[24] =
452
  { 25088, 27648, 24832, 25344, 27392, 8192, 28672, 25856, 24832, 29184,
453
    27648, 8192, 27136, 24832, 29440, 27904, 26880, 28160, 25856, 8192, 29696,
454
    25856, 24832, 2560
455
  };
456
  const int_type*       i_lit = i_lit_base;
457
  const ext_type*       efrom_next;
458
  const int_type*       ifrom_next;
459
  ext_type*             e_arr = new ext_type[size + 1];
460
  ext_type*             eto_next;
461
  int_type*             i_arr = new int_type[size + 1];
462
  int_type*             ito_next;
463
 
464
  // construct a locale object with the specialized facet.
465
  locale                loc(locale::classic(), new unicode_codecvt);
466
  // sanity check the constructed locale has the specialized facet.
467
  VERIFY( has_facet<unicode_codecvt>(loc) );
468
  const unicode_codecvt& cvt = use_facet<unicode_codecvt>(loc);
469
  // convert between const char* and unicode strings
470
  unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1");
471
  initialize_state(state01);
472
  result r1 = cvt.in(state01, e_lit, e_lit + size, efrom_next,
473
                     i_arr, i_arr + size, ito_next);
474
  VERIFY( r1 == codecvt_base::ok );
475
  VERIFY( !int_traits::compare(i_arr, i_lit, size) );
476
  VERIFY( efrom_next == e_lit + size );
477
  VERIFY( ito_next == i_arr + size );
478
479
 
480
481
 
482
483
Future
484
485
486
  
487
   a. things that are sketchy, or remain unimplemented:
488
      do_encoding, max_length and length member functions
489
      are only weakly implemented. I have no idea how to do
490
      this correctly, and in a generic manner.  Nathan?
491
492
493
 
494
495
  
496
   b. conversions involving std::string
497
  
498
   
499
      
500
      how should operators != and == work for string of
501
      different/same encoding?
502
      
503
 
504
      
505
      what is equal? A byte by byte comparison or an
506
      encoding then byte comparison?
507
      
508
 
509
      
510
      conversions between narrow, wide, and unicode strings
511
      
512
   
513
514
515
   c. conversions involving std::filebuf and std::ostream
516
517
   
518
      
519
      how to initialize the state object in a
520
      standards-conformant manner?
521
      
522
 
523
                
524
      how to synchronize the "C" and "C++"
525
      conversion information?
526
      
527
 
528
                
529
      wchar_t/char internal buffers and conversions between
530
      internal/external buffers?
531
      
532
   
533
534
535
536
 
537
 
538
539
Bibliography
540
 
541
  
542
    </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>543</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      The GNU C Library</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>544</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>    
545
    
546
      McGrath
547
      Roland
548
    
549
    
550
      Drepper
551
      Ulrich
552
    
553
    
554
      2007
555
      FSF
556
    
557
    
558
      Chapters 6 Character Set Handling and 7 Locales and Internationalization
559
    
560
  
561
 
562
  
563
    </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>564</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      Correspondence</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>565</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>    
566
    
567
      Drepper
568
      Ulrich
569
    
570
    
571
      2002
572
      
573
    
574
  
575
 
576
  
577
    </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>578</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      ISO/IEC 14882:1998 Programming languages - C++</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>579</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>    
580
    
581
      1998
582
      ISO
583
    
584
  
585
 
586
  
587
    </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>588</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      ISO/IEC 9899:1999 Programming languages - C</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>589</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>    
590
    
591
      1999
592
      ISO
593
    
594
  
595
 
596
  
597
    
598
      
599
        
600
          System Interface Definitions, Issue 7 (IEEE Std. 1003.1-2008)
601
        
602
      
603
    
604
    
605
      2008
606
      
607
        The Open Group/The Institute of Electrical and Electronics
608
        Engineers, Inc.
609
      
610
    
611
  
612
 
613
  
614
    </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>615</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      The C++ Programming Language, Special Edition</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>616</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>    
617
    
618
      Stroustrup
619
      Bjarne
620
    
621
    
622
      2000
623
      Addison Wesley, Inc.
624
    
625
    Appendix D
626
    
627
      
628
        Addison Wesley
629
      
630
    
631
  
632
 
633
 
634
  
635
    </code></pre></td>
      </tr>
      <tr valign="middle">
         <td>636</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>      Standard C++ IOStreams and Locales</code></pre></td>
      </tr>
      <tr valign="middle">
         <td>637</td>
         <td></td>
         <td></td>
         <td class="code"><pre><code>    
638
    
639
      Advanced Programmer's Guide and Reference
640
    
641
    
642
      Langer
643
      Angelika
644
    
645
    
646
      Kreft
647
      Klaus
648
    
649
    
650
      2000
651
      Addison Wesley Longman, Inc.
652
    
653
    
654
      
655
        Addison Wesley Longman
656
      
657
    
658
  
659
 
660
  
661
    
662
      
663
        
664
          A brief description of Normative Addendum 1
665
        
666
      
667
    
668
    
669
      Feather
670
      Clive
671
    
672
    Extended Character Sets
673
  
674
 
675
  
676
    
677
      
678
        
679
          The Unicode HOWTO
680
        
681
        
682
    
683
    
684
      Haible
685
      Bruno
686
    
687
  
688
 
689
  
690
    
691
      
692
        
693
          UTF-8 and Unicode FAQ for Unix/Linux
694
        
695
      
696
    
697
    
698
      Khun
699
      Markus
700
    
701
  
702
 
703
704
 
705

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.