1 |
20 |
jlechner |
<?xml version="1.0" encoding="ISO-8859-1"?>
|
2 |
|
|
<!DOCTYPE html
|
3 |
|
|
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
4 |
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
5 |
|
|
|
6 |
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|
7 |
|
|
<head>
|
8 |
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
|
9 |
|
|
<meta name="AUTHOR" content="bkoz@redhat.com (Benjamin Kosnik)" />
|
10 |
|
|
<meta name="KEYWORDS" content="HOWTO, libstdc++, GCC, g++, libg++, STL" />
|
11 |
|
|
<meta name="DESCRIPTION" content="Notes on the codecvt implementation." />
|
12 |
|
|
<title>Notes on the codecvt implementation.</title>
|
13 |
|
|
<link rel="StyleSheet" href="../lib3styles.css" type="text/css" />
|
14 |
|
|
<link rel="Start" href="../documentation.html" type="text/html"
|
15 |
|
|
title="GNU C++ Standard Library" />
|
16 |
|
|
<link rel="Bookmark" href="howto.html" type="text/html" title="Localization" />
|
17 |
|
|
<link rel="Copyright" href="../17_intro/license.html" type="text/html" />
|
18 |
|
|
<link rel="Help" href="../faq/index.html" type="text/html" title="F.A.Q." />
|
19 |
|
|
</head>
|
20 |
|
|
<body>
|
21 |
|
|
<h1>
|
22 |
|
|
Notes on the codecvt implementation.
|
23 |
|
|
</h1>
|
24 |
|
|
<p>
|
25 |
|
|
<em>
|
26 |
|
|
prepared by Benjamin Kosnik (bkoz@redhat.com) on August 28, 2000
|
27 |
|
|
</em>
|
28 |
|
|
</p>
|
29 |
|
|
|
30 |
|
|
<h2>
|
31 |
|
|
1. Abstract
|
32 |
|
|
</h2>
|
33 |
|
|
<p>
|
34 |
|
|
The standard class codecvt attempts to address conversions between
|
35 |
|
|
different character encoding schemes. In particular, the standard
|
36 |
|
|
attempts to detail conversions between the implementation-defined wide
|
37 |
|
|
characters (hereafter referred to as wchar_t) and the standard type
|
38 |
|
|
char that is so beloved in classic "C" (which can now be referred to
|
39 |
|
|
as narrow characters.) This document attempts to describe how the GNU
|
40 |
|
|
libstdc++-v3 implementation deals with the conversion between wide and
|
41 |
|
|
narrow characters, and also presents a framework for dealing with the
|
42 |
|
|
huge number of other encodings that iconv can convert, including
|
43 |
|
|
Unicode and UTF8. Design issues and requirements are addressed, and
|
44 |
|
|
examples of correct usage for both the required specializations for
|
45 |
|
|
wide and narrow characters and the implementation-provided extended
|
46 |
|
|
functionality are given.
|
47 |
|
|
</p>
|
48 |
|
|
|
49 |
|
|
<h2>
|
50 |
|
|
2. What the standard says
|
51 |
|
|
</h2>
|
52 |
|
|
Around page 425 of the C++ Standard, this charming heading comes into view:
|
53 |
|
|
|
54 |
|
|
<blockquote>
|
55 |
|
|
22.2.1.5 - Template class codecvt [lib.locale.codecvt]
|
56 |
|
|
</blockquote>
|
57 |
|
|
|
58 |
|
|
The text around the codecvt definition gives some clues:
|
59 |
|
|
|
60 |
|
|
<blockquote>
|
61 |
|
|
<em>
|
62 |
|
|
-1- The class codecvt<internT,externT,stateT> is for use when
|
63 |
|
|
converting from one codeset to another, such as from wide characters
|
64 |
|
|
to multibyte characters, between wide character encodings such as
|
65 |
|
|
Unicode and EUC.
|
66 |
|
|
</em>
|
67 |
|
|
</blockquote>
|
68 |
|
|
|
69 |
|
|
<p>
|
70 |
|
|
Hmm. So, in some unspecified way, Unicode encodings and
|
71 |
|
|
translations between other character sets should be handled by this
|
72 |
|
|
class.
|
73 |
|
|
</p>
|
74 |
|
|
|
75 |
|
|
<blockquote>
|
76 |
|
|
<em>
|
77 |
|
|
-2- The stateT argument selects the pair of codesets being mapped between.
|
78 |
|
|
</em>
|
79 |
|
|
</blockquote>
|
80 |
|
|
|
81 |
|
|
<p>
|
82 |
|
|
Ah ha! Another clue...
|
83 |
|
|
</p>
|
84 |
|
|
|
85 |
|
|
<blockquote>
|
86 |
|
|
<em>
|
87 |
|
|
-3- The instantiations required in the Table ??
|
88 |
|
|
(lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and
|
89 |
|
|
codecvt<char,char,mbstate_t>, convert the implementation-defined
|
90 |
|
|
native character set. codecvt<char,char,mbstate_t> implements a
|
91 |
|
|
degenerate conversion; it does not convert at
|
92 |
|
|
all. codecvt<wchar_t,char,mbstate_t> converts between the native
|
93 |
|
|
character sets for tiny and wide characters. Instantiations on
|
94 |
|
|
mbstate_t perform conversion between encodings known to the library
|
95 |
|
|
implementor. Other encodings can be converted by specializing on a
|
96 |
|
|
user-defined stateT type. The stateT object can contain any state that
|
97 |
|
|
is useful to communicate to or from the specialized do_convert member.
|
98 |
|
|
</em>
|
99 |
|
|
</blockquote>
|
100 |
|
|
|
101 |
|
|
<p>
|
102 |
|
|
At this point, a couple points become clear:
|
103 |
|
|
</p>
|
104 |
|
|
|
105 |
|
|
<p>
|
106 |
|
|
One: The standard clearly implies that attempts to add non-required
|
107 |
|
|
(yet useful and widely used) conversions need to do so through the
|
108 |
|
|
third template parameter, stateT.</p>
|
109 |
|
|
|
110 |
|
|
<p>
|
111 |
|
|
Two: The required conversions, by specifying mbstate_t as the third
|
112 |
|
|
template parameter, imply an implementation strategy that is mostly
|
113 |
|
|
(or wholly) based on the underlying C library, and the functions
|
114 |
|
|
mcsrtombs and wcsrtombs in particular.</p>
|
115 |
|
|
|
116 |
|
|
<h2>
|
117 |
|
|
3. Some thoughts on what would be useful
|
118 |
|
|
</h2>
|
119 |
|
|
Probably the most frequently asked question about code conversion is:
|
120 |
|
|
"So dudes, what's the deal with Unicode strings?" The dude part is
|
121 |
|
|
optional, but apparently the usefulness of Unicode strings is pretty
|
122 |
|
|
widely appreciated. Sadly, this specific encoding (And other useful
|
123 |
|
|
encodings like UTF8, UCS4, ISO 8859-10, etc etc etc) are not mentioned
|
124 |
|
|
in the C++ standard.
|
125 |
|
|
|
126 |
|
|
<p>
|
127 |
|
|
In particular, the simple implementation detail of wchar_t's size
|
128 |
|
|
seems to repeatedly confound people. Many systems use a two byte,
|
129 |
|
|
unsigned integral type to represent wide characters, and use an
|
130 |
|
|
internal encoding of Unicode or UCS2. (See AIX, Microsoft NT, Java,
|
131 |
|
|
others.) Other systems, use a four byte, unsigned integral type to
|
132 |
|
|
represent wide characters, and use an internal encoding of
|
133 |
|
|
UCS4. (GNU/Linux systems using glibc, in particular.) The C
|
134 |
|
|
programming language (and thus C++) does not specify a specific size
|
135 |
|
|
for the type wchar_t.
|
136 |
|
|
</p>
|
137 |
|
|
|
138 |
|
|
<p>
|
139 |
|
|
Thus, portable C++ code cannot assume a byte size (or endianness) either.
|
140 |
|
|
</p>
|
141 |
|
|
|
142 |
|
|
<p>
|
143 |
|
|
Getting back to the frequently asked question: What about Unicode strings?
|
144 |
|
|
</p>
|
145 |
|
|
|
146 |
|
|
<p>
|
147 |
|
|
What magic spell will do this conversion?
|
148 |
|
|
</p>
|
149 |
|
|
|
150 |
|
|
<p>
|
151 |
|
|
A couple of comments:
|
152 |
|
|
</p>
|
153 |
|
|
|
154 |
|
|
<p>
|
155 |
|
|
The thought that all one needs to convert between two arbitrary
|
156 |
|
|
codesets is two types and some kind of state argument is
|
157 |
|
|
unfortunate. In particular, encodings may be stateless. The naming of
|
158 |
|
|
the third parameter as stateT is unfortunate, as what is really needed
|
159 |
|
|
is some kind of generalized type that accounts for the issues that
|
160 |
|
|
abstract encodings will need. The minimum information that is required
|
161 |
|
|
includes:
|
162 |
|
|
</p>
|
163 |
|
|
|
164 |
|
|
<ul>
|
165 |
|
|
<li>
|
166 |
|
|
<p>
|
167 |
|
|
Identifiers for each of the codesets involved in the conversion. For
|
168 |
|
|
example, using the iconv family of functions from the Single Unix
|
169 |
|
|
Specification (what used to be called X/Open) hosted on the GNU/Linux
|
170 |
|
|
operating system allows bi-directional mapping between far more than
|
171 |
|
|
the following tantalizing possibilities:
|
172 |
|
|
</p>
|
173 |
|
|
|
174 |
|
|
<p>
|
175 |
|
|
(An edited list taken from <code>`iconv --list`</code> on a Red Hat 6.2/Intel system:
|
176 |
|
|
</p>
|
177 |
|
|
|
178 |
|
|
<blockquote>
|
179 |
|
|
<pre>
|
180 |
|
|
8859_1, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ARABIC, ARABIC7,
|
181 |
|
|
ASCII, EUC-CN, EUC-JP, EUC-KR, EUC-TW, GREEK-CCIcode, GREEK, GREEK7-OLD,
|
182 |
|
|
GREEK7, GREEK8, HEBREW, ISO-8859-1, ISO-8859-2, ISO-8859-3,
|
183 |
|
|
ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,
|
184 |
|
|
ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14,
|
185 |
|
|
ISO-8859-15, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4,
|
186 |
|
|
ISO-10646/UTF-8, ISO-10646/UTF8, SHIFT-JIS, SHIFT_JIS, UCS-2, UCS-4,
|
187 |
|
|
UCS2, UCS4, UNICODE, UNICODEBIG, UNICODELIcodeLE, US-ASCII, US, UTF-8,
|
188 |
|
|
UTF-16, UTF8, UTF16).
|
189 |
|
|
</pre>
|
190 |
|
|
</blockquote>
|
191 |
|
|
|
192 |
|
|
<p>
|
193 |
|
|
For iconv-based implementations, string literals for each of the
|
194 |
|
|
encodings (ie. "UCS-2" and "UTF-8") are necessary,
|
195 |
|
|
although for other,
|
196 |
|
|
non-iconv implementations a table of enumerated values or some other
|
197 |
|
|
mechanism may be required.
|
198 |
|
|
</p>
|
199 |
|
|
</li>
|
200 |
|
|
|
201 |
|
|
<li>
|
202 |
|
|
Maximum length of the identifying string literal.
|
203 |
|
|
</li>
|
204 |
|
|
|
205 |
|
|
<li>
|
206 |
|
|
Some encodings are require explicit endian-ness. As such, some kind
|
207 |
|
|
of endian marker or other byte-order marker will be necessary. See
|
208 |
|
|
"Footnotes for C/C++ developers" in Haible for more information on
|
209 |
|
|
UCS-2/Unicode endian issues. (Summary: big endian seems most likely,
|
210 |
|
|
however implementations, most notably Microsoft, vary.)
|
211 |
|
|
</li>
|
212 |
|
|
|
213 |
|
|
<li>
|
214 |
|
|
Types representing the conversion state, for conversions involving
|
215 |
|
|
the machinery in the "C" library, or the conversion descriptor, for
|
216 |
|
|
conversions using iconv (such as the type iconv_t.) Note that the
|
217 |
|
|
conversion descriptor encodes more information than a simple encoding
|
218 |
|
|
state type.
|
219 |
|
|
</li>
|
220 |
|
|
|
221 |
|
|
<li>
|
222 |
|
|
Conversion descriptors for both directions of encoding. (ie, both
|
223 |
|
|
UCS-2 to UTF-8 and UTF-8 to UCS-2.)
|
224 |
|
|
</li>
|
225 |
|
|
|
226 |
|
|
<li>
|
227 |
|
|
Something to indicate if the conversion requested if valid.
|
228 |
|
|
</li>
|
229 |
|
|
|
230 |
|
|
<li>
|
231 |
|
|
Something to represent if the conversion descriptors are valid.
|
232 |
|
|
</li>
|
233 |
|
|
|
234 |
|
|
<li>
|
235 |
|
|
Some way to enforce strict type checking on the internal and
|
236 |
|
|
external types. As part of this, the size of the internal and
|
237 |
|
|
external types will need to be known.
|
238 |
|
|
</li>
|
239 |
|
|
</ul>
|
240 |
|
|
|
241 |
|
|
<h2>
|
242 |
|
|
4. Problems with "C" code conversions : thread safety, global
|
243 |
|
|
locales, termination.
|
244 |
|
|
</h2>
|
245 |
|
|
|
246 |
|
|
In addition, multi-threaded and multi-locale environments also impact
|
247 |
|
|
the design and requirements for code conversions. In particular, they
|
248 |
|
|
affect the required specialization codecvt<wchar_t, char, mbstate_t>
|
249 |
|
|
when implemented using standard "C" functions.
|
250 |
|
|
|
251 |
|
|
<p>
|
252 |
|
|
Three problems arise, one big, one of medium importance, and one small.
|
253 |
|
|
</p>
|
254 |
|
|
|
255 |
|
|
<p>
|
256 |
|
|
First, the small: mcsrtombs and wcsrtombs may not be multithread-safe
|
257 |
|
|
on all systems required by the GNU tools. For GNU/Linux and glibc,
|
258 |
|
|
this is not an issue.
|
259 |
|
|
</p>
|
260 |
|
|
|
261 |
|
|
<p>
|
262 |
|
|
Of medium concern, in the grand scope of things, is that the functions
|
263 |
|
|
used to implement this specialization work on null-terminated
|
264 |
|
|
strings. Buffers, especially file buffers, may not be null-terminated,
|
265 |
|
|
thus giving conversions that end prematurely or are otherwise
|
266 |
|
|
incorrect. Yikes!
|
267 |
|
|
</p>
|
268 |
|
|
|
269 |
|
|
<p>
|
270 |
|
|
The last, and fundamental problem, is the assumption of a global
|
271 |
|
|
locale for all the "C" functions referenced above. For something like
|
272 |
|
|
C++ iostreams (where codecvt is explicitly used) the notion of
|
273 |
|
|
multiple locales is fundamental. In practice, most users may not run
|
274 |
|
|
into this limitation. However, as a quality of implementation issue,
|
275 |
|
|
the GNU C++ library would like to offer a solution that allows
|
276 |
|
|
multiple locales and or simultaneous usage with computationally
|
277 |
|
|
correct results. In short, libstdc++-v3 is trying to offer, as an
|
278 |
|
|
option, a high-quality implementation, damn the additional complexity!
|
279 |
|
|
</p>
|
280 |
|
|
|
281 |
|
|
<p>
|
282 |
|
|
For the required specialization codecvt<wchar_t, char, mbstate_t> ,
|
283 |
|
|
conversions are made between the internal character set (always UCS4
|
284 |
|
|
on GNU/Linux) and whatever the currently selected locale for the
|
285 |
|
|
LC_CTYPE category implements.
|
286 |
|
|
</p>
|
287 |
|
|
|
288 |
|
|
<h2>
|
289 |
|
|
5. Design
|
290 |
|
|
</h2>
|
291 |
|
|
The two required specializations are implemented as follows:
|
292 |
|
|
|
293 |
|
|
<p>
|
294 |
|
|
<code>
|
295 |
|
|
codecvt<char, char, mbstate_t>
|
296 |
|
|
</code>
|
297 |
|
|
</p>
|
298 |
|
|
<p>
|
299 |
|
|
This is a degenerate (ie, does nothing) specialization. Implementing
|
300 |
|
|
this was a piece of cake.
|
301 |
|
|
</p>
|
302 |
|
|
|
303 |
|
|
<p>
|
304 |
|
|
<code>
|
305 |
|
|
codecvt<char, wchar_t, mbstate_t>
|
306 |
|
|
</code>
|
307 |
|
|
</p>
|
308 |
|
|
<p>
|
309 |
|
|
This specialization, by specifying all the template parameters, pretty
|
310 |
|
|
much ties the hands of implementors. As such, the implementation is
|
311 |
|
|
straightforward, involving mcsrtombs for the conversions between char
|
312 |
|
|
to wchar_t and wcsrtombs for conversions between wchar_t and char.
|
313 |
|
|
</p>
|
314 |
|
|
|
315 |
|
|
<p>
|
316 |
|
|
Neither of these two required specializations deals with Unicode
|
317 |
|
|
characters. As such, libstdc++-v3 implements a partial specialization
|
318 |
|
|
of the codecvt class with and iconv wrapper class, __enc_traits as the
|
319 |
|
|
third template parameter.
|
320 |
|
|
</p>
|
321 |
|
|
|
322 |
|
|
<p>
|
323 |
|
|
This implementation should be standards conformant. First of all, the
|
324 |
|
|
standard explicitly points out that instantiations on the third
|
325 |
|
|
template parameter, stateT, are the proper way to implement
|
326 |
|
|
non-required conversions. Second of all, the standard says (in Chapter
|
327 |
|
|
17) that partial specializations of required classes are a-ok. Third
|
328 |
|
|
of all, the requirements for the stateT type elsewhere in the standard
|
329 |
|
|
(see 21.1.2 traits typedefs) only indicate that this type be copy
|
330 |
|
|
constructible.
|
331 |
|
|
</p>
|
332 |
|
|
|
333 |
|
|
<p>
|
334 |
|
|
As such, the type __enc_traits is defined as a non-templatized, POD
|
335 |
|
|
type to be used as the third type of a codecvt instantiation. This
|
336 |
|
|
type is just a wrapper class for iconv, and provides an easy interface
|
337 |
|
|
to iconv functionality.
|
338 |
|
|
</p>
|
339 |
|
|
|
340 |
|
|
<p>
|
341 |
|
|
There are two constructors for __enc_traits:
|
342 |
|
|
</p>
|
343 |
|
|
|
344 |
|
|
<p>
|
345 |
|
|
<code>
|
346 |
|
|
__enc_traits() : __in_desc(0), __out_desc(0)
|
347 |
|
|
</code>
|
348 |
|
|
</p>
|
349 |
|
|
<p>
|
350 |
|
|
This default constructor sets the internal encoding to some default
|
351 |
|
|
(currently UCS4) and the external encoding to whatever is returned by
|
352 |
|
|
nl_langinfo(CODESET).
|
353 |
|
|
</p>
|
354 |
|
|
|
355 |
|
|
<p>
|
356 |
|
|
<code>
|
357 |
|
|
__enc_traits(const char* __int, const char* __ext)
|
358 |
|
|
</code>
|
359 |
|
|
</p>
|
360 |
|
|
<p>
|
361 |
|
|
This constructor takes as parameters string literals that indicate the
|
362 |
|
|
desired internal and external encoding. There are no defaults for
|
363 |
|
|
either argument.
|
364 |
|
|
</p>
|
365 |
|
|
|
366 |
|
|
<p>
|
367 |
|
|
One of the issues with iconv is that the string literals identifying
|
368 |
|
|
conversions are not standardized. Because of this, the thought of
|
369 |
|
|
mandating and or enforcing some set of pre-determined valid
|
370 |
|
|
identifiers seems iffy: thus, a more practical (and non-migraine
|
371 |
|
|
inducing) strategy was implemented: end-users can specify any string
|
372 |
|
|
(subject to a pre-determined length qualifier, currently 32 bytes) for
|
373 |
|
|
encodings. It is up to the user to make sure that these strings are
|
374 |
|
|
valid on the target system.
|
375 |
|
|
</p>
|
376 |
|
|
|
377 |
|
|
<p>
|
378 |
|
|
<code>
|
379 |
|
|
void
|
380 |
|
|
_M_init()
|
381 |
|
|
</code>
|
382 |
|
|
</p>
|
383 |
|
|
<p>
|
384 |
|
|
Strangely enough, this member function attempts to open conversion
|
385 |
|
|
descriptors for a given __enc_traits object. If the conversion
|
386 |
|
|
descriptors are not valid, the conversion descriptors returned will
|
387 |
|
|
not be valid and the resulting calls to the codecvt conversion
|
388 |
|
|
functions will return error.
|
389 |
|
|
</p>
|
390 |
|
|
|
391 |
|
|
<p>
|
392 |
|
|
<code>
|
393 |
|
|
bool
|
394 |
|
|
_M_good()
|
395 |
|
|
</code>
|
396 |
|
|
</p>
|
397 |
|
|
<p>
|
398 |
|
|
Provides a way to see if the given __enc_traits object has been
|
399 |
|
|
properly initialized. If the string literals describing the desired
|
400 |
|
|
internal and external encoding are not valid, initialization will
|
401 |
|
|
fail, and this will return false. If the internal and external
|
402 |
|
|
encodings are valid, but iconv_open could not allocate conversion
|
403 |
|
|
descriptors, this will also return false. Otherwise, the object is
|
404 |
|
|
ready to convert and will return true.
|
405 |
|
|
</p>
|
406 |
|
|
|
407 |
|
|
<p>
|
408 |
|
|
<code>
|
409 |
|
|
__enc_traits(const __enc_traits&)
|
410 |
|
|
</code>
|
411 |
|
|
</p>
|
412 |
|
|
<p>
|
413 |
|
|
As iconv allocates memory and sets up conversion descriptors, the copy
|
414 |
|
|
constructor can only copy the member data pertaining to the internal
|
415 |
|
|
and external code conversions, and not the conversion descriptors
|
416 |
|
|
themselves.
|
417 |
|
|
</p>
|
418 |
|
|
|
419 |
|
|
<p>
|
420 |
|
|
Definitions for all the required codecvt member functions are provided
|
421 |
|
|
for this specialization, and usage of codecvt<internal character type,
|
422 |
|
|
external character type, __enc_traits> is consistent with other
|
423 |
|
|
codecvt usage.
|
424 |
|
|
</p>
|
425 |
|
|
|
426 |
|
|
<h2>
|
427 |
|
|
6. Examples
|
428 |
|
|
</h2>
|
429 |
|
|
|
430 |
|
|
<ul>
|
431 |
|
|
<li>
|
432 |
|
|
a. conversions involving string literals
|
433 |
|
|
|
434 |
|
|
<pre>
|
435 |
|
|
typedef codecvt_base::result result;
|
436 |
|
|
typedef unsigned short unicode_t;
|
437 |
|
|
typedef unicode_t int_type;
|
438 |
|
|
typedef char ext_type;
|
439 |
|
|
typedef __enc_traits enc_type;
|
440 |
|
|
typedef codecvt<int_type, ext_type, enc_type> unicode_codecvt;
|
441 |
|
|
|
442 |
|
|
const ext_type* e_lit = "black pearl jasmine tea";
|
443 |
|
|
int size = strlen(e_lit);
|
444 |
|
|
int_type i_lit_base[24] =
|
445 |
|
|
{ 25088, 27648, 24832, 25344, 27392, 8192, 28672, 25856, 24832, 29184,
|
446 |
|
|
27648, 8192, 27136, 24832, 29440, 27904, 26880, 28160, 25856, 8192, 29696,
|
447 |
|
|
25856, 24832, 2560
|
448 |
|
|
};
|
449 |
|
|
const int_type* i_lit = i_lit_base;
|
450 |
|
|
const ext_type* efrom_next;
|
451 |
|
|
const int_type* ifrom_next;
|
452 |
|
|
ext_type* e_arr = new ext_type[size + 1];
|
453 |
|
|
ext_type* eto_next;
|
454 |
|
|
int_type* i_arr = new int_type[size + 1];
|
455 |
|
|
int_type* ito_next;
|
456 |
|
|
|
457 |
|
|
// construct a locale object with the specialized facet.
|
458 |
|
|
locale loc(locale::classic(), new unicode_codecvt);
|
459 |
|
|
// sanity check the constructed locale has the specialized facet.
|
460 |
|
|
VERIFY( has_facet<unicode_codecvt>(loc) );
|
461 |
|
|
const unicode_codecvt& cvt = use_facet<unicode_codecvt>(loc);
|
462 |
|
|
// convert between const char* and unicode strings
|
463 |
|
|
unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1");
|
464 |
|
|
initialize_state(state01);
|
465 |
|
|
result r1 = cvt.in(state01, e_lit, e_lit + size, efrom_next,
|
466 |
|
|
i_arr, i_arr + size, ito_next);
|
467 |
|
|
VERIFY( r1 == codecvt_base::ok );
|
468 |
|
|
VERIFY( !int_traits::compare(i_arr, i_lit, size) );
|
469 |
|
|
VERIFY( efrom_next == e_lit + size );
|
470 |
|
|
VERIFY( ito_next == i_arr + size );
|
471 |
|
|
</pre>
|
472 |
|
|
</li>
|
473 |
|
|
<li>
|
474 |
|
|
b. conversions involving std::string
|
475 |
|
|
</li>
|
476 |
|
|
<li>
|
477 |
|
|
c. conversions involving std::filebuf and std::ostream
|
478 |
|
|
</li>
|
479 |
|
|
</ul>
|
480 |
|
|
|
481 |
|
|
More information can be found in the following testcases:
|
482 |
|
|
<ul>
|
483 |
|
|
<li> testsuite/22_locale/codecvt_char_char.cc </li>
|
484 |
|
|
<li> testsuite/22_locale/codecvt_unicode_wchar_t.cc </li>
|
485 |
|
|
<li> testsuite/22_locale/codecvt_unicode_char.cc </li>
|
486 |
|
|
<li> testsuite/22_locale/codecvt_wchar_t_char.cc </li>
|
487 |
|
|
</ul>
|
488 |
|
|
|
489 |
|
|
<h2>
|
490 |
|
|
7. Unresolved Issues
|
491 |
|
|
</h2>
|
492 |
|
|
<ul>
|
493 |
|
|
<li>
|
494 |
|
|
a. things that are sketchy, or remain unimplemented:
|
495 |
|
|
do_encoding, max_length and length member functions
|
496 |
|
|
are only weakly implemented. I have no idea how to do
|
497 |
|
|
this correctly, and in a generic manner. Nathan?
|
498 |
|
|
</li>
|
499 |
|
|
|
500 |
|
|
<li>
|
501 |
|
|
b. conversions involving std::string
|
502 |
|
|
|
503 |
|
|
<ul>
|
504 |
|
|
<li>
|
505 |
|
|
how should operators != and == work for string of
|
506 |
|
|
different/same encoding?
|
507 |
|
|
</li>
|
508 |
|
|
|
509 |
|
|
<li>
|
510 |
|
|
what is equal? A byte by byte comparison or an
|
511 |
|
|
encoding then byte comparison?
|
512 |
|
|
</li>
|
513 |
|
|
|
514 |
|
|
<li>
|
515 |
|
|
conversions between narrow, wide, and unicode strings
|
516 |
|
|
</li>
|
517 |
|
|
</ul>
|
518 |
|
|
</li>
|
519 |
|
|
<li>
|
520 |
|
|
c. conversions involving std::filebuf and std::ostream
|
521 |
|
|
<ul>
|
522 |
|
|
<li>
|
523 |
|
|
how to initialize the state object in a
|
524 |
|
|
standards-conformant manner?
|
525 |
|
|
</li>
|
526 |
|
|
|
527 |
|
|
<li>
|
528 |
|
|
how to synchronize the "C" and "C++"
|
529 |
|
|
conversion information?
|
530 |
|
|
</li>
|
531 |
|
|
|
532 |
|
|
<li>
|
533 |
|
|
wchar_t/char internal buffers and conversions between
|
534 |
|
|
internal/external buffers?
|
535 |
|
|
</li>
|
536 |
|
|
</ul>
|
537 |
|
|
</li>
|
538 |
|
|
</ul>
|
539 |
|
|
|
540 |
|
|
<h2>
|
541 |
|
|
8. Acknowledgments
|
542 |
|
|
</h2>
|
543 |
|
|
Ulrich Drepper for the iconv suggestions and patient answering of
|
544 |
|
|
late-night questions, Jason Merrill for the template partial
|
545 |
|
|
specialization hints, language clarification, and wchar_t fixes.
|
546 |
|
|
|
547 |
|
|
<h2>
|
548 |
|
|
9. Bibliography / Referenced Documents
|
549 |
|
|
</h2>
|
550 |
|
|
|
551 |
|
|
Drepper, Ulrich, GNU libc (glibc) 2.2 manual. In particular, Chapters "6. Character Set Handling" and "7 Locales and Internationalization"
|
552 |
|
|
|
553 |
|
|
<p>
|
554 |
|
|
Drepper, Ulrich, Numerous, late-night email correspondence
|
555 |
|
|
</p>
|
556 |
|
|
|
557 |
|
|
<p>
|
558 |
|
|
Feather, Clive, "A brief description of Normative Addendum 1," in particular the parts on Extended Character Sets
|
559 |
|
|
http://www.lysator.liu.se/c/na1.html
|
560 |
|
|
</p>
|
561 |
|
|
|
562 |
|
|
<p>
|
563 |
|
|
Haible, Bruno, "The Unicode HOWTO" v0.18, 4 August 2000
|
564 |
|
|
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
|
565 |
|
|
</p>
|
566 |
|
|
|
567 |
|
|
<p>
|
568 |
|
|
ISO/IEC 14882:1998 Programming languages - C++
|
569 |
|
|
</p>
|
570 |
|
|
|
571 |
|
|
<p>
|
572 |
|
|
ISO/IEC 9899:1999 Programming languages - C
|
573 |
|
|
</p>
|
574 |
|
|
|
575 |
|
|
<p>
|
576 |
|
|
Khun, Markus, "UTF-8 and Unicode FAQ for Unix/Linux"
|
577 |
|
|
http://www.cl.cam.ac.uk/~mgk25/unicode.html
|
578 |
|
|
</p>
|
579 |
|
|
|
580 |
|
|
<p>
|
581 |
|
|
Langer, Angelika and Klaus Kreft, Standard C++ IOStreams and Locales, Advanced Programmer's Guide and Reference, Addison Wesley Longman, Inc. 2000
|
582 |
|
|
</p>
|
583 |
|
|
|
584 |
|
|
<p>
|
585 |
|
|
Stroustrup, Bjarne, Appendix D, The C++ Programming Language, Special Edition, Addison Wesley, Inc. 2000
|
586 |
|
|
</p>
|
587 |
|
|
|
588 |
|
|
<p>
|
589 |
|
|
System Interface Definitions, Issue 6 (IEEE Std. 1003.1-200x)
|
590 |
|
|
The Open Group/The Institute of Electrical and Electronics Engineers, Inc.
|
591 |
|
|
http://www.opennc.org/austin/docreg.html
|
592 |
|
|
</p>
|
593 |
|
|
|
594 |
|
|
</body>
|
595 |
|
|
</html>
|