1 |
424 |
jeremybenn |
|
2 |
|
|
|
3 |
|
|
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"
|
4 |
|
|
[ ]>
|
5 |
|
|
|
6 |
|
|
|
7 |
|
|
|
8 |
|
|
|
9 |
|
|
|
10 |
|
|
|
11 |
|
|
|
12 |
|
|
ISO C++
|
13 |
|
|
|
14 |
|
|
|
15 |
|
|
library
|
16 |
|
|
|
17 |
|
|
|
18 |
|
|
|
19 |
|
|
|
20 |
|
|
|
21 |
|
|
Strings
|
22 |
|
|
Strings
|
23 |
|
|
|
24 |
|
|
|
25 |
|
|
|
26 |
|
|
|
27 |
|
|
|
28 |
|
|
|
29 |
|
|
String Classes
|
30 |
|
|
|
31 |
|
|
|
32 |
|
|
Simple Transformations
|
33 |
|
|
|
34 |
|
|
Here are Standard, simple, and portable ways to perform common
|
35 |
|
|
transformations on a string instance, such as
|
36 |
|
|
"convert to all upper case." The word transformations
|
37 |
|
|
is especially apt, because the standard template function
|
38 |
|
|
transform<> is used.
|
39 |
|
|
|
40 |
|
|
|
41 |
|
|
This code will go through some iterations. Here's a simple
|
42 |
|
|
version:
|
43 |
|
|
|
44 |
|
|
|
45 |
|
|
#include <string>
|
46 |
|
|
#include <algorithm>
|
47 |
|
|
#include <cctype> // old <ctype.h>
|
48 |
|
|
|
49 |
|
|
struct ToLower
|
50 |
|
|
{
|
51 |
|
|
char operator() (char c) const { return std::tolower(c); }
|
52 |
|
|
};
|
53 |
|
|
|
54 |
|
|
struct ToUpper
|
55 |
|
|
{
|
56 |
|
|
char operator() (char c) const { return std::toupper(c); }
|
57 |
|
|
};
|
58 |
|
|
|
59 |
|
|
int main()
|
60 |
|
|
{
|
61 |
|
|
std::string s ("Some Kind Of Initial Input Goes Here");
|
62 |
|
|
|
63 |
|
|
// Change everything into upper case
|
64 |
|
|
std::transform (s.begin(), s.end(), s.begin(), ToUpper());
|
65 |
|
|
|
66 |
|
|
// Change everything into lower case
|
67 |
|
|
std::transform (s.begin(), s.end(), s.begin(), ToLower());
|
68 |
|
|
|
69 |
|
|
// Change everything back into upper case, but store the
|
70 |
|
|
// result in a different string
|
71 |
|
|
std::string capital_s;
|
72 |
|
|
capital_s.resize(s.size());
|
73 |
|
|
std::transform (s.begin(), s.end(), capital_s.begin(), ToUpper());
|
74 |
|
|
}
|
75 |
|
|
|
76 |
|
|
|
77 |
|
|
Note that these calls all
|
78 |
|
|
involve the global C locale through the use of the C functions
|
79 |
|
|
toupper/tolower . This is absolutely guaranteed to work --
|
80 |
|
|
but only if the string contains only characters
|
81 |
|
|
from the basic source character set, and there are only
|
82 |
|
|
96 of those. Which means that not even all English text can be
|
83 |
|
|
represented (certain British spellings, proper names, and so forth).
|
84 |
|
|
So, if all your input forevermore consists of only those 96
|
85 |
|
|
characters (hahahahahaha), then you're done.
|
86 |
|
|
|
87 |
|
|
Note that the
|
88 |
|
|
ToUpper and ToLower function objects
|
89 |
|
|
are needed because toupper and tolower
|
90 |
|
|
are overloaded names (declared in <cctype> and
|
91 |
|
|
<locale> ) so the template-arguments for
|
92 |
|
|
transform<> cannot be deduced, as explained in
|
93 |
|
|
this
|
94 |
|
|
message.
|
95 |
|
|
|
96 |
|
|
At minimum, you can write short wrappers like
|
97 |
|
|
|
98 |
|
|
|
99 |
|
|
char toLower (char c)
|
100 |
|
|
{
|
101 |
|
|
return std::tolower(c);
|
102 |
|
|
}
|
103 |
|
|
(Thanks to James Kanze for assistance and suggestions on all of this.)
|
104 |
|
|
|
105 |
|
|
Another common operation is trimming off excess whitespace. Much
|
106 |
|
|
like transformations, this task is trivial with the use of string's
|
107 |
|
|
find family. These examples are broken into multiple
|
108 |
|
|
statements for readability:
|
109 |
|
|
|
110 |
|
|
|
111 |
|
|
std::string str (" \t blah blah blah \n ");
|
112 |
|
|
|
113 |
|
|
// trim leading whitespace
|
114 |
|
|
string::size_type notwhite = str.find_first_not_of(" \t\n");
|
115 |
|
|
str.erase(0,notwhite);
|
116 |
|
|
|
117 |
|
|
// trim trailing whitespace
|
118 |
|
|
notwhite = str.find_last_not_of(" \t\n");
|
119 |
|
|
str.erase(notwhite+1);
|
120 |
|
|
Obviously, the calls to find could be inserted directly
|
121 |
|
|
into the calls to erase , in case your compiler does not
|
122 |
|
|
optimize named temporaries out of existence.
|
123 |
|
|
|
124 |
|
|
|
125 |
|
|
|
126 |
|
|
|
127 |
|
|
Case Sensitivity
|
128 |
|
|
|
129 |
|
|
|
130 |
|
|
|
131 |
|
|
The well-known-and-if-it-isn't-well-known-it-ought-to-be
|
132 |
|
|
Guru of the Week
|
133 |
|
|
discussions held on Usenet covered this topic in January of 1998.
|
134 |
|
|
Briefly, the challenge was, write a 'ci_string' class which
|
135 |
|
|
is identical to the standard 'string' class, but is
|
136 |
|
|
case-insensitive in the same way as the (common but nonstandard)
|
137 |
|
|
C function stricmp().
|
138 |
|
|
|
139 |
|
|
|
140 |
|
|
ci_string s( "AbCdE" );
|
141 |
|
|
|
142 |
|
|
// case insensitive
|
143 |
|
|
assert( s == "abcde" );
|
144 |
|
|
assert( s == "ABCDE" );
|
145 |
|
|
|
146 |
|
|
// still case-preserving, of course
|
147 |
|
|
assert( strcmp( s.c_str(), "AbCdE" ) == 0 );
|
148 |
|
|
assert( strcmp( s.c_str(), "abcde" ) != 0 );
|
149 |
|
|
|
150 |
|
|
The solution is surprisingly easy. The original answer was
|
151 |
|
|
posted on Usenet, and a revised version appears in Herb Sutter's
|
152 |
|
|
book Exceptional C++ and on his website as GotW 29.
|
153 |
|
|
|
154 |
|
|
See? Told you it was easy!
|
155 |
|
|
|
156 |
|
|
Added June 2000: The May 2000 issue of C++
|
157 |
|
|
Report contains a fascinating
|
158 |
|
|
url="http://lafstern.org/matt/col2_new.pdf"> article by
|
159 |
|
|
Matt Austern (yes, the Matt Austern) on why
|
160 |
|
|
case-insensitive comparisons are not as easy as they seem, and
|
161 |
|
|
why creating a class is the wrong way to go
|
162 |
|
|
about it in production code. (The GotW answer mentions one of
|
163 |
|
|
the principle difficulties; his article mentions more.)
|
164 |
|
|
|
165 |
|
|
Basically, this is "easy" only if you ignore some things,
|
166 |
|
|
things which may be too important to your program to ignore. (I chose
|
167 |
|
|
to ignore them when originally writing this entry, and am surprised
|
168 |
|
|
that nobody ever called me on it...) The GotW question and answer
|
169 |
|
|
remain useful instructional tools, however.
|
170 |
|
|
|
171 |
|
|
Added September 2000: James Kanze provided a link to a
|
172 |
|
|
Unicode
|
173 |
|
|
Technical Report discussing case handling, which provides some
|
174 |
|
|
very good information.
|
175 |
|
|
|
176 |
|
|
|
177 |
|
|
|
178 |
|
|
|
179 |
|
|
Arbitrary Character Types
|
180 |
|
|
|
181 |
|
|
|
182 |
|
|
|
183 |
|
|
The std::basic_string is tantalizingly general, in that
|
184 |
|
|
it is parameterized on the type of the characters which it holds.
|
185 |
|
|
In theory, you could whip up a Unicode character class and instantiate
|
186 |
|
|
std::basic_string<my_unicode_char> , or assuming
|
187 |
|
|
that integers are wider than characters on your platform, maybe just
|
188 |
|
|
declare variables of type std::basic_string<int> .
|
189 |
|
|
|
190 |
|
|
That's the theory. Remember however that basic_string has additional
|
191 |
|
|
type parameters, which take default arguments based on the character
|
192 |
|
|
type (called CharT here):
|
193 |
|
|
|
194 |
|
|
|
195 |
|
|
template <typename CharT,
|
196 |
|
|
typename Traits = char_traits<CharT>,
|
197 |
|
|
typename Alloc = allocator<CharT> >
|
198 |
|
|
class basic_string { .... };
|
199 |
|
|
Now, allocator<CharT> will probably Do The Right
|
200 |
|
|
Thing by default, unless you need to implement your own allocator
|
201 |
|
|
for your characters.
|
202 |
|
|
|
203 |
|
|
But char_traits takes more work. The char_traits
|
204 |
|
|
template is declared but not defined.
|
205 |
|
|
That means there is only
|
206 |
|
|
|
207 |
|
|
|
208 |
|
|
template <typename CharT>
|
209 |
|
|
struct char_traits
|
210 |
|
|
{
|
211 |
|
|
static void foo (type1 x, type2 y);
|
212 |
|
|
...
|
213 |
|
|
};
|
214 |
|
|
and functions such as char_traits<CharT>::foo() are not
|
215 |
|
|
actually defined anywhere for the general case. The C++ standard
|
216 |
|
|
permits this, because writing such a definition to fit all possible
|
217 |
|
|
CharT's cannot be done.
|
218 |
|
|
|
219 |
|
|
The C++ standard also requires that char_traits be specialized for
|
220 |
|
|
instantiations of char and wchar_t , and it
|
221 |
|
|
is these template specializations that permit entities like
|
222 |
|
|
basic_string<char,char_traits<char>> to work.
|
223 |
|
|
|
224 |
|
|
If you want to use character types other than char and wchar_t,
|
225 |
|
|
such as unsigned char and int , you will
|
226 |
|
|
need suitable specializations for them. For a time, in earlier
|
227 |
|
|
versions of GCC, there was a mostly-correct implementation that
|
228 |
|
|
let programmers be lazy but it broke under many situations, so it
|
229 |
|
|
was removed. GCC 3.4 introduced a new implementation that mostly
|
230 |
|
|
works and can be specialized even for int and other
|
231 |
|
|
built-in types.
|
232 |
|
|
|
233 |
|
|
If you want to use your own special character class, then you have
|
234 |
|
|
a lot
|
235 |
|
|
of work to do, especially if you with to use i18n features
|
236 |
|
|
(facets require traits information but don't have a traits argument).
|
237 |
|
|
|
238 |
|
|
Another example of how to specialize char_traits was given on the
|
239 |
|
|
mailing list and at a later date was put into the file
|
240 |
|
|
include/ext/pod_char_traits.h . We agree
|
241 |
|
|
that the way it's used with basic_string (scroll down to main())
|
242 |
|
|
doesn't look nice, but that's because the
|
243 |
|
|
nice-looking first attempt turned out to not
|
244 |
|
|
be conforming C++, due to the rule that CharT must be a POD.
|
245 |
|
|
(See how tricky this is?)
|
246 |
|
|
|
247 |
|
|
|
248 |
|
|
|
249 |
|
|
|
250 |
|
|
|
251 |
|
|
Tokenizing
|
252 |
|
|
|
253 |
|
|
|
254 |
|
|
The Standard C (and C++) function strtok() leaves a lot to
|
255 |
|
|
be desired in terms of user-friendliness. It's unintuitive, it
|
256 |
|
|
destroys the character string on which it operates, and it requires
|
257 |
|
|
you to handle all the memory problems. But it does let the client
|
258 |
|
|
code decide what to use to break the string into pieces; it allows
|
259 |
|
|
you to choose the "whitespace," so to speak.
|
260 |
|
|
|
261 |
|
|
A C++ implementation lets us keep the good things and fix those
|
262 |
|
|
annoyances. The implementation here is more intuitive (you only
|
263 |
|
|
call it once, not in a loop with varying argument), it does not
|
264 |
|
|
affect the original string at all, and all the memory allocation
|
265 |
|
|
is handled for you.
|
266 |
|
|
|
267 |
|
|
It's called stringtok, and it's a template function. Sources are
|
268 |
|
|
as below, in a less-portable form than it could be, to keep this
|
269 |
|
|
example simple (for example, see the comments on what kind of
|
270 |
|
|
string it will accept).
|
271 |
|
|
|
272 |
|
|
|
273 |
|
|
|
274 |
|
|
#include <string>
|
275 |
|
|
template <typename Container>
|
276 |
|
|
void
|
277 |
|
|
stringtok(Container &container, string const &in,
|
278 |
|
|
const char * const delimiters = " \t\n")
|
279 |
|
|
{
|
280 |
|
|
const string::size_type len = in.length();
|
281 |
|
|
string::size_type i = 0;
|
282 |
|
|
|
283 |
|
|
while (i < len)
|
284 |
|
|
{
|
285 |
|
|
// Eat leading whitespace
|
286 |
|
|
i = in.find_first_not_of(delimiters, i);
|
287 |
|
|
if (i == string::npos)
|
288 |
|
|
return; // Nothing left but white space
|
289 |
|
|
|
290 |
|
|
// Find the end of the token
|
291 |
|
|
string::size_type j = in.find_first_of(delimiters, i);
|
292 |
|
|
|
293 |
|
|
// Push token
|
294 |
|
|
if (j == string::npos)
|
295 |
|
|
{
|
296 |
|
|
container.push_back(in.substr(i));
|
297 |
|
|
return;
|
298 |
|
|
}
|
299 |
|
|
else
|
300 |
|
|
container.push_back(in.substr(i, j-i));
|
301 |
|
|
|
302 |
|
|
// Set up for next loop
|
303 |
|
|
i = j + 1;
|
304 |
|
|
}
|
305 |
|
|
}
|
306 |
|
|
|
307 |
|
|
|
308 |
|
|
|
309 |
|
|
|
310 |
|
|
The author uses a more general (but less readable) form of it for
|
311 |
|
|
parsing command strings and the like. If you compiled and ran this
|
312 |
|
|
code using it:
|
313 |
|
|
|
314 |
|
|
|
315 |
|
|
|
316 |
|
|
|
317 |
|
|
std::list<string> ls;
|
318 |
|
|
stringtok (ls, " this \t is\t\n a test ");
|
319 |
|
|
for (std::list<string>const_iterator i = ls.begin();
|
320 |
|
|
i != ls.end(); ++i)
|
321 |
|
|
{
|
322 |
|
|
std::cerr << ':' << (*i) << ":\n";
|
323 |
|
|
}
|
324 |
|
|
You would see this as output:
|
325 |
|
|
|
326 |
|
|
|
327 |
|
|
:this:
|
328 |
|
|
:is:
|
329 |
|
|
:a:
|
330 |
|
|
:test:
|
331 |
|
|
with all the whitespace removed. The original s is still
|
332 |
|
|
available for use, ls will clean up after itself, and
|
333 |
|
|
ls.size() will return how many tokens there were.
|
334 |
|
|
|
335 |
|
|
As always, there is a price paid here, in that stringtok is not
|
336 |
|
|
as fast as strtok. The other benefits usually outweigh that, however.
|
337 |
|
|
|
338 |
|
|
|
339 |
|
|
Added February 2001: Mark Wilden pointed out that the
|
340 |
|
|
standard std::getline() function can be used with standard
|
341 |
|
|
istringstreams to perform
|
342 |
|
|
tokenizing as well. Build an istringstream from the input text,
|
343 |
|
|
and then use std::getline with varying delimiters (the three-argument
|
344 |
|
|
signature) to extract tokens into a string.
|
345 |
|
|
|
346 |
|
|
|
347 |
|
|
|
348 |
|
|
|
349 |
|
|
|
350 |
|
|
Shrink to Fit
|
351 |
|
|
|
352 |
|
|
|
353 |
|
|
From GCC 3.4 calling s.reserve(res) on a
|
354 |
|
|
string s with res < s.capacity() will
|
355 |
|
|
reduce the string's capacity to std::max(s.size(), res) .
|
356 |
|
|
|
357 |
|
|
This behaviour is suggested, but not required by the standard. Prior
|
358 |
|
|
to GCC 3.4 the following alternative can be used instead
|
359 |
|
|
|
360 |
|
|
|
361 |
|
|
std::string(str.data(), str.size()).swap(str);
|
362 |
|
|
|
363 |
|
|
This is similar to the idiom for reducing
|
364 |
|
|
a vector 's memory usage
|
365 |
|
|
(see this FAQ
|
366 |
|
|
entry) but the regular copy constructor cannot be used
|
367 |
|
|
because libstdc++'s string is Copy-On-Write.
|
368 |
|
|
|
369 |
|
|
|
370 |
|
|
|
371 |
|
|
|
372 |
|
|
|
373 |
|
|
|
374 |
|
|
CString (MFC)
|
375 |
|
|
|
376 |
|
|
|
377 |
|
|
|
378 |
|
|
A common lament seen in various newsgroups deals with the Standard
|
379 |
|
|
string class as opposed to the Microsoft Foundation Class called
|
380 |
|
|
CString. Often programmers realize that a standard portable
|
381 |
|
|
answer is better than a proprietary nonportable one, but in porting
|
382 |
|
|
their application from a Win32 platform, they discover that they
|
383 |
|
|
are relying on special functions offered by the CString class.
|
384 |
|
|
|
385 |
|
|
Things are not as bad as they seem. In
|
386 |
|
|
this
|
387 |
|
|
message, Joe Buck points out a few very important things:
|
388 |
|
|
|
389 |
|
|
|
390 |
|
|
The Standard string supports all the operations
|
391 |
|
|
that CString does, with three exceptions.
|
392 |
|
|
|
393 |
|
|
Two of those exceptions (whitespace trimming and case
|
394 |
|
|
conversion) are trivial to implement. In fact, we do so
|
395 |
|
|
on this page.
|
396 |
|
|
|
397 |
|
|
The third is CString::Format , which allows formatting
|
398 |
|
|
in the style of sprintf . This deserves some mention:
|
399 |
|
|
|
400 |
|
|
|
401 |
|
|
|
402 |
|
|
The old libg++ library had a function called form(), which did much
|
403 |
|
|
the same thing. But for a Standard solution, you should use the
|
404 |
|
|
stringstream classes. These are the bridge between the iostream
|
405 |
|
|
hierarchy and the string class, and they operate with regular
|
406 |
|
|
streams seamlessly because they inherit from the iostream
|
407 |
|
|
hierarchy. An quick example:
|
408 |
|
|
|
409 |
|
|
|
410 |
|
|
#include <iostream>
|
411 |
|
|
#include <string>
|
412 |
|
|
#include <sstream>
|
413 |
|
|
|
414 |
|
|
string f (string& incoming) // incoming is "foo N"
|
415 |
|
|
{
|
416 |
|
|
istringstream incoming_stream(incoming);
|
417 |
|
|
string the_word;
|
418 |
|
|
int the_number;
|
419 |
|
|
|
420 |
|
|
incoming_stream >> the_word // extract "foo"
|
421 |
|
|
>> the_number; // extract N
|
422 |
|
|
|
423 |
|
|
ostringstream output_stream;
|
424 |
|
|
output_stream << "The word was " << the_word
|
425 |
|
|
<< " and 3*N was " << (3*the_number);
|
426 |
|
|
|
427 |
|
|
return output_stream.str();
|
428 |
|
|
}
|
429 |
|
|
A serious problem with CString is a design bug in its memory
|
430 |
|
|
allocation. Specifically, quoting from that same message:
|
431 |
|
|
|
432 |
|
|
|
433 |
|
|
CString suffers from a common programming error that results in
|
434 |
|
|
poor performance. Consider the following code:
|
435 |
|
|
|
436 |
|
|
CString n_copies_of (const CString& foo, unsigned n)
|
437 |
|
|
{
|
438 |
|
|
CString tmp;
|
439 |
|
|
for (unsigned i = 0; i < n; i++)
|
440 |
|
|
tmp += foo;
|
441 |
|
|
return tmp;
|
442 |
|
|
}
|
443 |
|
|
|
444 |
|
|
This function is O(n^2), not O(n). The reason is that each +=
|
445 |
|
|
causes a reallocation and copy of the existing string. Microsoft
|
446 |
|
|
applications are full of this kind of thing (quadratic performance
|
447 |
|
|
on tasks that can be done in linear time) -- on the other hand,
|
448 |
|
|
we should be thankful, as it's created such a big market for high-end
|
449 |
|
|
ix86 hardware. :-)
|
450 |
|
|
|
451 |
|
|
If you replace CString with string in the above function, the
|
452 |
|
|
performance is O(n).
|
453 |
|
|
|
454 |
|
|
Joe Buck also pointed out some other things to keep in mind when
|
455 |
|
|
comparing CString and the Standard string class:
|
456 |
|
|
|
457 |
|
|
|
458 |
|
|
CString permits access to its internal representation; coders
|
459 |
|
|
who exploited that may have problems moving to string .
|
460 |
|
|
|
461 |
|
|
Microsoft ships the source to CString (in the files
|
462 |
|
|
MFC\SRC\Str{core,ex}.cpp), so you could fix the allocation
|
463 |
|
|
bug and rebuild your MFC libraries.
|
464 |
|
|
Note: It looks like the CString shipped
|
465 |
|
|
with VC++6.0 has fixed this, although it may in fact have been
|
466 |
|
|
one of the VC++ SPs that did it.
|
467 |
|
|
|
468 |
|
|
string operations like this have O(n) complexity
|
469 |
|
|
if the implementors do it correctly. The libstdc++
|
470 |
|
|
implementors did it correctly. Other vendors might not.
|
471 |
|
|
|
472 |
|
|
While chapters of the SGI STL are used in libstdc++, their
|
473 |
|
|
string class is not. The SGI string is essentially
|
474 |
|
|
vector<char> and does not do any reference
|
475 |
|
|
counting like libstdc++'s does. (It is O(n), though.)
|
476 |
|
|
So if you're thinking about SGI's string or rope classes,
|
477 |
|
|
you're now looking at four possibilities: CString, the
|
478 |
|
|
libstdc++ string, the SGI string, and the SGI rope, and this
|
479 |
|
|
is all before any allocator or traits customizations! (More
|
480 |
|
|
choices than you can shake a stick at -- want fries with that?)
|
481 |
|
|
|
482 |
|
|
|
483 |
|
|
|
484 |
|
|
|
485 |
|
|
|
486 |
|
|
|
487 |
|
|
|
488 |
|
|
|
489 |
|
|
|