1 |
578 |
markom |
'\"
|
2 |
|
|
'\" Copyright (c) 1993 The Regents of the University of California.
|
3 |
|
|
'\" Copyright (c) 1994-1996 Sun Microsystems, Inc.
|
4 |
|
|
'\"
|
5 |
|
|
'\" See the file "license.terms" for information on usage and redistribution
|
6 |
|
|
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
|
7 |
|
|
'\"
|
8 |
|
|
'\" RCS: @(#) $Id: regexp.n,v 1.1.1.1 2002-01-16 10:25:25 markom Exp $
|
9 |
|
|
'\"
|
10 |
|
|
.so man.macros
|
11 |
|
|
.TH regexp n "" Tcl "Tcl Built-In Commands"
|
12 |
|
|
.BS
|
13 |
|
|
'\" Note: do not modify the .SH NAME line immediately below!
|
14 |
|
|
.SH NAME
|
15 |
|
|
regexp \- Match a regular expression against a string
|
16 |
|
|
.SH SYNOPSIS
|
17 |
|
|
\fBregexp \fR?\fIswitches\fR? \fIexp string \fR?\fImatchVar\fR? ?\fIsubMatchVar subMatchVar ...\fR?
|
18 |
|
|
.BE
|
19 |
|
|
|
20 |
|
|
.SH DESCRIPTION
|
21 |
|
|
.PP
|
22 |
|
|
Determines whether the regular expression \fIexp\fR matches part or
|
23 |
|
|
all of \fIstring\fR and returns 1 if it does, 0 if it doesn't.
|
24 |
|
|
.LP
|
25 |
|
|
If additional arguments are specified after \fIstring\fR then they
|
26 |
|
|
are treated as the names of variables in which to return
|
27 |
|
|
information about which part(s) of \fIstring\fR matched \fIexp\fR.
|
28 |
|
|
\fIMatchVar\fR will be set to the range of \fIstring\fR that
|
29 |
|
|
matched all of \fIexp\fR. The first \fIsubMatchVar\fR will contain
|
30 |
|
|
the characters in \fIstring\fR that matched the leftmost parenthesized
|
31 |
|
|
subexpression within \fIexp\fR, the next \fIsubMatchVar\fR will
|
32 |
|
|
contain the characters that matched the next parenthesized
|
33 |
|
|
subexpression to the right in \fIexp\fR, and so on.
|
34 |
|
|
.LP
|
35 |
|
|
If the initial arguments to \fBregexp\fR start with \fB\-\fR then
|
36 |
|
|
they are treated as switches. The following switches are
|
37 |
|
|
currently supported:
|
38 |
|
|
.TP 10
|
39 |
|
|
\fB\-nocase\fR
|
40 |
|
|
Causes upper-case characters in \fIstring\fR to be treated as
|
41 |
|
|
lower case during the matching process.
|
42 |
|
|
.TP 10
|
43 |
|
|
\fB\-indices\fR
|
44 |
|
|
Changes what is stored in the \fIsubMatchVar\fRs.
|
45 |
|
|
Instead of storing the matching characters from \fBstring\fR,
|
46 |
|
|
each variable
|
47 |
|
|
will contain a list of two decimal strings giving the indices
|
48 |
|
|
in \fIstring\fR of the first and last characters in the matching
|
49 |
|
|
range of characters.
|
50 |
|
|
.TP 10
|
51 |
|
|
\fB\-\|\-\fR
|
52 |
|
|
Marks the end of switches. The argument following this one will
|
53 |
|
|
be treated as \fIexp\fR even if it starts with a \fB\-\fR.
|
54 |
|
|
.LP
|
55 |
|
|
If there are more \fIsubMatchVar\fR's than parenthesized
|
56 |
|
|
subexpressions within \fIexp\fR, or if a particular subexpression
|
57 |
|
|
in \fIexp\fR doesn't match the string (e.g. because it was in a
|
58 |
|
|
portion of the expression that wasn't matched), then the corresponding
|
59 |
|
|
\fIsubMatchVar\fR will be set to ``\fB\-1 \-1\fR'' if \fB\-indices\fR
|
60 |
|
|
has been specified or to an empty string otherwise.
|
61 |
|
|
|
62 |
|
|
.SH "REGULAR EXPRESSIONS"
|
63 |
|
|
.PP
|
64 |
|
|
Regular expressions are implemented using Henry Spencer's package
|
65 |
|
|
(thanks, Henry!),
|
66 |
|
|
and much of the description of regular expressions below is copied verbatim
|
67 |
|
|
from his manual entry.
|
68 |
|
|
.PP
|
69 |
|
|
A regular expression is zero or more \fIbranches\fR, separated by ``|''.
|
70 |
|
|
It matches anything that matches one of the branches.
|
71 |
|
|
.PP
|
72 |
|
|
A branch is zero or more \fIpieces\fR, concatenated.
|
73 |
|
|
It matches a match for the first, followed by a match for the second, etc.
|
74 |
|
|
.PP
|
75 |
|
|
A piece is an \fIatom\fR possibly followed by ``*'', ``+'', or ``?''.
|
76 |
|
|
An atom followed by ``*'' matches a sequence of 0 or more matches of the atom.
|
77 |
|
|
An atom followed by ``+'' matches a sequence of 1 or more matches of the atom.
|
78 |
|
|
An atom followed by ``?'' matches a match of the atom, or the null string.
|
79 |
|
|
.PP
|
80 |
|
|
An atom is a regular expression in parentheses (matching a match for the
|
81 |
|
|
regular expression), a \fIrange\fR (see below), ``.''
|
82 |
|
|
(matching any single character), ``^'' (matching the null string at the
|
83 |
|
|
beginning of the input string), ``$'' (matching the null string at the
|
84 |
|
|
end of the input string), a ``\e'' followed by a single character (matching
|
85 |
|
|
that character), or a single character with no other significance
|
86 |
|
|
(matching that character).
|
87 |
|
|
.PP
|
88 |
|
|
A \fIrange\fR is a sequence of characters enclosed in ``[]''.
|
89 |
|
|
It normally matches any single character from the sequence.
|
90 |
|
|
If the sequence begins with ``^'',
|
91 |
|
|
it matches any single character \fInot\fR from the rest of the sequence.
|
92 |
|
|
If two characters in the sequence are separated by ``\-'', this is shorthand
|
93 |
|
|
for the full list of ASCII characters between them
|
94 |
|
|
(e.g. ``[0-9]'' matches any decimal digit).
|
95 |
|
|
To include a literal ``]'' in the sequence, make it the first character
|
96 |
|
|
(following a possible ``^'').
|
97 |
|
|
To include a literal ``\-'', make it the first or last character.
|
98 |
|
|
|
99 |
|
|
.SH "CHOOSING AMONG ALTERNATIVE MATCHES"
|
100 |
|
|
.PP
|
101 |
|
|
In general there may be more than one way to match a regular expression
|
102 |
|
|
to an input string. For example, consider the command
|
103 |
|
|
.CS
|
104 |
|
|
\fBregexp (a*)b* aabaaabb x y\fR
|
105 |
|
|
.CE
|
106 |
|
|
Considering only the rules given so far, \fBx\fR and \fBy\fR could
|
107 |
|
|
end up with the values \fBaabb\fR and \fBaa\fR, \fBaaab\fR and \fBaaa\fR,
|
108 |
|
|
\fBab\fR and \fBa\fR, or any of several other combinations.
|
109 |
|
|
To resolve this potential ambiguity \fBregexp\fR chooses among
|
110 |
|
|
alternatives using the rule ``first then longest''.
|
111 |
|
|
In other words, it considers the possible matches in order working
|
112 |
|
|
from left to right across the input string and the pattern, and it
|
113 |
|
|
attempts to match longer pieces of the input string before shorter
|
114 |
|
|
ones. More specifically, the following rules apply in decreasing
|
115 |
|
|
order of priority:
|
116 |
|
|
.IP [1]
|
117 |
|
|
If a regular expression could match two different parts of an input string
|
118 |
|
|
then it will match the one that begins earliest.
|
119 |
|
|
.IP [2]
|
120 |
|
|
If a regular expression contains \fB|\fR operators then the leftmost
|
121 |
|
|
matching sub-expression is chosen.
|
122 |
|
|
.IP [3]
|
123 |
|
|
In \fB*\fR, \fB+\fR, and \fB?\fR constructs, longer matches are chosen
|
124 |
|
|
in preference to shorter ones.
|
125 |
|
|
.IP [4]
|
126 |
|
|
In sequences of expression components the components are considered
|
127 |
|
|
from left to right.
|
128 |
|
|
.LP
|
129 |
|
|
In the example from above, \fB(a*)b*\fR matches \fBaab\fR: the \fB(a*)\fR
|
130 |
|
|
portion of the pattern is matched first and it consumes the leading
|
131 |
|
|
\fBaa\fR; then the \fBb*\fR portion of the pattern consumes the
|
132 |
|
|
next \fBb\fR. Or, consider the following example:
|
133 |
|
|
.CS
|
134 |
|
|
\fBregexp (ab|a)(b*)c abc x y z\fR
|
135 |
|
|
.CE
|
136 |
|
|
After this command \fBx\fR will be \fBabc\fR, \fBy\fR will be
|
137 |
|
|
\fBab\fR, and \fBz\fR will be an empty string.
|
138 |
|
|
Rule 4 specifies that \fB(ab|a)\fR gets first shot at the input
|
139 |
|
|
string and Rule 2 specifies that the \fBab\fR sub-expression
|
140 |
|
|
is checked before the \fBa\fR sub-expression.
|
141 |
|
|
Thus the \fBb\fR has already been claimed before the \fB(b*)\fR
|
142 |
|
|
component is checked and \fB(b*)\fR must match an empty string.
|
143 |
|
|
|
144 |
|
|
.SH KEYWORDS
|
145 |
|
|
match, regular expression, string
|