1 |
199 |
simons |
.TH REGEXP 3 local
|
2 |
|
|
.DA 30 Nov 1985
|
3 |
|
|
.SH NAME
|
4 |
|
|
regcomp, regexec, regsub, regerror \- regular expression handler
|
5 |
|
|
.SH SYNOPSIS
|
6 |
|
|
.ft B
|
7 |
|
|
.nf
|
8 |
|
|
#include
|
9 |
|
|
|
10 |
|
|
regexp *regcomp(exp)
|
11 |
|
|
char *exp;
|
12 |
|
|
|
13 |
|
|
int regexec(prog, string)
|
14 |
|
|
regexp *prog;
|
15 |
|
|
char *string;
|
16 |
|
|
|
17 |
|
|
regsub(prog, source, dest)
|
18 |
|
|
regexp *prog;
|
19 |
|
|
char *source;
|
20 |
|
|
char *dest;
|
21 |
|
|
|
22 |
|
|
regerror(msg)
|
23 |
|
|
char *msg;
|
24 |
|
|
.SH DESCRIPTION
|
25 |
|
|
These functions implement
|
26 |
|
|
.IR egrep (1)-style
|
27 |
|
|
regular expressions and supporting facilities.
|
28 |
|
|
.PP
|
29 |
|
|
.I Regcomp
|
30 |
|
|
compiles a regular expression into a structure of type
|
31 |
|
|
.IR regexp ,
|
32 |
|
|
and returns a pointer to it.
|
33 |
|
|
The space has been allocated using
|
34 |
|
|
.IR malloc (3)
|
35 |
|
|
and may be released by
|
36 |
|
|
.IR free .
|
37 |
|
|
.PP
|
38 |
|
|
.I Regexec
|
39 |
|
|
matches a NUL-terminated \fIstring\fR against the compiled regular expression
|
40 |
|
|
in \fIprog\fR.
|
41 |
|
|
It returns 1 for success and 0 for failure, and adjusts the contents of
|
42 |
|
|
\fIprog\fR's \fIstartp\fR and \fIendp\fR (see below) accordingly.
|
43 |
|
|
.PP
|
44 |
|
|
The members of a
|
45 |
|
|
.I regexp
|
46 |
|
|
structure include at least the following (not necessarily in order):
|
47 |
|
|
.PP
|
48 |
|
|
.RS
|
49 |
|
|
char *startp[NSUBEXP];
|
50 |
|
|
.br
|
51 |
|
|
char *endp[NSUBEXP];
|
52 |
|
|
.RE
|
53 |
|
|
.PP
|
54 |
|
|
where
|
55 |
|
|
.I NSUBEXP
|
56 |
|
|
is defined (as 10) in the header file.
|
57 |
|
|
Once a successful \fIregexec\fR has been done using the \fIregexp\fR,
|
58 |
|
|
each \fIstartp\fR-\fIendp\fR pair describes one substring
|
59 |
|
|
within the \fIstring\fR,
|
60 |
|
|
with the \fIstartp\fR pointing to the first character of the substring and
|
61 |
|
|
the \fIendp\fR pointing to the first character following the substring.
|
62 |
|
|
The 0th substring is the substring of \fIstring\fR that matched the whole
|
63 |
|
|
regular expression.
|
64 |
|
|
The others are those substrings that matched parenthesized expressions
|
65 |
|
|
within the regular expression, with parenthesized expressions numbered
|
66 |
|
|
in left-to-right order of their opening parentheses.
|
67 |
|
|
.PP
|
68 |
|
|
.I Regsub
|
69 |
|
|
copies \fIsource\fR to \fIdest\fR, making substitutions according to the
|
70 |
|
|
most recent \fIregexec\fR performed using \fIprog\fR.
|
71 |
|
|
Each instance of `&' in \fIsource\fR is replaced by the substring
|
72 |
|
|
indicated by \fIstartp\fR[\fI0\fR] and
|
73 |
|
|
\fIendp\fR[\fI0\fR].
|
74 |
|
|
Each instance of `\e\fIn\fR', where \fIn\fR is a digit, is replaced by
|
75 |
|
|
the substring indicated by
|
76 |
|
|
\fIstartp\fR[\fIn\fR] and
|
77 |
|
|
\fIendp\fR[\fIn\fR].
|
78 |
|
|
.PP
|
79 |
|
|
.I Regerror
|
80 |
|
|
is called whenever an error is detected in \fIregcomp\fR, \fIregexec\fR,
|
81 |
|
|
or \fIregsub\fR.
|
82 |
|
|
The default \fIregerror\fR writes the string \fImsg\fR,
|
83 |
|
|
with a suitable indicator of origin,
|
84 |
|
|
on the standard
|
85 |
|
|
error output
|
86 |
|
|
and invokes \fIexit\fR(2).
|
87 |
|
|
.I Regerror
|
88 |
|
|
can be replaced by the user if other actions are desirable.
|
89 |
|
|
.SH "REGULAR EXPRESSION SYNTAX"
|
90 |
|
|
A regular expression is zero or more \fIbranches\fR, separated by `|'.
|
91 |
|
|
It matches anything that matches one of the branches.
|
92 |
|
|
.PP
|
93 |
|
|
A branch is zero or more \fIpieces\fR, concatenated.
|
94 |
|
|
It matches a match for the first, followed by a match for the second, etc.
|
95 |
|
|
.PP
|
96 |
|
|
A piece is an \fIatom\fR possibly followed by `*', `+', or `?'.
|
97 |
|
|
An atom followed by `*' matches a sequence of 0 or more matches of the atom.
|
98 |
|
|
An atom followed by `+' matches a sequence of 1 or more matches of the atom.
|
99 |
|
|
An atom followed by `?' matches a match of the atom, or the null string.
|
100 |
|
|
.PP
|
101 |
|
|
An atom is a regular expression in parentheses (matching a match for the
|
102 |
|
|
regular expression), a \fIrange\fR (see below), `.'
|
103 |
|
|
(matching any single character), `^' (matching the null string at the
|
104 |
|
|
beginning of the input string), `$' (matching the null string at the
|
105 |
|
|
end of the input string), a `\e' followed by a single character (matching
|
106 |
|
|
that character), or a single character with no other significance
|
107 |
|
|
(matching that character).
|
108 |
|
|
.PP
|
109 |
|
|
A \fIrange\fR is a sequence of characters enclosed in `[]'.
|
110 |
|
|
It normally matches any single character from the sequence.
|
111 |
|
|
If the sequence begins with `^',
|
112 |
|
|
it matches any single character \fInot\fR from the rest of the sequence.
|
113 |
|
|
If two characters in the sequence are separated by `\-', this is shorthand
|
114 |
|
|
for the full list of ASCII characters between them
|
115 |
|
|
(e.g. `[0-9]' matches any decimal digit).
|
116 |
|
|
To include a literal `]' in the sequence, make it the first character
|
117 |
|
|
(following a possible `^').
|
118 |
|
|
To include a literal `\-', make it the first or last character.
|
119 |
|
|
.SH AMBIGUITY
|
120 |
|
|
If a regular expression could match two different parts of the input string,
|
121 |
|
|
it will match the one which begins earliest.
|
122 |
|
|
If both begin in the same place but match different lengths, or match
|
123 |
|
|
the same length in different ways, life gets messier, as follows.
|
124 |
|
|
.PP
|
125 |
|
|
In general, the possibilities in a list of branches are considered in
|
126 |
|
|
left-to-right order, the possibilities for `*', `+', and `?' are
|
127 |
|
|
considered longest-first, nested constructs are considered from the
|
128 |
|
|
outermost in, and concatenated constructs are considered leftmost-first.
|
129 |
|
|
The match that will be chosen is the one that uses the earliest
|
130 |
|
|
possibility in the first choice that has to be made.
|
131 |
|
|
If there is more than one choice, the next will be made in the same manner
|
132 |
|
|
(earliest possibility) subject to the decision on the first choice.
|
133 |
|
|
And so forth.
|
134 |
|
|
.PP
|
135 |
|
|
For example, `(ab|a)b*c' could match `abc' in one of two ways.
|
136 |
|
|
The first choice is between `ab' and `a'; since `ab' is earlier, and does
|
137 |
|
|
lead to a successful overall match, it is chosen.
|
138 |
|
|
Since the `b' is already spoken for,
|
139 |
|
|
the `b*' must match its last possibility\(emthe empty string\(emsince
|
140 |
|
|
it must respect the earlier choice.
|
141 |
|
|
.PP
|
142 |
|
|
In the particular case where no `|'s are present and there is only one
|
143 |
|
|
`*', `+', or `?', the net effect is that the longest possible
|
144 |
|
|
match will be chosen.
|
145 |
|
|
So `ab*', presented with `xabbbby', will match `abbbb'.
|
146 |
|
|
Note that if `ab*' is tried against `xabyabbbz', it
|
147 |
|
|
will match `ab' just after `x', due to the begins-earliest rule.
|
148 |
|
|
(In effect, the decision on where to start the match is the first choice
|
149 |
|
|
to be made, hence subsequent choices must respect it even if this leads them
|
150 |
|
|
to less-preferred alternatives.)
|
151 |
|
|
.SH SEE ALSO
|
152 |
|
|
egrep(1), expr(1)
|
153 |
|
|
.SH DIAGNOSTICS
|
154 |
|
|
\fIRegcomp\fR returns NULL for a failure
|
155 |
|
|
(\fIregerror\fR permitting),
|
156 |
|
|
where failures are syntax errors, exceeding implementation limits,
|
157 |
|
|
or applying `+' or `*' to a possibly-null operand.
|
158 |
|
|
.SH HISTORY
|
159 |
|
|
Both code and manual page were
|
160 |
|
|
written at U of T.
|
161 |
|
|
They are intended to be compatible with the Bell V8 \fIregexp\fR(3),
|
162 |
|
|
but are not derived from Bell code.
|
163 |
|
|
.SH BUGS
|
164 |
|
|
Empty branches and empty regular expressions are not portable to V8.
|
165 |
|
|
.PP
|
166 |
|
|
The restriction against
|
167 |
|
|
applying `*' or `+' to a possibly-null operand is an artifact of the
|
168 |
|
|
simplistic implementation.
|
169 |
|
|
.PP
|
170 |
|
|
Does not support \fIegrep\fR's newline-separated branches;
|
171 |
|
|
neither does the V8 \fIregexp\fR(3), though.
|
172 |
|
|
.PP
|
173 |
|
|
Due to emphasis on
|
174 |
|
|
compactness and simplicity,
|
175 |
|
|
it's not strikingly fast.
|
176 |
|
|
It does give special attention to handling simple cases quickly.
|