This post is part of a series on the difference between pattern matching notation and extended regular expressions.
It is the standard (according to POSIX) that the shell have some interesting pattern matching abilities in its parameter and pathname expansions (called Pattern Matching Notation). However, they do not use the standards for regular expression (also defined in IEEE Std 1003.2 (POSIX.2)), which is the standard used in other pattern matching functions such as egrep
, and sed
. In this post we will explore some of the differences so as to avoid some confusing mistakes.
The IEEE Standards
As of this writing the latest version of POSIX I am aware of is the IEEE Std 1003.1, 2013 Edition, which is luckily available for free online.
Two Systems
As for the naming of the two systems, we will use the names that POSIX uses:
Name | Used by/in |
---|---|
Pattern Matching Notation | Parameter and Pathname Expansion |
[Extended] Regular Expressions | [e]grep and sed as well as conditional expressions when used with =~ |
Why Extended?
Extended is used to differentiate the system from the old regular expression notation called basic
. The man re_format
page has this to say:
Regular expressions (``REs''), as defined in IEEE Std 1003.2 (``POSIX.2''), come in two forms: modern REs (roughly those of egrep(1); 1003.2 calls these ``extended'' REs) and obsolete REs (roughly those of ed(1); 1003.2 ``basic'' REs). Obsolete REs mostly exist for backward compatibility in some old programs; they will be discussed at the end. IEEE Std 1003.2 (``POSIX.2'') leaves some aspects of RE syntax and semantics open; `=' marks decisions on these aspects that may not be fully portable to other IEEE Std 1003.2 (``POSIX.2'') implementations.
Similar but not the Same
Pattern Matching Notation is actually closely related to regular expressions but with some differences, as mentioned in the IEEE Std 1003.1, 2004 Edition section 2.13:
The pattern matching notation described in this section is used to specify patterns for matching strings in the shell. Historically, pattern matching notation is related to, but slightly different from, the regular expression notation described in XBD Regular Expressions. For this reason, the description of the rules for this pattern matching notation are based on the description of regular expression notation, modified to account for the differences.
The differences are mostly in the special operators:
Special Operators
Operator | Pattern Matching | Regular Expression | |
---|---|---|---|
* | Matches any string, including the null string. | An atom followed by `*’ matches a sequence of 0 or more matches of the atom | |
? | Matches any single character. | An atom followed by `?’ matches a sequence of 0 or 1 matches of the atom. | |
+ | Matches itself | An atom followed by `+’ matches a sequence of 1 or more matches of the atom. | |
. | Matches itself | Matches any single character | |
^ | Matches itself | Matches the null string at the beginning of a line | |
$ | Matches itself | Matches the null string at the end of a line | |
| | Matches itself | Indicates a branch (a way of saying ‘or’ between patterns) | |
Escapes the special characters used by the system | |||
`{‘[i],[j]`}’ | Matches itself | An atom followed by a bound containing two integers i and j matches a sequence of i through j (inclusive) matches of the atom. To be explained in more detail. | |
[…] | General | Matches any one of the enclosed characters. | |
Range | A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale’s collating sequence and character set, is matched. | ||
Character Classes | Within [ and ], character classes can be specified using the syntax [:class:], where class is one of the following classes defined in the POSIX standard: alnum alpha ascii blank cntrl digit graph lower print punct space upper word xdigit A character class matches any character belonging to that class. | ||
Negation | If the first character following the [ is a ! or a ^ then any character not enclosed is matched. | If the first character following the [ is a ^ then any character not enclosed is matched. | |
– | A – may be matched by including it as the first or last character in the set. | ||
] | A ] may be matched by including it as the first character in the set. |
You will notice that atoms
were mentioned several times in the table. man
re_format
gives us some more information about what an atom is:
An atom is a regular expression enclosed in `()` (matching a match for the regular expression), an empty set of `()` (matching the null string)=, a bracket expression (see below), `.` (matching any single character), `^` (matching the null string at the beginning of a line), `$` (matching the null string at the end of a line), a ``' followed by one of the characters `^.[$()|*+?{` (matching that character taken as an ordinary character), a `` followed by any other character= (matching that character taken as an ordinary character, as if the `` had not been present=), or a single character with no other significance (matching that character). A `{` followed by a character other than a digit is an ordinary character, not the beginning of a bound=. It is illegal to end an RE with ``.
The Applications
As mentioned earlier, pattern matching is used in pathname and parameter expansion. However, because of their uses there are some limitations. Let’s take a closer look:
Pathname Expansion
The man bash
page tells us the following:
Pathname Expansion
After word splitting, unless the -f option has been set, bash scans each word for the characters *, ?, and [. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern.
Special Note on Pathname Expansion
Something to note about pathname expansion, is that for there to be a match, the whole filename must match the pattern, and not just a part of it. So for example if we have a file named test
, the pattern te?
won’t match because it doesn’t match the whole filename, it only matched a part: tes
. To get a match we would have to use tes?
or te*
instead. As we will see, this is not the case with parameter expansion.
Parameter Expansion
There are a few cases where pattern matching can be used in parameter expansion.
Format | Matches | Matches Shortest or Longest | Function |
---|---|---|---|
${parameter#word} | pattern matches the beginning of the expanded value of parameter | Shortest | Deletes matched string |
${parameter##word} | pattern matches the beginning of the expanded value of parameter | Longest | Deletes matched string |
${parameter%word} | pattern matches a trailing portion of the expanded value of parameter | Shortest | Deletes matched string |
${parameter%word} | pattern matches a trailing portion of the expanded value of parameter | Longest | Deletes matched string |
${parameter/pattern/string} | pattern matches anywhere in the expanded value of parameter, doesn’t have to be beginning or end | Longest | Substitutes matched portion with string |
Unlike the special use of pathname expansion and its requirement to match a whole filename, there is no such restriction here as the matches are made within a string. However, notice that when #
is used there is a restriction that the match must be in the beginning of the parameter, while as when %
is used the match must be in the end of the parameter. The substiution format, ${parameter/pattern/string}
, doesn’t have that problem.
The bash man
page tells us the following:
${parameter#word}
${parameter##word}
The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches the beginning of the value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``#'' case) or the longest matching pattern (the ``##'' case) deleted. If parameter is @ or *, the pattern removal operation is applied to each positional parameter in turn, and the expansion is the resultant list. If parameter is an array variable subscripted with @ or *, the pattern removal operation is applied to each member of the array in turn, and the expansion is the resultant list.
${parameter%word}
${parameter%word}
The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches a trailing portion of the expanded value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``%'' case) or the longest matching pattern (the ``%'' case) deleted. If parameter is @ or *, the pattern removal operation is applied to each positional parameter in turn, and the expansion is the resultant list. If parameter is an array variable subscripted with @ or *, the pattern removal operation is applied to each member of the array in turn, and the expansion is the resultant list.
${parameter/pattern/string}
The pattern is expanded to produce a pattern just as in pathname expansion. Parameter is expanded and the longest match of pattern against its value is replaced with string. If Ipattern begins with /, all matches of pattern are replaced with string. Normally only the first match is replaced. If pattern begins with #, it must match at the beginning of the expanded value of parameter. If pattern begins with %, it must match at the end of the expanded value of parameter. If string is null, matches of pattern are deleted and the / following pattern may be omitted. If parameter is @ or *, the substitu- tion operation is applied to each positional parameter in turn, and the expansion is the resultant list. If parameter is an array variable subscripted with @ or *, the substitution operation is applied to each member of the array in turn, and the expansion is the resultant list.
Extended Regular Expressions
Alongside egrep
and sed
and other external functions, these are also usable in conditional expressions:
An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3))
Next Steps
- Practical Explorations of the Differences Between Pattern Matching Notation Used in Pathname and Parameter Expansion and Extended Regular Expressions
- A Table of Practical Matching Differences Between Pattern Matching Notation Used in Pathname and Parameter Expansion and Extended Regular Expressions
References
- IEEE Std 1003.1, 2013 Edition
man bash
pageman re_format
page (FreeBSD version)