A Theoretical Summary of the Differences Between Pattern Matching Notation Used in Pathname and Parameter Expansion and Extended Regular Expressions

This post is part of a series on the difference between pattern matching notation and extended regular expressions.

It is the standard (according to POSIX) that the shell have some interesting pattern matching abilities in its parameter and pathname expansions (called Pattern Matching Notation). However, they do not use the standards for regular expression (also defined in IEEE Std 1003.2 (POSIX.2)), which is the standard used in other pattern matching functions such as egrep, and sed. In this post we will explore some of the differences so as to avoid some confusing mistakes.

The IEEE Standards

As of this writing the latest version of POSIX I am aware of is the IEEE Std 1003.1, 2013 Edition, which is luckily available for free online.

Two Systems

As for the naming of the two systems, we will use the names that POSIX uses:

Name Used by/in
Pattern Matching Notation Parameter and Pathname Expansion
[Extended] Regular Expressions [e]grep and sed as well as conditional expressions when used with =~

Why Extended?

Extended is used to differentiate the system from the old regular expression notation called basic. The man re_format page has this to say:

 Regular expressions (``REs''), as defined in IEEE Std 1003.2 (``POSIX.2''), come in two forms: modern REs (roughly those of egrep(1); 1003.2 calls these ``extended'' REs) and obsolete REs (roughly those of ed(1); 1003.2 ``basic'' REs). Obsolete REs mostly exist for backward compatibility in some old programs; they will be discussed at the end. IEEE Std 1003.2 (``POSIX.2'') leaves some aspects of RE syntax and semantics open; `=' marks decisions on these aspects that may not be fully portable to other IEEE Std 1003.2 (``POSIX.2'') implementations.

Similar but not the Same

Pattern Matching Notation is actually closely related to regular expressions but with some differences, as mentioned in the IEEE Std 1003.1, 2004 Edition section 2.13:

The pattern matching notation described in this section is used to specify patterns for matching strings in the shell. Historically, pattern matching notation is related to, but slightly different from, the regular expression notation described in XBD Regular Expressions. For this reason, the description of the rules for this pattern matching notation are based on the description of regular expression notation, modified to account for the differences.

The differences are mostly in the special operators:

Special Operators

Operator Pattern Matching Regular Expression
* Matches any string, including the null string. An atom followed by `*’ matches a sequence of 0 or more matches of the atom
? Matches any single character. An atom followed by `?’ matches a sequence of 0 or 1 matches of the atom.
+ Matches itself An atom followed by `+’ matches a sequence of 1 or more matches of the atom.
. Matches itself Matches any single character
^ Matches itself Matches the null string at the beginning of a line
$ Matches itself Matches the null string at the end of a line
| Matches itself Indicates a branch (a way of saying ‘or’ between patterns)
Escapes the special characters used by the system
`{‘[i],[j]`}’ Matches itself An atom followed by a bound containing two integers i and j matches a sequence of i through j (inclusive) matches of the atom. To be explained in more detail.
[…] General Matches any one of the enclosed characters.
Range A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale’s collating sequence and character set, is matched.
Character Classes Within [ and ], character classes can be specified using the syntax [:class:], where class is one of the following classes defined in the POSIX standard: alnum alpha ascii blank cntrl digit graph lower print punct space upper word xdigit A character class matches any character belonging to that class.
Negation If the first character following the [ is a ! or a ^ then any character not enclosed is matched. If the first character following the [ is a ^ then any character not enclosed is matched.
A – may be matched by including it as the first or last character in the set.
] A ] may be matched by including it as the first character in the set.

You will notice that atoms were mentioned several times in the table. man re_format gives us some more information about what an atom is:

An atom is a regular expression enclosed in `()` (matching a match for the regular expression), an empty set of `()` (matching the null string)=, a bracket expression (see below), `.` (matching any single character), `^` (matching the null string at the beginning of a line), `$` (matching the null string at the end of a line), a ``' followed by one of the characters `^.[$()|*+?{` (matching that character taken as an ordinary character), a `` followed by any other character= (matching that character taken as an ordinary character, as if the `` had not been present=), or a single character with no other significance (matching that character). A `{` followed by a character other than a digit is an ordinary character, not the beginning of a bound=. It is illegal to end an RE with ``.

The Applications

As mentioned earlier, pattern matching is used in pathname and parameter expansion. However, because of their uses there are some limitations. Let’s take a closer look:

Pathname Expansion

The man bash page tells us the following:

Pathname Expansion
    After word splitting, unless the -f option has been set, bash scans each word for the characters *, ?, and [. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern.

Special Note on Pathname Expansion

Something to note about pathname expansion, is that for there to be a match, the whole filename must match the pattern, and not just a part of it. So for example if we have a file named test, the pattern te? won’t match because it doesn’t match the whole filename, it only matched a part: tes. To get a match we would have to use tes? or te* instead. As we will see, this is not the case with parameter expansion.

Parameter Expansion

There are a few cases where pattern matching can be used in parameter expansion.

Format Matches Matches Shortest or Longest Function
${parameter#word} pattern matches the beginning of the expanded value of parameter Shortest Deletes matched string
${parameter##word} pattern matches the beginning of the expanded value of parameter Longest Deletes matched string
${parameter%word} pattern matches a trailing portion of the expanded value of parameter Shortest Deletes matched string
${parameter%word} pattern matches a trailing portion of the expanded value of parameter Longest Deletes matched string
${parameter/pattern/string} pattern matches anywhere in the expanded value of parameter, doesn’t have to be beginning or end Longest Substitutes matched portion with string

Unlike the special use of pathname expansion and its requirement to match a whole filename, there is no such restriction here as the matches are made within a string. However, notice that when # is used there is a restriction that the match must be in the beginning of the parameter, while as when % is used the match must be in the end of the parameter. The substiution format, ${parameter/pattern/string}, doesn’t have that problem.

The bash man page tells us the following:

${parameter#word}
${parameter##word}
    The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches the beginning of the value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``#'' case) or the longest matching pattern (the ``##'' case) deleted. If parameter is @ or *, the pattern removal operation is applied to each positional parameter in turn, and the expansion is the resultant list. If parameter is an array variable subscripted with @ or *, the pattern removal operation is applied to each member of the array in turn, and the expansion is the resultant list.

${parameter%word}
${parameter%word}
    The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches a trailing portion of the expanded value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``%'' case) or the longest matching pattern (the ``%'' case) deleted. If parameter is @ or *, the pattern removal operation is applied to each positional parameter in turn, and the expansion is the resultant list. If parameter is an array variable subscripted with @ or *, the pattern removal operation is applied to each member of the array in turn, and the expansion is the resultant list.

${parameter/pattern/string}
    The pattern is expanded to produce a pattern just as in pathname expansion. Parameter is expanded and the longest match of pattern against its value is replaced with string. If Ipattern begins with /, all matches of pattern are replaced with string. Normally only the first match is replaced. If pattern begins with #, it must match at the beginning of the expanded value of parameter. If pattern begins with %, it must match at the end of the expanded value of parameter. If string is null, matches of pattern are deleted and the / following pattern may be omitted. If parameter is @ or *, the substitu- tion operation is applied to each positional parameter in turn, and the expansion is the resultant list. If parameter is an array variable subscripted with @ or *, the substitution operation is applied to each member of the array in turn, and the expansion is the resultant list.

Extended Regular Expressions

Alongside egrep and sed and other external functions, these are also usable in conditional expressions:

An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3))

Next Steps

  1. Practical Explorations of the Differences Between Pattern Matching Notation Used in Pathname and Parameter Expansion and Extended Regular Expressions
  2. A Table of Practical Matching Differences Between Pattern Matching Notation Used in Pathname and Parameter Expansion and Extended Regular Expressions

References

  1. IEEE Std 1003.1, 2013 Edition
  2. man bash page
  3. man re_format page (FreeBSD version)

Ahmed Amayem has written 90 articles

A Web Application Developer Entrepreneur.