When using regular expressions the *
special operator is very useful. An expression followed by *
matches a sequence of 0 or more matches of the expression. When the whole expression is followed by a *
we get some intersting behaviour that people may not have expected. We will explore such behaviour in this post.
Method of Testing
We will be using the sed
command, which is a stream editor. We will echo a string and pipe it to sed
where we will perform a substitution of the matches with the letter x
:
[ahmed@amayem ~]$ echo aaa | sed 's/a/x/'
xaa
The s
in s/a/x/
indicates a substitution. The a
indicated the regex pattern, and the x
indicated what we wanted to substitute into the string instead of the matched parts. For more on sed
check ShellTree 1: Analyzing a one Line Command Implementation
Exploration
Let’s begin with a simple example:
[ahmed@amayem ~]$ echo aaa | sed 's/a*/x/'
x
As expected the regex a*
matched the whole string aaa
, because regex is supposed to match the longest match. However, what happens when we put in a different letter in the beginning?
[ahmed@amayem ~]$ echo Aaaa | sed 's/a*/x/'
xAaaa
It didn’t match the aaa
, instead it only matched the null string in the beginning of the string.
Explanation
We find the following excerpt from the man re_format
page:
In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, it matches the longest. Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting later. Note that higher-level subexpressions thus take priority over their lower-level component subexpressions.
The key is that the matching gives precedence to the first match, and not the longest match. Since *
matches the expression before it zero or more times, the null string in the beginning of the string matched the first time.
References
man re_format
page (FreeBSD version)