Solving the Mystery of the Regular Expression Special Operator Star * Not Matching Anything (Matching the Null String)

When using regular expressions the * special operator is very useful. An expression followed by * matches a sequence of 0 or more matches of the expression. When the whole expression is followed by a * we get some intersting behaviour that people may not have expected. We will explore such behaviour in this post.

Method of Testing

We will be using the sed command, which is a stream editor. We will echo a string and pipe it to sed where we will perform a substitution of the matches with the letter x:

[ahmed@amayem ~]$ echo aaa | sed 's/a/x/'
xaa

The s in s/a/x/ indicates a substitution. The a indicated the regex pattern, and the x indicated what we wanted to substitute into the string instead of the matched parts. For more on sed check ShellTree 1: Analyzing a one Line Command Implementation

Exploration

Let’s begin with a simple example:

[ahmed@amayem ~]$ echo aaa | sed 's/a*/x/'
x

As expected the regex a* matched the whole string aaa, because regex is supposed to match the longest match. However, what happens when we put in a different letter in the beginning?

[ahmed@amayem ~]$ echo Aaaa | sed 's/a*/x/'
xAaaa

It didn’t match the aaa, instead it only matched the null string in the beginning of the string.

Explanation

We find the following excerpt from the man re_format page:

In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, it matches the longest. Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting later. Note that higher-level subexpressions thus take priority over their lower-level component subexpressions.

The key is that the matching gives precedence to the first match, and not the longest match. Since * matches the expression before it zero or more times, the null string in the beginning of the string matched the first time.

References

  1. man re_format page (FreeBSD version)

Ahmed Amayem has written 90 articles

A Web Application Developer Entrepreneur.