Practical Regex Building: The Regex to Match the Files part of a Unix-Like Command (ls)

This post is part of a practical series on regex (regular expressions) in a bash shell.

Setup

Let’s make a shell script. In your favourite editor type

#!/bin/bash

And save it somewhere as test.sh. Now we need to make it executable as follows:

[ahmed@amayem ~]$ chmod +x ./testt.sh 
[ahmed@amayem ~]$ ./test.sh 

Looks good so far.

Getting the Passed Parameters

This is done with the $* or $@ parameters.

echo $*

Let’s run it with a couple of parameters and see what happens:

[ahmed@amayem ~]$ ./test.sh -A checking
-A checking

Looks good.

Processing the Passed Parameters

We will be building the regex step by step. If you wish to see the end result move on to the summary

Using sed

We will be using sed for preprocessing because it gives us the ability to change the string. For more on sed check ShellTree 1: Analyzing a one line implementation. We will be using the substitute format which looks like this:

s/pattern/replacement/

It means that the substring that matches the pattern is replaced with replacement.

Finding the regex for the files

Specifying the Lack of a Dash in the Beginning of a Word

The beginning of the files part is when we meet a word that doesn’t start with a dash. So let’s indicate a start of a word without a dash by using [^-], which matches any character other than a dash. We will use the class bracket expression [:alnum:] to indicate any alphabet or number.

echo $* | sed -E  "s/[^-][[:alnum:]]*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-x -q -A i -i

It ignores the first dash but takes the letters after it till it hits a space. We need to indicate that the absence of a dash is in the beginning of a word, so we will use the [[:<:]] bracket expansion which means the beginning of a word. To use the [[:<:]] we will need to use extended regular expressions by adding the -E (depending on the version of sed you have you may have to use a different option like -r, check man sed) flag.

echo $* | sed -E  "s/[^-][[:<:]][[:alnum:]]*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -Ax -i

Things are looking better. The i was substituted but nothing after it. Also notice that some files may have special characters in them and that someone may put illegal characters in the file names such as the -i that’s at the end.

Generalizing the Characters that are in the Files part

Instead of [[:alnum:]] let’s use . to indicate any character. We will, of course, need a * after the . to indicate zero or more.

echo $* | sed -E  "s/[^-][[:<:]].*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -Ax

We have success.

Investigating [[:<:]] in the Beginning of a String

Let’s try a different case:

[ahmed@amayem ~]$ ./test.sh i
i

Hmm that’s not good. It seems that [[:<:]] is not matching the string when it is the beginning of the line. We can see this more clearly here:

[ahmed@amayem ~]$ ./test.sh i i
ix

So it’s clearly getting the second i, just not the first. Let’s add the special case of having it in the beginning of a line, by using the ^, which matches the beginning of a line. We will make two cases by using | which means or when used in a regex:

echo $* | sed -E  "s/^[^-].*|[^-][[:<:]].*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh i i
x
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -Ax

Great now it’s working.

Maintaining Portability by Finding and Alternative to [[:<:]]

We would like this to be protable and when we look at the man re_format page we see the following:

There are two special cases= of bracket expressions: the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor followed by word characters. A word character is an alnum character (as defined by ctype(3)) or an underscore. This is an extension, compatible with but not specified by IEEE Std 1003.2 (``POSIX.2''), and should be used with caution in software intended to be portable to other systems.

So let’s try to do it without [[:<:]]. The way we can find out if it is a beginning of a word is if it is preceded by a whitespace character, which can be any one of the characters in [:space:]:

echo $* | sed -E  "s/^[^-].*|[[:space:]][^-].*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -Ax
[ahmed@amayem ~]$ ./test.sh i i
x

Great, that’s exactly what we wanted.

Atempting to Simplify the Regex

I am wondering if we can make the regex simpler by removing the | and simply adding a * after the [:space:] to allow it to match the beginning of the line when there is no space:

echo $* | sed -E  "s/[[:space:]]*[^-].*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh  i -i
x
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-x

It deals with the first case properly, but when we test it with the second case we find that the addition of a * after the [:space:] caused it to match everything after a dash, and effectively removed the restriction of the non dash characters starting in the beginning of a word. Therefore we have to return to our previous pattern:

echo $* | sed -E  "s/^[^-].*|[[:space:]][^-].*/x/"

Trying to Use Basic Regular Expressions

The problem we face is in the |. According to the man re_format page the | is a regular character:

Obsolete (``basic'') regular expressions differ in several respects. `|' is an ordinary character and there is no equivalent for its functionality.

So we are stuck with extended only, unless we can find another regex that is compatible with ‘basic’ regular expressions.

Using the Matched Files to Isolate the Files Part

Now that we have isolated the files part, we can easily isolate the options part by substituting for nothing instead of x (Or you can use the regex we build for the files part here)

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -A

Once we have the string of the options we can use that as the pattern again to remove that side of the string:

options=$(echo $* | sed -E  "s/^[^-].*|[[:space:]][^-].*//")
files=$(echo $* | sed  "s/$options//")
echo $options
echo $files

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -A
i -i

Note on the Portability of sed

The issue with this is if we give sed an empty line complains. Furthermore the option for using extended regular expressions seems to be different for different versions, for example the free BSD extension uses the -E flag, while as the GNU version uses -r as mentioned in the man sed page:

The -E, -a and -i options are non-standard FreeBSD extensions and may not be available on other operating systems.

Let’s use a variant that would be more portable such as egrep.

Variants

egrep

We can use the -o option of egrep and use the regex mentioned earlier as follows:

files=$(echo $* | egrep -o  "^[^-].*|[[:space:]][^-].*")

echo $files

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
i -i

Bash Parameter Expansion

bash provides some string manipulation functionality in the form of parameter expansion. The following is from the man bash pages:

${parameter#word}
${parameter##word}
        The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches the beginning of the value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``#'' case) or the longest matching pattern (the ``##'' case) deleted.

As well as the following:

${parameter%word}
${parameter%word}
        The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches a trailing portion of the expanded value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``%'' case) or the longest matching pattern (the ``%'' case) deleted.

As well as the following:

${parameter/pattern/string}
        The pattern is expanded to produce a pattern just as in pathname expansion. Parameter is expanded and the longest match of pattern against its value is replaced with string. If Ipattern begins with /, all matches of pattern are replaced with string. Normally only the first match is replaced.

The key phrase we should be checking here is, The pattern is expanded to produce a pattern just as in pathname expansion. Pathname expansion is not the same as the regular expressions we have been using this far, hence it is inapplicable.

Expr

The command expr is another command that has some regular expression abilities, however we can’t really use it here because it only works with basic regular expressions and we are using extended regular expressions.

expr1 : expr2
         The ``:'' operator matches expr1 against expr2, which must be a regular expression. The regular expression is anchored to the beginning of  the string with an implicit ``^''. expr expects "basic" regular expressions, see re_format(7) for more information on regular expressions.

Summary

The regex that matches the files part of the ls command is the following:

^[^-].*|[[:space:]][^-].*

The files part can be removed using sed:

sed -E "s/^[^-].*|[[:space:]][^-].*//"

The files part can be isolated using egrep:

egrep -o "^[^-].*|[[:space:]][^-].*"

References

  1. man re_format page (FreeBSD version)
  2. man sed page (FreeBSD version)
  3. man bash page
  4. man egrep page (FreeBSD version)
  5. man expr page (FreeBSD version)

Ahmed Amayem has written 90 articles

A Web Application Developer Entrepreneur.