Practical Regex Building: The Regex to Match the Options part of a Unix-Like Command (ls)

This post is part of a practical series on regex (regular expressions) in a bash shell.

Setup

Let’s make a shell script. In your favourite editor type

#!/bin/bash

And save it somewhere as test.sh. Now we need to make it executable as follows:

[ahmed@amayem ~]$ chmod +x ./testt.sh 
[ahmed@amayem ~]$ ./test.sh 

Looks good so far.

Getting the Passed Parameters

This is done with the $* or $@ parameters.

echo $*

Let’s run it with a couple of parameters and see what happens:

[ahmed@amayem ~]$ ./test.sh -A checking
-A checking

Looks good.

Processing the Passed Parameters

We will be building the regex step by step. If you wish to see the end result move on to the summary

Using sed

We will be using sed for preprocessing because it gives us the ability to change the string. For more on sed check ShellTree 1: Analyzing a one line implementation. We will be using the substitute format which looks like this:

s/pattern/replacement/

It means that the substring that matches the pattern is replaced with replacement.

Deciding what will be Legal Options

Every command has a set of legal options. Let’s arbitrarily choose seme letters as legal options:

legalOptions="Aai1l"

Let’s start building the regex.

Finding the regex for the Options

Starting Small

We have two possible parts of the parameters, the options and the files. The options are the first part and the files are the second part. The way we know that something is an option is that it starts with a dash. So let’s start by searching for a dash. As for the characters that come after the dash, let’s just deal with an i in the beginning. We will begin by trying to substitute the options with an x. Add the following line to test.sh:

echo $* | sed s/[-i]/x/

Let’s run it as follows:

[ahmed@amayem ~]$ ./test.sh -i checking
xi checking

It seems that only the - is replaced. This is because [-i] only matches one character not multiple. If we want it to replace all instances then we have to use the option g at the end of the substitution as follows:

echo $* | sed s/[-i]/x/g

It gives us the following:

[ahmed@amayem ~]$ ./test.sh -i checking
xx checkxng

The options were substituted but the file portion was also affected: checkxng. We want to leave the file portion unaffected, so we want the pattern to match the whole options part without having to be repeated and affecting the files part, hence we should remove the g flag, and instead add a * after the [-i], which means that the characters -i can be repeated zero or more times:

echo $* | sed s/[-i]*/x/

It gives us the following:

[ahmed@amayem ~]$ ./test.sh -i checking
x checking

That looks much better. All of the options part was replaced with an x.

Matching More than one Dashed Options

Let’s try adding some other options:

[ahmed@amayem ~]$ ./test.sh -ia -A checking
xa -A checking

Nope, that’s because we only specified the i as a legal command; we need to add some more. Let’s add the legalOptions we specified earlier:

echo $* | sed s/[-$legalOptions]*/x/

It gives us the following:

[ahmed@amayem ~]$ ./test.sh -ia -A checking
x -A checking

It replaced the first -ia but didn’t recgonize the -A afterwards because there is a space in between. We need to add a space to the pattern. Let’s use the bracket expression [:space:], which expands to all whitespace.

echo $* | sed s/[-[:space:]$legalOptions]*/x/

It gives us the following:

[ahmed@amayem ~]$ ./test.sh -ia -A checking
xchecking

That is much better. We are isolating all the options part whether it be multiple characters after a dash or multiple dashes. However what will happen if the file is formed of letters from the legal options?

Limiting the Regex to Match Only Words Starting with a Dash and Switching to Extended Regular Expressions

[ahmed@amayem ~]$ ./test.sh -ia -A i
x

That’s not good. We need it to check for only the words that start with a dash. For this we will have to put the dash outside the square brackets to indicate that it is the start of a word.

echo $* | sed -E s/-[[:space:]$legalOptions]*/x/

⇩

[ahmed@amayem ~]$ ./test.sh -ia -A i
x-A i

We went back to only getting the first dashed option. We need a way to maintain the order of a dash followed by the characters allowed for options followed by whitespace, and then allowing that to be repeated zero or more times. To accomplish this we will need to use extended regular expressions by adding the -E (depending on the version of sed you have you may have to use a different option like -r, check man sed) flag, then arranging the order inside of parentheses and then add a * at the end as follows:

echo $* | sed -E "s/(-[$legalOptions]*[:space:])*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -A i
x -A i

Again it only replaced the first option, which means that the space is not being matched even though we have [:space:] specified. Let’s try to figure out this mystery, by replacing [:space:] with an actual space:

echo $* | sed -E "s/(-[$legalOptions]* )*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -A i
xi

Looks like it did just what we wanted, which means that [:space:] isn’t being expanded properly, even though it was earlier. Perhaps it is a quoting issue? Let’s try unquoting that part:

echo $* | sed -E "s/(-[$legalOptions]*"[:space:]")*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -A i
x -A i

Nope no use. Let’s expand [:space:] and see what the pattern would be?

"s/(-[$legalOptions]* tvf)*/x/"

That’s why its not working, the pattern requires such an order of the whitespace characters. W just need to put [:space:] in square brackets so that it can take any of those and then put a * after it:

echo $* | sed -E "s/(-[$legalOptions]*[[:space:]]*)*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -A i
xi

Worked beautifully. Now we need to see what will happen if I add illegal options:

Matching Legal and Illegal Options

[ahmed@amayem ~]$ ./test.sh -ia -q -A i
xq -A i

Argh!, I have to worry about legal and illegal options. If it’s an illegal option then I still have to isolate it as part of the options. Options can be alpha numeric but they could also be special characters like @ or !. Let’s add all the possibilities:

echo $* | sed -E "s/(-[[:alnum:][:punct:]]*[[:space:]]*)*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i
xi

Wonderful. However, perhaps a better way of indicating the possible characters that can be in options is to specify any character that is not whitespace, since whitespace is what separates the options and files.

echo $* | sed -E "s/(-[^[:space:]]*[[:space:]]*)*/x/"

It gives us the same result as before. Now let’s check if it keeps files that are in the same format as the options:

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
xi -i

Works great. How about no options being passed and only files.

Investigation of Matching at the Beginning of a String

[ahmed@amayem ~]$ ./test.sh i -i
xi -i

This is interesting. The pattern matched the empty line at the beginning of the string instead of the -i at the end. The answer for this can be found in the man re_format page:

In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, it matches the longest.

The trick is in the * that is after the whole expression. It indicates that the pattern before the * appears zero or more times. The empty string at the beginning of the string is the first match because it matches the pattern zero times. Let’s test our understanding as follows:

[ahmed@amayem ~]$ echo i -i | sed s/-*/x/
xi -i

Looks like we were correct. However, to be on the safe side let’s indicate that the pattern is supposed to be in the beginning of the string using the caret ^, but the question is, should we put the caret inside of the parenthesis ( or before it? Let’s give it a try:

echo $* | sed -E "s/(^-[^[:space:]]*[[:space:]]*)*/x/"

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
x-q -A i -i

It only picked out the first dashed option. This is because the caret was put inside the parenthesis, which means that the * after the parenthesis is saying, match the following pattern zero or more times: “A pattern that begins at the beginning of the string and starts with a dash then some optional non space characters then some optional spaces.” We need to put the caret outside the parenthesis.

echo $* | sed -E "s/^(-[^[:space:]]*[[:space:]]*)*/x/"

⇩

[ahmed@amayem ~] ./test.sh -ia -q -A i -i
xi -i

Great.

Using the Matched Options to Isolate the Files Part

Now that we have isolated the options part, we can easily isolate the files part by substituting for nothing instead of x (Or you can use the regex we build for the files part here):

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
i -i

Once we have the string of the files we can use that as the pattern again to remove that side of the string:

files=$(echo $* | sed -E "s/^(-[^[:space:]]*[[:space:]]*)*//")
options=$(echo $* | sed -E "s/$files//")

echo $options
echo $files

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -A
i -i

Done, however note that sometimes this won’t work if empty strings are passed to sed.

Note on the Portability of sed

The issue with this is if we give sed an empty line complains. Furthermore the option for using extended regular expressions seems to be different for different versions, for example the free BSD extension uses the -E flag, while as the GNU version uses -r as mentioned in the man sed page:

The -E, -a and -i options are non-standard FreeBSD extensions and may not be available on other operating systems.

Let’s use a variant that would be more portable such as egrep.

Variants

egrep

The grep command is great but it uses basic regular expressions, while as we are using an extended regular expression. Instead we can use egrep, which is the same as grep but for extended regex. Let’s use the -o option of egrep and use the pattern built earlier as follows:

options=$(echo $* | egrep -o  "(^(-[^[:space:]]*[[:space:]]*)*")

echo $options

⇩

[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -A

Bash Parameter Expansion

bash provides some string manipulation functionality in the form of parameter expansion. The following is from the man bash pages:

${parameter#word}
${parameter##word}
        The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches the beginning of the value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``#'' case) or the longest matching pattern (the ``##'' case) deleted.

As well as the following:

${parameter%word}
${parameter%word}
        The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches a trailing portion of the expanded value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``%'' case) or the longest matching pattern (the ``%'' case) deleted.

As well as the following:

${parameter/pattern/string}
        The pattern is expanded to produce a pattern just as in pathname expansion. Parameter is expanded and the longest match of pattern against its value is replaced with string. If Ipattern begins with /, all matches of pattern are replaced with string. Normally only the first match is replaced.

The key phrase we should be checking here is, The pattern is expanded to produce a pattern just as in pathname expansion. Pathname expansion is not the same as the regular expressions we have been using this far, hence it is inapplicable.

Expr

The command expr is another command that has some regular expression abilities, however we can’t really use it here because it only works with basic regular expressions and we are using extended regular expressions.

expr1 : expr2
         The ``:'' operator matches expr1 against expr2, which must be a regular expression. The regular expression is anchored to the beginning of  the string with an implicit ``^''. expr expects "basic" regular expressions, see re_format(7) for more information on regular expressions.

Summary

The regex that matches the options part of the ls command is the following:

^(-[^[:space:]]*[[:space:]]*)*

It can be removed using sed:

sed -E "s/^(-[^[:space:]]*[[:space:]]*)*//"

It can be isolated using egrep:

egrep -o "^(-[^[:space:]]*[[:space:]]*)*"

Now let’s try the opposite and build a regex for the files part instead

References

  1. man re_format page (FreeBSD version)
  2. man sed page (FreeBSD version)
  3. man bash page
  4. man egrep page (FreeBSD version)
  5. man expr page (FreeBSD version)

Ahmed Amayem has written 90 articles

A Web Application Developer Entrepreneur.