This post is part of a practical series on regex (regular expressions) in a bash shell.
Setup
Let’s make a shell script. In your favourite editor type
#!/bin/bash
And save it somewhere as test.sh
. Now we need to make it executable as follows:
[ahmed@amayem ~]$ chmod +x ./testt.sh
[ahmed@amayem ~]$ ./test.sh
Looks good so far.
Getting the Passed Parameters
This is done with the $*
or $@
parameters.
echo $*
Let’s run it with a couple of parameters and see what happens:
[ahmed@amayem ~]$ ./test.sh -A checking
-A checking
Looks good.
Processing the Passed Parameters
We will be building the regex step by step. If you wish to see the end result move on to the summary
Using sed
We will be using sed
for preprocessing because it gives us the ability to change the string. For more on sed
check ShellTree 1: Analyzing a one line implementation. We will be using the substitute format which looks like this:
s/pattern/replacement/
It means that the substring that matches the pattern
is replaced with replacement
.
Deciding what will be Legal Options
Every command has a set of legal options. Let’s arbitrarily choose seme letters as legal options:
legalOptions="Aai1l"
Let’s start building the regex.
Finding the regex for the Options
Starting Small
We have two possible parts of the parameters, the options
and the files
. The options
are the first part and the files
are the second part. The way we know that something is an option is that it starts with a dash. So let’s start by searching for a dash. As for the characters that come after the dash, let’s just deal with an i
in the beginning. We will begin by trying to substitute the options with an x
. Add the following line to test.sh
:
echo $* | sed s/[-i]/x/
Let’s run it as follows:
[ahmed@amayem ~]$ ./test.sh -i checking
xi checking
It seems that only the -
is replaced. This is because [-i]
only matches one character not multiple. If we want it to replace all instances then we have to use the option g
at the end of the substitution as follows:
echo $* | sed s/[-i]/x/g
It gives us the following:
[ahmed@amayem ~]$ ./test.sh -i checking
xx checkxng
The options were substituted but the file portion was also affected: checkxng
. We want to leave the file portion unaffected, so we want the pattern to match the whole options
part without having to be repeated and affecting the files
part, hence we should remove the g
flag, and instead add a *
after the [-i]
, which means that the characters -i
can be repeated zero or more times:
echo $* | sed s/[-i]*/x/
It gives us the following:
[ahmed@amayem ~]$ ./test.sh -i checking
x checking
That looks much better. All of the options
part was replaced with an x
.
Matching More than one Dashed Options
Let’s try adding some other options:
[ahmed@amayem ~]$ ./test.sh -ia -A checking
xa -A checking
Nope, that’s because we only specified the i
as a legal command; we need to add some more. Let’s add the legalOptions
we specified earlier:
echo $* | sed s/[-$legalOptions]*/x/
It gives us the following:
[ahmed@amayem ~]$ ./test.sh -ia -A checking
x -A checking
It replaced the first -ia
but didn’t recgonize the -A
afterwards because there is a space in between. We need to add a space to the pattern. Let’s use the bracket expression [:space:]
, which expands to all whitespace.
echo $* | sed s/[-[:space:]$legalOptions]*/x/
It gives us the following:
[ahmed@amayem ~]$ ./test.sh -ia -A checking
xchecking
That is much better. We are isolating all the options
part whether it be multiple characters after a dash or multiple dashes. However what will happen if the file is formed of letters from the legal options?
Limiting the Regex to Match Only Words Starting with a Dash and Switching to Extended Regular Expressions
[ahmed@amayem ~]$ ./test.sh -ia -A i
x
That’s not good. We need it to check for only the words that start with a dash. For this we will have to put the dash outside the square brackets to indicate that it is the start of a word.
echo $* | sed -E s/-[[:space:]$legalOptions]*/x/
⇩
[ahmed@amayem ~]$ ./test.sh -ia -A i
x-A i
We went back to only getting the first dashed option. We need a way to maintain the order of a dash followed by the characters allowed for options followed by whitespace, and then allowing that to be repeated zero or more times. To accomplish this we will need to use extended regular expressions by adding the -E
(depending on the version of sed
you have you may have to use a different option like -r
, check man sed
) flag, then arranging the order inside of parentheses and then add a *
at the end as follows:
echo $* | sed -E "s/(-[$legalOptions]*[:space:])*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -A i
x -A i
Again it only replaced the first option, which means that the space is not being matched even though we have [:space:]
specified. Let’s try to figure out this mystery, by replacing [:space:]
with an actual space:
echo $* | sed -E "s/(-[$legalOptions]* )*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -A i
xi
Looks like it did just what we wanted, which means that [:space:]
isn’t being expanded properly, even though it was earlier. Perhaps it is a quoting issue? Let’s try unquoting that part:
echo $* | sed -E "s/(-[$legalOptions]*"[:space:]")*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -A i
x -A i
Nope no use. Let’s expand [:space:]
and see what the pattern would be?
"s/(-[$legalOptions]* tvf)*/x/"
That’s why its not working, the pattern requires such an order of the whitespace characters. W just need to put [:space:]
in square brackets so that it can take any of those and then put a *
after it:
echo $* | sed -E "s/(-[$legalOptions]*[[:space:]]*)*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -A i
xi
Worked beautifully. Now we need to see what will happen if I add illegal options:
Matching Legal and Illegal Options
[ahmed@amayem ~]$ ./test.sh -ia -q -A i
xq -A i
Argh!, I have to worry about legal and illegal options. If it’s an illegal option then I still have to isolate it as part of the options. Options can be alpha numeric but they could also be special characters like @
or !
. Let’s add all the possibilities:
echo $* | sed -E "s/(-[[:alnum:][:punct:]]*[[:space:]]*)*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i
xi
Wonderful. However, perhaps a better way of indicating the possible characters that can be in options is to specify any character that is not whitespace, since whitespace is what separates the options and files.
echo $* | sed -E "s/(-[^[:space:]]*[[:space:]]*)*/x/"
It gives us the same result as before. Now let’s check if it keeps files that are in the same format as the options:
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
xi -i
Works great. How about no options being passed and only files.
Investigation of Matching at the Beginning of a String
[ahmed@amayem ~]$ ./test.sh i -i
xi -i
This is interesting. The pattern matched the empty line at the beginning of the string instead of the -i
at the end. The answer for this can be found in the man re_format
page:
In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, it matches the longest.
The trick is in the *
that is after the whole expression. It indicates that the pattern before the *
appears zero or more times. The empty string at the beginning of the string is the first match because it matches the pattern zero times. Let’s test our understanding as follows:
[ahmed@amayem ~]$ echo i -i | sed s/-*/x/
xi -i
Looks like we were correct. However, to be on the safe side let’s indicate that the pattern is supposed to be in the beginning of the string using the caret ^
, but the question is, should we put the caret inside of the parenthesis (
or before it? Let’s give it a try:
echo $* | sed -E "s/(^-[^[:space:]]*[[:space:]]*)*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
x-q -A i -i
It only picked out the first dashed option. This is because the caret was put inside the parenthesis, which means that the *
after the parenthesis is saying, match the following pattern zero or more times: “A pattern that begins at the beginning of the string and starts with a dash then some optional non space characters then some optional spaces.” We need to put the caret outside the parenthesis.
echo $* | sed -E "s/^(-[^[:space:]]*[[:space:]]*)*/x/"
⇩
[ahmed@amayem ~] ./test.sh -ia -q -A i -i
xi -i
Great.
Using the Matched Options to Isolate the Files Part
Now that we have isolated the options part, we can easily isolate the files part by substituting for nothing instead of x
(Or you can use the regex we build for the files part here):
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
i -i
Once we have the string of the files we can use that as the pattern again to remove that side of the string:
files=$(echo $* | sed -E "s/^(-[^[:space:]]*[[:space:]]*)*//")
options=$(echo $* | sed -E "s/$files//")
echo $options
echo $files
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -A
i -i
Done, however note that sometimes this won’t work if empty strings are passed to sed
.
Note on the Portability of sed
The issue with this is if we give sed
an empty line complains. Furthermore the option for using extended regular expressions seems to be different for different versions, for example the free BSD extension uses the -E
flag, while as the GNU version uses -r
as mentioned in the man sed
page:
The -E, -a and -i options are non-standard FreeBSD extensions and may not be available on other operating systems.
Let’s use a variant that would be more portable such as egrep
.
Variants
egrep
The grep
command is great but it uses basic regular expressions, while as we are using an extended regular expression. Instead we can use egrep
, which is the same as grep
but for extended regex. Let’s use the -o
option of egrep
and use the pattern built earlier as follows:
options=$(echo $* | egrep -o "(^(-[^[:space:]]*[[:space:]]*)*")
echo $options
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -A
Bash Parameter Expansion
bash
provides some string manipulation functionality in the form of parameter expansion. The following is from the man bash
pages:
${parameter#word}
${parameter##word}
The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches the beginning of the value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``#'' case) or the longest matching pattern (the ``##'' case) deleted.
As well as the following:
${parameter%word}
${parameter%word}
The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches a trailing portion of the expanded value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``%'' case) or the longest matching pattern (the ``%'' case) deleted.
As well as the following:
${parameter/pattern/string}
The pattern is expanded to produce a pattern just as in pathname expansion. Parameter is expanded and the longest match of pattern against its value is replaced with string. If Ipattern begins with /, all matches of pattern are replaced with string. Normally only the first match is replaced.
The key phrase we should be checking here is, The pattern is expanded to produce a pattern just as in pathname expansion
. Pathname expansion is not the same as the regular expressions we have been using this far, hence it is inapplicable.
Expr
The command expr
is another command that has some regular expression abilities, however we can’t really use it here because it only works with basic regular expressions and we are using extended regular expressions.
expr1 : expr2
The ``:'' operator matches expr1 against expr2, which must be a regular expression. The regular expression is anchored to the beginning of the string with an implicit ``^''. expr expects "basic" regular expressions, see re_format(7) for more information on regular expressions.
Summary
The regex that matches the options part of the ls
command is the following:
^(-[^[:space:]]*[[:space:]]*)*
It can be removed using sed
:
sed -E "s/^(-[^[:space:]]*[[:space:]]*)*//"
It can be isolated using egrep
:
egrep -o "^(-[^[:space:]]*[[:space:]]*)*"
Now let’s try the opposite and build a regex for the files
part instead
References
man re_format
page (FreeBSD version)man sed
page (FreeBSD version)man bash
pageman egrep
page (FreeBSD version)man expr
page (FreeBSD version)