This post is part of a practical series on regex (regular expressions) in a bash shell.
Setup
Let’s make a shell script. In your favourite editor type
#!/bin/bash
And save it somewhere as test.sh
. Now we need to make it executable as follows:
[ahmed@amayem ~]$ chmod +x ./testt.sh
[ahmed@amayem ~]$ ./test.sh
Looks good so far.
Getting the Passed Parameters
This is done with the $*
or $@
parameters.
echo $*
Let’s run it with a couple of parameters and see what happens:
[ahmed@amayem ~]$ ./test.sh -A checking
-A checking
Looks good.
Processing the Passed Parameters
We will be building the regex step by step. If you wish to see the end result move on to the summary
Using sed
We will be using sed
for preprocessing because it gives us the ability to change the string. For more on sed
check ShellTree 1: Analyzing a one line implementation. We will be using the substitute format which looks like this:
s/pattern/replacement/
It means that the substring that matches the pattern
is replaced with replacement
.
Finding the regex for the files
Specifying the Lack of a Dash in the Beginning of a Word
The beginning of the files
part is when we meet a word that doesn’t start with a dash. So let’s indicate a start of a word without a dash by using [^-]
, which matches any character other than a dash. We will use the class bracket expression [:alnum:]
to indicate any alphabet or number.
echo $* | sed -E "s/[^-][[:alnum:]]*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-x -q -A i -i
It ignores the first dash but takes the letters after it till it hits a space. We need to indicate that the absence of a dash is in the beginning of a word, so we will use the [[:<:]]
bracket expansion which means the beginning of a word. To use the [[:<:]]
we will need to use extended regular expressions by adding the -E
(depending on the version of sed
you have you may have to use a different option like -r
, check man sed
) flag.
echo $* | sed -E "s/[^-][[:<:]][[:alnum:]]*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -Ax -i
Things are looking better. The i
was substituted but nothing after it. Also notice that some files may have special characters in them and that someone may put illegal characters in the file names such as the -i
that’s at the end.
Generalizing the Characters that are in the Files part
Instead of [[:alnum:]]
let’s use .
to indicate any character. We will, of course, need a *
after the .
to indicate zero or more.
echo $* | sed -E "s/[^-][[:<:]].*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -Ax
We have success.
Investigating [[:<:]] in the Beginning of a String
Let’s try a different case:
[ahmed@amayem ~]$ ./test.sh i
i
Hmm that’s not good. It seems that [[:<:]]
is not matching the string when it is the beginning of the line. We can see this more clearly here:
[ahmed@amayem ~]$ ./test.sh i i
ix
So it’s clearly getting the second i
, just not the first. Let’s add the special case of having it in the beginning of a line, by using the ^
, which matches the beginning of a line. We will make two cases by using |
which means or when used in a regex:
echo $* | sed -E "s/^[^-].*|[^-][[:<:]].*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh i i
x
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -Ax
Great now it’s working.
Maintaining Portability by Finding and Alternative to [[:<:]]
We would like this to be protable and when we look at the man re_format
page we see the following:
There are two special cases= of bracket expressions: the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor followed by word characters. A word character is an alnum character (as defined by ctype(3)) or an underscore. This is an extension, compatible with but not specified by IEEE Std 1003.2 (``POSIX.2''), and should be used with caution in software intended to be portable to other systems.
So let’s try to do it without [[:<:]]
. The way we can find out if it is a beginning of a word is if it is preceded by a whitespace character, which can be any one of the characters in [:space:]
:
echo $* | sed -E "s/^[^-].*|[[:space:]][^-].*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -Ax
[ahmed@amayem ~]$ ./test.sh i i
x
Great, that’s exactly what we wanted.
Atempting to Simplify the Regex
I am wondering if we can make the regex simpler by removing the |
and simply adding a *
after the [:space:]
to allow it to match the beginning of the line when there is no space:
echo $* | sed -E "s/[[:space:]]*[^-].*/x/"
⇩
[ahmed@amayem ~]$ ./test.sh i -i
x
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-x
It deals with the first case properly, but when we test it with the second case we find that the addition of a *
after the [:space:]
caused it to match everything after a dash, and effectively removed the restriction of the non dash characters starting in the beginning of a word. Therefore we have to return to our previous pattern:
echo $* | sed -E "s/^[^-].*|[[:space:]][^-].*/x/"
Trying to Use Basic Regular Expressions
The problem we face is in the |
. According to the man re_format
page the |
is a regular character:
Obsolete (``basic'') regular expressions differ in several respects. `|' is an ordinary character and there is no equivalent for its functionality.
So we are stuck with extended only, unless we can find another regex that is compatible with ‘basic’ regular expressions.
Using the Matched Files to Isolate the Files Part
Now that we have isolated the files part, we can easily isolate the options part by substituting for nothing instead of x
(Or you can use the regex we build for the files part here)
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -A
Once we have the string of the options we can use that as the pattern again to remove that side of the string:
options=$(echo $* | sed -E "s/^[^-].*|[[:space:]][^-].*//")
files=$(echo $* | sed "s/$options//")
echo $options
echo $files
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
-ia -q -A
i -i
Note on the Portability of sed
The issue with this is if we give sed
an empty line complains. Furthermore the option for using extended regular expressions seems to be different for different versions, for example the free BSD extension uses the -E
flag, while as the GNU version uses -r
as mentioned in the man sed
page:
The -E, -a and -i options are non-standard FreeBSD extensions and may not be available on other operating systems.
Let’s use a variant that would be more portable such as egrep
.
Variants
egrep
We can use the -o
option of egrep
and use the regex mentioned earlier as follows:
files=$(echo $* | egrep -o "^[^-].*|[[:space:]][^-].*")
echo $files
⇩
[ahmed@amayem ~]$ ./test.sh -ia -q -A i -i
i -i
Bash Parameter Expansion
bash
provides some string manipulation functionality in the form of parameter expansion. The following is from the man bash
pages:
${parameter#word}
${parameter##word}
The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches the beginning of the value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``#'' case) or the longest matching pattern (the ``##'' case) deleted.
As well as the following:
${parameter%word}
${parameter%word}
The word is expanded to produce a pattern just as in pathname expansion. If the pattern matches a trailing portion of the expanded value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ``%'' case) or the longest matching pattern (the ``%'' case) deleted.
As well as the following:
${parameter/pattern/string}
The pattern is expanded to produce a pattern just as in pathname expansion. Parameter is expanded and the longest match of pattern against its value is replaced with string. If Ipattern begins with /, all matches of pattern are replaced with string. Normally only the first match is replaced.
The key phrase we should be checking here is, The pattern is expanded to produce a pattern just as in pathname expansion
. Pathname expansion is not the same as the regular expressions we have been using this far, hence it is inapplicable.
Expr
The command expr
is another command that has some regular expression abilities, however we can’t really use it here because it only works with basic regular expressions and we are using extended regular expressions.
expr1 : expr2
The ``:'' operator matches expr1 against expr2, which must be a regular expression. The regular expression is anchored to the beginning of the string with an implicit ``^''. expr expects "basic" regular expressions, see re_format(7) for more information on regular expressions.
Summary
The regex that matches the files part of the ls
command is the following:
^[^-].*|[[:space:]][^-].*
The files part can be removed using sed
:
sed -E "s/^[^-].*|[[:space:]][^-].*//"
The files part can be isolated using egrep
:
egrep -o "^[^-].*|[[:space:]][^-].*"
References
man re_format
page (FreeBSD version)man sed
page (FreeBSD version)man bash
pageman egrep
page (FreeBSD version)man expr
page (FreeBSD version)