Pattern Matching In Bash

bash

 

Wildcards have been around forever. Some even claim they appear in the hieroglyphics of the ancient Egyptians. Wildcards allow you to specify succinctly a pattern that matches a set of filenames (for example, *.pdf to get a list of all the PDF files). Wildcards are also often referred to as glob patterns (or when using them, as "globbing"). But glob patterns have uses beyond just generating a list of useful filenames. The bash man page refers to glob patterns simply as "Pattern Matching".

 

First, let's do a quick review of bash's glob patterns. In addition to the simple wildcard characters that are fairly well known, bash also has extended globbing, which adds additional features. These extended features are enabled via the extglob option.

Pattern Description
* Match zero or more characters
? Match any single character
[...] Match any of the characters in a set
?(patterns) Match zero or one occurrences of the patterns (extglob)
*(patterns) Match zero or more occurrences of the patterns (extglob)
+(patterns) Match one or more occurrences of the patterns (extglob)
@(patterns) Match one occurrence of the patterns (extglob)
!(patterns) Match anything that doesn't match one of the patterns (extglob)

 

For example:

$ ls
a.jpg  b.gif  c.png  d.pdf ee.pdf

$ ls *.jpg
a.jpg

$ ls ?.pdf
d.pdf

$ ls [ab]*
a.jpg  b.gif

$ shopt -s extglob   # turn on extended globbing

$ ls ?(*.jpg|*.gif)
a.jpg  b.gif

$ ls !(*.jpg|*.gif)  # not a jpg or a gif
c.png d.pdf ee.pdf

When first using extended globbing, many of them didn't seem to do what I initially thought they ought to do. For example, it appeared to me that, given a.jpg, the pattern ?(*.jpg|a.jpg) should not match, because a.jpg matched both patterns, and the ? is "zero or one", right? Wrong. My confusion was due to a misreading of the description: it's not the filename that can match only once, it's the pattern that can match only once. Think of it terms of regular expressions:

Glob Regular Expression Equivalent Description
?(patterns) (regex)? Match an optional regex
*(patterns) (regex)* Match zero or more occurrences of a regex
+(patterns) (regex)+ Match one or more occurrences of a regex
@(patterns) (regex) Match the regex (one occurrence)

 

So, for example:

$ ls *.pdf
ee.pdf  e.pdf  .pdf

$ ls ?(e).pdf    # zero or one "e" allowed
e.pdf  .pdf

$ ls *(e).pdf    # zero or more "e"s allowed
ee.pdf  e.pdf  .pdf

$ ls +(e).pdf    # one or more "e"s allowed
ee.pdf  e.pdf

$ ls @(e).pdf    # only one e allowed
e.pdf

And while I'm comparing glob patterns to regular expressions, there's an important point to be made that may not be immediately obvious: glob patterns are just another syntax for doing pattern matching in general in bash. And you can use them in a number of different places:

  • After the == in a bash [[ expr ]] expression.
  • In the patterns to a case command.
  • In parameter expansions (%, %%, #, ##, /, //).

The following example uses pattern matching in the expression of an if statement to test whether a variable has a value of "something" or "anything":

$ shopt +s extglob

$ a=something
$ if [[ $a == +(some|any)thing ]]; then echo yes; else echo no; fi
yes

$ a=anything
$ if [[ $a == +(some|any)thing ]]; then echo yes; else echo no; fi
yes

$ a=nothing
$ if [[ $a == +(some|any)thing ]]; then echo yes; else echo no; fi
no

The following example uses pattern matching in a case statement to determine whether a file is an image file:

shopt +s extglob
for f in $*
do
    case $f in
    !(*.gif|*.jpg|*.png))  # ! == does not match
        echo "Not an image: $f"
        ;;
    *)
        echo "Image: $f"
        ;;
    esac
done
$ bash script.sh a.jpg b.gif c.png d.pdf e.pdf
Image: a.jpg
Image: b.gif
Image: c.png
Not an image: d.pdf
Not an image: e.pdf

In the example above, the pattern !(*.gif|*.jpg|*.png) will match a filename if it's not a gif, jpg or png.

The following example uses pattern matching in a %% parameter expansion to remove the extension from all image files:

shopt -s extglob
for f in $*
do
    echo ${f%%*(.gif|.jpg|.png)}
done
$ bash script.sh a.jpg b.gif c.png d.pdf e.pdf
a
b
c
d.pdf
e.pdf

A feature that I just recently became aware of is that you can do the above action in one fell swoop: if you use "*" or "@" as the variable name, the transformation is done on all the command-line arguments at once. [Note to self: always read the last half of the paragraph from now on]:

shopt -s extglob
echo ${*%%*(.gif|.jpg|.png)}
$ bash script.sh a.jpg b.gif c.png d.pdf e.pdf
a b c d.pdf e.pdf

And that works on arrays too:

shopt -s extglob
array=($*)
echo ${array[*]%%*(.gif|.jpg|.png)}
$ bash script.sh a.jpg b.gif c.png d.pdf e.pdf
a b c d.pdf e.pdf

The biggest takeaway here is to stop thinking of wildcards as a mechanism just to get a list of filenames and start thinking of them as glob patterns that can be used to do general pattern matching in your bash scripts. Think of glob patterns as regular expressions in a different language.


Any code found in my articles should be considered licensed as follows:

# Copyright 2019 Mitch Frazier <mitch -at- linuxjournal.com>
#
# This software may be used and distributed according to the terms of the
# MIT License or the GNU General Public License version 2 (or any later version).

Mitch Frazier is an embedded systems programmer at Emerson Electric Co. Mitch has been a contributor to and a friend of Linux Journal since the early 2000s.

Load Disqus comments