Fun with Mail Merge and Cool Bash Arrays
Creating a sed-based file substitution tool.
A few weeks ago, I was digging through my spam folder and found an email message that started out like this:
Dear #name#
Congratulations on winning the $15.7 million lottery payout!
To learn how to claim your winnings, please...
Obviously, it was a scam (does anyone actually fall for these?), but what captured my
attention was the #name#
sequence. Clearly that was a fail on the part of the sender who
presumably didn't know how to use AnnoyingSpamTool 1.3 or whatever the heck
he or she was using.
The more general notation for bulk email and file transformations is pretty interesting, however. There are plenty of legitimate reasons to use this sort of substitution, ranging from email newsletters (like the one I send every week from AskDaveTaylor.com—check it out!) to stockholder announcements and much more.
With that as the inspiration, let's build a tool that offers just this capability.
The simple version will be a 1:1 substitution, so #name#
becomes, say,
"Rick Deckard", while #first#
might be "Rick" and
#last#
might
be "Deckard". Let's build on that, but let's start
small.
Simple Word Substitution in Linux
There are plenty of ways to tackle the word substitution from the command line,
ranging from Perl to awk, but here I'm using the original UNIX command
sed
(stream editor) designed for exactly this purpose. General notation for a
substitution is s/old/new/, and if you tack on a g
at the end, it
matches every occurrence on a line, not only the first, so the full command
is s/old/new/g.
Before going further, here's a simple document that has necessary substitutions embedded:
$ cat convertme.txt
#date#
Dear #name#, I wanted to start by again thanking you for your
generous donation of #amount# in #month#. We couldn't do our
work without support from humans like you, #first#.
This year we're looking at some unexpected expenses,
particularly in Sector 5, which encompasses #state#, as you
know. I'm hoping you can start the year with an additional
contribution? Even #suggested# would be tremendously helpful.
Thanks for your ongoing support. With regards,
Rick Deckard
Society for the Prevention of Cruelty to Replicants
Scan through it, and you'll see there's a lot of substitutions to do:
#date#
, #name#
,
#amount#
, #month#
, #first#
,
#state#
and #suggested#
. It turns out that
#date#
will
be replaced with the current date, and #suggested#
is one that'll be
calculated as the letter is processed, but that's for a bit later, so
stay tuned for that.
To make life easy, a source file that's a comma-separated list allows for easy interaction with a source spreadsheet, so a sample input data file might look like this:
name:first:amount:month:state
Eldon Tyrell:Eldon:500:July:California
At its most basic, the first line defines variable names (without the # notation), and subsequent lines are a set of values for a particular donor or recipient. To start, let's read in the variable names:
while IFS=',' read -r f1 f2 f3 f4 f5 f6 f7
do
declare -a varname=($f1 $f2 $f3 $f4 $f5 $f6 $f7)
done
Key to understanding this is to know about IFS, the internal field separator.
Normally, it's white space, which is why, for example, ls my file
name
looks
for three files called my, file and name. But you can change it, as I
demonstrate by changing IFS to a comma.
Those Cool Bash Arrays
I declare an array called varname
that receives each of the fields read into
the script. There are only five fields in use at this point, but let's
support up to seven to make the resultant script a bit more flexible.
Arrays are really cool in Bash actually, but the notation is a smidge funky.
That is, you can't just use $array[index]
, because it won't be parsed
correctly, so curly braces are a necessary addition:
echo ${varname[1]}
That works just fine.
For a basic algorithm, you're going to have two parallel arrays (parallel in that their indices will match up): one that retains all the variable names, and the other that contains the values for this instance of the data entry list.
This means you'll need to differentiate between the situation when the script is reading the first line and when subsequent lines of the data file are read. Easily done:
(( lines++ ))
if [ $lines -eq 1 ] ; then # field names
# variable names
declare -a varname=($f1 $f2 $f3 $f4 $f5 $f6 $f7)
else
# values for this line (can contain spaces)
declare -a value=("$f1" "$f2" "$f3" "$f4" "$f5"
"$f6" "$f7")
fi
As with most code, this makes assumptions here, but they're safe: variable
names aren't quoted because they're always a single word, but variable
values might have spaces, so they do end up quoted in the declare statement.
Otherwise, this should be easy, and the (( lines++ ))
notation should make you
cheer—it's a nice Bash shortcut!
Once you're past the very first line, the script can look in
varname[x]
for
the xth variable name, and value[x]
for the value of that named variable, expressed as a series of sed
-friendly substitution commands:
for ((i=0; i<${#value[*]}; i++))
do
if [ ! -z "${value[$i]}" ] ; then
echo "s/#${varname[$i]}#/${value[$i]}/g"
fi
done
Which produces this:
s/#name#/Eldon Tyrell/g
s/#first#/Eldon/g
s/#amount#/500/g
s/#month#/July/g
s/#state#/California/g
That's pretty darn close to what you want actually. Let's push forward.
Working with sed
The stream editor sed
is far more powerful than its modest and ancient history
might suggest. It's perfect for this job, as shown above.
You could write the above lines into a temp file and invoke sed
directly, but
let's avoid the file I/O and turn it all into a command-line argument as
necessary. That's done by simply separating each command with a semicolon,
which you can do by building it in a temp variable instead:
for ((i=0; i<${#value[*]}; i++))
do
if [ ! -z "${value[$i]}" ] ; then
if [ -z "$SUBS" ] ; then
SUBS="s/#${varname[$i]}#/${value[$i]}/g"
else
SUBS="$SUBS;s/#${varname[$i]}#/${value[$i]}/g"
fi
fi
done
There's undoubtedly a way to avoid the innermost if-then-else statement to omit the unnecessary ; prefix, but sometimes it's easier to have a few lines of code than yet more gobbledygook.
Otherwise, the above is a simple expansion from the previous for
loop shown.
This time, it builds the entire sed
command within the
SUBS
substitution
variable. Here's how to test:
echo " sed \"$SUBS\" $inputfile"
When you run this with the input data file, here's what's pushed out to the terminal:
sed "s/$name$/Eldon Tyrell/g;s/$first$/Eldon/g;
s/$amount$/500/g;s/$month$/July/g;
s/$state$/California/g" convertme.txt
sed "s/$name$/Rachel/g;s/$first$/Rachel/g;
s/$amount$/100/g;s/$month$/March/g;
s/$state$/New York/g" convertme.txt
(Note: line breaks added for formatting purposes only.)
It's actually a very small step from here to invoke the command, so let's do that:
$ sub.sh
#date#
Dear Eldon Tyrell, I wanted to start by again thanking you
for your generous donation of 500 in July. We couldn't do
our work without support from humans like you, Eldon.
This year we're looking at some unexpected expenses,
particularly in Sector 5, which encompasses California, as
you know. I'm hoping you can start the year with an
additional contribution? Even #suggested# would be
tremendously helpful.
Thanks for your ongoing support. With regards,
Rick Deckard
Society for the Prevention of Cruelty to Replicants
$
Generally, this looks good. #date#
and #suggested#
are still untranslated, but
that's as expected. What is a bit odd is that it didn't get the second
entry too. A bug.
I'm going to stop here, however, and maybe next time, I'll add some system
substitutions like #date#
and figure out how to calculate
#suggested#
, which can
be 50% of the actual donation. See you soon!