Finishing Up the Content Spinner
You'll recall that in my last article I shared a long, complex explanation for why spam email catches my attention and intrigues me, perhaps more than it should. Part of it is that I've been involved in email forever—I even wrote one of the most popular old-school email programs back in the day. But, there's also just the puzzle factor of taking a massive data set of millions of records and trying to produce "personalized" messages on such a large scale.
The easy version of this is to have named data fields like ${firstname}, so you can open your email with "Dear ${firstname}, I heard you went to ${college}? Me too!" and so on.
But, I'm more interested in the "spinning" side of things—the production of prose that has built-in synonyms, as exemplified by:
The {idea|concept|inspiration} is that each time you'd use a
{word|phrase} you instead list a set of {similar words|synonyms|
alternative words} and the software automatically picks one
{randomly|at random} and is done.
I know, you're likely shaking your head and wondering "what the deuce happened to Dave?", but humor me, let's explore this together as a text-processing puzzle.
In my June 2016 column, I presented the core building blocks of the article spinner, a script that could identify the {} surrounded choices, isolate them, count how many options were present and display it to the user as debugging output.
So, the above would be displayed as:
$ sh spinner.sh spinme.txt
The
3 options, spinning --- idea|concept|inspiration
is that each time you'd use a
2 options, spinning --- word|phrase
you instead list a set of
3 options, spinning --- similar words|synonyms|alternative words
and the software automatically picks one
2 options, spinning --- randomly|at random
and is done.
That's a good start, but this time, let's finish the job and actually pick randomly from the set of choices each time, output only the selected option and reflow the text to make it all look good.
Pick a Card, Any Card
The basic way to work with random numbers in Bash is to use the special
$RANDOM
variable. Each time it's referenced, it returns a randomly chosen number
between 1 and MAXINT (32767). I constrain it to a specific range by using the
modulus function, so this will generate a random number between 0 and MAXVALUE:
randomnum=$(( $RANDOM % $MAXVALUE ))
The double-parent notation triggers mathematical evaluation, but you already know that, right?
To make the bottom be the value 1 instead of zero, I just add a bit more math to the equation:
randomnum=$(( $RANDOM % $MAXVALUE + 1 ))
The script already can identify how many choices are in a specific cluster (for example, "{one|two|three}"), and now we have a simple one-liner to help randomly pick one of the values. The challenge, of course, is to pick the actual string value, not just show a number!
I know, I know—work, work, work.
Halfway through the spinline()
function (which I'll show in its entirety in
just a sec), $choices
stores the count of how many
options are in the cluster, and
$source
is the set of choices, minus the open and close curly brackets.
Here's my first attempt at the random word extraction:
pick=$(( $RANDOM % $choices ))
wordpick=$( echo $source | cut -d\| -f$pick )
But, that generates an error message when run. It's not because of a typo,
however—it's legit to use cut
and specify the pipe symbol as the
field delimiter—but because I haven't compensated for the 0..n
selection of the random number generator: request field
-f0
from
cut
, and it
complains because, well, there is no field zero.
That's easily fixed now that I understand the problem, however, and so here's version two:
pick=$(( $RANDOM % $choices + 1 ))
wordpick=$( echo $source | cut -d\| -f$pick )
Remember that modulus returns 0..(n-1) for its values, so when there are three
choices, for example, $RANDOM % 3
returns 0, 1 or 2. Add one to each, and
it's back on track with the values 1, 2 and 3.
With a few useful debugging lines, here's the function in its entirety:
function spinline()
{
source="$*"
choices=$(grep -o '|' <<< "$*" | wc -l)
choices=$(( $choices + 1 ))
echo $choices options, spinning --- $source
pick=$(( $RANDOM % $choices + 1 ))
wordpick=$( echo $source | cut -d\| -f$pick )
echo I pick choice $pick which is $wordpick
}
Yeah, code. Let's see what happens when I run it with the test sentence as input:
$ sh spinner.sh spinme.txt
The
3 options, spinning --- idea|concept|inspiration
I pick choice 2 which is concept
is that each time you'd use a
2 options, spinning --- word|phrase
I pick choice 1 which is word
you instead list a set of
3 options, spinning --- similar words|synonyms|alternative words
I pick choice 2 which is synonyms
and the software automatically picks one
2 options, spinning --- randomly|at random
I pick choice 2 which is at random
and is done.
It's close, actually—really close!
In fact, let's get rid of those superfluous debugging
echo
statements
(actually, I always just comment them out instead by prepending
#
on each line, so that if I develop the script
further, and things start
to go sideways, I can simply uncomment the lines and figure out what's going
on).
Here's the result:
$ sh spinner.sh spinme.txt
The
idea
is that each time you'd use a
word
you instead list a set of
synonyms
and the software automatically picks one
at random
and is done.
The magic really becomes apparent when the entire output is piped through the
handy fmt
command to put all the puzzle pieces back together on the line:
$ sh spinner.sh spinme.txt | fmt
The idea is that each time you'd use a word you instead list a set of
synonyms and the software automatically picks one randomly and is done.
Run it a second time, and it's the same concept being discussed, but the specific word choices are different:
$ sh spinner.sh spinme.txt | fmt
The idea is that each time you'd use a phrase you instead list a set of
alternative words and the software automatically picks one randomly and
is done.
So that's the program—mission accomplished.
Don't Bug Me, Man!
It turns out that there's a bug in the script; however, it's a subtle one that is nonetheless tricky to solve: if the text to spin includes a word cluster followed immediately by punctuation, the punctuation ends up being broken.
For example, consider if I slightly modified the spinme text like this:
The {idea|concept|inspiration} is that each time you'd
use a {word|phrase}, you instead list a
set of {similar words|synonyms|alternative words} and the
software automatically picks one
{randomly|at random} and is done.
See the added punctuation immediately after the word cluster on the second line? Here's what happens if I run this through the spinner script:
The inspiration is that each time you'd use a phrase , you instead list
a set of similar words and the software automatically picks one randomly
and is done.
See the problem? There shouldn't be a space before the comma. That's easily
fixed with a sed
statement, but it's an instance of a bigger problem, so
rather than sed 's/ ,/,/g'
, I'm going to leave it to you, dear reader,
to try to come up with a more generalized solution that takes into account all
punctuation, including sequences like:
({cat|dog})
so that they'll be formatted properly in the final output.
And, that's a wrap for this article. For my next article, I'll look at, um, something or other. Perhaps it's time to start another game script?