System Status as SMS Text Messages
If you're paying really close attention, you'll remember that in my last article, I was exploring the rudiments of a script that would accept a list of words as input and create a word search grid, suitable for printing. It turns out that's crazy hard to do as a shell script—it just doesn't have the muscle to implement any sort of functional algorithm in an elegant fashion. So, I'm going to bail on it, at least until I can find someone else's open-source code I can explore for inspiration.
Or, of course, if you're motivated and have some time to experiment, go back to my April 2015 column, read through it, then try your own hand at implementing something.
Before I get letters about the oddity of being the shell script programming columnist who is bailing on a script, I will point out that there's a lot to learn from this experience actually. Most specifically, although it's nice to imagine that the Linux environment is completely egalitarian, and that every script, every language and every program is as powerful and well designed as every other, it's clear that's not the case.
Take Perl versus Awk, for example. Awk is powerful and I use it frequently, but although there are major software programs written in Perl, you'd be hard-pressed to find any significant software, functions, applications or utilities programmed directly in Awk. The same goes for C++ versus PHP, for example, or any modern structured language versus, well, the Bourne Again Shell. There, I said it. Shell script programming can take you only so far, and then you realize that you've hit the edges of the environment and its capabilities, and it's time to jump to another language.
Indeed, when I wrote my popular book Wicked Cool Shell Scripts, there was a tiny C program that snuck in by necessity: it was a few lines of C to do a certain date calculation that would have been dozens, if not hundreds, of lines of shell script.
Having said that, I will rush back to defend the shell as a powerful, lightweight programming and prototyping environment perfect for a variety of tasks because of its super-easy access to the power and capabilities of the entire Linux environment and, by extension, the entire Internet.
What's your take? You read this column, so it's reasonable for me to conclude that you are interested in learning more about programming within the Linux shell environment. How often do you bump into the bleeding edge of your shell and realize you have to flip into Perl, Ruby, C, Cobol (just kidding!) or another more sophisticated development environment to solve the problem properly?
Let's Talk about Text Messages
I was watching the Apple introduction of its new Apple Watch and was struck by the fact that like a few of the high-end Android smart watches, it will show you the entirety of e-mail and text messages on the tiny watch screen. This means it's a great device for sysadmins and Linux IT folk to keep tabs on the status of their machine or set of machines.
Sure, you could do this by having the system send an e-mail, but let's go a bit further and tap into one of the e-mail-to-SMS gateways instead. Table 1 shows a list of gateway addresses for the most common cellular carriers in the United States.
Wireless Carrier | Domain Name |
At&T | @txt.att.net |
Cricket | @mms.mycricket.com |
Nextel | @messaging.nextel.com |
Qwest | @qwestmp.com |
Sprint | @messaging.sprintpcs.com |
T-Mobile | @tmomail.net |
US Cellular | @email.uscc.net |
Verizon | @vtext.com |
Virgin | @vmobl.com |
For example, I can send a text message to someone on the AT&T network with the number (303) 555-1234 by formatting the e-mail like this:
3035551234@txt.att.net
Armed with this information, there are a lot of different statuses that you can monitor and get a succinct text message if something's messed up.
Worried about load averages becoming excessive? That's a figure
easily accessible through the one-line output of
uptime
:
$ uptime
11:20 up 4 days, 22:44, 3 users, load averages: 1.08 1.40 1.46
The last three figures are the load average over the last 1, 5 and 15 minutes. In this case, the system barely is being tapped at all. But what if it jumped up to 10, or 35 or more than 100? Then everything would slow down. Here's how you could write a simple script to test for that condition:
#!/bin/sh
# loadwatch.sh - send an alert if uptime > MAXOK
MAXOK=10
loadavg=$(uptime | cut -d\ -f11 | cut -d. -f1)
if [ $loadavg -gt $MAXOK ] ; then
echo "Alert: Load avg $(uptime | cut -d\ -f11)"
fi
exit 0
Armed with the information about the various SMS gateways, it's
easy to hard code a recipient address, which changes just the
echo
line within the conditional:
mail -s "Alert: Load avg $(uptime|cut -d\ -f11)" $recipient
where earlier in the script "recipient" is formatted similar to:
recipient=3035551234@txt.att.net
or as appropriate for your own smart watch or, um, other device.
For this script to be useful, you'd likely want to run it every few minutes so that when there is a spike in usage, you're alerted as soon as possible. This most easily would be a cron job, and if you haven't explored how your own custom cron jobs can make your life as even the most rudimentary of Linux users better, well, you're missing out!
To make the script run every ten minutes, here's how it might look in the root or even just your user crontab file:
0,10,20,30,40,50 * * * * /home/taylor/bin/loadwatch.sh
Modern crontabs have a more sophisticated notational language that can make this a wee bit more succinct:
*/10 * * * * /home/taylor/bin/loadwatch.sh
For this really to be useful, it might be better to have the script monitor state changes, so it'd notify you when the load rose above a specified threshold but not notify you again until it then went back down below that threshold.
This is done with what we old-school programmers call a semaphore, a state variable that remembers what's happening. Because a shell script is transient in nature, the semaphore needs to be a file. Typically these are located in a protected directory of some sort, but let's just drop it in your home directory for the purposes of this demo script.
The command-line function that's useful to know at this point is lockfile(1). This manages the atomic creation of the semaphore so that you never hit what's called a "race condition" where two instantiations of the script might collide on who is creating the file.
Here's how it'll work with the addition of the semaphore:
statefile=/home/taylor/bin/.loadavg
if [ -f "$statefile" ] ; then
# statefile already exists, we're in a high load situation
if [ $loadavg -gt $MAXOK ] ; then
# still in high load situation
echo "nothing to do, still in high load situation"
else
# high load situation has ended
/bin/rm -f $statefile
mail -s "Alert: load average back to normal" $recipient \
< /dev/null > /dev/null 2>&1
fi
else
# statefile doesn't exist, let's create it.
if [ $loadavg -gt $MAXOK ] ; then
# load average has jumped above OK level
lockfile $statefile
load=$(uptime | cut -d\ -f11)
mail -s "Alert: load average is $load" $recipient \
< /dev/null > /dev/null 2>&1
else
# load average was okay and still is.
echo "nothing to do, load average still ok."
fi
fi
Of course, there are two of the four possible scenarios where you'd really want to remove the debugging code, clean up the if-then-else chain and shorten the script, because if this is going to run every ten minutes, you most assuredly do not want "no change" messages generated!
With that in mind, here's the more succinct code block:
if [ -f "$statefile" ] ; then
# statefile already exists, we're in a high load situation
if [ $loadavg -le $MAXOK ] ; then
# high load situation has ended
/bin/rm -f $statefile
mail -s "Alert: load average back to normal?" $recipient \
< /dev/null > /dev/null 2>&1
fi
else
# statefile doesn't exist, let's create it.
if [ $loadavg -gt $MAXOK ] ; then
# load average has jumped above OK level
lockfile $statefile
load=$(uptime | cut -d\ -f11)
mail -s "Alert: load average is $load?" $recipient \
< /dev/null > /dev/null 2>&1
fi
fi
Note the extra work involved in using the command-line Mail program, where you have to redirect input so that it's not waiting for a message from stdin and redirecting the resultant "null message body" warning message. That's what this does:
< /dev/null > /dev/null 2>&1
Otherwise, hopefully you can read through and see what it does.
What Else Could You Monitor?
Tracking load average is rather trivial when you think about all the many things that can go wrong on a Linux system, including processes that get wedged and use an inordinate amount of CPU time, disk space that could be close to filling up, RAM that's tapped out and causing excessive swapping, or even unauthorized users logging in.
All of those situations can be analyzed, and alerts can be sent to you via e-mail or SMS text message, even to your shiny gold $17,000 Apple Watch. Now, you tell me, what do you think is worth monitoring on your system?