Book Excerpt: DevOps Troubleshooting: Linux Server Best Practices

This excerpt is from the book, 'DevOps Troubleshooting: Linux Server Best Practices' by Kyle Rankin, published by Pearson/Addison-Wesley Professional, ISBN 0321832043, Nov 2012, Copyright 2013 Pearson Education, Inc. For more info please visit: https://www.informit.com/store/devops-troubleshooting-linux-server-best-practices-9780321832047

Chapter 8: Is the Website Down? Tracking Down Web Server Problems

Get Web Server Statistics

Although there’s a fair amount of web server troubleshooting you can perform outside of the server itself, ultimately you will get in a situation where you want to know this type of information: How many web server processes are currently serving requests? How many web server processes are idle? What are the busy processes doing right now? To pull data like this, you can enable a special server status page that gives you all sorts of useful server statistics.

Both Apache and Nginx provide a server status page. In the case of Apache, it requires that you enable a built-in module named status. How modules are enabled varies depending on your distribution; for example, on an Ubuntu server, you would type a2enmod status. On other distributions you may need to browse through the Apache configuration files and look for a commented-out section that loads the status module; it may look something like this:

LoadModule status_module /usr/lib/apache2/modules/mod_status.so

After the module is loaded on Ubuntu systems, the server-status page is already configured for use by localhost. On other systems you may need to add configuration like the following to your Apache configuration:

ExtendedStatus On
<IfModule mod_status.c>
#
# Allow server status reports generated by mod_status,
# with the URL of https://servername/server-status
# Uncomment and change the ".example.com" to allow
# access from other hosts.
#
<Location /server-status>
    SetHandler server-status
    Order deny,allow
    Deny from all
    Allow from localhost ip6-localhost
#    Allow from .example.com
</Location>

</IfModule>

Note that in this configuration example, we have really locked down who can access the page by saying deny from all hosts and only allow from localhost. This is a safe default because you generally don’t want the world to be able to view this kind of debugging information. As you can see in the commented-out example, you can add additional Allow from statements to add IPs or hostnames that are allowed to view the page.

For Nginx, you would add a configuration like the following to your existing Nginx config. In this example, Nginx will only listen on localhost, but you could change this to allow other machines on your local network:

server {
   listen 127.0.0.1:80;

   location /nginx_status {
           stub_status on;
           access_log off;
           allow 127.0.0.1;
           deny all;
   }
}

Once you have your configuration set and have reloaded the web server, if you have allowed some remote IPs to view the page, open up a web browser and access /server-status on your web server. For instance, if your web server was located at https://www.example.net, you would load https://www.example.net/server-status and see a page like the one shown in Figure 8-1.

Figure 8-1: Standard Apache status page

At the top of the status page, you will see general statistics about the web server including what version of Apache it is running and data about its uptime, overall traffic, and how many requests it is serving per second. Below that is a scoreboard that gives a nice at-a-glance overview of how busy your web server is, and below that is a table that provides data on the last request that each process served.

Although all of this data is useful in different troubleshooting circumstances, the scoreboard is particularly handy to quickly gauge the health of a server. Each spot in the scoreboard corresponds to a particular web server process, and the character that is used for that process gives you information about what that process is doing:

  • _

  • Waiting for a connection

  • S

  • Starting up

  • R

  • Reading the request

  • W

  • Sending a reply

  • K

  • Staying open as a keepalive process so it can send multiple files

  • D

  • Performing DNS lookup

  • C

  • Closing connection

  • L

  • Logging

  • G

  • Gracefully finishing

  • I

  • Performing idle cleanup of worker

  • .

  • Open slot with no current process

Figure 8-1 shows a fairly idle web server with only one process in a K (-keep-alive) state and one process in a W (sending reply) state. If you are curious about what each of those processes were doing last, just scroll down the page to the table and find the process of the correct number in the scoreboard. So, for instance, the W process would be found as server 2-16. It’s not apparent in the screenshot, but that process was actually the response to the request for the server-status page itself. You will also notice a few _ (waiting for connection) processes in the scoreboard, which correspond to the number of processes Apache is configured to always have running to respond to new requests. The rest of the scoreboard is full of ., which symbolize slots where new process could go—basically the MaxClients setting (the maximum number of processes Apache will spawn).

What you will notice as you refresh this page is that the objects in the scoreboard should change during each request. This scoreboard is handy when you want to keep an eye on your web server; just continually refresh the page. During a spike, you are able to see new processes get spawned, switch to W to serve the request, and then, if the spike in traffic subsides, those processes slowly change to _ and ultimately . as they are no longer needed.

Generally speaking, when you access the server status page, you do so from the command line while logged into the web server. This lets you restrict what hosts can view the page while still providing all the information you need. Now, by default, if you were to run curl against the regular server status page, you would get HTML output. However, if you pass the auto option to the server-status page, you will get text output that’s more useful for both command-line viewing and parsing by scripts:

$ curl https://localhost/server-status?auto
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
117   586  117   586    0     0   2579      0 --:--:-- --:--:-- --:--:--  2579117   586  11
7   586    0     0   1905      0 --:--:-- --:--:-- --:--:--     0
Total Accesses: 2343235
Total kBytes: 265925501
CPULoad: .0773742
Uptime: 6801454
ReqPerSec: .34452 
BytesPerSec: 40036.7
BytesPerReq: 116210
BusyWorkers: 53
IdleWorkers: 28  
Scoreboard: WW_W__W_W__W_K_W_W_K___WWWWW_WWKWWWW_WWWWWWWWWWWWWWWWWWKKKKKK_KW_.WC.CW_____K__
____.......................................................................................
...........................................................................................
...........................................................................................
................................................

When you want to monitor the status page of a server in the command line, although you could just run the curl command over and over, you could use a handy command called watch, which will run whatever command you specify every x number of seconds (by default 2). So if you wanted to keep an eye on the status page and have it refresh every 5 seconds on the command line, you would type

$ watch -n 5 'curl https://localhost/server-status?auto'

To exit out of watch, just hit Ctrl-C.

Solve Common Web Server Problems

Although it’s difficult to document how to solve any and all web server problems, you are likely to run into a few general classes of problems that have identifiable symptoms. This section will highlight some of the common types of problems you may find, their symptoms, and how to remedy them.

Configuration Problems

One common and relatively simple problem to identify in a web server is a configuration problem. Since web servers need to be reloaded to take on changes in their configuration, it can be tempting to make many changes without reloading the web server; however, if you do so, you are likely to find out during a server maintenance (or when you need to restart the server to load new SSL certificates) that there’s some syntax error in your configuration files and your server will refuse to start.

Both Apache and Nginx validate their configuration files when you start, restart, or reload the service, so that’s one way to find configuration errors—unfortunately, it also means that in the case of a problem, the server is down while you fix the errors. Fortunately, both web servers provide means to test configuration syntax and highlight any syntax errors while the server is still running.

In the case of Apache, the command is apache2ctl configtest. Be sure to run this command as a user who can read all of the configuration files (probably the root user). A successful run looks like this:

$ sudo apache2ctl configtest
Syntax OK

When there is a syntax error, this command will identify the file and line number of the error so it’s easy to find:

$ sudo apache2ctl configtest
apache2: Syntax error on line 233 of /etc/apache2/apache2.conf: Could not open configuration
  ↪file /etc/apache2/conf/: No such file or directory

In this case, the configuration file had a typo—the directory you wanted to include was /etc/apache2/conf.d.

Nginx also provides a syntax check by running nginx -t:

$ sudo nginx -t
the configuration file /etc/nginx/nginx.conf syntax is ok
configuration file /etc/nginx/nginx.conf test is successful

As with Apache, when Nginx detects an error, it tells you the file and line number:

$ sudo nginx -t
[emerg]: unknown directive "included" in /etc/nginx/nginx.conf:13
configuration file /etc/nginx/nginx.conf test failed

Permissions Problems

A common headache, especially for new web server administrators, is permission problems on the web server. Although both Apache and Nginx’s initial processes run as root, all subprocesses that actually do the work of serving content run as a user with more restricted permissions—usually a user like www-data or apache. If you are, for instance, uploading web pages as a different user, you may initially run into permissions problems until you make sure that each file you want to serve is readable by the www-data or apache user.

So what does a permissions problem look like from the outside? This example takes a basic Nginx setup and changes the permissions on the main index.html file so that it is no longer readable by the world. Then it uses curl to attempt to load the page:

$ curl https://localhost
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/0.7.65</center>
</body>
</html>

The output from the web page tells us the HTTP error even without having to tell curl to display it: a 403 Forbidden error. Unfortunately, although we can see the page is forbidden, from this output, we’re not yet sure why. At this point, though, we would turn to the Nginx error logs and see

2012/07/07 16:13:37 [error] 547#0: *2 open() "/var/www/nginx-default/index.html" failed  
↪(13: Permission denied), client: 127.0.0.1, server: localhost, request: "GET /  
↪HTTP/1.1", host: "localhost"

This error log lets us know that Nginx attempted to open /var/www/nginx-default/index.html, but permission was denied. At this point, we could check out the permissions of that file and confirm that it wasn’t readable by the www-data user Nginx runs as:

$ ps -ef | grep nginx
root       545     1  0 15:19 ?        00:00:00 nginx: master process /usr/sbin/nginx
www-data   547   545  0 15:19 ?        00:00:00 nginx: worker process

$ ls -l /var/www/nginx-default/index.html
-rw-r----- 1 root root 151 2006-08-30 03:39 /var/www/nginx-default/index.html

In this case, you could fix this permission problem with the chmod o+r command, which would add world read permissions to the file. Alternatively, you could also change the group ownership of the file so it was owned by the www-data group (or by a group that www-data was a member of).

Although some administrators may sidestep permissions problems by basically making all files readable and writeable by everyone, the security risks of doing so aren’t worth the easy fix. Instead, consider creating a group on the system whose membership includes both www-data or apache users (depending on what user your web server runs as) and the users you upload files as. If you do try the “chmod 777” method of making the file readable by everyone, use it only as a temporary sanity check to confirm that the problem truly is a permissions issue. Be sure after you have solved the problem to change permissions back to something more secure.

Sluggish or Unavailable Web Server

Although configuration and permission problems are pretty well defined, probably one of the more common web server problems you will troubleshoot is nice and vague—the server seems slow to the point it may even be temporarily unavailable. Although a large number of root causes make this kind of problem, this section will guide you through some common causes for sluggish web servers along with their symptoms.

High Load

One of the first things I check when a server is sluggish or temporarily unavailable is its load. If you haven’t already read through Chapter 2, read it to learn how to determine whether the server is suffering from high load, and if so, whether that high load is the result of your web server processes; if it is, you’ll learn how to determine whether the load is CPU, RAM, or I/O bound.

Once you have identified whether the load is high and that your web server processes are the issue, if the load is CPU-bound, then you will likely need to troubleshoot any CGIs, PHP code, and so on, that your web server executes to generate dynamic content. Go through your web server logs and attempt to identify which pages are being accessed during this high load period; then attempt to load them yourself (possibly on a test server if your main server is overloaded) to gauge how much CPU various dynamic pages consume.

If the load seems RAM-bound and you notice you are using more and more swap storage and may even completely run out of RAM, then you may be facing the dreaded web server swap death spiral. This shows up commonly in Apache prefork servers but could potentially show up in Apache worker or even Nginx servers in the right conditions. Essentially, when you configure your web server, you can configure the maximum number of web server instances the server will spawn in response to traffic. In Apache prefork, this is known as the MaxClient setting. When a server gets so much traffic that it spawns more web server processes than can fit in RAM, processes end up using the much slower swap space instead. This causes those processes to respond much more slowly than processes residing in RAM, which causes the requests to take longer, which in turn causes more processes to be needed to handle the current load until, ultimately, both RAM and swap are consumed.

To solve this issue, you will need to calculate how many web server processes can fit into RAM. First calculate how much RAM an individual web server process will take, then take your total RAM and subtract your operating system overhead. Then figure out how many Apache processes you currently can fit into the remaining free RAM without going into swap. You should then configure your web server so it never launches more processes than it can fit into RAM.

Of course, with modern dynamically generated web pages, setting this value can be a bit tricky. After all, some PHP scripts, for instance, use little RAM whereas others may use quite a bit. In circumstances like this, the best tactic is to look at all of the web server processes on a busy web server and attempt to gauge the maximum, minimum, and average amount of RAM a process consumes. Then you can decide whether to set the number of web servers according to the worst case (maximum amount of RAM) or the average case.

If your load is I/O bound, and the web server has a database back-end on the same machine, you might simply be saturating your disk I/O with database requests. Of course, if you followed the load troubleshooting guide from Chapter 2, you should have been able to identify database processes as the culprit instead of web server processes. In either case, you may want to consider either putting your database on a separate server, upgrading your storage speed, or going to Chapter 9 for more information on how to troubleshoot database issues. Even if the database server is on a separate machine, each web server process that is waiting on a response from the database over the network may still generate a high load average.

Otherwise, if the server is I/O bound but the problem seems to be coming from the web server itself and not the database, it could be that the software that powers your website running on the machine simply is saturating disk I/O with requests. Alternatively, if you have enabled reverse DNS resolution in your logs so that IP addresses are converted into hostnames, your web server processes could simply have to wait on each DNS query to resolve before it finishes its request.

Server Status Pages

One of the other main places to look when diagnosing sluggish servers, other than troubleshooting high load on the system, is in the server status page. Earlier in the chapter we talked about how to enable and view the server status page in your web server. In cases of slow or unavailable web servers, this status page gives a nice overall view of the health of your web server. You not only see system load averages, you can also see how many processes are currently busy and what they are doing.

If, for instance, you see something like this,

$ curl https://localhost/server-status?auto
. . .
Scoreboard: WWWWWWWWWWWWWKWWWWWKWWWWWWWWWWWKWWWWWWWWWWWWWWWWWWWWWWWKKKKKKWKWWWWCWCWWWWWWKWW
____WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW

you’ll know that this server is completely overloaded with requests. As you refresh this page, you may see a process open up every now and then, but clearly, just about every process is busy fulfilling a request. In this circumstance, you may just need to allow your web server to spawn more processes (if you can fit them in RAM), or, alternatively, it may be time to add another web server to help share the load.

Then again, if you see a scoreboard like the one shown earlier, but notice that your web server seems quite responsive, it could be that each web request is having to wait on something on the back end. Behavior like this can happen when an application server is overloaded with waiting requests (sometimes because, ultimately, the database server it depends on is overloaded), so although all the web server processes are busy, adding more wouldn’t necessarily help the issue—they would also still be waiting on the back end to respond.

On the other hand, you might see something like this:

Scoreboard: WW_W__W_W__W_K_W_W_K___WWWWW_WWKWWWW_WWWWWWWWWWWWWWWWWWKKKKKK_KW_.WC.CW_____K__
____.......................................................................................
...........................................................................................
...........................................................................................
................................................

This is a server that has many processes to spare, both ones that are loaded into RAM and ones that are waiting to be loaded. If your server is sluggish but your scoreboard looks like this, then you are going to need to dig into your web server logs and try to identify which pages are currently being loaded. Ultimately you will want to identify which pages on your site are taking so long to respond, and then you’ll need to dig into that software to try to find the root cause. Of course, it could also simply be that your web server is underpowered for the software it’s running, and if so, it’s time to consider a hardware upgrade.


© Copyright Pearson Education. All rights reserved.

Kyle Rankin is a Tech Editor and columnist at Linux Journal and the Chief Security Officer at Purism. He is the author of Linux Hardening in Hostile Networks, DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks and Ubuntu Hacks, and also a contributor to a number of other O'Reilly books. Rankin speaks frequently on security and open-source software including at BsidesLV, O'Reilly Security Conference, OSCON, SCALE, CactusCon, Linux World Expo and Penguicon. You can follow him at @kylerankin.

Load Disqus comments