Integrating Web Applications with Apache
When you deploy a web application, how do end users access it? Often web applications are set behind a gateway device through which end users can access it. One of the popular products to act as an application gateway on Linux is the Apache Web Server. Although it can function as a normal web server, it also has the ability to connect through it to other web servers.
In this article, I discuss what it takes to integrate a web application into Apache. This includes integrating the HTTP protocol functionality, customizing content to render properly and reusing pieces of configuration. Once you understand those basic bits of functionality, you'll have the tools you need to maximize your web applications' usability. So, let's get started!
Crash Course in RegEx
A mechanism that I use throughout this article that might need a brief
introduction is Regular Expressions (or regex). Regex is used to define
a text pattern to search for within a URL or to find and replace text
within content, such as HTML or JavaScript. The text processing command
sed
uses regex to do searches and substitutions.
For each example below there will be three parts: input, regex pattern and output. The pattern will be applied to the input text and determine the value of the output text.
Example 1:
Input:
Name: Frank Sinatra
Genre: Jazz
Name: 2Pac
Genre: Rap
Name: Reel Big Fish
Genre: Ska
Regex pattern: "^Name: "
Output:
Name: Frank Sinatra
Name: 2Pac
Name: Reel Big Fish
This example searches the input text for text that matches
the pattern "^Name: "
. This pattern says, "Look for the text 'Name:
' at the beginning of each line." Since there are two lines that
begin with that text, only those two lines are returned. While
"^"
represents the beginning of a line, "$" represents the end of a line.
So if you were to apply the pattern "a$", two lines would be returned
(Frank Sinatra and Ska). Let's expand on that example and use
the input from Example 1 with a new pattern.
Example 2:
Regex pattern: "^Name: [0-9]"
Output:
Name: 2Pac
As you can see, I've taken the original regex pattern and added
[0-9]
to the end. This will search for a single character that can be any
number from 0 to 9, which is why "2Pac" was the only line
returned.
You also can specify a range with alphabetic characters
([a-z]
or [A-Z]
).
Along with pattern selection, you also can do substitution with regex. There are two formats for regex substitutions: s|pattern|replace|modifier or s/pattern/replace/modifier. In Apache, I find it easier to use the pipe-style substitution. Example 3 uses the same input with a new pattern.
Example 3:
Regex pattern: "s|^(.*)Frank(.*)$|\1Dwezil\2|g"
Output:
Name: Dwezil Sinatra
Genre: Jazz
Name: 2Pac
Genre: Rap
Name: Reel Big Fish
Genre: Ska
Name: Dwezil Zappa
Genre: Unknown
This pattern has a lot to dissect.
One of the great features
of regex is the ability to match any character. The dot operator will
match any one character. The asterisk operator will match 0 or more of
whatever character or operator preceded it. Putting these two operators
together matches 0 or more of any character. Enclosing this in
parentheses allows the matched text to be represented in the replace
portion of the pattern with a variable. In this case,
\1
represents the
first block of text within parentheses and \2
represents the second.
The only characters that are explicitly being matched are
"Frank". As
such, the lines containing "Frank" will be replaced with everything up to
"Frank" (represented by \1
),
"Dwezil", and everything following "Frank"
(represented by \2
). As you can see, the entirety of the text input
was sent to the output although modified by the pattern.
Protocol Integration
When it is decided that an application would benefit from Apache
integration, there is a high likelihood that it will reside on a separate
server from Apache. To integrate applications being accessed via
HTTP fully, any or all of these modules may be used:
mod_rewrite
, mod_proxy
,
mod_ssl
and mod_headers
. Each
of these modules allows you to customize
the way communication between the end user and web servers occurs from
modifying HTTP header data to managing proxy connections to other servers.
First, let's look at mod_rewrite
. There are a number of directives
within the mod_rewrite
module, but I cover only a
handful here: RewriteEngine
,
RewriteCond
and RewriteRule
.
The RewriteEngine
directive simply enables URL rewriting and is invoked
as follows:
RewriteEngine on
RewriteRule
allows the server to respond to an HTTP request to a specific
URL by, among other things, returning an HTTP redirect (code 301 or 302),
which will redirect the end user to a specified URL or send a proxied
request to a back-end server. Here's an example of issuing an HTTP
redirect:
RewriteRule /google https://www.google.com [R=301]
In this example, when the URL of /google is accessed, the server
will respond with an HTTP 301 that will redirect the user to
https://www.google.com. This example will work only if the request
URL is exactly equal to "/google". If the need is to redirect on any
URL starting with "/google", you would define a conditional redirect using
RewriteCond
as follows:
RewriteCond "%{REQUEST_URI}" "/google.*$"
RewriteRule "^.*$" https://www.google.com [R=301]
The RewriteCond
directive has two parts: a string value to check
and a substring to search for. In this example, you are looking in
the REQUEST_URI
HTTP session variable for anything beginning with
"/google". If that condition is met, the
RewriteRule
on the following
line is executed. Because you are determining the value of the target
URL in the RewriteCond
, the value of the target URL
in the RewriteRule
is defined as "^.*$"
.
The examples given here are all user-facing events like a 301
redirect. The RewriteRule
directive also can be used to proxy requests
to a server. This is done behind the scenes unlike an HTTP redirect,
so the request is forwarded without the users' knowledge. A proxied
request may be configured like the example below:
RewriteRule "/home/(.*)$" https://back-end01.test:8080/$1 [P]
The above illustrates an example of a virtual root directory. When
the user accesses anything underneath /home (note the ".*"
expression),
the request is sent to back-end01.test on port 8080 with the location
set to the URL path beneath /home. For example, if the user tries to
access /home/test/image.jpg, the request is sent to back-end01.test:8080
with a location of /test/image.jpg. A proxied
RewriteRule
also
may be used in conjunction with RewriteCond
for further customization.
Note that this statement proxies only the HTTP request.
Proxying of HTTP responses will require mod_proxy
.
Another option for proxying HTTP connections through Apache is
mod_proxy
,
which provides ProxyPass
,
ProxyPassReverse
and
ProxyPassMatch
among
many other directives that provide more robust proxying options.
I focus primarily on these three directives here. As mentioned
previously, RewriteRule
provides proxying of HTTP requests. Let's compare
the example already given for proxying with
RewriteRule
and an example
for ProxyPass
:
ProxyPass /home https://back-end01.test:8080/
This ProxyPass
statement provides roughly the same level of functionality
as the RewriteRule
statement with a more simplistic command. When a
request comes in for any URL beginning with "/home", the request header
will be rewritten so that the request will be received properly by
https://back-end01.test:8080/. Consider the following first lines of
an HTTP request:
From user to server: GET /home/test/image.jpg HTTP/1.1
From server to back-end: GET /test/image.jpg HTTP/1.1
The first line of the header contains the method
(GET
in this case)
and the URL being requested. When the server receives the request from
the client, it strips off "/home", as specified in the
ProxyPass
directive
and forwards the request to the back-end server. If you want to proxy
response packets as well as request packets, the
following ProxyPassReverse
statement
can be paired with the previous ProxyPass
statement:
ProxyPassReverse /home https://back-end01.test:8080/
The syntax is exactly the same as ProxyPass
, adding to the simplicity
of the mod_proxy
configuration. This will take any HTTP response
matching an HTTP request for /home and forward the response back to
the original client. If you need to add some programmatic
proxying (similar to RewriteCond
), you can use the
ProxyPassMatch
. When
implementing a forward/reverse proxy configuration,
ProxyPassMatch
can
replace ProxyPass
. Here's an example:
ProxyPassMatch "^/home/([a-z0-9]*/docs)" https://docserver01.test:8080/$1
ProxyPassReverse /home https://docserver01.test:8080/
This example suggests that within the /home folder, there are
many sub-folders (let's say user names) and within each of those
exists a folder named "docs". The USERNAME/docs URL exists on
docserver01.test:8080 in the root of the web server, as denoted by the
$1 in the server URL. The ProxyPassReverse
will function in the same
manner as it did in the previous example.
Securing websites with SSL in Apache is accomplished with
mod_ssl
.
Although I won't discuss configuring SSL from the ground up,
a few directives relate to proxied SSL connections:
SSLProxyCheckPeerExpire
,
SSLProxyCheckPeerName
and
SSLProxyCheckPeerCN
.
It is a common practice to use self-signed certificates on back-end
servers (provided a valid cert is in place on the user-facing server),
and these directives address common issues that can arise when using
self-signed certs. Any of these directives can have one of two arguments
provided: "on" or "off". If set to
"off", SSLProxyCheckPeerExpire
will skip checking the expiration date on the SSL cert used on a
back-end server. To avoid checking a certificate's common name or
alternate names against the server name used to access a back end,
set SSLProxyCheckPeerName
to "off". In
older versions of Apache,
you might be able to use SSLProxyCheckPeerCN
(set to
"off") instead of
SSLProxyCheckPeerName
.
Along with rewriting URLs, it may be necessary to rewrite HTTP request or
response header fields. In Apache, this is done with
mod_headers
. There
are only two directives within this module: Header
and RequestHeader
.
These directives are used to modify response and request header fields,
respectively. Many actions can be used with either
of these directives, but here, let's look at the set
and edit
actions—for example:
Header set ReceiveTime "%t"
This example will add and replace any existing header in an HTTP response
named ReceiveTime
and give it the value of the UNIX timestamp when the
request was received by the server (represented by
"%t"
).
If you
need to replace the value of a header that comes from a back-end server,
you would use the edit
action. Consider the following example:
Header edit Location "^https://back-end01.test:8080/(.*)$"
↪"https://public.test/$1"
This example will replace the Location
attribute in an HTTP
response, which will exist in a 301/302 redirect. If it finds
https://back-end01.test:8080 at the beginning of the
Location
header,
it replaces that part with "https://public.test" (the user-facing URL).
Content Integration
Once a remote application is integrated with an Apache server, from a
protocol standpoint, it may be necessary to integrate content. This will
generally manifest itself as URLs coded into HTML or JavaScript that are
specific to a back-end server and not to a user-facing server. The basic
necessity is to be able to search and replace bits of HTML or JavaScript
content,
so that it can render and perform correctly when accessed through an
Apache proxy. The module that accomplishes this is
mod_substitute
and
specifically the Substitute
directive.
Substitute
allows a simple regex
substitute to be performed on the payload data of an HTTP response.
Something to consider before attempting to replace text is to account for
whether the back-end web server compresses data before sending it
over the network. If it does, your Substitute
statements might not work,
as it will be searching for ASCII text within binary compressed data.
To account for this, you can instruct Apache to decompress the data,
manipulate the response and then re-compress it. This is done using
the SetOutputFilter
directive, which is part of Apache core functionality.
Here's how it works:
SetOutputFilter INFLATE;SUBSTITUTE;DEFLATE
Reading the arguments from left to right, this tells Apache to
INFLATE
(decompress) the data from the back-end
server, perform the
substitute and DEFLATE
(compress) the data before returning it to
the end user.
The Substitute
statement uses a regex substitute expression. As I
mentioned previously, I found it easier to use the pipe-style substitute
expression in Apache. To recap, the syntax is s|search|replace|options. Two
common options that I tend to use: "i", which denotes a case-insensitive
search, and "n", to allow the search and replace values to be processed
as regex. Here's a common use example:
Substitute "s|(href="http)(://)back-end01.test:8080|$1s$2public.test|in"
For this example, let's assume that the user-facing site (public.test) runs
HTTPS, and the back-end server (back-end01.test) runs HTTP on port 8080.
This would be a solution if the back-end web server returned hyperlinks
that were specific to itself as opposed to the user-facing site. In the
search portion of the regex substitute, this splits out two groups of
text in parentheses: (href=\"http)
and
(://)
. These are blocks of text
that you want preserved in the replace section of the regex. In the
replace,
you are inserting an "s" after http and replacing the hostname/port with
the user-facing site name. After processing, the resulting string will
be href="https://public.test
. This will update hyperlinks that use
"href" attributes (<a> and <link>). For <img> and
<script> tags, you
could use this same Substitute statement and replace "href"
with "src".
Another consideration would be to account for double or single quotes
delimiting attribute values (href='
vs.
href="
).
Another application of Substitute
is to extend the
functionality of a page
without manipulating the original source code. Consider the
following example:
Substitute "s|(<body.*>)|\1<div style=\"font-size:14pt;
↪font-weight:bold;background-color:#ff0000;color:
↪#ffffff;display:block;text-align:center;\">This site
↪will be down for 24 hours beginning at 8 pm tonight</div>|in"
If a website needs to be taken off-line for maintenance, this is an easy way to alert the user population of the outage without modifying the application itself. This example simply inserts a red bar along the top of the page (right after the <body> tag), which displays information about the outage. Depending on how your page is rendered, you might need to choose another tag to act as your starting point instead of <body>.
Streamlining Future Integrations
All of the topics presented here can be configured and
maintained relatively easily if you have only a few statements.
In the real world, there typically will be many sites that use a similar
configuration and having to define the functionality for each site can
be time-consuming and can lead to mistakes. Luckily, Apache provides a
mechanism to repeat functionality throughout your configuration through
the use of mod_macro
. The
<Macro>
directive within an Apache config
functions very much like a function or subroutine. Once a macro is
defined, it can be referenced as many times as is necessary, leaving you
with one place within your config to maintain your detailed functionality.
Here's an example macro:
<Macro RedirectSecure $host $path>
RewriteCond "%{REQUEST_URI}" "^$path"
RewriteRule "^/(.*)$" "https://$host/$1"
</Macro>
When called, this macro will define a RewriteCond
and RewriteRule
that,
if they access a URL starting with the value of the $path argument, will
redirect the user to https://$host/$1, where $host is the hostname specified
as a macro argument and $1 is the entire URL path. The following syntax
would be used to call this macro:
Use RedirectSecure public.test /users
Something to consider is the location within the Apache config from
which a macro is called. A RewriteRule
, for
example, cannot be called
outside a <VirtualHost>
block. As such, if the macro is called
outside a <VirtualHost>
block, Apache will throw an error and not start.
Here's another example:
<Macro ReplaceContentURL $backendurl $publicurl>
Substitute "s|(href=\")$backendurl|$1$publicurl|in"
Substitute "s|(src=\")$backendurl|$1$publicurl|in"
</Macro>
This macro expands on the replacing of URLs that I covered previously. This will search for tag attributes of "href" and "src" and replace the hyperlinks of the back-end server with that of the user-facing server. Here's an example of how this might be called:
Use ReplaceContentURL https://back-end01.test:8080 https://public.test
This will search for https://back-end01.test:8080, beginning with either
href="
or src="
and replace the URL with https://public.test. Macros
can be used for any piece of Apache configuration. They can be used
to do small tasks as shown here as well as whole site configurations.
Although macros are pretty simple, they make the difference between a
large amount of difficult-to-maintain configuration files and a simplified
reusable configuration.
At this point, you have some basic knowledge of integrating HTTP, customizing content and reproducing configuration within Apache. Although many directives and modules weren't covered here, this will be a great starting point and can help you get started with accessing your applications through Apache.
Resources
The following are some articles I've found useful along with some example Apache configs I've written.
Apache Module Reference (2.2): https://httpd.apache.org/docs/2.2/mod
Apache Module Reference (2.4): https://httpd.apache.org/docs/2.4/mod
Git Instaweb Reverse Proxy: https://git.andydoestech.com/git/scripts/.git/tree/config/gitreverseproxy.conf
Monit Reverse Proxy: https://git.andydoestech.com/git/scripts/.git/tree/config/monit.conf
Adobe Experience Manager Apache Config: https://git.andydoestech.com/git/aem-dispatcher-config/.git/tree