About the Service Monitoring Script
This is a *very* simply PHP script used to monitor the services running on a server.
I wrote it in PHP because it was the quickest and easiest at the time, but the idea could
certainly be ported to any other language, or even a shell script. In fact, I'll probably
re-write it as a shell script at some point in case a server didn't have php installed.
How the Script is able to monitor the services
The script needs to be set to run every couple of minutes on the system so it can check that
the services are running (and if they are not, it will then do something). This is done by
setting the script to run as a cron job. More information about cron jobs can be fond
online, but here is a basic tutorial.
* * * * * user command_to_be_executed
- - - - -
| | | | |
| | | | +----- day of week (0 - 6) (Sunday=0)
| | | +------- month (1 - 12)
| | +--------- day of month (1 - 31)
| +----------- hour (0 - 23)
+------------- min (0 - 59)
For instance, the following would execute the command "/sbin/runscript" every hour on the hour as root:
0 * * * * root /sbin/runscript
The following would run the command "wget http://google.com" every day at 3am and 5pm as user "bob"
0 3,17 * * * bob wget http://google.com
The following would run the command "service apache2 start" every 5 minutes as the root user.
*/5 * * * * root service apache2 start
The cron commands and time configuration is saved in /etc/crontab (or you can also use a user's cron file by running crontab -e as the user... if you do that don't add the "user" option as shown above). The crontab file has a list of
time configurations and commands (like the examples above). Also, it is important to note that
having a "#" at the beginning of a line makes that line a comment in the crontab file.
Below is the crontab file for my monitor script.
# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 4 * * * root run-parts /etc/cron.daily
22 4 * * 0 root run-parts /etc/cron.weekly
42 4 1 * * root run-parts /etc/cron.monthly
# this is the monitor script that emails us if apache or mysql stops
*/5 * * * * root php /root/monitor
Notice the line that runs /root/monitor, that is my monitoring script! As you can see, I have it
running every 5 minutes as root. If the fact that it runs as root concerns you, it should work as
a normal user too, but the script accepts no input parameters, and does not read any files, so
running it as root should not really matter because I have the script permissions as 700 and owned
by the root user (thus in the /root directory). 700 permissions means other users cannot
execute the script, write the script, or even read the script. If you don't know how that works,
look up information on the commands "chmod", "chown", and "chgrp" which are used to change
the permissions and owners of a file in linux. Let me make this clear: If you run the script from
the crontab file as the root user, ONLY ALLOW THE ROOT USER ACCESS TO THAT FILE!
For more information on cron and crontab files, go
here,
here,
and here. You might also
notice that all the crontab documents do not mention the user being in the line, only the 5
numbers (for the time to run the command) and the command. All I know is both servers I
administer (one mine and one not mine) have a user field there, so I used it. I suggest you
take a look at your crontab file and take a queue from the lines already there to know if you
should specify the user or not.
How the Script Works
The script will first check to see if the services are running. To do this, it needs the port that
the service should be listening on. Then it runs the command "netstat -an | grep 0.0.0.0:X"
(where X is the port number) on the command line and checks the output. If the command does not
return anything, the service is not listening on the port and is presumed to be in trouble.
The script will run through all the services, compiling a list of any that are stopped.
If any are stopped, the script will check for a special lock file located in /tmp before the
alerts are sent, and if it exists it will not send out any. If the lock file does not exist,
it will create it and then send out the alerts. Once the script sees that no services are stopped
anymore, it will remove the lock file if it exists. In this way, the script will only send out
the alerts once, so it will not flood your email address or text your cellphone into oblivion.
The full script can be seen below, or can be downloaded here.
/*============================================================================
BEGIN OPTIONS SECTION. THESE ARE THE OPTIONS FOR THE SCRIPT.
============================================================================*/
// email this person if things go wrong!!
$emails[0] = "myemail@mydomain.com";
$emails[1] = "anotheremail@mydomain.com";
$emails[2] = "somebody@somewhere.com";
$emails[3] = "youget@theidea.com";
// text these phone numbers using the appropriate SMS gateways
// for more info on SMS, see http://en.wikipedia.org/wiki/SMS_gateways
// we make the SMS email message shorter so the text message is not too expensive!
$sms[0] = "4045557777@mymetropcs.com"; // michael's cell with Metro PCS
$sms[1] = "7705558888@messaging.sprintpcs.com"; // Reece's Cell with Sprint
// list the services and ports we need to check.
// note that the service name is the key and the port is the value.
$service['Apache'] = 80;
$service['MySQL'] = 3306;
// set up the email information (name and email the message sas it is from)
$from_name = "Monitor Script";
$from_email = "monitor@mydomain.com";
/*============================================================================
BEGIN THE INITIAL SETUP. THIS WILL INITIALIZE SOME VARIABLES.
============================================================================*/
// set up headers
$headers = "From: $from_name <$from_email>\n";
$headers.= "Reply-To: $from_name <$from_email>\n";
$headers.= "Content-Type: text/plain\n";
// get the current date/time
$time = date("Y-m-d H:i:s");
// the name of the lock file
$lockfile = "MONITOR.lck";
// set to true if a service is found to not be running
$error = array();
// clear cache used for file_exists() function so we get updated data
clearstatcache();
/* ============================================================================
BEGIN SERVICE CHECKING TO SEE WHICH SERVICES ARE NOT RUNNING.
============================================================================*/
// loop through services
foreach($service as $k=>$v)
{
// quick sanity checks
if(!is_numeric($v)) continue;
if($k=="") continue;
// build the command
$cmd = "netstat -an | grep 0.0.0.0:".$v;
// get the output of the command and see if it's blank
if(shell_exec($cmd)==""){
// debug echo so you can run this manually.
echo "[ERROR] $k is NOT listening on port $v\n";
$error[$k] = $v;
}else{
// debug echo so you can run this manually.
echo "[SUCCESS] $k is listening on port $v\n";
}
}
/* ============================================================================
BEGIN ACTIONS IF A SERVICE WAS FOUND NOT TO BE RUNNING
============================================================================*/
// if any service was not listening on it's designated port, perform actions
if( count($error) > 0)
{
// if lock file exists, exit and do nothing because we already did actions.
if(file_exists("/tmp/$lockfile")){
echo "Lock file exists already, exiting.\n";
exit;
}
// try to set lock file so that we don't send more than one alert
// lock file MUST be in tmp directory so file will be removed on reboot automatically
echo "Creating lock file\n";
if(strlen($lockfile)>0) shell_exec("echo \"FILLER\" > /tmp/$lockfile");
// set main message that will be put in messages file
$msg = "Monitor Script Found services *NOT* running at $time :";
// set subject
$subject = "[SERVER ERROR] Services Reporting as Stopped";
// set the body of the email for mail emails
$body = "\nThis is a Monitor Script Alert email message.\n\n";
$body.= "At $time, the following services have problems:\n\n";
// build the list of services found stopped
foreach($error as $k=>$v){
$body.= "$k on port $v\n";
}
$body.= "\nAnother alert will not be sent until all services are found to be running.\n\n";
// begin sending emails.
foreach($emails as $email){
if(mail($email, $subject, $body, $headers)){
shell_exec("echo \"$msg Email Sent Successfully to $email\" >> /var/log/messages");
}else{
shell_exec("echo \"$msg EMAIL SEND ERROR TO $email\" >> /var/log/messages");
// debug echo in case you manually run this.
echo "Email Sending Failed.\n";
}
}
// set up SMS body and send SMS emails to text cell phones
$subject = "SERVER ERROR. ";
$body = "Services stopped:";
// add list of services not running to the sms message
foreach($error as $k=>$v){
$body.= " $k";
}
// loop through sms emails and send the text messages
foreach($sms as $email){
if(mail($email, $subject, $body, $headers)){
shell_exec("echo \"$msg SMS Sent Successfully to $email\" >> /var/log/messages");
}else{
shell_exec("echo \"$msg SMS SEND ERROR TO $email\" >> /var/log/messages");
// debug echo in case you manually run this.
echo "SMS Sending Failed.\n";
}
}
}
// if all services are okay and running, do this.
else{
// sanity check
if(strlen($lockfile)>0){
// see if lock file exists or not
if(file_exists("/tmp/$lockfile")){
// if lock file exists, remove it since everything is okay and running now.
echo "Removing lockfile\n";
shell_exec("rm -f /tmp/$lockfile");
}else{
echo "No lock file to remove\n";
}
}else echo "Error: Lockfile is blank\n";
}
Download the script
You can download the script below. Please make sure you change the emails and SMS emails that the
alerts are sent to before you run it! Also, you can run the file manually with the same command
you put in the crontab file, and it should print out basic debugging messages for you. That way
you don't have to wait 5 minutes every time you want to test a change.
Taking it further
The above php script does not take into account zombie processes though, which might cause issues. I saw this
a bit when dealing with redmine running as it's own Ruby process. Ideally you would set it to run under
apache via mod_ruby but in this instance we wanted to use it as it's own process.
In order to fix it, I simply wrote a
watchdog process in my new language of choice, Python! It basically checks every 5 minutes (via cron) to see
if it can connect to the redmine http server, and if not it will start the server again.
import httplib
import os
def main():
h3 = httplib.HTTPConnection('localhost', 3000, 5)
try:
h3.request("GET","/")
except:
cmd = "/redmine/script/manualrun -e production 1>/tmp/redmine.log 2>&1 &"
os.system(cmd)
# just redirects the call of __main__ to main()
if __name__ == "__main__":
main()