Posts Tagged ‘alerts’

[More Splunk: Part 4] Narrow search results to create an alert

Wednesday, January 30th, 2013

This post continues [More Splunk: Part 3] Report on remote server activity.

Now that we have Splunk generating reports and turning raw data into useful information, let’s use that information to trigger something to happen automatically such as sending an email alert.

In the prior posts a Splunk Forwarder was gathering information using a shell script and sending the results to the Splunk Receiver. To find those results we used this search string:

hosts="TMI" source="/Applications/splunkforwarder/etc/apps/talkingmoose/bin/counters.sh"

It returned data every 60 seconds that looked something like:

2012-11-20 14:34:45-08:00 MySQLCPU=23.2 ApacheCount=1

Using the timechart function of Splunk we extracted the MySQLCPU field to get its value 23.2 and put that into a graph for easier viewing.

Area graph

Returning to view that graph every few minutes, hours or days can get tedious if nothing really changes or the data isn’t out of the ordinary. Ideally, Splunk would watch the data and alert us when something is out of the ordinary. That’s where alerts are useful.

For example, the graph above shows the highest spike in activity to be around 45% and we can assume that a spike at 65% would be unusual. We want to know about that before processor usage gets out of control.

Configuring Splunk for email alerts

Before Splunk can send email alerts it needs basic email server settings for outgoing mail (SMTP). Click the Manager link in the upper right corner and then click System Settings. Click on Email alert settings. Enter public or private outgoing mail server settings for Splunk. If using a public mail server such as Gmail then include a user name and password to authenticate to the server and select the option for either SSL or TLS. Be sure to append port number 465 for SSL or 587 for TLS to the mail server name.

Splunk email server settings

In the same settings area Splunk includes some additional basic settings. Modify them as needed or just accept the defaults.

Splunk additional email server settings

Click the Save button when done.

Refining the search

Next, select Search from the App menu. Let’s refine the search to find only those results that may be out of the ordinary. Our first search found all results for the MySQLCPU field but now we want to limit its results to anything at 65% or higher. The where function is our new friend.

hosts="TMI" source="/Applications/splunkforwarder/etc/apps/talkingmoose/bin/counters.sh" | where MySQLCPU >= 65

This takes the result from the Forwarder and pipes it into an operation that returns only values of the MySQLCPU field that are greater than or equal to “65″. The search results, we hope, are empty. To verify the search is working correctly, change the value temporarily from “65″ to something lower such as “30″ or “40″. The lower values should return multiple results.

On a side note but unrelated to our need, if we wanted an alert for a range of values an AND operator connecting two statements will limit the results to something between values:

hosts="TMI" source="/Applications/splunkforwarder/etc/apps/talkingmoose/bin/counters.sh" | where MySQLCPU >= 55 AND MySQLCPU <=65

Creating an alert

An alert will evaluate this search as frequently as Splunk receives new data and if it spots any results other than nothing then it can do something automatically.

With the search results in view (or lack of them), select Alert… from the Create drop down menu in the upper right corner. Name the search “MySQL CPU Usage Over 65%” or something that’s recognizable later. One drawback with Splunk is that it won’t allow renaming the search later. To do that requires editing more .conf files. Leave the Schedule at its default Trigger in real-time whenever a result matches. Click the Next button.

Schedule an alert

Enable Send email and enter one or more addresses to receive the alerts. Also, enable Throttling by selecting Suppress for results with the same field value and enter the MySQLCPU field name. Set the suppression time to five minutes, which is pretty aggressive. Remember, the script on the Forwarder server is sending new values every minute. Without throttling Splunk would send an alert every minute as well. This will allow an administrator to keep some sanity. Click the Next button.

Enable alert actions

Finally, select whether to keep the alert private or share it with other users on the Splunk system. This only applies to the Enterprise version of Splunk. Click the Finish button.

Share an alert

Splunk is now looking for new data to come from a Forwarder and as it receives that new data it’s going to evaluate it against the saved search. Any result other than no results found will trigger an email.

Note that alerts don’t need to just trigger emails. They can also run scripts. For example, an advanced Splunk search may look for multiple Java processes on a server running a Java-based application. If it found more than 20 spawned processes it could trigger a script to send a killall command to stop them before they consumed the server’s resources and then issue a start command to the application.

Installing Lithium on Mac OS X

Thursday, November 1st, 2007

Installing Lithium Core 4.9.0 Make sure the system is not currently a web server and port 80 is available. Download the Lithium 4.9.0 package. Double-click on the Core 4.9.0 Installer. Click Continue through the license agreement screens. Choose the packages to install and click on Continue. Choose the location to install the Lithium Core application and click on Install. Enter the credentials of an administrator and Click OK. When the installer is complete, click on the Close button. Open Lithium Core Admin from the /Applications folder. Click Next and enter the name of the client for whom you are installing Lithium. Click Next and enter a new administrative username and password for accessing Lithium. Click Next and you will be placed into the database configuration screen. Unless you are using PostgreSQL on another host, do not modify these settings. Click Next and double-check the settings. If they look good then click on the Finish button and enter administrative credentials to commit the changes. When you open Lithium Console from the /Applications folder for the first time you will be asked whether you would like to check for updates each time. Click Yes. You have now installed Lithium and can move on to adding hosts to be monitored.

Xsan: Sometimes You’re Going to Loose a Drive

Wednesday, April 4th, 2007

Sometimes a drive fails, or a RAID controller goes down on an array with a redundant drive and the parity on a RAID must be rebuilt. In other words, if you loose a drive in a RAID 5, RAID 1, RAID 0+1 or RAID 3 array you will be left with a degraded RAID (also referred to as a critical RAID) unless you have configured your Xserve RAID to use a hot spare. If you are using a hot spare on the channel of the failed drive the RAID will begin to rebuild itself automatically. If you are not using a hot spare, upgrading your degraded RAID back to a healthy state should happen as quickly as possible to avoid data loss. In the event of a second drive failure on the array most of the data could be lost – and Murphy’s Law is evil when it comes to RAIDs. The data should be backed up as quickly as possible if it has not already been backed up.

Once the data is backed up, you should perform a rebuild of the parity for the array. The partiy is rebuilt based on the data that is on the array. This does not fix any issues that may be present with actual data. In other words, if you were using the Xserve RAID as a local volume it would only repair issues with the array and not also perform a repair disk on the drives. In an Xsan any data corruption could force you to rebuild you volume from the LUNs. You would not need to relabel the LUNs, but you may have to rebuild your volume

In many situations you will be able to simply swap the bad drive out with an identical good drive and configure it as a hot spare. Then the Xserve RAID will automatically begin rebuilding the array, moving it from a degraded state into a healthy state.

However, there are often logical issues with drives and arrays. Also, hot spares do not always join the degraded array. In these situations you may need to manually rebuild an array. To do this:
Silence the alarm on the Xserve RAID.
Verify that you have a clean backup of your data.
Verify that you have a clean backup of your data again or better, have someone else check as well.
Open up your trusty Xserve RAID Spare Parts Kit and grab the spare drive module.
Remove the drive module that has gone down (typically the one with the amber light).
Install the new drive in your now empty slot.
Open RAID Admin from the /Applications/Server directory.
Click on the RAID containing the damagemed array.
Click on the Advanced button in the toolbar.
Enter the management password for the Xserve RAID you are rebuilding the parity for.
Click on the button for Verify or Rebuild Parity and click on Continue.
Select the array needing to be rebuilt.
Click Rebuild Array and be prepared to wait for hours during the rebuild process. It is possible to use the array during the rebuild process – although if you don’t have to use the array it is probably best not to as you will see a performance loss. During the rebuild the lights on the drive will flash between an amber and a green state.
Once the rebuild is complete, perform a Verify Array on the RAID.
Verify the data on the volumes using the array.
Order a new drive to replace the broken drive in your Xserve RAID Spare Parts Kit.

If the rebuild of the data does not go well and the array is lost then you will likely need to delete the array and readd it. This will cause you to loose the data that was stored on that array and possibly on the volume, so it can never hurt to call Apple first and see if they have any more steps you can attempt. This is one of the many good reasons for backing data up. Just because you are using a RAID does not mean you should not back your data up.

The Verify Array can also be used to help troubleshoot issues with corrupted arrays.

This process has been tested using firmware 1.5 and below for Xserve RAIDs.

Using NPRE with Nagios

Wednesday, July 5th, 2006

Nagios is a computer Monitoring Software. Nagios runs on one central server, and has the ability to check resources on remote computers. In order to allow these remote computers to be monitored the Nagios Software team has NRPE. NRPE functions as a Daemon and plugin for executing plugins on remote hosts. When installed on the remote server it creates a Medium for the Nagios server to be able to execute commands to the remote agent. You have the ability to check a wide range of resources, CPU, Hard Drive Space, Computer Load, Server Services, DNS checks etc.

To Download NRPE and NRPE Plugins

http://www.nagios.org/download/

In this example we will be installing NRPE to a remote Linux Server. So to send the downloaded file to the remote server
scp Documents/NAGIOS/nagios-plugins-1.4.9.tar.gz USER@HOSTNAME:
scp Documents/NAGIOS/nrpe-2.8.1.tar.gz USER@HOSTNAME:
This command will send the downloaded file to the 318admin accounts home folder on the remote linux server.

NRPE Plugins
Once the file has been copied to the server and placed in an appropriate location, for example /usr/local/src
First we will uncompress the package
[root@HOSTNAME src]# tar -xzvf nagios-plugins-1.4.9.tar.gz
Then cd into nagios-plugins-1.4.9
Perform the following to compile and install the plugins

[root@HOSTNAME nagios-plugins-1.4.9]# ./configure
With this executed sucsesfuly you should expect an output similar to this:
config.status: creating po/Makefile
–with-apt-get-command:
–with-ping6-command:
–with-ping-command: /bin/ping -n -U -w %d -c %d %s
–with-ipv6: yes
–with-mysql: /usr/bin/mysql_config
–with-openssl: yes
–with-gnutls: no
–with-perl: /usr/bin/perl
–with-cgiurl: /nagios/cgi-bin
–with-trusted-path: /bin:/sbin:/usr/bin:/usr/sbin

Now we will make the package
[root@HOSTNAME nagios-plugins-1.4.9]# make
With this executed successfully the following
Making all in po

Now we will perform the make install command to install the Nagios Plugins on the remote server. Run this command as root
[root@HOSTNAME nagios-plugins-1.4.9]# make install

Nagios Documentation recommends changing the permissions on the files after the install is performed. The commands to perform this are as follows:
chown nagios:nagios /usr/local/nagios
chown -R nagios:nagios /usr/local/nagios/libexec

Now we will need to Install the NRPE Daemon to execute these plugins

NRPE Daemon
Next we will uncompress the package on the remote server. This document assumes that you have some type of root access to the server.
First we will uncompress the file, I generally move my source files to /usr/local/src
To complete these tasks first SSH into the remote server
Sudo mv nrpe-2.8.1.tar.gz /usr/local/src
Cd /usr/local/src
sudo tar -xzvf nrpe-2.8.1.tar.gz

We have now moved the package file to /usr/local/src and uncompressed the package to /usr/local/src/nrpe-2.8.1

Next a nagios user account needs to be added on the remote system. Execute this command as sudo or root
Useradd nagios
Passwd nagios

Next we must configure and compile the package before we can install it. I have found that compiling NRPE to work best when executed as root, perhaps this will be resolved in later versions. For the purspose of this documentation, it is assumed that the following commands are executed as root

Cd /usr/local/src/nrpe-2.8.1
[root@HOSTNAME nrpe-2.8.1]# ./configure

When ./configure executes properly you should see an output similar to this
*** Configuration summary for nrpe 2.8.1 05-10-2007 ***:

General Options:
————————-
NRPE port: 5666
NRPE user: nagios
NRPE group: nagios
Nagios user: nagios
Nagios group: nagios

Review the options above for accuracy. If they look okay,
type ‘make all’ to compile the NRPE daemon and client.

If you receive errors do not proceed, you must resolve what ever dependency errors you are receiving when attempting to compile.

Next we will perform the make command to ‘make’ the installation. This is done by the following command
[root@HOSTNAME nrpe-2.8.1]# make all

With a succesfull make an output similar to this should be seen
*** Compile finished ***

If the NRPE daemon and client compiled without any errors, you
can continue with the installation or upgrade process.

We are now ready to install NRPE on the remote system. To perform the install, execute the following commands as root:
[root@phillip2 nrpe-2.8.1]# make install-plugin
[root@phillip2 nrpe-2.8.1]# make install-daemon
[root@phillip2 nrpe-2.8.1]# make install-daemon-config
[root@phillip2 nrpe-2.8.1]# make install-xinetd

Now since we installed NRPE with Xinetd, we have to edit the following nrpe file to allow outside connections to NRPE from the Nagios Server
Perform this as root
[root@HOSTNAME nrpe-2.8.1]# nano /etc/xinetd.d/nrpe
Now under ‘only_from’ add the local IP address of the Nagios Server, for example you can have the following:
only_from = 127.0.0.1 192.168.1.23
Save the changes and exit the document edit program

Now we have to Edit the Services file to add the port for NRPE to run. Perform the Following as root:
[root@HOSTNAME nrpe-2.8.1]# nano /etc/services
At the bottom of the File I would just add the following Text
nrpe 5666/tcp #NRPE

Once we have edited the Xinetd File and Services file, perform a restart of Xinetd to apply the changes
This should look something like this if done correctly:
[root@HOSTNAME nrpe-2.8.1]# service xinetd restart
Stopping xinetd: [ OK ]
Starting xinetd: [ OK ]
To test that the NRPE Daemon is running you can perform this command
netstat -at|grep nrpe
tcp 0 0 *:nrpe *:* LISTEN

If you do not see the same output Nagios Documentation recommends to check the following:
– You added the nrpe entry to your /etc/services file
– The only_from directive in the /etc/xinetd.d/nrpe file contains an entry for “127.0.0.1″
– xinetd is installed and started
– Check the system log files for references about xinetd or nrpe and fix any problems that are reported

Also you can use the check_nrpe plugin to check your installation of NRPE by performing the following command:
[root@HOSTNAME nrpe-2.8.1]# /usr/local/nagios/libexec/check_nrpe -H localhost
NRPE v2.8.1

The service and Version number should be the expected output of this command.

Before we go to the Nagios server to execute the plugins, in most cases we just have to confirm that the correct devices are being called by the plugins in the config file. Most common is the incorrect hard drive will be checked, which makes the tool not as useful.

So for this server when I run the df command I would want Nagios to examine the partitions of / and /home

For this server I want NRPE to monior /dev/sda1 and /dev/sdb1 so to make this happen we will edit the nrpe.cfg file, /usr/local/nagios/etc/nrpe.cfg down at the bottom are the check commands connected to devices. I have added these two lines, and removed any other hard drive devices
command[check_sda2]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 -p /dev/sda2
command[check_sdb1]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 -p /dev/sdb1

-w 20 means to warn when the drive is 20% free space available
-c 10 mean to have a critical warning when 10% free space available

Configure Nagios Server to Monitor Remote Host

The Remote server must already be running Nagios, this documentation assumes this has already been done. Also if you will need to install the chek_nrpe plugin on the Nagios server. The is done in the same way that a remote host would have NRPE installed, which has already been covered.

After all the requirements have been met, its now time to define server commands for the new host.
First you have to define a services file. This is done on the Nagios Server nagios.cfg. Edit the file located in /usr/local/nagios/etc/nagios.cfg and include for example
cfg_file=/usr/local/nagios/etc/hosts.cfg
cfg_file=/usr/local/nagios/etc/services.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg

Then we will edit the Host.cfg to include the new remote host. Using a text editor add the following, this will add a Host called HOSTNAME with an a local ip of 192.168.1.226, also it will be given the linux-server template. Templates are configured in the nagios.cfg file

define host{
use linux-server
host_name HOSTNAME
alias HOSTNAME
address 192.168.1.226
}

Now we can edit the services file to tell Nagios what NRPE services to check on the new Remote Host. I generally use this template to check various default services

#HOSTNAME
define service{
use generic-service
host_name HOSTNAME
service_description / Free Space
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_nrpe!check_sda2
}

define service{
use generic-service
host_name HOSTNAME
service_description /home Free Space
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_nrpe!check_sdb1
}

define service{
use generic-service ; Name of service template to use
host_name HOSTNAME
service_description SMTP
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_smtp
}
define service{
use generic-service ; Name of service template to use
host_name HOSTNAME
service_description FTP
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_ftp
}
define service{
use generic-service ; Name of service template to use
host_name HOSTNAME
service_description HTTP
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_nrpe!check_http
}
define service {
use generic-service
host_name HOSTNAME
service_description CPU Load
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_nrpe!check_load
}
define service{
use generic-service
host_name HOSTNAME
service_description Current Users
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_nrpe!check_users
}
define service{
use generic-service
host_name HOSTNAME
service_description Total Processes
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_nrpe!check_total_procs
}
define service{
use generic-service
host_name HOSTNAME
service_description Zombie Processes
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_nrpe!check_zombie_procs
}

After you have edited the host and service files, you may want to run a check to see what if anything is wrong with your config. Run this file, I find it helpful to make this just a script that you can execute once your done. The command is
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

If everything runs without error, its safe to restart the nagio service and check the website for the new host.

How to Monitor Remote services with Nagios and NRPE

As long as both the Nagios server and Remote Host running NRPE have the nagios plugins installed this is very straight forward.

In this example we will add monitoring of HTTP and MySQL to a remote server named terrence2.

First on the Remote Host with NRPE, edit the nrpe.cfg file to include:
command[check_mysql]=/usr/local/nagios/libexec/check_mysql -h localhost
command[check_http]=/usr/local/nagios/libexec/check_http -h localhost

Then on the Nagios server simply edit the services.cfg file to add the following check command:
define service{
use generic-service ; Name of service template to use
host_name HOSTNAME
service_description HTTP
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_nrpe!check_http

and for MySQL

define service{
use generic-service ; Name of service template to use
host_name HOSTNAME
service_description MySQL
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24×7
notification_options w,u,c,r
check_command check_nrpe!check_mysql

Restart xinetd on the remote host and restart nagios on the server and your up and monitoring.