Skip to end of metadata
Go to start of metadata

Log Monitor is a simple python script to monitor log files and send email alerts when a relevant pattern (regex) is detected. By default it looks for 'ERROR' and 'FAIL', but this is configurable.  Log Monitor checks log files in a dated directory that matches yesterday's date.

Log Monitor can also looks for segfaults reported by dmesg (although dmesg time stamps are unreliable, so it is recommended instead to catch these signals, and log them instead - see the "Monitoring Segfaults" section below for details).

Example Installation

Python 2 Version

mkdir cvsTmp
cd cvsTmp
cvs co apps/monitoring/src/scripts/LogMonitor.py
cp !$ ~/CIP/bin

You will also need to install ConfigMaster.

mkdir cvsTmp
cvs cvsTmp cvs co libs/python/src/ConfigMaster/ConfigMaster.py
cp !$ ~/CIP/bin

Python 3 Version

The instructions will install both LogMonitor and ConfigMaster

mkdir gitTmp
cd gitTmp
git clone https://github.com/NCAR/LogMonitor
cd LogMonitor
./manage_externals/checkout_externals
cp configmaster/ConfigMaster.py ~/CIP/bin
cp LogMonitor.py ~/CIP/bin


Configuration

LogMonitor.py --print_params >& LogMonitor.conf.py
emacs LogMonitor.conf.py

Example Configuration

#!/usr/bin/python
import os
 
##############################################################
##  EMAIL OPTIONS
##############################################################
emailList = ["prestop@ucar.edu"]

emailListTruncated = ["prestop@ucar.edu"]

fromEmail = "prestop@ucar.edu"

# Report logs in the summary even if they had no hits.
verboseSummary = False


# A dated directory is created under the destination given
# Set it empty to not have details sent to a file.
sendDetailsToDirs = ["/scratch/WEEKLY/ifigen/LogMonitor"]

# use this to make sure emails are not too long.
# set to zero or negative to get the full message in the email
truncateEmailAfterCharacter = 0

##############################################################
## GREP OPTIONS
##############################################################
searchStrings = ["ERROR","FATAL"]

# number of lines before and after hit to include in report
contextLines = 3

# strings to exclude
excludeStrings = []

##############################################################
## LOG FILE OPTIONS
##############################################################
logDir = os.environ["CIP_LOG_DIR"]

# logNameRegex allows you to limit the log files searched to ones that match this regex.
# Leave empty to match all log files.
logNameRegex = ""

# logNameExclude allows you to exclude log files that match this regex.
# Leave empty to not exclude any log files.
logNameExclude = ""

#############################################################
## ADDITIONAL MONITORING OPTIONS
#############################################################

# If true, grep dmesg for segfaults and include that in the list
# of errors
checkForSegFaults = True

# Number of minutes to look back in the dmesg output.
# defaults to one day
dmesgMaxAge = 1440

# Skip error reporting if the line contains the string to ignore
dmesgIgnoreRegex = ""

Running

emacs run_LogMonitor

paste:

#! /bin/sh

echo "running LogMonitor.py"
LogMonitor.py -c $CIP_PROJ_DIR/monitorHome/params/LogMonitor.conf.py 2>&1 >\
$LOG_DIR/LogMonitor.py.log &
chmod a+x run_LogMonitor

Add to crontab:

# monitor logs
10 0 * * * runWithEnv.sh run_LogMonitor &>     /home/iida/CIP/logs/LogMonitor.cron.log

Handling Strings from Libraries

Sometime a call to a library may result in a string that implies an error, but in your case it's not an error. Trap the string and replace the offending characters with something else:

C++ example

string errStr = _mdvx->getErrStr();
size_t ePos = errStr.find("ERROR");
while (ePos != string::npos){
   errStr.replace(ePos,std::string("ERROR").length(), "WARNING");
   ePos = errStr.find("ERROR");
}
cerr << errStr << endl;

 

If you have multiple strings you want to swap, something like this makes more sense:

// In the header file
 vector<pair<string,string> > _swapStrings;

 string _stringSwap(string errStr, vector<pair<string,string> > swapStrings);

//In the .cc file

_swapStrings.push_back(make_pair("ERROR","WARNING"));
_swapStrings.push_back(make_pair("error","warning")); 

string _stringSwap(string errStr, vector<pair<string,string> > swapStrings)
{
  size_t ePos;
  vector<pair<string,string> >::iterator it;
  
  for(it = swapStrings.begin(); it != swapStrings.end(); it++)
  {
    ePos = errStr.find((*it).first);
    while (ePos != string::npos){
      errStr.replace(ePos,std::string((*it).first).length(), (*it).second);
      ePos = errStr.find((*it).first);
    }
  }
  return errStr;
}


cerr << _stringSwap(_mdvx->getErrStr(), _swapStrings) << endl;

 

Monitoring SegFaults

LogMonitor uses dmesg to find segfaults.  Unfortunately the times reported by dmesg are not accurate (see this discussion: http://serverfault.com/questions/576139/dmesg-time-vs-system-time-time-isnt-correct).

A better way to monitor segfaults is to trap them in your program and log them.  Then log monitor will find them in the log file with a correct time stamp.

A simple technique for catching and logging segfaults in C++ programs is used in apps/mdv_utils/src/MdvBlender.  It not only catches and logs SEGV signals, but it will still allow a core file to be created when they are enabled via ulimits.

 

Here is an example of what needs to be done:

//static limits their scope to this file.
static std::string cmdline = "";
static MdvBlender *mainObj = 0;

void store_arguments(int argc, char **argv){

        for (int ix = 0; ix < argc; ix++){
                cmdline += argv[ix];
                cmdline += " ";
        }
}



// based on blog at http://www.alexonlinux.com/how-to-handle-sigsegv-but-also-generate-core-dump
// NOTE:  that comments on the blog indicate the core file generated on red hat or on multi-threaded programs
//        might contain unhelpful information.
void segv_handler(int signum)
{
  time_t rawtime;
  struct tm * timeinfo;
  char timebuffer[80];
  char cwdbuffer[PATH_MAX+1];

  time (&rawtime);
  timeinfo = localtime(&rawtime);

  strftime(timebuffer,80,"%Y-%m-%d %H:%M:%S",timeinfo);
  getcwd(cwdbuffer,PATH_MAX+1);

  printf("FATAL ERROR (SEGFAULT): Process %d got signal %d @ local time = %s\n", getpid(), signum, timebuffer);
  printf("FATAL ERROR (SEGFAULT): Look for a core file in %s\n",cwdbuffer);
  printf("FATAL ERROR (SEGFAULT): Process command line: %s\n",cmdline.c_str());
  signal(signum, SIG_DFL);
  kill(getpid(), signum);
}

void tidy_and_exit(int signal)
{
   // Remove the top-level application class
   delete mainObj;
   printf("Exiting %d\n", signal);
   exit(signal);
}


void no_memory() {
  cerr << "Caught bad_alloc exception (Failed to allocate memory!)\n";
  tidy_and_exit(1);
}
/////////////////////////////////////////////////////////////////////////
//
// Function Name:       main
// 
// Description:         main routine for CipAlgo
// 
// Returns:             -1 for failure and iret id program runs
// 

int main(int argc, char **argv)
{
  store_arguments(argc,argv);

  // create main object
  mainObj = MdvBlender::instance(argc, argv);
  if(!mainObj->isOK()) {
    tidy_and_exit(-1);
  } 
  // set signal handling
  PORTsignal(SIGINT, tidy_and_exit);
  PORTsignal(SIGHUP, tidy_and_exit);
  PORTsignal(SIGTERM, tidy_and_exit);
  PORTsignal(SIGPIPE, (PORTsigfunc)SIG_IGN);
  PORTsignal(SIGSEGV, segv_handler);
  set_new_handler(no_memory);

 

Example Output

Here is what an email might look like

#########################################################
##                  LOG FILE REPORT                    ##
#########################################################
##                  2016-10-20                         ##
#########################################################
Searching /d1/iida/data/CIP/logs/20161020
----------------------------------------------
SUMMARY: 8 instances of ERROR in CipAlgo.wrf_rr_ncep.log
SUMMARY: 2 instances of ERROR in cip_control.log
SUMMARY: 4 instances of ERROR in MdvMerge2.conus.log
SUMMARY: 0 segfaults found in dmesg in last 1440 minutes.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
========================================
looking at: /d1/iida/data/CIP/logs/20161020/LtgSpdb2Mdv.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/build_logdir_links.primary
========================================
looking at: /d1/iida/data/CIP/logs/20161020/CipAlgo.wrf_rr_ncep.log
----------------------------------------
looking for: ERROR
03:14:15 - Found 151987 matching points for PirepData
03:14:15 - Found 151987 matching points for RadarData
03:14:16 - Found 151987 matching points for SatelliteData
03:14:23 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight()
03:14:23 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight()
03:15:01 - writing geopotential height
03:15:01 - writing icing probability
03:15:02 - writing supercooled large droplet potential
--
04:14:56 - Found 151987 matching points for PirepData
04:14:56 - Found 151987 matching points for RadarData
04:14:56 - Found 151987 matching points for SatelliteData
04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight()
04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight()
04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight()
04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight()
04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight()
04:15:41 - writing geopotential height
04:15:41 - writing icing probability
04:15:41 - writing supercooled large droplet potential
--
05:14:37 - Found 151987 matching points for PirepData
05:14:37 - Found 151987 matching points for RadarData
05:14:37 - Found 151987 matching points for SatelliteData
05:14:44 - DATA ERROR: METAR with CC = 3 with CBZ = -9999 in CipAlgo calcBaseHeight()
05:15:25 - writing geopotential height
05:15:25 - writing icing probability
05:15:25 - writing supercooled large droplet potential

========================================
looking at: /d1/iida/data/CIP/logs/20161020/Grib2toMdv.mrms.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/pressure2flight.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/satDerive.west.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/cip_control.log
----------------------------------------
looking for: ERROR
08:09:41 -      DEBUG -- stop_time = 2016 10 20 08 08 00
08:09:41 -      DEBUG -- out_dir = /d1/iida/data/CIP/mdv/satellite/west_derived
08:09:41 -      DEBUG --        satDerive finished run at 2016-10-20 08:09
08:09:42 -      ERROR --        Could not find file in /d1/iida/data/CIP/mdv/satellite/west_derived
08:09:42 -      DEBUG -- SUCCESS: satDerive.
08:09:42 -      DEBUG --        Starting MdvMerge2 : conus at time 2016-10-20 08:09
08:09:42 -      DEBUG -- start_time = 2016 10 20 07 08 00
08:09:42 -      DEBUG -- stop_time = 2016 10 20 08 08 00
08:09:42 -      DEBUG -- out_dir = /d1/iida/data/CIP/mdv/satellite/conus
08:09:42 -      DEBUG --        MdvMerge2 finished run at 2016-10-20 08:09
08:09:44 -      ERROR --        Could not find file in /d1/iida/data/CIP/mdv/satellite/conus
08:09:44 -      DEBUG -- SUCCESS: MdvMerge
08:09:44 -
08:10:34 -   Begin process_wrf_rr_file:

========================================
looking at: /d1/iida/data/CIP/logs/20161020/Gini2Mdv.east.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/PirepMapper.cip-spdb.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/LdataWatcher.pressure2flight2.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/pressure2flight2.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/LdataWatcher.create_plots.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/MetarCsv2Spdb.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/MdvBlender.metar.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/MdvtoGrib2.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/satDerive.east.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/MetarMapper.cip-spdb.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/PirepCsv2Spdb.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/icingXSect.cip_plot.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/IcingCategory.grib2.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/derived_model_fields.wrf_rr_ncep.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/Mdv2NetCDF.cip_plot.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/Ltg2Spdb.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/DsFileDist.primary.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/create_images.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/Grib2toMdv.wrf_rr_ncep.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/createModelMetars.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/Mdv2NetCDF.metar_mapper.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/MdvMerge2.conus.log
----------------------------------------
looking for: ERROR
08:09:42 -   Search margin: 1800 secs
08:09:42 -   Read dir: /d1/iida/data/CIP/mdv/satellite/west_derived
08:09:42 -   Read dir url: /d1/iida/data/CIP/mdv/satellite/west_derived
08:09:42 - ERROR - MdvMerge2
08:09:42 -   Cannot read data for url: /d1/iida/data/CIP/mdv/satellite/west_derived
08:09:42 -   Time for following error: 2016/10/20 08:09:42 UTC
08:09:42 - ERROR - computeReadPath
08:09:42 -   Read closest failed, dir: /d1/iida/data/CIP/mdv/satellite/west_derived
08:09:42 -   Search margin (secs): 1800
08:09:42 -   Search time: 2016/10/20 08:00:00
08:09:42 - ERROR - DsMdvx::_readVolumeLocal
08:09:42 -
09:08:33 - ======================================================================
09:08:33 - Program 'MdvMerge2'

========================================
looking at: /d1/iida/data/CIP/logs/20161020/create_images.cip_xsect.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/plotcipnc_main.ncl.cip_plot.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/RadarMapper.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/pirep
========================================
looking at: /d1/iida/data/CIP/logs/20161020/plotcipnc_xsect_main.ncl.cip_xsect.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/create_plots.cip.test.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/LdataWatcher.create_images.cip.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/Hybrid2Pressure.wrf_rr_ncep.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/metar
========================================
looking at: /d1/iida/data/CIP/logs/20161020/Gini2Mdv.west.log
========================================
looking at: /d1/iida/data/CIP/logs/20161020/LdataWatcher.create_images.cip_xsect.log

==========================================
Looking for segfaults in dmesg:
----------------------------------------
  • No labels