Log Monitor is a simple python script to monitor log files and send email alerts when a relevant pattern (regex) is detected. By default it looks for 'ERROR' and 'FAIL', but this is configurable. Log Monitor checks log files in a dated directory that matches yesterday's date.
Log Monitor can also looks for segfaults reported by dmesg (although dmesg time stamps are unreliable, so it is recommended instead to catch these signals, and log them instead - see the "Monitoring Segfaults" section below for details).
Example Installation
Python 2 Version
mkdir cvsTmp cd cvsTmp cvs co apps/monitoring/src/scripts/LogMonitor.py cp !$ ~/CIP/bin
You will also need to install ConfigMaster.
mkdir cvsTmp cvs cvsTmp cvs co libs/python/src/ConfigMaster/ConfigMaster.py cp !$ ~/CIP/bin
Python 3 Version
The instructions will install both LogMonitor and ConfigMaster
mkdir gitTmp cd gitTmp git clone https://github.com/NCAR/LogMonitor cd LogMonitor ./manage_externals/checkout_externals cp configmaster/ConfigMaster.py ~/CIP/bin cp LogMonitor.py ~/CIP/bin
Configuration
LogMonitor.py --print_params >& LogMonitor.conf.py emacs LogMonitor.conf.py
Example Configuration
#!/usr/bin/python import os ############################################################## ## EMAIL OPTIONS ############################################################## emailList = ["prestop@ucar.edu"] emailListTruncated = ["prestop@ucar.edu"] fromEmail = "prestop@ucar.edu" # Report logs in the summary even if they had no hits. verboseSummary = False # A dated directory is created under the destination given # Set it empty to not have details sent to a file. sendDetailsToDirs = ["/scratch/WEEKLY/ifigen/LogMonitor"] # use this to make sure emails are not too long. # set to zero or negative to get the full message in the email truncateEmailAfterCharacter = 0 ############################################################## ## GREP OPTIONS ############################################################## searchStrings = ["ERROR","FATAL"] # number of lines before and after hit to include in report contextLines = 3 # strings to exclude excludeStrings = [] ############################################################## ## LOG FILE OPTIONS ############################################################## logDir = os.environ["CIP_LOG_DIR"] # logNameRegex allows you to limit the log files searched to ones that match this regex. # Leave empty to match all log files. logNameRegex = "" # logNameExclude allows you to exclude log files that match this regex. # Leave empty to not exclude any log files. logNameExclude = "" ############################################################# ## ADDITIONAL MONITORING OPTIONS ############################################################# # If true, grep dmesg for segfaults and include that in the list # of errors checkForSegFaults = True # Number of minutes to look back in the dmesg output. # defaults to one day dmesgMaxAge = 1440 # Skip error reporting if the line contains the string to ignore dmesgIgnoreRegex = ""
Running
emacs run_LogMonitor
paste:
#! /bin/sh echo "running LogMonitor.py" LogMonitor.py -c $CIP_PROJ_DIR/monitorHome/params/LogMonitor.conf.py 2>&1 >\ $LOG_DIR/LogMonitor.py.log &
chmod a+x run_LogMonitor
Add to crontab:
# monitor logs 10 0 * * * runWithEnv.sh run_LogMonitor &> /home/iida/CIP/logs/LogMonitor.cron.log
Handling Strings from Libraries
Sometime a call to a library may result in a string that implies an error, but in your case it's not an error. Trap the string and replace the offending characters with something else:
C++ example
string errStr = _mdvx->getErrStr(); size_t ePos = errStr.find("ERROR"); while (ePos != string::npos){ errStr.replace(ePos,std::string("ERROR").length(), "WARNING"); ePos = errStr.find("ERROR"); } cerr << errStr << endl;
If you have multiple strings you want to swap, something like this makes more sense:
// In the header file vector<pair<string,string> > _swapStrings; string _stringSwap(string errStr, vector<pair<string,string> > swapStrings); //In the .cc file _swapStrings.push_back(make_pair("ERROR","WARNING")); _swapStrings.push_back(make_pair("error","warning")); string _stringSwap(string errStr, vector<pair<string,string> > swapStrings) { size_t ePos; vector<pair<string,string> >::iterator it; for(it = swapStrings.begin(); it != swapStrings.end(); it++) { ePos = errStr.find((*it).first); while (ePos != string::npos){ errStr.replace(ePos,std::string((*it).first).length(), (*it).second); ePos = errStr.find((*it).first); } } return errStr; } cerr << _stringSwap(_mdvx->getErrStr(), _swapStrings) << endl;
Monitoring SegFaults
LogMonitor uses dmesg to find segfaults. Unfortunately the times reported by dmesg are not accurate (see this discussion: http://serverfault.com/questions/576139/dmesg-time-vs-system-time-time-isnt-correct).
A better way to monitor segfaults is to trap them in your program and log them. Then log monitor will find them in the log file with a correct time stamp.
A simple technique for catching and logging segfaults in C++ programs is used in apps/mdv_utils/src/MdvBlender. It not only catches and logs SEGV signals, but it will still allow a core file to be created when they are enabled via ulimits.
Here is an example of what needs to be done:
//static limits their scope to this file. static std::string cmdline = ""; static MdvBlender *mainObj = 0; void store_arguments(int argc, char **argv){ for (int ix = 0; ix < argc; ix++){ cmdline += argv[ix]; cmdline += " "; } } // based on blog at http://www.alexonlinux.com/how-to-handle-sigsegv-but-also-generate-core-dump // NOTE: that comments on the blog indicate the core file generated on red hat or on multi-threaded programs // might contain unhelpful information. void segv_handler(int signum) { time_t rawtime; struct tm * timeinfo; char timebuffer[80]; char cwdbuffer[PATH_MAX+1]; time (&rawtime); timeinfo = localtime(&rawtime); strftime(timebuffer,80,"%Y-%m-%d %H:%M:%S",timeinfo); getcwd(cwdbuffer,PATH_MAX+1); printf("FATAL ERROR (SEGFAULT): Process %d got signal %d @ local time = %s\n", getpid(), signum, timebuffer); printf("FATAL ERROR (SEGFAULT): Look for a core file in %s\n",cwdbuffer); printf("FATAL ERROR (SEGFAULT): Process command line: %s\n",cmdline.c_str()); signal(signum, SIG_DFL); kill(getpid(), signum); } void tidy_and_exit(int signal) { // Remove the top-level application class delete mainObj; printf("Exiting %d\n", signal); exit(signal); } void no_memory() { cerr << "Caught bad_alloc exception (Failed to allocate memory!)\n"; tidy_and_exit(1); } ///////////////////////////////////////////////////////////////////////// // // Function Name: main // // Description: main routine for CipAlgo // // Returns: -1 for failure and iret id program runs // int main(int argc, char **argv) { store_arguments(argc,argv); // create main object mainObj = MdvBlender::instance(argc, argv); if(!mainObj->isOK()) { tidy_and_exit(-1); } // set signal handling PORTsignal(SIGINT, tidy_and_exit); PORTsignal(SIGHUP, tidy_and_exit); PORTsignal(SIGTERM, tidy_and_exit); PORTsignal(SIGPIPE, (PORTsigfunc)SIG_IGN); PORTsignal(SIGSEGV, segv_handler); set_new_handler(no_memory);
Example Output
Here is what an email might look like
######################################################### ## LOG FILE REPORT ## ######################################################### ## 2016-10-20 ## ######################################################### Searching /d1/iida/data/CIP/logs/20161020 ---------------------------------------------- SUMMARY: 8 instances of ERROR in CipAlgo.wrf_rr_ncep.log SUMMARY: 2 instances of ERROR in cip_control.log SUMMARY: 4 instances of ERROR in MdvMerge2.conus.log SUMMARY: 0 segfaults found in dmesg in last 1440 minutes. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ======================================== looking at: /d1/iida/data/CIP/logs/20161020/LtgSpdb2Mdv.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/build_logdir_links.primary ======================================== looking at: /d1/iida/data/CIP/logs/20161020/CipAlgo.wrf_rr_ncep.log ---------------------------------------- looking for: ERROR 03:14:15 - Found 151987 matching points for PirepData 03:14:15 - Found 151987 matching points for RadarData 03:14:16 - Found 151987 matching points for SatelliteData 03:14:23 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight() 03:14:23 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight() 03:15:01 - writing geopotential height 03:15:01 - writing icing probability 03:15:02 - writing supercooled large droplet potential -- 04:14:56 - Found 151987 matching points for PirepData 04:14:56 - Found 151987 matching points for RadarData 04:14:56 - Found 151987 matching points for SatelliteData 04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight() 04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight() 04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight() 04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight() 04:15:03 - DATA ERROR: METAR with CC = 4 with CBZ = -9999 in CipAlgo calcBaseHeight() 04:15:41 - writing geopotential height 04:15:41 - writing icing probability 04:15:41 - writing supercooled large droplet potential -- 05:14:37 - Found 151987 matching points for PirepData 05:14:37 - Found 151987 matching points for RadarData 05:14:37 - Found 151987 matching points for SatelliteData 05:14:44 - DATA ERROR: METAR with CC = 3 with CBZ = -9999 in CipAlgo calcBaseHeight() 05:15:25 - writing geopotential height 05:15:25 - writing icing probability 05:15:25 - writing supercooled large droplet potential ======================================== looking at: /d1/iida/data/CIP/logs/20161020/Grib2toMdv.mrms.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/pressure2flight.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/satDerive.west.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/cip_control.log ---------------------------------------- looking for: ERROR 08:09:41 - DEBUG -- stop_time = 2016 10 20 08 08 00 08:09:41 - DEBUG -- out_dir = /d1/iida/data/CIP/mdv/satellite/west_derived 08:09:41 - DEBUG -- satDerive finished run at 2016-10-20 08:09 08:09:42 - ERROR -- Could not find file in /d1/iida/data/CIP/mdv/satellite/west_derived 08:09:42 - DEBUG -- SUCCESS: satDerive. 08:09:42 - DEBUG -- Starting MdvMerge2 : conus at time 2016-10-20 08:09 08:09:42 - DEBUG -- start_time = 2016 10 20 07 08 00 08:09:42 - DEBUG -- stop_time = 2016 10 20 08 08 00 08:09:42 - DEBUG -- out_dir = /d1/iida/data/CIP/mdv/satellite/conus 08:09:42 - DEBUG -- MdvMerge2 finished run at 2016-10-20 08:09 08:09:44 - ERROR -- Could not find file in /d1/iida/data/CIP/mdv/satellite/conus 08:09:44 - DEBUG -- SUCCESS: MdvMerge 08:09:44 - 08:10:34 - Begin process_wrf_rr_file: ======================================== looking at: /d1/iida/data/CIP/logs/20161020/Gini2Mdv.east.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/PirepMapper.cip-spdb.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/LdataWatcher.pressure2flight2.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/pressure2flight2.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/LdataWatcher.create_plots.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/MetarCsv2Spdb.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/MdvBlender.metar.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/MdvtoGrib2.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/satDerive.east.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/MetarMapper.cip-spdb.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/PirepCsv2Spdb.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/icingXSect.cip_plot.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/IcingCategory.grib2.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/derived_model_fields.wrf_rr_ncep.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/Mdv2NetCDF.cip_plot.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/Ltg2Spdb.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/DsFileDist.primary.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/create_images.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/Grib2toMdv.wrf_rr_ncep.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/createModelMetars.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/Mdv2NetCDF.metar_mapper.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/MdvMerge2.conus.log ---------------------------------------- looking for: ERROR 08:09:42 - Search margin: 1800 secs 08:09:42 - Read dir: /d1/iida/data/CIP/mdv/satellite/west_derived 08:09:42 - Read dir url: /d1/iida/data/CIP/mdv/satellite/west_derived 08:09:42 - ERROR - MdvMerge2 08:09:42 - Cannot read data for url: /d1/iida/data/CIP/mdv/satellite/west_derived 08:09:42 - Time for following error: 2016/10/20 08:09:42 UTC 08:09:42 - ERROR - computeReadPath 08:09:42 - Read closest failed, dir: /d1/iida/data/CIP/mdv/satellite/west_derived 08:09:42 - Search margin (secs): 1800 08:09:42 - Search time: 2016/10/20 08:00:00 08:09:42 - ERROR - DsMdvx::_readVolumeLocal 08:09:42 - 09:08:33 - ====================================================================== 09:08:33 - Program 'MdvMerge2' ======================================== looking at: /d1/iida/data/CIP/logs/20161020/create_images.cip_xsect.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/plotcipnc_main.ncl.cip_plot.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/RadarMapper.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/pirep ======================================== looking at: /d1/iida/data/CIP/logs/20161020/plotcipnc_xsect_main.ncl.cip_xsect.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/create_plots.cip.test.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/LdataWatcher.create_images.cip.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/Hybrid2Pressure.wrf_rr_ncep.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/metar ======================================== looking at: /d1/iida/data/CIP/logs/20161020/Gini2Mdv.west.log ======================================== looking at: /d1/iida/data/CIP/logs/20161020/LdataWatcher.create_images.cip_xsect.log ========================================== Looking for segfaults in dmesg: ----------------------------------------