When a host check goes into a failure state (WARN, CRIT), the other service monitors on the target element will have their alerts and/or actions suppressed and enter an UNKN state until the host check recovers.  There are a number of scenarios in which you may still notice alerts being sent for the non-host check monitors:

 

  • The host check hasn't failed all rechecks.  If the host check is still in its rechecking loop and hasn't started alerting, other monitors will still register outages and potentially send alerts.
  • The element in question doesn't have a proper host check defined.  Check to ensure that the Host Check for the system exists and is the correct monitor.
  • When a non-host check monitor runs, it will not force a run of the host check before sending an alert.  For example, if your host check runs once every 15 minutes but you have a monitor set up to run every minute, the once per minute monitor may fail and alert well before the host check has registered the outage.

 

As a general rule of thumb, your host check should check as often as the most frequently checked service on the element and have a recheck interval / max rechecks that is shorter than the most frequently checked service.

  • No labels