In up.time , a service level agreement (SLA) measures your IT infrastructure’s ability to meet performance goals, particularly from the end-user perspective. Different goals can focus on different aspects of your infrastructure from underlying network performance, to back-end database availability, to user-facing application server response time. Given this broad coverage, a performance goal encompasses anything from a handful of monitored systems to an entire production center.
Defining and working toward fulfilling SLAs provides you with more insight into the performance and planning of your infrastructure:
An SLA can measure the success of your IT infrastructure by using end-user-focused service monitors such as the Web Application Transaction monitor and the Email Delivery monitor.
Use SLAs to methodically set expectations on all or the most critical aspects of your infrastructure. SLAs provide you with metrics with which you can gauge the success of your network administration.
Trend lines in SLA reports can give you an estimate for when your current hardware deployment will require augmentation.
Compliance reports quantify the value of the IT department’s efforts, and objective-based reports exist to identify recurring problems that affect business outcomes.
Like other up.time Elements (i.e., systems, network devices, and Applications) an SLA definition consists of service monitors that you have previously created. Depending on its use, an SLA can consist of a single service level objective (SLO) that in turn consists of a single service monitor.
In other cases, an SLA’s coverage can be broad enough to include an ungainly list of service monitors; in this case the SLA can be refined to consist of multiple SLOs that focus on different aspects of the SLA. Creating multiple objectives helps you further refine your performance targeting and reporting.
For example, consider an SLA called “Web Application” that focuses on IT performance for end users. The SLA’s objectives could be broken down by performance:
Consider another example: an SLA called “Customer Service Group” that focuses on the operational readiness of a support team. The SLA’s objectives could be broken down by application:
Service level agreements, and the type of information displayed, are viewed in the Global Scan panel from a monitoring perspective, and in My Infrastructure from a configuration perspective.
You can view the status of all your SLAs in the Service Level Agreements subpanel, which can be accessed by clicking the View SLAs tab when you are in the Global Scan panel.
For more information about what kind of SLA information you can view in the Global Scan panel, see Viewing All SLAs.
The details of an SLA definition can be viewed in the Service Level Agreement General Information subpanel. This can be accessed from the My Infrastructure panel by clicking the SLA name listed among the Elements, or from the Global Scan panel by clicking the Info tab in the Tree panel, then clicking Info .
The General Information subpanel displays a summary for the SLA that includes the following:
You can view information about the services that make up the SLA by clicking the Services tab in the Tree panel.
Clicking the Graphing tab in the Tree panel, then clicking Current Status displays a verbose status summary of the SLA that includes the following:
See A Note About SLOs and Compliance for more information about SLOs and the Achieving statistic.
SLA downtime occurs when any of the SLA’s services are in a critical state. An SLA is compliant if its downtime has not exceeded a maximum number of minutes over a one-week or one-month Monitoring Period.
For example, consider an SLA whose compliance period type is weekly and its Monitoring Period is Monday through Friday, 9 p.m. to 5 p.m. The Monitoring Period consists of five eight-hour days--in other words, 40 hours, or 2400 minutes. If the SLA’s target is 95%, it has 120 minutes of allowable downtime for any of its services.
An SLA’s reported status in the Global Scan panel includes the following in the form of progress bars: the percentage of the Monitoring Period that has expired, and the percentage of allowable downtime consumed during the Monitoring Period. (See Viewing All SLAs for information about SLA information in the Global Scan panel.)
An SLA will reach a critical state when its allowable downtime has been depleted. An SLA will reach a warning-level state when its allowable downtime, at the current rate of use, will be depleted before the compliance period has ended. These states, and their conditions under which they happen, are shown in the Global Scan status display.
The simultaneous downtime of multiple services does not cumulatively impact an SLA’s remaining allowable downtime; the term “allowable downtime” can be expanded to mean the amount of time during which there can be any service downtimes (until the compliance period has ended, after which the counters are reset).
In the following outage graph for an SLO, note that any time an outage is experienced--whether by one or four services--the SLO is deemed to have experienced an outage, which is reflected in the top red line:
It is important to note the role an SLO plays regarding SLA compliance: SLOs exist to help you conceptually separate services into logical groups that make it easier for you to monitor, diagnose, and set performance goals for them. Although the descriptions of “allowable downtime” in the previous section implied that service downtime affects SLA downtime, it is more accurate to say that service downtime affects SLO performance--which in turn, affects SLA downtime.
SLO outages affect reported SLA compliance in the same way service outages affect SLO compliance: allowable downtime is reduced when any outage is experienced. This is also pertinent if you are scanning the “Achieving” statistic for an SLA Summary. (This statistic can be viewed in the Service Level Agreement subpanel of My Infrastructure , by clicking the Graphing tab, then clicking Current Status .)
You can verify how well or poorly an SLA is achieving its target, but you can also view how the component SLOs are performing for the time period. See Viewing SLA Details for information on how to find information such as the Achieving statistic in an SLA summary.
The key to an effective SLA is defining a service level that satisfies end users, yet is also attainable by IT staff and their systems configurations. This section covers the suggested steps to pinpointing this target service level:
Determine which service monitors will best reflect the end-user experience, based on the aspect of your infrastructure that your SLA will cover. See SLAs, Service Monitors, and SLOs for some sample SLAs and objectives.
up.time users who do not have existing service monitors should create them and allow them to accumulate data for at least one week. Having historical data is essential to determining what level of service you should target.
When added to an SLA, service monitors that have been collecting data will immediately contribute to the SLA’s reported status. For example, if all of an SLA’s service monitors have a year’s worth of historical data, creating a trial SLA will allow you to see how it would have performed over that last year. Having this historical data in SLA reports helps you analyze each component service monitor in the context of the SLA.
Consider a sample SLA called System Performance that is meant to ensure your application servers are not experiencing excessive loads; this can be indicated by CPU usage and disk space. The first service level objective is based on the Performance Check monitor for the application servers. A critical state occurs when CPU usage exceeds 90%. The second service level objective is based on the File System Capacity monitor. A critical state occurs when remaining disk space falls under 10%.
After creating an SLA based on these objectives, the SLA is immediately shown to be in a critical state--for the current Monitoring Period, one or both of the objectives have already failed to meet the defined service level:
You can investigate outages using the SLA Detailed report. In this example, you determine that the cause the SLA failure was a prolonged
disk-space-related outage that, based on the outage graph, appears to have been resolved:
However, there may be cases were analyzing the SLA Detailed report will show intermittent outages that have not caused your trial SLA to fail, but represent underperforming services that should be optimized:
After outages and underperforming systems have been addressed, use the SLA Summary report to compare test service levels to historical data.
Find a service level that is attainable. For example, in the SLA graph below, a 95% service level would be more realistic than the default 99% level, given the historical data. In the bottom SLA graph, although the 90% service level is compliant based on historical data, the performance history shows that a 95% service level is attainable if the IT department is able to isolate and improve key underperforming systems.
up.time provides two types of SLA reports. The SLA Summary report provides high-level SLA compliance information, and the SLA Detailed report provides SLO- and service-level compliance information for system administrators.
See Reports for Service Level Agreements for more information.
Adding and using an SLA requires that you first define the SLA, then add one or more SLOs to it.
Note - When you create an SLA, it will be inserted into the current compliance period. For example, a newly created SLA that reports over a monthly compliance period will, if created on the 15th of the month, already be around 50% through the period. |
Once saved, the SLA’s Service Level Agreement General Information subpanel is displayed (see Viewing SLA Details for more information). From this page, you can add SLOs, as well as associate Alert Profiles and Action Profiles to the SLA.
Note - Any changes made are immediately reflected in any SLA reporting.
Note - Any changes made are immediately reflected in any SLA reporting.