The following reports enable you to visualize the overall performance of a system in the Uptime Infrastructure Monitor environment, as well as analyze the information to determine the cause of problems with those systems:
The Resource Usage report tracks the usage of system resources and performance information for systems over a given period of time. In addition to the usage information reported on, the report displays the following information:
Multi-CPU
The percentage of total CPU time used on systems with more than one CPU.
If you find the report’s rendered graph too dense due to a large number of CPUs, alternatively generate a Multi-CPU Usage graph while including fewer CPUs. |
If the system for which you are creating a report for has multiple disks, a graph for each disk on the system is generated. |
Workload (Top 10 - CPU)
The top 10 processes that are consuming CPU time, grouped by user ID, group ID, and process name. This information appears as a graph in the report.
This graph does not appear when you generate this report for a VMware ESX system. |
Workload (Top 10 - Memsize)
The top 10 processes that consume system memory, based on the total memory size of the processes - including virtual pages and shared memory. This information appears as a graph in the report.
This graph does not appear when you generate this report for a VMware ESX system. |
Workload (Top 10 - RSS)
The top 10 processes that are consuming physical memory (in KB), as measured by the run-set size (RSS) of the process. This information appears as a graph in the report.
This graph does not appear when you generate this report for a VMware ESX system. |
Network Device Interfaces
The Resource Usage statistics for all Network Device interfaces associated with the selected Elements. The following statistics are included:
Port Name | the name of the port |
Type | the type of port |
Usage | the percentage of the port's maximum throughput that was used by inbound and outbound packets |
In Rate | the average throughput of inbound packets, in Mbps |
In Usage | the percentage of the port's maximum throughput that was used by inbound packets |
Out Rate | the average throughput of outbound packets, in Mbps |
Out Usage % | the percentage of the port’s maximum throughput that was used by outbound packets |
Errors #/sec | the average number of errors per second |
Discards #/sec | the average number of packets discarded per second |
Status | the port status |
Multiple historical graphs are provided for this report including:
The Resource Hot Spot report is a key checkpoint report that allows you to quickly identify servers and network devices across your enterprise that may be having performance issues, so you can immediately start working to identify what may be causing them.
The Resource Hot Spot report helps you answer the following types of questions:
The report is also a valuable investigative tool that helps you quickly focus on the parts of your infrastructure that require troubleshooting. The report can be configured to include full listings of threshold-violating servers and network devices based on key resource-usage metrics such as memory and CPU usage, port throughput caps, or packet-issue counts. The high, low, and average for these metrics are presented, along with historical graphs for offending metrics; these details can help you confirm whether sustained resource strain, or wild swings are caused by resourcing deficits or configuration errors.
The Resource Hot Spot report is also a key starter report, as it is automatically created and saved for new Uptime Infrastructure Monitor installations. This report provides a summary of top resource users for the week. By default, a PDF version of the report is emailed to the SysAdmin user group.
The following are portions of an example Resource Hot Spot report:
The Resource Hot Spot report is a default report that is automatically created and saved for weekly generation on new Uptime Infrastructure Monitor installations, beginning the third day after Uptime Infrastructure Monitor was first installed. |
The following information comprises the Resource Hot Spot report:
Servers | |
---|---|
Top Servers | Of the servers included in the report, the top five resource consuming servers in terms of CPU, memory, swap usage, and disk usage. These servers are listed regardless of whether they violated resource usage thresholds set during report configuration; if your entire infrastructure is meeting your resource usage criteria for the report, the top-five servers are still included in the summary for each category. |
CPU | The percentage of CPU capacity used during the defined time period. In the Top Servers summary, this is an average value for the time period. |
Mem | The percentage of memory used by processes for the defined time period. In the Top Servers summary, this is an average value for the time period. |
Swap Usage | The percentage of memory swap space used during the defined time period. In the Top Servers summary, this is an average value for the time period. |
Disk Busy | The percentage of time the server disk is handling transactions in progress for the defined time period. In the Top Servers summary, this is an average value for the time period. |
Servers with High CPU Usage | A listing of all servers included in the report whose average CPU usage for the time period exceeded the threshold defined during report configuration. Each server's entry includes the following information to help correlate the violating metric with other metrics or events:
|
Servers with High Memory Usage | A listing of all servers included in the report whose average memory usage for the time period exceeded the threshold defined during report configuration. Each server's entry includes the following information to help correlate the violating metric with other metrics or events:
|
Servers with High Swap Usage | A listing of all servers included in the report whose average memory swap space usage for the time period exceeded the threshold defined during report configuration. Each server's entry includes the following information to help correlate the violating metric with other metrics or events:
|
Servers with High Disk Busy | A listing of all servers included in the report whose average disk processing for the time period exceeded the threshold defined during report configuration. Each server's entry includes the following information to help correlate the violating metric with other metrics or events:
|
Network Devices | |
Top Network Devices | Of the network-device type Elements included in the report, the five most inefficient network devices in terms of in rate, out rate, error count, and discards. These network devices are listed regardless of whether they violated throughput or error thresholds set during report configuration; if all of your network devices are passing the criteria for the report, the top-five network devices are still included in the summary for each category. |
In Rate % | The percentage of the network device's maximum throughput, on a per port basis, that was used by inbound packets during the defined time period. |
Out Rate % | The percentage of the network device's maximum throughput, on a per port basis, that was used by outbound packets during the defined time period. |
Errors per sec | The average number of errors encountered per second, on a per port basis, during the defined time period. |
Discards per sec | The average number of packets discarded per second, on a per port basis, during the defined time period. |
Network Devices with High In Rate | A listing of all network devices included in the report whose average in-rate percentage for the time period exceeded the threshold defined during report configuration. Each network device's entry includes the following information to help correlate the violating metric with other metrics or events:
|
Network Devices with High Out Rate | A listing of all network devices included in the report whose average out-rate percentage for the time period exceeded the threshold defined during report configuration. Each network device's entry includes the following information to help correlate the violating metric with other metrics or events:
|
Network Devices with High Errors | A listing of all network devices included in the report whose average error-per-second count for the time period exceeded the threshold defined during report configuration. Each network device's entry includes the following information to help correlate the violating metric with other metrics or events:
|
Network Devices with High Discards | A listing of all network devices included in the report whose average discarded-packet count for the time period exceeded the threshold defined during report configuration. Each network device's entry includes the following information to help correlate the violating metric with other metrics or events:
|
The Resource Cold Spot report is a key checkpoint report that provides resource utilization less than the specified metrics, or rather the inverse of the Resource Hot Spot report.
The Resource Cold Spot report helps you answer the following types of questions:
The resource consumer summaries rank physical and virtual servers as well as network devices in various resource-related categories, allowing you to correlate bottom-ranking consumers across categories to identify availability.
The report is also a valuable investigative tool that helps you quickly focus on the parts of your infrastructure that require troubleshooting. The report can be configured to include full listings of servers and network devices based on key resource-usage metrics such as memory and CPU usage, port throughput caps, or packet-issue counts. The high, low, and average for these metrics are presented, along with historical graphs for offending metrics.
The Resource Cold Spot report is a default report that is automatically created and saved for weekly generation on new Uptime Infrastructure Monitor installations, beginning the third day after Uptime Infrastructure Monitor was first installed. |
The following information comprises the Resource Cold Spot report:
Servers | |
---|---|
Bottom Servers | Of the servers included in the report, the bottom five resource-consuming servers in terms of CPU and memory. These servers are listed regardless of whether they violated resource usage thresholds set during report configuration; if your entire infrastructure is meeting your resource usage criteria for the report, the bottom-five servers are still included in the summary for each category. |
CPU | The percentage of CPU capacity used during the defined time period. In the Bottom Servers summary, this is an average value for the time period. |
Mem | The percentage of memory used by processes for the defined time period. In the Bottom Servers summary, this is an average value for the time period. |
Servers with Low CPU Usage | A listing of all servers included in the report whose average CPU usage for the time period was less than the threshold defined during report configuration. Each server's entry includes the following information to help correlate the metric with other metrics or events:
|
Servers with Low Memory Usage | A listing of all servers included in the report whose average memory usage for the time period was less than the threshold defined during report configuration. Each server's entry includes the following information to help correlate the metric with other metrics or events:
|
Network Devices | |
Bottom Network Devices | Of the network-device type Elements included in the report, the five least inefficient network devices in terms of in rate, out rate, error count, and discards. These network devices are listed regardless of whether they violated throughput or error thresholds set during report configuration; if all of your network devices are passing the criteria for the report, the bottom-five network devices are still included in the summary for each category. |
In Rate % | The percentage of the network device's maximum throughput, on a per port basis, that was used by inbound packets during the defined time period. |
Out Rate % | The percentage of the network device's maximum throughput, on a per port basis, that was used by outbound packets during the defined time period. |
Network Devices with Low In Rate | A listing of all network devices included in the report whose average in-rate percentage for the time period was less than the threshold defined during report configuration. Each network device's entry includes the following information to help correlate the metric with other metrics or events:
|
Network Devices with Low Out Rate | A listing of all network devices included in the report whose average out-rate percentage for the time period was less than the threshold defined during report configuration. Each network device's entry includes the following information to help correlate the metric with other metrics or events:
|
The Multi-System CPU report charts and compares the CPU performance statistics from multiple systems in your environment. These statistics indicate whether the systems are exhibiting balanced behavior, or if processes are forced off CPUs in certain circumstances.
To create a Multi-System CPU report, do the following:
The CPU Utilization Summary report generates a tabular summary of the CPU and memory consumption over a specific time period. Specifically, this report returns the following information:
To create a CPU Utilization Summary report, do the following:
Page Scans
The number of page scans per second.
The statistic you select must match the sort criteria that you selected in step 4. For example, if your sort criteria is Average CPU you must also select the CPU statistic. Otherwise, an error message appears when you try to generate the report. |
Optionally, in the Architectures to exclude field enter either the name of a system architecture or a regular expression that Uptime Infrastructure Monitor uses to ignore certain system architectures when generating the report.
For example, if you want to exclude all Solaris systems from the report, enter SunOS in the field.
Uptime Infrastructure Monitor determines the architecture of a system by checking the output of the uname -a command on UNIX or Linux, or by analyzing one or both of the following Windows registry keys: HKEY_LOCAL_MACHINE\\Software\\Microsoft\\ WindowsNT\\CurrentVersion HKEY_LOCAL_MACHINE\\Software\\Microsoft\\ Windows\\CurrentVersion |
The CPU Utilization Ratio report charts, in a table, the ratio of the percentage of CPU usage over a specified period of time. The ratio is derived by dividing the percentage of system time that is used by the percentage of user time. For example, if the amount of system time that is used is 22.12% and the amount of user time is 5.2%, then the CPU utilization ratio is 4.25.
This report contains the following information:
The Wait I/O report enables you to determine the amount of time that processes spend waiting on I/O from a system device.
The Wait I/O report contains the following information:
The Inventory report provides details about the composition of your monitored infrastructure by operating system, across physical and virtual Elements. The report contents can optionally be organized by group, and can include individual Element entries.
These different reporting options allow you to confidently assess your inventory from a variety of perspectives, and help you answer the following types of questions:
Because the Inventory report displays all monitored Elements, the report is intended for system administrators. Non-administrative Uptime Infrastructure Monitor users who do not have permission to view all Elements cannot view complete inventory listings. |
The following information can be displayed in the Inventory Report when you select Show Operating System Summary:
Operating System Summary | ||
unique breakdown of physical and virtual Elements, with totals by OS, and component totals for OS versions | ||
Physical Elements | the total number of systems-type Elements (i.e., not network devices, Applications, and SLAs) this total includes virtual machines that are not managed by a VMware vCenter server Element (e.g., LPARs or VMware VMs manually added to Uptime Infrastructure Monitor, and not through vSync) | |
Virtual Elements | the total number of Elements running on VM instances (i.e., VM instances with their own UUID) | |
Operating System | the detected operating system, including VMware environments | |
Version | the detected operating system version; build version details are listed if available | |
Element Name | the Element’s host name | |
Architecture | the detected hardware platform type on which the Element’s CPUs are running | |
Agent Version | if applicable, the version of the Uptime Infrastructure Monitor Agent that is running on the Element | |
Added Date | the date the Element was added to Uptime Infrastructure Monitor monitored inventory | |
Group | the Element’s Infrastructure group name |
The following information can be displayed in the Inventory Report when you select Show Element Type Summary:
Element Type Summary | ||
unique breakdown of physical and virtual Elements by type, with when the element was added, monitoring status, which group contains each element, and component totals for each Element type | ||
Agent Elements | contains Elements identified by Uptime Infrastructure Monitor as Agents | |
vSphere Elements | contains Elements identified by Uptime Infrastructure Monitor as vSpheres | |
Virtual Machine Agentless Elements | contains Elements identified by Uptime Infrastructure Monitor as Agentless Virtual Machines | |
VMware vCenter Server Elements | contains Elements identified by Uptime Infrastructure Monitor as VMware vCenter Servers | |
Network Device Elements | contains Elements identified by Uptime Infrastructure Monitor as Network Devices | |
Element Name | the Element’s host name | |
OS Type | the detected operating system, including VMware environments | |
Added Date | the date the Element was added to Uptime Infrastructure Monitor monitored inventory | |
Monitored | whether the Element is monitored (True) or not monitored (False) | |
Group | the Element’s Infrastructure group name |
The following information can be displayed in the Inventory Report when you select Show Monitor Summary:
Monitor Summary | ||
unique breakdown of physical and virtual Elements by type, with element name, assigned Service Monitor, whether the Element is monitored and for what period of time, whether there are associated Alert and Action Profiles, and component totals for each Element Service Monitor | ||
Agent Elements | contains Elements identified by Uptime Infrastructure Monitor as Agents | |
vSphere Elements | contains Elements identified by Uptime Infrastructure Monitor as vSpheres | |
Virtual Machine Agentless Elements | contains Elements identified by Uptime Infrastructure Monitor as Agentless Virtual Machines | |
VMware vCenter Server Elements | contains Elements identified by Uptime Infrastructure Monitor as VMware vCenter Servers | |
Network Device Elements | contains Elements identified by Uptime Infrastructure Monitor as Network Devices | |
Element Name | the Element’s host name | |
Service Monitor | the name of the service monitor assigned to this Element | |
Monitored | whether the Element is monitored (True) or not monitored (False) | |
whether notifications are issued for this Element's service monitor (Yes) or notifications are not issued (No) regardless of status or interval | ||
Monitoring Period | the Element's service monitor time period at which Uptime Infrastructure Monitor sends alerts | |
Alert Profile | whether the Element has an associated alert profile (Yes) or not (No) Alert Profiles are templates that tell Uptime Infrastructure Monitor how to react to various alerts that are generated by service checks. | |
Action Profile | whether the Element has an associated action profile (Yes) or not (No) Action Profiles are templates that direct Uptime Infrastructure Monitor when it encounters a problem on a monitored system. |
You can configure the Uptime Infrastructure Monitor service monitors to retain data, which is saved to the Uptime Infrastructure Monitor DataStore for later use. The Service Monitor Metrics report visualizes the retained data in a line chart.
For example, if you have configured a service monitor to retain response time data then this report charts any changes in the response time (in milliseconds) that have occurred over the time period that you specified for the report.
Creating a Service Monitor Metrics report is a two-step process:
The following reports enable you to visualize the resource usage of systems in your Uptime Infrastructure Monitor environment, and then use that information to better plan, deploy, and consolidate your server resources:
The Enterprise CPU Utilization report enables you to compare the processing power of different types of systems in your environment. Performing this kind of comparison is difficult because different types of systems use different processors - for example, a Windows server uses an Intel processor while a Solaris server may use a SPARC processor. The benchmarks for measuring the power of each type of processor are different.
An Enterprise CPU Utilization report offers a quick snapshot of the overall performance of the servers in your environment. Based on the information in the report, you can then determine how best to optimize CPU capacity across your enterprise.
Uptime Infrastructure Monitor can measure processing power using statistics called a power units. Power units are the number of CPUs on a system multiplied by the speed of the processors. For example, a Solaris server has four CPUs and each CPU runs at 168 Mhz. The total number of power units for the server is 672 (4 x 168). If you compare this to a Windows server with one CPU running at 2900 MHz (2,900 power units), then you can conclude that the Windows server has more processing power.
Enterprise CPU utilization is a percentage that is derived by dividing the total number of power units used by the total number of power units available. For example, if the number of power units used is 104 and the total number of available power units is 2,346 then the enterprise CPU utilization is 4.34%.
The File System Capacity Growth report illustrates the following:
On Windows servers with a single disk, Uptime Infrastructure Monitor looks at the capacity of the main partition (usually the C:\ drive). If the Windows server has multiple disks, this report collects information for all of the disks. On UNIX and Linux servers, Uptime Infrastructure Monitor looks at individual file systems (for example, /var , /export , or /usr ) on all the disks in the system.
This report ignores floppy drives, tapes drives, and CD-ROM drives. |
Many organizations have a number of production servers that are not used to their full capacity. For example, a server could be running one or two applications and not using much of the hardware. Instead of wasting resources, you can consolidate these applications in a virtual environment, for example using VMware. This enables you to run applications on distinct servers, but without using as much hardware.
The Server Virtualization report can help you to pinpoint physical servers that can be combined on a single virtual server. The report highlights servers that are good candidates for virtualization - ones that do not fully use their CPU, memory, or disk resources.
In the report, each system has one of the following stars beside it:
As well, the metrics for Average Power Units Used ( Power Units measure the power of CPUs by multiplying the number of CPUs on a system by their speed), Avg Disk I/O, and Avg Network I/O for each system may be highlighted.
The results of a Server Virtualization report can help you to determine which physical servers to combine on a single virtual server. In order to effectively use the report, you must analyze the results in more depth.
First, look at the average number of power units used by the systems that you want to consolidate on a virtual server. That figure should be less than the total number of power units available on the target system.
Next, look at the disk I/O for the individual systems. If the system is running an application that has high levels of disk usage (for example, a database), that system might not benefit from virtualization. If, however, the target system has a very fast disk, you can still consider moving the candidate system to it.
Also, consider the geographical locations of the systems for which you are generating the report. For example, the report states the four systems of a similar type are good candidates for virtualization. However, two of those system are in different parts of the country or the world. In this case, adding them to a virtual server is not a viable option.
Solaris system with two or more CPUs can suffer from mutex (mutual exclusion) locks when two or more threads are waiting for the same resource. During processing, the Solaris kernel maintains locks on various resources. The kernel allocates enough mutex locks to allow multiple CPUs to complete their work simultaneously. However, if two or more CPUs try to get the same lock at the same time, all but one CPU stalls.
The Solaris Mutex Exception report pinpoints multi-processor Solaris systems that have a high number of mutex stalls. The report contains the following information:
If you are generating reports for specific Applications in your environment, select them from the List of Elements.
Only Solaris systems with two or more CPUs are shown in the List of Elements. |
The following is an example of a Solaris Mutex Exception report:
The number of mutex stalls for the first system in the list exceeds the threshold that was set when the report was defined. Based on this information, you can generate one of the following graphs to get a better idea of the performance of the CPUs on the system:
From there, you determine how to best reduce the queue size to improve performance.
The Network Bandwidth report keeps track of the amount of data moving in and out of each network interface on a system. This report helps you identify or confirm that specific systems are overloaded, based on the amount of data they are sending or receiving; such systems could become bottlenecks for the whole network.
The amount of data moving through each interface is measured in megabytes. However, the following systems store data as packets rather than bytes:
If you are monitoring one or more of these systems, you can specify a ratio for converting packets to bytes.
Different network interfaces have a maximum packet size called a Maximum Transmission Unit (MTU) - an Ethernet interface, for example, has an MTU of 1,500 bytes. Most interfaces do not transmit packets at the MTU. The value that you specify for the bytes-per-packet conversion is based on the observed performance of the network interface. Fifty percent of MTU is a good average to use - the default value in Uptime Infrastructure Monitor is 750.
The report contains the following information:
The following is an example of a Network Bandwidth report:
In this example, the system Filter has high levels of network traffic flowing in and out of a particular network interface. Based on this information, you can generate a Network graph (see Network Graphs for more information) to get a better idea of why network I/O is so high on the system.
The Disk I/O Bandwidth report keeps track of the amount of data read from and written to a disk on a system. The report can the display the amount of data either as blocks or megabytes.
The report contains the following information:
You can use regular expressions to include or exclude disks and file systems when generating a Disk I/O Bandwidth Report (or a File System Service Time Summary Report), as shown below:
Using regular expressions, you can focus on particular disks or file systems on a server and also decrease the length of your report.
The regular expression syntax used with the Disk I/O Bandwidth Report or a File System Service Time Summary Report is similar to that used with the File System Capacity Growth report. For example, if you are generating a report on an Oracle volume and only want to focus on five specific file systems, you can enter the regular expression /u[0-4] in the Exceptions field.
If, on the other hand, you are working with a UNIX system with multiple disks and want to focus on disks whose names start with md1 but ignore those whose names start with md2 , you can enter the regular expression /md1.* in the Exceptions field and /md2.* in the Exclude Disks field.
The following is an example of a Disk I/O Bandwidth report:
In this example, the systems Brightmail and Weblogic Server have high levels of disk I/O. Based on this information, you can generate a Disk Performance Statistics graph (see Generating a Disk Performance Statistics Graph for more information) to get a better idea of why disk I/O is so high on the system.
The CPU Run Queue Threshold report lists -- when a system’s CPU reaches a high level of usage -- the number of jobs that were ready to run but waiting in a queue, as well as the amount of time they were waiting.
If the size of the run queue is appreciably larger than the number of available processors on a system, or the run queue is backlogged for long periods of time, you can conclude that the server is overloaded.
You can use this report to pinpoint servers that are overloaded using the following factors:
This report contains the following information:
The following is an example of a CPU Run Queue Threshold report:
In this example, the system is consistently over the run queue threshold that was specified when the report was defined. Based on this information, you can generate a CPU performance graph (see See Monitoring CPU Performance. for more information) to get a better idea of why the system is exceeding the CPU run queue threshold.
The File System Service Time Summary report indicates which system disks (and file systems) are using an excessive amount of time to complete disk operations. This report helps you identify which systems may benefit from configuration changes (e.g., adding RAM, moving a file system to another hard disk, implementing a RAID).
The report contains the following information:
You can also sort the results in the report by one of six criteria that you can specify when defining the report.
The following is an example of a File System Service Time Summary report:
In this example, the disks on each system have high levels of service time, and they are in the highest percentile that exceeds the service time threshold.
The following reports enable you to assess your organization’s ability to meet, and diagnose failures in meeting service level agreements by summarizing compliance and reporting on compliance and non-compliance of an SLA’s component objectives and services:
The SLA Summary report shows whether an SLA’s performance target is met, whether performance--even through currently compliant with the defined target--may eventually fall short in the future, and how component SLOs contributed to performance. The report contains charts and a table that provide the following information:
The report answers the following questions:
For more information on SLA definitions, see Service Level Agreements.
In cases where an SLA compliance target is not met, the SLA Detailed report breaks down both the outages of an SLA’s component SLOs, and the outages of each SLOs component services. This report allows you to pinpoint when specific services experienced outages, assisting with further investigation.
The report answers the following questions:
For more information on SLA definitions, see Service Level Agreements.
The following reports enable you to visualize the availability metrics for all your mission-critical Applications and your critical system services:
The Server Uptime report is a key checkpoint report that provides you with a focused and succinct snapshot of your infrastructure's availability. Report components include overall availability based on a defined up time threshold, availability by defined interval over the reporting period, as well as tallies of the number of Elements that experienced one or more outages, and the total number of outages. To assist with follow-up actions, Elements are listed by outage time and include details that help you determine whether the outage frequency or duration is contributing the most to total downtime. The Server Uptime report helps you answer the following types of questions:
The Server Uptime report is also a key starter report, as it is automatically created and saved for new Uptime Infrastructure Monitor installations. This daily report provides an hourly breakdown of availability, using a 95% uptime threshold. By default, a PDF version of the report is emailed to the SysAdmin user group.
The following is an example of a Server Uptime report:
The Server Uptime report is a default report that is automatically created and saved for daily generation on new Uptime Infrastructure Monitor installations. |
The following details are displayed in the Server Uptime report:
Uptime Summary | |
---|---|
Overall Uptime | The uptime of all Elements included in the report for the defined time period. This is a composite uptime value for all individual Elements that are in an OK, WARN, or MAINT state; Element or Element group averages, or maximum values for a time period do not contribute to overall uptime. |
Element Outages | The total number of separate Element outages during the time period (because an individual Element can have more than one outage in the same time period). |
Elements That Failed | The number of Elements that experienced an outage during the time period. Use this value to ensure the previous Element Outages count is not misleading due to the under performance of, for example, a single Element. |
availability graph | A breakdown of the overall uptime for the time period, where the granularity is dependent on the Breakdown Type set during report configuration: by hour, day, week, or month. Availability for each time slice (i.e., whether it is marked as pass/green, or fail/red) is determined by the Target Percentage set during report configuration. |
Uptime Details | |
Element | The name of the Element that's in this report. Whether this Element is listed individually or within an Element group listing depends on whether you selected the Group by Element Group check box during report configuration. Elements are primarily sorted by uptime; Elements with equal uptime are sorted by name. |
Uptime | The uptime for the specific Element during the time period, expressed as a percentage and bar. Element lists in the report are sorted by Uptime Infrastructure Monitor. The Target Percentage set during report configuration determines whether the Element is marked as pass/green, or fail/red. |
Minutes Down | The total number of minutes the Element spent in a "down" state (CRIT or UNKNOWN) for the time period. If the Element experienced no downtime, this field is blank. |
Outages | The number of outages the Element experienced during the time period. If the Element experienced no downtime, this field is blank. |
Longest | The number of minutes that comprises the Element's longest outage during the time period. Use this value to ensure the previous Minutes Down tally is not misleading due to, for example, a particularly long single outage among several short ones. If the Element experienced no downtime, this field is blank. |
Because reports have a finite amount of space to present their information, use a level of granularity that suits the breadth of the date and time range selected for the report (e.g., hourly time slices for a daily report, or daily time slices for a weekly report). |
The Application Availability report tracks the availability of the Applications in your environment, as well as the monitors that are associated with the Applications. This report contains the following information:
For more information on Applications, see Working with Applications.
The Incident Priority report provides information on the frequency, duration, and recovery time of critical-level events, and the overall reliability of your monitored systems. This information is presented for services that are associated with groups of Elements (whether a pre-defined group, or an manually selected list of individual Elements). Compared to the Service Monitor Outages report, the Incident Priority report, instead of providing an auditable list of outages, uses a comparative approach to indicate how efficiently systems are running in relation to each other, and furthermore, how efficiently problems are dealt with.
In order to report this efficiency, the following building blocks are available as elements in the report:
Note that, to provide clear results in the report, only service monitors that were manually assigned to, and are directly associated with, an Element are taken into account when downtime and incident counts are tallied. This means service monitors that may be automatically installed such as the Platform Performance Gatherer are not included; additionally, only an Application’s status as a whole affects downtime and incident counts, but its component service monitors--both master and regular service monitors--do not.
Using downtime and efficiency counts, the Incident Priority report includes the following key elements:
For all report elements, a service monitor is considered to have reached a critical state--thus has caused an incident, is contributing to downtime, or is an ongoing failure--when it actually generates an alert. The period preceding the alert, during which rechecks are intermittently performed to avoid a false positive, does not count. See Understanding the Alert Flow for information on rechecks leading to a generated alert.
The Service Monitor Availability report tracks the status of the services associated with the hosts in your environment. This report lists the percentage of time each service was in the following states over the time period that you specify: OK, Warning, Critical, Maintenance, or Unknown.
For more information on each status, see Understanding the Status of Services.
The Service Monitor Outages report lists all warning or critical events for services that have occurred over a specified time period. Use this report to determine the cause of a problem by analyzing the declining availability of a server or set of servers.
The Service Monitor Outages report contains the following information:
To create a Service Monitor Outages report, do the following:
The following reports enable you to visualize any performance problems with applications that are running a J2EE environments:
The WebSphere report charts a set of counters that provide insight into the health and performance of a WebSphere Application Server. Depending on the number of options that you select, the report can become quite long and can take considerable time to generate. For most options, the report contains charts for two or more metrics.
Because WebSphere is large and complex, it can be difficult to pinpoint the source of a problem with the server or an application running on the server. This is especially true when that problem is intermittent. Watching for problems in real time only gives you a snapshot of the problem. The Uptime Infrastructure Monitor WebSphere report, on the other hand, gives you a detailed historical perspective of the problem. Using the information in the report, you can find the source of the problem.
For example, users have trouble working with an application that intensively uses a database. Checking the Connection Pool charts section of a WebSphere report could indicate the source of the problem - the database has reached its maximum number of connections.
You can then adjust the size of the database connection pool to allow more connections.
Or, if a WebSphere application is using a large amount of memory you could check the JVM charts section of the report. If there are spikes in the heap size or memory usage of the JVM, you can tune the JVM to ensure that it is working at optimal levels.
The WebLogic report charts a set of metrics (see WebLogic for details) that provide insight into the health and performance of a WebLogic server. Using the WebLogic report, you can pinpoint problem areas on your WebLogic server and quickly determine how to fix those problems.
Depending on the number of options that you select, the report can become quite long and can take considerable time to generate. For most options, the report contains charts for two or more metrics.
Because WebLogic is large and complex, it can be difficult to pinpoint the source of a problem with the server or an application running on the server. This is especially true when that problem is intermittent. Watching for problems in real time only gives you a snapshot of the problem. The Uptime Infrastructure Monitor WebLogic report, on the other hand, gives you a detailed historical perspective of the problem. Using the information report, you can find the source of the problem.
For example, users have trouble logging into an application that is running on the WebLogic server. Checking the Connection Pool charts section of a WebLogic report, you might see that the size of the connection pool has reached its maximum, and that there are a large number of connections that are waiting in the pool. From there, you can then adjust the size of the connection pool to allow more connections.
Or, if a WebLogic application is using a large amount of memory you could check the JVM charts section of the report.
If there are increases or sudden spikes in the heap size or memory usage of the JVM, then you can tune the JVM to ensure that it is working at optimal levels.
Virtualization platforms such as VMware vSphere enable you to consolidate servers and applications in a virtual environment. Using virtual machine managers such as VMware ESX, you can run multiple servers or applications on a single system, but without using as much hardware. Each server or application runs in its own VMware instance. You can use VMware vSphere to manage and monitor ESX servers, as well as allocate resources among virtual machines.
Uptime Infrastructure Monitor's VMware and pSeries reports enable you to visualize the performance of systems that are consolidated on virtual machines, whether using VMware or IBM pSeries Logical Partitions (LPARs):
The VM Sprawl report helps you assess the extent of sprawl across your virtual infrastructure, and provides you with the information needed to reduce it. Using the report, you can perform the following types of tasks:
Depending on how it is configured the VM Sprawl report consists of at least one of the following components:
The VMware vSphere workload report provides a broad view of workloads across your entire virtual infrastructure. For selected Elements, it includes a resource usage overview (CPU, memory, disk, and network). It also includes detailed resource usage charts for selected Elements' child objects in the vSphere hierarchy.
For reporting periods, total resource usage is reported regardless of how the VMware vSphere object child objects change during that time period. For example, a four-week overview chart for an ESX server includes performance of VMs that have since migrated. |
Using this report, you can visualize resource usage at the datacenter, cluster or ESX server level, as well as with resource pools, vApps, and virtual machines. You can also understand how these Elements' respective child objects contribute to their usage levels.
The following metrics can be displayed in a VM Workload report:
CPU Usage | the following is shown for the interval:
| |
Memory Usage | the following is shown for the interval:
| |
Disk I/O | the rate, in MBps, at which ESX hosts or VMs are reading and writing data to and from disk | |
Network I/O | the rate, in Mbps, at which ESX hosts or VMs are receiving or transmitting data over the network | |
% Wait | the amount of time during the interval, as a percentage, that the VM or all VMs on an ESX host, resource pool or vApp had scheduled CPU time, but gave nothing to process | |
% Ready | the amount of time during the interval, as a percentage, that the VM or all VMs on an ESX host, resource pool or vApp were ready to process, but were not scheduled CPU time by the host |
For the % Wait and % Ready metrics, it is possible to be presented with values that exceed 100%. The underlying data used in the workload report's graphs are migrated from VMware vSphere via vSync; VMware conventions include percent-based metrics that can be greater than 100%. For example, refer to the VMware Technical Note, Performance Counters, at http://www.vmware.com/files/pdf/technote_PerformanceCounters.pdf. |
A VMware server often slows down because an instance on the server is consuming large amounts of such system resources as CPU, disk I/O, and memory. The problem could lie with an instance that is currently slow or another instance on the same server. The VMware Workload report charts the workload of the server. You can also use the report to determine whether you are using a particular VMware server to its optimal capacity.
This report provides information only about older, legacy VMware ESX type Elements that are part of your monitored infrastructure. |
The VMware Workload report can be a useful tool for determining whether a VMware server is used to its optimal capacity. Consider the following example, in which the VMware Workload report returns the following information about the top ten CPU loads on the VMware server:
This graph indicates that, on average, the 10 most CPU-intensive instances use only 20% of the server’s CPU capacity. The CPU on the server can handle up to three to four times its current load.
The memory usage section of the report indicates that the instances are using roughly the same amount of memory:
The server appears to have an ample amount of memory available.
The report indicates that you can add more instances to the VMware server.
The VM Density report enables you to assess the carrying capacity and workload distribution of your ESX infrastructure. To accomplish this, virtual machine counts are tracked and reported on a daily basis, where the peak VM count for a given day is used as that day’s tally. The information available in the report includes the following:
Using this report, you can have a better understanding of virtualized workloads by seeing ESX server use and trends, and quantifying VM creation overall, and on a server-by-server basis.
This report provides information only about older, legacy VMware ESX type Elements that are part of your monitored infrastructure. |
The LPAR Workload report charts the workload of the individual logical partitions (LPARs) on an IBM pSeries server. It does this by graphing the following workload data:
Using the information in the report, you can ... This enables you to ...
The LPAR Workload report takes the guesswork out of determining CPU entitlements for the LPARs on a pSeries server. The entitlements indicate the amount of CPU power that is assigned to an LPAR.
For example, you have an LPAR with hard entitlement (one that cannot use spare processing power from another CPU on the server) and its CPU usage is constantly at or near the maximum. In this case, you can either increase the CPU entitlement of the LPAR, or change it to a soft entitlement.
If, on the other hand, the LPAR has a soft entitlement (one which can use spare processing power from another CPU on the server) and its CPU usage is consistently at or greater than the entitlement, you can increase it.
The Datastore Capacity Growth report illustrates the following for virtual storage capacity:
On Windows servers with a single disk, Uptime Infrastructure Monitor looks at the capacity of the main partition (usually the C:\ drive). If the Windows server has multiple disks, this report collects information for all of the disks. On UNIX and Linux servers, Uptime Infrastructure Monitor looks at individual file systems (for example, /var, /export, or /usr) on all the disks in the system.
This report ignores floppy drives, tapes drives, and CD-ROM drives. |