Uptime Infrastructure Monitor can display the performance and availability statistics for the systems that you are monitoring in a graph. You can use the graphs to collect and display information for Elements, services, and configurations.
You have different graphing options depending on the operating system that is running on a host. The metrics that Uptime Infrastructure Monitor agents capture and return to the Monitoring Station differ from operating system to operating system.
If a graph is not available in the Tree panel for a given host, it is because the host does not provide the metric that the graph requires. Also, if you add a node or a virtual node, such as a router or IP address, you can only see them in the Config and the Services tabs as other metrics such as CPU and disk usage are not available from the node. |
In most cases, you can interpret performance data from different platforms - such as Windows, UNIX and Linux - in similar ways. When the interpretation of the data is different, the Uptime Infrastructure Monitor interface displays operating system-specific information - such as the performance counters used - as necessary.
You can view the status of a system in your environment using a Quick Snapshot. The Quick Snapshot summarizes key hardware and process information for a system for the last 24 hours. If there is not 24 hours worth of data available, then Uptime Infrastructure Monitor uses data from as far back as possible to generate charts.
The Quick Snapshot is typically used as a preliminary step toward root cause analysis. When you first acknowledge an issue by clicking an Element name on either the Global Scan dashboard, or the My Alerts section of My Portal, you are shown the Quick Snapshot for that Element. From here, you can work with the information provided in the charts and tables and begin further investigation:
The Quick Snapshot contains the following information:
System Status Charts | Top 10 Processes | File System Statistics |
---|---|---|
|
|
|
The components that comprise a Quick Snapshot depend on the type of Element in view. Monitored Elements typically provide the aforementioned information. For information about Quick Snapshots for VMware vSphere objects, or a network device, see Viewing the Status of a vSphere Element, and Viewing the Status of a Network Device, respectively. |
On the Global Scan dashboard, click the name of the system whose information you want to graph. The Quick Snapshot is displayed by default.
Generally speaking, you can access a Quick Snapshot for an Element by clicking the Graphing tab, then clicking Quick Snapshot in the Tree panel.
Uptime Infrastructure Monitor uses the following graphs to chart the performance of one or more CPUs on a system:
These graphs use the same input criteria, but they return different data.
The Usage (% Busy) graph charts the percentage of a system’s CPU resources that are used over a period that you specify. This graph displays three components of CPU time: user, system, and wait I/O. Taken together, these components display the total amount of CPU usage. On a system with multiple CPUs, the numbers are averages across all CPUs.
The key CPU usage metric in Windows is % Usr Time , which monitors the amount of time the CPU spends processing a thread that is not idle. If usage is consistently at 80% to 90%, you may need to upgrade the CPU or add more processors.
You should monitor a separate instance of this counter for each processor on systems with multiple CPUs. The value returned by the counter represents the sum of processor time on a specific processor.
To determine the average for all processors, monitor the System: %Total Processor Time metric. |
Optionally, you can monitor the following metrics:
In UNIX and Linux, Uptime Infrastructure Monitor graphs the following metrics:
The Run Queue Length graph counts the number of processes that are not currently running, and which are waiting to be served by the CPU. If several processes are trying to use CPU time, you might need to install a faster processor, or add an another processor if you are using a multiprocessor system.
A long queue increases the time that a request waits before it is carried out by the CPU. However, it does not affect the time that is required to process each request once the CPU starts carrying out the request.
Uptime Infrastructure Monitor counts the number of processes that are waiting in queue at a particular point in time. If the run queue or load average is greater than four times the number of CPUs, then processes must wait too long for the CPU to process the requests.
The Run Queue Occupancy graph charts the percentage of time that one or more services or processes are waiting to be served by the CPU.
If the run queue occupancy is close to 100% and the run queue length is considered low, the CPU is not necessarily overloaded. While there may always be services waiting to be processed, the CPU may still be able to quickly process them.
If the run queue occupancy is high and the queue is long, then there is a capacity problem. However, a system should always have some idle time. Having consistently low idle time usually means that your system is working near its maximum capacity.
The Multi-CPU Usage graph charts the performance statistics for systems with more than one CPU. These statistics indicate whether a system is effectively balancing tasks between CPUs, or if processes are forced off CPUs in certain circumstances. You can also use this graph to determine whether there are too many system interrupts that are using a CPU or that are overloading a CPU.
Uptime Infrastructure Monitor can also collect and chart information for systems running Net-SNMP that have two or more CPUs. However, if the system was recently added to Uptime Infrastructure Monitor, or if the HOST-RESOURCES MIB - which is used to collect data from the system - is not properly installed and configured, Uptime Infrastructure Monitor cannot collect CPU performance data. You must either wait until Uptime Infrastructure Monitor is able to collect performance data, or check whether the HOST-RESOURCES MIB is properly installed and configured on the monitored system that is monitored.
If there is only one CPU on the system, the following message is displayed instead of a graph:
|
% User Time
% System Time
SMTX: the number of read or write locks that a thread was not able to acquire on the first attempt, as reported by the mpstat
command
While it is trying to acquire locks, the thread is active but is not performing any tasks. |
XCAL - the number of interprocess cross-calls
In a multi-processor environment, one processor sends cross-calls to another processor to get that processor to do work. Cross-calls can also be used to ensure consistency in virtual memory. Heavy file system activity, such as NFS, can result in a high number of cross-calls.
% Interrupt Time
)% Total
and is the total amount of % User Time
, % Privileged Time
, and % Interrupt Time
Uptime Infrastructure Monitor uses the following graphs to chart memory usage on a system:
These graphs use the same input criteria, but they return different data.
This graph charts the amount of memory used on a system. Used memory is the amount of physical memory occupied by the operating system, system library files, and applications.
This graph indicates how effectively buffers are controlling the flow of data between disks and the system.
CPU cache is a small store of free memory that is used by frequently-performed tasks for repeated fast disk access. The cache hit rate measures how often the system accesses the CPU cache.
The cache hit rate calculations are taken from the following metrics:
Cache read efficiency should be close to 100%. Cache write efficiency should be approximately 66%. However, low percentages do not always indicate performance problems.
This graph indicates whether a system is short of memory. Uptime Infrastructure Monitor checks whether the pgscan rate and page-out statistics are consistently high. Use the following equation to calculate the scan rate threshold:
scan threshold = handspreadpages ÷ residence time
The handspreadpages variable is fixed at 8192 on UltraSPARC systems with more than 256 MB of memory. The residence time variable is generally fixed at 30 seconds. Therefore, the default scan rate threshold is 273 .
You should also examine the swap device for excessive activity. To identify the device, check the file /etc/vfstab for the tmpfs file system. You can also use the swap -l command to list the physical partitions that are used for swap on the system.
When a program requires more memory than is physically available, information that is not used is written to a temporary buffer on the hard disk, called swap . The Free Swap graph charts the amount of available free swap space, as a percentage of total available free swap space.
Microsoft Windows writes data to the Windows Page File when it needs additional memory. The Windows Page File can range in size from 20 million bytes to over 200 million bytes. The \Paging File(_Total)\% Usage performance counter extracts page file information.
On Solaris, swap space is separated into:
The actual space on a disk available for swapping.
The amount of physical swap space and the amount of memory that is available for swapping.
If the amount of swap space drops to zero, then the system cannot create new processes or store information in the /tmp file system.
Linux swaps data to a dedicated swap partition.
Uptime Infrastructure Monitor uses the following graphs to chart the activity of processes on a system.
This graph charts the number of processes that are currently running on a system. The process count is taken from the system kernel, and can be used to determine process usage trends.
This graph indicates whether there is enough CPU capacity for the processes that are run on a system. If the size of the blocked or waiting queue is disproportionate to the running queue, then either the system does not have enough CPUs or is too I/O bound.
A blocked process signals a disk bottleneck. If the number of blocked processes approaches or exceeds the number of processes in the run queue, you should tune the disk subsystem. Whenever there are any blocked processes, all CPU idle time is treated as wait for I/O time. If database batch jobs are running on the system that is monitored, there are always some blocked processes. However, you can increase the throughput of batch jobs by removing disk bottlenecks.
This graph determines whether there are runaway processes on a system or if a forking-based process (like a Web server) is spawning too many processes over a specified period of time.
The TCP Retransmits graph indicates whether data is transmitted over a network. Using TCP, information is transmitted in pieces called packets. A packet consists of:
Contains transmission information, such as the IP addresses of the sender and receiver, the protocol used, and the packet number.
Contains the sent data.
Contains data that denotes the end of the packet, as well as error correction information.
TCP retransmits indicate that certain network services may not be completing properly because of a high load on a network or a system. A lost packet can indicate network congestion, and requires the sender to reduce the transmission rate and to retransmit the packet. A slower transmission rate combined with retransmitted packets reduces network performance.
Uptime Infrastructure Monitor uses the following graphs to chart the activity of users on a Linux or UNIX system:
The number of times or frequency at which a user has logged into a system during any 30 minute time interval.
The number of sessions or number of distinct users who are logged into a system during any 30 minute time interval.
Using these graphs, an administrator can identify user load and whether there is any correlation between user logins or number of sessions and problems with the performance of the system. These graphs use the same input criteria, but they return different data.
If there is no data to graph, the message |
The three workload graphs determine the demand that network and local services are putting on a system. The graphs chart an aggregate amount of performance information for a given user, group, or process.
You can generate the following workload graphs:
The demand that network and local services are putting on the system, based on the IDs of the users who are logged into a system.
The demand that network and local services are putting on the system, based on the IDs of the user groups that are logged into a system.
The demand that network and local services are putting on a system, based on the processes that are running.
These graphs use the same input criteria, but they return different data.
Each workload graph captures the following metrics:
The percentage of CPU time that is taken up by a user, group, or process.
The amount of the page file and virtual memory that is taken up by a user, group, or process.
On Windows systems, Memory Size is called Virtual Bytes .
The Run Set Size, which is the amount of physical memory used by a user, group, or process. On Windows systems, RSS is called Working Set .
Graphs generated for SNMP agents only chart the memory metric. |
See RSS. or Working Set (on UNIX and Windows, respectively)
You can only graph one metric at a time. |
Select one or more of the available users, groups, or processes from the list.
If you are generating a workload graph by processes, (i.e., Workload - Process Name graph), enter a regular expression in the Process Selection Regex field to automatically add matching process names for graphing, and avoid dealing with ungainly lists of system processes.
The list of available process varies by server and by operating system. |
The three Workload top 10 graphs chart the 10 processes that are consuming the most CPU resources. Consumption of CPU resources is tracked via one of the following: a user ID, a group ID, or the name of a process. Workload Top 10 graphs enable you to quickly determine which processes are consuming the most CPU resources over a specified time period.
Each graph uses the same input criteria, but they return different data.
Uptime Infrastructure Monitor can collect workload information from logical partitions (LPARs) that are running on pSeries servers. The following graphs visualize the workload information for all LPARs on a server:
The amount of CPU time used by the LPAR.
The total amount of memory used by an LPAR.
The amount of data that is transferred to and from the disk.
The amount of data that is transferred over the network interface used by the LPAR.
You can also graph the CPU entitlement of individual LPARs using the CPU Utilization graph. See “LPAR CPU Utilization Graphs” for more information.
Using the CPU Utilization graph, you can better determine the CPU entitlements of the LPARs on a system. The entitlements indicate the amount of CPU power that is assigned to an individual LPAR. For example, an entitlement of 0.5 indicates that an LPAR is assigned half of the processing power of a CPU.
You can use the graphs to give you a clearer view of how much you may need to increase an LPAR’s entitlement. Instead of using trial and error to determine optimum entitlements, you can use actual data to determine accurate entitlements.
If the message There are no LPARs for this date range is displayed, do one of the following:
Network graphs track the performance and reliability of your computing network. You can generate I/O and Errors graphs. These graphs use the same input criteria, but return different data.
The I/O graph charts the average amount of data that is moving in and out of a network interface over a specified time period. Uptime Infrastructure Monitor also identifies bursts of network traffic.
The I/O graph captures the following statistics:
Out bytes: the number of bytes sent by the network interface each second
The Errors graph charts the number of network interface errors that occur each second. The most common types of errors include collisions in a hubbed environment or the presence of full-duplex handshake errors between a system and a switch. The following communication line problems can also cause network errors:
The Errors graph captures the following statistics:
For network device Elements that are monitored by Scrutinizer, a graph that covers a specified time frame is generated. It shows the monitored node’s bi-directional throughput rates through known ports, which are determined based on use by all known applications.
The Quick Snapshot for a network device summarizes both the recent (24-hour) and current performance of SNMP-based devices, and can help administrators identify potential issues.
If there are not 24 hours’ worth of data available, Uptime Infrastructure Monitor uses data from as far back as possible to generate charts. |
The Quick Snapshot is typically used as a preliminary step toward root-cause analysis. When you first acknowledge an issue by clicking the network device’s Element name in either Global Scan or the My Alerts section of My Portal, you are shown its Quick Snapshot. From here, you can work with the information provided in the charts and tables (e.g., overloaded ports, or excessively long round-trip times) and begin further investigation:
The following information is displayed in a network device’s Quick Snapshot.
Performance Charts | ||
% Packet Loss |
| |
Average Round-Trip Time |
| |
Port Status | ||
Port Name | the name of the port on the network device | |
Port Type | the interface type (i.e., Ethernet or Virtual/VLAN) | |
Usage | the percentage of the port’s maximum throughput that was used during the most recent time interval | |
In Rate | the average throughput of inbound packets, in Mbps, during the most recent time interval | |
In Usage | the percentage of the port’s maximum throughput that was used by inbound packets during the most recent time interval | |
Out Rate | the average throughput of outbound packets, in Mbps, during the most recent time interval | |
Out Usage | the percentage of the port’s maximum throughput that was used by outbound packets, during the most recent time interval | |
Errors | the average number of errors per second, during the most recent time interval | |
Discards | the average number of packets discarded per second, during most recent the time interval | |
Status | the current status of the port, based on information retrieved from the network device’s Platform Performance Gatherer service |
To display the Quick Snapshot page for a network device Element, do the following:
Note that when you are viewing a network device Element’s profile, you can always access its Quick Snapshot by clicking the Graphing tab, then clicking Quick Snapshot in the tree panel. |
Uptime Infrastructure Monitor allows you to generate graphs to display the performance of the following:
The I/O graph displays the average amount of data moving in and out of a network device’s ports over a specified time period. This can help you confirm bursts in network traffic, and identify ports that are receiving and transmitting large amounts of data in relation to their maximum throughput.
You can generate top-10-port graphs based on a specific criterion, or focus on a specific port on your network device, and create a graph that includes multiple metrics.
The following metrics can be used when generating a Network I/O graph for a network device Element:
Total Rate | the combined incoming and outgoing data rates, in Mbps, for the port during the time period | |
Usage | the percentage of the port’s maximum throughput that was used by inbound and outbound packets, during the time interval | |
In Rate | the average throughput of inbound packets, in Mbps, during the time interval | |
In Usage | the percentage of the port’s maximum throughput that was used by inbound packets during the time interval | |
Out Rate | the average throughput of outbound packets, in Mbps, during the time interval | |
Out Usage | the percentage of the port’s maximum throughput that was used by outbound packets, during the time interval |
The network device Errors graph displays the number of errors or discards that occur each second. The following communication line problems can cause network errors:
The following metrics can be used when generating a Network Error graph:
Errors | the total number of errors per second during the time period | |
In Errors | the number of packets received, but unable to be decoded, per second, due to a missing header or trailer | |
Out Errors | the number of packets that were not sent, per second, due to problems transmitting the packet or formatting the packet for transmission | |
Discards | the total number of packets dropped per second, through the port, during the time period | |
In Discards | the number of packets inbound through the port that were dropped per second, during the time period | |
Out Discards | the number of packets outbound through the port that were dropped per second, during the time period |
The Disk Performance Statistics graph charts a set of disk performance metrics returned by utilities - such as perfmon on Windows, and iostat or sar on Solaris - that are running on a system.
Requests can experience delays proportional to the length of the request queue minus the number of spindles on the disks. For optimal performance, this difference should be less than two on average.
Percent Busy
The percentage of the disk capacity used.
For NFS systems, 100% busy does not indicate that the server itself is saturated, but that the client always has outstanding requests to that server. |
The Top 10 Disks graph displays the ten busiest disks in your environment as of the last sample that Uptime Infrastructure Monitor has taken. If there are fewer than ten disks on the system, then all of the disks on a system charted in the graph.
Percent Busy
The percentage of the disk capacity used.
For NFS systems, 100% busy does not indicate that the server itself is saturated, but that the client has outstanding requests to that server. |
A File System Capacity graph charts the amount of total and used space, in kilobytes, on a server’s disk. On Windows servers, Uptime Infrastructure Monitor looks at the capacity of the main partition (usually the C:\ drive). On UNIX and Linux servers, Uptime Infrastructure Monitor looks at the individual file systems (for example, /var ,
/export , /usr ) on all the disks on the server.
If a single disk system has no partitions, then the file system capacity is the same as the disk capacity. |
The File System Capacity graph visualizes the following statistics:
The VXVM Stats graph charts the amount of data written to or read from a Solaris volume that is managed by the Veritas Volume Manager. Veritas Volume Manager is storage management system that operates between a host’s operating system and its filesystems or database management systems. Veritas Volume Manager enables you to manage disk drives on a system as if they were volumes (logical devices that appear to be physical partitions on a disk).
Depending on the options that you specify, this graph contains the following information:
If Veritas Volume Manager is not running on a host, or if Uptime Infrastructure Monitor cannot connect to the volume, an error message informing you that Uptime Infrastructure Monitor cannot detect the Veritas Volume Manager appears in the Graphing subpanel.
In the Info & Rescan panel, verify that the entry Has a Logical Volume Manager? is set to Yes . If it is, then ensure that you can connect to the host from the Monitoring Station.
If you selected Average Service Times in step 6, the amount of time requires to read and write data to and from the volume.
Select only one option if you are comparing more than one volume. |
Uptime Infrastructure Monitor can collect data from systems that are running version 6.5 of the Novell Remote Manager (NRM). Uptime Infrastructure Monitor retrieves NRM service metrics and then stores this information in the DataStore. Using the data that is collected from NRM, you can generate graphs for the following metrics:
For more information about Novell NRM systems, see Novell NRM Systems.
The VMware VMotion tool enables you to move ESX instances from one server to another without any downtime or loss of data. You would use VMotion to, for example, move an instance to newer and faster hardware, or to temporarily relocate the instance while performing a hardware upgrade.
The Instance Motion graph enables you to keep track of a moving VMware instance. For a given ESX instance, the graph charts which systems it is running on over a given time range.
Detailed process information provides an insight into how various user and system processes are consuming system resources. The information is not presented in a graph - it is a table that contains the following information:
User System Time
The amount of time (in seconds) that a process is consuming system time on the CPU.
This value is not displayed for Windows systems.
You can get a better indication of the amount of work a process has done by dividing this amount by a sample of time - for example, five minutes. |
Start Time
The time at which the process started. This can be used to determine the lifetime of a process.
The process information for the current date and time is displayed in the Graphing subpanel. |
The percentage of time that the CPU spends executing Windows kernel commands. If this metric is consistently high you should consider using a faster or more efficient disk subsystem.