Agent health checks

NOTE:
The agent health checks described here check the health of the HPOM agent and of all its subagents.

Self management monitors the health of the agents on each managed node using the following mechanisms:

The control agent checks the health of its subagents and reports aborting agents by sending a message to the message browser.
After a configurable interval (300 seconds by default), the management server checks the agent health. The management server attempts to contact the agent with either an ICMP ping or a call to the control agent, or both.

The management server reports the health of the agents either to the active message browser, or to the Windows event log. An event log policy that is deployed to the management server evaluates events in the event log and forwards them to the message browser. Message correlation acknowledges Node down messages automatically when a Node up message arrives.

Some sample message generated by the server are:

Node down-messages:

Could not contact RPC server of agent on radish. RPC server of agent not registered. Agent is probably not running.
Node rhubarb is maybe down. Even to contact it with ping-packages failed.
Message agent on node radish is not running

Node up-messages:

Message agent on node radish is now running
Control agent on node radish is now running.

NOTE:
The management server does not check the health of nodes that have an empty package inventory. Nodes can have an empty package inventory if, for example, you install the agent manually, or if you upload the node configuration from another management server. If you want the management server to start checking the health of these nodes, synchronize the package inventory.

To configure advanced agent health check options

In the console tree, right-click Operations Manager, and then click Configure Server.... The Server Configuration dialog opens.
Click Namespaces, and then click Agent Health Check. A list of values appears.
Change any of the values in the following table:

Values	Value type	Unit	Default value	Description
Health check ping protocol	List	DISABLED AGENTONLY ICMPONLY ENABLED	ENABLED	This value configures the default ping protocol. You can change the default for each node in the Node Properties dialog. DISABLED means that the management server performs no agent health check at all. AGENTONLY means that the server does not actively contact the node with ICMP pings, but still contacts the agent on the node. This is useful for nodes behind a firewall. ICMPONLY means that the server does not contact the agent, but only uses ICMP pings. This is useful for managed nodes like SNMP devices that do not have an agent installed. ENABLED means that all aspects of agent health check are used.
Enable health check	Boolean	True False	True	Enables or disables all aspects of the health check.
Time interval to check agent health	Integer	Number of seconds	300	The default interval at which the management server checks the health of each agent. You can change the default for each node in the Node Properties dialog.
Maximum number of parallel checks	Integer	Number of threads	100	The maximum number of parallel threads that are used to do the active check (server pings the node). After you have changed this value, restart the OvEpMessageActionServer service for the change to take effect.
Health check retries	Integer	0 to 3 retries	0	This value configures the number of health check ping retries to do immediately if an agent could not be reached. The node is considered down when all retries have been unsuccessful. Increase this value if you have an unreliable network infrastructure.
Target for agent health problem messages	List	SERVER EVENTLOG SERVER_EVENTLOG	SERVER	The target for messages that indicate problems with agent health checking. SERVER means that these messages are directly written to the active message browser on the management server, without passing any policy-based message filter. EVENTLOG means that these messages are written to the application event log so that they can be picked up by a Windows Event Log policy. The VP_SM-Server_EventLogEntries policy already contains two rules for these health messages named "forwards all health check...". These rules can be easily adapted or used as templates for your own health checking rules. SERVER_EVENTLOG combines SERVER and EVENTLOG.
Severity of agent health problem messages	List	Normal Warning Minor Major Critical	Critical	The severity for messages that indicate problems with agent health checking. For example, "Node xxx may be down. Failed to contact it using ping." If you configure the Target for agent health problem messages to include the event log, this value sets the event types as follows: Normal results in information events. Warning, minor, and major result in warning events. Critical results in error events.
Health check report buffering	Boolean	True False	True	This value configures whether to report that an agent is buffering messages.
Severity of buffering for this management server	List	Normal Warning Minor Major Critical	Major	This value configures the severity of messages that indicate that the agent is buffering messages for this management server.
Severity of buffering for other management servers	List	Normal Warning Minor Major Critical	Warning	This value configures the severity of messages that indicate that the agent is buffering messages for a management server other than this one.
Enable access denied warning for raw socket creation	Boolean	True False	True	This value configures whether to write a warning to the system event log if the management server cannot accept alive packets from agents. (See Accepting alive packets below.)

Accepting alive packets

On nodes that have the DCE agent, the message agent sends an alive packet to the management server at a configurable interval. However, in HPOM 8.10, the management server is no longer able to receive these alive packets by default. The management server runs under the HP-OVE-User account, which no longer has administrative rights. Without administrative rights, the management server cannot open the raw socket that it needs to receive alive packets.

To continue receiving alive packets, you must add the HP-OVE-User to the local administrators group on the management server. Before you give the HP-OVE-User administrative rights, check the security requirements of your organization.

If the management server can accept alive packets, it checks whether it received a packet from a node before it contacts that node by ICMP ping or call to the control agent. If the management server has received an alive packet, it does not attempt to contact the node.

You can change the frequency with which each node sends alive packets. You do this by configuring the value for OPC_HBP_INTERVAL_ON_AGENT in a nodeinfo policy, which you deploy to the agent. The agent sends an alive packet at an interval equal to two-thirds of the configured value. On nodes that have the DCE agent, the default value of OPC_HBP_INTERVAL_ON_AGENT is 280, so the agent sends an alive packet every 120 seconds.

If the management server cannot accept alive packets, change the default value of OPC_HBP_INTERVAL_ON_AGENT to 0 on nodes with DCE agents. The agent stops sending alive packets, which prevents unnecessary network load. On nodes that have the HTTPS agent, the value is not set by default, so the HTTPS agent sends no alive packets by default.

Changing agent health check behavior

To reduce network traffic by monitoring less frequently, increase Time interval to check agent health and OPC_HBP_INTERVAL_ON_AGENT. Time interval to check agent health should remain greater than the value for OPC_HBP_INTERVAL_ON_AGENT to ensure that the server looks for the alive packet after the node sends it.
To increase monitoring, decrease Time interval to check agent health and OPC_HBP_INTERVAL_ON_AGENT. Time interval to check agent health should remain greater than the value for OPC_HBP_INTERVAL_ON_AGENT to ensure that the server looks for the alive packet after the node sends it.
To monitor nodes through a firewall, or if the ICMP-port cannot be used, set Health check ping protocol to AGENTONLY to switch off the check with PING-packets. The active check with RPCs will still be done. Note that this increases the network traffic because the server will check the health of the agent with an RPC call each time the Time interval to check agent health is exceeded. (RPC calls require more bandwidth than ping.)
To reduce CPU load and memory consumption, reduce the value of Maximum number of parallel checks. This might be necessary when monitoring large environments, or if the management server has limited resources.
To stop the health check entirely, set Enable health check to false.

Related Topics: