== Fault Manager ==

The Fault manager is a redundant client/server service that maintains table of entities and their current status.  This table is located in shared memory and is available on all nodes in the cluster and is maintained by the fault manager server running on that node.

Fault clients can rapidly query the current state of an entity, can register new entities, and can report faults or alarms in an entity.

A single fault server is elected "master" and all reports are filtered through user-defined plugins to determine what effect the fault report should have.  Fault plugins can control whether alarms are applied or cleared on an entity, and whether the entity is seen as "up" or "down" by the rest of the cluster.  

When it is decided that an entity is "down", the notification is broadcast across the entire cluster with high priority, optionally wrapped inside a transaction.  This ensures that all applications in the cluster see a consistent state for the entity.

=== Faults vs. Alarms ===

An "Alarm" is a persistent, exceptional condition that affects an entity.  Alarms are well-defined in telecom management literature.

A fault is the report of a not-necessarily-persistent condition that Entity X believes is affecting Entity Y.  It is possible that the fault manager may agree with Entity X and so apply an alarm to entity Y.  It is also possible that the fault manager will decide that Entity X is misreporting the fault and so apply an alarm to X, or ignore the report.  For example, a RPC (remote procedure call) timeout could be caused by a failure on the sending side, the receiving side, or the network in between.  So a fault can be considered an "alarm candidate"...

The terms "up" and "down" are also used in this document to indicate that and entity is usable or unusable (some literature uses the term "faulted" but here I use "down" to clearly distinguish this state from fault reports).  The act of marking an entity "up" or "down" is decided by the active fault server (and its plugins), and will implicitly trigger recovery or failover actions by the AMF's redundancy management plugins.

An entity that is either "up" or "down" may have any number of active alarms associated with it.  The fault manager's job is to transition an entity into the "down" state when it cannot be managed (communicated with) and to maintain the active alarms.

It is the job of sophisticated AMF plugins to look at an entities' active alarms, realise that although the entity is "up" it cannot fulfill its job function, and so choose to gracefully transition service to an redundant entity with fewer active alarms. Of course, the entity itself may help the AMF make this decision by reporting its status via RPC, management calls, or (in the case of a process) transition to the "down" state (AKA calling exit() or abort()).