Differences between revisions 1 and 9 (spanning 8 versions)

Checkpoint

The checkpoint entity forms the backbone of the coordination of information between nodes in the cluster. Abstractly, it is a "dictionary", "map", or database table data structure -- that is, a user provides an arbitrary data "key" that returns an arbitrary data "value". However a Checkpoint differs from these structures because it exists in all processes that are interested in it. A checkpoint is fully replicated to all nodes that are interested in it -- it is not a "distributed" dictionary where every node has partial data.

Uses

The primary use for checkpoint is synchronization of state between redundant programs. The "active" software writes a checkpoint, and the "standby" software reads it. This is more useful than a simple message interface because the checkpoint abstraction simultaneously presents the total state and the incremental state changes. Total state access is required when a "cold" standby is first started, when the standby fails and is restarted, and when the active fails over if the "warm" standby has not been actively tracking state changes. Incremental state changes (deltas) are required so a "hot" standby can continually update its program state in response to the active software's actions.

Checkpoint is also used whenever commonly-used data needs to be distributed throughout the cluster. Checkpoints can be single-writer, multiple-reader (more efficient), or multiple-writer, multiple-reader.

A Checkpoint is inappropriate when the data is not often used. Although a checkpoint may be written to disk for recovery during a catastrophic failure event, the entire checkpoint data set is stored in RAM. Therefore a traditional replicated database is more appropriate for a large and/or rarely used data set.

Major Features

Most major features can be selected at object creation time to optimize for speed or for utility

Replicated: Replicated efficiently to multiple nodes
Nested: Checkpoint values can be unique identifiers that automatically resolve-and-lookup in another Checkpoint
Persistent: Checkpoints can automatically store themselves to disk (Persistence)
Notifiable: You can subscribe to get notified of changes to Checkpoints (Event)
Shared memory: For efficiency, a single service per node can maintain coherency (with the rest of the cluster) of a checkpoint. To do this the checkpoint is stored in shared memory
Transactional: Checkpoint operations can be wrapped in a cluster-wide Transaction

Design

In this document the term "Checkpoint" will be used to refer to the entire replicated checkpoint abstraction. The term "local replica" will be used to refer to a particular copy of the checkpoint data.

Process Access

A Checkpoint can be located in process private memory or in shared memory based on an option when the Checkpoint is created.

A Checkpoint that is used by multiple processes on the same node should be located in shared memory for efficiency.

Checkpoint Creation

A checkpoint is always identified by a Handle. At the API layer, a string name can be used. If the latter, this name will be registered with the Name service using the checkpoint's Handle as the value.

Intra-Node and Inter-Node Replication and Communication

There are two ways that processes can share checkpoint data: either via shared memory or via messaging. Typically, all processes on the same node will use shared memory to share checkpoint data (intra-node) and a designated process per node will handle communication with other nodes (inter-node). However, if a process opens the checkpoint without the shared memory flag, it cannot only use the messaging communication mechanism. It is in essence behaving as if it was running isolated on its own node. When this document refers to inter-node communication, messaging communication, a process being a "node leader", etc. it may refer to actual communication across nodes but it equally refers to communication to processes that have opened the checkpoint "unshared".

For example, it is possible to have a single checkpoint opened and shared by 3 processes and opened unshared by 2 processes all on the same node. In this case, there will be 3 copies of that checkpoint stored in RAM on the node; 1 copy located in shared memory and 2 copies in process local memory. As of the writing, this configuration's is mostly conceived for test and debugging, but the existence of this configuration proves that the code is written in a robust fashion.

Discovery

When a checkpoint is created, it will be opened in shared memory (if applicable). Data within the shared memory will be used to elect a "synchronization replica process" (see Election section). A process that opens an "unshared" checkpoint is defacto the "sync replica process" of its own checkpoint replica.

The job of the synchronization replica process is to keep the checkpoint synchronized with other replicas via the messaging interface. To do this, the synchronization replica process will register a new group with the Group service. It is identified by a well-known Handle or by the ClusterUniqueId returned by the Group registration. It may also be identified by a string entry in the Name service and all APIs that need a checkpoint will accept either a ClusterUniqueId or a string name.

So every checkpoint replica will have exactly one process that is a member of this checkpoint group. Each checkpoint will have a different group. So the Group service (and Name service) end up being implicitly responsible for ensuring that a particular checkpoint (identified by handle or name/handle) is shared across the cluster. In other words, Checkpoint replicas discover each other by joining the Checkpoint's group and iterating through the group members.

The active process designated in the checkpoint's group will be the checkpoint writer, in checkpoints that only allow a single writer. In this case, the process is called the "master replica".

-  ⇤ ← Revision 1 as of 2013-07-06 04:45:08 → 
  Size: 993
  Editor: AndrewStone
  Comment:
+   ← Revision 9 as of 2014-06-13 21:25:27 → ⇥
  Size: 6273
  Editor: AndrewStone
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= Dictionary =
+## page was renamed from Dictionary
= Checkpoint =
-Line 3:
+Line 4:
-The dictionary entity forms the backbone of the coordination of information between nodes in the cluster. This entity is fully replicated to all nodes that are interested in it -- it is not a "distributed" dictionary where every node has partial data.
+The checkpoint entity forms the backbone of the coordination of information between nodes in the cluster.  Abstractly, it is a "dictionary", "map", or database table data structure -- that is, a user provides an arbitrary data "key" that returns an arbitrary data "value".  However a Checkpoint differs from these structures because it exists in all processes that are interested in it.  A checkpoint is fully replicated to all nodes that are interested in it -- it is not a "distributed" dictionary where every node has partial data.

== Uses ==

The primary use for checkpoint is synchronization of state between redundant programs.  The "active" software writes a checkpoint, and the "standby" software reads it.  This is more useful than a simple message interface because the checkpoint abstraction simultaneously presents the total state and the incremental state changes.  Total state access is required when a "cold" standby is first started, when the standby fails and is restarted, and when the active fails over if the "warm" standby has not been actively tracking state changes.  Incremental state changes (deltas) are required so a "hot" standby can continually update its program state in response to the active software's actions.

Checkpoint is also used whenever commonly-used data needs to be distributed throughout the cluster.  Checkpoints can be single-writer, multiple-reader (more efficient), or multiple-writer, multiple-reader.


A Checkpoint is inappropriate when the data is not often used.  Although a checkpoint may be written to disk for recovery during a catastrophic failure event, the entire checkpoint data set is stored in RAM.  Therefore a traditional replicated database is more appropriate for a large and/or rarely used data set.
-Line 9:
+Line 20:
- * Replicated
+ * '''Replicated:'''
-Line 11:
+Line 22:
- * Nested
 Dictionary values can be unique identifiers that resolve to another Dictionary
 * Persistent
 Dictionaries can automatically store themselves to disk ([[Persistence]])
 * Notifiable
 You can subscribe to get notified of changes to dictionaries ([[Event]])
 * Shared memory
 For efficiency, a single service per node can maintain coherency (with the rest of the cluster) of a dictionary.  To do this the dictionary is stored in shared memory
 * Transactional
 Dictionary operations can be wrapped in a [[Transaction]]
+ * '''Nested:'''
 Checkpoint values can be unique identifiers that automatically resolve-and-lookup in another Checkpoint
 * '''Persistent:'''
 Checkpoints can automatically store themselves to disk ([[Persistence]])
 * '''Notifiable:'''
 You can subscribe to get notified of changes to Checkpoints ([[Event]])
 * '''Shared memory:'''
 For efficiency, a single service per node can maintain coherency (with the rest of the cluster) of a checkpoint.  To do this the checkpoint is stored in shared memory
 * '''Transactional:'''
 Checkpoint operations can be wrapped in a cluster-wide [[Transaction]]


== Design ==

In this document the term "Checkpoint" will be used to refer to the entire replicated checkpoint abstraction.  The term "local replica" will be used to refer to a particular copy of the checkpoint data.

=== Process Access ===

A Checkpoint can be located in process private memory or in shared memory based on an option when the Checkpoint is created.

A Checkpoint that is used by multiple processes on the same node should be located in shared memory for efficiency.

=== Checkpoint Creation ===

A checkpoint is always identified by a [[Handle]].  At the API layer, a string name can be used.  If the latter, this name will be registered with the [[Name]] service using the checkpoint's [[Handle]] as the value.

=== Intra-Node and Inter-Node Replication and Communication ===

There are two ways that processes can share checkpoint data: either via shared memory or via messaging.  Typically, all processes on the same node will use shared memory to share checkpoint data (intra-node) and a designated process per node will handle communication with other nodes (inter-node).  However, if a process opens the checkpoint without the shared memory flag, it cannot only use the messaging communication mechanism.  It is ''in essence'' behaving as if it was running isolated on its own node.  When this document refers to inter-node communication, messaging communication, a process being a "node leader", etc. it may refer to actual communication across nodes but it equally refers to communication to processes that have opened the checkpoint "unshared".

For example, it is possible to have a single checkpoint opened and shared by 3 processes and opened unshared by 2 processes all on the same node.  In this case, there will be 3 copies of that checkpoint stored in RAM on the node; 1 copy located in shared memory and 2 copies in process local memory.  As of the writing, this configuration's is mostly conceived for test and debugging, but the existence of this configuration proves that the code is written in a robust fashion.


==== Discovery ====

When a checkpoint is created, it will be opened in shared memory (if applicable).  Data within the shared memory will be used to elect a "synchronization replica process" (see Election section).  A process that opens an "unshared" checkpoint is defacto the "sync replica process" of its own checkpoint replica.  

The job of the synchronization replica process is to keep the checkpoint synchronized with other replicas via the messaging interface.  To do this, the synchronization replica process will register a new group with the [[Group]] service.  It is identified by a well-known [[Handle]] or by the [[ClusterUniqueId]] returned by the [[Group]] registration.  It may also be identified by a string entry in the [[Name]] service and all APIs that need a checkpoint will accept either a [[ClusterUniqueId]] or a string name.

So every checkpoint replica will have exactly one process that is a member of this checkpoint group.  Each checkpoint will have a different group.  So the [[Group]] service (and [[Name]] service) end up being implicitly responsible for ensuring that a particular checkpoint (identified by handle or name/handle) is shared across the cluster.  In other words, Checkpoint replicas discover each other by joining the Checkpoint's group and iterating through the group members.

The active process designated in the checkpoint's group will be the checkpoint writer, in checkpoints that only allow a single writer.  In this case, the process is called the "master replica".

==== Replication ====

Site

General

Services

Objects

Wiki

Page

User