NIST logo
*

****WORKING DOCUMENT****

6.3      Fault-Tolerant Cloud Group

Actors: cloud-subscriber, cloud-provider-1, cloud-provider-2, cloud-provider-n

Goals:  Synthesize a highly-reliable service using the facilities of multiple cloud-providers.

Assumptions: Assume that a cloud-subscriber has already opened accounts with N cloud-providers (See Use case "Open An Account").  We also assume that when comparisons of data or output results from the N cloud-providers is made, a majority of the data or results will be found to be equivalent. Also, the metadata about data objects includes time stamps or sequence numbers.

Success Scenario 1 (write data, IaaS, PaaS):  The cloud-subscriber attempts to copy a data object onto all N of the cloud-providers using the data object APIs that each cloud-provider publishes (See Use Case "Copy Data Objects Into A Cloud"). Each cloud-provider returns a message indicating whether or not the copy operation succeeded.  The cloud-subscriber records the number of successes M.  If M < N, the cloud-subscriber may re-issue the request or evaluate whether or not the data has been stored with sufficient redundancy.  If not, the cloud-subscriber may optionally open accounts with new cloud-providers.

Success Scenario 2 (read data, IaaS, PaaS): Assume the cloud-subscriber issues a number K of concurrent object read requests using the data object APIs that each cloud-provider publishes. The cloud-subscriber will choose K to be large enough so that at least one of the responses from the responding cloud-providers will contain data from the object's most recent update. The cloud-subscriber compares responses from the responding cloud-providers, and chooses the response representing the latest version of the object.

Success Scenario 3 (redundant batch jobs, IaaS, PaaS): The cloud-subscriber starts a processing job on each of the N cloud-providers (e.g., See Use Case "VM Control: Manage Virtual Machine Instances"). Each cloud-provider runs exactly the same job, on the same input data, and produces output data. The cloud-subscriber retrieves the output data from the first-completing cloud-provider, checksums it, and then checksums the output subsequently returning cloud-providers, comparing each for equality. If any of the equality checks fail, the cloud-subscriber can rerun the job, perhaps allocating it onto a different set of cloud-providers, or simply take a majority vote and consider that the result.

Success Scenario 4 (state machine replication, IaaS, PaaS): The cloud-subscriber starts a long-running server process in each of the N cloud-providers. Iteratively, the cloud-subscriber sends a service request to each server process in the N cloud-providers, receives each server's results, and compares the results. If the comparisons do not show equality, the cloud-subscriber re-initializes servers that are determined to have failed by perhaps migrating to new cloud-providers. If a server has failed to respond to requests for a timeout period, the cloud-subscriber reinitializes the server, bringing it up to the state of the others.

Failure Conditions: The requested action or process performed at one or more of the N cloud-providers fails or produces incorrect returning data to cloud-subscriber.

Failure Handling: Cloud-subscriber either reinitiates the requested action, or considers performing the action with new cloud-provider(s)

Requirements File:

Credit: Note: there is a lot of literature on how to implement replication in network services using protocols such as two-phase-commit or quorum-consensus or timestamps or transactions; this is just a sketch. One good source of information on how to compare results (termed "voting") can be found in the n-version programming literature.