Resilient Computing in the Clouds

Project summary

Cloud computing is a recent approach for providing computing, network and storage resources over the internet. This virtualization of resources allows companies to benefit from scalability and pay-per-use pricing, i.e., companies do not have to make huge investments in resources considering their worst case requirements, but can obtain them dynamically from a cloud computing provider, paying only for the resources they actually need. The provider does not have to consider the worst case per client because the resources are shared among a large number of clients. In Portugal there is currently a great interest from medium and large companies in using cloud computing to cut investments in computing, network and storage equipment. However, the notion of having computing or storage resources located in a third party’s infrastructure leads to concerns about security and dependability. These concerns prevent many companies from adhering to cloud computing due to the criticality of their resources, or at least to the risk of economical losses.

This project aims to tackle the challenge of improving the security and dependability of cloud computing services using recent techniques investigated under designations such as resilience, Byzantine fault tolerance or survivability, which we call intrusion tolerance or InTol for short. InTol has deserved much research in recent years since it can be used to obtain both fault tolerance and added security. InTol departures from the fact that complex systems can not be made entirely secure. However, similarly to accidental faults, attacks and intrusions in components of a system can be tolerated, leaving the system as a whole working properly, i.e., satisfying availability, integrity and confidentiality properties, despite such events. A common approach is to use the state machine approach to scatter a service by several servers in such a way that even if some of the servers are compromised, the service as a whole remains correct.

InTol can be beneficial in cloud computing at two levels. Inside a cloud (intra-cloud level) it can be used by the cloud provider or its users to implement highly resilient critical services. An example of such services that the project will look into carefully are coordination services, which are currently a key component in these environments (examples are Google’s Chubby and Yahoo!’s Zookeeper). These services are vital for the operation of dozens of applications in current clouds, therefore improving their resilience has direct impact in the overall resilience. However, companies running large clusters of machines are reluctant to use InTol protocols for some reasons: lack of performance (not true); instability of the protocols that can lead to undesirable effects like multicast storms and chaotic load fluctuations (true).

At the second level, inter-cloud, the idea is that cloud users may improve the resilience of their resources by replicating them in several clouds. This poses its own set of problems, for instance congestion and link failures in some of the internet connections between these clouds may lead to negative impacts in the quality of service perceived.

The project will deal with these challenges and make contributions in the following important areas:

First, the project will investigate and design efficient and stable intra-cloud intrusion-tolerant replication protocols and show how to use them to implement a highly-resilient coordination service. The main challenge is obtaining stability by avoiding the potential for multicast storms and load fluctuations of current protocols, with good efficiency (low latency, high throughput). The project will leverage the following approaches to achieve this objective: use of a combination of techniques such as failure detectors, randomization and ordering oracles to obtain stability; use of small secure components to constrain the power of faulty replicas to cause disagreement; use of the different protocols for implementing each of the coordination service operations, instead of a solution fits all.

Second, the project will investigate and design efficient and stable inter-cloud replication protocols. The main challenge is to improve the resilience to instabilities and unavailability in the wide-area network. Furthermore, reducing the number of replicas needed is extremely important due to the extra cost of contacting each additional cloud. The project will leverage approaches such as: application-level routing using overlay networks to deviate traffic through better and/or non-faulty connections; connection of each cloud to more than one service provider (multihoming) in order to tolerate critical failures in some of them.