HASLab

Vilaça R, Oliveira R. 2009. Clouder: a flexible large scale decentralized object store - architecture overview. Proceedings of 3rd Workshop on Dependable Distributed Data Management (with EuroSys). Abstractclouder.pdf

The current exponential growth of data calls for massive scale capabilities of storage and processing. Such large volumes of data tend to disallow their centralized storage and processing making extensive and flexible data partitioning unavoidable. This is being acknowledged by several major Internet players embracing the Cloud computing model and offering fi rst generation remote storage services with simple processing capabilities. In this position paper we present preliminary ideas for the architecture of a flexible, efficient and dependable fully decentralized object store able to manage very large sets of variable size objects and to coordinate in place processing. Our target are local area large computing facilities composed of tens of thousands of nodes under the same administrative domain. The system should be capable of leveraging massive replication of data to balance read scalability and fault tolerance.

Oliveira R. 2008. An Open Architecture for Scalable Database Clustering. Proceedings of 3th Enterprise Distributed Object Computing Conference Workshops (EDOC). Abstract

The Database Management System (DBMS) used to be a commodity software component, with well known standard interfaces and semantics. However, the performance, scalability and reliability expectations being placed on DBMS have increased the demand for a variety of add-ons, that augment the functionality of the database in a wide range of deployment scenarios, offering support for features such as clustering, replication, and self-management, among others. Recently, several such add-ons have been designed and implemented both in the academia and by leading commercial database providers. Each proposal tends to target certain goals and applications, therefore establishing specific tradeoffs that impair their flexibility. Moreover, it has been a common fundamental assumption that any add-ons should not be intrusive and that the DBMS should be kept unchanged and monolithically handled. While this is a very sensible and pragmatic view due to the complexity of DBMS and the critical role they play in existing information systems, emerging demands on scalability require greater flexibility of the whole data management system so that major functionalities can be realized as autonomous services with specific tradeoffs and quality of service. The GORDA project (EU 1ST FP6) proposed a general purpose DBMS reflection architecture and interface - GAPI, which supports a number of useful extensions while at the same time admitting efficient implementations. By exposing at the interface an abstract representation of the systems' inner functionality, the later can be inspected and manipulated, thus changing its behavior without loss of encapsulation. DBMS have long taken advantage of this - on the database schema, on triggers, and when exposing the log. In this talk we describe the various aspects and goals that led to GAPI and we illustrate the usefulness of the architecture and interface with concrete examples. GORDA fundamentally emphasizes the modularity of the add-ons, e.g. cluste- - ring, replication and management, the DBMS itself and fundamental building blocks such as reliable group communication. This effort clearly seems to be of major relevance for the emerging Cloud storage systems. By easing the development of different add-ons for database systems, it can be used to enrich the current products offered by key providers such as Amazon and Google and enable small providers to jump into this new trend. Cloud storage offers are touted as being able to deal with both very large data volumes as well as large numbers of clients with different storage needs. Per se, these two requirements call for highly scalable and flexible infrastructures. Current general tradeoffs however, favor minimal client interfaces with pretty relaxed consistency guarantees which are not adequate to general applications. Bringing transactional semantics and ACID guarantees to the Cloud appears as a major commercial trend and research challenge.

Carvalho N, Pereira JO, Oliveira R, Rodrigues L. 2007. Emergent structure in unstructured epidemic multicast. Proceedings of 37th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Abstractdsn07-carvalho.pdf

In epidemic or gossip-based multicast protocols, each node simply relays each message to some random neighbors, such that all destinations receive it at least once with high proba- bility. In sharp contrast, structured multicast protocols explicitly build and use a spanning tree to take advantage of efficient paths, and aim at having each message received exactly once. Unfortunately, when failures occur, the tree must be rebuilt. Gossiping thus provides simplicity and resilience at the expense of performance and resource efficiency. In this paper we propose a novel technique that exploits knowledge about the environment to schedule payload transmission when gossiping. The resulting protocol retains the desirable qualities of gossip, but approximates the performance of structured multicast. In some sense, instead of imposing structure by construction, we let it emerge from the operation of the gossip protocol. Experimental evaluation shows that this approach is effective even when knowledge about the environment is only approximate.

Grov J, Soares L, Correia A, Pereira JO, Oliveira R, Pedone F. 2006. A pragmatic protocol for database replication in interconnected clusters. Proceedings of 12th Pacific Rim International Symposium on Dependable Computing (PRDC). Abstract10.1.1.190.3848.pdf

Multi-master update everywhere database replication, as achieved by protocols based on group communication such as DBSM and Postgres-R, addresses both performance and availability. By scaling it to wide area networks, one could save costly bandwidth and avoid large round-trips to a distant master server. Also, by ensuring that updates are safely stored at a remote site within transaction boundaries, disaster recovery is guaranteed. Unfortunately, scaling ex- isting cluster based replication protocols is troublesome. In this paper we present a database replication proto- col based on group communication that targets intercon- nected clusters. In contrast with previous proposals, it uses a separate multicast group for each cluster and thus does not impose any additional requirements on group commu- nication, easing implementation and deployment in a real setting. Nonetheless, the protocol ensures one-copy equiv- alence while allowing all sites to execute update transac- tions. Experimental evaluation using the workload of the industry standard TPC-C benchmark confirms the advan- tages of the approach.

Oliveira R, Pereira JO, Correia A, Archibald E. 2006. Revisiting 1-Copy equivalence in clustered databases. Proceedings of 21st ACM Symposium on Applied Computing (SAC). Abstractocpa06.pdf

Recently renewed interest in scalable database systems for shared nothing clusters has been supported by replication protocols based on group communication that are aimed at seamlessly extending the native consistency criteria of centralized database management systems. By using a read-one/write-all-available approach and avoiding the fine-grained synchronization associated with traditional distributed locking, one needs just a single distributed interaction step for each update transaction. Therefore the system can easily be scaled to a large number of replicas, especially, with read intensive loads typical of Web server support environments.In this paper we point out that 1-copy equivalence for causal consistency, which is subsumed by both serializability and snapshot isolation criteria, depends on basic session guarantees that are costly to ensure in clusters, especially in a multi-tier environment. We then point out a simple solution that guarantees causal consistency in the Database State Machine protocol and evaluate its performance, thus highlighting the cost of seamlessly providing common consistency criteria of centralized databases in a clustered environment.

Sousa AL, Correia A, Moura F, Pereira JO, Oliveira R. 2006. Evaluating certification protocols in the partial database state machine. Proceedings of 1st International Conference on Availability, Reliability and Security (ARES). Abstractpdbsm_als.pdf

Partial replication is an alluring technique to ensure the reliability of very large and geographically distributed databases while, at the same time, offering good performance. By correctly exploiting access locality most transactions become confined to a small subset of the database replicas thus reducing processing, storage access and communication overhead associated with replication. The advantages of partial replication have however to be weighted against the added complexity that is required to manage it. In fact, if the chosen replica configuration prevents the local execution of transactions or if the overhead of consistency protocols offsets the savings of locality, potential gains cannot be realized. These issues are heavily dependent on the application used for evaluation and render simplistic benchmarks useless. In this paper, we present a detailed analysis of partial database state machine (PDBSM) replication by comparing alternative partial replication protocols with full replication. This is done using a realistic scenario based on a detailed network simulator and access patterns from an industry standard database benchmark. The results obtained allow us to identify the best configuration for typical online transaction processing applications.

Correia A, Sousa AL, Soares L, Pereira JO, Moura F, Oliveira R. 2005. Group-based replication of on-line transaction processing servers. Proceedings of 2nd Latin-American Symposium on Dependable Computing (LADC). 3747 Abstractladc05.pdf

Several techniques for database replication using group communication have recently been proposed, namely, the Database State Machine, Postgres- R, and the NODO protocol. Although all rely on a totally ordered multicast for consistency, they differ substantially on how multicast is used. This re- sults in different performance trade-offs which are hard to compare as each protocol is presented using a different load scenario and evaluation method. In this paper we evaluate the suitability of such protocols for replication of On-Line Transaction Processing (OLTP) applications in clusters of servers and over wide area networks. This is achieved by implementing them using a common infra-structure and by using a standard workload. The results allows us to select the best protocol regarding performance and scalability in a demanding but realistic usage scenario.

Sousa AL, Pereira JO, Soares L, Correia A, Rocha L, Oliveira R, Moura F. 2005. Testing the dependability and performance of group communication based database replication protocols. Proceedings of 35th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Abstractdbsim-dsn-pds05.pdf

Database replication based on group communication systems has recently been proposed as an efficient and resilient solution for large-scale data management. However, its evaluation has been conducted either on simplistic simulation models, which fail to assess concrete implementations, or on complete system implementations which are costly to test with realistic large-scale scenarios. This paper presents a tool that combines implementations of replication and communication protocols under study with simulated network, database engine, and traffic generator models. Replication components can therefore be subjected to realistic large scale loads in a variety of scenarios, including fault-injection, while at the same time providing global observation and control. The paper shows first how the model is configured and validated to closely reproduce the behavior of a real system, and then how it is applied, allowing us to derive interesting conclusions both on replication and communication protocols and on their implementations.

Pereira JO, Oliveira R. 2005. Rewriting 'The Hare and the Turtle': Sleeping to Get There Faster. Proceedings of the 35th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (Supplemental Volume). Abstracthareturtle_jop.pdf

When developing algorithms for dependable distributed systems one often makes several simplifying assumptions that are essential to reason about the problem in hand. It is usual to assume an asynchronous system model, unconstrained system resources and the absence of easily maskable faults such as message loss. While most of the simplifications strengthen the model and are particularly useful when proving theoretical edge results, asynchrony, on the contrary, is a "non-assumption" and it is specially appealing in practice as it yields robust solutions that are correct regardless of the actual timing behavior of the target systems.

Pereira JO, Rodrigues L, Pinto A, Oliveira R. 2004. Low latency probabilistic broadcast in wide area networks. Proceedings of 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS). Abstractprpo04.pdf

In this paper we propose a novel probabilistic broadcast protocol that reduces the average end-to-end latency by dynamically adapting to network topology and tracconditions.
It does so by using an unique strategy that consists in adjusting the fanout and preferred targets for diferent gossip rounds as a function of the properties of each node.No declassication is light-weight and integrated in the protocol membership management.
Furthermore, each node is not required to have full knowledge of the group membership or of the network topology. The paper shows how the protocol can be congured and evaluates its performance with a detailed simulation model.

Pereira JO, Oliveira R. 2004. The mutable consensus protocol. Proceedings of 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS). Abstractsrds04-mutable.pdf

In this paper we propose the mutable consensus protocol, a pragmatic and theoretically appealing approach to enhance the performance of distributed consensus. First, an apparently inefficient protocol is developed using the simple stubborn channel abstraction for unreliable message passing. Then, performance is improved by introducing judiciously chosen finite delays in the implementation of channels. Although this does not compromise correctness, which rests on an asynchronous system model, it makes it likely that the transmission of some messages is avoided and thus the message exchange pattern at the network level changes noticeably. By choosing different delays in the underlying stubborn channels, the mutable consensus protocol can actually be made to resemble several different protocols. Besides presenting the mutable consensus protocol and four different mutations, we evaluate in detail the particularly interesting permutation gossip mutation, which allows the protocol to scale gracefully to a large number of processes by balancing the number of messages to be handled by each process with the number of communication steps required to decide. The evaluation is performed using a realistic simulation model which accurately reproduces resource consumption in real systems.

Sousa AL, Pereira JO, Oliveira R, Moura F. 2004. Semantic reliability on the Database State Machine. JISBD - Jornadas de Ingenieria del Software y Bases de Datos. Abstractspo04.pdf

Database replication protocols based on group communication primitives have recently been the subject of a considerable body of research [1, 11, 13, 6, 8, 4]. The reason for this stems from the adequacy of the order and atomicity properties of group communication primitives to implement synchronous replication (i.e., strong consistent) strategies. Unlike database replication schemes based on traditional transactional.

Correia A, Sousa AL, Soares L, Moura F, Oliveira R. 2004. Revisiting epsilon serializabilty to improve the database state machine. Proceedings of the Workshop on Dependable Distributed Data Management (with SRDS 2004). Abstract10.1.1.149.5820.pdf

Recently, a large body of research has been exploiting group communication based techniques to improve the dependability and performance of synchronously replicated database systems.

Pereira JO, Oliveira R. 2003. A mutable protocol for Consensus in large groups. Proceedings of the Workshop on Large-Scale Group Communication. Abstractwlsgc03.pdf

In this paper we propose the mutable con- sensus protocol, a pragmatic and theoretically appealing approach to enhance the performance of distributed con- sensus with a large number of participants. First, an apparently inefficient consensus protocol is developed using the very simple stubborn channel abstraction for unreliable message passing. Then, the introduction of judiciously chosen finite delays in the implementation of channels makes it likely that the transmission of some messages is avoided. Although this does not affect correctness, which rests on an asynchronous system model, the message exchange pattern at the network level changes noticeably and can be made to resemble several different protocols. A particularly appealing instantiation, called the permutation gossip, allows the protocol to scale gracefully to a large number of processes.

Pereira JO, Rodrigues L, Monteiro M, Oliveira R, Kermarrec AM. 2003. NEEM: Network-friendly epidemic multicast. Proceedings of 22nd IEEE International Symposium on Reliable Distributed Systems (SRDS). Abstractsrds03.pdf

Epidemic, or probabilistic, multicast protocols have emerged as a viable mechanism to circumvent the scalabil- ity problems of reliable multicast protocols. However, most existing epidemic approaches use connectionless transport protocols to exchange messages and rely on the intrinsic robustness of the epidemic dissemination to mask network omissions. Unfortunately, such an approach is not network- friendly, since the epidemic protocol makes no effort to re- duce the load imposed on the network when the system is congested. In this paper, we propose a novel epidemic protocol whose main characteristic is to be network-friendly. This property is achieved by relying on connection-oriented transport connections, such as TCP/IP, to support the com- munication among peers. Since during congestion mes- sages accumulate in the border of the network, the pro- tocol uses an innovative buffer management scheme, that combines different selection techniques to discard messages upon overflow. This technique improves the quality of the information delivered to the application during periods of network congestion. The protocol has been implemented and the benefits of the approach are illustrated using a com- bination of experimental and simulation results.

Soares L, Sousa AL, Pereira JO, Oliveira R, Rocha L, Moura F, Correia A. 2003. Avaliação de um SGBD replicado usando simulação de redes. Actas da 6ª Conferência sobre Redes de Computadores. Abstractcrc03-avaliacao.pdf

A replicação de sistemas de gestão de bases de dados (SGBD) é um mecanismo fundamental para a fiabilidade de sistemas de informação. Em sistemas geograficamente distribuídos é ainda útil na recuperação de desas- tres e disponibilidade ubíqua de dados. Uma técnica de replicação recentemente proposta é a Database State Ma- chine (DBSM), que promete aliar fiabilidade a elevado desempenho tirando partido de sistemas de comunicação em grupo. A avaliação do desempenho desta técnica tem no entanto sido efectuada com redes de comunicação demasiado simples ou irrealistas e com uma carga não representativa. Este artigo propõe uma avaliação rigorosa de uma concretização desta técnica de replicação, aliando um modelo de simulação realista de redes de comunicação com uma geração de carga efectuada de acordo com os padrões elaborados pelo Transaction Processing Council (TPC). Os resultados obtidos confirmam o interesse desta técnica em redes locais, mas mostram que o seu desempenho é condicionado pelas características da rede e da carga.

Pereira JO, Rodrigues L, Oliveira R. 2002. Reducing the cost of group communication with semantic view synchrony. Proceedings of 32nd IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Abstract10.1.1.18.6116.pdf

View Synchrony (VS) is a powerful abstraction in the design and implementation of de- pendable distributed systems. By ensuring that processes deliver the same set of messages in each view, it allows them to maintain consistency across membership changes. However, experience indicates that it is hard to combine strong reliability guarantees as offered by VS with stable high performance. In this paper we propose a novel abstraction, Semantic View Synchrony (SVS), that exploits the application's semantics to cope with high throughput applications. This is achieved by allowing some messages to be dropped while still preserving consistency when new views are installed. Thus, SVS inherits the elegance of view synchronous communi- cation. The paper describes how SVS can be implemented and illustrates its usefulness in the context of distributed multi-player games.

Sousa AL, Pereira JO, Moura F, Oliveira R. 2002. Optimistic total order in wide area networks. Proceedings of 21st IEEE International Symposium on Reliable Distributed Systems (SRDS). Abstractspmo02.pdf

Total order multicast greatly simplifies the implementa- tion of fault-tolerant services using the replicated state ma- chine approach. The additional latency of total ordering can be masked by taking advantage of spontaneous order- ing observed in LANs: A tentative delivery allows the ap- plication to proceed in parallel with the ordering protocol. The effectiveness of the technique rests on the optimistic as- sumption that a large share of correctly ordered tentative deliveries offsets the cost of undoing the effect of mistakes. This paper proposes a simple technique which enables the usage of optimistic delivery also in WANs with much larger transmission delays where the optimistic assumption does not normally hold. Our proposal exploits local clocks and the stability of network delays to reduce the mistakes in the ordering of tentative deliveries. An experimental evalu- ation of a modified sequencer-based protocol is presented, illustrating the usefulness of the approach in fault-tolerant database management.

Oliveira R, Pereira JO, Schiper A. 2001. Primary-backup replication: From a time-free protocol to a time-based implementation. Proceedings of 20th IEEE International Symposium on Reliable Distributed Systems (SRDS). Abstracttrains.pdf

Fault-tolerant control systems can be built by replicating critical components. However, replication raises the issue of inconsistency. Multiple protocols for ensuring consistency have been described in the literature. PADRE (Protocol for Asymmetric Duplex Redundancy) is such a protocol, and an interesting case study of a complex and sensitive problem: the management of replicated traffic controllers in a railway system [5]. However, the low level at which the protocol has been developed embodies system details, namely timeliness assumptions, that make it difficult to understand and may narrow its applicability. We argue that, when designing a protocol, it is preferable to consider first a general solution that does not include any timeliness assumptions; then, by taking into account additional hypothesis, one can easily design a time-based solution tailored to a specific environment. This paper illustrates the benefit of a top-down protocol design approach, and shows that PADRE can be seen as an instance of a standard Primary-backup replication protocol based on View Synchronous Communication (VSC).

Pereira JO, Rodrigues L, Oliveira R, Kermarrec AM. 2001. Probabilistic semantically reliable multicast. Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA). Abstractnca.pdf

Traditional reliable broadcast protocols fail to scale to large settings. The paper proposes a reliable multicast protocol that integrates two approaches to deal with the large-scale dimension in group communication protocols: gossip-based probabilistic broadcast and semantic reliability. The aim of the resulting protocol is to improve the resiliency of the probabilistic protocol to network congestion by allocating scarce resources to semantically relevant messages. Although intuitively it seems that a straightforward combination of probabilistic and semantic reliable protocols is possible, we show that it offers disappointing results. Instead, we propose an architecture based on a specialized probabilistic semantically reliable layer and show that it produces the desired results. The combined primitive is thus scalable to large number of participants, highly resilient to network and process failures, and delivers a high quality data flow even when the load exceeds the available bandwidth. We present a summary of simulation results that compare different protocol configurations.

Position:

Associate Professor

Site logo:

Rui Oliveira

Distributed Systems

Rui Oliveira