Disorderly Labs

Fixed It For You: Protocol Repair Using Lineage Graphs (CIDR 2019)

Distributed systems are difficult to program and near impossible to debug. Existing tools that focus on single-node computation are poorly-suited to diagnose errors that involve the interaction of many machines over time. The database notion of provenance would appear better suited to answering the sort of cause-and-effect questions that arise during debugging, but existing provenance-based approaches target only a narrow set of debugging scenarios. In this paper, we explore the limits of provenance-based debugging. We propose a simple query language to express common debugging questions as expressions over provenance graphs capturing traces of distributed executions. We show how when programs and their correctness properties are written in a high-level declarative language, we can go a step further than highlighting errors and generate repairs for distributed programs. We validate our prototype debugger, Nemo, on six protocols from our taxonomy of 52 real-world distributed bugs, either generating repair rules or pointing the programmer to root causes.

Abstracting the Geniuses Away from Failure Testing (CACM, 2018)

The heterogeneity, complexity, and scale of cloud applications make verification of their fault tolerance properties challenging. Companies are moving away from formal methods and toward large-scale testing in which components are deliberately compromised to identify weaknesses in the software. For example, techniques such as Jepsen apply fault-injection testing to distributed data stores, and Chaos Engineering performs fault injection experiments on production systems, often on live traffic. Both approaches have captured the attention of industry and academia alike. Unfortunately, the search space of distinct fault combinations that an infrastructure can test is intractable. Existing failure-testing solutions require skilled and intelligent users who can supply the faults to inject. These superusers, known as Chaos Engineers and Jepsen experts, must study the systems under test, observe system executions, and then formulate hypotheses about which faults are most likely to expose real system-design flaws. This approach is fundamentally unscalable and unprincipled. It relies on the superuser's ability to interpret how a distributed system employs redundancy to mask or ameliorate faults and, moreover, the ability to recognize the insufficiencies in those redundancies—in other words, human genius.

Growing a Protocol (USENIX HotCloud 2017)

Verification is often regarded as a one-time procedure undertaken after a protocol is specified but before it is implemented. However, in practice, protocols continually evolve with the addition of new capabilities and performance optimizations. Existing verification tools are ill-suited to “tracking” protocol evolution and programmers are too busy (or too lazy?) to simultaneously co-evolve specifications manually. This means that the correctness guarantees determined at verification time can erode as protocols evolve. Existing software quality techniques such as regression testing and root cause analysis, which naturally support system evolution, are poorly suited to reasoning about fault tolerance properties of a distributed system because these properties require a search of the execution schedule rather than merely replaying inputs. This paper advocates that our community should explore the intersection of testing and verification to better ensure quality for distributed software and presents our experience evolving a data replication protocol at Elastic using a novel bug-finding technology called LDFI as evidence.

Automating Failure Testing Research at Internet Scale (ACM SoCC 2016)

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatorial space of failure scenarios is too large to explore exhaustively. Existing failure testing approaches either randomly explore the space of potential failures randomly or exploit the “hunches” of domain experts to guide the search. Random strategies waste resources testing “uninteresting” faults, while programmer-guided approaches are only as good as human intuition and only scale with human effort. In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix. Along the way, we describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. We show how we implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present early results.

Lineage-driven Fault Injection (ACM SIGMOD 2015)

Failure is always an option; in large-scale data management systems, it is practically a certainty. Fault-tolerant protocols and components are notoriously difficult to implement and debug. Worse still, choosing existing fault-tolerance mechanisms and integrating them correctly into complex systems remains an art form, and programmers have few tools to assist them. We propose a novel approach for discovering bugs in fault-tolerant data management systems: lineage-driven fault injection. A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome. We present MOLLY, a prototype of lineage-driven fault injection that exploits a novel combination of data lineage techniques from the database literature and state-of-the-art satisfiability testing. If fault-tolerance bugs exist for a particular configuration, MOLLY finds them rapidly, in many cases using an order of magnitude fewer executions than random fault injection. Otherwise, MOLLY certifies that the code is bug-free for that configuration.

Dedalus: Datalog in Time and Space (Datalog 2010)

Recent research has explored using Datalog-based languages to express a distributed system as a set of logical invariants. Two properties of distributed systems proved difficult to model in Datalog. First, the state of any such system evolves with its execution. Second, deductions in these systems may be arbitrarily delayed, dropped, or reordered by the unreliable network links they must traverse. Previous efforts addressed the former by extending Datalog to include updates, key constraints, persistence and events, and the latter by assuming ordered and reliable delivery while ignoring delay. These details have a semantics outside Datalog, which increases the complexity of the language or its interpretation, and forces programmers to think operationally. We argue that the missing component from these previous languages is a notion of time. In this paper we present Dedalus, a foundation language for programming and reasoning about distributed systems. Dedalus reduces to a subset of Datalog with negation, aggregate functions, successor and choice, and admits an explicit representation of time into the logic language. We show that Dedalus provides a declarative foundation for the two signature features of distributed systems: mutable state, and asynchronous processing and communication. Given these two features, we address three important properties of programs in a domain-specific manner: a notion of safety appropriate to non-terminating computations, stratified monotonic reasoning with negation over time, and efficient evaluation over time via a simple execution strategy. We also provide conservative syntactic checks for our temporal notions of safety and stratification. Our experience implementing full-featured systems in variants of Datalog suggests that Dedalus is well-suited to the specification of rich distributed services and protocols, and provides both cleaner semantics and richer tests of correctness.

Research

Projects

Publications

Fixed It For You: Protocol Repair Using Lineage Graphs (CIDR 2019)

Abstracting the Geniuses Away from Failure Testing (CACM, 2018)

Growing a Protocol (USENIX HotCloud 2017)

Automating Failure Testing Research at Internet Scale (ACM SoCC 2016)

Lineage-driven Fault Injection (ACM SIGMOD 2015)

Dedalus: Datalog in Time and Space (Datalog 2010)

Selected Talks

Orchestrated Chaos: Applying Failure Testing Research at Scale

I See What You Mean: what a tiny language can teach us about gigantic systems

Monkeys in labcoats

Current Team

Peter Alvaro Principal Investigator

Kamala Aspiring Ml Wizard

Tuan Kohd Wizard In Training

Aldrin Montana
4th Year PhD Student
Computer Science

Jun Zhang (Aleck)
2nd Year PhD Student
Computer Science

Collaborators

Kolton Andrus Gremlin, Inc

Joseph M. Hellerstein UC Berkeley

Boaz Leskes elastic.co

Casey Rosenthal Netflix

Severine Tymon Pivotal

Ashutosh Raina eBay

Press

Sponsors

Research

Projects

Publications

Fixed It For You: Protocol Repair Using Lineage Graphs (CIDR 2019)

Abstracting the Geniuses Away from Failure Testing (CACM, 2018)

Growing a Protocol (USENIX HotCloud 2017)

Automating Failure Testing Research at Internet Scale (ACM SoCC 2016)

Lineage-driven Fault Injection (ACM SIGMOD 2015)

Dedalus: Datalog in Time and Space (Datalog 2010)

Selected Talks

Orchestrated Chaos: Applying Failure Testing Research at Scale

I See What You Mean: what a tiny language can teach us about gigantic systems

Monkeys in labcoats

Current Team

Peter Alvaro Principal Investigator

Kamala Aspiring Ml Wizard

Tuan Kohd Wizard In Training

Aldrin Montana 4th Year PhD Student Computer Science

Jun Zhang (Aleck) 2nd Year PhD Student Computer Science

Collaborators

Kolton Andrus Gremlin, Inc

Joseph M. Hellerstein UC Berkeley

Boaz Leskes elastic.co

Casey Rosenthal Netflix

Severine Tymon Pivotal

Ashutosh Raina eBay

Press

Sponsors

Aldrin Montana
4th Year PhD Student
Computer Science

Jun Zhang (Aleck)
2nd Year PhD Student
Computer Science