Distributed systems are difficult to program and near impossible
to debug. Existing tools that focus on single-node computation are
poorly-suited to diagnose errors that involve the interaction of
many machines over time. The database notion of provenance would
appear better suited to answering the sort of cause-and-effect
questions that arise during debugging, but existing provenance-based
approaches target only a narrow set of debugging scenarios. In this
paper, we explore the limits of provenance-based debugging. We
propose a simple query language to express common debugging
questions as expressions over provenance graphs capturing
traces of distributed executions. We show how when programs
and their correctness properties are written in a high-level
declarative language, we can go a step further than highlighting
errors and generate repairs for distributed programs. We validate
our prototype debugger, Nemo, on six protocols from our taxonomy
of 52 real-world distributed bugs, either generating repair rules
or pointing the programmer to root causes.
The heterogeneity, complexity, and scale of cloud applications make
verification of their fault tolerance properties challenging. Companies
are moving away from formal methods and toward large-scale testing in
which components are deliberately compromised to identify weaknesses in
the software. For example, techniques such as Jepsen apply
fault-injection testing to distributed data stores, and Chaos
Engineering performs fault injection experiments on production systems,
often on live traffic. Both approaches have captured the attention of
industry and academia alike.
Unfortunately, the search space of distinct fault combinations that an
infrastructure can test is intractable. Existing failure-testing
solutions require skilled and intelligent users who can supply the
faults to inject. These superusers, known as Chaos Engineers and Jepsen
experts, must study the systems under test, observe system executions,
and then formulate hypotheses about which faults are most likely to
expose real system-design flaws. This approach is fundamentally
unscalable and unprincipled. It relies on the superuser's ability to
interpret how a distributed system employs redundancy to mask or
ameliorate faults and, moreover, the ability to recognize the
insufficiencies in those redundancies—in other words, human genius.
Verification is often regarded as a one-time procedure undertaken after a protocol is specified but
before it is implemented. However, in practice, protocols continually evolve with the addition of
new capabilities and performance optimizations. Existing verification tools are ill-suited to
“tracking” protocol evolution and programmers are too busy (or too lazy?) to simultaneously co-evolve
specifications manually. This means that the correctness guarantees determined at verification time can
erode as protocols evolve. Existing software quality techniques such as regression testing and root
cause analysis, which naturally support system evolution, are poorly suited to reasoning about fault
tolerance properties of a distributed system because these properties require a search of the execution
schedule rather than merely replaying inputs. This paper advocates that our community should explore the
intersection of testing and verification to better ensure quality for distributed software and presents
our experience evolving a data replication protocol at Elastic using a novel bug-finding technology
called LDFI as evidence.
Large-scale distributed systems must be built to anticipate and
mitigate a variety of hardware and software failures. In order to build
confidence that fault-tolerant systems are correctly implemented,
Netflix (and similar enterprises) regularly run failure drills in which
faults are deliberately injected in their production system. The
combinatorial space of failure scenarios is too large to explore
exhaustively. Existing failure testing approaches either randomly
explore the space of potential failures randomly or exploit the
“hunches” of domain experts to guide the search. Random strategies
waste resources testing “uninteresting” faults, while programmer-guided
approaches are only as good as human intuition and only scale with
human effort. In this paper, we describe how we adapted and implemented
a research prototype called lineage-driven fault injection (LDFI) to
automate failure testing at Netflix. Along the way, we describe the
challenges that arose adapting the LDFI model to the complex and
dynamic realities of the Netflix architecture. We show how we
implemented the adapted algorithm as a service atop the existing
tracing and fault injection infrastructure, and present early results.
Failure is always an option; in large-scale data management systems, it
is practically a certainty. Fault-tolerant protocols and components are
notoriously difficult to implement and debug. Worse still, choosing
existing fault-tolerance mechanisms and integrating them correctly into
complex systems remains an art form, and programmers have few tools to
assist them. We propose a novel approach for discovering bugs in
fault-tolerant data management systems: lineage-driven fault injection.
A lineage-driven fault injector reasons backwards from correct system
outcomes to determine whether failures in the execution could have
prevented the outcome. We present MOLLY, a prototype of lineage-driven
fault injection that exploits a novel combination of data lineage
techniques from the database literature and state-of-the-art
satisfiability testing. If fault-tolerance bugs exist for a particular
configuration, MOLLY finds them rapidly, in many cases using an order
of magnitude fewer executions than random fault injection. Otherwise,
MOLLY certifies that the code is bug-free for that configuration.
Recent research has explored using Datalog-based languages to express a
distributed system as a set of logical invariants. Two properties of
distributed systems proved difficult to model in Datalog. First, the
state of any such system evolves with its execution. Second, deductions
in these systems may be arbitrarily delayed, dropped, or reordered by
the unreliable network links they must traverse. Previous efforts
addressed the former by extending Datalog to include updates, key
constraints, persistence and events, and the latter by assuming ordered
and reliable delivery while ignoring delay. These details have a
semantics outside Datalog, which increases the complexity of the
language or its interpretation, and forces programmers to think
operationally. We argue that the missing component from these previous
languages is a notion of time. In this paper we present Dedalus, a
foundation language for programming and reasoning about distributed
systems. Dedalus reduces to a subset of Datalog with negation,
aggregate functions, successor and choice, and admits an explicit
representation of time into the logic language. We show that Dedalus
provides a declarative foundation for the two signature features of
distributed systems: mutable state, and asynchronous processing and
communication. Given these two features, we address three important
properties of programs in a domain-specific manner: a notion of safety
appropriate to non-terminating computations, stratified monotonic
reasoning with negation over time, and efficient evaluation over time
via a simple execution strategy. We also provide conservative syntactic
checks for our temporal notions of safety and stratification. Our
experience implementing full-featured systems in variants of Datalog
suggests that Dedalus is well-suited to the specification of rich
distributed services and protocols, and provides both cleaner semantics
and richer tests of correctness.