Main menu

Pages

Database Internals: A Deep Dive into How Distributed Data Systems Work free PDF

 


Database Internals: A Deep Dive into How Distributed Data Systems Work PDF free download


Understanding the internals of a database is critical when it comes to selecting, utilising, and managing one. However, with so many distributed databases and tools on the market today, it can be difficult to know what each one offers and how they differ. Alex Petrov walks developers through the ideas behind current database and storage engine internals in this practical guide.

You'll read pertinent content from a variety of books, articles, blog posts, and the source code of multiple open source databases throughout the book. Parts one and two contain a list of these resources. The most significant differences between many modern databases can be found in subsystems that control how storage is organised and data is distributed.

This book investigates:

Explore storage classification and taxonomy, as well as B-Tree-based and immutable Log storage engines. Storage building blocks: Structured storage engines, with differences and use-cases for each Learn how to use auxiliary data structures like Page Cache, Buffer Pool, and Write-Ahead Log to organise database files for efficient storage.
Systems that are distributed: Learn how to connect nodes and processes and create complicated communication patterns step by step.
Clusters of databases: What are the most frequent consistency models in current databases, and how can distributed storage systems accomplish consistency?

Preface
Distributed database systems are an integral part of most businesses and the vast
majority of software applications. These applications provide logic and a user inter‐
face, while database systems take care of data integrity, consistency, and redundancy.
Back in 2000, if you were to choose a database, you would have just a few options,
and most of them would be within the realm of relational databases, so differences
between them would be relatively small. Of course, this does not mean that all data‐
bases were completely the same, but their functionality and use cases were very
similar.
Some of these databases have focused on horizontal scaling (scaling out)—improving
performance and increasing capacity by running multiple database instances acting
as a single logical unit: Gamma Database Machine Project, Teradata, Greenplum, Par‐
allel DB2, and many others. Today, horizontal scaling remains one of the most impor‐
tant properties that customers expect from databases. This can be explained by the
rising popularity of cloud-based services. It is often easier to spin up a new instance
and add it to the cluster than scaling vertically (scaling up) by moving the database to
a larger, more powerful machine. Migrations can be long and painful, potentially
incurring downtime.
Around 2010, a new class of eventually consistent databases started appearing, and
terms such as NoSQL, and later, big data grew in popularity. Over the last 15 years,
the open source community, large internet companies, and database vendors have
created so many databases and tools that it’s easy to get lost trying to understand use
cases, details, and specifics.
The Dynamo paper [DECANDIA07], published by the team at Amazon in 2007, had
so much impact on the database community that within a short period it inspired
many variants and implementations. The most prominent of them were Apache Cas‐
sandra, created at Facebook; Project Voldemort, created at LinkedIn; and Riak, cre‐
ated by former Akamai engineers.
Today, the field is changing again: after the time of key-value stores, NoSQL, and
eventual consistency, we have started seeing more scalable and performant databases,
able to execute complex queries with stronger consistency guarantees.

Audience of This Book
In conversations at technical conferences, I often hear the same question: “How can I
learn more about database internals? I don’t even know where to start.” Most of the
books on database systems do not go into details of storage engine implementation,
and cover the access methods, such as B-Trees, on a rather high level. There are very
few books that cover more recent concepts, such as different B-Tree variants and log-
structured storage, so I usually recommend reading papers.
Everyone who reads papers knows that it’s not that easy: you often lack context, the
wording might be ambiguous, there’s little or no connection between papers, and
they’re hard to find. This book contains concise summaries of important database
systems concepts and can serve as a guide for those who’d like to dig in deeper, or as a
cheat sheet for those already familiar with these concepts.
Not everyone wants to become a database developer, but this book will help people
who build software that uses database systems: software developers, reliability engi‐
neers, architects, and engineering managers.
If your company depends on any infrastructure component, be it a database, a mes‐
saging queue, a container platform, or a task scheduler, you have to read the project
change-logs and mailing lists to stay in touch with the community and be up-to-date
with the most recent happenings in the project. Understanding terminology and
knowing what’s inside will enable you to yield more information from these sources
and use your tools more productively to troubleshoot, identify, and avoid potential
risks and bottlenecks. Having an overview and a general understanding of how data‐
base systems work will help in case something goes wrong. Using this knowledge,
you’ll be able to form a hypothesis, validate it, find the root cause, and present it to
other project maintainers.
This book is also for curious minds: for the people who like learning things without
immediate necessity, those who spend their free time hacking on something fun, cre‐
ating compilers, writing homegrown operating systems, text editors, computer
games, learning programming languages, and absorbing new information.
The reader is assumed to have some experience with developing backend systems and
working with database systems as a user. Having some prior knowledge of different
data structures will help to digest material faster.

Why Should I Read This Book?
We often hear people describing database systems in terms of the concepts and algo‐
rithms they implement: “This database uses gossip for membership propagation” (see
Chapter 12), “They have implemented Dynamo,” or “This is just like what they’ve
described in the Spanner paper” (see Chapter 13). Or, if you’re discussing the algo‐
rithms and data structures, you can hear something like “ZAB and Raft have a lot in
common” (see Chapter 14), “Bw-Trees are like the B-Trees implemented on top of log
structured storage” (see Chapter 6), or “They are using sibling pointers like in Blink
-
Trees” (see Chapter 5).
We need abstractions to discuss complex concepts, and we can’t have a discussion
about terminology every time we start a conversation. Having shortcuts in the form
of common language helps us to move our attention to other, higher-level problems.
One of the advantages of learning the fundamental concepts, proofs, and algorithms
is that they never grow old. Of course, there will always be new ones, but new algo‐
rithms are often created after finding a flaw or room for improvement in a classical
one. Knowing the history helps to understand differences and motivation better.
Learning about these things is inspiring. You see the variety of algorithms, see how
our industry was solving one problem after the other, and get to appreciate that work.
At the same time, learning is rewarding: you can almost feel how multiple puzzle
pieces move together in your mind to form a full picture that you will always be able
to share with others.


Scope of This Book
This is neither a book about relational database management systems nor about
NoSQL ones, but about the algorithms and concepts used in all kinds of database sys‐
tems, with a focus on a storage engine and the components responsible for
distribution.
Some concepts, such as query planning, query optimization, scheduling, the rela‐
tional model, and a few others, are already covered in several great textbooks on data‐
base systems. Some of these concepts are usually described from the user’s
perspective, but this book concentrates on the internals. You can find some pointers
to useful literature in the Part II Conclusion and in the chapter summaries. In these
books you’re likely to find answers to many database-related questions you might
have.
Query languages aren’t discussed, since there’s no single common language among
the database systems mentioned in this book.
To collect material for this book, I studied over 15 books, more than 300 papers,
countless blog posts, source code, and the documentation for several open source databases. The rule of thumb for whether or not to include a particular concept in the
book was the question: “Do the people in the database industry and research circles
talk about this concept?” If the answer was “yes,” I added the concept to the long list
of things to discuss.

Structure of This Book
There are some examples of extensible databases with pluggable components (such as
[SCHWARZ86]), but they are rather rare. At the same time, there are plenty of exam‐
ples where databases use pluggable storage. Similarly, we rarely hear database vendors
talking about query execution, while they are very eager to discuss the ways their
databases preserve consistency.
The most significant distinctions between database systems are concentrated around
two aspects: how they store and how they distribute the data. (Other subsystems can
at times also be of importance, but are not covered here.) The book is arranged into
parts that discuss the subsystems and components responsible for storage (Part I) and
distribution (Part II).
Part I discusses node-local processes and focuses on the storage engine, the central
component of the database system and one of the most significant distinctive factors.
First, we start with the architecture of a database management system and present
several ways to classify database systems based on the primary storage medium and
layout.
We continue with storage structures and try to understand how disk-based structures
are different from in-memory ones, introduce B-Trees, and cover algorithms for effi‐
ciently maintaining B-Tree structures on disk, including serialization, page layout,
and on-disk representations. Later, we discuss multiple variants to illustrate the power
of this concept and the diversity of data structures influenced and inspired by B-
Trees.
Last, we discuss several variants of log-structured storage, commonly used for imple‐
menting file and storage systems, motivation, and reasons to use them.
Part II is about how to organize multiple nodes into a database cluster. We start with
the importance of understanding the theoretical concepts for building fault-tolerant
distributed systems, how distributed systems are different from single-node applica‐
tions, and which problems, constraints, and complications we face in a distributed
environment.
After that, we dive deep into distributed algorithms. Here, we start with algorithms
for failure detection, helping to improve performance and stability by noticing and
reporting failures and avoiding the failed nodes. Since many algorithms discussed
later in the book rely on understanding the concept of leadership, we introduce sev‐
eral algorithms for leader election and discuss their suitability.
As one of the most difficult things in distributed systems is achieving data consis‐
tency, we discuss concepts of replication, followed by consistency models, possible
divergence between replicas, and eventual consistency. Since eventually consistent
systems sometimes rely on anti-entropy for convergence and gossip for data dissemi‐
nation, we discuss several anti-entropy and gossip approaches. Finally, we discuss log‐
ical consistency in the context of database transactions, and finish with consensus
algorithms.
It would’ve been impossible to write this book without all the research and publica‐
tions. You will find many references to papers and publications in the text, in square
brackets with monospace font; for example, [DECANDIA07]. You can use these ref‐
erences to learn more about related concepts in more detail.
After each chapter, you will find a summary section that contains material for further
study, related to the content of the chapter.


Part I. Storage Engines
- Introduction and Overview
- DBMS Architecture
- Memory- Versus Disk-Based DBMS
- Durability in Memory-Based Stores
- Column- Versus Row-Oriented DBMS
- Row-Oriented Data Layout
- Column-Oriented Data Layout
- Distinctions and Optimizations
- Wide Column Stores
- Data Files and Index Files
- Data Files
- Index Files
- Primary Index as an Indirection
- Buffering, Immutability, and Ordering
- Summary
- B-Tree Basics.
- Binary Search Trees
- Tree Balancing
- Trees for Disk-Based Storage
- Disk-Based Structures.
- Hard Disk Drives
- Solid State Drives
- On-Disk Structures
- Ubiquitous B-Trees
- B-Tree Hierarchy
- Separator Keys
- B-Tree Lookup Complexity
- B-Tree Lookup Algorithm
- Counting Keys
- B-Tree Node Splits
- B-Tree Node Merges
- Summary
- File Formats
- Motivation
- Binary Encoding
- Primitive Types
- Strings and Variable-Size Data
- Bit-Packed Data: Booleans, Enums, and Flags
- General Principles
- Page Structure
- Slotted Pages
- Cell Layout
- Combining Cells into Slotted Pages
- Managing Variable-Size Data
- Versioning
- Checksumming
- Summary
- Implementing B-Trees
- Page Header
- Magic Numbers
- Sibling Links
- Rightmost Pointers
- Node High Keys
- Overflow Pages
- Binary Search
- Binary Search with Indirection Pointers
- Propagating Splits and Merges
- Breadcrumbs
- Rebalancing
- Right-Only Appends
- Bulk Loading
- Compression
- Vacuum and Maintenance
- Fragmentation Caused by Updates and Deletes
- Page Defragmentation
- Summary
- Transaction Processing and Recovery
- Buffer Management
- Caching Semantics
- Cache Eviction
- Locking Pages in Cache
- Page Replacement
- Recovery
- Log Semantics
- Operation Versus Data Log
- Steal and Force Policies
- ARIES
- Concurrency Control
- Serializability
- Transaction Isolation
- Read and Write Anomalies
- Isolation Levels
- Optimistic Concurrency Control
- Multiversion Concurrency Control
- Pessimistic Concurrency Control
- Lock-Based Concurrency Control
- Summary
- B-Tree Variants
- Copy-on-Write
- Implementing Copy-on-Write: LMDB
- Abstracting Node Updates
- Lazy B-Trees
- WiredTiger
- Lazy-Adaptive Tree
- FD-Trees
- Fractional Cascading
- Logarithmic Runs
- Bw-Trees
- Update Chains
- Taming Concurrency with Compare-and-Swap
- Structural Modification Operations
- Consolidation and Garbage Collection
- Cache-Oblivious B-Trees
- van Emde Boas Layout
- Summary
- Log-Structured Storage
- LSM Trees
- LSM Tree Structure
- Updates and Deletes
- LSM Tree Lookups
- Merge-Iteration
- Reconciliation
- Maintenance in LSM Trees
- Read, Write, and Space Amplification
- RUM Conjecture
- Implementation Details
- Sorted String Tables
- Bloom Filters
- Skiplist
- Disk Access
- Compression
- Unordered LSM Storage
- Bitcask
- WiscKey
- Concurrency in LSM Trees
- Log Stacking
- Flash Translation Layer
- Filesystem Logging
- LLAMA and Mindful Stacking
- Open-Channel SSDs
- Summary
Part I Conclusion
Part II. Distributed Systems
- Introduction and Overview
- Concurrent Execution
- Shared State in a Distributed System
- Fallacies of Distributed Computing
- Processing
Clocks and Time
- State Consistency
- Local and Remote Execution
- Need to Handle Failures
- Network Partitions and Partial Failures
- Cascading Failures
- Distributed Systems Abstractions
- Links
- Two Generals’ Problem
- FLP Impossibility
- System Synchrony
- Failure Models
- Crash Faults
- Omission Faults
- Arbitrary Faults
- Handling Failures
- Summary
- Failure Detection
- Heartbeats and Pings
- Timeout-Free Failure Detector
- Outsourced Heartbeats
- Phi-Accural Failure Detector
- Gossip and Failure Detection
- Reversing Failure Detection Problem Statement
- Summary
- Leader Election
- Bully Algorithm
- Next-In-Line Failover
- Candidate/Ordinary Optimization
- Invitation Algorithm
- Ring Algorithm 211
Summary
- Replication and Consistency
- Achieving Availability
- Infamous CAP
- Use CAP Carefully
- Harvest and Yield
- Shared Memory
- Ordering
- Consistency Models
- Strict Consistency
- Linearizability
- Sequential Consistency
- Causal Consistency
- Session Models
- Eventual Consistency
- Tunable Consistency
- Witness Replicas
- Strong Eventual Consistency and CRDTs
- Summary
- Anti-Entropy and Dissemination
- Read Repair
- Digest Reads
- Hinted Handoff
- Merkle Trees
- Bitmap Version Vectors
- Gossip Dissemination
- Gossip Mechanics
- Overlay Networks
- Hybrid Gossip
- Partial Views
- Summary
- Distributed Transactions
- Making Operations Appear Atomic
- Two-Phase Commit
- Cohort Failures in 2PC
- Coordinator Failures in 2PC
- Three-Phase Commit
- Coordinator Failures in 3PC
- Distributed Transactions with Calvin
- Distributed Transactions with Spanner
- Database Partitioning
- Consistent Hashing
- Distributed Transactions with Percolator
- Coordination Avoidance
- Summary
 - Consensus
- Broadcast
- Atomic Broadcast
- Virtual Synchrony
- Zookeeper Atomic Broadcast (ZAB)
- Paxos
- Paxos Algorithm
- Quorums in Paxos
- Failure Scenarios
- Multi-Paxos
- Fast Paxos
- Egalitarian Paxos
- Flexible Paxos
- Generalized Solution to Consensus
- Raft
- Leader Role in Raft
- Failure Scenarios
- Byzantine Consensus
- PBFT Algorithm
- Recovery and Checkpointing
- Summary
Part II 
- Conclusion
- A.Bibliography
- Index

PDF free download
EPUB free download

Comments

table of contents title