Main menu

Pages

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

table of contents title

    Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

    Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

     "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" is a book written by Martin Kleppmann which provides a comprehensive overview of the technologies and techniques used to build data-intensive applications. It covers topics such as data modeling, distributed systems, and data storage, as well as the trade-offs and challenges involved in designing and implementing these systems. The book is intended for engineers and developers who want to understand the underlying concepts and principles of data-intensive applications, and how to design and build systems that are reliable, scalable, and maintainable. It covers both practical and theoretical aspects of data-intensive systems, and provides a wealth of examples and case studies to illustrate key concepts.


    "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" is a book that covers the concepts and strategies behind building data-intensive applications. The book covers topics such as data modeling, distributed systems, data storage, data processing, and data governance. It provides an overview of the various technologies and tools available for building data-intensive systems, as well as best practices for designing and implementing these systems. The book also covers the challenges that come with scaling and maintaining data-intensive systems, and provides guidance on how to address these challenges. Overall, the book is aimed at engineers and architects who are building or maintaining data-intensive systems and want to learn about the underlying concepts and best practices for doing so.


    Book content table


    Part I. Foundations of Data Systems

    1. Reliable, Scalable, and Maintainable Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    Thinking About Data Systems 4

    Reliability 6

    Hardware Faults 7

    Software Errors 8

    Human Errors 9

    How Important Is Reliability? 10

    Scalability 10

    Describing Load 11

    Describing Performance 13

    Approaches for Coping with Load 17

    Maintainability 18

    Operability: Making Life Easy for Operations 19

    Simplicity: Managing Complexity 20

    Evolvability: Making Change Easy 21

    Summary 22

    2. Data Models and Query Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    Relational Model Versus Document Model 28

    The Birth of NoSQL 29

    The Object-Relational Mismatch 29

    Many-to-One and Many-to-Many Relationships 33

    Are Document Databases Repeating History? 36

    Relational Versus Document Databases Today 38

    Query Languages for Data 42

    Declarative Queries on the Web 44

    MapReduce Querying 46

    Graph-Like Data Models 49

    Property Graphs 50

    The Cypher Query Language 52

    Graph Queries in SQL 53

    Triple-Stores and SPARQL 55

    The Foundation: Datalog 60

    Summary 63

    3. Storage and Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    Data Structures That Power Your Database 70

    Hash Indexes 72

    SSTables and LSM-Trees 76

    B-Trees 79

    Comparing B-Trees and LSM-Trees 83

    Other Indexing Structures 85

    Transaction Processing or Analytics? 90

    Data Warehousing 91

    Stars and Snowflakes: Schemas for Analytics 93

    Column-Oriented Storage 95

    Column Compression 97

    Sort Order in Column Storage 99

    Writing to Column-Oriented Storage 101

    Aggregation: Data Cubes and Materialized Views 101

    Summary 103

    4. Encoding and Evolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    Formats for Encoding Data 112

    Language-Specific Formats 113

    JSON, XML, and Binary Variants 114

    Thrift and Protocol Buffers 117

    Avro 122

    The Merits of Schemas 127

    Modes of Dataflow 128

    Dataflow Through Databases 129

    Dataflow Through Services: REST and RPC 131

    Message-Passing Dataflow 136

    Summary 139

    Part II. Distributed Data

    5. Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

    Leaders and Followers 152

    Synchronous Versus Asynchronous Replication 153

    Setting Up New Followers 155

    Handling Node Outages 156

    Implementation of Replication Logs 158

    Problems with Replication Lag 161

    Reading Your Own Writes 162

    Monotonic Reads 164

    Consistent Prefix Reads 165

    Solutions for Replication Lag 167

    Multi-Leader Replication 168

    Use Cases for Multi-Leader Replication 168

    Handling Write Conflicts 171

    Multi-Leader Replication Topologies 175

    Leaderless Replication 177

    Writing to the Database When a Node Is Down 177

    Limitations of Quorum Consistency 181

    Sloppy Quorums and Hinted Handoff 183

    Detecting Concurrent Writes 184

    Summary 192

    6. Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

    Partitioning and Replication 200

    Partitioning of Key-Value Data 201

    Partitioning by Key Range 202

    Partitioning by Hash of Key 203

    Skewed Workloads and Relieving Hot Spots 205

    Partitioning and Secondary Indexes 206

    Partitioning Secondary Indexes by Document 206

    Partitioning Secondary Indexes by Term 208

    Rebalancing Partitions 209

    Strategies for Rebalancing 210

    Operations: Automatic or Manual Rebalancing 213

    Request Routing 214

    Parallel Query Execution 216

    Summary 216

    7. Transactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

    The Slippery Concept of a Transaction 222

    The Meaning of ACID 223

    Single-Object and Multi-Object Operations 228

    Weak Isolation Levels 233

    Read Committed 234

    Snapshot Isolation and Repeatable Read 237

    Preventing Lost Updates 242

    Write Skew and Phantoms 246

    Serializability 251

    Actual Serial Execution 252

    Two-Phase Locking (2PL) 257

    Serializable Snapshot Isolation (SSI) 261

    Summary 266

    8. The Trouble with Distributed Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

    Faults and Partial Failures 274

    Cloud Computing and Supercomputing 275

    Unreliable Networks 277

    Network Faults in Practice 279

    Detecting Faults 280

    Timeouts and Unbounded Delays 281

    Synchronous Versus Asynchronous Networks 284

    Unreliable Clocks 287

    Monotonic Versus Time-of-Day Clocks 288

    Clock Synchronization and Accuracy 289

    Relying on Synchronized Clocks 291

    Process Pauses 295

    Knowledge, Truth, and Lies 300

    The Truth Is Defined by the Majority 300

    Byzantine Faults 304

    System Model and Reality 306

    Summary 310

    9. Consistency and Consensus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

    Consistency Guarantees 322

    Linearizability 324

    What Makes a System Linearizable? 325

    Relying on Linearizability 330

    Implementing Linearizable Systems 332

    The Cost of Linearizability 335

    Ordering Guarantees 339

    Ordering and Causality 339

    Sequence Number Ordering 343

    Total Order Broadcast 348

    Distributed Transactions and Consensus 352

    Atomic Commit and Two-Phase Commit (2PC) 354

    Distributed Transactions in Practice 360

    Fault-Tolerant Consensus 364

    Membership and Coordination Services 370

    Summary 373

    Part III. Derived Data

    10. Batch Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

    Batch Processing with Unix Tools 391

    Simple Log Analysis 391

    The Unix Philosophy 394

    MapReduce and Distributed Filesystems 397

    MapReduce Job Execution 399

    Reduce-Side Joins and Grouping 403

    Map-Side Joins 408

    The Output of Batch Workflows 411

    Comparing Hadoop to Distributed Databases 414

    Beyond MapReduce 419

    Materialization of Intermediate State 419

    Graphs and Iterative Processing 424

    High-Level APIs and Languages 426

    Summary 429

    11. Stream Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439

    Transmitting Event Streams 440

    Messaging Systems 441

    Partitioned Logs 446

    Databases and Streams 451

    Keeping Systems in Sync 452

    Change Data Capture 454

    Event Sourcing 457

    State, Streams, and Immutability 459

    Processing Streams 464

    Uses of Stream Processing 465

    Reasoning About Time 468

    Stream Joins 472

    Fault Tolerance 476

    Summary 479

    12. The Future of Data Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489

    Data Integration 490

    Combining Specialized Tools by Deriving Data 490

    Batch and Stream Processing 494

    Unbundling Databases 499

    Composing Data Storage Technologies 499

    Designing Applications Around Dataflow 504

    Observing Derived State 509

    Aiming for Correctness 515

    The End-to-End Argument for Databases 516

    Enforcing Constraints 521

    Timeliness and Integrity 524

    Trust, but Verify 528

    Doing the Right Thing 533

    Predictive Analytics 533

    Privacy and Tracking 536

    Summary 543


    Comments