Main menu

Pages

Fundamentals of Data Engineering: Plan and Build Robust Data Systems

table of contents title

    Fundamentals of Data Engineering: Plan and Build Robust Data Systems

    Fundamentals of Data Engineering: Plan and Build Robust Data Systems

    "Fundamentals of Data Engineering: Plan and Build Robust Data Systems" as my knowledge cutoff is 2021 and this book may be published later. However, in general, data engineering is the process of designing, building, and maintaining the infrastructure and systems that are necessary to store, process, and analyze large volumes of data. A book on data engineering would likely cover topics such as data warehousing, data modeling, data integration, data quality, data governance, and big data technologies. It would also provide guidance on how to design and build robust data systems that can handle the scale, complexity, and variability of modern data environments.

    "Fundamentals of Data Engineering: Plan and Build Robust Data Systems" is a book that covers the fundamentals of data engineering and provides guidance on how to design and implement robust data systems. It likely covers topics such as data modeling, data warehousing, data integration, data pipeline design and implementation, data governance, and data quality. The book may also include case studies and best practices for building and maintaining large-scale data systems. Additionally, the book probably covers the tools and technologies commonly used in data engineering such as SQL, big data technologies like Hadoop and Spark, and cloud data platforms like AWS, GCP, and Azure.


    book content


    Part I. Foundation and Building Blocks

    1. Data Engineering Described. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    What Is Data Engineering? 3

    Data Engineering Defined 4

    The Data Engineering Lifecycle 5

    Evolution of the Data Engineer 6

    Data Engineering and Data Science 11

    Data Engineering Skills and Activities 13

    Data Maturity and the Data Engineer 13

    The Background and Skills of a Data Engineer 17

    Business Responsibilities 18

    Technical Responsibilities 19

    The Continuum of Data Engineering Roles, from A to B 21

    Data Engineers Inside an Organization 22

    Internal-Facing Versus External-Facing Data Engineers 23

    Data Engineers and Other Technical Roles 24

    Data Engineers and Business Leadership 28

    Conclusion 31

    Additional Resources 32

    2. The Data Engineering Lifecycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    What Is the Data Engineering Lifecycle? 33

    The Data Lifecycle Versus the Data Engineering Lifecycle 34

    Generation: Source Systems 35

    Storage 38

    Ingestion 39

    Transformation 43

    Serving Data 44

    Major Undercurrents Across the Data Engineering Lifecycle 48

    Security 49

    Data Management 50

    DataOps 59

    Data Architecture 64

    Orchestration 64

    Software Engineering 66

    Conclusion 68

    Additional Resources 69

    3. Designing Good Data Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    What Is Data Architecture? 71

    Enterprise Architecture Defined 72

    Data Architecture Defined 75

    “Good” Data Architecture 76

    Principles of Good Data Architecture 77

    Principle 1: Choose Common Components Wisely 78

    Principle 2: Plan for Failure 79

    Principle 3: Architect for Scalability 80

    Principle 4: Architecture Is Leadership 80

    Principle 5: Always Be Architecting 81

    Principle 6: Build Loosely Coupled Systems 81

    Principle 7: Make Reversible Decisions 83

    Principle 8: Prioritize Security 84

    Principle 9: Embrace FinOps 85

    Major Architecture Concepts 87

    Domains and Services 87

    Distributed Systems, Scalability, and Designing for Failure 88

    Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices 90

    User Access: Single Versus Multitenant 94

    Event-Driven Architecture 95

    Brownfield Versus Greenfield Projects 96

    Examples and Types of Data Architecture 98

    Data Warehouse 98

    Data Lake 101

    Convergence, Next-Generation Data Lakes, and the Data Platform 102

    Modern Data Stack 103

    Lambda Architecture 104

    Kappa Architecture 105

    The Dataflow Model and Unified Batch and Streaming 105

    Architecture for IoT 106

    Data Mesh 109

    Other Data Architecture Examples 110

    Who’s Involved with Designing a Data Architecture? 111

    Conclusion 111

    Additional Resources 111

    4. Choosing Technologies Across the Data Engineering Lifecycle. . . . . . . . . . . . . . . . . . . 115

    Team Size and Capabilities 116

    Speed to Market 117

    Interoperability 117

    Cost Optimization and Business Value 118

    Total Cost of Ownership 118

    Total Opportunity Cost of Ownership 119

    FinOps 120

    Today Versus the Future: Immutable Versus Transitory Technologies 120

    Our Advice 122

    Location 123

    On Premises 123

    Cloud 124

    Hybrid Cloud 127

    Multicloud 128

    Decentralized: Blockchain and the Edge 129

    Our Advice 129

    Cloud Repatriation Arguments 130

    Build Versus Buy 132

    Open Source Software 133

    Proprietary Walled Gardens 137

    Our Advice 138

    Monolith Versus Modular 139

    Monolith 139

    Modularity 140

    The Distributed Monolith Pattern 142

    Our Advice 142

    Serverless Versus Servers 143

    Serverless 143

    Containers 144

    How to Evaluate Server Versus Serverless 145

    Our Advice 146

    Optimization, Performance, and the Benchmark Wars 147

    Big Data...for the 1990s 148

    Nonsensical Cost Comparisons 148

    Asymmetric Optimization 148

    Caveat Emptor 149

    Undercurrents and Their Impacts on Choosing Technologies 149

    Data Management 149

    DataOps 149

    Data Architecture 150

    Orchestration Example: Airflow 150

    Software Engineering 151

    Conclusion 151

    Additional Resources 151

    Part II. The Data Engineering Lifecycle in Depth

    5. Data Generation in Source Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

    Sources of Data: How Is Data Created? 156

    Source Systems: Main Ideas 156

    Files and Unstructured Data 156

    APIs 157

    Application Databases (OLTP Systems) 157

    Online Analytical Processing System 159

    Change Data Capture 159

    Logs 160

    Database Logs 161

    CRUD 162

    Insert-Only 162

    Messages and Streams 163

    Types of Time 164

    Source System Practical Details 165

    Databases 166

    APIs 174

    Data Sharing 176

    Third-Party Data Sources 177

    Message Queues and Event-Streaming Platforms 177

    Whom You’ll Work With 181

    Undercurrents and Their Impact on Source Systems 183

    Security 183

    Data Management 184

    DataOps 184

    Data Architecture 185

    Orchestration 186

    Software Engineering 187

    Conclusion 187

    Additional Resources 188

    6. Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

    Raw Ingredients of Data Storage 191

    Magnetic Disk Drive 191

    Solid-State Drive 193

    Random Access Memory 194

    Networking and CPU 195

    Serialization 195

    Compression 196

    Caching 197

    Data Storage Systems 197

    Single Machine Versus Distributed Storage 198

    Eventual Versus Strong Consistency 198

    File Storage 199

    Block Storage 202

    Object Storage 205

    Cache and Memory-Based Storage Systems 211

    The Hadoop Distributed File System 211

    Streaming Storage 212

    Indexes, Partitioning, and Clustering 213

    Data Engineering Storage Abstractions 215

    The Data Warehouse 215

    The Data Lake 216

    The Data Lakehouse 216

    Data Platforms 217

    Stream-to-Batch Storage Architecture 217

    Big Ideas and Trends in Storage 218

    Data Catalog 218

    Data Sharing 219

    Schema 219

    Separation of Compute from Storage 220

    Data Storage Lifecycle and Data Retention 223

    Single-Tenant Versus Multitenant Storage 226

    Whom You’ll Work With 227

    Undercurrents 228

    Security 228

    Data Management 228

    DataOps 229

    Data Architecture 230

    Orchestration 230

    Software Engineering 230

    Conclusion 230

    Additional Resources 231

    7. Ingestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

    What Is Data Ingestion? 234

    Key Engineering Considerations for the Ingestion Phase 235

    Bounded Versus Unbounded Data 236

    Frequency 237

    Synchronous Versus Asynchronous Ingestion 238

    Serialization and Deserialization 239

    Throughput and Scalability 239

    Reliability and Durability 240

    Payload 241

    Push Versus Pull Versus Poll Patterns 244

    Batch Ingestion Considerations 244

    Snapshot or Differential Extraction 246

    File-Based Export and Ingestion 246

    ETL Versus ELT 246

    Inserts, Updates, and Batch Size 247

    Data Migration 247

    Message and Stream Ingestion Considerations 248

    Schema Evolution 248

    Late-Arriving Data 248

    Ordering and Multiple Delivery 248

    Replay 249

    Time to Live 249

    Message Size 249

    Error Handling and Dead-Letter Queues 249

    Consumer Pull and Push 250

    Location 250

    Ways to Ingest Data 250

    Direct Database Connection 251

    Change Data Capture 252

    APIs 254

    Message Queues and Event-Streaming Platforms 255

    Managed Data Connectors 256

    Moving Data with Object Storage 257

    EDI 257

    Databases and File Export 257

    Practical Issues with Common File Formats 258

    Shell 258

    SSH 259

    SFTP and SCP 259

    Webhooks 259

    Web Interface 260

    Web Scraping 260

    Transfer Appliances for Data Migration 261

    Data Sharing 262

    Whom You’ll Work With 262

    Upstream Stakeholders 262

    Downstream Stakeholders 263

    Undercurrents 263

    Security 264

    Data Management 264

    DataOps 266

    Orchestration 268

    Software Engineering 268

    Conclusion 268

    Additional Resources 269

    8. Queries, Modeling, and Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

    Queries 272

    What Is a Query? 273

    The Life of a Query 274

    The Query Optimizer 275

    Improving Query Performance 275

    Queries on Streaming Data 281

    Data Modeling 287

    What Is a Data Model? 288

    Conceptual, Logical, and Physical Data Models 289

    Normalization 290

    Techniques for Modeling Batch Analytical Data 294

    Modeling Streaming Data 307

    Transformations 309

    Batch Transformations 310

    Materialized Views, Federation, and Query Virtualization 323

    Streaming Transformations and Processing 326

    Whom You’ll Work With 329

    Upstream Stakeholders 329

    Downstream Stakeholders 330

    Undercurrents 330

    Security 330

    Data Management 331

    DataOps 332

    Data Architecture 333

    Orchestration 333

    Software Engineering 333

    Conclusion 334

    Additional Resources 335

    9. Serving Data for Analytics, Machine Learning, and Reverse ETL. . . . . . . . . . . . . . . . . 337

    General Considerations for Serving Data 338

    Trust 338

    What’s the Use Case, and Who’s the User? 339

    Data Products 340

    Self-Service or Not? 341

    Data Definitions and Logic 342

    Data Mesh 343

    Analytics 344

    Business Analytics 344

    Operational Analytics 346

    Embedded Analytics 348

    Machine Learning 349

    What a Data Engineer Should Know About ML 350

    Ways to Serve Data for Analytics and ML 351

    File Exchange 351

    Databases 352

    Streaming Systems 354

    Query Federation 354

    Data Sharing 355

    Semantic and Metrics Layers 355

    Serving Data in Notebooks 356

    Reverse ETL 358

    Whom You’ll Work With 360

    Undercurrents 360

    Security 361

    Data Management 362

    DataOps 362

    Data Architecture 363

    Orchestration 363

    Software Engineering 364

    Conclusion 365

    Additional Resources 365

    Part III. Security, Privacy, and the Future of Data Engineering

    10. Security and Privacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

    People 370

    The Power of Negative Thinking 370

    Always Be Paranoid 370

    Processes 371

    Security Theater Versus Security Habit 371

    Active Security 371

    The Principle of Least Privilege 372

    Shared Responsibility in the Cloud 372

    Always Back Up Your Data 372

    An Example Security Policy 373

    Technology 374

    Patch and Update Systems 374

    Encryption 375

    Logging, Monitoring, and Alerting 375

    Network Access 376

    Security for Low-Level Data Engineering 377

    Conclusion 378

    Additional Resources 378

    11. The Future of Data Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

    The Data Engineering Lifecycle Isn’t Going Away 380

    The Decline of Complexity and the Rise of Easy-to-Use Data Tools 380

    The Cloud-Scale Data OS and Improved Interoperability 381

    “Enterprisey” Data Engineering 383

    Titles and Responsibilities Will Morph... 384

    Moving Beyond the Modern Data Stack, Toward the Live Data Stack 385

    The Live Data Stack 385

    Streaming Pipelines and Real-Time Analytical Databases 386

    The Fusion of Data with Applications 387

    The Tight Feedback Between Applications and ML 388

    Dark Matter Data and the Rise of...Spreadsheets?! 388

    Conclusion 389

    A. Serialization and Compression Technical Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

    B. Cloud Networking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399


    Comments