Main menu

Pages

Fundamentals of Data Engineering: Plan and Build Robust Data Systems

Fundamentals of Data Engineering: Plan and Build Robust Data Systems

Fundamentals of Data Engineering: Plan and Build Robust Data Systems

"Fundamentals of Data Engineering: Plan and Build Robust Data Systems" as my knowledge cutoff is 2021 and this book may be published later. However, in general, data engineering is the process of designing, building, and maintaining the infrastructure and systems that are necessary to store, process, and analyze large volumes of data. A book on data engineering would likely cover topics such as data warehousing, data modeling, data integration, data quality, data governance, and big data technologies. It would also provide guidance on how to design and build robust data systems that can handle the scale, complexity, and variability of modern data environments.

"Fundamentals of Data Engineering: Plan and Build Robust Data Systems" is a book that covers the fundamentals of data engineering and provides guidance on how to design and implement robust data systems. It likely covers topics such as data modeling, data warehousing, data integration, data pipeline design and implementation, data governance, and data quality. The book may also include case studies and best practices for building and maintaining large-scale data systems. Additionally, the book probably covers the tools and technologies commonly used in data engineering such as SQL, big data technologies like Hadoop and Spark, and cloud data platforms like AWS, GCP, and Azure.


book content


Part I. Foundation and Building Blocks

1. Data Engineering Described. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

What Is Data Engineering? 3

Data Engineering Defined 4

The Data Engineering Lifecycle 5

Evolution of the Data Engineer 6

Data Engineering and Data Science 11

Data Engineering Skills and Activities 13

Data Maturity and the Data Engineer 13

The Background and Skills of a Data Engineer 17

Business Responsibilities 18

Technical Responsibilities 19

The Continuum of Data Engineering Roles, from A to B 21

Data Engineers Inside an Organization 22

Internal-Facing Versus External-Facing Data Engineers 23

Data Engineers and Other Technical Roles 24

Data Engineers and Business Leadership 28

Conclusion 31

Additional Resources 32

2. The Data Engineering Lifecycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

What Is the Data Engineering Lifecycle? 33

The Data Lifecycle Versus the Data Engineering Lifecycle 34

Generation: Source Systems 35

Storage 38

Ingestion 39

Transformation 43

Serving Data 44

Major Undercurrents Across the Data Engineering Lifecycle 48

Security 49

Data Management 50

DataOps 59

Data Architecture 64

Orchestration 64

Software Engineering 66

Conclusion 68

Additional Resources 69

3. Designing Good Data Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

What Is Data Architecture? 71

Enterprise Architecture Defined 72

Data Architecture Defined 75

“Good” Data Architecture 76

Principles of Good Data Architecture 77

Principle 1: Choose Common Components Wisely 78

Principle 2: Plan for Failure 79

Principle 3: Architect for Scalability 80

Principle 4: Architecture Is Leadership 80

Principle 5: Always Be Architecting 81

Principle 6: Build Loosely Coupled Systems 81

Principle 7: Make Reversible Decisions 83

Principle 8: Prioritize Security 84

Principle 9: Embrace FinOps 85

Major Architecture Concepts 87

Domains and Services 87

Distributed Systems, Scalability, and Designing for Failure 88

Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices 90

User Access: Single Versus Multitenant 94

Event-Driven Architecture 95

Brownfield Versus Greenfield Projects 96

Examples and Types of Data Architecture 98

Data Warehouse 98

Data Lake 101

Convergence, Next-Generation Data Lakes, and the Data Platform 102

Modern Data Stack 103

Lambda Architecture 104

Kappa Architecture 105

The Dataflow Model and Unified Batch and Streaming 105

Architecture for IoT 106

Data Mesh 109

Other Data Architecture Examples 110

Who’s Involved with Designing a Data Architecture? 111

Conclusion 111

Additional Resources 111

4. Choosing Technologies Across the Data Engineering Lifecycle. . . . . . . . . . . . . . . . . . . 115

Team Size and Capabilities 116

Speed to Market 117

Interoperability 117

Cost Optimization and Business Value 118

Total Cost of Ownership 118

Total Opportunity Cost of Ownership 119

FinOps 120

Today Versus the Future: Immutable Versus Transitory Technologies 120

Our Advice 122

Location 123

On Premises 123

Cloud 124

Hybrid Cloud 127

Multicloud 128

Decentralized: Blockchain and the Edge 129

Our Advice 129

Cloud Repatriation Arguments 130

Build Versus Buy 132

Open Source Software 133

Proprietary Walled Gardens 137

Our Advice 138

Monolith Versus Modular 139

Monolith 139

Modularity 140

The Distributed Monolith Pattern 142

Our Advice 142

Serverless Versus Servers 143

Serverless 143

Containers 144

How to Evaluate Server Versus Serverless 145

Our Advice 146

Optimization, Performance, and the Benchmark Wars 147

Big Data...for the 1990s 148

Nonsensical Cost Comparisons 148

Asymmetric Optimization 148

Caveat Emptor 149

Undercurrents and Their Impacts on Choosing Technologies 149

Data Management 149

DataOps 149

Data Architecture 150

Orchestration Example: Airflow 150

Software Engineering 151

Conclusion 151

Additional Resources 151

Part II. The Data Engineering Lifecycle in Depth

5. Data Generation in Source Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Sources of Data: How Is Data Created? 156

Source Systems: Main Ideas 156

Files and Unstructured Data 156

APIs 157

Application Databases (OLTP Systems) 157

Online Analytical Processing System 159

Change Data Capture 159

Logs 160

Database Logs 161

CRUD 162

Insert-Only 162

Messages and Streams 163

Types of Time 164

Source System Practical Details 165

Databases 166

APIs 174

Data Sharing 176

Third-Party Data Sources 177

Message Queues and Event-Streaming Platforms 177

Whom You’ll Work With 181

Undercurrents and Their Impact on Source Systems 183

Security 183

Data Management 184

DataOps 184

Data Architecture 185

Orchestration 186

Software Engineering 187

Conclusion 187

Additional Resources 188

6. Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Raw Ingredients of Data Storage 191

Magnetic Disk Drive 191

Solid-State Drive 193

Random Access Memory 194

Networking and CPU 195

Serialization 195

Compression 196

Caching 197

Data Storage Systems 197

Single Machine Versus Distributed Storage 198

Eventual Versus Strong Consistency 198

File Storage 199

Block Storage 202

Object Storage 205

Cache and Memory-Based Storage Systems 211

The Hadoop Distributed File System 211

Streaming Storage 212

Indexes, Partitioning, and Clustering 213

Data Engineering Storage Abstractions 215

The Data Warehouse 215

The Data Lake 216

The Data Lakehouse 216

Data Platforms 217

Stream-to-Batch Storage Architecture 217

Big Ideas and Trends in Storage 218

Data Catalog 218

Data Sharing 219

Schema 219

Separation of Compute from Storage 220

Data Storage Lifecycle and Data Retention 223

Single-Tenant Versus Multitenant Storage 226

Whom You’ll Work With 227

Undercurrents 228

Security 228

Data Management 228

DataOps 229

Data Architecture 230

Orchestration 230

Software Engineering 230

Conclusion 230

Additional Resources 231

7. Ingestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

What Is Data Ingestion? 234

Key Engineering Considerations for the Ingestion Phase 235

Bounded Versus Unbounded Data 236

Frequency 237

Synchronous Versus Asynchronous Ingestion 238

Serialization and Deserialization 239

Throughput and Scalability 239

Reliability and Durability 240

Payload 241

Push Versus Pull Versus Poll Patterns 244

Batch Ingestion Considerations 244

Snapshot or Differential Extraction 246

File-Based Export and Ingestion 246

ETL Versus ELT 246

Inserts, Updates, and Batch Size 247

Data Migration 247

Message and Stream Ingestion Considerations 248

Schema Evolution 248

Late-Arriving Data 248

Ordering and Multiple Delivery 248

Replay 249

Time to Live 249

Message Size 249

Error Handling and Dead-Letter Queues 249

Consumer Pull and Push 250

Location 250

Ways to Ingest Data 250

Direct Database Connection 251

Change Data Capture 252

APIs 254

Message Queues and Event-Streaming Platforms 255

Managed Data Connectors 256

Moving Data with Object Storage 257

EDI 257

Databases and File Export 257

Practical Issues with Common File Formats 258

Shell 258

SSH 259

SFTP and SCP 259

Webhooks 259

Web Interface 260

Web Scraping 260

Transfer Appliances for Data Migration 261

Data Sharing 262

Whom You’ll Work With 262

Upstream Stakeholders 262

Downstream Stakeholders 263

Undercurrents 263

Security 264

Data Management 264

DataOps 266

Orchestration 268

Software Engineering 268

Conclusion 268

Additional Resources 269

8. Queries, Modeling, and Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

Queries 272

What Is a Query? 273

The Life of a Query 274

The Query Optimizer 275

Improving Query Performance 275

Queries on Streaming Data 281

Data Modeling 287

What Is a Data Model? 288

Conceptual, Logical, and Physical Data Models 289

Normalization 290

Techniques for Modeling Batch Analytical Data 294

Modeling Streaming Data 307

Transformations 309

Batch Transformations 310

Materialized Views, Federation, and Query Virtualization 323

Streaming Transformations and Processing 326

Whom You’ll Work With 329

Upstream Stakeholders 329

Downstream Stakeholders 330

Undercurrents 330

Security 330

Data Management 331

DataOps 332

Data Architecture 333

Orchestration 333

Software Engineering 333

Conclusion 334

Additional Resources 335

9. Serving Data for Analytics, Machine Learning, and Reverse ETL. . . . . . . . . . . . . . . . . 337

General Considerations for Serving Data 338

Trust 338

What’s the Use Case, and Who’s the User? 339

Data Products 340

Self-Service or Not? 341

Data Definitions and Logic 342

Data Mesh 343

Analytics 344

Business Analytics 344

Operational Analytics 346

Embedded Analytics 348

Machine Learning 349

What a Data Engineer Should Know About ML 350

Ways to Serve Data for Analytics and ML 351

File Exchange 351

Databases 352

Streaming Systems 354

Query Federation 354

Data Sharing 355

Semantic and Metrics Layers 355

Serving Data in Notebooks 356

Reverse ETL 358

Whom You’ll Work With 360

Undercurrents 360

Security 361

Data Management 362

DataOps 362

Data Architecture 363

Orchestration 363

Software Engineering 364

Conclusion 365

Additional Resources 365

Part III. Security, Privacy, and the Future of Data Engineering

10. Security and Privacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

People 370

The Power of Negative Thinking 370

Always Be Paranoid 370

Processes 371

Security Theater Versus Security Habit 371

Active Security 371

The Principle of Least Privilege 372

Shared Responsibility in the Cloud 372

Always Back Up Your Data 372

An Example Security Policy 373

Technology 374

Patch and Update Systems 374

Encryption 375

Logging, Monitoring, and Alerting 375

Network Access 376

Security for Low-Level Data Engineering 377

Conclusion 378

Additional Resources 378

11. The Future of Data Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

The Data Engineering Lifecycle Isn’t Going Away 380

The Decline of Complexity and the Rise of Easy-to-Use Data Tools 380

The Cloud-Scale Data OS and Improved Interoperability 381

“Enterprisey” Data Engineering 383

Titles and Responsibilities Will Morph... 384

Moving Beyond the Modern Data Stack, Toward the Live Data Stack 385

The Live Data Stack 385

Streaming Pipelines and Real-Time Analytical Databases 386

The Fusion of Data with Applications 387

The Tight Feedback Between Applications and ML 388

Dark Matter Data and the Rise of...Spreadsheets?! 388

Conclusion 389

A. Serialization and Compression Technical Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

B. Cloud Networking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399


Comments

table of contents title