Learning Fundamentals of Data Engineering

Fundamentals Of Data Engineering

These are my notes from the book Fundamentals Of Data Engineering.

Although you can access the content through the github page, this is a served with mkdocs-material 💕

Header image

Why ? đŸ€”

This is an amazing book for everyone involved in data.

By the end of the book you'll be better equipped to:

Which is a pretty good deal. 🎉

I thought, I can share some of my highlights from it. If you want to discover more about any of the topics, please check out the book.

If you’re interested in the book, you can purchase one. It was previously available via Redpanda, but the free copy is no longer offered. Now, that link redirects to a guide, which is still useful.

The Structure 🔹

The book consists of 3 parts, made up of 11 chapters and 2 appendices.

Here is the tree of the book.

And the following are my notes, following this structure.

Fundamentals of Data Engineering
├── Part 1 – Foundation and Building Blocks
│   ├── 1. Data Engineering Described
│   ├── 2. The Data Engineering Lifecycle
│   ├── 3. Designing Good Data Architecture
│   └── 4. Choosing Technologies Across the Data Engineering Lifecycle
├── Part 2 – The Data Engineering Lifecycle in Depth
│   ├── 5. Data Generation in Source Systems
│   ├── 6. Storage
│   ├── 7. Ingestion
│   ├── 8. Orchestration
│   └── 9. Queries, Modeling, and Transformation
└── Part 3 – Security, Privacy, and the Future of Data Engineering
    ├── 10. Security and Privacy
    └── 11. The Future of Data Engineering

Part 1 – Foundation and Building Blocks 🏂

Let's discover the land of data together.


1. Data Engineering Described

Let's clarify why we are here.

Definition of Data Engineer đŸ€š

Who is a data engineer? What do they do?

Here is Joe's and Matt's definition:

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning.

Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.

If you do not understand these definitions fully, don't worry. 💕

Throughout the book, we will unpack this definition.

Data Engineering Lifecycle

The book is centered around an idea called the data engineering lifecycle, which gives data engineers the holistic context to view their role.

So we'll to dive deep in these 5 stages:

and consider the undercurrents of them.

I believe this is a fantastic way to see the field. It's free from any single technology and it helps us focus the end goal. đŸ„ł

Evolution of the Data Engineer

This bit gives us a history for the Data Engineering field.

Most important points are:

Data engineers managing the data engineering lifecycle have better tools and techniques than ever before. All we have to do is to master them. 😌

Data Hierarchy Of Needs

Another crucial idea to understand is the Data Hierarchy Of Needs:

Special thanks to Monica Rogati.

We need a solid foundation for effective AI and ML.

Here is how I interpret this image:

Collect:

We gather the raw inputs that fuel all downstream data work.

Instrumentation

Instrumentation means embedding code or tools into applications & systems to collect data about usage or performance.

Examples:

📌 Goal: Make sure data is being captured from the start.

Logging

Logging is the automatic recording of system or application events — it's like keeping a diary of what the system is doing.

Examples:

📌 Goal: Enable debugging, monitoring, and behavioral analysis.

External Data

This refers to data sourced from outside your system, like 3rd-party APIs or public datasets.

Examples:

📌 Goal: Enrich internal data with external context.

Clickstream Data

Clickstream data tracks how users navigate through a website or app, capturing sequences of events.

Examples:

📌 Goal: Understand user behavior and intent.

Sensors

Sensors collect physical world signals and convert them into data.

Examples:

📌 Goal: Capture real-time data from the physical environment.

User-Generated Content

This is any content that users make themselves, either actively or passively.

Examples:

📌 Goal: Leverage user input for insights, personalization, or community building.

Move / Store

The Move/Store stage of the Data Hierarchy of Needs is all about getting the data from its source to where it can be used — reliably, at scale, and efficiently.

Here's what each part means:

Reliable Data Flow

This ensures that data moves consistently and accurately from one system to another without loss, duplication, or delay.

Examples:

📌 Goal: Trust that your data is flowing smoothly and predictably.

Infrastructure

Infrastructure includes the compute, storage, and networking resources that support data movement and storage.

Examples:

📌 Goal: Provide the foundation for scalable and secure data systems.

Pipelines

Pipelines are automated systems that move and transform data from source to destination in a defined sequence.

Examples:

📌 Goal: Automate reliable and repeatable data movement and processing.

ETL (Extract, Transform, Load)

ETL refers to the process of extracting data, cleaning or transforming it, and loading it into a final system like a warehouse or database.

📌 Goal: Prepare data for consumption by analytics, ML, or applications.

Data Storage

This is where data lives long-term, structured in a way that it can be easily accessed, queried, or analyzed.

Examples:

📌 Goal: Store data cost-effectively while ensuring durability and accessibility.

Explore / Transform

The Explore and Transform stage of the Data Hierarchy of Needs is where raw data is shaped into something useful for analysis or modeling. Here's a breakdown of the three components you mentioned:

Cleaning

This is about removing errors and inconsistencies from raw data to make it usable.

Examples:

📌 Goal: Make the data trustworthy and consistent.

Preparation

This involves transforming clean data into a form suited for analysis or modeling.

Examples:

📌 Goal: Reshape data to match your downstream tasks (analytics, ML, etc.).

Anomaly Detection

This is the process of identifying unexpected, unusual, or suspicious data points that could indicate errors or rare events.

Examples:

📌 Goal: Spot and address data quality issues or operational anomalies before they affect insights or models.

Aggregate / Label

The Aggregate & Label stage of the Data Hierarchy of Needs is about creating summarized, structured, and labeled data that supports analysis, ML, and business insights. Let’s break down each term:

Analytics

This is the process of examining data to draw insights, usually through queries, reports, and dashboards.

Examples:

📌 Goal: Support business decisions with summarized views of data.

Metrics

Metrics are quantifiable measurements used to track performance over time.

Examples:

📌 Goal: Provide standardized KPIs (Key Performance Indicators) that align teams.

Segments

Segments are subsets of data grouped by shared characteristics.

Examples:

📌 Goal: Enable targeted analysis, personalization, or experimentation.

Aggregates

Aggregates are summarized data values computed from raw data, often using functions like sum(), avg(), count(), etc.

Examples:

📌 Goal: Reduce data volume and highlight meaningful patterns.

Features

In machine learning, features are input variables used to train a model.

Examples:

📌 Goal: Create informative variables that help ML models make predictions.

Training Data

This is labeled data used to train machine learning models.

Examples:

📌 Goal: Provide examples of correct behavior for supervised learning.

Learn / Optimize

The Learn/Optimize stage in the Data Hierarchy of Needs is the pinnacle — it's where data actually drives decisions or automation through learning, experimentation, and predictive models.

A/B Testing

A/B testing is a method of comparing two or more versions of something (like a web page or product feature) to see which performs better.

How it works: Split users into groups → Show each group a different version (A or B) → Measure outcomes (e.g., clicks, purchases).

Purpose: Understand which version leads to better results using data-driven evidence.

📌 Think of it as controlled experimentation to validate ideas.

Experimentation

A broader concept than A/B testing, experimentation includes testing changes or ideas under controlled conditions to learn causal effects.

Examples:

📌 Goal: Use experiments to explore how changes impact behavior or outcomes.

Classical Machine Learning (ML)

This includes well-established algorithms that learn patterns from data to make predictions or decisions.

Examples:

📌 Used when the data and problem are well-structured and interpretable.

Artificial Intelligence (AI)

AI is a broader field focused on building systems that can perform tasks that usually require human intelligence.

Examples:

📌 Goal: Build intelligent systems that can perceive, reason, and act.

Deep Learning (DL)

Deep Learning is a subset of ML based on neural networks with many layers, designed to learn complex representations of data.

Examples:

📌 Used when the problem is too complex for classical ML and massive data is available.

What's the focus for Data Engineer?

So, even though almost everyone is focused on AI/ML applications, a strong Data Engineering Team should provide them with a infrastructure that has:

These are really simple things, but they can be really hard to implement in complex systems. đŸ€­

As an engineer, we work under constraints. We must optimize along these axes:

Data Maturity

Another great idea from this chapter is Data Maturity.

Data Maturity refers to the organization's advancement in utilizing, integrating, and maximizing data capabilities.

Data maturity isn’t determined solely by a company’s age or revenue; an early-stage startup may demonstrate higher data maturity than a century-old corporation with billions in annual revenue.

What truly matters is how effectively the company leverages data as a competitive advantage.

Let's understand this with some examples:

đŸŒ Low Data Maturity
Example: A small retail store writes sales down in a notebook.

🧒 Early Data Maturity
Example: A startup uses Excel to track customer data and email open rates.

🧑‍🎓 Growing Data Maturity
Example: A mid-sized company uses a dashboard to track user behavior and marketing ROI.

🧠 High Data Maturity
Example: An e-commerce company uses real-time data to personalize recommendations and detect fraud.

🧙 Very High Data Maturity
Example: A global tech company automatically retrains ML models, predicts demand, and adjusts supply chain in real-time.

How to become a Data Engineer ? đŸ„ł

Data engineering is a rapidly growing field, but lacks a formal training path. Universities don't offer standardized programs, and while boot camps exist, a unified curriculum is missing.

People enter the field with diverse backgrounds, often transitioning from roles like software engineering or data analysis, and self-study is crucial. 🏂

A data engineer must master data management, technology tools, and understand the needs of data consumers like analysts and scientists.

Success in data engineering requires both technical expertise and a broader understanding of the business impact of data.

Business Responsibilities:

Example: You’re building a dashboard for the marketing team. You explain in simple terms how long it will take and ask them what insights matter most—without using technical jargon like “ETL pipelines” or “schema evolution.” Then you talk to your fellow engineers in detail about data modeling and infrastructure.

Example: A product manager says, “We want to know why users drop off after sign-up.” You don’t just jump into building something—you ask follow-up questions: “What’s your definition of drop-off? Are we looking at mobile or web users? Over what time frame?”

Example: You don’t wait months to launch a data product. Instead, you release a small working version (MVP), get feedback from stakeholders, and iterate quickly. You write tests and automate your pipeline deployments using CI/CD like a software engineer.

Example: Instead of running an expensive BigQuery job every hour, you optimize the SQL and reduce the schedule to once every 6 hours—saving the company hundreds or thousands of dollars a month in compute costs.

Example: You hear your team wants to adopt Apache Iceberg. You’ve never used it, so you take an online course, read the docs, and build a mini project over the weekend to see how it works.

A successful data engineer always zooms out to understand the big picture and how to achieve outsized value for the business.

Technical Responsibilities:

Data engineers remain software engineers, in addition to their many other roles.

What languages should a data engineer know?

You can also add a CI/CD tool like Jenkins, containerization with Docker, and orchestration with Kubernetes to this list.

Data Engineers and Other Technical Roles

It is important to understand the technical stakeholders that you'll be working with.

The crucial idea is that, you are a part of a bigger team. As a unit, you are trying to achieve something. 🏉

A great tactic would be to understand the workflows of those people which sits at the upstream or downstream of your work.

So feel free to research all technical roles with a prompt to an LLM like following:

As a Data Engineer a stakeholder of mine are Machine Learning Engineers. Can you help me understand what they do, how they do it and how's their work quality measured? I want to serve them in the best way possible.

Data Engineers and Leadership

Data engineers act as connectors within organizations, bridging business and data teams.

They now play a key role in strategic planning, helping align business goals with data initiatives and supporting data architects in driving data-centric projects.

Data in the C-Suite

C-level executives increasingly recognize data as a core asset.

The CEO typically partners with technical leaders on high-level data strategies without diving into technical specifics.

The CIO focuses on internal IT systems and often collaborates with data engineers on initiatives like cloud migrations and infrastructure planning.

The CTO handles external-facing technologies, working with data teams to integrate information from customer-facing platforms such as web and mobile applications.

The Chief Data Officer (CDO) oversees data strategy and governance, ensuring data delivers tangible business value.

There are other examples, but these are enough to demonstrate the value we bring as data engineers.

Conclusion

Now we know about:

Let's dive deep on the lifecycle. đŸ„ł


2. The Data Engineering Lifecycle. 🐩

We can move beyond viewing data engineering as a specific collection of data technologies, which is a big trap. 😼

We can think with data engineering lifecycle. 💯

It shows the stages that turn raw data ingredients into a useful end product, ready for consumption by analysts, data scientists, ML engineers, and others.

Let's remember the figure for the lifecycle.

In the following chapters we'll dive deep for each of these stages, but let's learn the useful questions to ask about them first.

Arguably, the most impactful contribution we can make lies in the answers to these questions. Regardless of the company structure or the system we’re working on, asking the right questions about data generation, ingestion, storage, transformation, and serving allows us to identify opportunities for improvement and drive meaningful change.

Generation: Source Systems 🌊

A source system is where data originates in the data engineering process.

Examples of source systems include IoT devices, application message queues, or transactional databases.

Data engineers use data from these source systems but typically do not own or control them.

Therefore, it's important for data engineers to understand how these source systems operate, how they generate data, how frequently and quickly they produce data (frequency & velocity) and the different types of data they generate.

Here is a set of evaluation questions for Source Systems:

We'll learn more about Source Systems in Chapter 5.

Storage đŸŒ±

Choosing the right data storage solution is critical yet complex in data engineering because it affects all stages of the data lifecycle.

Cloud architectures often use multiple storage systems that offer capabilities beyond storage, like data transformation and querying.

Storage intersects with other stages such as ingestion, transformation, and serving, influencing how data is used throughout the entire pipeline.

Here is a set of evaluation questions for Storage:

Regardless of the storage type, the temperature of data is a good frame to interpret storage and data.

Data access frequency defines data "temperatures": Hot data is frequently accessed and needs fast retrieval; lukewarm data is accessed occasionally; cold data is rarely accessed and suited for archival storage. Cloud storage tiers match these temperatures, balancing cost with retrieval speed.

We'll learn more about Storage in Chapter 6.

Ingestion đŸ§˜â€â™‚ïž

Data ingestion from source systems is a critical stage in the data engineering lifecycle and often represents the biggest bottleneck.

Source systems are typically outside of our control and may become unresponsive or provide poor-quality data.

Ingestion services might also fail for various reasons, halting data flow and impacting storage, processing, and serving stages. These unreliabilities can ripple across the entire lifecycle, but if we've addressed the key questions about source systems, we can better mitigate these challenges.

Here is a set of evaluation questions for Ingestion:

Batch processing is often preferred over streaming due to added complexities and costs; real-time streaming should be used only when necessary.

Data ingestion involves push models (source sends data) and pull models (system retrieves data), often combined in pipelines. Traditional ETL uses the pull model.

Continuous Change Data Capture (CDC) can be push-based (triggers on data changes) or pull-based (reading logs).

Streaming ingestion pushes data directly to endpoints, ideal for scenarios like IoT sensors emitting events, simplifying real-time processing by treating each data point as an event.

We'll learn more about Ingestion in Chapter 7.

Transformation 🔹

After data is ingested and stored, it must be transformed into usable formats for downstream purposes like reporting, analysis, or machine learning.

Transformation converts raw, inert data into valuable information by correcting data types, standardizing formats, removing invalid records, and preparing data for further processing.

This preparation can be applying normalization, performing large-scale aggregations for reports or extracting features for ML models.

Here is a set of evaluation questions for Transformation:

Transformation often overlaps with other stages of the data lifecycle, such as ingestion, where data may be enriched or formatted on the fly.

Business logic plays a significant role in shaping transformations, especially in data modeling, to provide clear insights into business processes and ensure consistent implementation across systems.

Additionally, data featurization is an important transformation for machine learning, involving the extraction and enhancement of data features for model training—a process that data engineers can automate once defined by data scientists.

We'll learn more about Transformation in Chapter 8.

Serving Data đŸ€č

After data is ingested, stored, and transformed, the goal is to derive value from it.

In the beginning of the book, we've seen how data engineering is enabling predictive analysis, descriptive analytics, and reports.

With simple terms, here is what they are:

Here is a set of questions to make a solid Serving Stage:

ML is cool, but it’s generally best to develop competence in analytics before moving to ML.

We'll dive deep on Serving in Chapter 9.

The Undercurrents

Data engineering is evolving beyond just technology, integrating traditional practices like data management and cost optimization with newer approaches such as DataOps.

These key "undercurrents"—including security, data architecture, orchestration, and software engineering—support the entire data engineering lifecycle.

Let's talk about them in single sentences, and we'll go into explore them in greater detail throughout the book.

Security

Security is paramount in data engineering, requiring engineers to enforce the principle of least privilege, cultivate a security-focused culture, implement robust access controls and encryption, and possess comprehensive security administration skills to effectively protect sensitive data.

Data Management

Modern data engineering integrates comprehensive data management practices—such as governance and lifecycle management—transforming it from a purely technical role into a strategic function essential for treating data as a vital organizational asset.

DataOps

DataOps applies Agile and DevOps principles to data engineering by fostering a collaborative culture and implementing automation, monitoring, and incident response practices to enhance the quality, speed, and reliability of data products.

Data Architecture

Data architecture is a fundamental aspect of data engineering that involves understanding business requirements, designing cost-effective and simple data systems, and collaborating with data architects to support an organization’s evolving data strategy.

Orchestration

Orchestration in DataOps is the coordinated management of data jobs using systems like Apache Airflow to handle dependencies, scheduling, monitoring, and automation, ensuring efficient and reliable execution of data workflows.

Software Engineering

Software engineering is fundamental to data engineering, encompassing the development and testing of data processing code, leveraging and contributing to open source frameworks, managing streaming complexities, implementing infrastructure and pipelines as code, and addressing diverse technical challenges to support and advance evolving data systems.

Conclusion 🌠

The data engineering lifecycle, supported by key undercurrents such as security, data management, DataOps, architecture, orchestration, and software engineering, provides a comprehensive framework for data engineers to optimize ROI, reduce costs and risks, and maximize the value and utility of data.

Let's learn to think with this mindset! 🧠


3. Designing Good Data Architecture 🎋

What is it? Here is a definition:

We can divide Data Architecture into two parts, Operational and Technical.

Here are my definitions:

Operational architecture involves the practical needs related to people, processes, and technology. For example, it looks at which business activities the data supports, how the company maintains data quality, and how quickly data needs to be available for use after it's made.

Technical architecture explains the methods for collecting, storing, changing, and delivering data throughout its lifecycle. For example, it might describe how to move 10 TB of data every hour from a source database to a data lake.

In short, operational architecture defines what needs to be done, while technical architecture explains how to do it.

Effective data architecture meets business needs by using standardized, reusable components while remaining adaptable and balancing necessary compromises. It's also dynamic and continually evolving. It is never truly complete.

By definition, adaptability and growth are fundamental to the essence and objectives of data architecture.

Next, let's explore the principles that underpin good data architecture. 😌

Principles of Good Data Architecture ✅

Here are 9 principles to keep in mind.

1: Choose Common Components Wisely

A key responsibility of data engineers is selecting shared components and practices—such as object storage, version control systems, observability tools, orchestration platforms, and processing engines—that are widely usable across the organization.

Effective selection promotes collaboration, breaks down silos, and enhances flexibility by leveraging common knowledge and skills.

These shared tools should be accessible to all relevant teams, encouraging the use of existing solutions over creating new ones, while ensuring robust permissions and security to safely share resources.

Cloud platforms are ideal for implementing these components, allowing teams to access a common storage layer with specialized tools for their specific needs.

Balancing organizational-wide requirements with the flexibility for specialized tasks is essential to support various projects and foster collaboration without imposing one-size-fits-all solutions.

Further details are provided in Chapter 4.

2: Plan for Failure

Modern hardware is generally reliable, but failures are inevitable over time.

To build robust data systems, it's essential to design with potential failures in mind by understanding key concepts such as availability (the percentage of time a service is operational), reliability (the likelihood a system performs its intended function), recovery time objective (the maximum acceptable downtime), and recovery point objective (the maximum acceptable data loss).

These factors guide engineers in making informed architectural decisions to effectively handle and mitigate failure scenarios, ensuring systems remain resilient and meet business requirements.

3: Architect for Scalability

Scalability in data systems means the ability to automatically increase capacity to handle large data volumes or temporary spikes and decrease it to reduce costs when demand drops.

Elastic systems adjust dynamically, sometimes even scaling to zero when not needed, as seen in serverless architectures. However, choosing the right scaling strategy is essential to avoid complexity and high costs.

This requires carefully assessing current usage, anticipating future growth, and selecting appropriate database architectures to ensure efficiency and cost-effectiveness as the organization expands.

4: Architecture Is Leadership

Data architects combine strong technical expertise with leadership and mentorship to make technology decisions, promote flexibility and innovation, and guide data engineers in achieving organizational goals.

It really helps to have a growth mindset. 🧠

5: Always Be Architecting ♻

Data architects continuously design and adapt architectures in an agile, collaborative way, responding to business and technology changes by planning and prioritizing updates.

Innovation requires iteration. Mark Papermaster.

6: Build Loosely Coupled Systems

Loose coupling through independent components and APIs allows teams to collaborate efficiently and evolve systems flexibly.

7: Make Reversible Decisions

To stay agile in a rapidly changing data landscape, architects should make reversible decisions that keep architectures simple and adaptable.

You can read this shareholder letter from Jeff Bezos on reversible decisions.

8: Prioritize Security

Data engineers must take responsibility for system security by adopting zero-trust models and the shared responsibility approach, ensuring robust protection in cloud-native environments and preventing breaches through proper configuration and proactive security practices.

9: Embrace FinOps

FinOps is a cloud financial management practice that encourages collaboration between engineering and finance teams to optimize cloud spending through data-driven decisions and continuous cost monitoring.

We should embrace FinOps! It helps us defend our decisions.


Now that we have a grasp of the fundamental principles of effective data architecture, let's explore the key concepts necessary for designing and building robust data systems in more detail.

Major Architecture Concepts

To learn more about:

please read this part. đŸ„°

Next, we’ll explore different types of architectures.

Examples and Types of Data Architecture

Here, we can explore some 101 information about:

which is foundational knowledge on which what we'll build after.

Conclusion

Architectural design involves close collaboration with business teams to weigh different options.

For instance:

Gaining a strong grasp of these decision points will equip us to make sound, reasonable choices.

Next, we’ll explore approaches to selecting the right technologies for our data architecture and throughout the data engineering lifecycle. 😍


4. Choosing Technologies Across the Data Engineering Lifecycle

Chapter 3 explored the concept of good data architecture and its importance.

Now, we shift focus to selecting the right technologies to support this architecture.

For data engineers, choosing the right tools is crucial for building high-quality data products.

The key question to ask when evaluating a technology is straightforward:

Does it add value to the data product and the broader business? 💡

One common misconception is equating architecture with tools.

Architecture is strategic, while tools are tactical.

Key Factors for Choosing Data Technologies

When selecting technologies to support your data architecture, consider the following across the data engineering lifecycle:

These points might be helpful for you to demonstrate that your approach is rooted in industry best practices and aligned with the system’s goals.

Read this part in detail on how to choose the right tooling.

Continue with the second part of the book here đŸ„ł

Part 2 – The Data Engineering Lifecycle in Depth 🔬

Then we move onto the second part of the book, which helps us understand the core idea.


5. Data Generation in Source Systems

Before getting the raw data, we must understand where the data exists, how it is generated, and its characteristics.

Let's make sure we get the absolute basics about source systems correctly. 🍓

Main Ideas on Source Systems

Files

A file is a sequence of bytes, typically stored on a disk. Applications often write data to files. Files may store local parameters, events, logs, images, and audio.

In addition, files are a universal medium of data exchange. As much as data engineers wish that they could get data programmatically, much of the world still sends and receives files.

APIs

API's are a standard data exchange method.

A simple example would be the "log in with Twitter/Google/GitHub" capability seen on many websites. Rather than entering into users' social media accounts (which would be a severe security risk), applications with this capability use the APIs of these platforms to authenticate the user with each login.

Application Databases (OLTP)

Application databases store app state with fast, high-volume reads/writes.

They are ideal for transactional tasks like banking. Commonly low-latency, high-concurrency systems—RDBMS, document, or graph DBs.

More info about ACID and atomic transactions can be found here.

OLAP Systems

Built for large, interactive analytics—relatively slower at single-record lookups. Often used in ML pipelines or reverse ETL.

Change Data Capture (CDC)

CDC, captures DB changes (insert/update/delete) for real-time sync or streaming. Implementation varies by database type.

Database Logs store operations before execution for recovery. Key for CDC and reliable event streams.

Logs

Logs are tracked system events for analysis or ML. Common sources: OS, apps, servers, networks. Formats: binary, semi-structured (e.g. JSON), or plain text.

Log Resolution defines how much detail logs capture. Log level controls what gets recorded (e.g., errors only). Logs can be batch or real-time.

CRUD

Create, Read, Update, Delete.

Core data operations in apps and databases. Common in APIs and storage systems.

Insert-Only

Instead of updates, new records are inserted with timestamps. This is great for history, but tables grow fast and lookups get costly.

Messages and Streams

Messages are single-use signals between systems. Once the message is received, and the action is taken, the message is removed from the message queue.

Streams are ordered, persistent logs of events for long-term processing. With append only nature, records in a stream are persisted over a retention window.

Streaming platforms often handle both.

Types of Time

Track all to monitor delays and flow.

Source Systems Practical Details

Here are some practical knowledge of APIs, databases, and data flow tools is essential but ever-changing—stay current.

Relational Databases (RDBMS)

Structured, ACID-compliant, great for transactional systems. They have tables, foreign keys, normalization, etc.

Examples would be PostgreSQL, MySQL, SQL Server, Oracle DB etc.

NoSQL Databases

Flexible, horizontally scalable databases with different data models.

Key-Value Stores

Fast read/write using unique keys. Great for caching or real-time event storage.

Examples would be Redis, Amazon DynamoDB etc.

Document Stores

Schema-flexible, store nested JSON-like documents.

Some examples are MongoDB, Couchbase, Firebase Firestore etc.

Wide-Column Stores

High-throughput databases that scale horizontally. They use column families and rows.

Some examples are Cassandra, ScyllaDB, Google Bigtable.

Graph Databases

Store nodes and edges. Ideal for analyzing relationships.

Examples could be given as Neo4j, Amazon Neptune, ArangoDB etc.

Search Databases

Fast search and text analysis engines. Common in logs and e-commerce.

Popular examples are Elasticsearch, Apache Solr, Algolia.

Time-Series Databases

Optimized for time-stamped data: metrics, sensors, logs.

Some examples would be InfluxDB, TimescaleDB, Apache Druid etc.

APIs

Standard for data exchange across systems, especially over HTTP.

REST (Representational State Transfer)

Stateless API style using HTTP verbs (GET, POST, etc.). Widely adopted but loosely defined—developer experience varies.

An example would be the GitHub REST API.

GraphQL

Made by Meta. Lets clients request exactly the data they need in one query—more flexible than REST.

Here is a link for the curious: GitHub GraphQL API.

Webhooks

Event-based callbacks from source systems to endpoints. Called reverse APIs because the server pushes data to the client.

Stripe Webhooks and Slack Webhooks are great examples.

RPC / gRPC

Run remote functions as if local. gRPC (by Google) uses Protocol Buffers and HTTP/2 for fast, efficient communication.

Check out gRPC by Google for more.

And More

There are more details about Data Sharing, Third-Party Data Sources, Message Queues and Event-Streaming Platforms.

Summary

Now we have a baseline for understanding source systems. The details matter.

We should work closely with app teams to improve data quality, anticipate changes, and build better data products. Collaboration leads to shared success—especially with trends like reverse ETL and event-driven architectures.

Making source teams part of the data journey is also a great idea.

Next: storing the data.

One additional note: Ideally our systems should be idempotent. An idempotent system produces the same result whether a message is processed once or multiple times—crucial for handling retries safely.


6. Storage 📩

Storage is core to every stage—data is stored repeatedly across ingestion, transformation, and serving.

Two things to consider while deciding on storage are:

The way storage is explained in the book is with the following figure:

Raw Ingredients of Data Storage

Here are some one liners as definitions.

Data Storage Systems

Operate above raw hardware—like disks—using platforms such as cloud object stores or HDFS. Higher abstractions include data lakes and lakehouses.

Here are some one liners about them.

Data Engineering Storage Abstractions

These are the abstractions that are built on top of storage systems.

Let's remember our map for storage.

Here are some of the Storage Abstractions.

Big Ideas in Data Storage

Here are some big ideas in Storage.

🔍 Data Catalogs

Data catalogs are centralized metadata hubs that let users search, explore, and describe datasets.

They support:

🔗 Data Sharing

Cloud platforms enable secure sharing of data across teams or organizations.

⚠ This requires strong access controls to avoid accidental exposure.

đŸ§± Schema Management

Understanding structure is essential:

💡 Use formats like Parquet for built-in schema support. Avoid raw CSV.

⚙ Separation of Compute & Storage

Modern systems decouple compute from storage for better scalability and cost control.

🔁 Hybrid Storage Examples

📎 Zero-Copy Cloning

Clone data without duplicating it (e.g., Snowflake, BigQuery).

⚠ Deleting original files may affect clones — know the limits.

📈 Data Storage Lifecycle & Retention

We talked about the temperature of data. Let's see an example.

đŸ”„ Hot, 🟠 Warm, 🧊 Cold Data

Type Frequency Storage Cost Use Case
Hot Frequent RAM/SSD High Recommendations, live queries
Warm Occasional S3 IA Medium Monthly reports, staging data
Cold Rare/Archive Glacier Low Compliance, backups

Use lifecycle policies to move data between tiers automatically.

⏳ Retention Strategy

🏱 Single-Tenant vs Multitenant Storage

Single-Tenant

Multitenant

Summary 😌

Storage is the backbone of the data engineering lifecycle—powering ingestion, transformation, and serving. As data flows through systems, it's stored multiple times across various layers, so understanding how, where, and why we store data is critical.

Smart storage decisions—paired with good schema design, lifecycle management, and collaboration—can drastically improve scalability, performance, and cost-efficiency in any data platform.

Here are 3 strong quotes from the book.

As always, exercise the principle of least privilege. Don’t give full database access to anyone unless required.

Data engineers must monitor storage in a variety of ways. This includes monitoring infrastructure storage components, object storage and other “serverless” systems.

Orchestration is highly entangled with storage. Storage allows data to flow through pipelines, and orchestration is the pump.


7. Ingestion

Ingestion is the process of moving data from source systems into storage—it's the first step in the data engineering lifecycle after data is generated.

Quick definition, data ingestion is data movement from point A to B, data integration combines data from disparate sources into a new dataset. Example of data integration is a CRM system, advertising analytics data, and web analytics to make a user profile, which is saved to our data warehouse.

A data pipeline is the full system that moves data through the data engineering lifecycle. Design of data pipelines typically starts at the ingestion stage.

What to Consider when Building Ingestion? đŸ€”

Consider these factors when designing your ingestion architecture:

Bounded vs. Unbounded

All data is unbounded until constrained. Streaming preserves natural flow; batching adds structure.

Frequency

Choose between batch, micro-batch, or real-time ingestion. "Real-time" typically means low-latency, near real-time.

Synchronous vs. Asynchronous

Serialization & Deserialization

Data must be encoded before transfer and properly decoded at destination. Incompatible formats make data unusable.

Throughput & Scalability

Design to handle spikes and backlogs. Use buffering and managed services (e.g., Kafka, Kinesis) for elasticity.

⏳ Reliability & Durability

Ensure uptime and no data loss through redundancy and failover. Balance cost vs. risk—design for resilience within reason.

🗃 Payload

Let's understand data characteristics:

Push vs. Pull vs. Poll

And here are some additional insight.

🔄 Streaming + Batch Coexist

Even with real-time ingestion, batch steps are common (e.g., model training, reports). Expect a hybrid approach.

đŸ§± Schema Awareness

Schemas change—new columns, types, or renames can silently break pipelines. Use schema registries to version and manage schemas reliably.

đŸ—‚ïž Metadata Mattera

Without rich metadata, raw data can become a data swamp. Proper tagging and descriptions are critical for usability.

Batch Ingestion

If we went with batch way, here are some things to keep in mind. Batch ingestion moves data in bulk, usually based on a time interval or data size. It’s widely used in traditional ETL and for transferring data from stream processors to long-term storage like data lakes.

Snapshot vs. Differential Extraction

File-Based Export and Ingestion

ETL vs. ELT

We defined ETL before. In ELT the definition is as follows:

Choose based on system capabilities and transformation complexity.

đŸ“„ Inserts, Updates, and Batch Size

🔄 Data Migration

Large migrations (TBs+) involve moving full tables or entire systems.

Key challenges are:

Use staging via object storage and test sample loads before full migration. Also consider migration tools instead of writing from scratch.

📹 Message and Stream Ingestion

Event-based ingestion is common in modern architectures. This section covers best practices and challenges to watch for when working with streaming data.

🧬 Schema Evolution

Schema changes (added fields, type changes) can break pipelines.

Here is what you can do:

🕓 Late-Arriving Data

🔁 Ordering & Duplicate Delivery

âȘ Replay

Replay lets you reprocess historical events within a time range.

⏳ Time to Live (TTL)

TTL defines how long events are retained before being discarded. It's the maximum message retention time, which is helpful to reduce backpressure.

Short TTLs can cause data loss; long TTLs can create backlogs.

Examples:

📏 Message Size

Be mindful of max size limits:

🧯 Error Handling & Dead-Letter Queues

Invalid or oversized messages should be routed to a dead-letter queue.

🔄 Consumer Models: Pull vs. Push

Pull is default for data engineering; push is used for specialized needs.

🌍 Ingestion Location & Latency

đŸ“„ Ways to Ingest Data

There are many ways to ingest data—each with its own trade-offs depending on the source system, use case, and infrastructure setup.

đŸ§© Direct Database Connections (JDBC/ODBC)

JDBC/ODBC are standard interfaces for pulling data directly from databases.

JDBC is Java-native and widely portable; ODBC is system-specific. These connections can be parallelized for performance but are row-based and struggle with nested/columnar data.

Many modern systems now favor native file export (e.g., Parquet) or REST APIs instead.

🔄 Change Data Capture (CDC)

Here are some quick definitions on CDC:

🌐 APIs

APIs are a common ingestion method from external systems.

📹 Message Queues & Event Streams

Use systems like Kafka, Kinesis, or Pub/Sub to ingest real-time event data.

Design for low latency, high throughput, and consider autoscaling or managed services to reduce ops burden.

🔌 Managed Data Connectors

Services like Fivetran, Airbyte, and Stitch provide plug-and-play connectors.

đŸȘŁ Object Storage

Object storage (e.g., S3, GCS, Azure Blob) is great for moving files between teams and systems.

Use signed URLs for temporary access and treat object stores as secure staging zones for data.

đŸ’Ÿ EDI (Electronic Data Interchange)

This is a legacy format still common in business environments.

đŸ“€ File Exports from Databases

Large exports put load on source systems—use read replicas or key-based partitioning for efficiency.

Modern cloud data warehouses support direct export to object storage in formats like Parquet or ORC.

đŸ§Ÿ File Formats & Considerations

Avoid CSV when possible due to its lack of schema, support for nested data, and error-prone behavior.

Prefer Parquet, ORC, Avro, Arrow, JSON—which support schema and complex structures.

đŸ’» Shell, SSH, SCP, and SFTP

Shell scripts and CLI tools still play a big role in scripting ingestion pipelines.

📡 Webhooks

Webhooks are "Reverse API" where the data provider pushes data to your service.

🌐 Web Interfaces & Scraping

🚚 Transfer Appliances

đŸ€ Data Sharing

Platforms like Snowflake, BigQuery, Redshift, and S3 allow read-only data sharing.

Summary đŸ„ł

Here are some quotes from the book.

Moving data introduces security vulnerabilities because you have to transfer data between locations. Data that needs to move within your VPC should use secure endpoints and never leave the confines of the VPC.

Do not collect the data you don't need. Data cannot leak if it is never collected.

My summary is down below:

Ingestion is the stage in the data engineering lifecycle where raw data is moved from source systems into storage or processing systems.

Data can be ingested in batch or streaming modes, depending on the use case. Batch ingestion processes large chunks of data at set intervals or based on file size, making it ideal for daily reports or large-scale migrations. Streaming ingestion, on the other hand, continuously processes data as it arrives, making it suitable for real-time applications like IoT, event tracking, or transaction streams.

Designing ingestion systems involves careful consideration of factors like bounded vs. unbounded data, frequency, serialization, throughput, reliability, and the push/pull method of data retrieval.

Ingestion isn’t just about moving data—it’s about understanding the shape, schema, and sensitivity of that data to ensure it's usable downstream.

As Data Engineers we must track metadata, consider ingestion frequency vs. transformation frequency, and apply best practices for security, compliance, and cost.

We should also stay flexible: even legacy methods like EDI or manual downloads may still be part of real-world workflows.

The key is to choose ingestion patterns that match the needs of the business while staying robust, scalable, and future-proof.


8. Queries, Modeling, and Transformation đŸȘ‡

Now we'll learn how to make data useful. đŸ„ł

Queries

Queries are at the core of data engineering and data analysis, enabling users to interact with, manipulate, and retrieve data.

Just to paint a picture, here is an example query:

SELECT name, age
FROM df
WHERE city = 'LA' AND age > 27;

Here is a complete example in Python:

import pandas as pd
import duckdb

# Create a sample DataFrame
data = {
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'age': [25, 30, 35, 40, 28],
    'city': ['NY', 'LA', 'NY', 'SF', 'LA']
}

df = pd.DataFrame(data)

# Run SQL query: Get users from LA older than 27
result = duckdb.query("""
    SELECT name, age
    FROM df
    WHERE city = 'LA' AND age > 27
""").to_df()

print(result)
#   name  age
# 0  Bob   30

Structured Query Language (SQL) is commonly used for querying tabular and semistructured data.

A query may read data (SELECT), modify it (INSERT, UPDATE, DELETE), or control access (GRANT, REVOKE).

Under the hood, a query goes through parsing, compilation to bytecode, optimization, and execution.

Various query languages (DML, DDL, DCL, TCL) are used to define and manipulate data and database objects, manage access, and control transactions for consistency and reliability.

To improve query performance, data engineers must understand the role of the query optimizer and write efficient queries. Strategies include optimizing joins, using prejoined tables or materialized views, leveraging indexes and partitioning, and avoiding full table scans.

We should monitor execution plans, system resource usage, and take advantage of query caching. Managing commits properly and vacuuming dead records are essential to maintain database performance. Understanding the consistency models of databases (e.g., ACID, eventual consistency) ensures reliable query results.

Streaming queries differ from batch queries, requiring real-time strategies such as session, fixed-time, or sliding windows.

Watermarks are used to handle late-arriving data, while triggers enable event-driven processing.

Combining streams with batch data, enriching events, or joining multiple streams adds complexity but unlocks deeper insights.

Technologies like Kafka, Flink, and Spark are essential for such patterns. Modern architectures like Kappa treat streaming logs as first-class data stores, enabling analytics on both recent and historical data with minimal latency.

Data Modeling

Data modeling is a foundational practice in data engineering that ensures data structures reflect business needs and logic. A data model shows how data relates to real world.

Despite its long-standing history, it has often been overlooked, especially with the rise of big data and NoSQL systems.

Today, there's a renewed focus on data modeling as companies recognize the importance of structured data for quality, governance, and decision-making.

A good data model aligns with business outcomes, supports consistent definitions (like what qualifies as a "customer"), and provides a scalable framework for analytics.

Modeling typically progresses from conceptual (business rules), to logical (types and keys), to physical (actual database schemas), and always considers the grain (resolution which data is stored and queried) of data.

A normalized model avoids redundancy and maintains data integrity. The first three normal forms (1NF, 2NF, 3NF) establish increasingly strict rules for structuring tables. While normalization reduces duplication, denormalization—often found in analytical or OLAP systems—can improve performance.

Three dominant batch modeling strategies are Inmon (centralized, normalized warehouse with downstream marts), Kimball (dimensional model with fact/dimension tables in star schemas), and Data Vault (insert-only, agile, source-aligned modeling using hubs, links, and satellites).

Wide, denormalized tables are gaining popularity in the cloud era due to flexible schemas and cheap storage, especially in columnar databases.

Additionally, streaming data modeling presents new challenges. Traditional batch paradigms don’t easily apply due to continuous schema changes and unbounded nature.

So flexibility is key: assume the source defines business logic, expect schema drift, and store both recent and historical data together.

Automation and dynamic analytics on streaming data are emerging trends. While no universal approach has yet emerged, models like the Data Vault show promise in adapting to streaming workflows.

The future may involve unified layers that combine metrics, semantics, pipelines, and real-time source-connected analytics, reducing the batch-vs-stream divide.

Transformation

Transformations enhance and persist data for downstream use.

Unlike queries, which retrieve data, transformations are about shaping and saving data—often as part of a pipeline. This reduces cost, increases performance, and enables reuse.

Batch Transformations

Batch transformations process data in chunks on a schedule (e.g., hourly, daily) and support reports, analytics, and ML models.

Distributed Joins:

ETL vs. ELT:

Choose based on context—no need to stick to one approach for the entire org. 😌

SQL vs. Code-Based Tools

Avoid excessive use of Python UDFs; they slow performance in Spark. Prefer native Scala/Java implementations when needed.

Update Patterns

Schema Updates

Data Wrangling

Example (Spark)

Business Logic & Derived Data

MapReduce

Materialized Views, Federation, and Query Virtualization

Here are some one liners.

Streaming Transformations and Processing

Streaming Transformations vs. Queries

Streaming DAGs

Micro-Batch vs. True Streaming

Choose based on latency requirements, team expertise, and real-world testing.

Summary

Modern data systems revolve around three tightly interwoven pillars: queries, data modeling, and transformations.

At the surface, SQL queries let us retrieve, filter, and analyze data in declarative ways, whether for dashboards or ad-hoc investigations.

But queries alone are not enough—they assume data is structured and meaningful. Techniques like joins (e.g., combining customer orders and product data), window functions, and streaming queries (e.g., computing moving averages in real time) depend on underlying data that’s clean, normalized, and aligned with business logic. Without good structure, queries become brittle, hard to reuse, and difficult to scale.

That structure comes from data modeling—the process of organizing raw data into logical layers that reflect the organization’s goals.

Whether it’s Inmon’s normalized warehouse-first approach, Kimball’s dimensional star schemas, or the flexibility of a Data Vault, modeling helps define relationships, enforce consistency, and preserve meaning over time.

Modeling even applies to stream data, albeit in more relaxed forms, where business definitions may shift dynamically, and flexibility (e.g., using JSON columns or CDC feeds) becomes more important than strict schema enforcement.

Poorly modeled data often leads to data swamps, reporting confusion, and redundant pipelines—while good models lead to faster insights and cleaner transformations downstream.

Finally, transformations take center stage in turning data into its most useful, consumable form. This includes batch pipelines (e.g., ETL/ELT jobs using Spark or SQL), real-time stream enrichments, and creating derived data that reflects business logic like profit metrics.

Tools like materialized views, Airflow DAGs, and orchestration frameworks help simplify these complex workflows and reduce redundant processing.

As data engineers, we’re often tasked with choosing between performance and flexibility—using insert-only patterns, upserts, or schema evolution strategies that balance cost and query speed.

Whether we persist transformed data in a wide denormalized table, or virtualize it across systems our transformations are what elevate raw data into decision-ready information. 💕


9. Serving Data for Analytics, Machine Learning, and Reverse ETL 🍜

Serving is the final stage of the data engineering lifecycle, where data is delivered to drive insights, predictions, and actions.

It covers use cases like dashboards, machine learning, and feeding transformed data back into operational tools (reverse ETL).

Success here depends on data trust, user understanding, performance, and thoughtful system design.

Let's discover them further 😌

General Considerations for Serving Data

Trust

Trust is the most critical factor when serving data—if users don’t believe the data is accurate or consistent, they won’t use it.

Data validation, observability, and adherence to SLAs/SLOs ensure trustworthiness throughout the lifecycle. Once trust is lost, it’s difficult to regain and often leads to poor adoption and failed data initiatives.

Here is an example on SLA and SLO at serving stage.

Setting an SLA isn’t enough. Clear communication is essential—teams must regularly discuss any risks that could impact expectations and define a process for addressing issues and continuous improvement.

What’s the Use Case, and Who’s the User?

Knowing the user and their intended action helps shape data products with real business impact.

Serving data should begin by identifying the use case and working backwards from the decision or trigger it supports.

This user-first approach ensures relevance, usability, and alignment with goals.

Data Products

A data product is a reusable dataset or service that solves a defined user problem through data.

Building effective products requires collaboration with end users and clarity on their goals and expected outcomes.

Good data products generate feedback loops, improving themselves as usage increases and needs evolve.

Data Definitions and Logic

Definitions like “customer” or “churn” must be consistent across systems to ensure correct and aligned usage.

Embedded business logic should be captured and centralized to avoid ambiguity and hidden institutional knowledge.

Tools like semantic layers or catalogs can document and enforce shared definitions across teams and systems.

Data Mesh

Data mesh distributes data ownership across teams, turning them into both producers and consumers of high-quality data.

This decentralization improves scale and accountability, as each domain serves its data for others to use. It changes how data is served—teams must prepare, document, and support the data they publish to the mesh.

Analytics

This is the first use case for data-serving. 😌

Business Analytics

Business analytics helps stakeholders make strategic decisions using historical trends, KPIs, and dashboards.

Data is often served through data warehouses or lakes, using BI tools like Tableau, Looker, or Power BI.

Dashboards, reports, and ad hoc analysis are key outputs, with data engineers enabling access and quality.

Operational Analytics

Operational analytics supports real-time monitoring and rapid responses by processing live data streams.

It powers use cases like fraud detection, system monitoring, and factory floor analytics with low-latency data. This category requires real-time pipelines and databases optimized for concurrency, freshness, and speed.

Real-time analytics at the factory is a great example here!

Embedded Analytics

Embedded analytics integrates data and insights directly into user-facing applications, enabling real-time, data-driven decision-making.

For instance, a smart thermostat app displays live temperature and energy usage, helping users optimize their heating or cooling schedules for efficiency.

Similarly, a third-party e-commerce platform offers sellers real-time dashboards on sales, inventory, and returns—empowering them to react quickly, like launching instant promotions.

Other examples include fitness apps showing health trends and workout suggestions based on user data, SaaS platforms that provide usage and engagement insights to customer success teams, and ride-sharing apps surfacing driver performance and earnings in real time.

In all these cases, analytics isn’t a separate tool—it’s woven into the experience, driving immediate, contextual decisions.

Users expect near-instant data updates and smooth interactivity, which requires low-latency serving systems. Data engineers manage performance, concurrency, and delivery infrastructure behind the scenes.

Machine Learning

Really good quote:

Boundary between ML, data science, data engineering, and ML engineering is increasingly fuzzy, and this boundary varies dramatically between organizations.

Serving for ML means preparing and delivering high-quality data for model training, tuning, and inference.

Data engineers may handle raw ingestion, feature pipelines, or even batch scoring alongside ML teams.

What a Data Engineer Should Know About ML đŸ€š

This knowledge helps data engineers better support ML pipelines and collaborate effectively with data scientists and ML engineers.

Ways to Serve Data for Analytics and ML

File Exchange

File-based serving is still common—CSV, Excel, JSON—but lacks scalability and consistency.

Better alternatives include cloud file sharing, object storage, or automated pipelines into data lakes.

It’s often a stopgap or used when consumers lack access to more advanced platforms.

Databases

OLAP databases like Snowflake, BigQuery, and Redshift offer structured, high-performance serving for analytics and ML.

They support schemas, access control, and caching, and allow slicing compute for cost management.

Data engineers manage performance, security, and scaling based on usage and workload.

Streaming Systems

Streaming systems enable near real-time serving by continuously processing incoming data. They’re used for operational dashboards, anomaly detection, and time-sensitive applications.

Technologies like Flink, Kafka, and materialized views help bridge streaming and batch worlds.

Query Federation

Query federation lets users query multiple systems (e.g., OLTP, OLAP, APIs) without centralizing the data.

It’s useful for ad hoc analysis and controlled access but requires performance and resource safeguards.

Tools like Trino and Starburst make this practical, especially in data mesh environments.

Data Sharing

Data sharing provides secure, scalable access to datasets between teams or organizations in the cloud.

It reduces the need for duplicating data and allows for real-time consumption through platforms like Snowflake or BigQuery.

Access control becomes the main concern, and engineers shift to enabling visibility while managing cost.

Semantic and Metrics Layers

Semantic layers define shared metrics and business logic once, enabling reuse across dashboards and queries.

They improve consistency, trust, and speed of development by centralizing definitions.

Tools like Looker and dbt exemplify this approach, bridging analysts, engineers, and stakeholders.

Serving Data in Notebooks

Notebooks like Jupyter are central to data science work, but local environments often hit memory limits. Data scientists typically connect to data sources programmatically, whether it's an API, a relational database, a cloud data warehouse, or a data lake.

Engineers help scale access via cloud notebooks, distributed engines (e.g., Dask, Spark), or managed services. They also manage permissions, access control, and infrastructure for collaborative, reproducible analysis.

Reverse ETL

Reverse ETL pushes processed data from the warehouse back into operational tools like CRMs or ad platforms.

It enables teams to act on insights directly within their workflows, improving impact and usability.

However, it makes feedback loops and must be carefully monitored for accuracy, cost, and unintended consequences.

Summary

At the final stage of the data engineering lifecycle, serving data ensures insights flow into action.

This involves delivering clean, timely, and trustworthy data to a variety of consumers: analysts generating dashboards and reports, data scientists training models, and even business systems via reverse ETL—where insights are pushed back into operational tools like CRMs or ad platforms.

Regardless of the use case, trust is foundational: teams must invest in data validation, observability, and clear service-level agreements (SLAs/SLOs) to maintain reliability.

The right data definitions and consistent logic—often managed via semantic or metrics layers—ensure that users interpret and act on data the same way across the organization.

Data must be served with the user and use case in mind.

Business analysts rely on OLAP databases and BI tools (like Tableau, Looker, or Power BI) for trend detection and strategic reporting, while operational and embedded analytics require real-time or low-latency systems.

For machine learning, engineers prepare structured or semi-structured data for offline or online model training, often via feature pipelines and batch exports from data warehouses.

Data scientists may use notebooks like Jupyter, often hitting limits of local memory and scaling into tools like Spark, Ray, or SageMaker.

Whether serving analytics or ML, delivery options include query engines, object storage, streaming systems, and federated queries—all chosen based on latency, concurrency, and access control needs.

Lastly, reverse ETL has emerged as a key method to close the loop between insights and action. Rather than expecting users to access insights in dashboards or files, reverse ETL pipelines push enriched or modeled data directly into operational tools—like inserting ML-scored leads back into Salesforce.

This approach reduces friction and enables real-time decisioning within the tools teams already use.

However, it also introduces potential feedback loops and risks, such as runaway bid models in ad platforms. Monitoring and safeguards are essential.

As serving becomes more complex and democratized, concepts like data mesh, where teams produce and consume data products autonomously, shift the mindset from centralized pipelines to federated, domain-driven delivery.


Part 3 – Security, Privacy, and the Future of Data Engineering

The final part of the book is about Security, Privacy, and the Future of Data Engineering.


10. Security and Privacy đŸ›Ąïž

Security in data engineering is not optional—it’s foundational.

As custodians of sensitive data, data engineers must prioritize security at every stage of the data lifecycle.

Beyond protecting infrastructure, strong security builds trust, ensures regulatory compliance, and prevents damaging breaches that could derail careers and companies.

People

Humans are the weakest link in the security chain.

Engineers should adopt a defensive mindset, practice negative thinking to anticipate worst-case scenarios, and be cautious with sensitive data and credentials.

Ethical concerns about data handling should be raised, not buried.

Processes

Security must be habitual, not theatrical. Many organizations prioritize compliance checklists (SOC-2, ISO 27001) without truly securing systems.

Embed active security thinking into the culture, regularly audit risks, and exercise the principle of least privilege by granting only necessary access, only for the time it’s needed.

See this doc from Google Cloud as an example.

Understand your shared responsibility when using the cloud.

Technology

Keep your software patched and systems updated (Good Luck).

Use encryption at rest and in transit to protect against basic attacks, but remember encryption alone won’t prevent human errors.

Ensure network access is locked down—never expose databases or cloud instances to public IPs without strict controls.

Regular logging, monitoring, and alerting will help detect anomalies in access, resource usage, or billing that could signal a breach.

Data Backups and Disaster Preparedness

Always back up your data and test restore procedures regularly. ♻

In the era of ransomware, recovery readiness is as vital as prevention. Don’t wait for disaster to find out your backups are broken.

Security at a Technical Level

Security risks exist even at the hardware and low-level software layer—e.g., vulnerabilities in logging libraries, microcode, or memory.

While this book focuses on higher-level pipelines, engineers working closer to storage and processing must stay vigilant and up-to-date.

Internal Security Awareness

Encourage engineers to be active security contributors within their domains. Familiarity with specific tools gives them a unique vantage point to identify potential flaws.

Security shouldn’t be siloed—it should be everyone’s responsibility.

Conclusion

Security is not just a policy—it’s a habit. Treat data like your most valuable possession.

While you may not be the lead on security at your company, by practicing good security hygiene, staying alert, and keeping security front of mind, you play a key role in protecting your organization’s data.


11. The Future of Data Engineering đŸ—»

The field of data engineering is evolving rapidly, but its lifecycle—ingest, transform, serve—remains a durable foundation.

Though tools and best practices evolve, the underlying need to build trustworthy, performant data systems persists.

Simplicity is on the rise, but that doesn’t diminish the need for engineers—it elevates them to higher-level thinking and system design.

Simplification, Not Elimination

Rise of Simpler Tools:

The decline of complexity through managed cloud services (like Snowflake, BigQuery, Airbyte) has democratized data engineering.

Open source tools, now available as cloud offerings, reduce the need for infrastructure expertise, allowing companies of all sizes to participate in building robust data platforms.

Here are some examples of popular open-source data engineering tools along with their managed cloud offerings from major providers like Azure and AWS:

Open-source tool Google Cloud AWS Azure
Apache Airflow Google Cloud Composer Amazon Managed Workflows for Apache Airflow (MWAA) Azure Managed Airflow (via Azure Data Factory)
Apache Beam Google Cloud Dataflow Amazon Kinesis Data Analytics (Apache Flink runtime) Azure Stream Analytics (similar capabilities, not Beam-based directly)
Apache Kafka Google Pub/Sub, Confluent Cloud Amazon Managed Streaming for Apache Kafka (MSK) Azure Event Hubs (with Kafka interface)
Apache Spark Dataproc, Databricks on Google Cloud Amazon EMR, Databricks on AWS Azure Databricks, Azure Synapse Analytics (Spark runtime)
Apache Flink Google Cloud Dataflow (Apache Flink runtime), Ververica Platform Amazon Kinesis Data Analytics (Apache Flink runtime) Azure HDInsight (Flink cluster preview), Azure Stream Analytics (similar capabilities)
Apache Cassandra Google Cloud Bigtable (similar) Amazon Keyspaces (Managed Apache Cassandra) Azure Managed Instance for Apache Cassandra
Apache HBase Google Cloud Bigtable Amazon EMR (with HBase) Azure HDInsight (HBase)
Apache Hadoop/HDFS Google Cloud Dataproc, Google Cloud Storage Amazon EMR, Amazon S3 Azure HDInsight, Azure Data Lake Storage Gen2
PostgreSQL/MySQL Cloud SQL Amazon RDS, Aurora Azure Database for PostgreSQL/MySQL
Apache NiFi Cloud Data Fusion (similar no-code ETL) AWS Glue (visual ETL, similar) Azure Data Factory (visual ETL, similar)
Elasticsearch Elastic Cloud on GCP Marketplace Amazon OpenSearch Service (formerly Elasticsearch Service) Elastic Cloud on Azure Marketplace
Redis Google Cloud Memorystore Amazon ElastiCache for Redis Azure Cache for Redis

These examples illustrate how each major cloud provider packages open-source tools into managed services, abstracting away infrastructure management and simplifying operational complexity.

Shift in Focus:

As foundational components become plug-and-play, engineers will shift from pipeline plumbing to designing interoperable, resilient systems.

Tools like dbt, Fivetran, and managed Airflow free up time for higher-value work.

The Data Operating System

From Devices to the Cloud:

Cloud services resemble operating system services—storage, compute, orchestration—operating at global scale.

Just as app developers rely on OS abstractions, data engineers will increasingly build upon cloud-native primitives with standard APIs, enhanced metadata, and smart orchestration layers like Airflow, Dagster, and Prefect.

Future Stack Evolution:

We should expect:

This scaffolding will make cloud data systems feel like OS-level services.

From Batch to Live Data

The End of the Modern Data Stack (MDS):

While MDS made analytics accessible and scalable, its batch-oriented paradigm limits real-time applications.

The Live Data Stack is emerging, built on streaming pipelines and real-time OLAP databases (e.g., ClickHouse, Druid).

STL (Stream-Transform-Load) may replace ELT.

Expected Changes:

New Roles and Blurred Boundaries

Hybrid Roles Will Rise:

Engineers will wear mixed hats—data scientists with pipeline skills, ML engineers embedded in ops, software engineers integrating streaming data and analytics.

Expect the rise of ML platform engineers and real-time data app developers.

Embedded Data Engineering:

Instead of siloed teams, data engineers will become part of application teams, enabling faster experimentation and deeper integration of data and ML into the user experience.

The Rise of Interactive Analytics

Dark Matter of Data: Spreadsheets:

Spreadsheets remain the most widely used data tool.

Future platforms will merge the spreadsheet’s interactivity with the backend power of real-time OLAP, giving business users rich interfaces without sacrificing performance or structure.

Summary 🌟

Here are some trends to Watch:

Your Role:

Stay curious, engage with the community, and keep learning. Whether you design pipelines or invent tools, you’re part of a fast-moving and impactful domain.

Data engineering’s future is bright—and you get to help build it.


Appendices

Appendix A. Serialization and Compression Technical Details

Modern data engineers, especially in the cloud, must understand how data is serialized, compressed, and deserialized to optimize pipeline performance.

Choosing the right formats and compression strategies can significantly reduce storage size, improve query performance, and support interoperability across systems.

Serialization Formats

Compression Techniques

Storage Engines

Storage engines handle how data is physically arranged, indexed, and compressed.

Columnar storage is now standard in analytics systems, with modern engines optimized for SSDs, complex types, and structured queries.

Engines like those in SQL Server, PostgreSQL, and MySQL offer pluggable or configurable storage modes, and innovations continue in database internals to better support today's workloads.

Key Takeaway

Understanding serialization and compression isn't optional—it’s essential for designing fast, scalable, and reliable data systems.

Choosing the right format and compression algorithm can yield massive performance improvements and smoother system interoperability.


Appendix B. Cloud Networking

Data engineers must understand cloud networking basics to design performant and cost-efficient systems.

Cloud networks impact latency, cost (especially due to data egress fees), and system architecture.

Key Concepts

Network Topology & Resource Hierarchy

Public clouds (AWS, GCP, Azure) follow similar structures: zones (smallest unit), regions (group of zones), and in GCP’s case, multiregions (group of regions).

Engineers must align data systems with this topology for high performance and resilience.

Data Egress Fees

Clouds allow free inbound traffic but charge for outbound traffic, especially across regions or to the internet.

This pricing model can create vendor lock-in and affect architecture choices.

Direct connections or CDNs can reduce costs.

Zones vs. Regions

GCP’s Premium Networking

Google offers premium-tier networking, where inter-region traffic stays on its private network, improving reliability and speed.

Direct Connect

Providers like AWS, Azure, and GCP offer direct network connections (e.g., AWS Direct Connect), lowering latency and significantly cutting egress costs—e.g., 9±/GB to 2±/GB.

CDNs (Content Delivery Networks)

CDNs like Cloudflare and cloud-native options cache data closer to users, improving delivery speed and reducing load on origin servers. However, their availability varies by region and political factors.

The Future of Data Egress

Data egress fees restrict cloud portability and multi-cloud adoption.

Competitive pressure and customer demand may push providers to reduce or eliminate egress fees in the near future, just as telecom pricing models evolved.

Takeaway

Cloud networking shapes system performance, resilience, and cost.

Data engineers must be aware of how their data moves within and across zones, regions, and providers—and should design architectures that balance latency, cost, and reliability while keeping an eye on evolving cloud pricing models.


Closing

So grateful that this book exists. Thanks to Joe Reis and Matt Housley.