Learning Fundamentals of Data Engineering

6. Storage 📦

Storage is core to every stage—data is stored repeatedly across ingestion, transformation, and serving.

Two things to consider while deciding on storage are:

The way storage is explained in the book is with the following figure:

Raw Ingredients of Data Storage

Here are some one liners as definitions.

Data Storage Systems

Operate above raw hardware—like disks—using platforms such as cloud object stores or HDFS. Higher abstractions include data lakes and lakehouses.

Here are some one liners about them.

Data Engineering Storage Abstractions

These are the abstractions that are built on top of storage systems.

Let's remember our map for storage.

Here are some of the Storage Abstractions.

Big Ideas in Data Storage

Here are some big ideas in Storage.

🔍 Data Catalogs

Data catalogs are centralized metadata hubs that let users search, explore, and describe datasets.

They support:

🔗 Data Sharing

Cloud platforms enable secure sharing of data across teams or organizations.

⚠️ This requires strong access controls to avoid accidental exposure.

🧱 Schema Management

Understanding structure is essential:

💡 Use formats like Parquet for built-in schema support. Avoid raw CSV.

⚙️ Separation of Compute & Storage

Modern systems decouple compute from storage for better scalability and cost control.

🔁 Hybrid Storage Examples

📎 Zero-Copy Cloning

Clone data without duplicating it (e.g., Snowflake, BigQuery).

⚠️ Deleting original files may affect clones — know the limits.

📈 Data Storage Lifecycle & Retention

We talked about the temperature of data. Let's see an example.

🔥 Hot, 🟠 Warm, 🧊 Cold Data

Type Frequency Storage Cost Use Case
Hot Frequent RAM/SSD High Recommendations, live queries
Warm Occasional S3 IA Medium Monthly reports, staging data
Cold Rare/Archive Glacier Low Compliance, backups

Use lifecycle policies to move data between tiers automatically.

⏳ Retention Strategy

🏢 Single-Tenant vs Multitenant Storage

Single-Tenant

Multitenant

Summary 😌

Storage is the backbone of the data engineering lifecycle—powering ingestion, transformation, and serving. As data flows through systems, it's stored multiple times across various layers, so understanding how, where, and why we store data is critical.

Smart storage decisions—paired with good schema design, lifecycle management, and collaboration—can drastically improve scalability, performance, and cost-efficiency in any data platform.

Here are 3 strong quotes from the book.

As always, exercise the principle of least privilege. Don’t give full database access to anyone unless required.

Data engineers must monitor storage in a variety of ways. This includes monitoring infrastructure storage components, object storage and other “serverless” systems.

Orchestration is highly entangled with storage. Storage allows data to flow through pipelines, and orchestration is the pump.


🡐 Part 2 Overview