Learning Fundamentals of Data Engineering

7. Ingestion

Ingestion is the process of moving data from source systems into storage—it's the first step in the data engineering lifecycle after data is generated.

Quick definition, data ingestion is data movement from point A to B, data integration combines data from disparate sources into a new dataset. Example of data integration is a CRM system, advertising analytics data, and web analytics to make a user profile, which is saved to our data warehouse.

A data pipeline is the full system that moves data through the data engineering lifecycle. Design of data pipelines typically starts at the ingestion stage.

What to Consider when Building Ingestion? 🤔

Consider these factors when designing your ingestion architecture:

Bounded vs. Unbounded

All data is unbounded until constrained. Streaming preserves natural flow; batching adds structure.

Frequency

Choose between batch, micro-batch, or real-time ingestion. "Real-time" typically means low-latency, near real-time.

Synchronous vs. Asynchronous

Serialization & Deserialization

Data must be encoded before transfer and properly decoded at destination. Incompatible formats make data unusable.

Throughput & Scalability

Design to handle spikes and backlogs. Use buffering and managed services (e.g., Kafka, Kinesis) for elasticity.

⏳ Reliability & Durability

Ensure uptime and no data loss through redundancy and failover. Balance cost vs. risk—design for resilience within reason.

🗃 Payload

Let's understand data characteristics:

Push vs. Pull vs. Poll

And here are some additional insight.

🔄 Streaming + Batch Coexist

Even with real-time ingestion, batch steps are common (e.g., model training, reports). Expect a hybrid approach.

🧱 Schema Awareness

Schemas change—new columns, types, or renames can silently break pipelines. Use schema registries to version and manage schemas reliably.

🗂️ Metadata Mattera

Without rich metadata, raw data can become a data swamp. Proper tagging and descriptions are critical for usability.

Batch Ingestion

If we went with batch way, here are some things to keep in mind. Batch ingestion moves data in bulk, usually based on a time interval or data size. It’s widely used in traditional ETL and for transferring data from stream processors to long-term storage like data lakes.

Snapshot vs. Differential Extraction

File-Based Export and Ingestion

ETL vs. ELT

We defined ETL before. In ELT the definition is as follows:

Choose based on system capabilities and transformation complexity.

📥 Inserts, Updates, and Batch Size

🔄 Data Migration

Large migrations (TBs+) involve moving full tables or entire systems.

Key challenges are:

Use staging via object storage and test sample loads before full migration. Also consider migration tools instead of writing from scratch.

📨 Message and Stream Ingestion

Event-based ingestion is common in modern architectures. This section covers best practices and challenges to watch for when working with streaming data.

🧬 Schema Evolution

Schema changes (added fields, type changes) can break pipelines.

Here is what you can do:

🕓 Late-Arriving Data

🔁 Ordering & Duplicate Delivery

⏪ Replay

Replay lets you reprocess historical events within a time range.

⏳ Time to Live (TTL)

TTL defines how long events are retained before being discarded. It's the maximum message retention time, which is helpful to reduce backpressure.

Short TTLs can cause data loss; long TTLs can create backlogs.

Examples:

📏 Message Size

Be mindful of max size limits:

🧯 Error Handling & Dead-Letter Queues

Invalid or oversized messages should be routed to a dead-letter queue.

🔄 Consumer Models: Pull vs. Push

Pull is default for data engineering; push is used for specialized needs.

🌍 Ingestion Location & Latency

📥 Ways to Ingest Data

There are many ways to ingest data—each with its own trade-offs depending on the source system, use case, and infrastructure setup.

🧩 Direct Database Connections (JDBC/ODBC)

JDBC/ODBC are standard interfaces for pulling data directly from databases.

JDBC is Java-native and widely portable; ODBC is system-specific. These connections can be parallelized for performance but are row-based and struggle with nested/columnar data.

Many modern systems now favor native file export (e.g., Parquet) or REST APIs instead.

🔄 Change Data Capture (CDC)

Here are some quick definitions on CDC:

🌐 APIs

APIs are a common ingestion method from external systems.

📨 Message Queues & Event Streams

Use systems like Kafka, Kinesis, or Pub/Sub to ingest real-time event data.

Design for low latency, high throughput, and consider autoscaling or managed services to reduce ops burden.

🔌 Managed Data Connectors

Services like Fivetran, Airbyte, and Stitch provide plug-and-play connectors.

🪣 Object Storage

Object storage (e.g., S3, GCS, Azure Blob) is great for moving files between teams and systems.

Use signed URLs for temporary access and treat object stores as secure staging zones for data.

💾 EDI (Electronic Data Interchange)

This is a legacy format still common in business environments.

📤 File Exports from Databases

Large exports put load on source systems—use read replicas or key-based partitioning for efficiency.

Modern cloud data warehouses support direct export to object storage in formats like Parquet or ORC.

🧾 File Formats & Considerations

Avoid CSV when possible due to its lack of schema, support for nested data, and error-prone behavior.

Prefer Parquet, ORC, Avro, Arrow, JSON—which support schema and complex structures.

💻 Shell, SSH, SCP, and SFTP

Shell scripts and CLI tools still play a big role in scripting ingestion pipelines.

📡 Webhooks

Webhooks are "Reverse API" where the data provider pushes data to your service.

🌐 Web Interfaces & Scraping

🚚 Transfer Appliances

🤝 Data Sharing

Platforms like Snowflake, BigQuery, Redshift, and S3 allow read-only data sharing.

Summary 🥳

Here are some quotes from the book.

Moving data introduces security vulnerabilities because you have to transfer data between locations. Data that needs to move within your VPC should use secure endpoints and never leave the confines of the VPC.

Do not collect the data you don't need. Data cannot leak if it is never collected.

My summary is down below:

Ingestion is the stage in the data engineering lifecycle where raw data is moved from source systems into storage or processing systems.

Data can be ingested in batch or streaming modes, depending on the use case. Batch ingestion processes large chunks of data at set intervals or based on file size, making it ideal for daily reports or large-scale migrations. Streaming ingestion, on the other hand, continuously processes data as it arrives, making it suitable for real-time applications like IoT, event tracking, or transaction streams.

Designing ingestion systems involves careful consideration of factors like bounded vs. unbounded data, frequency, serialization, throughput, reliability, and the push/pull method of data retrieval.

Ingestion isn’t just about moving data—it’s about understanding the shape, schema, and sensitivity of that data to ensure it's usable downstream.

As Data Engineers we must track metadata, consider ingestion frequency vs. transformation frequency, and apply best practices for security, compliance, and cost.

We should also stay flexible: even legacy methods like EDI or manual downloads may still be part of real-world workflows.

The key is to choose ingestion patterns that match the needs of the business while staying robust, scalable, and future-proof.


🡐 Part 2 Overview