Appendix A - Serialization and Compression - Learning Fundamentals of Data Engineering

Appendix A. Serialization and Compression Technical Details

Modern data engineers, especially in the cloud, must understand how data is serialized, compressed, and deserialized to optimize pipeline performance.

Choosing the right formats and compression strategies can significantly reduce storage size, improve query performance, and support interoperability across systems.

Serialization Formats

Row-Based Formats like CSV, JSON, and Avro store records one after another. CSV is fragile and inefficient, though still common. JSON is widely used for APIs and semistructured data, while Avro supports binary serialization with schema evolution, making it ideal for big data and RPCs.
Columnar Formats like Parquet and ORC store data by columns, improving performance for analytics and compression. Parquet is widely supported and preferred in cloud ecosystems, while ORC is more common in Hadoop and Hive-based stacks.
Apache Arrow enables in-memory columnar processing and interoperability across languages (Python, Java, Rust, etc.), reducing the need for repeated serialization/deserialization and improving performance in real-time applications.
Hybrid Formats such as Hudi and Iceberg combine transactional row writes with columnar reads and support features like schema evolution, time travel, and efficient CDC handling in data lakes.

Compression Techniques

Compression reduces file size by encoding repeating patterns and redundancy. Algorithms like gzip and bzip2 offer strong compression but slower performance.
Newer, faster options like Snappy, Zstandard, LZ4, and LZFSE are optimized for speed over compression ratio and are commonly used in modern data lakes and analytics databases.

Storage Engines

Storage engines handle how data is physically arranged, indexed, and compressed.

Columnar storage is now standard in analytics systems, with modern engines optimized for SSDs, complex types, and structured queries.

Engines like those in SQL Server, PostgreSQL, and MySQL offer pluggable or configurable storage modes, and innovations continue in database internals to better support today's workloads.

Key Takeaway

Understanding serialization and compression isn't optional—it’s essential for designing fast, scalable, and reliable data systems.

Choosing the right format and compression algorithm can yield massive performance improvements and smoother system interoperability.

🡐 Overview