DDIA Chapter 3: Storage and Retrieval
Tony Duong
Mar 13, 2026 · 7 min
Overview
This chapter explores how databases internally store data and how they find it again when asked. Understanding storage engines helps you pick the right database for your workload and tune its performance. The two main families are log-structured engines and page-oriented engines.
Data Structures That Power Your Database
The Simplest Database: An Append-Only Log
The simplest storage engine is an append-only file. Writing is fast (O(1)), but reading requires scanning the whole file (O(n)). To make reads efficient, we need an index — an additional data structure derived from the primary data that acts as a signpost.
Key trade-off: Indexes speed up reads but slow down writes, because the index must be updated on every write. This is why databases don't index everything by default — you choose indexes based on your query patterns.
Hash Indexes
The simplest indexing strategy: keep an in-memory hash map where every key maps to the byte offset in the data file. This is essentially what Bitcask (the default storage engine in Riak) does.
- Works well when you have a limited number of keys with frequent updates
- All keys must fit in memory
- Range queries are not efficient (you can't scan between
kitty00000andkitty99999)
Compaction: Since the log is append-only, it grows forever. Compaction merges log segments by throwing away duplicate keys and keeping only the most recent value.
SSTables and LSM-Trees
Sorted String Tables (SSTables): Like hash-indexed segments, but key-value pairs are sorted by key.
Advantages over hash indexes:
- Merging segments is efficient (like merge sort)
- Sparse in-memory index is sufficient — you don't need every key, just some keys, and scan between them
- Blocks between index entries can be compressed
LSM-Trees (Log-Structured Merge-Trees): The full storage engine pattern:
- Writes go to an in-memory balanced tree (the memtable)
- When the memtable gets big enough, flush it to disk as an SSTable
- Reads check the memtable first, then the most recent on-disk segment, then the next-older, etc.
- Background merging and compaction keeps the number of segments manageable
Used by: LevelDB, RocksDB, Cassandra, HBase, Lucene (for its term dictionary).
Bloom filters are commonly used to avoid unnecessary disk reads for keys that don't exist.
B-Trees
The most widely used indexing structure. B-trees break the database into fixed-size pages (typically 4 KB) and read or write one page at a time — matching the underlying hardware.
- Each page is identified by an address (like a pointer on disk)
- A page contains keys and references to child pages
- The branching factor is typically several hundred
- A four-level B-tree with 4 KB pages and a branching factor of 500 can store up to 256 TB
Key difference from LSM-trees: B-trees modify pages in place, while LSM-trees only append to files and never modify in place.
Write-ahead log (WAL): Before modifying any page, the database writes the change to an append-only log. This allows crash recovery.
B-Trees vs. LSM-Trees
| Aspect | LSM-Trees | B-Trees |
|---|---|---|
| Write throughput | Generally higher | Lower (every write updates the tree) |
| Write amplification | Can be high due to compaction | One write to WAL + one to page |
| Read performance | Slower (check multiple structures) | Faster (one definitive location) |
| Compression | Better (no page fragmentation) | Some wasted space in pages |
| Predictability | Compaction can interfere with reads | More predictable latency |
| Transactions | Harder (key may exist in multiple segments) | Easier (key exists in exactly one place) |
Other Indexing Structures
Secondary Indexes
Unlike primary indexes, secondary indexes don't uniquely identify a record. They can either:
- Store the list of matching row IDs (like a postings list)
- Have each entry point to the row's location
Storing Values Within the Index
- Clustered index: Stores the actual row data within the index (e.g., InnoDB primary key)
- Non-clustered index: Stores a reference (heap file pointer) to where the row lives
- Covering index: Stores some columns within the index to allow some queries to be answered by the index alone
Multi-Column Indexes
- Concatenated index: Appends columns in a defined order (like a phone book: last name, then first name)
- Multi-dimensional indexes: For queries that need to search on multiple columns simultaneously (e.g., latitude AND longitude). R-trees are a common choice.
Full-Text Search and Fuzzy Indexes
Lucene uses an SSTable-like structure for its term dictionary but adds a finite state automaton (similar to a trie) to support fuzzy matching within a given edit distance.
In-Memory Databases
Keeping everything in memory (e.g., VoltDB, MemSQL, Redis):
- Faster not because of avoiding disk reads (the OS cache handles that) but because they avoid the overhead of encoding data for disk
- Can offer data structures that are hard to implement with disk-based indexes (e.g., Redis priority queues and sets)
- Durability can be achieved through WAL, snapshots, or replication
Transaction Processing vs. Analytics (OLTP vs. OLAP)
| Property | OLTP | OLAP |
|---|---|---|
| Read pattern | Small number of records, fetched by key | Aggregate over large number of records |
| Write pattern | Random-access, low-latency writes from user input | Bulk import (ETL) or event stream |
| Primary users | End users via web application | Internal analysts |
| Dataset size | GB to TB | TB to PB |
| Bottleneck | Disk seek time | Disk bandwidth |
Data Warehousing
A separate read-optimized copy of the data, extracted from OLTP systems via ETL (Extract–Transform–Load). Uses a star schema (or snowflake schema):
- Fact table: Each row represents an event (e.g., a sale). Can be extremely large (billions of rows).
- Dimension tables: Describe the who, what, where, when, why of the event (e.g., product, customer, store, date).
Column-Oriented Storage
In a typical data warehouse query, you access a few columns from a table with hundreds of columns. Row-oriented storage wastes I/O by loading all columns.
Column-oriented storage stores all values from each column together. This makes analytical queries much faster because you only read the columns you need.
Column Compression
Columns often have many repeated values, making them very compressible:
- Bitmap encoding: One bitmap per distinct value in the column. Works well when the number of distinct values is small relative to the number of rows.
- Run-length encoding: Compress the bitmaps further.
Compressed columns also enable vectorized processing — operating on chunks of compressed column data directly using CPU SIMD instructions.
Sort Order in Column Storage
Rows must be in the same order across all column files. You can choose a sort order (e.g., by date, then by product) to help common queries and improve compression of the first sort key.
Multiple sort orders: Store the same data sorted in different ways (like having multiple secondary indexes). Vertica uses this approach.
Writing to Column Storage
LSM-trees work well here: writes go to an in-memory store (row-oriented), then are flushed to disk in columnar format and merged with existing column files.
Aggregation: Materialized Views and Data Cubes
- Materialized view: A precomputed cache of common aggregates (e.g.,
SUM(amount) GROUP BY product_id, date). Updated when underlying data changes. - Data cube (OLAP cube): A grid of aggregates along multiple dimensions. Very fast for precomputed queries but inflexible for ad-hoc analysis.
Key Takeaways
- Two families of storage engines: Log-structured (LSM-trees) append data and merge in the background; page-oriented (B-trees) update fixed-size pages in place
- LSM-trees tend to have higher write throughput; B-trees tend to have more predictable read performance
- OLTP and OLAP have fundamentally different access patterns and are best served by different storage engines
- Column-oriented storage dramatically improves analytical query performance by only reading relevant columns and enabling aggressive compression
- Indexing is a trade-off: every index speeds up reads but adds overhead to writes — choose based on your query patterns
- In-memory databases win not by avoiding disk reads but by avoiding the overhead of disk-oriented data encoding