DDIA Chapter 2: Data Models and Query Languages
Tony Duong
Mar 12, 2026 Β· 4 min
Overview
Chapter 2 explores how we model data and how we query it. The choice of data model has a profound effect on how we think about the problem we're solving. Each model comes with assumptions about how data will be used, and those assumptions make some operations easy and others awkward.
Relational Model vs Document Model
The relational model (SQL) organizes data into tables of rows. It was proposed by Edgar Codd in 1970, became dominant by the mid-1980s, and remains the standard for most applications.
The document model (NoSQL) emerged in the 2010s, driven by:
- Need for greater scalability (large datasets, high write throughput)
- Preference for free and open-source software
- Specialized query operations not well supported by relational model
- Frustration with restrictive relational schemas β desire for a more dynamic, expressive data model
The Object-Relational Mismatch
Most application code uses objects or structs, but relational tables require an awkward translation layer (sometimes called impedance mismatch). ORMs help but don't fully hide the differences.
A rΓ©sumΓ© is a good example: in a relational database you'd need multiple tables (education, work experience, contact info) joined to a user. In a document model, the whole thing can be one JSON document β much closer to how the application sees it.
Relational vs Document Databases Today
| Aspect | Relational | Document |
|---|---|---|
| Schema | Schema-on-write (enforced) | Schema-on-read (flexible) |
| Joins | Natural, well-supported | Weak, often requires application-level joins |
| Many-to-one | Easy with foreign keys | Requires denormalization or app-level joins |
| Data locality | Spread across tables | Whole document stored together |
| Best for | Highly interconnected data | Self-contained documents, one-to-many trees |
Convergence: relational databases now support JSON/XML columns; document databases are adding join-like features. The models are becoming more similar over time.
Many-to-One and Many-to-Many Relationships
When you store a plain text value like "Greater Seattle Area" instead of an ID, you're duplicating information. Using an ID that references a canonical record (normalization) provides:
- Consistent spelling/formatting
- Easy updates (change in one place)
- Localization support
- Better search
This requires joins. Document databases historically haven't supported joins well, pushing joins into application code β which is slower and more complex.
As applications grow, data tends to become more interconnected, and the lack of join support becomes increasingly painful.
Query Languages for Data
Declarative vs Imperative
- Imperative (most programming languages): you tell the machine what steps to perform
- Declarative (SQL, CSS): you describe the pattern of the result you want, and the database optimizer figures out how to get it
Declarative languages are better for databases because:
- The optimizer can improve without changing queries
- They're easier to parallelize
- They hide implementation details
MapReduce
A programming model for processing large amounts of data across many machines. It's neither fully declarative nor fully imperative β you write map and reduce functions (which must be pure), and the framework handles distribution.
MongoDB supports a MapReduce-like query API, but also introduced the aggregation pipeline as a more declarative alternative.
Graph-Like Data Models
When many-to-many relationships dominate your data, it's natural to model it as a graph: vertices (nodes) connected by edges (relationships).
Examples: social networks, web links, road networks, fraud detection.
Property Graphs (Neo4j, etc.)
- Each vertex has a unique ID, a set of outgoing/incoming edges, and a collection of properties (key-value pairs)
- Each edge has a unique ID, tail vertex, head vertex, a label describing the relationship, and properties
- Any vertex can connect to any other vertex β no schema restricts which kinds of things can be connected
Query language: Cypher β a declarative language for property graphs.
MATCH (person)-[:BORN_IN]->(place)-[:WITHIN*0..]->(usa:Location {name: 'United States'})
RETURN person.name
Triple-Stores (RDF)
All information stored as (subject, predicate, object) triples. The subject is a vertex, and the object is either a primitive value (property) or another vertex (edge).
Query language: SPARQL β similar to Cypher but for triple-stores.
Datalog
The oldest of the three, foundational to later query languages. Uses a set of rules that can be combined and reused, making it powerful for complex queries.
Key Takeaways
- No single data model is best for everything β pick based on your access patterns
- Document databases shine when data is self-contained and relationships are mostly one-to-many
- Relational databases remain the best default for most applications with interconnected data
- Graph databases are ideal when anything can relate to anything
- Declarative query languages outperform imperative ones in databases because they enable optimization
- Data models are converging β relational databases add document features, document databases add relational features