Chapter 4: Data Modeling & Normalization
Learning Objectives
After completing this chapter, you will be able to:
- Design Entity-Relationship (ER) diagrams using Chen and Crow's Foot notation
- Identify entities, attributes, relationships, and cardinality constraints
- Map ER diagrams to relational tables
- Recognize data redundancy and its problems
- Apply normalization rules (1NF, 2NF, 3NF, BCNF) to eliminate anomalies
- Determine functional dependencies in a relation
- Explain when and why denormalization is appropriate
- Evaluate normalization trade-offs in real business systems
4.1 Introduction to Data Modeling
Before writing SQL or loading data, you need a blueprint. Data modeling is the process of designing the structure of a database: what data to store, how pieces of data relate to each other, and what rules the data must follow.
Think of it like an architect's floor plan. You would never start pouring concrete without a plan, and you should never start building tables without a data model.
Why Data Modeling Matters
| Concern | What Happens Without Modeling |
|---|---|
| Data redundancy | The same customer address stored in five tables |
| Update anomalies | Changing a price in one place but not another |
| Insert anomalies | Unable to add a new product until someone orders it |
| Delete anomalies | Deleting an order accidentally removes the only record of a customer |
Business Perspective
Poor data models cost real money. A 2023 Gartner report estimated that poor data quality costs organizations an average of $12.9 million per year. Much of this traces back to flawed database design.
4.2 The Entity-Relationship Model
The Entity-Relationship (ER) model, introduced by Peter Chen in 1976, is a conceptual tool for describing data requirements. It focuses on what data the business needs, not how it will be stored.
Entities
An entity is a thing or object in the real world that is distinguishable from other objects. Entities become tables in a relational database.
Examples: Customer, Product, Order, Employee, Department
Attributes
An attribute is a property or characteristic of an entity. Attributes become columns.
Types of attributes:
| Type | Description | Example |
|---|---|---|
| Simple | Cannot be subdivided | first_name, price |
| Composite | Can be broken into sub-parts | address (street, city, state, zip) |
| Derived | Computed from other attributes | age (from birth_date) |
| Multi-valued | Can hold multiple values | phone_numbers (home, work, cell) |
| Key attribute | Uniquely identifies the entity | customer_id |
Multi-Valued Attributes
Multi-valued attributes violate the atomic value rule from Chapter 2. During ER-to-relational mapping, they must be moved to a separate table.
Entity Relationships
A relationship is an association between two or more entities. In a business context, relationships represent verbs: a customer places an order, an employee works in a department, a product belongs to a category.
Cardinality
Cardinality describes the maximum number of entity instances that can participate in a relationship.
| Cardinality | Meaning | Example |
|---|---|---|
| 1:1 (one-to-one) | Each A is associated with at most one B | Employee has one Parking Spot |
| 1:M (one-to-many) | Each A can be associated with many Bs | Department employs many Employees |
| M:N (many-to-many) | Each A can be associated with many Bs and vice versa | Student enrolls in many Courses; each Course has many Students |
Participation Constraints
Participation describes the minimum number of instances required in a relationship.
- Total participation (mandatory): Every entity instance must participate. Shown as a double line in Chen notation.
- Partial participation (optional): Some instances may not participate. Shown as a single line.
Participation Example
In an e-commerce system:
- Every ORDER must be placed by a CUSTOMER (total participation of ORDER in the "places" relationship)
- A CUSTOMER may or may not have placed any orders (partial participation of CUSTOMER)
4.3 ER Diagrams
ER diagrams are the visual representation of the ER model. Two notations are widely used.
Chen Notation
Chen notation uses distinct shapes for each component:
1 2 3 4 5 6 7 | |
Shape legend:
| Shape | Meaning |
|---|---|
| Rectangle | Entity |
| Ellipse | Attribute |
| Diamond | Relationship |
| Double rectangle | Weak entity |
| Double ellipse | Multi-valued attribute |
| Dashed ellipse | Derived attribute |
Crow's Foot Notation
Crow's foot notation is more common in industry tools (MySQL Workbench, Lucidchart, draw.io). It shows cardinality using symbols at the ends of relationship lines.
1 2 3 4 5 6 7 8 | |
Crow's foot symbols:
| Symbol | Meaning |
|---|---|
\|\| (single bar) |
Exactly one (mandatory) |
o (circle) |
Zero (optional) |
> or crow's foot |
Many |
\|\|---\|\| |
One-to-one, both mandatory |
\|\|---o< |
One-to-many, many side optional |
Which Notation to Use?
- Chen notation is better for learning because each concept has its own visual shape.
- Crow's foot notation is better for implementation because it maps more directly to tables and is used by most database design tools.
In this course, we will use both. Exams may use either notation.
4.4 Weak Entities and Enhanced ER
Weak Entities
A weak entity cannot be uniquely identified by its own attributes alone. It depends on a strong (owner) entity for identification.
Weak Entity Example
An ORDER_ITEM cannot exist without its parent ORDER. Its identity depends on the combination of order_id (from ORDER) and line_number (its own partial key).
1 2 3 4 5 6 7 8 | |
The primary key of ORDER_ITEM is the composite key {order_id, line_number}.
Enhanced ER Model: Specialization and Generalization
The Enhanced ER (EER) model adds inheritance-like concepts from object-oriented design.
Specialization (top-down): Splitting a general entity into specialized sub-entities.
Generalization (bottom-up): Combining similar entities into a general super-entity.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Constraints on specialization:
- Disjoint (d): An entity can belong to at most one subclass (an employee is either Salaried or Hourly, not both).
- Overlapping (o): An entity can belong to multiple subclasses.
- Total: Every entity in the superclass must belong to some subclass.
- Partial: Some entities may not belong to any subclass.
4.5 ER-to-Relational Mapping
Once your ER diagram is complete, you translate it into relational tables. Here are the mapping rules.
Rule 1: Strong Entities Become Tables
Each strong entity becomes a table. Each attribute becomes a column. The key attribute becomes the primary key.
1 2 3 4 5 6 7 8 9 | |
Rule 2: Weak Entities Become Tables with Composite Keys
The weak entity becomes a table. Its primary key is the combination of the owner's primary key (as a foreign key) and its own partial key.
1 2 3 4 5 6 7 8 9 | |
Rule 3: 1:1 Relationships
Add the primary key of one entity as a foreign key in the other. Prefer putting the FK on the side with total participation.
Rule 4: 1:M Relationships
Add the primary key of the "one" side as a foreign key on the "many" side.
1 2 3 | |
Rule 5: M:N Relationships Become Junction Tables
Create a new junction table (also called a bridge table or associative entity) with foreign keys referencing both entities. The composite of these foreign keys forms the primary key.
1 2 3 4 5 6 7 8 9 | |
Rule 6: Multi-Valued Attributes Become Tables
Create a separate table with the entity's PK as a FK plus the multi-valued attribute.
1 2 3 4 5 6 7 8 | |
Mapping Summary
| ER Concept | Relational Mapping |
|---|---|
| Strong entity | Table |
| Weak entity | Table with composite PK (owner PK + partial key) |
| 1:1 relationship | FK on either side (prefer total participation side) |
| 1:M relationship | FK on the "many" side |
| M:N relationship | Junction table with two FKs |
| Multi-valued attribute | Separate table |
| Composite attribute | Flatten into simple columns |
| Derived attribute | Omit from schema (compute at query time) |
4.6 Data Redundancy and Anomalies
Now that we can build tables, we need to build them well. Data redundancy -- storing the same fact in multiple places -- is the root cause of most database design problems.
A Badly Designed Table
Consider this single table for an e-commerce company:
1 2 3 4 5 6 7 8 9 | |
Problems with this design:
- Update anomaly: If Alice moves from Chicago to Champaign, we must update every row where she appears. Miss one and the data is inconsistent.
- Insert anomaly: We cannot add a new customer until they place an order, because
order_idis required. - Delete anomaly: If we delete order 1002, we lose all information about Bob Smith entirely.
Functional Dependencies
A functional dependency (FD) exists when the value of one attribute (or set of attributes) determines the value of another.
Notation: A -> B means "A determines B" or "B is functionally dependent on A."
Examples from the table above:
order_id, product_name -> qty(an order and product determine the quantity)cust_id -> cust_name, cust_city(customer ID determines name and city)product_name -> price(product name determines price, assuming one price per product)
How to Find Functional Dependencies
Ask yourself: "If I know the value of X, can I determine exactly one value of Y?" If yes, then X -> Y.
- Knowing
cust_idtells you exactly onecust_name. Socust_id -> cust_name. - Knowing
order_iddoes NOT tell you exactly oneproduct_name(an order can have many products). Soorder_id -> product_nameis NOT a valid FD.
4.7 Normalization: First Normal Form (1NF)
Normalization is the process of organizing tables to reduce redundancy and eliminate anomalies. We apply a series of "normal forms," each building on the previous one.
1NF: Atomic Values
A table is in First Normal Form (1NF) if:
- Every column contains only atomic (indivisible) values
- There are no repeating groups or arrays
- Each row is unique (has a primary key)
Violation of 1NF:
1 2 3 4 5 6 7 | |
Fixed (1NF):
1 2 3 4 5 6 7 8 9 10 | |
Common 1NF Mistake
Storing comma-separated values in a column (like "Laptop, Mouse") is a very common real-world mistake, especially in Excel-to-database migrations. It makes filtering, joining, and aggregating extremely difficult.
4.8 Second Normal Form (2NF)
A table is in Second Normal Form (2NF) if:
- It is in 1NF, AND
- Every non-key attribute is fully dependent on the entire primary key (no partial dependencies)
2NF only matters when the primary key is composite (multi-column). If the PK is a single column, a 1NF table is automatically in 2NF.
Identifying Partial Dependencies
Return to our ORDER_DETAILS table. Suppose the primary key is {order_id, product_name}:
1 2 3 4 | |
Partial dependencies (non-key attribute depends on only part of the PK):
order_id -> cust_id, cust_name, cust_city(depends on onlyorder_id, not the full PK)product_name -> price(depends on onlyproduct_name, not the full PK)
Full dependency:
{order_id, product_name} -> qty(depends on the entire PK)
Fixing Partial Dependencies
Remove partially dependent attributes into their own tables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | |
Now every non-key attribute depends on the full primary key of its table. The tables are in 2NF.
4.9 Third Normal Form (3NF)
A table is in Third Normal Form (3NF) if:
- It is in 2NF, AND
- No non-key attribute depends on another non-key attribute (no transitive dependencies)
Identifying Transitive Dependencies
Look at the ORDERS table. We still have customer info mixed in:
1 2 3 4 | |
The dependency chain is: order_id -> cust_id -> cust_name, cust_city
Here cust_name and cust_city depend on cust_id, which is not the primary key. This is a transitive dependency.
Fixing Transitive Dependencies
Move the transitively dependent attributes to their own table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
The Complete 3NF Design
Here is our fully normalized e-commerce schema:
1 2 3 4 5 6 7 8 9 10 11 | |
Each fact is stored exactly once. Update Alice's city in one row. Delete an order without losing customer data. Add a new product without needing an order.
Step-by-Step Normalization Summary
Starting from the flat ORDER_DETAILS table:
- 1NF: Made all values atomic, established composite PK
{order_id, product_name} - 2NF: Removed partial dependencies -- split out PRODUCT (depends on
product_nameonly) and ORDERS (depends onorder_idonly), keeping ORDER_LINE for fully dependent attributes - 3NF: Removed transitive dependency -- split CUSTOMER out of ORDERS because
cust_nameandcust_citydepend oncust_id, not onorder_id
4.10 Boyce-Codd Normal Form (BCNF)
BCNF is a stricter version of 3NF. A table is in BCNF if:
- For every functional dependency
X -> Y, X is a superkey (can uniquely identify the entire row)
In most practical cases, 3NF and BCNF are equivalent. They differ only when a table has overlapping composite candidate keys.
When BCNF Differs from 3NF
Consider a TEACHING table where each course section is taught by one professor, and each professor teaches only one course (but a course can have many professors):
1 2 3 4 | |
Candidate keys: {student, course} and {student, professor}
FD: professor -> course (each professor teaches one course)
This is in 3NF but NOT in BCNF because professor is not a superkey. The fix is to decompose into two tables.
For this course, achieving 3NF is sufficient for most business database designs. BCNF is worth knowing for edge cases and interviews.
4.11 Denormalization and Trade-offs
Why Denormalize?
Normalization reduces redundancy but increases the number of tables. Queries that need data from many tables require JOINs, which can be slower on large datasets.
Denormalization is the intentional introduction of controlled redundancy to improve read performance.
When to Denormalize
| Scenario | Rationale |
|---|---|
| Reporting and analytics | Dashboard queries that JOIN 8 tables are too slow |
| Read-heavy workloads | Application reads data 100x more often than it writes |
| Caching derived values | Storing order_total instead of recomputing SUM(qty * price) every time |
| Data warehousing | Star schema designs intentionally denormalize for query speed |
When NOT to Denormalize
| Scenario | Rationale |
|---|---|
| Transactional systems | Order entry, banking -- data integrity is paramount |
| Write-heavy workloads | Redundant data means more updates to keep in sync |
| Small datasets | JOINs on small tables are fast; denormalization adds complexity for no gain |
Denormalize by Choice, Not by Accident
Denormalization should be a deliberate design decision with documented rationale. A table that is "not normalized" because nobody thought about it is not "denormalized" -- it is poorly designed. Always normalize first, then selectively denormalize where performance data justifies it.
Normalization Trade-offs Summary
| Factor | More Normalized | More Denormalized |
|---|---|---|
| Data redundancy | Low | Higher |
| Data integrity | Easier to maintain | Requires extra logic |
| Write performance | Better (fewer updates) | Worse (update multiple copies) |
| Read performance | May need many JOINs | Faster (fewer JOINs) |
| Storage | Less | More |
| Schema complexity | More tables | Fewer tables |
| Best for | OLTP (transactions) | OLAP (analytics) |
Key Takeaways
- Data modeling comes before implementation -- design your ER diagram before writing CREATE TABLE statements
- Entities are nouns, relationships are verbs -- Customer places Order, Employee works in Department
- Cardinality (1:1, 1:M, M:N) determines how tables connect -- M:N relationships always require a junction table
- Normalization eliminates redundancy step by step -- 1NF (atomic values), 2NF (no partial dependencies), 3NF (no transitive dependencies)
- Functional dependencies drive normalization -- ask "does knowing X uniquely determine Y?"
- 3NF is the standard target for transactional business databases
- Denormalization is a performance optimization -- normalize first, denormalize selectively with justification
Review Questions
- Explain the difference between Chen notation and Crow's Foot notation for ER diagrams. When would you use each?
- Given a university system with Students, Courses, and Professors, identify the entities, key attributes, and the cardinality of each relationship.
- What is a functional dependency? Give two examples from a typical HR database.
- Walk through the differences between 1NF, 2NF, and 3NF. Why must you achieve each form in order?
- A data warehouse team wants to store
customer_namedirectly in the ORDERS table to speed up reports. Is this denormalization justified? What risks does it introduce?
Practical Exercise
Starting from this unnormalized INVOICE table, normalize it to 3NF:
1 2 3 4 5 6 7 8 9 | |
- List all functional dependencies you can identify
- Is this table in 1NF? Why or why not?
- Decompose into 2NF tables. Show the resulting tables with sample data.
- Decompose further into 3NF tables. Show the resulting tables with sample data.
- Write the
CREATE TABLEstatements (with primary and foreign keys) for your final 3NF design. - Identify one scenario where you might denormalize part of this design and explain why.
Next Steps
In Chapter 5, we'll shift from database design to data analysis, learning how Python and pandas connect to databases and enable powerful data manipulation beyond what SQL alone can do.
Corresponds to Week 3 of BADM 554 -- Data Modeling & Normalization