Chapter 3b: JOINs, Subqueries & Window Functions
Learning Objectives
After completing this chapter, you will be able to:
- Insert, update, and delete data using DML write operations
- Modify table structures with ALTER TABLE
- Group and filter aggregated data using GROUP BY and HAVING
- Combine data from multiple tables using INNER, LEFT, RIGHT, and Self JOINs
- Write subqueries (scalar, correlated, and EXISTS) to solve multi-step problems
- Simplify complex queries with Common Table Expressions (CTEs)
- Apply window functions (ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG) for analytics
- Combine JOINs, CTEs, and window functions to answer sophisticated business questions
Our Example Database
Every example in this chapter uses the same four-table e-commerce schema. If you completed the normalization exercise in Chapter 4, these tables will look familiar -- they are the end result of normalizing a flat order spreadsheet into 3NF.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Notice that customer C05 (Eva Chen) has never placed an order. This is intentional -- it will matter when we discuss LEFT JOINs.
3b.1 DML Write Operations: INSERT, UPDATE, DELETE
Chapter 3 covered reading data with SELECT. Real databases also need to write data. The three DML write statements are INSERT, UPDATE, and DELETE.
INSERT -- Adding Rows
1 2 3 4 5 6 7 8 9 10 | |
Column List Best Practice
Always list the columns explicitly in your INSERT statement. Writing INSERT INTO customer VALUES (...) works but breaks if the table structure ever changes.
UPDATE -- Modifying Existing Rows
1 2 3 4 5 6 7 8 9 | |
Always Include a WHERE Clause
Running UPDATE product SET price = 0 without a WHERE clause sets every product's price to zero. This is one of the most common and costly SQL mistakes. Always double-check your WHERE clause before executing an UPDATE.
DELETE -- Removing Rows
1 2 3 4 5 6 7 | |
DELETE is Permanent
Unlike a spreadsheet, there is no "undo" for DELETE in SQL. In production, many teams use "soft deletes" -- setting an is_deleted flag instead of removing rows -- so data can be recovered.
3b.2 ALTER TABLE -- Changing Table Structure
After a table is created, you can modify its structure with ALTER TABLE.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
| Operation | Syntax | Use Case |
|---|---|---|
| Add column | ADD COLUMN name type |
New business requirement |
| Modify column | MODIFY COLUMN name new_type |
Change data type or size |
| Rename column | RENAME COLUMN old TO new |
Clarify naming |
| Drop column | DROP COLUMN name |
Remove unused field |
| Add constraint | ADD CONSTRAINT name ... |
Enforce new rule |
DDL vs DML
ALTER TABLE is a DDL (Data Definition Language) command -- it changes the table's structure. INSERT, UPDATE, and DELETE are DML (Data Manipulation Language) commands -- they change the table's data. This distinction matters for transactions and permissions.
3b.3 GROUP BY and HAVING
Chapter 3 introduced aggregate functions (COUNT, SUM, AVG, MIN, MAX). GROUP BY lets you apply those functions per group rather than across the entire table.
GROUP BY Basics
1 2 3 4 | |
Result:
1 2 3 4 5 6 7 8 | |
Multiple Aggregations in One Query
1 2 3 4 5 6 7 8 9 10 11 12 | |
The GROUP BY Rule
Every column in your SELECT must either appear in the GROUP BY clause or be inside an aggregate function. If you SELECT customer_id and COUNT(*), you must GROUP BY customer_id. Violating this rule produces an error in most databases (MySQL in strict mode, PostgreSQL, SQL Server).
HAVING -- Filtering Groups
WHERE filters individual rows before grouping. HAVING filters groups after aggregation.
1 2 3 4 5 | |
Result:
1 2 3 4 5 6 | |
WHERE vs HAVING
| Clause | Filters | Runs | Can Use Aggregates? |
|---|---|---|---|
| WHERE | Individual rows | Before GROUP BY | No |
| HAVING | Groups | After GROUP BY | Yes |
1 2 3 4 5 6 7 8 9 10 | |
Execution Order
SQL does not execute in the order you write it. The actual execution order is:
- FROM -- identify the tables
- WHERE -- filter individual rows
- GROUP BY -- form groups
- HAVING -- filter groups
- SELECT -- compute output columns
- ORDER BY -- sort results
- LIMIT -- restrict row count
This is why you cannot use a column alias from SELECT inside a WHERE clause -- SELECT has not run yet.
3b.4 JOINs -- Combining Tables
JOINs are what make relational databases relational. They let you combine rows from two or more tables based on a related column (typically a foreign key relationship).
INNER JOIN
An INNER JOIN returns only rows where the join condition matches in both tables.
1 2 3 4 | |
How it works:
1 2 3 4 5 6 7 8 9 10 11 | |
Result:
1 2 3 4 5 6 7 8 9 10 | |
LEFT JOIN (LEFT OUTER JOIN)
A LEFT JOIN returns all rows from the left table, plus matching rows from the right table. If there is no match, the right side columns are NULL.
1 2 3 4 | |
Result:
1 2 3 4 5 6 7 8 9 10 11 | |
Finding Missing Data with LEFT JOIN
A common pattern is LEFT JOIN + WHERE ... IS NULL to find records with no match:
1 2 3 4 5 | |
Result: Eva Chen (C05).
RIGHT JOIN (RIGHT OUTER JOIN)
A RIGHT JOIN is the mirror image of LEFT JOIN: it returns all rows from the right table, plus matching rows from the left.
1 2 3 4 | |
LEFT JOIN vs RIGHT JOIN
In practice, most SQL developers use LEFT JOIN almost exclusively. Any RIGHT JOIN can be rewritten as a LEFT JOIN by swapping the table order:
1 2 3 | |
Pick one convention and stick with it. LEFT JOIN is the industry standard.
JOIN Summary
| JOIN Type | Left Table Rows | Right Table Rows | NULL Fills |
|---|---|---|---|
| INNER JOIN | Only matching | Only matching | Never |
| LEFT JOIN | All | Only matching | Right side |
| RIGHT JOIN | Only matching | All | Left side |
| FULL OUTER JOIN | All | All | Both sides |
Joining Multiple Tables
Real queries often join three or more tables. Read them left to right, one JOIN at a time.
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Table Aliases
Use short aliases (o for orders, c for customer) to keep multi-table queries readable. Without aliases, you would write orders.order_id and customer.name everywhere, which gets verbose quickly.
3b.5 Self Joins
A self join joins a table to itself. This is useful when a table contains a hierarchical or reflexive relationship -- where a row references another row in the same table.
The Classic Example: Employee-Manager
1 2 3 4 5 6 7 | |
1 2 3 4 5 6 7 8 9 10 | |
1 2 3 4 5 6 | |
Result:
1 2 3 4 5 6 7 8 9 | |
The key insight: we use the same table twice with different aliases (e for the employee role, m for the manager role). The LEFT JOIN ensures the CEO (who has no manager) still appears in the results.
3b.6 Subqueries
A subquery is a query nested inside another query. Subqueries let you break complex problems into steps.
Scalar Subqueries (Return One Value)
A scalar subquery returns a single value and can be used anywhere a single value is expected.
1 2 3 4 | |
How it executes:
- Inner query runs first:
SELECT AVG(price) FROM productreturns 444.99 - Outer query becomes:
WHERE price > 444.99
Column Subqueries (Return a List)
A column subquery returns a set of values, used with IN, ANY, or ALL.
1 2 3 4 5 6 7 | |
Table Subqueries (Return a Table)
A table subquery appears in the FROM clause and acts as a temporary table.
1 2 3 4 5 6 7 8 9 10 11 12 | |
Derived Table Alias
MySQL requires that every subquery in the FROM clause has an alias (the AS customer_totals above). Forgetting this alias is a common syntax error.
Correlated Subqueries
A correlated subquery references a column from the outer query. It re-executes for every row of the outer query.
1 2 3 4 5 6 7 8 9 | |
The inner query references p.category from the outer query, so it runs once per product, computing the average for that product's specific category.
EXISTS -- Testing for Existence
EXISTS returns TRUE if the subquery returns any rows. It is often faster than IN for large datasets.
1 2 3 4 5 6 7 8 9 | |
1 2 3 4 5 6 7 8 | |
EXISTS vs IN
- EXISTS stops as soon as it finds one matching row. Efficient when the subquery table is large.
- IN collects all values from the subquery into a list, then checks membership. Can be simpler for small lists.
Rule of thumb: use EXISTS for correlated checks against large tables; use IN for short, known lists.
3b.7 Common Table Expressions (CTEs)
A Common Table Expression (CTE) is a named temporary result set defined with the WITH keyword. CTEs make complex queries readable by breaking them into logical steps.
Basic CTE Syntax
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Compare this to the table subquery version in Section 3b.6. The CTE version reads top-to-bottom like a recipe: "First, calculate customer spending. Then, filter to big spenders."
Multiple CTEs
You can chain multiple CTEs separated by commas:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
CTEs vs Subqueries
CTEs and subqueries produce the same results. The advantages of CTEs:
- Readability: Named steps are easier to follow than deeply nested parentheses
- Reuse: You can reference a CTE multiple times in the same query
- Debugging: You can test each CTE independently by selecting from it
Use CTEs as your default for any query that would require more than one level of nesting.
3b.8 Window Functions
Window functions perform calculations across a set of rows related to the current row -- without collapsing those rows into a single group the way GROUP BY does.
Window Functions vs GROUP BY
| Feature | GROUP BY + Aggregate | Window Function |
|---|---|---|
| Rows in result | One row per group | Every original row preserved |
| Syntax | GROUP BY col |
OVER (PARTITION BY col) |
| Use case | Summary tables | Rankings, running totals, comparisons |
The OVER Clause
Every window function includes an OVER() clause that defines the "window" (the set of rows the function operates on).
1 2 3 4 5 6 | |
Result:
1 2 3 4 5 6 7 8 9 | |
All five rows are preserved. The OVER () with empty parentheses means "the window is the entire table."
PARTITION BY -- Windows Within Groups
PARTITION BY divides rows into groups (partitions) for the window function, like GROUP BY but without collapsing rows.
1 2 3 4 5 6 7 | |
Result:
1 2 3 4 5 6 7 8 9 | |
ROW_NUMBER -- Sequential Numbering
ROW_NUMBER assigns a unique sequential integer to each row within a partition.
1 2 3 4 5 6 7 8 9 10 11 | |
Result:
1 2 3 4 5 6 7 8 9 10 | |
RANK and DENSE_RANK
RANK and DENSE_RANK assign rankings based on the ORDER BY expression. They differ in how they handle ties.
1 2 3 4 5 6 | |
To illustrate the difference, assume two products share the same price:
1 2 3 4 5 6 7 8 9 | |
| Function | Ties | Next Rank After Tie |
|---|---|---|
| ROW_NUMBER | No ties (arbitrary tiebreak) | Always increments by 1 |
| RANK | Same rank for ties | Skips numbers (1, 1, 3) |
| DENSE_RANK | Same rank for ties | No gaps (1, 1, 2) |
When Ties Matter
Imagine three students tied for 2nd place in a competition:
- RANK: 1st, 2nd, 2nd, 2nd, 5th (skips 3rd and 4th)
- DENSE_RANK: 1st, 2nd, 2nd, 2nd, 3rd (no gap)
- ROW_NUMBER: 1st, 2nd, 3rd, 4th, 5th (no ties -- arbitrary among the tied students)
LEAD and LAG -- Accessing Adjacent Rows
LEAD looks at the next row; LAG looks at the previous row. These are invaluable for time-series analysis.
1 2 3 4 5 6 7 8 | |
Result:
1 2 3 4 5 6 | |
A practical application -- calculating days between consecutive orders:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
Window Function Summary
| Function | Purpose | Example Use Case |
|---|---|---|
| ROW_NUMBER() | Sequential numbering | Pagination, deduplication |
| RANK() | Ranking with gaps | Competition rankings |
| DENSE_RANK() | Ranking without gaps | Top-N without skipping |
| LEAD(col, n) | Value from next row | Forecasting, trend analysis |
| LAG(col, n) | Value from previous row | Period-over-period comparison |
| SUM() OVER | Running total | Cumulative revenue |
| AVG() OVER | Moving average | Smoothing time-series data |
3b.9 Putting It All Together
Real business questions rarely require just one technique. This section combines JOINs, CTEs, and window functions in a single query.
Business question: "Rank our customers by total spending and show each customer's percentage of overall revenue."
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
Result:
1 2 3 4 5 6 7 8 | |
Reading the Query
This query uses four techniques from this chapter:
- JOINs (three tables: orders, customer, order_line) to bring the data together
- GROUP BY with SUM to calculate total spending per customer
- CTE (
customer_revenue) to name and reuse the intermediate result - Window functions (RANK and SUM OVER) to rank customers and compute percentages
When building complex queries like this, develop and test each CTE independently before combining them. Start with Step 1, verify it produces correct numbers, then add Step 2, and so on.
Key Takeaways
- INSERT, UPDATE, DELETE complete the SQL data manipulation toolkit -- always include a WHERE clause with UPDATE and DELETE
- GROUP BY transforms rows into groups for aggregation; HAVING filters those groups after aggregation
- JOINs connect tables through foreign key relationships -- INNER JOIN keeps only matches, LEFT JOIN preserves all rows from the left table
- Subqueries nest queries inside queries -- use scalar subqueries for single values, correlated subqueries when the inner query depends on the outer
- CTEs (WITH clause) are named subqueries that make complex SQL readable and debuggable -- prefer them over deeply nested subqueries
- Window functions analyze rows without collapsing them -- ROW_NUMBER for sequencing, RANK for ordering, LEAD/LAG for comparing adjacent rows
- Real business queries combine all these tools -- build incrementally, test each piece, then assemble
Review Questions
- What is the difference between WHERE and HAVING? Write a query that uses both to find product categories with more than 2 products where the average price exceeds $100.
- Explain when you would use a LEFT JOIN instead of an INNER JOIN. Give a business scenario where using INNER JOIN would produce incorrect results.
- What is a correlated subquery? How does it differ from a regular subquery in terms of execution?
- Rewrite the following subquery as a CTE. Which version is more readable?
1 2 3 4 5
SELECT name FROM customer WHERE customer_id IN ( SELECT customer_id FROM orders WHERE order_date >= '2026-03-01' ); - Explain the difference between RANK() and DENSE_RANK(). In what business scenario would the distinction matter?
Practical Exercise
Using the four-table e-commerce database from this chapter, write SQL queries for each task:
-
INSERT: Add a new customer (C06, 'Frank Nguyen') and a new order (1007) for that customer with two products of your choice.
-
GROUP BY + HAVING: Find all products that have been ordered more than twice (across all orders). Show the product name and total quantity ordered.
-
LEFT JOIN: Produce a list of ALL customers showing their name and total amount spent. Customers who have never ordered should show $0.00. (Hint: use COALESCE to replace NULL with 0.)
-
CTE + Window Function: Write a query that shows each order with its total value, and includes a column ranking orders from most to least expensive. Use a CTE to calculate order totals, then apply RANK() in the final SELECT.
-
LEAD/LAG: For customers who have placed more than one order, show each order date alongside the number of days since their previous order.
Next Steps
In Chapter 4, we'll learn how to design databases using ER diagrams and normalization -- the principles that produce well-structured schemas like the e-commerce database used throughout this chapter.
Corresponds to Week 2 of BADM 554 -- JOINs, Subqueries & Window Functions