Chapter 5: Python & pandas for Data
Learning Objectives
After completing this chapter, you will be able to:
- Write basic Python code using variables, lists, dictionaries, and functions
- Use Jupyter Notebooks and Google Colab as interactive data workspaces
- Create and manipulate pandas Series and DataFrames
- Read CSV files and select data using loc and iloc
- Filter, group, and merge DataFrames for business analysis
- Handle missing data and convert data types in pandas
- Translate familiar SQL operations into equivalent pandas code
5.1 Why Python for Data?
In Chapters 3 and 4, you learned SQL for querying databases. SQL is powerful for retrieving and aggregating data, but real-world analysis often requires more: cleaning messy data, building visualizations, running statistical models, and automating repetitive tasks. That is where Python comes in.
Python is a general-purpose programming language that has become the dominant tool for data science and analytics. You do not need to become a software engineer to use it effectively. In this chapter, we cover just enough Python to work with data using the pandas library.
Python in the Business World
| Use Case | How Python Helps |
|---|---|
| Data Cleaning | Fix inconsistent formats, handle missing values, merge sources |
| Analysis & Reporting | Summarize data, compute metrics, automate recurring reports |
| Visualization | Create charts and dashboards with matplotlib, seaborn, Plotly |
| Machine Learning | Build predictive models with scikit-learn, TensorFlow |
| Automation | Schedule data pipelines, send alerts, update dashboards |
You Already Know the Concepts
If you can write a SQL query, you already understand filtering, grouping, and joining. Python and pandas use the same logic with different syntax.
5.2 Just Enough Python
This section covers the minimum Python you need to work with data. We are not teaching software engineering; we are teaching data manipulation.
Variables and Data Types
A variable stores a value so you can use it later. Python figures out the type automatically.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
No Semicolons, No Type Declarations
Unlike SQL or Java, Python does not require semicolons at the end of lines or explicit type declarations. Indentation (spaces) matters instead of curly braces.
Lists
A list is an ordered collection of items. Think of it as a single column of data.
1 2 3 4 5 6 7 8 9 10 11 12 | |
Dictionaries
A dictionary stores key-value pairs. Think of it as a single row of named data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Lists vs Dictionaries
- List = ordered items accessed by position:
revenue[0] - Dictionary = named items accessed by key:
customer["name"] - A table is essentially a list of dictionaries (rows of named columns)
Functions
A function is a reusable block of code. You have already used functions in SQL (COUNT(), SUM()). Python functions work the same way.
1 2 3 4 5 6 7 8 9 | |
Loops and Comprehensions
Loops let you repeat an operation for every item in a collection.
1 2 3 4 5 6 7 8 9 | |
Comprehensions Replace Simple Loops
List comprehensions are a compact way to transform data. You will see them frequently in pandas code and data pipelines.
File I/O
Python can read and write files directly, though pandas usually handles this for data work.
1 2 3 4 5 6 7 | |
5.3 Jupyter Notebooks and Google Colab
What is a Jupyter Notebook?
A Jupyter Notebook is an interactive document that combines code, output, and explanations in a single file. Instead of writing a script and running it all at once, you work in cells that you execute one at a time.
This is ideal for data analysis because you can:
- Load data, inspect it, then decide what to do next
- See results immediately after each step
- Mix code cells with markdown cells for documentation
- Share your analysis as a readable document
Google Colab
Google Colab is a free, cloud-based Jupyter environment. You do not need to install anything. Open a browser, go to colab.research.google.com, and start coding.
| Feature | Jupyter (Local) | Google Colab |
|---|---|---|
| Setup | Install Python + Jupyter | None (browser only) |
| Cost | Free | Free (paid tiers for more GPU) |
| Collaboration | Manual sharing | Google Drive sharing |
| Libraries | Install yourself | pandas, numpy, etc. pre-installed |
| Best For | Production work, large datasets | Learning, quick analysis, sharing |
We Use Google Colab in This Course
All assignments and labs use Google Colab. You only need a Google account and a web browser.
Notebook Workflow
A typical data analysis notebook follows this pattern:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
5.4 Introduction to pandas
What is pandas?
pandas is a Python library for data manipulation and analysis. It provides two core data structures:
- Series: A single column of data (like a list with labels)
- DataFrame: A table of data (like a spreadsheet or SQL table)
pandas is to Python what Excel is to business users, but far more powerful and reproducible.
1 | |
Series
A Series is a one-dimensional labeled array. Think of it as a single column from a table.
1 2 3 4 5 6 7 8 9 | |
DataFrame
A DataFrame is a two-dimensional table with labeled rows and columns. This is the data structure you will use 90% of the time.
1 2 3 4 5 6 7 8 | |
This produces:
1 2 3 4 5 | |
DataFrame = SQL Table = Excel Sheet
All three represent the same concept: rows and columns of structured data. The difference is how you interact with them:
- Excel: Click and drag
- SQL: Write queries against a database
- pandas: Write Python code in a notebook
5.5 Reading Data into pandas
Reading CSV Files
CSV (Comma-Separated Values) files are the most common format for sharing tabular data.
1 2 3 4 5 6 | |
Inspecting Your Data
After loading data, always inspect it before doing anything else.
1 2 3 4 5 6 7 | |
Always Inspect Before You Analyze
Skipping this step is how you end up with wrong results. Check for unexpected column names, wrong data types (numbers stored as strings), and missing values before writing any analysis code.
5.6 Selecting Data: loc and iloc
pandas provides two main ways to select data from a DataFrame.
Selecting Columns
1 2 3 4 5 | |
loc — Selection by Label
Use loc when you know the row labels or column names.
1 2 3 4 5 6 7 8 | |
iloc — Selection by Position
Use iloc when you want rows/columns by numeric position (0-indexed).
1 2 3 4 5 6 7 8 | |
| Method | Selects By | Example | Returns |
|---|---|---|---|
df["col"] |
Column name | df["price"] |
Series |
df[["a","b"]] |
Column names | df[["price","units_sold"]] |
DataFrame |
df.loc[] |
Labels | df.loc[0:2, "price"] |
By name |
df.iloc[] |
Positions | df.iloc[0:3, 1] |
By number |
When in Doubt, Use loc
loc is explicit about what you are selecting by name. iloc is useful for quick positional slicing, but loc is safer for production code.
5.7 Filtering DataFrames
Filtering in pandas is conceptually identical to SQL's WHERE clause.
1 2 3 4 5 6 7 8 9 10 11 | |
Use & and | Instead of 'and' and 'or'
When combining conditions in pandas, use & (and) and | (or), not Python's and / or keywords. Always wrap each condition in parentheses.
1 2 3 4 5 | |
5.8 GroupBy: Aggregating Data
GroupBy in pandas works exactly like SQL's GROUP BY. It splits data into groups, applies a function to each group, and combines the results.
1 2 3 4 5 6 7 8 | |
GroupBy in Action
Suppose you have monthly sales data for a retail company:
1 2 3 4 5 6 7 8 9 10 11 | |
5.9 Merging and Joining DataFrames
In SQL, you use JOIN to combine tables. In pandas, you use merge().
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Result:
1 2 3 4 5 | |
Types of Merges
1 2 3 4 5 6 7 8 9 10 11 | |
Same Logic as SQL JOINs
If you understood INNER JOIN, LEFT JOIN, and FULL OUTER JOIN in Chapter 4, you already understand pandas merges. The only difference is syntax.
5.10 Data Types and Conversion
pandas assigns data types automatically when reading files, but sometimes it gets them wrong.
1 2 3 4 5 6 7 8 9 10 11 | |
| pandas dtype | Python Type | SQL Equivalent | Example |
|---|---|---|---|
int64 |
int | INT | 42, 1000 |
float64 |
float | DECIMAL, FLOAT | 3.14, 99.99 |
object |
str | VARCHAR, TEXT | "Alice", "East" |
bool |
bool | BOOLEAN | True, False |
datetime64 |
datetime | DATE, DATETIME | 2026-01-15 |
The 'object' Trap
When df.dtypes shows object, it almost always means string. If you expect a numeric column and see object, the column likely has non-numeric values (like "$1,200" or "N/A") that prevented automatic conversion. Clean those values first, then convert.
5.11 Handling Missing Data
Real-world datasets are messy. Missing values are the most common problem you will encounter.
Detecting Missing Data
1 2 3 4 5 6 7 8 | |
Strategies for Missing Data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Choosing a Strategy
- Drop rows when very few rows are missing and your dataset is large
- Fill with mean/median for numeric columns where missing is random
- Fill with a category (like "Unknown") for text columns
- Never silently ignore missing data — it will distort your results
5.12 pandas vs SQL: A Comparison
You now know two ways to work with tabular data. This table maps common SQL operations to their pandas equivalents.
| Operation | SQL | pandas |
|---|---|---|
| Select columns | SELECT name, price FROM products |
df[["name", "price"]] |
| Filter rows | WHERE price > 100 |
df[df["price"] > 100] |
| Sort | ORDER BY price DESC |
df.sort_values("price", ascending=False) |
| Limit rows | LIMIT 10 |
df.head(10) |
| Count | SELECT COUNT(*) FROM products |
len(df) or df.shape[0] |
| Unique values | SELECT DISTINCT region FROM sales |
df["region"].unique() |
| Aggregation | SELECT region, SUM(revenue) FROM sales GROUP BY region |
df.groupby("region")["revenue"].sum() |
| Join | SELECT * FROM orders JOIN customers ON ... |
pd.merge(orders, customers, on="customer_id") |
| NULL check | WHERE email IS NULL |
df[df["email"].isnull()] |
| Alias | SELECT price * 0.9 AS discounted |
df["discounted"] = df["price"] * 0.9 |
When to Use Which
- SQL: Data lives in a database, you need to extract specific subsets, or the dataset is very large (millions+ rows)
- pandas: Data is in files (CSV, Excel), you need to clean or reshape it, or you want to combine analysis with visualization and modeling
- Both: Most real-world workflows start with SQL to extract data, then switch to pandas for analysis
Key Takeaways
- Python is a tool, not a destination — you need just enough to work with pandas and data libraries effectively
- Jupyter Notebooks are your data workspace — they let you explore data interactively, one step at a time
- The DataFrame is the core abstraction — it is the pandas equivalent of a SQL table or Excel spreadsheet
- pandas operations mirror SQL — filtering, grouping, joining, and aggregation work the same conceptually
- Always inspect your data first — use
head(),info(),dtypes, andisnull().sum()before any analysis - Missing data must be handled explicitly — ignoring it leads to wrong results and broken pipelines
- SQL and pandas are complementary — use SQL to get data out of databases, pandas to clean and analyze it
Review Questions
- What is the difference between a pandas Series and a DataFrame? How does each relate to a SQL table?
- When would you use
locversusilocto select data from a DataFrame? - Write pandas code to filter a DataFrame for rows where the
regioncolumn is "East" andrevenueis greater than 50000. - Explain three strategies for handling missing data in pandas, and when you would choose each one.
- Translate this SQL query into pandas code:
SELECT department, AVG(salary) FROM employees WHERE is_active = TRUE GROUP BY department ORDER BY AVG(salary) DESC
Practical Exercise
You have a CSV file with monthly sales data. Load it into a Colab notebook and perform the following analysis:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Complete these tasks using pandas:
- Load the CSV and inspect it with
head(),info(), andisnull().sum() - Find all orders from the "East" region with an amount greater than 100
- Calculate total revenue by region
- Calculate the average order amount by product
- Identify and handle any missing values in the product column
- Merge this data with a
regions.csvfile that contains region manager names
Next Steps
In Chapter 6, we'll learn how to build ETL (Extract, Transform, Load) pipelines that combine SQL and pandas to move data from source systems into analytics-ready formats.
Corresponds to Week 4 of BADM 554 — Python & pandas