DataFrames & Series Fundamentals
DataFrames are 2D labeled data structures with columns of potentially different types.
Series are 1D labeled arrays. The foundation of all data analysis workflows.
| Function |
Creates |
Use Case |
pd.DataFrame() |
From dicts |
Tabular data |
pd.read_csv() |
CSV files |
Data loading |
pd.Series() |
1D arrays |
Single columns |
df.groupby() |
Groups |
Aggregation |
df.merge() |
Joins |
SQL-like joins |
Hello Pandas World
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [10, 20, 30]
})
print(df)
# From NumPy
data = np.random.randn(5, 3)
df_np = pd.DataFrame(data, columns=['x', 'y', 'z'])
print(df_np.head())
# Series
s = pd.Series([1, 3, 5, np.nan, 6])
print(s)
Data Loading & Exploration
File I/O
df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')
df = pd.read_json('data.json')
# Write
df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx')
Quick Inspection
print(df.head()) # First 5 rows
print(df.info()) # Data types
print(df.describe()) # Statistics
print(df.shape) # Dimensions
print(df.columns) # Column names
Column Selection
# Single column (Series)
name = df['name']
# Multiple columns
subset = df[['name', 'age', 'salary']]
# By position
df.iloc[:, 0:2]
Boolean Filtering
# Age > 30
high_age = df[df['age'] > 30]
# Multiple conditions
df[(df['age'] > 25) & (df['salary'] > 50000)]
# isnull/notnull
df[df['salary'].isnull()]
Data Cleaning & Transformation
name age city salary
0 Alice 28 London 55000.0
1 Bob 34 Paris NaN
2 Carol 29 Berlin 62000.0
# Handle missing values
df['salary'].fillna(df['salary'].mean(), inplace=True)
df.dropna(subset=['name'], inplace=True)
# Remove duplicates
df.drop_duplicates(subset=['email'], inplace=True)
# Data types
df['age'] = df['age'].astype('int32')
df['date'] = pd.to_datetime(df['date'])
# String operations
df['name_upper'] = df['name'].str.upper()
df['name_len'] = df['name'].str.len()
GroupBy & Aggregation
# Group by city
city_stats = df.groupby('city').agg({
'salary': ['mean', 'median', 'count'],
'age': 'mean'
}).round(2)
print(city_stats)
# Multiple groups
df.groupby(['city', 'department'])['salary'].mean()
# Custom aggregation
df.groupby('city')['salary'].agg(['min', 'max', lambda x: x.max()-x.min()])
📈 GroupBy Result
| city | salary_mean | age_mean |
| London | 55,000 | 28.5 |
| Paris | 62,000 | 34.0 |
Reshaping & Pivot Tables
# Pivot tables (Excel-like)
pivot = df.pivot_table(
values='salary',
index='department',
columns='city',
aggfunc='mean',
fill_value=0
)
# Melt (wide → long)
long_df = pd.melt(df, id_vars=['name'], value_vars=['salary', 'bonus'])
# Stack/Unstack
multi_index = df.set_index(['city', 'department'])['salary']
stacked = multi_index.unstack()
Merging & Joining
# SQL-style joins
employees = pd.DataFrame({'id': [1,2,3], 'name': ['Alice', 'Bob', 'Carol']})
salaries = pd.DataFrame({'id': [1,2,4], 'salary': [55000, 62000, 70000]})
# Inner join
merged = pd.merge(employees, salaries, on='id', how='inner')
# Left join
left_join = pd.merge(employees, salaries, on='id', how='left')
# Concatenate
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
combined = pd.concat([df1, df2])
Time Series Analysis
# DateTime handling
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Resampling (daily → monthly)
monthly = df['sales'].resample('M').sum()
# Rolling windows
df['ma_7'] = df['sales'].rolling(window=7).mean()
df['ma_30'] = df['sales'].rolling(window=30).mean()
# Time-based slicing
last_year = df['2023']
q1_2024 = df['2024-01':'2024-03']
Production Projects
1. Sales Analytics Dashboard
1M+ rows: revenue trends, customer segmentation, forecasting
2. Customer Churn Analysis
Feature engineering, cohort analysis, retention metrics
3. Financial Reporting
P&L statements, multi-sheet Excel, automated reports
4. ETL Pipeline
10GB datasets: clean → transform → load → dashboard
Data Analysis Expert!
📊 Production Data Workflows:
💼 Business Intelligence
ETL, dashboards, KPIs
🔬 Data Science Prep
Cleaning, feature engineering
📈 Financial Analysis
Reporting, forecasting
🛒 E-commerce Analytics
Cohort, A/B testing
⚡ Production Scale
1TB+ datasets optimized
Transform raw data into business intelligence! 🚀