📈

Pandas Mastery

Data analysis & manipulation

INTERMEDIATE

Pandas - Data Analysis & Manipulation

The most powerful and flexible open source data analysis/manipulation tool. Used by 90%+ of data scientists for cleaning, transforming, and analyzing data.

DataFrames & Series Fundamentals

DataFrames are 2D labeled data structures with columns of potentially different types. Series are 1D labeled arrays. The foundation of all data analysis workflows.

Function Creates Use Case
pd.DataFrame() From dicts Tabular data
pd.read_csv() CSV files Data loading
pd.Series() 1D arrays Single columns
df.groupby() Groups Aggregation
df.merge() Joins SQL-like joins

Hello Pandas World

🚀 python pandas_intro.py
   a  b
0  1  10
1  2  20
2  3  30
import pandas as pd import numpy as np # Create DataFrame df = pd.DataFrame({ 'a': [1, 2, 3], 'b': [10, 20, 30] }) print(df) # From NumPy data = np.random.randn(5, 3) df_np = pd.DataFrame(data, columns=['x', 'y', 'z']) print(df_np.head()) # Series s = pd.Series([1, 3, 5, np.nan, 6]) print(s)

Data Loading & Exploration

File I/O

df = pd.read_csv('data.csv') df = pd.read_excel('data.xlsx') df = pd.read_json('data.json') # Write df.to_csv('output.csv', index=False) df.to_excel('output.xlsx')

Quick Inspection

print(df.head()) # First 5 rows print(df.info()) # Data types print(df.describe()) # Statistics print(df.shape) # Dimensions print(df.columns) # Column names

Column Selection

# Single column (Series) name = df['name'] # Multiple columns subset = df[['name', 'age', 'salary']] # By position df.iloc[:, 0:2]

Boolean Filtering

# Age > 30 high_age = df[df['age'] > 30] # Multiple conditions df[(df['age'] > 25) & (df['salary'] > 50000)] # isnull/notnull df[df['salary'].isnull()]

Data Cleaning & Transformation

    name  age    city  salary
0  Alice   28  London  55000.0
1    Bob   34  Paris     NaN
2  Carol   29  Berlin  62000.0
# Handle missing values df['salary'].fillna(df['salary'].mean(), inplace=True) df.dropna(subset=['name'], inplace=True) # Remove duplicates df.drop_duplicates(subset=['email'], inplace=True) # Data types df['age'] = df['age'].astype('int32') df['date'] = pd.to_datetime(df['date']) # String operations df['name_upper'] = df['name'].str.upper() df['name_len'] = df['name'].str.len()

GroupBy & Aggregation

# Group by city city_stats = df.groupby('city').agg({ 'salary': ['mean', 'median', 'count'], 'age': 'mean' }).round(2) print(city_stats) # Multiple groups df.groupby(['city', 'department'])['salary'].mean() # Custom aggregation df.groupby('city')['salary'].agg(['min', 'max', lambda x: x.max()-x.min()])

📈 GroupBy Result

citysalary_meanage_mean
London55,00028.5
Paris62,00034.0

Reshaping & Pivot Tables

# Pivot tables (Excel-like) pivot = df.pivot_table( values='salary', index='department', columns='city', aggfunc='mean', fill_value=0 ) # Melt (wide → long) long_df = pd.melt(df, id_vars=['name'], value_vars=['salary', 'bonus']) # Stack/Unstack multi_index = df.set_index(['city', 'department'])['salary'] stacked = multi_index.unstack()

Merging & Joining

# SQL-style joins employees = pd.DataFrame({'id': [1,2,3], 'name': ['Alice', 'Bob', 'Carol']}) salaries = pd.DataFrame({'id': [1,2,4], 'salary': [55000, 62000, 70000]}) # Inner join merged = pd.merge(employees, salaries, on='id', how='inner') # Left join left_join = pd.merge(employees, salaries, on='id', how='left') # Concatenate df1 = pd.DataFrame({'A': [1, 2]}) df2 = pd.DataFrame({'A': [3, 4]}) combined = pd.concat([df1, df2])

Time Series Analysis

# DateTime handling df['date'] = pd.to_datetime(df['date']) df.set_index('date', inplace=True) # Resampling (daily → monthly) monthly = df['sales'].resample('M').sum() # Rolling windows df['ma_7'] = df['sales'].rolling(window=7).mean() df['ma_30'] = df['sales'].rolling(window=30).mean() # Time-based slicing last_year = df['2023'] q1_2024 = df['2024-01':'2024-03']

Production Projects

1. Sales Analytics Dashboard

1M+ rows: revenue trends, customer segmentation, forecasting

2. Customer Churn Analysis

Feature engineering, cohort analysis, retention metrics

3. Financial Reporting

P&L statements, multi-sheet Excel, automated reports

4. ETL Pipeline

10GB datasets: clean → transform → load → dashboard

Data Analysis Expert!

📊 Production Data Workflows:

💼 Business Intelligence
ETL, dashboards, KPIs
🔬 Data Science Prep
Cleaning, feature engineering
📈 Financial Analysis
Reporting, forecasting
🛒 E-commerce Analytics
Cohort, A/B testing
⚡ Production Scale
1TB+ datasets optimized

Transform raw data into business intelligence! 🚀