Master Pandas - Data Analysis & Manipulation

DataFrames & Series Fundamentals

DataFrames are 2D labeled data structures with columns of potentially different types. Series are 1D labeled arrays. The foundation of all data analysis workflows.

Function	Creates	Use Case
`pd.DataFrame()`	From dicts	Tabular data
`pd.read_csv()`	CSV files	Data loading
`pd.Series()`	1D arrays	Single columns
`df.groupby()`	Groups	Aggregation
`df.merge()`	Joins	SQL-like joins

Hello Pandas World

🚀 python pandas_intro.py

import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [10, 20, 30]
})
print(df)

# From NumPy
data = np.random.randn(5, 3)
df_np = pd.DataFrame(data, columns=['x', 'y', 'z'])
print(df_np.head())

# Series
s = pd.Series([1, 3, 5, np.nan, 6])
print(s)
                

Data Loading & Exploration

File I/O

df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')
df = pd.read_json('data.json')

# Write
df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx')
                        

Quick Inspection

print(df.head())      # First 5 rows
print(df.info())      # Data types
print(df.describe())  # Statistics
print(df.shape)       # Dimensions
print(df.columns)     # Column names
                        

Column Selection

# Single column (Series)
name = df['name']

# Multiple columns
subset = df[['name', 'age', 'salary']]

# By position
df.iloc[:, 0:2]
                        

Boolean Filtering

# Age > 30
high_age = df[df['age'] > 30]

# Multiple conditions
df[(df['age'] > 25) & (df['salary'] > 50000)]

# isnull/notnull
df[df['salary'].isnull()]
                        

Data Cleaning & Transformation

    name  age    city  salary
0  Alice   28  London  55000.0
1    Bob   34  Paris     NaN
2  Carol   29  Berlin  62000.0

# Handle missing values
df['salary'].fillna(df['salary'].mean(), inplace=True)
df.dropna(subset=['name'], inplace=True)

# Remove duplicates
df.drop_duplicates(subset=['email'], inplace=True)

# Data types
df['age'] = df['age'].astype('int32')
df['date'] = pd.to_datetime(df['date'])

# String operations
df['name_upper'] = df['name'].str.upper()
df['name_len'] = df['name'].str.len()
                

GroupBy & Aggregation

# Group by city
city_stats = df.groupby('city').agg({
    'salary': ['mean', 'median', 'count'],
    'age': 'mean'
}).round(2)

print(city_stats)

# Multiple groups
df.groupby(['city', 'department'])['salary'].mean()

# Custom aggregation
df.groupby('city')['salary'].agg(['min', 'max', lambda x: x.max()-x.min()])
                

📈 GroupBy Result

city	salary_mean	age_mean
London	55,000	28.5
Paris	62,000	34.0

Reshaping & Pivot Tables

# Pivot tables (Excel-like)
pivot = df.pivot_table(
    values='salary', 
    index='department', 
    columns='city', 
    aggfunc='mean',
    fill_value=0
)

# Melt (wide → long)
long_df = pd.melt(df, id_vars=['name'], value_vars=['salary', 'bonus'])

# Stack/Unstack
multi_index = df.set_index(['city', 'department'])['salary']
stacked = multi_index.unstack()
                

Merging & Joining

# SQL-style joins
employees = pd.DataFrame({'id': [1,2,3], 'name': ['Alice', 'Bob', 'Carol']})
salaries = pd.DataFrame({'id': [1,2,4], 'salary': [55000, 62000, 70000]})

# Inner join
merged = pd.merge(employees, salaries, on='id', how='inner')

# Left join
left_join = pd.merge(employees, salaries, on='id', how='left')

# Concatenate
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
combined = pd.concat([df1, df2])
                

Time Series Analysis

# DateTime handling
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Resampling (daily → monthly)
monthly = df['sales'].resample('M').sum()

# Rolling windows
df['ma_7'] = df['sales'].rolling(window=7).mean()
df['ma_30'] = df['sales'].rolling(window=30).mean()

# Time-based slicing
last_year = df['2023']
q1_2024 = df['2024-01':'2024-03']
                

Production Projects

1. Sales Analytics Dashboard

1M+ rows: revenue trends, customer segmentation, forecasting

2. Customer Churn Analysis

Feature engineering, cohort analysis, retention metrics

3. Financial Reporting

P&L statements, multi-sheet Excel, automated reports

4. ETL Pipeline

10GB datasets: clean → transform → load → dashboard

Data Analysis Expert!

📊 Production Data Workflows:

💼 Business Intelligence
ETL, dashboards, KPIs

🔬 Data Science Prep
Cleaning, feature engineering

📈 Financial Analysis
Reporting, forecasting

🛒 E-commerce Analytics
Cohort, A/B testing

⚡ Production Scale
1TB+ datasets optimized

Transform raw data into business intelligence! 🚀

Pandas Mastery

Pandas - Data Analysis & Manipulation