Master Data Science with Python – Complete Hands-on Tutorial for Students, Researchers & Professionals (2026 Edition)
If you're seeing this book's cover or link pointing to Amazon.com (USA marketplace)
Master Data Science with Python – Complete Hands-on Tutorial for Students, Researchers & Professionals (2026 Edition)
👈 PREVIOUS PYTHON PROGRAMMING OOP NEXT R PROGRAMMING 👉
TABLE OF CONTENTS
Master Data Science with Python – Complete Hands-on Tutorial for Students, Researchers & Professionals (2026 Edition) From Zero to Job-Ready | Real Projects | Pandas, NumPy, Visualization, Machine Learning & Deployment
Introduction to Data Science & Python Setup 1.1 What is Data Science and Why Python in 2026? 1.2 Data Science Career Paths for Students, Researchers & Professionals 1.3 Complete Python Environment Setup (Anaconda, Jupyter, VS Code) 1.4 Essential Libraries Overview (NumPy, Pandas, Matplotlib, Scikit-learn)
NumPy – Foundation of Numerical Computing 2.1 NumPy Arrays vs Python Lists 2.2 Array Operations, Broadcasting & Vectorization 2.3 Indexing, Slicing & Advanced Array Manipulation 2.4 Mathematical & Statistical Functions
Pandas – Data Manipulation & Analysis 3.1 Series and DataFrame – Core Data Structures 3.2 Data Loading (CSV, Excel, JSON, SQL) 3.3 Data Cleaning, Filtering & Transformation 3.4 Grouping, Aggregation & Pivot Tables 3.5 Handling Missing Values & Outliers
Data Visualization with Matplotlib & Seaborn 4.1 Matplotlib Basics – Line, Bar, Scatter & Histogram 4.2 Seaborn for Statistical Visualization 4.3 Advanced Plots – Heatmap, Pairplot, Boxplot 4.4 Creating Publication-Ready Visualizations
Exploratory Data Analysis (EDA) 5.1 Understanding Data Distribution & Summary Statistics 5.2 Univariate, Bivariate & Multivariate Analysis 5.3 Correlation Analysis & Feature Relationships 5.4 Real-World EDA Case Study
Data Preprocessing & Feature Engineering 6.1 Data Scaling & Normalization 6.2 Encoding Categorical Variables 6.3 Feature Selection Techniques 6.4 Handling Imbalanced Datasets
Statistics & Probability for Data Science 7.1 Descriptive vs Inferential Statistics 7.2 Hypothesis Testing & p-value 7.3 Probability Distributions 7.4 Correlation, Regression & Confidence Intervals
Machine Learning with Scikit-learn 8.1 Supervised Learning – Regression & Classification 8.2 Model Training, Evaluation & Hyperparameter Tuning 8.3 Cross-Validation & Model Selection 8.4 Unsupervised Learning – Clustering & Dimensionality Reduction
Advanced Data Science Topics 9.1 Time Series Analysis & Forecasting 9.2 Natural Language Processing (NLP) Basics 9.3 Introduction to Deep Learning with TensorFlow/Keras 9.4 Model Deployment & MLOps Basics
Real-World Projects & Case Studies 10.1 Project 1: House Price Prediction (Regression) 10.2 Project 2: Customer Churn Prediction (Classification) 10.3 Project 3: Sentiment Analysis on Reviews (NLP) 10.4 Project 4: Sales Dashboard & EDA Report
Best Practices, Portfolio & Career Guidance 11.1 Writing Clean & Reproducible Data Science Code 11.2 Building a Strong Data Science Portfolio 11.3 Git, Kaggle & Resume Tips for Students & Professionals 11.4 Interview Preparation & Top Data Science Questions
Next Steps & Learning Roadmap 12.1 Advanced Topics (Deep Learning, Computer Vision, Big Data) 12.2 Recommended Books, Courses & Resources (2026 Updated) 12.3 Career Paths & Job Opportunities in Data Science
1. Introduction to Data Science & Python Setup
Welcome to your journey into Data Science with Python! This section lays the foundation — understanding what data science really is in 2026, why Python remains the #1 choice, career opportunities, and how to set up a powerful, professional environment.
1.1 What is Data Science and Why Python in 2026?
Data Science is the field of extracting meaningful insights and knowledge from structured and unstructured data using scientific methods, processes, algorithms, and systems.
In 2026, data science combines:
Statistics & mathematics
Programming & computer science
Domain expertise
Machine learning & AI
Data visualization & storytelling
Core activities in modern data science:
Collecting & cleaning data
Exploratory Data Analysis (EDA)
Building predictive models
Deploying models into production
Communicating insights (dashboards, reports)
Why Python is still #1 in 2026?
Extremely rich ecosystem: NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch, Hugging Face, Polars, Streamlit, FastAPI
Beginner-friendly syntax + powerful for production
Largest community & job market demand (Stack Overflow, LinkedIn, IEEE reports)
Used by Google, Meta, Netflix, NASA, ISRO, startups & research labs
Excellent for automation, web scraping, APIs, cloud (AWS, GCP, Azure)
Fast prototyping + scalable deployment
Python vs R vs Julia vs others (2026 view):
Python → general-purpose, industry standard, huge ecosystem
R → strong in statistics & academia (but declining in industry)
Julia → fast computation (but small ecosystem & adoption)
1.2 Data Science Career Paths for Students, Researchers & Professionals
Career Roles in 2026 (with approximate global salary ranges – India & International)
RoleTypical ResponsibilitiesBest ForIndia Salary (₹ LPA)Global Salary (USD)Data AnalystSQL, Excel, Power BI, basic Python, dashboardsStudents & freshers4–12$60k–$90kData ScientistML models, EDA, feature engineering, deploymentStudents + Professionals10–25$100k–$160kMachine Learning EngineerProduction ML, MLOps, pipelines, cloudProfessionals & researchers15–40$130k–$220kAI Research ScientistDeep learning, papers, innovationResearchers & PhD holders18–50+$150k–$300k+Data EngineerETL pipelines, big data (Spark, Airflow)Professionals12–30$110k–$180kBusiness Intelligence AnalystDashboards, KPIs, stakeholder communicationFreshers & mid-level6–15$70k–$110k
Skills in demand (2026):
Python + SQL (must-have)
Cloud (AWS/GCP/Azure)
Git & GitHub
Docker & FastAPI
ML deployment (MLflow, BentoML, Streamlit)
Communication & storytelling
1.3 Complete Python Environment Setup (Anaconda, Jupyter, VS Code)
Recommended Setup for Data Science (2026 standard):
Option 1 – Anaconda (easiest for beginners & researchers)
Download Anaconda: https://www.anaconda.com/download
Install → includes Python, Jupyter, Spyder, NumPy, Pandas, Matplotlib, Scikit-learn, etc.
Open Anaconda Navigator → launch Jupyter Notebook or JupyterLab
Option 2 – Miniconda + VS Code (lightweight & professional)
Install Miniconda: https://docs.conda.io/en/latest/miniconda.html
Create environment:
Bash
conda create -n datascience python=3.11 conda activate datascience conda install jupyter numpy pandas matplotlib seaborn scikit-learn pip install jupyterlab
Install VS Code: https://code.visualstudio.com
Install extensions: Python, Jupyter, Pylance, Black Formatter, GitLens
Recommended VS Code Settings (settings.json):
JSON
{ "python.defaultInterpreterPath": "~/.conda/envs/datascience/bin/python", "jupyter.alwaysTrustNotebooks": true, "editor.formatOnSave": true, "python.formatting.provider": "black" }
Quick start JupyterLab:
Bash
conda activate datascience jupyter lab
1.4 Essential Libraries Overview (NumPy, Pandas, Matplotlib, Scikit-learn)
NumPy – Numerical foundation
Fast arrays & matrices
Vectorized operations (no loops)
Broadcasting, linear algebra
Pandas – Data wrangling & analysis
DataFrame (Excel-like table)
Read/write CSV, Excel, SQL, JSON
Filtering, grouping, merging
Matplotlib + Seaborn – Visualization
Matplotlib: base plotting library
Seaborn: beautiful statistical plots on top of Matplotlib
Scikit-learn – Machine Learning
Preprocessing, models (regression, classification, clustering)
Model evaluation, pipelines, grid search
Quick import cheat sheet
Python
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression
Mini Hello Data Science Code (run in Jupyter)
Python
import numpy as np import pandas as pd import matplotlib.pyplot as plt # Create sample data data = pd.DataFrame({ 'age': np.random.randint(20, 60, 100), 'salary': np.random.normal(80000, 20000, 100) }) # Quick EDA print(data.describe()) sns.scatterplot(x='age', y='salary', data=data) plt.title("Age vs Salary") plt.show()
This completes the full Introduction to Data Science & Python Setup section — your perfect starting point for the entire Data Science with Python tutorial!
2. NumPy – Foundation of Numerical Computing
NumPy (Numerical Python) is the most important library for numerical and scientific computing in Python. Almost every data science library (Pandas, Scikit-learn, Matplotlib, TensorFlow, PyTorch, etc.) is built on top of NumPy.
Why NumPy is essential in 2026:
Extremely fast (written in C, vectorized operations)
Memory-efficient multi-dimensional arrays
Broadcasting (no loops needed for many operations)
Basis for all modern data science & machine learning
Install NumPy (if not using Anaconda)
Bash
pip install numpy
Import convention (standard in data science):
Python
import numpy as np
2.1 NumPy Arrays vs Python Lists
Python lists are flexible but slow for numerical work.
NumPy arrays (ndarray) are homogeneous, fixed-type, multi-dimensional arrays optimized for math.
FeaturePython ListNumPy Array (ndarray)WinnerData typesMixed (int, str, float, etc.)Homogeneous (all same type)NumPySpeed (math operations)Slow (loops in Python)Very fast (vectorized, C-level)NumPyMemory usageHigh (objects + pointers)Low (contiguous memory block)NumPyMulti-dimensional supportManual (list of lists)Native (ndarray with shape)NumPyBroadcastingNot supportedAutomatic (shape rules)NumPyMathematical functionsManual or loopBuilt-in (np.sum, np.mean, etc.)NumPy
Quick comparison example
Python
# Python list (slow) lst = list(range(1000000)) %timeit [x**2 for x in lst] # ~100–150 ms # NumPy array (fast) arr = np.arange(1000000) %timeit arr**2 # ~1–5 ms
2.2 Array Operations, Broadcasting & Vectorization
Vectorization = performing operations on entire arrays without explicit loops.
Basic array creation
Python
import numpy as np a = np.array([1, 2, 3, 4]) # 1D array b = np.array([[1, 2], [3, 4]]) # 2D array zeros = np.zeros((3, 4)) # 3×4 array of zeros ones = np.ones(5) # [1. 1. 1. 1. 1.] arange = np.arange(0, 10, 2) # [0 2 4 6 8] linspace = np.linspace(0, 1, 5) # 5 evenly spaced points rand = np.random.rand(3, 2) # random values [0,1)
Vectorized operations
Python
a = np.array([10, 20, 30, 40]) b = np.array([1, 2, 3, 4]) print(a + b) # [11 22 33 44] print(a 2) # [20 40 60 80] print(a * 2) # [100 400 900 1600] print(np.sqrt(a)) # square root of each element
Broadcasting – automatic shape alignment
Python
a = np.array([[1, 2, 3], [4, 5, 6]]) # shape (2,3) b = np.array([10, 20, 30]) # shape (3,) print(a + b) # adds b to each row # [[11 22 33] # [14 25 36]] c = np.array([[100], [200]]) # shape (2,1) print(a + c) # adds c to each column
Rule of thumb: Broadcasting works when dimensions are compatible (equal or one is 1).
2.3 Indexing, Slicing & Advanced Array Manipulation
Basic indexing & slicing
Python
arr = np.array([10, 20, 30, 40, 50]) print(arr[0]) # 10 print(arr[-1]) # 50 (last element) print(arr[1:4]) # [20 30 40] print(arr[::2]) # [10 30 50] (every second) print(arr[::-1]) # [50 40 30 20 10] (reverse)
2D array indexing
Python
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(matrix[0, 2]) # 3 print(matrix[:, 1]) # [2 5 8] (second column) print(matrix[1:, :2]) # [[4 5] # [7 8]] (rows 1–2, columns 0–1)
Boolean indexing (very powerful)
Python
arr = np.array([10, 25, 7, 40, 15]) print(arr[arr > 20]) # [25 40]
Advanced manipulation
Python
# Reshape a = np.arange(12) print(a.reshape(3, 4)) # 3×4 matrix # Flatten / ravel print(a.ravel()) # back to 1D # Transpose matrix.T # rows ↔ columns # Concatenate & stack np.concatenate([a, b]) np.vstack([a, b]) # vertical stack np.hstack([a, b]) # horizontal stack
2.4 Mathematical & Statistical Functions
NumPy provides fast, vectorized versions of almost all math operations.
Basic math
Python
a = np.array([1, 4, 9, 16]) print(np.sqrt(a)) # [1. 2. 3. 4.] print(np.exp(a)) # exponential print(np.log(a)) # natural log print(np.sin(np.deg2rad(30))) # sin(30°) = 0.5
Statistical functions
Python
data = np.random.randn(1000) # 1000 random normal values print(np.mean(data)) # ≈ 0 print(np.median(data)) print(np.std(data)) # standard deviation print(np.var(data)) # variance print(np.min(data), np.max(data)) print(np.percentile(data, 25)) # 25th percentile
Axis-wise operations (very important)
Python
matrix = np.random.randint(1, 100, size=(4, 5)) print(matrix.mean(axis=0)) # mean of each column print(matrix.sum(axis=1)) # sum of each row print(matrix.max(axis=0)) # max per column
Mini Summary Project – Basic Data Analysis with NumPy
Python
import numpy as np # Simulate student marks marks = np.random.randint(40, 100, size=50) print("Average marks:", np.mean(marks)) print("Highest marks:", np.max(marks)) print("Lowest marks:", np.min(marks)) print("Top 10% percentile:", np.percentile(marks, 90)) # Students above 80 above_80 = marks[marks >= 80] print(f"{len(above_80)} students scored 80+")
This completes the full NumPy – Foundation of Numerical Computing section — the true backbone of all data science in Python!
3. Pandas – Data Manipulation & Analysis
Pandas is the most powerful and widely used Python library for data wrangling, cleaning, exploration, and analysis. It is built on top of NumPy and provides high-level data structures (Series and DataFrame) that make working with tabular/structured data feel like using Excel or SQL — but much more powerful.
Install Pandas (if not using Anaconda)
Bash
pip install pandas
Standard import (always use this)
Python
import pandas as pd
3.1 Series and DataFrame – Core Data Structures
Series A one-dimensional labeled array (like a column in Excel or a vector with labels).
Python
# Create Series from list s1 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd']) print(s1) # a 10 # b 20 # c 30 # d 40 # dtype: int64 # Access by label print(s1['b']) # 20 # From dictionary s2 = pd.Series({'Math': 85, 'Science': 92, 'English': 78}) print(s2['Science']) # 92
DataFrame A two-dimensional labeled data structure (like a spreadsheet or SQL table) — the heart of Pandas.
Python
# Create DataFrame from dictionary data = { 'Name': ['Anshuman', 'Priya', 'Rahul', 'Sneha'], 'Age': [25, 23, 24, 22], 'City': ['Ranchi', 'Delhi', 'Patna', 'Kolkata'], 'Marks': [92, 88, 85, 90] } df = pd.DataFrame(data) print(df) # Name Age City Marks # 0 Anshuman 25 Ranchi 92 # 1 Priya 23 Delhi 88 # 2 Rahul 24 Patna 85 # 3 Sneha 22 Kolkata 90 # Basic inspection print(df.head(2)) # first 2 rows print(df.info()) # data types, non-null count print(df.describe()) # summary statistics print(df.shape) # (rows, columns) → (4, 4)
Quick access
Python
df['Name'] # Series – one column df[['Name', 'Marks']] # DataFrame – multiple columns df.iloc[0] # first row (position-based) df.loc[0, 'Name'] # label-based access
3.2 Data Loading (CSV, Excel, JSON, SQL)
Pandas makes reading data from almost any source effortless.
CSV
Python
df = pd.read_csv("sales_data.csv") # Options: skiprows=2, usecols=['date','sales'], dtype={'sales':float}
Excel
Python
df = pd.read_excel("report.xlsx", sheet_name="Sales", skiprows=1) # Need: pip install openpyxl or xlrd
JSON
Python
df = pd.read_json("data.json") # or pd.json_normalize() for nested JSON
SQL (with database connection)
Python
import sqlalchemy as sa engine = sa.create_engine("sqlite:///mydb.db") df = pd.read_sql("SELECT * FROM customers", engine) # or pd.read_sql_query(query, engine)
Quick save
Python
df.to_csv("cleaned_data.csv", index=False) df.to_excel("report.xlsx", index=False) df.to_json("data.json", orient="records")
3.3 Data Cleaning, Filtering & Transformation
Real data is messy — Pandas excels at cleaning it.
Basic cleaning
Python
df = df.drop_duplicates() # remove duplicate rows df = df.dropna(subset=['age', 'city']) # drop rows with missing values df['age'] = df['age'].fillna(df['age'].median()) # fill missing with median df['salary'] = df['salary'].astype(float) # change data type df['date'] = pd.to_datetime(df['date']) # convert to datetime
Filtering
Python
high_salary = df[df['salary'] > 80000] young_delhi = df[(df['age'] < 30) & (df['city'] == 'Delhi')] top_10 = df.nlargest(10, 'marks')
Transformations
Python
df['tax'] = df['salary'] * 0.18 # new column df['full_name'] = df['first'] + " " + df['last'] df['salary_category'] = pd.cut(df['salary'], bins=[0, 50000, 100000, np.inf], labels=['Low', 'Medium', 'High'])
3.4 Grouping, Aggregation & Pivot Tables
Groupby – most powerful feature
Python
# Group by city and calculate mean salary df.groupby('city')['salary'].mean() # Multiple aggregations df.groupby('city').agg({ 'salary': ['mean', 'max', 'count'], 'age': 'median' })
Pivot Tables
Python
pd.pivot_table(df, values='salary', index='city', columns='department', aggfunc='mean', fill_value=0)
Crosstab
Python
pd.crosstab(df['city'], df['gender'], margins=True)
3.5 Handling Missing Values & Outliers
Missing values
Python
# Check missing df.isnull().sum() # Fill missing df['age'].fillna(df['age'].median(), inplace=True) df['city'].fillna('Unknown', inplace=True) # Drop missing df.dropna(subset=['salary'], inplace=True)
Detect & handle outliers
Python
# Using IQR method Q1 = df['salary'].quantile(0.25) Q3 = df['salary'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 IQR upper = Q3 + 1.5 IQR # Remove outliers df_clean = df[(df['salary'] >= lower) & (df['salary'] <= upper)] # Or cap them df['salary'] = df['salary'].clip(lower=lower, upper=upper)
Mini Summary Project – Quick EDA on Sample Dataset
Python
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Load sample (or your own CSV) df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv") # Quick look print(df.head()) print(df.info()) print(df.describe()) # Missing values print(df.isnull().sum()) # Group analysis print(df.groupby('day')['total_bill'].mean()) # Visualization sns.boxplot(x='day', y='total_bill', data=df) plt.title("Total Bill by Day") plt.show()
This completes the full Pandas – Data Manipulation & Analysis section — the most important tool for real data science work in Python!
4. Data Visualization with Matplotlib & Seaborn
Data visualization is one of the most powerful ways to explore data, communicate insights, and tell stories. Matplotlib is the foundational plotting library in Python (highly customizable but requires more code). Seaborn is built on top of Matplotlib — it provides beautiful, high-level statistical plots with minimal code.
Install (if not using Anaconda)
Bash
pip install matplotlib seaborn
Standard imports (always use these)
Python
import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np # Set beautiful default style (highly recommended) sns.set_style("whitegrid") # clean white background with grid plt.rcParams['figure.figsize'] = (10, 6) # default figure size
4.1 Matplotlib Basics – Line, Bar, Scatter & Histogram
Line Plot
Python
x = np.linspace(0, 10, 100) y1 = np.sin(x) y2 = np.cos(x) plt.plot(x, y1, label='sin(x)', color='blue', linewidth=2) plt.plot(x, y2, label='cos(x)', color='red', linestyle='--') plt.title("Sine and Cosine Waves") plt.xlabel("X-axis") plt.ylabel("Y-axis") plt.legend() plt.grid(True) plt.show()
Bar Plot
Python
categories = ['Python', 'R', 'SQL', 'Excel', 'Tableau'] usage = [85, 45, 70, 60, 55] plt.bar(categories, usage, color='skyblue') plt.title("Programming Language Popularity (2026)") plt.xlabel("Language") plt.ylabel("Usage (%)") plt.xticks(rotation=45) plt.show()
Scatter Plot
Python
np.random.seed(42) x = np.random.randn(100) y = 2 x + np.random.randn(100) 0.5 plt.scatter(x, y, color='purple', alpha=0.6, s=80, edgecolor='black') plt.title("Scatter Plot with Correlation") plt.xlabel("Feature X") plt.ylabel("Feature Y") plt.show()
Histogram
Python
data = np.random.normal(loc=50, scale=15, size=1000) # normal distribution plt.hist(data, bins=30, color='teal', edgecolor='black', alpha=0.7) plt.title("Distribution of Exam Scores") plt.xlabel("Score") plt.ylabel("Frequency") plt.axvline(data.mean(), color='red', linestyle='--', label=f'Mean = {data.mean():.1f}') plt.legend() plt.show()
4.2 Seaborn for Statistical Visualization
Seaborn makes statistical plots beautiful and easy.
Line Plot with confidence interval
Python
tips = sns.load_dataset("tips") # built-in dataset sns.lineplot(x="total_bill", y="tip", data=tips, hue="time", style="time") plt.title("Tip vs Total Bill by Time") plt.show()
Count Plot
Python
sns.countplot(x="day", data=tips, hue="sex", palette="Set2") plt.title("Number of Customers by Day and Gender") plt.show()
Pair Plot (exploratory)
Python
sns.pairplot(tips, hue="smoker", diag_kind="kde") plt.suptitle("Pair Plot of Tips Dataset", y=1.02) plt.show()
4.3 Advanced Plots – Heatmap, Pairplot, Boxplot
Heatmap (Correlation Matrix)
Python
# Correlation matrix corr = tips.corr(numeric_only=True) sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5) plt.title("Correlation Heatmap of Tips Dataset") plt.show()
Boxplot (distribution & outliers)
Python
sns.boxplot(x="day", y="total_bill", hue="smoker", data=tips, palette="Set3") plt.title("Total Bill Distribution by Day & Smoking Status") plt.show()
Violin Plot (distribution + density)
Python
sns.violinplot(x="day", y="tip", hue="sex", data=tips, split=True, palette="muted") plt.title("Tip Distribution by Day and Gender") plt.show()
4.4 Creating Publication-Ready Visualizations
Tips to make plots look professional and publication-quality:
Best Practices Code Template
Python
plt.figure(figsize=(10, 6), dpi=120) # high resolution sns.set_context("paper", font_scale=1.3) # publication style # Your plot here sns.boxplot(x="day", y="total_bill", data=tips, palette="pastel") plt.title("Total Bill Distribution by Day", fontsize=16, fontweight='bold') plt.xlabel("Day of the Week", fontsize=14) plt.ylabel("Total Bill (USD)", fontsize=14) plt.xticks(fontsize=12) plt.yticks(fontsize=12) plt.grid(True, linestyle='--', alpha=0.7) # Save high-quality image plt.tight_layout() plt.savefig("publication_plot.png", dpi=300, bbox_inches='tight') plt.show()
Additional Tips for Publication/Report Quality:
Use sns.set_style("whitegrid") or "ticks" for clean look
Choose color palettes: "viridis", "magma", "coolwarm", "Set2", "pastel"
Add annotations: plt.annotate(), sns.despine()
Use fig, ax = plt.subplots() for multi-plot figures
Export as PNG (300+ dpi) or SVG for journals
Mini Summary Project – Full EDA Visualization
Python
# Load built-in dataset df = sns.load_dataset("penguins") # 1. Pair plot sns.pairplot(df, hue="species", diag_kind="kde") plt.suptitle("Penguin Species Comparison", y=1.02) plt.show() # 2. Boxplot of bill length by species sns.boxplot(x="species", y="bill_length_mm", data=df, palette="Set2") plt.title("Bill Length Distribution by Penguin Species") plt.show() # 3. Heatmap of correlations corr = df.corr(numeric_only=True) sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f") plt.title("Correlation Matrix – Penguin Dataset") plt.show()
This completes the full Data Visualization with Matplotlib & Seaborn section — now you can create beautiful, insightful, and publication-ready visualizations!
5. Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the most critical step in any data science project. It helps you understand the data, discover patterns, detect anomalies, find relationships, and form hypotheses — all before building any model.
Why EDA is important in 2026:
Prevents garbage-in-garbage-out (bad data → bad model)
Saves time & money by identifying issues early
Guides feature engineering and model selection
Creates compelling stories for stakeholders/reports
Core tools for EDA:
Pandas (data manipulation)
NumPy (numerical operations)
Matplotlib + Seaborn (visualization)
Missingno, Sweetviz, Pandas Profiling (automated EDA reports)
5.1 Understanding Data Distribution & Summary Statistics
First step: Load & inspect data
Python
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Load example dataset (or use your own CSV) df = sns.load_dataset("titanic") # Quick overview print(df.head()) print(df.info()) print(df.shape) # (rows, columns) print(df.describe()) # numerical summary print(df.describe(include='object')) # categorical summary
Key summary statistics
Mean / Median / Mode → central tendency
Standard deviation / IQR → spread
Min / Max / Percentiles → range & outliers
Skewness & Kurtosis → shape of distribution
Visualize distribution (histogram + KDE)
Python
plt.figure(figsize=(10, 6)) sns.histplot(df['age'].dropna(), kde=True, bins=30, color='teal') plt.title("Age Distribution of Titanic Passengers") plt.xlabel("Age") plt.ylabel("Count") plt.axvline(df['age'].mean(), color='red', linestyle='--', label=f'Mean = {df["age"].mean():.1f}') plt.axvline(df['age'].median(), color='green', linestyle='--', label=f'Median = {df["age"].median():.1f}') plt.legend() plt.show()
Check skewness
Python
print("Skewness of Age:", df['age'].skew()) # positive → right-skewed
5.2 Univariate, Bivariate & Multivariate Analysis
Univariate Analysis – Study one variable at a time
Python
# Categorical sns.countplot(x='class', data=df, palette='Set2') plt.title("Passenger Class Distribution") plt.show() # Numerical sns.boxplot(x='fare', data=df, color='lightblue') plt.title("Fare Distribution (with outliers)") plt.show()
Bivariate Analysis – Relationship between two variables
Python
# Numerical vs Numerical sns.scatterplot(x='age', y='fare', hue='survived', data=df, palette='coolwarm') plt.title("Age vs Fare by Survival") plt.show() # Categorical vs Numerical sns.boxplot(x='class', y='fare', hue='sex', data=df) plt.title("Fare by Passenger Class & Gender") plt.show() # Categorical vs Categorical pd.crosstab(df['class'], df['survived'], normalize='index').plot(kind='bar', stacked=True) plt.title("Survival Rate by Passenger Class") plt.show()
Multivariate Analysis – More than two variables
Python
# Pair plot (best for quick multivariate look) sns.pairplot(df[['age', 'fare', 'survived']], hue='survived', diag_kind='kde') plt.suptitle("Multivariate Relationships – Titanic Dataset", y=1.02) plt.show()
5.3 Correlation Analysis & Feature Relationships
Correlation Matrix (Pearson)
Python
# Select only numeric columns numeric_df = df.select_dtypes(include=['number']) corr = numeric_df.corr() plt.figure(figsize=(10, 8)) sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1) plt.title("Correlation Matrix – Titanic Features") plt.show()
Interpretation tips:
Values near +1 → strong positive correlation
Values near -1 → strong negative correlation
Values near 0 → no linear relationship
Correlation ≠ causation!
Advanced: Spearman / Kendall correlation (good for non-linear or ordinal data)
Python
corr_spearman = numeric_df.corr(method='spearman') sns.heatmap(corr_spearman, annot=True, cmap='viridis') plt.title("Spearman Correlation") plt.show()
5.4 Real-World EDA Case Study
Dataset: Titanic (classic but very educational)
Complete EDA workflow (copy-paste ready)
Python
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = sns.load_dataset("titanic") # 1. Overview print("Shape:", df.shape) print("\nMissing Values:\n", df.isnull().sum()) print("\nData Types:\n", df.dtypes) # 2. Univariate plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) sns.histplot(df['age'].dropna(), kde=True, color='teal') plt.title("Age Distribution") plt.subplot(1, 2, 2) sns.countplot(x='class', data=df, palette='Set2') plt.title("Passenger Class Distribution") plt.tight_layout() plt.show() # 3. Bivariate plt.figure(figsize=(10, 6)) sns.boxplot(x='class', y='fare', hue='survived', data=df) plt.title("Fare by Class & Survival") plt.show() # 4. Correlation numeric = df.select_dtypes(include=['number']) corr = numeric.corr() sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f') plt.title("Correlation Heatmap") plt.show() # 5. Survival Rate by Gender & Class pd.crosstab([df['sex'], df['class']], df['survived'], normalize='index').plot(kind='bar', stacked=True) plt.title("Survival Rate by Gender & Class") plt.show() print("Key Insights:") print("- Females had much higher survival rate than males") print("- Higher class (1st) had better survival and higher fares") print("- Age had missing values – needs imputation") print("- Fare is highly skewed – consider log transformation")
Key Insights from Titanic EDA (typical findings):
Women & children had higher survival rates
1st class passengers survived more
Fare is a strong indicator of class & survival
Age has missing values (esp. in cabin) → imputation needed
Many categorical variables → encoding required
This completes the full Exploratory Data Analysis (EDA) section — now you know how to deeply understand any dataset before modeling!
6. Data Preprocessing & Feature Engineering
Data preprocessing and feature engineering are the most time-consuming and most important steps in any data science project. Good preprocessing turns raw, messy data into clean, model-ready input. Feature engineering creates new powerful features that can dramatically improve model performance.
Goal: Prepare data so machine learning models can learn effectively and generalize well.
6.1 Data Scaling & Normalization
Many machine learning algorithms (especially distance-based ones like KNN, SVM, K-Means, Neural Networks) perform poorly if features are on different scales.
Common Scaling Techniques:
TechniqueFormula / MethodWhen to UseRange / OutputAffected by Outliers?Min-Max Scaling(X - min) / (max - min)Neural networks, image data, bounded data[0, 1] or custom rangeYesStandardization(X - mean) / stdMost algorithms (SVM, logistic regression, PCA)Mean ≈ 0, Std ≈ 1YesRobust Scaling(X - median) / IQRData with outliersCentered around medianNoLog Transformationlog(1 + X) or Box-CoxHighly skewed data (income, time, counts)Reduces skewnessReduces impact
Code Examples
Python
import pandas as pd from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler import numpy as np df = pd.DataFrame({ 'age': [25, 30, 45, 22, 60], 'salary': [30000, 50000, 120000, 25000, 200000], 'experience': [1, 3, 10, 0, 20] }) # 1. Min-Max Scaling (0 to 1) scaler_minmax = MinMaxScaler() df[['age_minmax', 'salary_minmax']] = scaler_minmax.fit_transform(df[['age', 'salary']]) # 2. Standardization (mean=0, std=1) scaler_std = StandardScaler() df[['age_std', 'salary_std']] = scaler_std.fit_transform(df[['age', 'salary']]) # 3. Robust Scaling (handles outliers) scaler_robust = RobustScaler() df[['salary_robust']] = scaler_robust.fit_transform(df[['salary']]) # 4. Log Transformation (for skewed data) df['salary_log'] = np.log1p(df['salary']) # log(1 + x) to handle 0 print(df)
Quick rule of thumb (2026):
Use StandardScaler for most ML models (default choice)
Use MinMaxScaler for neural networks or when you need [0,1] range
Use RobustScaler if data has outliers
Apply log or sqrt for highly right-skewed features (income, time, counts)
6.2 Encoding Categorical Variables
Machine learning models require numerical input — so we convert categories to numbers.
Common Encoding Techniques:
TechniqueMethod / LibraryWhen to UseProsConsLabel Encodingsklearn.preprocessing.LabelEncoderOrdinal categories (low < medium < high)Simple, fastImplies order (bad for nominal)One-Hot Encodingpd.get_dummies / OneHotEncoderNominal categories (colors, cities)No order assumptionHigh dimensionality (curse)Target / Mean Encodingcategory_encoders.TargetEncoderHigh cardinality nominal (many unique values)Captures target relationshipRisk of data leakageFrequency / Count EncodingManual or category_encoders.CountEncoderHigh cardinality, no target leakage neededSimpleNo target informationBinary Encodingcategory_encoders.BinaryEncoderVery high cardinalityReduces dimensionsLess interpretable
Code Examples
Python
import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder from category_encoders import TargetEncoder df = pd.DataFrame({ 'city': ['Delhi', 'Mumbai', 'Bangalore', 'Delhi', 'Kolkata'], 'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'], 'salary': [50000, 80000, 120000, 65000, 90000] }) # 1. Label Encoding (ordinal) le = LabelEncoder() df['education_label'] = le.fit_transform(df['education']) # 2. One-Hot Encoding (nominal) df_onehot = pd.get_dummies(df, columns=['city'], prefix='city', drop_first=True) # 3. Target Encoding (high cardinality + target) encoder = TargetEncoder(cols=['city']) df['city_target'] = encoder.fit_transform(df['city'], df['salary'])
Best practice:
Use OneHotEncoder for low-cardinality nominal features (<10–15 categories)
Use TargetEncoder or Mean Encoding for high-cardinality with target variable
Always fit on train set only — avoid data leakage
6.3 Feature Selection Techniques
Feature selection reduces dimensionality, removes noise, speeds up training, and improves model performance.
Common Methods:
Filter Methods (fast, model-independent)
Variance Threshold
Correlation with target
Chi-square, ANOVA F-test
Wrapper Methods (model-dependent, accurate but slow)
Forward/Backward Selection
Recursive Feature Elimination (RFE)
Embedded Methods (built into model)
Lasso / Ridge regression (L1 regularization)
Tree-based feature importance (Random Forest, XGBoost)
Code Examples
Python
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif, RFE from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split X = df.drop('salary', axis=1) # features y = df['salary'] # target (regression example) # 1. Remove low-variance features selector_var = VarianceThreshold(threshold=0.01) X_var = selector_var.fit_transform(X) # 2. Select top K features (ANOVA F-test for classification) selector_k = SelectKBest(score_func=f_classif, k=5) X_kbest = selector_k.fit_transform(X, y) # 3. Recursive Feature Elimination model = RandomForestClassifier() rfe = RFE(model, n_features_to_select=5) X_rfe = rfe.fit_transform(X, y) # 4. Tree-based importance model.fit(X, y) importances = pd.Series(model.feature_importances_, index=X.columns) print(importances.sort_values(ascending=False))
Rule of thumb:
Start with filter methods (fast)
Use embedded or wrapper for final selection
Never select features on full dataset — use train set only
6.4 Handling Imbalanced Datasets
Imbalanced data (e.g., fraud detection, disease prediction) is very common — models tend to favor majority class.
Common Techniques:
Resampling Methods
Oversampling minority (SMOTE, ADASYN)
Undersampling majority (RandomUnderSampler)
Combination (SMOTE + Tomek / SMOTE + ENN)
Class Weighting
Most algorithms support class_weight='balanced'
Evaluation Metrics
Use Precision, Recall, F1-score, ROC-AUC (not accuracy)
Code Examples
Python
from imblearn.over_sampling import SMOTE from imblearn.under_sampling import RandomUnderSampler from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) # 1. SMOTE oversampling smote = SMOTE(random_state=42) X_smote, y_smote = smote.fit_resample(X_train, y_train) # 2. Class weight (easier) model = RandomForestClassifier(class_weight='balanced', random_state=42) model.fit(X_train, y_train) # 3. Evaluation y_pred = model.predict(X_test) print(classification_report(y_test, y_pred))
Best practice (2026):
Prefer class_weight or balanced accuracy first (simple & no data creation)
Use SMOTE carefully — only on training data
Always evaluate with stratified split and F1 / ROC-AUC
Mini Summary Project – Full Preprocessing Pipeline
Python
from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline numeric_features = ['age', 'fare'] categorical_features = ['sex', 'class'] preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(drop='first'), categorical_features) ]) X_preprocessed = preprocessor.fit_transform(df)
This completes the full Data Preprocessing & Feature Engineering section — now you know how to transform raw data into model-ready input!
7. Statistics & Probability for Data Science
Statistics and probability form the mathematical foundation of data science. Without understanding them, machine learning models, hypothesis testing, confidence intervals, and model evaluation become guesswork.
7.1 Descriptive vs Inferential Statistics
Descriptive Statistics → Summarizes and describes the data you already have (the sample).
Common tools:
Measures of central tendency: mean, median, mode
Measures of spread: range, variance, standard deviation, IQR
Shape: skewness, kurtosis
Visuals: histogram, boxplot, density plot
Example (Python)
Python
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = sns.load_dataset("tips") # Descriptive summary print(df['total_bill'].describe()) # count 244.000000 # mean 19.785943 # std 8.902412 # min 3.070000 # 25% 13.347500 # 50% 17.795000 # 75% 24.127500 # max 50.810000 sns.histplot(df['total_bill'], kde=True) plt.title("Distribution of Total Bill (Descriptive)") plt.show()
Inferential Statistics → Uses sample data to make conclusions / predictions about the population.
Common tools:
Hypothesis testing
Confidence intervals
Regression analysis
p-values, significance levels
Key difference (2026 perspective)
Descriptive: "What does my data look like?" (past/current)
Inferential: "What can I say about the larger population?" (future/generalization)
7.2 Hypothesis Testing & p-value
Hypothesis testing helps decide whether observed effects in sample data are real (statistically significant) or due to random chance.
Basic steps
State null hypothesis (H₀) – usually "no effect / no difference"
State alternative hypothesis (H₁) – what you want to prove
Choose significance level (α) – commonly 0.05
Calculate test statistic & p-value
If p-value ≤ α → reject H₀ (statistically significant)
Common tests
t-test (compare means)
Chi-square test (categorical data)
ANOVA (compare means across 3+ groups)
p-value interpretation (2026 correct understanding)
p-value = probability of observing the data (or more extreme) assuming H₀ is true
Small p-value (< 0.05) → strong evidence against H₀
Not "probability that H₀ is true"
Example: One-sample t-test
Python
from scipy import stats # Suppose average salary claim = ₹80,000 salaries = [75000, 82000, 78000, 79000, 81000, 83000, 77000] t_stat, p_value = stats.ttest_1samp(salaries, 80000) print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}") # If p-value < 0.05 → reject null (salary ≠ ₹80,000)
Two-sample t-test
Python
group1 = [85, 88, 90, 92, 87] group2 = [78, 80, 82, 79, 81] t_stat, p_value = stats.ttest_ind(group1, group2) print(f"p-value: {p_value:.4f}")
7.3 Probability Distributions
Probability distributions describe how probabilities are distributed over values of a random variable.
Key distributions in data science (2026)
Normal / Gaussian Distribution (bell curve)
Most important – Central Limit Theorem
Used in: z-scores, confidence intervals, many ML assumptions
Python
from scipy.stats import norm x = np.linspace(-4, 4, 1000) plt.plot(x, norm.pdf(x, loc=0, scale=1)) plt.title("Standard Normal Distribution") plt.show()
Binomial Distribution (discrete)
Number of successes in n independent trials
Example: Click-through rate (CTR)
Poisson Distribution (discrete)
Number of events in fixed interval (rare events)
Example: Number of customer complaints per day
Exponential Distribution (continuous)
Time between events in Poisson process
Example: Time between customer arrivals
Uniform Distribution
All values equally likely
Quick visualization of common distributions
Python
from scipy.stats import norm, binom, poisson, expon x = np.linspace(0, 20, 1000) plt.subplot(2, 2, 1) plt.plot(x, norm.pdf(x, loc=10, scale=3)) plt.title("Normal") plt.subplot(2, 2, 2) plt.bar(range(20), binom.pmf(range(20), n=20, p=0.5)) plt.title("Binomial") plt.subplot(2, 2, 3) plt.bar(range(20), poisson.pmf(range(20), mu=5)) plt.title("Poisson") plt.subplot(2, 2, 4) plt.plot(x, expon.pdf(x, scale=5)) plt.title("Exponential") plt.tight_layout() plt.show()
7.4 Correlation, Regression & Confidence Intervals
Correlation measures linear relationship strength & direction.
Python
# Pearson correlation print(df[['total_bill', 'tip']].corr()) # total_bill tip # total_bill 1.000000 0.675734 # tip 0.675734 1.000000 sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm') plt.title("Correlation Matrix") plt.show()
Simple Linear Regression
Python
from sklearn.linear_model import LinearRegression X = df[['total_bill']] y = df['tip'] model = LinearRegression() model.fit(X, y) print("Slope (β1):", model.coef_[0]) print("Intercept (β0):", model.intercept_)
Confidence Intervals
Python
from scipy import stats # 95% confidence interval for mean tip mean_tip = df['tip'].mean() ci = stats.t.interval(0.95, len(df['tip'])-1, loc=mean_tip, scale=stats.sem(df['tip'])) print(f"95% CI for mean tip: {ci}")
Interpretation (2026 correct way): "We are 95% confident that the true population mean tip lies between X and Y."
Mini Summary Project – Full Statistical Analysis
Python
import pandas as pd import seaborn as sns from scipy import stats df = sns.load_dataset("tips") # 1. Summary stats print(df['tip'].describe()) # 2. Hypothesis test: Do smokers tip more? smoker_tip = df[df['smoker']=='Yes']['tip'] non_smoker_tip = df[df['smoker']=='No']['tip'] t_stat, p_val = stats.ttest_ind(smoker_tip, non_smoker_tip) print(f"p-value: {p_val:.4f}") if p_val < 0.05: print("Significant difference in tipping between smokers and non-smokers") # 3. Correlation & regression sns.regplot(x='total_bill', y='tip', data=df) plt.title("Tip vs Total Bill with Regression Line") plt.show()
This completes the full Statistics & Probability for Data Science section — now you understand the mathematical foundation behind every data science model and decision!
8. Machine Learning with Scikit-learn
Scikit-learn (sklearn) is the most popular open-source machine learning library in Python. It provides simple, consistent, and efficient tools for data mining and analysis — from preprocessing to model evaluation and deployment.
Install Scikit-learn (if not using Anaconda)
Bash
pip install scikit-learn
Standard import
Python
import sklearn from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, mean_squared_error
8.1 Supervised Learning – Regression & Classification
Supervised Learning = Learning from labeled data (input + correct output).
Regression → Predict continuous values (e.g., house price, temperature, salary)
Classification → Predict discrete classes (e.g., spam/not spam, disease/no disease)
Common algorithms in Scikit-learn
Regression:
Linear Regression
Ridge / Lasso (regularized)
Decision Tree / Random Forest Regressor
Gradient Boosting (XGBoost, LightGBM often used via sklearn interface)
Classification:
Logistic Regression
Decision Tree / Random Forest Classifier
Support Vector Machine (SVM)
k-Nearest Neighbors (KNN)
Gradient Boosting Classifier
Basic Regression Example (House Price Prediction)
Python
from sklearn.linear_model import LinearRegression from sklearn.datasets import fetch_california_housing # Load data housing = fetch_california_housing(as_frame=True) X = housing.data y = housing.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) rmse = mean_squared_error(y_test, predictions, squared=False) print(f"RMSE: {rmse:.3f}")
Basic Classification Example (Iris Dataset)
Python
from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier iris = load_iris(as_frame=True) X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) accuracy = accuracy_score(y_test, clf.predict(X_test)) print(f"Accuracy: {accuracy:.3f}")
8.2 Model Training, Evaluation & Hyperparameter Tuning
Training = model.fit(X_train, y_train)
Prediction = model.predict(X_test)
Evaluation Metrics
Regression:
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
R² Score
Classification:
Accuracy
Precision / Recall / F1-Score
Confusion Matrix
ROC-AUC (especially for imbalanced data)
Hyperparameter Tuning (Grid Search / Random Search)
Python
from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10] } grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='f1_macro', n_jobs=-1 ) grid_search.fit(X_train, y_train) print("Best parameters:", grid_search.best_params_) print("Best score:", grid_search.best_score_)
Best practice (2026):
Use RandomizedSearchCV for large search spaces (faster)
Use cross-validation for reliable evaluation
Never tune on test set — use validation set or cross-validation
8.3 Cross-Validation & Model Selection
Cross-Validation = Splitting data multiple times to get reliable performance estimate.
Most common: K-Fold CV
Python
from sklearn.model_selection import cross_val_score scores = cross_val_score( RandomForestClassifier(random_state=42), X, y, cv=5, # 5-fold scoring='accuracy' ) print("Cross-validation scores:", scores) print("Mean accuracy:", scores.mean()) print("Std deviation:", scores.std())
Stratified K-Fold (for imbalanced classification)
Python
from sklearn.model_selection import StratifiedKFold cv = StratifiedKFold(n_splits=5) scores = cross_val_score(model, X, y, cv=cv, scoring='f1_macro')
Model Selection Flow (recommended)
Split data → train / validation / test (80/10/10 or 70/15/15)
Preprocess → fit on train only, transform validation/test
Try multiple models with cross-validation on train+validation
Select best model → tune hyperparameters
Final evaluation on hold-out test set
8.4 Unsupervised Learning – Clustering & Dimensionality Reduction
Unsupervised Learning → No labels. Discover hidden structure in data.
Clustering – Group similar data points
Most popular: K-Means
Python
from sklearn.cluster import KMeans from sklearn.datasets import make_blobs X, = makeblobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) kmeans = KMeans(n_clusters=4, random_state=42) kmeans.fit(X) labels = kmeans.labels_ centers = kmeans.cluster_centers_ plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis') plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='x', s=200) plt.title("K-Means Clustering") plt.show()
Dimensionality Reduction – Reduce number of features while preserving information
PCA (Principal Component Analysis)
Python
from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis') plt.title("PCA – 2D Visualization") plt.xlabel("PC1") plt.ylabel("PC2") plt.show() print("Explained variance ratio:", pca.explained_variance_ratio_)
t-SNE (non-linear, great for visualization)
Python
from sklearn.manifold import TSNE tsne = TSNE(n_components=2, random_state=42) X_tsne = tsne.fit_transform(X) plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='viridis') plt.title("t-SNE Visualization") plt.show()
Mini Summary Project – End-to-End ML Pipeline
Python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score X = df.drop('target', axis=1) # your features y = df['target'] pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42)) ]) scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1') print("F1 scores:", scores) print("Mean F1:", scores.mean())
This completes the full Machine Learning with Scikit-learn section — now you can build, train, evaluate, and tune real ML models!
9. Advanced Data Science Topics
After mastering the fundamentals (EDA, preprocessing, classical ML), this section introduces more advanced and highly in-demand areas in 2026: time series, NLP, deep learning, and deployment/MLOps. These topics are essential for real-world projects, research papers, and industry roles.
9.1 Time Series Analysis & Forecasting
Time series data has a temporal order (stock prices, sales, weather, sensor readings). The goal is to understand patterns (trend, seasonality, cycles) and predict future values.
Key concepts
Trend: long-term increase/decrease
Seasonality: repeating patterns (weekly, monthly, yearly)
Stationarity: statistical properties constant over time (most models require it)
Autocorrelation: correlation with lagged values
Popular libraries
statsmodels (ARIMA, SARIMA)
Prophet (Facebook)
sktime (modern, scikit-learn compatible)
Basic ARIMA example
Python
import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.arima.model import ARIMA from statsmodels.tsa.stattools import adfuller # Load sample time series (AirPassengers or your own data) url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv" df = pd.read_csv(url, parse_dates=['Month'], index_col='Month') series = df['Passengers'] # Check stationarity result = adfuller(series) print("ADF Statistic:", result[0]) print("p-value:", result[1]) # if >0.05 → not stationary → difference # Fit ARIMA (p,d,q) – here (5,1,0) is a good start model = ARIMA(series, order=(5,1,0)) model_fit = model.fit() # Forecast next 12 months forecast = model_fit.forecast(steps=12) print("Forecast:", forecast) # Plot plt.plot(series, label='Actual') plt.plot(forecast, label='Forecast', color='red') plt.title("Air Passengers Forecast") plt.legend() plt.show()
Prophet (very easy & powerful)
Python
from prophet import Prophet df_prophet = df.reset_index().rename(columns={'Month': 'ds', 'Passengers': 'y'}) m = Prophet(yearly_seasonality=True) m.fit(df_prophet) future = m.make_future_dataframe(periods=12, freq='MS') forecast = m.predict(future) m.plot(forecast) plt.title("Prophet Forecast – Air Passengers") plt.show()
When to use what (2026):
Short-term, classic data → ARIMA/SARIMA
Business time series with holidays → Prophet
Multivariate → VAR, LSTM (deep learning)
9.2 Natural Language Processing (NLP) Basics
NLP deals with text data: sentiment analysis, chatbots, translation, summarization, etc.
Essential libraries in 2026
NLTK / spaCy (traditional)
Transformers (Hugging Face) → state-of-the-art (BERT, GPT-style)
Basic NLP pipeline with spaCy
Python
import spacy nlp = spacy.load("en_core_web_sm") text = "Apple is looking at buying U.K. startup for $1 billion in 2026" doc = nlp(text) for token in doc: print(token.text, token.pos_, token.dep_) # Named Entity Recognition (NER) for ent in doc.ents: print(ent.text, ent.label_) # Output: # Apple ORG # U.K. GPE # $1 billion MONEY # 2026 DATE
Sentiment Analysis with Hugging Face (easiest & best in 2026)
Python
from transformers import pipeline sentiment_pipeline = pipeline("sentiment-analysis") reviews = [ "This product is amazing! Love it.", "Worst experience ever. Do not buy.", "It's okay, nothing special." ] results = sentiment_pipeline(reviews) for review, res in zip(reviews, results): print(f"Review: {review}") print(f"Sentiment: {res['label']} (score: {res['score']:.4f})\n")
Text Classification (custom model) Use Hugging Face Trainer API or scikit-learn + TF-IDF.
9.3 Introduction to Deep Learning with TensorFlow/Keras
Deep learning = neural networks with many layers. In 2026, Keras (inside TensorFlow) is still the easiest high-level API for beginners.
Basic Neural Network (Classification)
Python
import tensorflow as tf from tensorflow.keras import layers, models from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) model = models.Sequential([ layers.Dense(16, activation='relu', input_shape=(4,)), layers.Dense(8, activation='relu'), layers.Dense(3, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=1) test_loss, test_acc = model.evaluate(X_test, y_test) print(f"Test accuracy: {test_acc:.4f}")
Why Keras in 2026?
Simple & readable (Sequential / Functional API)
Built-in callbacks (EarlyStopping, ModelCheckpoint)
Integrates with TensorFlow ecosystem (TensorBoard, TPU support)
9.4 Model Deployment & MLOps Basics
Deployment = putting model into production so others can use it.
Popular options in 2026 (easy to advanced):
Streamlit / Gradio → interactive web apps (fastest)
FastAPI + Uvicorn → production API
Flask / Django → traditional web
Docker + Kubernetes → scalable deployment
MLflow / BentoML → full MLOps
Simple Streamlit app example
Python
# app.py import streamlit as st from sklearn.linear_model import LogisticRegression import numpy as np st.title("Simple Iris Flower Prediction") sepal_length = st.slider("Sepal Length", 4.0, 8.0, 5.0) sepal_width = st.slider("Sepal Width", 2.0, 4.5, 3.5) petal_length = st.slider("Petal Length", 1.0, 7.0, 4.0) petal_width = st.slider("Petal Width", 0.1, 2.5, 1.3) model = LogisticRegression() # load your trained model here prediction = model.predict([[sepal_length, sepal_width, petal_length, petal_width]]) st.write(f"Predicted class: {prediction[0]}")
Run:
Bash
pip install streamlit streamlit run app.py
MLOps Basics (2026 essentials)
Version control data & models → DVC
Track experiments → MLflow
Package models → BentoML / ONNX
Deploy → Docker + Render / Railway / AWS / GCP
Mini Summary Project – End-to-End Churn Prediction
Load data → EDA (section 5)
Preprocess → scale, encode (section 6)
Train Random Forest / XGBoost
Evaluate → cross-validation, ROC-AUC
Deploy simple Streamlit app for prediction
This completes the full Advanced Data Science Topics section — now you have exposure to time series, NLP, deep learning, and deployment!
10. Real-World Projects & Case Studies
These four hands-on projects apply everything you've learned: data loading, EDA, preprocessing, modeling, evaluation, visualization, and interpretation. They are designed to be portfolio-ready and commonly asked about in interviews.
10.1 Project 1: House Price Prediction (Regression)
Goal: Predict house prices based on features (classic regression problem).
Dataset: California Housing (built-in in sklearn) or use Kaggle's House Prices dataset.
Steps & Code
Python
import pandas as pd import numpy as np from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt import seaborn as sns # 1. Load data housing = fetch_california_housing(as_frame=True) df = housing.frame X = df.drop("MedHouseVal", axis=1) y = df["MedHouseVal"] # 2. EDA (quick look) print(df.describe()) sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f') plt.title("Correlation Matrix – House Prices") plt.show() # 3. Preprocessing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 4. Model training & evaluation model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train_scaled, y_train) predictions = model.predict(X_test_scaled) rmse = np.sqrt(mean_squared_error(y_test, predictions)) r2 = r2_score(y_test, predictions) print(f"RMSE: {rmse:.3f} (lower is better)") print(f"R² Score: {r2:.3f} (closer to 1 is better)") # 5. Feature importance importances = pd.Series(model.feature_importances_, index=X.columns) importances.sort_values(ascending=False).plot(kind='bar') plt.title("Feature Importance – House Price Prediction") plt.show()
Key Takeaways:
Median Income is usually the strongest predictor
RMSE in range 0.45–0.55 is good for this dataset
Try XGBoost or LightGBM for better performance
Improvements:
Add feature engineering (rooms per household, age buckets)
Hyperparameter tuning (GridSearchCV)
Deploy as Streamlit app
10.2 Project 2: Customer Churn Prediction (Classification)
Goal: Predict whether a customer will leave (churn) — imbalanced classification problem.
Dataset: Telco Customer Churn (Kaggle or use seaborn example)
Steps & Code
Python
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, roc_auc_score import seaborn as sns # 1. Load & quick EDA df = pd.read_csv("https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv") print(df['Churn'].value_counts(normalize=True)) # imbalanced ~73% No # 2. Preprocessing df = df.drop(['customerID'], axis=1) df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce') df = df.dropna() X = df.drop('Churn', axis=1) y = df['Churn'].map({'Yes': 1, 'No': 0}) categorical = X.select_dtypes(include='object').columns numeric = X.select_dtypes(include=['int64', 'float64']).columns preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric), ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical) ]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # 3. Model pipeline pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(class_weight='balanced', random_state=42)) ]) pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) y_prob = pipeline.predict_proba(X_test)[:, 1] print(classification_report(y_test, y_pred)) print("ROC-AUC:", roc_auc_score(y_test, y_prob)) # 4. Confusion matrix visualization sns.heatmap(pd.crosstab(y_test, y_pred), annot=True, fmt='d', cmap='Blues') plt.title("Confusion Matrix – Churn Prediction") plt.xlabel("Predicted") plt.ylabel("Actual") plt.show()
Key Takeaways:
Class imbalance → use class_weight='balanced' or SMOTE
Focus on Recall (catching churners) and ROC-AUC
Top features: Contract type, tenure, monthly charges
Improvements:
Try XGBoost / LightGBM
Add SMOTE in pipeline
Create dashboard with Streamlit
10.3 Project 3: Sentiment Analysis on Reviews (NLP)
Goal: Classify product/movie reviews as positive/negative/neutral.
Dataset: Amazon Reviews or IMDb (use Hugging Face datasets)
Easy & powerful method: Hugging Face Transformers
Python
from transformers import pipeline import pandas as pd # Load sentiment pipeline (pre-trained model) sentiment = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment") # Sample reviews reviews = [ "This phone is amazing! Battery lasts all day.", "Worst product ever. Broke in 2 days.", "It's okay, nothing special but works fine.", "Absolutely love it! Best purchase this year." ] results = sentiment(reviews) for review, res in zip(reviews, results): print(f"Review: {review}") print(f"Sentiment: {res['label']} (score: {res['score']:.4f})\n")
Custom model with scikit-learn + TF-IDF
Python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Assume df has 'review' and 'sentiment' columns (1=positive, 0=negative) X = df['review'] y = df['sentiment'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) vectorizer = TfidfVectorizer(max_features=5000, stop_words='english') X_train_vec = vectorizer.fit_transform(X_train) X_test_vec = vectorizer.transform(X_test) model = LogisticRegression(max_iter=1000) model.fit(X_train_vec, y_train) y_pred = model.predict(X_test_vec) print(classification_report(y_test, y_pred))
Key Takeaways:
Pre-trained transformers (Hugging Face) → best accuracy with almost no code
TF-IDF + Logistic Regression → fast baseline, good interpretability
Improvements:
Fine-tune BERT/RoBERTa
Add emoji/text cleaning
Create Streamlit app for live prediction
10.4 Project 4: Sales Dashboard & EDA Report
Goal: Create an interactive EDA & sales dashboard using Streamlit.
Install
Bash
pip install streamlit pandas plotly
Full code (save as app.py)
Python
import streamlit as st import pandas as pd import plotly.express as px st.title("Sales Dashboard & EDA Report") # Upload data uploaded_file = st.file_uploader("Upload your sales CSV", type="csv") if uploaded_file: df = pd.read_csv(uploaded_file) st.subheader("Data Overview") st.dataframe(df.head()) st.subheader("Summary Statistics") st.write(df.describe()) # Interactive filters category = st.selectbox("Select Category", df.columns) # Visualizations fig1 = px.histogram(df, x=category, title=f"Distribution of {category}") st.plotly_chart(fig1) fig2 = px.box(df, x="Region", y="Sales", title="Sales by Region") st.plotly_chart(fig2) st.subheader("Top Products") top_products = df.groupby("Product")["Sales"].sum().nlargest(10) st.bar_chart(top_products)
Run
Bash
streamlit run app.py
Key Takeaways:
Streamlit = fastest way to turn data scripts into interactive dashboards
Plotly = interactive charts (zoom, hover)
Great for EDA reports, stakeholder presentations
This completes the full Real-World Projects & Case Studies section — now you have four portfolio-ready projects!
11. Best Practices, Portfolio & Career Guidance
You’ve now learned the full technical stack — from Python basics to advanced ML. This final section focuses on how to stand out in the real world: writing production-ready code, building a strong portfolio, using Git & Kaggle effectively, and acing data science interviews in 2026.
11.1 Writing Clean & Reproducible Data Science Code
Clean, reproducible code is what separates hobbyists from professionals.
Core Principles (2026 Standard)
Follow PEP 8 + modern formatting tools
Use Black (auto-formatter) + isort (import sorter)
Bash
pip install black isort black . && isort .
Use virtual environments (never install globally)
Bash
python -m venv env source env/bin/activate pip install -r requirements.txt
Always create requirements.txt
Bash
pip freeze > requirements.txt
Write reproducible notebooks (Jupyter)
Set random seeds everywhere
Python
import numpy as np import random np.random.seed(42) random.seed(42)
Use nbdev or papermill for production notebooks
Prefer .py scripts for final pipelines
Structure projects professionally
text
my_project/ ├── data/ # raw & processed data (never commit raw) ├── notebooks/ # exploratory .ipynb files ├── src/ # reusable .py modules │ ├── data.py │ ├── model.py │ └── utils.py ├── models/ # saved models ├── reports/ # figures, dashboards ├── requirements.txt ├── README.md └── main.py / run_pipeline.py
Document everything
Use docstrings (PEP 257)
Add README with project goal, setup instructions, results
11.2 Building a Strong Data Science Portfolio
Your GitHub portfolio is your resume in 2026 — recruiters look here first.
Must-Have Projects (2026 recruiters love these)
End-to-end regression project (House Prices / Bike Sharing)
Imbalanced classification (Fraud Detection / Churn Prediction)
NLP project (Sentiment Analysis / Resume Parser)
Time series forecasting (Sales / Stock Price)
Interactive dashboard (Streamlit / Plotly)
Deep learning project (Image classification with transfer learning)
Portfolio Tips
Host 4–6 high-quality projects
Each repo should have:
Clean README (problem statement, approach, results, visuals)
Requirements.txt
Jupyter notebook + .py pipeline
Visuals (charts, confusion matrix, feature importance)
Model performance metrics
Deploy 2–3 projects (Streamlit, Heroku, Render, Hugging Face Spaces)
Add blog posts (Medium / Hashnode) explaining your projects
Example README structure
Markdown
# House Price Prediction ## Problem Predict house prices in California using regression models. ## Dataset California Housing (sklearn) ## Approach - EDA → Correlation analysis, outlier removal - Preprocessing → Scaling, feature engineering - Models → Linear Regression, Random Forest, XGBoost - Best model → Random Forest (RMSE 0.47) ## Results - R²: 0.81 - Feature importance: Median Income > House Age ## Deployment Live demo: https://house-price-app.streamlit.app ## Tech Stack Python, Pandas, Scikit-learn, Streamlit
11.3 Git, Kaggle & Resume Tips for Students & Professionals
Git & GitHub Workflow (2026 standard)
Create repo → git init
Work on feature branch: git checkout -b feature/eda
Commit often: git commit -m "Add EDA visualizations"
Push & create Pull Request
Use .gitignore (ignore data/, *.pkl, pycache)
Add GitHub Actions for CI (lint, tests)
Kaggle Tips
Participate in competitions → top 10% looks great
Create notebooks → aim for upvotes & medals
Fork good kernels → learn from top solutions
Build datasets → upload clean versions
Resume & LinkedIn Tips (2026)
One-page resume for freshers
Structure:
Projects (3–5) → title, tech stack, results (metrics!)
Skills → Python, SQL, Pandas, Scikit-learn, Git, AWS/GCP (basic)
Education + certifications (Coursera, Kaggle)
LinkedIn: Post weekly → project updates, Kaggle kernels, articles
Add badges: Kaggle Expert/Master, GitHub streak
11.4 Interview Preparation & Top Data Science Questions
Common Interview Stages (2026)
Resume screening + HR
Technical MCQ / coding test (HackerRank, LeetCode)
Live coding / take-home assignment
ML system design / case study
Behavioral + project deep-dive
Top 20 Data Science Interview Questions (2026)
Explain bias-variance tradeoff.
What is overfitting? How to prevent it?
Difference between L1 and L2 regularization?
Explain cross-validation. Why stratified?
How does Random Forest work? Why better than single tree?
What is gradient boosting? Difference from Random Forest?
Explain ROC-AUC vs Precision-Recall curve.
How to handle imbalanced datasets?
What is multicollinearity? How to detect & fix?
Explain PCA. When to use it?
Difference between bagging and boosting?
How does k-means clustering work?
What is a confusion matrix? Precision, Recall, F1?
Explain time series components (trend, seasonality).
What is stationarity? How to test it?
Difference between ARIMA and Prophet?
How does BERT work? (high-level)
Explain attention mechanism.
What is transfer learning? When to use it?
How would you deploy a model in production?
Preparation Strategy (2026)
Practice LeetCode (medium SQL & Python)
Build 4–6 strong projects → explain end-to-end
Revise statistics & ML theory (StatQuest YouTube)
Mock interviews (Pramp, Interviewing.io)
Read “Ace the Data Science Interview” book
This completes the full Best Practices, Portfolio & Career Guidance section — and the entire Master Data Science with Python tutorial!
12. Next Steps & Learning Roadmap
You’ve now completed a full, structured journey from Python basics → OOP → data manipulation → visualization → EDA → preprocessing → statistics → machine learning → advanced topics → real projects. This final section gives you a clear, realistic, and up-to-date (2026) roadmap to take your skills to the next level — whether your goal is jobs, research papers, freelancing, or startup building.
12.1 Advanced Topics (Deep Learning, Computer Vision, Big Data)
After mastering classical ML (Scikit-learn), these are the high-impact areas to learn next:
Deep Learning (Neural Networks & Transformers)
Frameworks: PyTorch (industry/research favorite in 2026) or TensorFlow/Keras
Key topics:
Neural network fundamentals (layers, activation, backpropagation)
CNNs (Convolutional Neural Networks) for images
RNNs / LSTMs / GRUs for sequences
Transformers (BERT, GPT-style models) → Hugging Face Transformers library
Best starting course: fast.ai “Practical Deep Learning for Coders” (free, project-based)
Computer Vision
Image classification, object detection, segmentation
Libraries: PyTorch + torchvision, Ultralytics YOLOv8, Hugging Face
Projects:
Custom image classifier (cats vs dogs)
Object detection on your own photos (YOLO)
Face recognition / emotion detection
Big Data & Scalability
Tools: PySpark (Spark with Python), Dask (parallel Pandas), Polars (fast DataFrame)
Cloud platforms: AWS (S3 + SageMaker), GCP (BigQuery + Vertex AI), Azure
Key skills:
Distributed computing
Handling terabyte-scale data
ETL pipelines (Airflow / Prefect)
Learning Order Suggestion (2026)
Deep Learning basics (fast.ai or DeepLearning.AI Coursera)
Computer Vision (PyTorch + YOLO)
NLP Advanced (fine-tune BERT)
Big Data basics (PySpark or Polars)
MLOps / Deployment (MLflow, BentoML, Docker)
12.3 Career Paths & Job Opportunities in Data Science
Main Career Tracks in 2026 (with approximate global salary ranges)
RolePrimary Skills RequiredTypical ExperienceIndia Salary (₹ LPA)Global Salary (USD/year)Best ForData AnalystSQL, Excel/Power BI, basic Python/Pandas0–3 years4–12$60k–$95kFreshers & studentsData ScientistPython, ML (sklearn), stats, SQL, visualization1–6 years10–28$100k–$170kMost common pathMachine Learning EngineerPython, ML deployment, MLOps, Docker, cloud3–8 years18–45$130k–$220kProfessionalsMLOps EngineerDocker, Kubernetes, MLflow, CI/CD, cloud3–7 years20–50$140k–$240kHigh demand in 2026AI Research ScientistDeep learning, PyTorch, research papers3–10+ years/PhD25–70+$150k–$350k+Researchers & PhDsData EngineerSQL, Spark, Airflow, cloud pipelines3–8 years12–35$110k–$190kInfrastructure focused
How to Get Hired in 2026
Build 4–6 strong projects (GitHub + deployed versions)
Participate in Kaggle competitions (top 10% = strong signal)
Earn certifications: Google Data Analytics, IBM Data Science, DeepLearning.AI
Contribute to open source (Hugging Face, scikit-learn, fastai)
Network: LinkedIn, Twitter/X (post weekly), Kaggle discussions
Prepare for interviews: LeetCode (SQL + Python), system design cases
Final Motivation Data science is one of the most rewarding careers in 2026 — high impact, high salary, and endless learning. Code every day. Build real things. Share your work. Stay curious.
You’ve completed the entire Master Data Science with Python tutorial — from setup to advanced topics and career guidance. You are now equipped to start real projects, contribute to open source, and pursue exciting opportunities.
👈 PREVIOUS PYTHON PROGRAMMING OOP NEXT R PROGRAMMING 👉
Free Reading Alert! All my books are FREE on Kindle Unlimited or eBooks just ₹145!
Check now: https://www.amazon.in/stores/Anshuman-Mishra/author/B0DQVNPL7P
Start reading! 🚀
फ्री रीडिंग का मौका! मेरी सारी किताबें Kindle Unlimited में FREE या ईबुक सिर्फ ₹145 में!
अभी देखें: https://www.amazon.in/stores/Anshuman-Mishra/author/B0DQVNPL7P पढ़ना शुरू करें! 🚀🚀
These Python notes made complex concepts feel simple and clear.
Amy K.
★★★★★
ibm.anshuman@gmail.com
© 2026 CodeForge AI | Privacy Policy |Terms of Service | Contact | Disclaimer | 1000 university college list












