Master Data Science with Python – Complete Hands-on Tutorial for Students, Researchers & Professionals (2026 Edition)

If you're seeing this book's cover or link pointing to Amazon.com (USA marketplace)

Master Data Science with Python – Complete Hands-on Tutorial for Students, Researchers & Professionals (2026 Edition)

👈 PREVIOUS PYTHON PROGRAMMING OOP NEXT R PROGRAMMING 👉

TABLE OF CONTENTS

Master Data Science with Python – Complete Hands-on Tutorial for Students, Researchers & Professionals (2026 Edition) From Zero to Job-Ready | Real Projects | Pandas, NumPy, Visualization, Machine Learning & Deployment

  1. Introduction to Data Science & Python Setup 1.1 What is Data Science and Why Python in 2026? 1.2 Data Science Career Paths for Students, Researchers & Professionals 1.3 Complete Python Environment Setup (Anaconda, Jupyter, VS Code) 1.4 Essential Libraries Overview (NumPy, Pandas, Matplotlib, Scikit-learn)

  2. NumPy – Foundation of Numerical Computing 2.1 NumPy Arrays vs Python Lists 2.2 Array Operations, Broadcasting & Vectorization 2.3 Indexing, Slicing & Advanced Array Manipulation 2.4 Mathematical & Statistical Functions

  3. Pandas – Data Manipulation & Analysis 3.1 Series and DataFrame – Core Data Structures 3.2 Data Loading (CSV, Excel, JSON, SQL) 3.3 Data Cleaning, Filtering & Transformation 3.4 Grouping, Aggregation & Pivot Tables 3.5 Handling Missing Values & Outliers

  4. Data Visualization with Matplotlib & Seaborn 4.1 Matplotlib Basics – Line, Bar, Scatter & Histogram 4.2 Seaborn for Statistical Visualization 4.3 Advanced Plots – Heatmap, Pairplot, Boxplot 4.4 Creating Publication-Ready Visualizations

  5. Exploratory Data Analysis (EDA) 5.1 Understanding Data Distribution & Summary Statistics 5.2 Univariate, Bivariate & Multivariate Analysis 5.3 Correlation Analysis & Feature Relationships 5.4 Real-World EDA Case Study

  6. Data Preprocessing & Feature Engineering 6.1 Data Scaling & Normalization 6.2 Encoding Categorical Variables 6.3 Feature Selection Techniques 6.4 Handling Imbalanced Datasets

  7. Statistics & Probability for Data Science 7.1 Descriptive vs Inferential Statistics 7.2 Hypothesis Testing & p-value 7.3 Probability Distributions 7.4 Correlation, Regression & Confidence Intervals

  8. Machine Learning with Scikit-learn 8.1 Supervised Learning – Regression & Classification 8.2 Model Training, Evaluation & Hyperparameter Tuning 8.3 Cross-Validation & Model Selection 8.4 Unsupervised Learning – Clustering & Dimensionality Reduction

  9. Advanced Data Science Topics 9.1 Time Series Analysis & Forecasting 9.2 Natural Language Processing (NLP) Basics 9.3 Introduction to Deep Learning with TensorFlow/Keras 9.4 Model Deployment & MLOps Basics

  10. Real-World Projects & Case Studies 10.1 Project 1: House Price Prediction (Regression) 10.2 Project 2: Customer Churn Prediction (Classification) 10.3 Project 3: Sentiment Analysis on Reviews (NLP) 10.4 Project 4: Sales Dashboard & EDA Report

  11. Best Practices, Portfolio & Career Guidance 11.1 Writing Clean & Reproducible Data Science Code 11.2 Building a Strong Data Science Portfolio 11.3 Git, Kaggle & Resume Tips for Students & Professionals 11.4 Interview Preparation & Top Data Science Questions

  12. Next Steps & Learning Roadmap 12.1 Advanced Topics (Deep Learning, Computer Vision, Big Data) 12.2 Recommended Books, Courses & Resources (2026 Updated) 12.3 Career Paths & Job Opportunities in Data Science

1. Introduction to Data Science & Python Setup

Welcome to your journey into Data Science with Python! This section lays the foundation — understanding what data science really is in 2026, why Python remains the #1 choice, career opportunities, and how to set up a powerful, professional environment.

1.1 What is Data Science and Why Python in 2026?

Data Science is the field of extracting meaningful insights and knowledge from structured and unstructured data using scientific methods, processes, algorithms, and systems.

In 2026, data science combines:

  • Statistics & mathematics

  • Programming & computer science

  • Domain expertise

  • Machine learning & AI

  • Data visualization & storytelling

Core activities in modern data science:

  • Collecting & cleaning data

  • Exploratory Data Analysis (EDA)

  • Building predictive models

  • Deploying models into production

  • Communicating insights (dashboards, reports)

Why Python is still #1 in 2026?

  • Extremely rich ecosystem: NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch, Hugging Face, Polars, Streamlit, FastAPI

  • Beginner-friendly syntax + powerful for production

  • Largest community & job market demand (Stack Overflow, LinkedIn, IEEE reports)

  • Used by Google, Meta, Netflix, NASA, ISRO, startups & research labs

  • Excellent for automation, web scraping, APIs, cloud (AWS, GCP, Azure)

  • Fast prototyping + scalable deployment

Python vs R vs Julia vs others (2026 view):

  • Python → general-purpose, industry standard, huge ecosystem

  • R → strong in statistics & academia (but declining in industry)

  • Julia → fast computation (but small ecosystem & adoption)

1.2 Data Science Career Paths for Students, Researchers & Professionals

Career Roles in 2026 (with approximate global salary ranges – India & International)

RoleTypical ResponsibilitiesBest ForIndia Salary (₹ LPA)Global Salary (USD)Data AnalystSQL, Excel, Power BI, basic Python, dashboardsStudents & freshers4–12$60k–$90kData ScientistML models, EDA, feature engineering, deploymentStudents + Professionals10–25$100k–$160kMachine Learning EngineerProduction ML, MLOps, pipelines, cloudProfessionals & researchers15–40$130k–$220kAI Research ScientistDeep learning, papers, innovationResearchers & PhD holders18–50+$150k–$300k+Data EngineerETL pipelines, big data (Spark, Airflow)Professionals12–30$110k–$180kBusiness Intelligence AnalystDashboards, KPIs, stakeholder communicationFreshers & mid-level6–15$70k–$110k

Skills in demand (2026):

  • Python + SQL (must-have)

  • Cloud (AWS/GCP/Azure)

  • Git & GitHub

  • Docker & FastAPI

  • ML deployment (MLflow, BentoML, Streamlit)

  • Communication & storytelling

1.3 Complete Python Environment Setup (Anaconda, Jupyter, VS Code)

Recommended Setup for Data Science (2026 standard):

Option 1 – Anaconda (easiest for beginners & researchers)

  1. Download Anaconda: https://www.anaconda.com/download

  2. Install → includes Python, Jupyter, Spyder, NumPy, Pandas, Matplotlib, Scikit-learn, etc.

  3. Open Anaconda Navigator → launch Jupyter Notebook or JupyterLab

Option 2 – Miniconda + VS Code (lightweight & professional)

  1. Install Miniconda: https://docs.conda.io/en/latest/miniconda.html

  2. Create environment:

    Bash

    conda create -n datascience python=3.11 conda activate datascience conda install jupyter numpy pandas matplotlib seaborn scikit-learn pip install jupyterlab

  3. Install VS Code: https://code.visualstudio.com

  4. Install extensions: Python, Jupyter, Pylance, Black Formatter, GitLens

Recommended VS Code Settings (settings.json):

JSON

{ "python.defaultInterpreterPath": "~/.conda/envs/datascience/bin/python", "jupyter.alwaysTrustNotebooks": true, "editor.formatOnSave": true, "python.formatting.provider": "black" }

Quick start JupyterLab:

Bash

conda activate datascience jupyter lab

1.4 Essential Libraries Overview (NumPy, Pandas, Matplotlib, Scikit-learn)

NumPy – Numerical foundation

  • Fast arrays & matrices

  • Vectorized operations (no loops)

  • Broadcasting, linear algebra

Pandas – Data wrangling & analysis

  • DataFrame (Excel-like table)

  • Read/write CSV, Excel, SQL, JSON

  • Filtering, grouping, merging

Matplotlib + Seaborn – Visualization

  • Matplotlib: base plotting library

  • Seaborn: beautiful statistical plots on top of Matplotlib

Scikit-learn – Machine Learning

  • Preprocessing, models (regression, classification, clustering)

  • Model evaluation, pipelines, grid search

Quick import cheat sheet

Python

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression

Mini Hello Data Science Code (run in Jupyter)

Python

import numpy as np import pandas as pd import matplotlib.pyplot as plt # Create sample data data = pd.DataFrame({ 'age': np.random.randint(20, 60, 100), 'salary': np.random.normal(80000, 20000, 100) }) # Quick EDA print(data.describe()) sns.scatterplot(x='age', y='salary', data=data) plt.title("Age vs Salary") plt.show()

This completes the full Introduction to Data Science & Python Setup section — your perfect starting point for the entire Data Science with Python tutorial!

2. NumPy – Foundation of Numerical Computing

NumPy (Numerical Python) is the most important library for numerical and scientific computing in Python. Almost every data science library (Pandas, Scikit-learn, Matplotlib, TensorFlow, PyTorch, etc.) is built on top of NumPy.

Why NumPy is essential in 2026:

  • Extremely fast (written in C, vectorized operations)

  • Memory-efficient multi-dimensional arrays

  • Broadcasting (no loops needed for many operations)

  • Basis for all modern data science & machine learning

Install NumPy (if not using Anaconda)

Bash

pip install numpy

Import convention (standard in data science):

Python

import numpy as np

2.1 NumPy Arrays vs Python Lists

Python lists are flexible but slow for numerical work.

NumPy arrays (ndarray) are homogeneous, fixed-type, multi-dimensional arrays optimized for math.

FeaturePython ListNumPy Array (ndarray)WinnerData typesMixed (int, str, float, etc.)Homogeneous (all same type)NumPySpeed (math operations)Slow (loops in Python)Very fast (vectorized, C-level)NumPyMemory usageHigh (objects + pointers)Low (contiguous memory block)NumPyMulti-dimensional supportManual (list of lists)Native (ndarray with shape)NumPyBroadcastingNot supportedAutomatic (shape rules)NumPyMathematical functionsManual or loopBuilt-in (np.sum, np.mean, etc.)NumPy

Quick comparison example

Python

# Python list (slow) lst = list(range(1000000)) %timeit [x**2 for x in lst] # ~100–150 ms # NumPy array (fast) arr = np.arange(1000000) %timeit arr**2 # ~1–5 ms

2.2 Array Operations, Broadcasting & Vectorization

Vectorization = performing operations on entire arrays without explicit loops.

Basic array creation

Python

import numpy as np a = np.array([1, 2, 3, 4]) # 1D array b = np.array([[1, 2], [3, 4]]) # 2D array zeros = np.zeros((3, 4)) # 3×4 array of zeros ones = np.ones(5) # [1. 1. 1. 1. 1.] arange = np.arange(0, 10, 2) # [0 2 4 6 8] linspace = np.linspace(0, 1, 5) # 5 evenly spaced points rand = np.random.rand(3, 2) # random values [0,1)

Vectorized operations

Python

a = np.array([10, 20, 30, 40]) b = np.array([1, 2, 3, 4]) print(a + b) # [11 22 33 44] print(a 2) # [20 40 60 80] print(a * 2) # [100 400 900 1600] print(np.sqrt(a)) # square root of each element

Broadcasting – automatic shape alignment

Python

a = np.array([[1, 2, 3], [4, 5, 6]]) # shape (2,3) b = np.array([10, 20, 30]) # shape (3,) print(a + b) # adds b to each row # [[11 22 33] # [14 25 36]] c = np.array([[100], [200]]) # shape (2,1) print(a + c) # adds c to each column

Rule of thumb: Broadcasting works when dimensions are compatible (equal or one is 1).

2.3 Indexing, Slicing & Advanced Array Manipulation

Basic indexing & slicing

Python

arr = np.array([10, 20, 30, 40, 50]) print(arr[0]) # 10 print(arr[-1]) # 50 (last element) print(arr[1:4]) # [20 30 40] print(arr[::2]) # [10 30 50] (every second) print(arr[::-1]) # [50 40 30 20 10] (reverse)

2D array indexing

Python

matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(matrix[0, 2]) # 3 print(matrix[:, 1]) # [2 5 8] (second column) print(matrix[1:, :2]) # [[4 5] # [7 8]] (rows 1–2, columns 0–1)

Boolean indexing (very powerful)

Python

arr = np.array([10, 25, 7, 40, 15]) print(arr[arr > 20]) # [25 40]

Advanced manipulation

Python

# Reshape a = np.arange(12) print(a.reshape(3, 4)) # 3×4 matrix # Flatten / ravel print(a.ravel()) # back to 1D # Transpose matrix.T # rows ↔ columns # Concatenate & stack np.concatenate([a, b]) np.vstack([a, b]) # vertical stack np.hstack([a, b]) # horizontal stack

2.4 Mathematical & Statistical Functions

NumPy provides fast, vectorized versions of almost all math operations.

Basic math

Python

a = np.array([1, 4, 9, 16]) print(np.sqrt(a)) # [1. 2. 3. 4.] print(np.exp(a)) # exponential print(np.log(a)) # natural log print(np.sin(np.deg2rad(30))) # sin(30°) = 0.5

Statistical functions

Python

data = np.random.randn(1000) # 1000 random normal values print(np.mean(data)) # ≈ 0 print(np.median(data)) print(np.std(data)) # standard deviation print(np.var(data)) # variance print(np.min(data), np.max(data)) print(np.percentile(data, 25)) # 25th percentile

Axis-wise operations (very important)

Python

matrix = np.random.randint(1, 100, size=(4, 5)) print(matrix.mean(axis=0)) # mean of each column print(matrix.sum(axis=1)) # sum of each row print(matrix.max(axis=0)) # max per column

Mini Summary Project – Basic Data Analysis with NumPy

Python

import numpy as np # Simulate student marks marks = np.random.randint(40, 100, size=50) print("Average marks:", np.mean(marks)) print("Highest marks:", np.max(marks)) print("Lowest marks:", np.min(marks)) print("Top 10% percentile:", np.percentile(marks, 90)) # Students above 80 above_80 = marks[marks >= 80] print(f"{len(above_80)} students scored 80+")

This completes the full NumPy – Foundation of Numerical Computing section — the true backbone of all data science in Python!

3. Pandas – Data Manipulation & Analysis

Pandas is the most powerful and widely used Python library for data wrangling, cleaning, exploration, and analysis. It is built on top of NumPy and provides high-level data structures (Series and DataFrame) that make working with tabular/structured data feel like using Excel or SQL — but much more powerful.

Install Pandas (if not using Anaconda)

Bash

pip install pandas

Standard import (always use this)

Python

import pandas as pd

3.1 Series and DataFrame – Core Data Structures

Series A one-dimensional labeled array (like a column in Excel or a vector with labels).

Python

# Create Series from list s1 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd']) print(s1) # a 10 # b 20 # c 30 # d 40 # dtype: int64 # Access by label print(s1['b']) # 20 # From dictionary s2 = pd.Series({'Math': 85, 'Science': 92, 'English': 78}) print(s2['Science']) # 92

DataFrame A two-dimensional labeled data structure (like a spreadsheet or SQL table) — the heart of Pandas.

Python

# Create DataFrame from dictionary data = { 'Name': ['Anshuman', 'Priya', 'Rahul', 'Sneha'], 'Age': [25, 23, 24, 22], 'City': ['Ranchi', 'Delhi', 'Patna', 'Kolkata'], 'Marks': [92, 88, 85, 90] } df = pd.DataFrame(data) print(df) # Name Age City Marks # 0 Anshuman 25 Ranchi 92 # 1 Priya 23 Delhi 88 # 2 Rahul 24 Patna 85 # 3 Sneha 22 Kolkata 90 # Basic inspection print(df.head(2)) # first 2 rows print(df.info()) # data types, non-null count print(df.describe()) # summary statistics print(df.shape) # (rows, columns) → (4, 4)

Quick access

Python

df['Name'] # Series – one column df[['Name', 'Marks']] # DataFrame – multiple columns df.iloc[0] # first row (position-based) df.loc[0, 'Name'] # label-based access

3.2 Data Loading (CSV, Excel, JSON, SQL)

Pandas makes reading data from almost any source effortless.

CSV

Python

df = pd.read_csv("sales_data.csv") # Options: skiprows=2, usecols=['date','sales'], dtype={'sales':float}

Excel

Python

df = pd.read_excel("report.xlsx", sheet_name="Sales", skiprows=1) # Need: pip install openpyxl or xlrd

JSON

Python

df = pd.read_json("data.json") # or pd.json_normalize() for nested JSON

SQL (with database connection)

Python

import sqlalchemy as sa engine = sa.create_engine("sqlite:///mydb.db") df = pd.read_sql("SELECT * FROM customers", engine) # or pd.read_sql_query(query, engine)

Quick save

Python

df.to_csv("cleaned_data.csv", index=False) df.to_excel("report.xlsx", index=False) df.to_json("data.json", orient="records")

3.3 Data Cleaning, Filtering & Transformation

Real data is messy — Pandas excels at cleaning it.

Basic cleaning

Python

df = df.drop_duplicates() # remove duplicate rows df = df.dropna(subset=['age', 'city']) # drop rows with missing values df['age'] = df['age'].fillna(df['age'].median()) # fill missing with median df['salary'] = df['salary'].astype(float) # change data type df['date'] = pd.to_datetime(df['date']) # convert to datetime

Filtering

Python

high_salary = df[df['salary'] > 80000] young_delhi = df[(df['age'] < 30) & (df['city'] == 'Delhi')] top_10 = df.nlargest(10, 'marks')

Transformations

Python

df['tax'] = df['salary'] * 0.18 # new column df['full_name'] = df['first'] + " " + df['last'] df['salary_category'] = pd.cut(df['salary'], bins=[0, 50000, 100000, np.inf], labels=['Low', 'Medium', 'High'])

3.4 Grouping, Aggregation & Pivot Tables

Groupby – most powerful feature

Python

# Group by city and calculate mean salary df.groupby('city')['salary'].mean() # Multiple aggregations df.groupby('city').agg({ 'salary': ['mean', 'max', 'count'], 'age': 'median' })

Pivot Tables

Python

pd.pivot_table(df, values='salary', index='city', columns='department', aggfunc='mean', fill_value=0)

Crosstab

Python

pd.crosstab(df['city'], df['gender'], margins=True)

3.5 Handling Missing Values & Outliers

Missing values

Python

# Check missing df.isnull().sum() # Fill missing df['age'].fillna(df['age'].median(), inplace=True) df['city'].fillna('Unknown', inplace=True) # Drop missing df.dropna(subset=['salary'], inplace=True)

Detect & handle outliers

Python

# Using IQR method Q1 = df['salary'].quantile(0.25) Q3 = df['salary'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 IQR upper = Q3 + 1.5 IQR # Remove outliers df_clean = df[(df['salary'] >= lower) & (df['salary'] <= upper)] # Or cap them df['salary'] = df['salary'].clip(lower=lower, upper=upper)

Mini Summary Project – Quick EDA on Sample Dataset

Python

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Load sample (or your own CSV) df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv") # Quick look print(df.head()) print(df.info()) print(df.describe()) # Missing values print(df.isnull().sum()) # Group analysis print(df.groupby('day')['total_bill'].mean()) # Visualization sns.boxplot(x='day', y='total_bill', data=df) plt.title("Total Bill by Day") plt.show()

This completes the full Pandas – Data Manipulation & Analysis section — the most important tool for real data science work in Python!

4. Data Visualization with Matplotlib & Seaborn

Data visualization is one of the most powerful ways to explore data, communicate insights, and tell stories. Matplotlib is the foundational plotting library in Python (highly customizable but requires more code). Seaborn is built on top of Matplotlib — it provides beautiful, high-level statistical plots with minimal code.

Install (if not using Anaconda)

Bash

pip install matplotlib seaborn

Standard imports (always use these)

Python

import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np # Set beautiful default style (highly recommended) sns.set_style("whitegrid") # clean white background with grid plt.rcParams['figure.figsize'] = (10, 6) # default figure size

4.1 Matplotlib Basics – Line, Bar, Scatter & Histogram

Line Plot

Python

x = np.linspace(0, 10, 100) y1 = np.sin(x) y2 = np.cos(x) plt.plot(x, y1, label='sin(x)', color='blue', linewidth=2) plt.plot(x, y2, label='cos(x)', color='red', linestyle='--') plt.title("Sine and Cosine Waves") plt.xlabel("X-axis") plt.ylabel("Y-axis") plt.legend() plt.grid(True) plt.show()

Bar Plot

Python

categories = ['Python', 'R', 'SQL', 'Excel', 'Tableau'] usage = [85, 45, 70, 60, 55] plt.bar(categories, usage, color='skyblue') plt.title("Programming Language Popularity (2026)") plt.xlabel("Language") plt.ylabel("Usage (%)") plt.xticks(rotation=45) plt.show()

Scatter Plot

Python

np.random.seed(42) x = np.random.randn(100) y = 2 x + np.random.randn(100) 0.5 plt.scatter(x, y, color='purple', alpha=0.6, s=80, edgecolor='black') plt.title("Scatter Plot with Correlation") plt.xlabel("Feature X") plt.ylabel("Feature Y") plt.show()

Histogram

Python

data = np.random.normal(loc=50, scale=15, size=1000) # normal distribution plt.hist(data, bins=30, color='teal', edgecolor='black', alpha=0.7) plt.title("Distribution of Exam Scores") plt.xlabel("Score") plt.ylabel("Frequency") plt.axvline(data.mean(), color='red', linestyle='--', label=f'Mean = {data.mean():.1f}') plt.legend() plt.show()

4.2 Seaborn for Statistical Visualization

Seaborn makes statistical plots beautiful and easy.

Line Plot with confidence interval

Python

tips = sns.load_dataset("tips") # built-in dataset sns.lineplot(x="total_bill", y="tip", data=tips, hue="time", style="time") plt.title("Tip vs Total Bill by Time") plt.show()

Count Plot

Python

sns.countplot(x="day", data=tips, hue="sex", palette="Set2") plt.title("Number of Customers by Day and Gender") plt.show()

Pair Plot (exploratory)

Python

sns.pairplot(tips, hue="smoker", diag_kind="kde") plt.suptitle("Pair Plot of Tips Dataset", y=1.02) plt.show()

4.3 Advanced Plots – Heatmap, Pairplot, Boxplot

Heatmap (Correlation Matrix)

Python

# Correlation matrix corr = tips.corr(numeric_only=True) sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5) plt.title("Correlation Heatmap of Tips Dataset") plt.show()

Boxplot (distribution & outliers)

Python

sns.boxplot(x="day", y="total_bill", hue="smoker", data=tips, palette="Set3") plt.title("Total Bill Distribution by Day & Smoking Status") plt.show()

Violin Plot (distribution + density)

Python

sns.violinplot(x="day", y="tip", hue="sex", data=tips, split=True, palette="muted") plt.title("Tip Distribution by Day and Gender") plt.show()

4.4 Creating Publication-Ready Visualizations

Tips to make plots look professional and publication-quality:

Best Practices Code Template

Python

plt.figure(figsize=(10, 6), dpi=120) # high resolution sns.set_context("paper", font_scale=1.3) # publication style # Your plot here sns.boxplot(x="day", y="total_bill", data=tips, palette="pastel") plt.title("Total Bill Distribution by Day", fontsize=16, fontweight='bold') plt.xlabel("Day of the Week", fontsize=14) plt.ylabel("Total Bill (USD)", fontsize=14) plt.xticks(fontsize=12) plt.yticks(fontsize=12) plt.grid(True, linestyle='--', alpha=0.7) # Save high-quality image plt.tight_layout() plt.savefig("publication_plot.png", dpi=300, bbox_inches='tight') plt.show()

Additional Tips for Publication/Report Quality:

  • Use sns.set_style("whitegrid") or "ticks" for clean look

  • Choose color palettes: "viridis", "magma", "coolwarm", "Set2", "pastel"

  • Add annotations: plt.annotate(), sns.despine()

  • Use fig, ax = plt.subplots() for multi-plot figures

  • Export as PNG (300+ dpi) or SVG for journals

Mini Summary Project – Full EDA Visualization

Python

# Load built-in dataset df = sns.load_dataset("penguins") # 1. Pair plot sns.pairplot(df, hue="species", diag_kind="kde") plt.suptitle("Penguin Species Comparison", y=1.02) plt.show() # 2. Boxplot of bill length by species sns.boxplot(x="species", y="bill_length_mm", data=df, palette="Set2") plt.title("Bill Length Distribution by Penguin Species") plt.show() # 3. Heatmap of correlations corr = df.corr(numeric_only=True) sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f") plt.title("Correlation Matrix – Penguin Dataset") plt.show()

This completes the full Data Visualization with Matplotlib & Seaborn section — now you can create beautiful, insightful, and publication-ready visualizations!

5. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the most critical step in any data science project. It helps you understand the data, discover patterns, detect anomalies, find relationships, and form hypotheses — all before building any model.

Why EDA is important in 2026:

  • Prevents garbage-in-garbage-out (bad data → bad model)

  • Saves time & money by identifying issues early

  • Guides feature engineering and model selection

  • Creates compelling stories for stakeholders/reports

Core tools for EDA:

  • Pandas (data manipulation)

  • NumPy (numerical operations)

  • Matplotlib + Seaborn (visualization)

  • Missingno, Sweetviz, Pandas Profiling (automated EDA reports)

5.1 Understanding Data Distribution & Summary Statistics

First step: Load & inspect data

Python

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Load example dataset (or use your own CSV) df = sns.load_dataset("titanic") # Quick overview print(df.head()) print(df.info()) print(df.shape) # (rows, columns) print(df.describe()) # numerical summary print(df.describe(include='object')) # categorical summary

Key summary statistics

  • Mean / Median / Mode → central tendency

  • Standard deviation / IQR → spread

  • Min / Max / Percentiles → range & outliers

  • Skewness & Kurtosis → shape of distribution

Visualize distribution (histogram + KDE)

Python

plt.figure(figsize=(10, 6)) sns.histplot(df['age'].dropna(), kde=True, bins=30, color='teal') plt.title("Age Distribution of Titanic Passengers") plt.xlabel("Age") plt.ylabel("Count") plt.axvline(df['age'].mean(), color='red', linestyle='--', label=f'Mean = {df["age"].mean():.1f}') plt.axvline(df['age'].median(), color='green', linestyle='--', label=f'Median = {df["age"].median():.1f}') plt.legend() plt.show()

Check skewness

Python

print("Skewness of Age:", df['age'].skew()) # positive → right-skewed

5.2 Univariate, Bivariate & Multivariate Analysis

Univariate Analysis – Study one variable at a time

Python

# Categorical sns.countplot(x='class', data=df, palette='Set2') plt.title("Passenger Class Distribution") plt.show() # Numerical sns.boxplot(x='fare', data=df, color='lightblue') plt.title("Fare Distribution (with outliers)") plt.show()

Bivariate Analysis – Relationship between two variables

Python

# Numerical vs Numerical sns.scatterplot(x='age', y='fare', hue='survived', data=df, palette='coolwarm') plt.title("Age vs Fare by Survival") plt.show() # Categorical vs Numerical sns.boxplot(x='class', y='fare', hue='sex', data=df) plt.title("Fare by Passenger Class & Gender") plt.show() # Categorical vs Categorical pd.crosstab(df['class'], df['survived'], normalize='index').plot(kind='bar', stacked=True) plt.title("Survival Rate by Passenger Class") plt.show()

Multivariate Analysis – More than two variables

Python

# Pair plot (best for quick multivariate look) sns.pairplot(df[['age', 'fare', 'survived']], hue='survived', diag_kind='kde') plt.suptitle("Multivariate Relationships – Titanic Dataset", y=1.02) plt.show()

5.3 Correlation Analysis & Feature Relationships

Correlation Matrix (Pearson)

Python

# Select only numeric columns numeric_df = df.select_dtypes(include=['number']) corr = numeric_df.corr() plt.figure(figsize=(10, 8)) sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1) plt.title("Correlation Matrix – Titanic Features") plt.show()

Interpretation tips:

  • Values near +1 → strong positive correlation

  • Values near -1 → strong negative correlation

  • Values near 0 → no linear relationship

  • Correlation ≠ causation!

Advanced: Spearman / Kendall correlation (good for non-linear or ordinal data)

Python

corr_spearman = numeric_df.corr(method='spearman') sns.heatmap(corr_spearman, annot=True, cmap='viridis') plt.title("Spearman Correlation") plt.show()

5.4 Real-World EDA Case Study

Dataset: Titanic (classic but very educational)

Complete EDA workflow (copy-paste ready)

Python

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = sns.load_dataset("titanic") # 1. Overview print("Shape:", df.shape) print("\nMissing Values:\n", df.isnull().sum()) print("\nData Types:\n", df.dtypes) # 2. Univariate plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) sns.histplot(df['age'].dropna(), kde=True, color='teal') plt.title("Age Distribution") plt.subplot(1, 2, 2) sns.countplot(x='class', data=df, palette='Set2') plt.title("Passenger Class Distribution") plt.tight_layout() plt.show() # 3. Bivariate plt.figure(figsize=(10, 6)) sns.boxplot(x='class', y='fare', hue='survived', data=df) plt.title("Fare by Class & Survival") plt.show() # 4. Correlation numeric = df.select_dtypes(include=['number']) corr = numeric.corr() sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f') plt.title("Correlation Heatmap") plt.show() # 5. Survival Rate by Gender & Class pd.crosstab([df['sex'], df['class']], df['survived'], normalize='index').plot(kind='bar', stacked=True) plt.title("Survival Rate by Gender & Class") plt.show() print("Key Insights:") print("- Females had much higher survival rate than males") print("- Higher class (1st) had better survival and higher fares") print("- Age had missing values – needs imputation") print("- Fare is highly skewed – consider log transformation")

Key Insights from Titanic EDA (typical findings):

  • Women & children had higher survival rates

  • 1st class passengers survived more

  • Fare is a strong indicator of class & survival

  • Age has missing values (esp. in cabin) → imputation needed

  • Many categorical variables → encoding required

This completes the full Exploratory Data Analysis (EDA) section — now you know how to deeply understand any dataset before modeling!

6. Data Preprocessing & Feature Engineering

Data preprocessing and feature engineering are the most time-consuming and most important steps in any data science project. Good preprocessing turns raw, messy data into clean, model-ready input. Feature engineering creates new powerful features that can dramatically improve model performance.

Goal: Prepare data so machine learning models can learn effectively and generalize well.

6.1 Data Scaling & Normalization

Many machine learning algorithms (especially distance-based ones like KNN, SVM, K-Means, Neural Networks) perform poorly if features are on different scales.

Common Scaling Techniques:

TechniqueFormula / MethodWhen to UseRange / OutputAffected by Outliers?Min-Max Scaling(X - min) / (max - min)Neural networks, image data, bounded data[0, 1] or custom rangeYesStandardization(X - mean) / stdMost algorithms (SVM, logistic regression, PCA)Mean ≈ 0, Std ≈ 1YesRobust Scaling(X - median) / IQRData with outliersCentered around medianNoLog Transformationlog(1 + X) or Box-CoxHighly skewed data (income, time, counts)Reduces skewnessReduces impact

Code Examples

Python

import pandas as pd from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler import numpy as np df = pd.DataFrame({ 'age': [25, 30, 45, 22, 60], 'salary': [30000, 50000, 120000, 25000, 200000], 'experience': [1, 3, 10, 0, 20] }) # 1. Min-Max Scaling (0 to 1) scaler_minmax = MinMaxScaler() df[['age_minmax', 'salary_minmax']] = scaler_minmax.fit_transform(df[['age', 'salary']]) # 2. Standardization (mean=0, std=1) scaler_std = StandardScaler() df[['age_std', 'salary_std']] = scaler_std.fit_transform(df[['age', 'salary']]) # 3. Robust Scaling (handles outliers) scaler_robust = RobustScaler() df[['salary_robust']] = scaler_robust.fit_transform(df[['salary']]) # 4. Log Transformation (for skewed data) df['salary_log'] = np.log1p(df['salary']) # log(1 + x) to handle 0 print(df)

Quick rule of thumb (2026):

  • Use StandardScaler for most ML models (default choice)

  • Use MinMaxScaler for neural networks or when you need [0,1] range

  • Use RobustScaler if data has outliers

  • Apply log or sqrt for highly right-skewed features (income, time, counts)

6.2 Encoding Categorical Variables

Machine learning models require numerical input — so we convert categories to numbers.

Common Encoding Techniques:

TechniqueMethod / LibraryWhen to UseProsConsLabel Encodingsklearn.preprocessing.LabelEncoderOrdinal categories (low < medium < high)Simple, fastImplies order (bad for nominal)One-Hot Encodingpd.get_dummies / OneHotEncoderNominal categories (colors, cities)No order assumptionHigh dimensionality (curse)Target / Mean Encodingcategory_encoders.TargetEncoderHigh cardinality nominal (many unique values)Captures target relationshipRisk of data leakageFrequency / Count EncodingManual or category_encoders.CountEncoderHigh cardinality, no target leakage neededSimpleNo target informationBinary Encodingcategory_encoders.BinaryEncoderVery high cardinalityReduces dimensionsLess interpretable

Code Examples

Python

import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder from category_encoders import TargetEncoder df = pd.DataFrame({ 'city': ['Delhi', 'Mumbai', 'Bangalore', 'Delhi', 'Kolkata'], 'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'], 'salary': [50000, 80000, 120000, 65000, 90000] }) # 1. Label Encoding (ordinal) le = LabelEncoder() df['education_label'] = le.fit_transform(df['education']) # 2. One-Hot Encoding (nominal) df_onehot = pd.get_dummies(df, columns=['city'], prefix='city', drop_first=True) # 3. Target Encoding (high cardinality + target) encoder = TargetEncoder(cols=['city']) df['city_target'] = encoder.fit_transform(df['city'], df['salary'])

Best practice:

  • Use OneHotEncoder for low-cardinality nominal features (<10–15 categories)

  • Use TargetEncoder or Mean Encoding for high-cardinality with target variable

  • Always fit on train set only — avoid data leakage

6.3 Feature Selection Techniques

Feature selection reduces dimensionality, removes noise, speeds up training, and improves model performance.

Common Methods:

  1. Filter Methods (fast, model-independent)

    • Variance Threshold

    • Correlation with target

    • Chi-square, ANOVA F-test

  2. Wrapper Methods (model-dependent, accurate but slow)

    • Forward/Backward Selection

    • Recursive Feature Elimination (RFE)

  3. Embedded Methods (built into model)

    • Lasso / Ridge regression (L1 regularization)

    • Tree-based feature importance (Random Forest, XGBoost)

Code Examples

Python

from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif, RFE from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split X = df.drop('salary', axis=1) # features y = df['salary'] # target (regression example) # 1. Remove low-variance features selector_var = VarianceThreshold(threshold=0.01) X_var = selector_var.fit_transform(X) # 2. Select top K features (ANOVA F-test for classification) selector_k = SelectKBest(score_func=f_classif, k=5) X_kbest = selector_k.fit_transform(X, y) # 3. Recursive Feature Elimination model = RandomForestClassifier() rfe = RFE(model, n_features_to_select=5) X_rfe = rfe.fit_transform(X, y) # 4. Tree-based importance model.fit(X, y) importances = pd.Series(model.feature_importances_, index=X.columns) print(importances.sort_values(ascending=False))

Rule of thumb:

  • Start with filter methods (fast)

  • Use embedded or wrapper for final selection

  • Never select features on full dataset — use train set only

6.4 Handling Imbalanced Datasets

Imbalanced data (e.g., fraud detection, disease prediction) is very common — models tend to favor majority class.

Common Techniques:

  1. Resampling Methods

    • Oversampling minority (SMOTE, ADASYN)

    • Undersampling majority (RandomUnderSampler)

    • Combination (SMOTE + Tomek / SMOTE + ENN)

  2. Class Weighting

    • Most algorithms support class_weight='balanced'

  3. Evaluation Metrics

    • Use Precision, Recall, F1-score, ROC-AUC (not accuracy)

Code Examples

Python

from imblearn.over_sampling import SMOTE from imblearn.under_sampling import RandomUnderSampler from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) # 1. SMOTE oversampling smote = SMOTE(random_state=42) X_smote, y_smote = smote.fit_resample(X_train, y_train) # 2. Class weight (easier) model = RandomForestClassifier(class_weight='balanced', random_state=42) model.fit(X_train, y_train) # 3. Evaluation y_pred = model.predict(X_test) print(classification_report(y_test, y_pred))

Best practice (2026):

  • Prefer class_weight or balanced accuracy first (simple & no data creation)

  • Use SMOTE carefully — only on training data

  • Always evaluate with stratified split and F1 / ROC-AUC

Mini Summary Project – Full Preprocessing Pipeline

Python

from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline numeric_features = ['age', 'fare'] categorical_features = ['sex', 'class'] preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(drop='first'), categorical_features) ]) X_preprocessed = preprocessor.fit_transform(df)

This completes the full Data Preprocessing & Feature Engineering section — now you know how to transform raw data into model-ready input!

7. Statistics & Probability for Data Science

Statistics and probability form the mathematical foundation of data science. Without understanding them, machine learning models, hypothesis testing, confidence intervals, and model evaluation become guesswork.

7.1 Descriptive vs Inferential Statistics

Descriptive Statistics → Summarizes and describes the data you already have (the sample).

Common tools:

  • Measures of central tendency: mean, median, mode

  • Measures of spread: range, variance, standard deviation, IQR

  • Shape: skewness, kurtosis

  • Visuals: histogram, boxplot, density plot

Example (Python)

Python

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = sns.load_dataset("tips") # Descriptive summary print(df['total_bill'].describe()) # count 244.000000 # mean 19.785943 # std 8.902412 # min 3.070000 # 25% 13.347500 # 50% 17.795000 # 75% 24.127500 # max 50.810000 sns.histplot(df['total_bill'], kde=True) plt.title("Distribution of Total Bill (Descriptive)") plt.show()

Inferential Statistics → Uses sample data to make conclusions / predictions about the population.

Common tools:

  • Hypothesis testing

  • Confidence intervals

  • Regression analysis

  • p-values, significance levels

Key difference (2026 perspective)

  • Descriptive: "What does my data look like?" (past/current)

  • Inferential: "What can I say about the larger population?" (future/generalization)

7.2 Hypothesis Testing & p-value

Hypothesis testing helps decide whether observed effects in sample data are real (statistically significant) or due to random chance.

Basic steps

  1. State null hypothesis (H₀) – usually "no effect / no difference"

  2. State alternative hypothesis (H₁) – what you want to prove

  3. Choose significance level (α) – commonly 0.05

  4. Calculate test statistic & p-value

  5. If p-value ≤ α → reject H₀ (statistically significant)

Common tests

  • t-test (compare means)

  • Chi-square test (categorical data)

  • ANOVA (compare means across 3+ groups)

p-value interpretation (2026 correct understanding)

  • p-value = probability of observing the data (or more extreme) assuming H₀ is true

  • Small p-value (< 0.05) → strong evidence against H₀

  • Not "probability that H₀ is true"

Example: One-sample t-test

Python

from scipy import stats # Suppose average salary claim = ₹80,000 salaries = [75000, 82000, 78000, 79000, 81000, 83000, 77000] t_stat, p_value = stats.ttest_1samp(salaries, 80000) print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}") # If p-value < 0.05 → reject null (salary ≠ ₹80,000)

Two-sample t-test

Python

group1 = [85, 88, 90, 92, 87] group2 = [78, 80, 82, 79, 81] t_stat, p_value = stats.ttest_ind(group1, group2) print(f"p-value: {p_value:.4f}")

7.3 Probability Distributions

Probability distributions describe how probabilities are distributed over values of a random variable.

Key distributions in data science (2026)

  1. Normal / Gaussian Distribution (bell curve)

    • Most important – Central Limit Theorem

    • Used in: z-scores, confidence intervals, many ML assumptions

    Python

    from scipy.stats import norm x = np.linspace(-4, 4, 1000) plt.plot(x, norm.pdf(x, loc=0, scale=1)) plt.title("Standard Normal Distribution") plt.show()

  2. Binomial Distribution (discrete)

    • Number of successes in n independent trials

    • Example: Click-through rate (CTR)

  3. Poisson Distribution (discrete)

    • Number of events in fixed interval (rare events)

    • Example: Number of customer complaints per day

  4. Exponential Distribution (continuous)

    • Time between events in Poisson process

    • Example: Time between customer arrivals

  5. Uniform Distribution

    • All values equally likely

Quick visualization of common distributions

Python

from scipy.stats import norm, binom, poisson, expon x = np.linspace(0, 20, 1000) plt.subplot(2, 2, 1) plt.plot(x, norm.pdf(x, loc=10, scale=3)) plt.title("Normal") plt.subplot(2, 2, 2) plt.bar(range(20), binom.pmf(range(20), n=20, p=0.5)) plt.title("Binomial") plt.subplot(2, 2, 3) plt.bar(range(20), poisson.pmf(range(20), mu=5)) plt.title("Poisson") plt.subplot(2, 2, 4) plt.plot(x, expon.pdf(x, scale=5)) plt.title("Exponential") plt.tight_layout() plt.show()

7.4 Correlation, Regression & Confidence Intervals

Correlation measures linear relationship strength & direction.

Python

# Pearson correlation print(df[['total_bill', 'tip']].corr()) # total_bill tip # total_bill 1.000000 0.675734 # tip 0.675734 1.000000 sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm') plt.title("Correlation Matrix") plt.show()

Simple Linear Regression

Python

from sklearn.linear_model import LinearRegression X = df[['total_bill']] y = df['tip'] model = LinearRegression() model.fit(X, y) print("Slope (β1):", model.coef_[0]) print("Intercept (β0):", model.intercept_)

Confidence Intervals

Python

from scipy import stats # 95% confidence interval for mean tip mean_tip = df['tip'].mean() ci = stats.t.interval(0.95, len(df['tip'])-1, loc=mean_tip, scale=stats.sem(df['tip'])) print(f"95% CI for mean tip: {ci}")

Interpretation (2026 correct way): "We are 95% confident that the true population mean tip lies between X and Y."

Mini Summary Project – Full Statistical Analysis

Python

import pandas as pd import seaborn as sns from scipy import stats df = sns.load_dataset("tips") # 1. Summary stats print(df['tip'].describe()) # 2. Hypothesis test: Do smokers tip more? smoker_tip = df[df['smoker']=='Yes']['tip'] non_smoker_tip = df[df['smoker']=='No']['tip'] t_stat, p_val = stats.ttest_ind(smoker_tip, non_smoker_tip) print(f"p-value: {p_val:.4f}") if p_val < 0.05: print("Significant difference in tipping between smokers and non-smokers") # 3. Correlation & regression sns.regplot(x='total_bill', y='tip', data=df) plt.title("Tip vs Total Bill with Regression Line") plt.show()

This completes the full Statistics & Probability for Data Science section — now you understand the mathematical foundation behind every data science model and decision!

8. Machine Learning with Scikit-learn

Scikit-learn (sklearn) is the most popular open-source machine learning library in Python. It provides simple, consistent, and efficient tools for data mining and analysis — from preprocessing to model evaluation and deployment.

Install Scikit-learn (if not using Anaconda)

Bash

pip install scikit-learn

Standard import

Python

import sklearn from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, mean_squared_error

8.1 Supervised Learning – Regression & Classification

Supervised Learning = Learning from labeled data (input + correct output).

Regression → Predict continuous values (e.g., house price, temperature, salary)

Classification → Predict discrete classes (e.g., spam/not spam, disease/no disease)

Common algorithms in Scikit-learn

Regression:

  • Linear Regression

  • Ridge / Lasso (regularized)

  • Decision Tree / Random Forest Regressor

  • Gradient Boosting (XGBoost, LightGBM often used via sklearn interface)

Classification:

  • Logistic Regression

  • Decision Tree / Random Forest Classifier

  • Support Vector Machine (SVM)

  • k-Nearest Neighbors (KNN)

  • Gradient Boosting Classifier

Basic Regression Example (House Price Prediction)

Python

from sklearn.linear_model import LinearRegression from sklearn.datasets import fetch_california_housing # Load data housing = fetch_california_housing(as_frame=True) X = housing.data y = housing.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) rmse = mean_squared_error(y_test, predictions, squared=False) print(f"RMSE: {rmse:.3f}")

Basic Classification Example (Iris Dataset)

Python

from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier iris = load_iris(as_frame=True) X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) accuracy = accuracy_score(y_test, clf.predict(X_test)) print(f"Accuracy: {accuracy:.3f}")

8.2 Model Training, Evaluation & Hyperparameter Tuning

Training = model.fit(X_train, y_train)

Prediction = model.predict(X_test)

Evaluation Metrics

Regression:

  • Mean Squared Error (MSE)

  • Root Mean Squared Error (RMSE)

  • Mean Absolute Error (MAE)

  • R² Score

Classification:

  • Accuracy

  • Precision / Recall / F1-Score

  • Confusion Matrix

  • ROC-AUC (especially for imbalanced data)

Hyperparameter Tuning (Grid Search / Random Search)

Python

from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10] } grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='f1_macro', n_jobs=-1 ) grid_search.fit(X_train, y_train) print("Best parameters:", grid_search.best_params_) print("Best score:", grid_search.best_score_)

Best practice (2026):

  • Use RandomizedSearchCV for large search spaces (faster)

  • Use cross-validation for reliable evaluation

  • Never tune on test set — use validation set or cross-validation

8.3 Cross-Validation & Model Selection

Cross-Validation = Splitting data multiple times to get reliable performance estimate.

Most common: K-Fold CV

Python

from sklearn.model_selection import cross_val_score scores = cross_val_score( RandomForestClassifier(random_state=42), X, y, cv=5, # 5-fold scoring='accuracy' ) print("Cross-validation scores:", scores) print("Mean accuracy:", scores.mean()) print("Std deviation:", scores.std())

Stratified K-Fold (for imbalanced classification)

Python

from sklearn.model_selection import StratifiedKFold cv = StratifiedKFold(n_splits=5) scores = cross_val_score(model, X, y, cv=cv, scoring='f1_macro')

Model Selection Flow (recommended)

  1. Split data → train / validation / test (80/10/10 or 70/15/15)

  2. Preprocess → fit on train only, transform validation/test

  3. Try multiple models with cross-validation on train+validation

  4. Select best model → tune hyperparameters

  5. Final evaluation on hold-out test set

8.4 Unsupervised Learning – Clustering & Dimensionality Reduction

Unsupervised Learning → No labels. Discover hidden structure in data.

Clustering – Group similar data points

Most popular: K-Means

Python

from sklearn.cluster import KMeans from sklearn.datasets import make_blobs X, = makeblobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) kmeans = KMeans(n_clusters=4, random_state=42) kmeans.fit(X) labels = kmeans.labels_ centers = kmeans.cluster_centers_ plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis') plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='x', s=200) plt.title("K-Means Clustering") plt.show()

Dimensionality Reduction – Reduce number of features while preserving information

PCA (Principal Component Analysis)

Python

from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis') plt.title("PCA – 2D Visualization") plt.xlabel("PC1") plt.ylabel("PC2") plt.show() print("Explained variance ratio:", pca.explained_variance_ratio_)

t-SNE (non-linear, great for visualization)

Python

from sklearn.manifold import TSNE tsne = TSNE(n_components=2, random_state=42) X_tsne = tsne.fit_transform(X) plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='viridis') plt.title("t-SNE Visualization") plt.show()

Mini Summary Project – End-to-End ML Pipeline

Python

from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score X = df.drop('target', axis=1) # your features y = df['target'] pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42)) ]) scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1') print("F1 scores:", scores) print("Mean F1:", scores.mean())

This completes the full Machine Learning with Scikit-learn section — now you can build, train, evaluate, and tune real ML models!

9. Advanced Data Science Topics

After mastering the fundamentals (EDA, preprocessing, classical ML), this section introduces more advanced and highly in-demand areas in 2026: time series, NLP, deep learning, and deployment/MLOps. These topics are essential for real-world projects, research papers, and industry roles.

9.1 Time Series Analysis & Forecasting

Time series data has a temporal order (stock prices, sales, weather, sensor readings). The goal is to understand patterns (trend, seasonality, cycles) and predict future values.

Key concepts

  • Trend: long-term increase/decrease

  • Seasonality: repeating patterns (weekly, monthly, yearly)

  • Stationarity: statistical properties constant over time (most models require it)

  • Autocorrelation: correlation with lagged values

Popular libraries

  • statsmodels (ARIMA, SARIMA)

  • Prophet (Facebook)

  • sktime (modern, scikit-learn compatible)

Basic ARIMA example

Python

import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.arima.model import ARIMA from statsmodels.tsa.stattools import adfuller # Load sample time series (AirPassengers or your own data) url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv" df = pd.read_csv(url, parse_dates=['Month'], index_col='Month') series = df['Passengers'] # Check stationarity result = adfuller(series) print("ADF Statistic:", result[0]) print("p-value:", result[1]) # if >0.05 → not stationary → difference # Fit ARIMA (p,d,q) – here (5,1,0) is a good start model = ARIMA(series, order=(5,1,0)) model_fit = model.fit() # Forecast next 12 months forecast = model_fit.forecast(steps=12) print("Forecast:", forecast) # Plot plt.plot(series, label='Actual') plt.plot(forecast, label='Forecast', color='red') plt.title("Air Passengers Forecast") plt.legend() plt.show()

Prophet (very easy & powerful)

Python

from prophet import Prophet df_prophet = df.reset_index().rename(columns={'Month': 'ds', 'Passengers': 'y'}) m = Prophet(yearly_seasonality=True) m.fit(df_prophet) future = m.make_future_dataframe(periods=12, freq='MS') forecast = m.predict(future) m.plot(forecast) plt.title("Prophet Forecast – Air Passengers") plt.show()

When to use what (2026):

  • Short-term, classic data → ARIMA/SARIMA

  • Business time series with holidays → Prophet

  • Multivariate → VAR, LSTM (deep learning)

9.2 Natural Language Processing (NLP) Basics

NLP deals with text data: sentiment analysis, chatbots, translation, summarization, etc.

Essential libraries in 2026

  • NLTK / spaCy (traditional)

  • Transformers (Hugging Face) → state-of-the-art (BERT, GPT-style)

Basic NLP pipeline with spaCy

Python

import spacy nlp = spacy.load("en_core_web_sm") text = "Apple is looking at buying U.K. startup for $1 billion in 2026" doc = nlp(text) for token in doc: print(token.text, token.pos_, token.dep_) # Named Entity Recognition (NER) for ent in doc.ents: print(ent.text, ent.label_) # Output: # Apple ORG # U.K. GPE # $1 billion MONEY # 2026 DATE

Sentiment Analysis with Hugging Face (easiest & best in 2026)

Python

from transformers import pipeline sentiment_pipeline = pipeline("sentiment-analysis") reviews = [ "This product is amazing! Love it.", "Worst experience ever. Do not buy.", "It's okay, nothing special." ] results = sentiment_pipeline(reviews) for review, res in zip(reviews, results): print(f"Review: {review}") print(f"Sentiment: {res['label']} (score: {res['score']:.4f})\n")

Text Classification (custom model) Use Hugging Face Trainer API or scikit-learn + TF-IDF.

9.3 Introduction to Deep Learning with TensorFlow/Keras

Deep learning = neural networks with many layers. In 2026, Keras (inside TensorFlow) is still the easiest high-level API for beginners.

Basic Neural Network (Classification)

Python

import tensorflow as tf from tensorflow.keras import layers, models from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) model = models.Sequential([ layers.Dense(16, activation='relu', input_shape=(4,)), layers.Dense(8, activation='relu'), layers.Dense(3, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=1) test_loss, test_acc = model.evaluate(X_test, y_test) print(f"Test accuracy: {test_acc:.4f}")

Why Keras in 2026?

  • Simple & readable (Sequential / Functional API)

  • Built-in callbacks (EarlyStopping, ModelCheckpoint)

  • Integrates with TensorFlow ecosystem (TensorBoard, TPU support)

9.4 Model Deployment & MLOps Basics

Deployment = putting model into production so others can use it.

Popular options in 2026 (easy to advanced):

  • Streamlit / Gradio → interactive web apps (fastest)

  • FastAPI + Uvicorn → production API

  • Flask / Django → traditional web

  • Docker + Kubernetes → scalable deployment

  • MLflow / BentoML → full MLOps

Simple Streamlit app example

Python

# app.py import streamlit as st from sklearn.linear_model import LogisticRegression import numpy as np st.title("Simple Iris Flower Prediction") sepal_length = st.slider("Sepal Length", 4.0, 8.0, 5.0) sepal_width = st.slider("Sepal Width", 2.0, 4.5, 3.5) petal_length = st.slider("Petal Length", 1.0, 7.0, 4.0) petal_width = st.slider("Petal Width", 0.1, 2.5, 1.3) model = LogisticRegression() # load your trained model here prediction = model.predict([[sepal_length, sepal_width, petal_length, petal_width]]) st.write(f"Predicted class: {prediction[0]}")

Run:

Bash

pip install streamlit streamlit run app.py

MLOps Basics (2026 essentials)

  • Version control data & models → DVC

  • Track experiments → MLflow

  • Package models → BentoML / ONNX

  • Deploy → Docker + Render / Railway / AWS / GCP

Mini Summary Project – End-to-End Churn Prediction

  1. Load data → EDA (section 5)

  2. Preprocess → scale, encode (section 6)

  3. Train Random Forest / XGBoost

  4. Evaluate → cross-validation, ROC-AUC

  5. Deploy simple Streamlit app for prediction

This completes the full Advanced Data Science Topics section — now you have exposure to time series, NLP, deep learning, and deployment!

10. Real-World Projects & Case Studies

These four hands-on projects apply everything you've learned: data loading, EDA, preprocessing, modeling, evaluation, visualization, and interpretation. They are designed to be portfolio-ready and commonly asked about in interviews.

10.1 Project 1: House Price Prediction (Regression)

Goal: Predict house prices based on features (classic regression problem).

Dataset: California Housing (built-in in sklearn) or use Kaggle's House Prices dataset.

Steps & Code

Python

import pandas as pd import numpy as np from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt import seaborn as sns # 1. Load data housing = fetch_california_housing(as_frame=True) df = housing.frame X = df.drop("MedHouseVal", axis=1) y = df["MedHouseVal"] # 2. EDA (quick look) print(df.describe()) sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f') plt.title("Correlation Matrix – House Prices") plt.show() # 3. Preprocessing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 4. Model training & evaluation model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train_scaled, y_train) predictions = model.predict(X_test_scaled) rmse = np.sqrt(mean_squared_error(y_test, predictions)) r2 = r2_score(y_test, predictions) print(f"RMSE: {rmse:.3f} (lower is better)") print(f"R² Score: {r2:.3f} (closer to 1 is better)") # 5. Feature importance importances = pd.Series(model.feature_importances_, index=X.columns) importances.sort_values(ascending=False).plot(kind='bar') plt.title("Feature Importance – House Price Prediction") plt.show()

Key Takeaways:

  • Median Income is usually the strongest predictor

  • RMSE in range 0.45–0.55 is good for this dataset

  • Try XGBoost or LightGBM for better performance

Improvements:

  • Add feature engineering (rooms per household, age buckets)

  • Hyperparameter tuning (GridSearchCV)

  • Deploy as Streamlit app

10.2 Project 2: Customer Churn Prediction (Classification)

Goal: Predict whether a customer will leave (churn) — imbalanced classification problem.

Dataset: Telco Customer Churn (Kaggle or use seaborn example)

Steps & Code

Python

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, roc_auc_score import seaborn as sns # 1. Load & quick EDA df = pd.read_csv("https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv") print(df['Churn'].value_counts(normalize=True)) # imbalanced ~73% No # 2. Preprocessing df = df.drop(['customerID'], axis=1) df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce') df = df.dropna() X = df.drop('Churn', axis=1) y = df['Churn'].map({'Yes': 1, 'No': 0}) categorical = X.select_dtypes(include='object').columns numeric = X.select_dtypes(include=['int64', 'float64']).columns preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric), ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical) ]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # 3. Model pipeline pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(class_weight='balanced', random_state=42)) ]) pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) y_prob = pipeline.predict_proba(X_test)[:, 1] print(classification_report(y_test, y_pred)) print("ROC-AUC:", roc_auc_score(y_test, y_prob)) # 4. Confusion matrix visualization sns.heatmap(pd.crosstab(y_test, y_pred), annot=True, fmt='d', cmap='Blues') plt.title("Confusion Matrix – Churn Prediction") plt.xlabel("Predicted") plt.ylabel("Actual") plt.show()

Key Takeaways:

  • Class imbalance → use class_weight='balanced' or SMOTE

  • Focus on Recall (catching churners) and ROC-AUC

  • Top features: Contract type, tenure, monthly charges

Improvements:

  • Try XGBoost / LightGBM

  • Add SMOTE in pipeline

  • Create dashboard with Streamlit

10.3 Project 3: Sentiment Analysis on Reviews (NLP)

Goal: Classify product/movie reviews as positive/negative/neutral.

Dataset: Amazon Reviews or IMDb (use Hugging Face datasets)

Easy & powerful method: Hugging Face Transformers

Python

from transformers import pipeline import pandas as pd # Load sentiment pipeline (pre-trained model) sentiment = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment") # Sample reviews reviews = [ "This phone is amazing! Battery lasts all day.", "Worst product ever. Broke in 2 days.", "It's okay, nothing special but works fine.", "Absolutely love it! Best purchase this year." ] results = sentiment(reviews) for review, res in zip(reviews, results): print(f"Review: {review}") print(f"Sentiment: {res['label']} (score: {res['score']:.4f})\n")

Custom model with scikit-learn + TF-IDF

Python

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Assume df has 'review' and 'sentiment' columns (1=positive, 0=negative) X = df['review'] y = df['sentiment'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) vectorizer = TfidfVectorizer(max_features=5000, stop_words='english') X_train_vec = vectorizer.fit_transform(X_train) X_test_vec = vectorizer.transform(X_test) model = LogisticRegression(max_iter=1000) model.fit(X_train_vec, y_train) y_pred = model.predict(X_test_vec) print(classification_report(y_test, y_pred))

Key Takeaways:

  • Pre-trained transformers (Hugging Face) → best accuracy with almost no code

  • TF-IDF + Logistic Regression → fast baseline, good interpretability

Improvements:

  • Fine-tune BERT/RoBERTa

  • Add emoji/text cleaning

  • Create Streamlit app for live prediction

10.4 Project 4: Sales Dashboard & EDA Report

Goal: Create an interactive EDA & sales dashboard using Streamlit.

Install

Bash

pip install streamlit pandas plotly

Full code (save as app.py)

Python

import streamlit as st import pandas as pd import plotly.express as px st.title("Sales Dashboard & EDA Report") # Upload data uploaded_file = st.file_uploader("Upload your sales CSV", type="csv") if uploaded_file: df = pd.read_csv(uploaded_file) st.subheader("Data Overview") st.dataframe(df.head()) st.subheader("Summary Statistics") st.write(df.describe()) # Interactive filters category = st.selectbox("Select Category", df.columns) # Visualizations fig1 = px.histogram(df, x=category, title=f"Distribution of {category}") st.plotly_chart(fig1) fig2 = px.box(df, x="Region", y="Sales", title="Sales by Region") st.plotly_chart(fig2) st.subheader("Top Products") top_products = df.groupby("Product")["Sales"].sum().nlargest(10) st.bar_chart(top_products)

Run

Bash

streamlit run app.py

Key Takeaways:

  • Streamlit = fastest way to turn data scripts into interactive dashboards

  • Plotly = interactive charts (zoom, hover)

  • Great for EDA reports, stakeholder presentations

This completes the full Real-World Projects & Case Studies section — now you have four portfolio-ready projects!

11. Best Practices, Portfolio & Career Guidance

You’ve now learned the full technical stack — from Python basics to advanced ML. This final section focuses on how to stand out in the real world: writing production-ready code, building a strong portfolio, using Git & Kaggle effectively, and acing data science interviews in 2026.

11.1 Writing Clean & Reproducible Data Science Code

Clean, reproducible code is what separates hobbyists from professionals.

Core Principles (2026 Standard)

  1. Follow PEP 8 + modern formatting tools

    • Use Black (auto-formatter) + isort (import sorter)

    Bash

    pip install black isort black . && isort .

  2. Use virtual environments (never install globally)

    Bash

    python -m venv env source env/bin/activate pip install -r requirements.txt

  3. Always create requirements.txt

    Bash

    pip freeze > requirements.txt

  4. Write reproducible notebooks (Jupyter)

    • Set random seeds everywhere

    Python

    import numpy as np import random np.random.seed(42) random.seed(42)

    • Use nbdev or papermill for production notebooks

    • Prefer .py scripts for final pipelines

  5. Structure projects professionally

    text

    my_project/ ├── data/ # raw & processed data (never commit raw) ├── notebooks/ # exploratory .ipynb files ├── src/ # reusable .py modules │ ├── data.py │ ├── model.py │ └── utils.py ├── models/ # saved models ├── reports/ # figures, dashboards ├── requirements.txt ├── README.md └── main.py / run_pipeline.py

  6. Document everything

    • Use docstrings (PEP 257)

    • Add README with project goal, setup instructions, results

11.2 Building a Strong Data Science Portfolio

Your GitHub portfolio is your resume in 2026 — recruiters look here first.

Must-Have Projects (2026 recruiters love these)

  1. End-to-end regression project (House Prices / Bike Sharing)

  2. Imbalanced classification (Fraud Detection / Churn Prediction)

  3. NLP project (Sentiment Analysis / Resume Parser)

  4. Time series forecasting (Sales / Stock Price)

  5. Interactive dashboard (Streamlit / Plotly)

  6. Deep learning project (Image classification with transfer learning)

Portfolio Tips

  • Host 4–6 high-quality projects

  • Each repo should have:

    • Clean README (problem statement, approach, results, visuals)

    • Requirements.txt

    • Jupyter notebook + .py pipeline

    • Visuals (charts, confusion matrix, feature importance)

    • Model performance metrics

  • Deploy 2–3 projects (Streamlit, Heroku, Render, Hugging Face Spaces)

  • Add blog posts (Medium / Hashnode) explaining your projects

Example README structure

Markdown

# House Price Prediction ## Problem Predict house prices in California using regression models. ## Dataset California Housing (sklearn) ## Approach - EDA → Correlation analysis, outlier removal - Preprocessing → Scaling, feature engineering - Models → Linear Regression, Random Forest, XGBoost - Best model → Random Forest (RMSE 0.47) ## Results - R²: 0.81 - Feature importance: Median Income > House Age ## Deployment Live demo: https://house-price-app.streamlit.app ## Tech Stack Python, Pandas, Scikit-learn, Streamlit

11.3 Git, Kaggle & Resume Tips for Students & Professionals

Git & GitHub Workflow (2026 standard)

  1. Create repo → git init

  2. Work on feature branch: git checkout -b feature/eda

  3. Commit often: git commit -m "Add EDA visualizations"

  4. Push & create Pull Request

  5. Use .gitignore (ignore data/, *.pkl, pycache)

  6. Add GitHub Actions for CI (lint, tests)

Kaggle Tips

  • Participate in competitions → top 10% looks great

  • Create notebooks → aim for upvotes & medals

  • Fork good kernels → learn from top solutions

  • Build datasets → upload clean versions

Resume & LinkedIn Tips (2026)

  • One-page resume for freshers

  • Structure:

    • Projects (3–5) → title, tech stack, results (metrics!)

    • Skills → Python, SQL, Pandas, Scikit-learn, Git, AWS/GCP (basic)

    • Education + certifications (Coursera, Kaggle)

  • LinkedIn: Post weekly → project updates, Kaggle kernels, articles

  • Add badges: Kaggle Expert/Master, GitHub streak

11.4 Interview Preparation & Top Data Science Questions

Common Interview Stages (2026)

  1. Resume screening + HR

  2. Technical MCQ / coding test (HackerRank, LeetCode)

  3. Live coding / take-home assignment

  4. ML system design / case study

  5. Behavioral + project deep-dive

Top 20 Data Science Interview Questions (2026)

  1. Explain bias-variance tradeoff.

  2. What is overfitting? How to prevent it?

  3. Difference between L1 and L2 regularization?

  4. Explain cross-validation. Why stratified?

  5. How does Random Forest work? Why better than single tree?

  6. What is gradient boosting? Difference from Random Forest?

  7. Explain ROC-AUC vs Precision-Recall curve.

  8. How to handle imbalanced datasets?

  9. What is multicollinearity? How to detect & fix?

  10. Explain PCA. When to use it?

  11. Difference between bagging and boosting?

  12. How does k-means clustering work?

  13. What is a confusion matrix? Precision, Recall, F1?

  14. Explain time series components (trend, seasonality).

  15. What is stationarity? How to test it?

  16. Difference between ARIMA and Prophet?

  17. How does BERT work? (high-level)

  18. Explain attention mechanism.

  19. What is transfer learning? When to use it?

  20. How would you deploy a model in production?

Preparation Strategy (2026)

  • Practice LeetCode (medium SQL & Python)

  • Build 4–6 strong projects → explain end-to-end

  • Revise statistics & ML theory (StatQuest YouTube)

  • Mock interviews (Pramp, Interviewing.io)

  • Read “Ace the Data Science Interview” book

This completes the full Best Practices, Portfolio & Career Guidance section — and the entire Master Data Science with Python tutorial!

12. Next Steps & Learning Roadmap

You’ve now completed a full, structured journey from Python basics → OOP → data manipulation → visualization → EDA → preprocessing → statistics → machine learning → advanced topics → real projects. This final section gives you a clear, realistic, and up-to-date (2026) roadmap to take your skills to the next level — whether your goal is jobs, research papers, freelancing, or startup building.

12.1 Advanced Topics (Deep Learning, Computer Vision, Big Data)

After mastering classical ML (Scikit-learn), these are the high-impact areas to learn next:

Deep Learning (Neural Networks & Transformers)

  • Frameworks: PyTorch (industry/research favorite in 2026) or TensorFlow/Keras

  • Key topics:

    • Neural network fundamentals (layers, activation, backpropagation)

    • CNNs (Convolutional Neural Networks) for images

    • RNNs / LSTMs / GRUs for sequences

    • Transformers (BERT, GPT-style models) → Hugging Face Transformers library

  • Best starting course: fast.ai “Practical Deep Learning for Coders” (free, project-based)

Computer Vision

  • Image classification, object detection, segmentation

  • Libraries: PyTorch + torchvision, Ultralytics YOLOv8, Hugging Face

  • Projects:

    • Custom image classifier (cats vs dogs)

    • Object detection on your own photos (YOLO)

    • Face recognition / emotion detection

Big Data & Scalability

  • Tools: PySpark (Spark with Python), Dask (parallel Pandas), Polars (fast DataFrame)

  • Cloud platforms: AWS (S3 + SageMaker), GCP (BigQuery + Vertex AI), Azure

  • Key skills:

    • Distributed computing

    • Handling terabyte-scale data

    • ETL pipelines (Airflow / Prefect)

Learning Order Suggestion (2026)

  1. Deep Learning basics (fast.ai or DeepLearning.AI Coursera)

  2. Computer Vision (PyTorch + YOLO)

  3. NLP Advanced (fine-tune BERT)

  4. Big Data basics (PySpark or Polars)

  5. MLOps / Deployment (MLflow, BentoML, Docker)

12.3 Career Paths & Job Opportunities in Data Science

Main Career Tracks in 2026 (with approximate global salary ranges)

RolePrimary Skills RequiredTypical ExperienceIndia Salary (₹ LPA)Global Salary (USD/year)Best ForData AnalystSQL, Excel/Power BI, basic Python/Pandas0–3 years4–12$60k–$95kFreshers & studentsData ScientistPython, ML (sklearn), stats, SQL, visualization1–6 years10–28$100k–$170kMost common pathMachine Learning EngineerPython, ML deployment, MLOps, Docker, cloud3–8 years18–45$130k–$220kProfessionalsMLOps EngineerDocker, Kubernetes, MLflow, CI/CD, cloud3–7 years20–50$140k–$240kHigh demand in 2026AI Research ScientistDeep learning, PyTorch, research papers3–10+ years/PhD25–70+$150k–$350k+Researchers & PhDsData EngineerSQL, Spark, Airflow, cloud pipelines3–8 years12–35$110k–$190kInfrastructure focused

How to Get Hired in 2026

  • Build 4–6 strong projects (GitHub + deployed versions)

  • Participate in Kaggle competitions (top 10% = strong signal)

  • Earn certifications: Google Data Analytics, IBM Data Science, DeepLearning.AI

  • Contribute to open source (Hugging Face, scikit-learn, fastai)

  • Network: LinkedIn, Twitter/X (post weekly), Kaggle discussions

  • Prepare for interviews: LeetCode (SQL + Python), system design cases

Final Motivation Data science is one of the most rewarding careers in 2026 — high impact, high salary, and endless learning. Code every day. Build real things. Share your work. Stay curious.

You’ve completed the entire Master Data Science with Python tutorial — from setup to advanced topics and career guidance. You are now equipped to start real projects, contribute to open source, and pursue exciting opportunities.

👈 PREVIOUS PYTHON PROGRAMMING OOP NEXT R PROGRAMMING 👉

These Python notes made complex concepts feel simple and clear.

Amy K.

A cozy study desk with an open laptop displaying Python code and a notebook filled with handwritten notes.
A cozy study desk with an open laptop displaying Python code and a notebook filled with handwritten notes.
A smiling student holding a tablet showing a Python tutorial webpage, surrounded by textbooks.
A smiling student holding a tablet showing a Python tutorial webpage, surrounded by textbooks.

★★★★★