LEARN COMPLETE PYTHON IN 24 HOURS

🟦 Table of Contents – Master Data Science with Python

🔹 1. Introduction to Data Science & Python Setup

  • 1.1 What is Data Science and Why Python

  • 1.2 Data Science Career Paths

  • 1.3 Python Environment Setup

  • 1.4 Essential Libraries Overview

🔹 2. NumPy – Foundation of Numerical Computing

  • 2.1 NumPy Arrays vs Python Lists

  • 2.2 Array Operations, Broadcasting & Vectorization

  • 2.3 Indexing, Slicing & Array Manipulation

  • 2.4 Mathematical & Statistical Functions

🔹 3. Pandas – Data Manipulation & Analysis

  • 3.1 Series and DataFrame

  • 3.2 Data Loading

  • 3.3 Data Cleaning & Transformation

  • 3.4 Grouping & Aggregation

  • 3.5 Handling Missing Values & Outliers

🔹 4. Data Visualization with Matplotlib & Seaborn

  • 4.1 Matplotlib Basics

  • 4.2 Seaborn Visualization

  • 4.3 Advanced Plots

  • 4.4 Publication-Ready Visualizations

🔹 5. Exploratory Data Analysis (EDA)

  • 5.1 Data Distribution & Summary Statistics

  • 5.2 Univariate, Bivariate & Multivariate Analysis

  • 5.3 Correlation Analysis

  • 5.4 EDA Case Study

🔹 6. Data Preprocessing & Feature Engineering

  • 6.1 Data Scaling & Normalization

  • 6.2 Encoding Categorical Variables

  • 6.3 Feature Selection

  • 6.4 Handling Imbalanced Data

🔹 7. Statistics & Probability for Data Science

  • 7.1 Descriptive vs Inferential Statistics

  • 7.2 Hypothesis Testing

  • 7.3 Probability Distributions

  • 7.4 Correlation & Regression

🔹 8. Machine Learning with Scikit-learn

  • 8.1 Supervised Learning

  • 8.2 Model Training & Evaluation

  • 8.3 Cross-Validation

  • 8.4 Unsupervised Learning

🔹 9. Advanced Data Science Topics

  • 9.1 Time Series Analysis

  • 9.2 NLP Basics

  • 9.3 Deep Learning Introduction

  • 9.4 Model Deployment

🔹 10. Real-World Projects & Case Studies

  • 10.1 House Price Prediction

  • 10.2 Customer Churn Prediction

  • 10.3 Sentiment Analysis

  • 10.4 Sales Dashboard

🔹 11. Best Practices, Portfolio & Career Guidance

  • 11.1 Clean Code Practices

  • 11.2 Portfolio Building

  • 11.3 Git & Resume Tips

  • 11.4 Interview Preparation

🔹 12. Next Steps & Learning Roadmap

  • 12.1 Advanced Topics

  • 12.2 Books & Resources

  • 12.3 Career Opportunities

6. Data Preprocessing & Feature Engineering

Data preprocessing and feature engineering are the most time-consuming and most important steps in any data science project. Good preprocessing turns raw, messy data into clean, model-ready input. Feature engineering creates new powerful features that can dramatically improve model performance.

Goal: Prepare data so machine learning models can learn effectively and generalize well.

6.1 Data Scaling & Normalization

Many machine learning algorithms (especially distance-based ones like KNN, SVM, K-Means, Neural Networks) perform poorly if features are on different scales.

Common Scaling Techniques:

TechniqueFormula / MethodWhen to UseRange / OutputAffected by Outliers?Min-Max Scaling(X - min) / (max - min)Neural networks, image data, bounded data[0, 1] or custom rangeYesStandardization(X - mean) / stdMost algorithms (SVM, logistic regression, PCA)Mean ≈ 0, Std ≈ 1YesRobust Scaling(X - median) / IQRData with outliersCentered around medianNoLog Transformationlog(1 + X) or Box-CoxHighly skewed data (income, time, counts)Reduces skewnessReduces impact

Code Examples

Python

import pandas as pd from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler import numpy as np df = pd.DataFrame({ 'age': [25, 30, 45, 22, 60], 'salary': [30000, 50000, 120000, 25000, 200000], 'experience': [1, 3, 10, 0, 20] }) # 1. Min-Max Scaling (0 to 1) scaler_minmax = MinMaxScaler() df[['age_minmax', 'salary_minmax']] = scaler_minmax.fit_transform(df[['age', 'salary']]) # 2. Standardization (mean=0, std=1) scaler_std = StandardScaler() df[['age_std', 'salary_std']] = scaler_std.fit_transform(df[['age', 'salary']]) # 3. Robust Scaling (handles outliers) scaler_robust = RobustScaler() df[['salary_robust']] = scaler_robust.fit_transform(df[['salary']]) # 4. Log Transformation (for skewed data) df['salary_log'] = np.log1p(df['salary']) # log(1 + x) to handle 0 print(df)

Quick rule of thumb (2026):

  • Use StandardScaler for most ML models (default choice)

  • Use MinMaxScaler for neural networks or when you need [0,1] range

  • Use RobustScaler if data has outliers

  • Apply log or sqrt for highly right-skewed features (income, time, counts)

6.2 Encoding Categorical Variables

Machine learning models require numerical input — so we convert categories to numbers.

Common Encoding Techniques:

TechniqueMethod / LibraryWhen to UseProsConsLabel Encodingsklearn.preprocessing.LabelEncoderOrdinal categories (low < medium < high)Simple, fastImplies order (bad for nominal)One-Hot Encodingpd.get_dummies / OneHotEncoderNominal categories (colors, cities)No order assumptionHigh dimensionality (curse)Target / Mean Encodingcategory_encoders.TargetEncoderHigh cardinality nominal (many unique values)Captures target relationshipRisk of data leakageFrequency / Count EncodingManual or category_encoders.CountEncoderHigh cardinality, no target leakage neededSimpleNo target informationBinary Encodingcategory_encoders.BinaryEncoderVery high cardinalityReduces dimensionsLess interpretable

Code Examples

Python

import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder from category_encoders import TargetEncoder df = pd.DataFrame({ 'city': ['Delhi', 'Mumbai', 'Bangalore', 'Delhi', 'Kolkata'], 'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'], 'salary': [50000, 80000, 120000, 65000, 90000] }) # 1. Label Encoding (ordinal) le = LabelEncoder() df['education_label'] = le.fit_transform(df['education']) # 2. One-Hot Encoding (nominal) df_onehot = pd.get_dummies(df, columns=['city'], prefix='city', drop_first=True) # 3. Target Encoding (high cardinality + target) encoder = TargetEncoder(cols=['city']) df['city_target'] = encoder.fit_transform(df['city'], df['salary'])

Best practice:

  • Use OneHotEncoder for low-cardinality nominal features (<10–15 categories)

  • Use TargetEncoder or Mean Encoding for high-cardinality with target variable

  • Always fit on train set only — avoid data leakage

6.3 Feature Selection Techniques

Feature selection reduces dimensionality, removes noise, speeds up training, and improves model performance.

Common Methods:

  1. Filter Methods (fast, model-independent)

    • Variance Threshold

    • Correlation with target

    • Chi-square, ANOVA F-test

  2. Wrapper Methods (model-dependent, accurate but slow)

    • Forward/Backward Selection

    • Recursive Feature Elimination (RFE)

  3. Embedded Methods (built into model)

    • Lasso / Ridge regression (L1 regularization)

    • Tree-based feature importance (Random Forest, XGBoost)

Code Examples

Python

from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif, RFE from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split X = df.drop('salary', axis=1) # features y = df['salary'] # target (regression example) # 1. Remove low-variance features selector_var = VarianceThreshold(threshold=0.01) X_var = selector_var.fit_transform(X) # 2. Select top K features (ANOVA F-test for classification) selector_k = SelectKBest(score_func=f_classif, k=5) X_kbest = selector_k.fit_transform(X, y) # 3. Recursive Feature Elimination model = RandomForestClassifier() rfe = RFE(model, n_features_to_select=5) X_rfe = rfe.fit_transform(X, y) # 4. Tree-based importance model.fit(X, y) importances = pd.Series(model.feature_importances_, index=X.columns) print(importances.sort_values(ascending=False))

Rule of thumb:

  • Start with filter methods (fast)

  • Use embedded or wrapper for final selection

  • Never select features on full dataset — use train set only

6.4 Handling Imbalanced Datasets

Imbalanced data (e.g., fraud detection, disease prediction) is very common — models tend to favor majority class.

Common Techniques:

  1. Resampling Methods

    • Oversampling minority (SMOTE, ADASYN)

    • Undersampling majority (RandomUnderSampler)

    • Combination (SMOTE + Tomek / SMOTE + ENN)

  2. Class Weighting

    • Most algorithms support class_weight='balanced'

  3. Evaluation Metrics

    • Use Precision, Recall, F1-score, ROC-AUC (not accuracy)

Code Examples

Python

from imblearn.over_sampling import SMOTE from imblearn.under_sampling import RandomUnderSampler from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) # 1. SMOTE oversampling smote = SMOTE(random_state=42) X_smote, y_smote = smote.fit_resample(X_train, y_train) # 2. Class weight (easier) model = RandomForestClassifier(class_weight='balanced', random_state=42) model.fit(X_train, y_train) # 3. Evaluation y_pred = model.predict(X_test) print(classification_report(y_test, y_pred))

Best practice (2026):

  • Prefer class_weight or balanced accuracy first (simple & no data creation)

  • Use SMOTE carefully — only on training data

  • Always evaluate with stratified split and F1 / ROC-AUC

Mini Summary Project – Full Preprocessing Pipeline

Python

from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline numeric_features = ['age', 'fare'] categorical_features = ['sex', 'class'] preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(drop='first'), categorical_features) ]) X_preprocessed = preprocessor.fit_transform(df)

This completes the full Data Preprocessing & Feature Engineering section — now you know how to transform raw data into model-ready input!pletes the full Classes and Objects – Basic Building Blocks section — the heart of OOP in Python!

📚 Amazon Book Library

All my books are FREE on Amazon Kindle Unlimited🌍 Exclusive Country-Wise Amazon Book Library – Only Here!

On GlobalCodeMaster.com you’ll find complete, ready-to-use lists of my books with direct Amazon links for every country.
Belong to India, Australia, USA, UK, Canada or any other country? Just click your country’s link and enjoy:
Any eBook FREE on Kindle Unlimited ✅ Or buy at incredibly low prices
400+ fresh books written in 2025-2026 with today’s latest AI, Python, Machine Learning & tech trends – nowhere else will you find this complete country-wise collection on one platform!
Choose your country below and start reading instantly 🚀
BOOK LIBRARY USA 2026 LINK
BOOK LIBRARY INDIA 2026 LINK
BOOK LIBRARY AUSTRALIA 2026 LINK
BOOK LIBRARY CANADA 2026 LINK
BOOK LIBRARY UNITED KINGDOM 2026 LINK
BOOK LIBRARY GERMANY 2026 LINK
BOOK LIBRARY FRANCE 2026 LINK
BOOK LIBRARY ITALY 2026 LINK
BOOK LIBRARY SPAIN 2026 LINK
BOOK LIBRARY NETHERLANDS 2026 LINK
BOOK LIBRARY BRAZIL 2026 LINK
BOOK LIBRARY MEXICO 2026 LINK
BOOK LIBRARY JAPAN 2026 LINK
BOOK LIBRARY POLAND 2026 LINK
BOOK LIBRARY IRELAND 2026 LINK
BOOK LIBRARY SWEDEN 2026 LINK
BOOK LIBRARY BELGIUM 2026 LINK