LEARN COMPLETE PYTHON IN 24 HOURS

Data preprocessing and feature engineering are the most time-consuming and most important steps in any data science project. Good preprocessing turns raw, messy data into clean, model-ready input. Feature engineering creates new powerful features that can dramatically improve model performance.

Goal: Prepare data so machine learning models can learn effectively and generalize well.

6.1 Data Scaling & Normalization

Many machine learning algorithms (especially distance-based ones like KNN, SVM, K-Means, Neural Networks) perform poorly if features are on different scales.

Common Scaling Techniques:

TechniqueFormula / MethodWhen to UseRange / OutputAffected by Outliers?Min-Max Scaling(X - min) / (max - min)Neural networks, image data, bounded data[0, 1] or custom rangeYesStandardization(X - mean) / stdMost algorithms (SVM, logistic regression, PCA)Mean ≈ 0, Std ≈ 1YesRobust Scaling(X - median) / IQRData with outliersCentered around medianNoLog Transformationlog(1 + X) or Box-CoxHighly skewed data (income, time, counts)Reduces skewnessReduces impact

Code Examples

Python

import pandas as pd from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler import numpy as np df = pd.DataFrame({ 'age': [25, 30, 45, 22, 60], 'salary': [30000, 50000, 120000, 25000, 200000], 'experience': [1, 3, 10, 0, 20] }) # 1. Min-Max Scaling (0 to 1) scaler_minmax = MinMaxScaler() df[['age_minmax', 'salary_minmax']] = scaler_minmax.fit_transform(df[['age', 'salary']]) # 2. Standardization (mean=0, std=1) scaler_std = StandardScaler() df[['age_std', 'salary_std']] = scaler_std.fit_transform(df[['age', 'salary']]) # 3. Robust Scaling (handles outliers) scaler_robust = RobustScaler() df[['salary_robust']] = scaler_robust.fit_transform(df[['salary']]) # 4. Log Transformation (for skewed data) df['salary_log'] = np.log1p(df['salary']) # log(1 + x) to handle 0 print(df)

Quick rule of thumb (2026):

Use StandardScaler for most ML models (default choice)
Use MinMaxScaler for neural networks or when you need [0,1] range
Use RobustScaler if data has outliers
Apply log or sqrt for highly right-skewed features (income, time, counts)

6.2 Encoding Categorical Variables

Machine learning models require numerical input — so we convert categories to numbers.

Common Encoding Techniques:

TechniqueMethod / LibraryWhen to UseProsConsLabel Encodingsklearn.preprocessing.LabelEncoderOrdinal categories (low < medium < high)Simple, fastImplies order (bad for nominal)One-Hot Encodingpd.get_dummies / OneHotEncoderNominal categories (colors, cities)No order assumptionHigh dimensionality (curse)Target / Mean Encodingcategory_encoders.TargetEncoderHigh cardinality nominal (many unique values)Captures target relationshipRisk of data leakageFrequency / Count EncodingManual or category_encoders.CountEncoderHigh cardinality, no target leakage neededSimpleNo target informationBinary Encodingcategory_encoders.BinaryEncoderVery high cardinalityReduces dimensionsLess interpretable

Code Examples

Python

import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder from category_encoders import TargetEncoder df = pd.DataFrame({ 'city': ['Delhi', 'Mumbai', 'Bangalore', 'Delhi', 'Kolkata'], 'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'], 'salary': [50000, 80000, 120000, 65000, 90000] }) # 1. Label Encoding (ordinal) le = LabelEncoder() df['education_label'] = le.fit_transform(df['education']) # 2. One-Hot Encoding (nominal) df_onehot = pd.get_dummies(df, columns=['city'], prefix='city', drop_first=True) # 3. Target Encoding (high cardinality + target) encoder = TargetEncoder(cols=['city']) df['city_target'] = encoder.fit_transform(df['city'], df['salary'])

Best practice:

Use OneHotEncoder for low-cardinality nominal features (<10–15 categories)
Use TargetEncoder or Mean Encoding for high-cardinality with target variable
Always fit on train set only — avoid data leakage

6.3 Feature Selection Techniques

Feature selection reduces dimensionality, removes noise, speeds up training, and improves model performance.

Common Methods:

Filter Methods (fast, model-independent)
- Variance Threshold
- Correlation with target
- Chi-square, ANOVA F-test
Wrapper Methods (model-dependent, accurate but slow)
- Forward/Backward Selection
- Recursive Feature Elimination (RFE)
Embedded Methods (built into model)
- Lasso / Ridge regression (L1 regularization)
- Tree-based feature importance (Random Forest, XGBoost)

Code Examples

Python

from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif, RFE from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split X = df.drop('salary', axis=1) # features y = df['salary'] # target (regression example) # 1. Remove low-variance features selector_var = VarianceThreshold(threshold=0.01) X_var = selector_var.fit_transform(X) # 2. Select top K features (ANOVA F-test for classification) selector_k = SelectKBest(score_func=f_classif, k=5) X_kbest = selector_k.fit_transform(X, y) # 3. Recursive Feature Elimination model = RandomForestClassifier() rfe = RFE(model, n_features_to_select=5) X_rfe = rfe.fit_transform(X, y) # 4. Tree-based importance model.fit(X, y) importances = pd.Series(model.feature_importances_, index=X.columns) print(importances.sort_values(ascending=False))

Rule of thumb:

Start with filter methods (fast)
Use embedded or wrapper for final selection
Never select features on full dataset — use train set only

6.4 Handling Imbalanced Datasets

Imbalanced data (e.g., fraud detection, disease prediction) is very common — models tend to favor majority class.

Common Techniques:

Resampling Methods
- Oversampling minority (SMOTE, ADASYN)
- Undersampling majority (RandomUnderSampler)
- Combination (SMOTE + Tomek / SMOTE + ENN)
Class Weighting
- Most algorithms support class_weight='balanced'
Evaluation Metrics
- Use Precision, Recall, F1-score, ROC-AUC (not accuracy)

Code Examples

Python

from imblearn.over_sampling import SMOTE from imblearn.under_sampling import RandomUnderSampler from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) # 1. SMOTE oversampling smote = SMOTE(random_state=42) X_smote, y_smote = smote.fit_resample(X_train, y_train) # 2. Class weight (easier) model = RandomForestClassifier(class_weight='balanced', random_state=42) model.fit(X_train, y_train) # 3. Evaluation y_pred = model.predict(X_test) print(classification_report(y_test, y_pred))

Best practice (2026):

Prefer class_weight or balanced accuracy first (simple & no data creation)
Use SMOTE carefully — only on training data
Always evaluate with stratified split and F1 / ROC-AUC

Mini Summary Project – Full Preprocessing Pipeline

Python

from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline numeric_features = ['age', 'fare'] categorical_features = ['sex', 'class'] preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(drop='first'), categorical_features) ]) X_preprocessed = preprocessor.fit_transform(df)

This completes the full Data Preprocessing & Feature Engineering section — now you know how to transform raw data into model-ready input!pletes the full Classes and Objects – Basic Building Blocks section — the heart of OOP in Python!

📚 Amazon Book Library

All my books are FREE on Amazon Kindle Unlimited🌍 Exclusive Country-Wise Amazon Book Library – Only Here!

On GlobalCodeMaster.com you’ll find complete, ready-to-use lists of my books with direct Amazon links for every country.

Belong to India, Australia, USA, UK, Canada or any other country? Just click your country’s link and enjoy:

✅ Any eBook FREE on Kindle Unlimited ✅ Or buy at incredibly low prices

400+ fresh books written in 2025-2026 with today’s latest AI, Python, Machine Learning & tech trends – nowhere else will you find this complete country-wise collection on one platform!

Choose your country below and start reading instantly 🚀

BOOK LIBRARY USA 2026 LINK

BOOK LIBRARY INDIA 2026 LINK

BOOK LIBRARY AUSTRALIA 2026 LINK

BOOK LIBRARY CANADA 2026 LINK

BOOK LIBRARY UNITED KINGDOM 2026 LINK

BOOK LIBRARY GERMANY 2026 LINK

BOOK LIBRARY FRANCE 2026 LINK

BOOK LIBRARY ITALY 2026 LINK

BOOK LIBRARY SPAIN 2026 LINK

BOOK LIBRARY NETHERLANDS 2026 LINK

BOOK LIBRARY BRAZIL 2026 LINK

BOOK LIBRARY MEXICO 2026 LINK

BOOK LIBRARY JAPAN 2026 LINK

BOOK LIBRARY POLAND 2026 LINK

BOOK LIBRARY IRELAND 2026 LINK

BOOK LIBRARY SWEDEN 2026 LINK

BOOK LIBRARY BELGIUM 2026 LINK

Email-ibm.anshuman@gmail.com

All my books are exclusively available on Amazon. The free notes/materials on globalcodemaster.com do NOT match even 1% with any of my PUBLISHED BOoks. Similar topics ≠ same content. Books have full details, exercises, chapters & structure — website notes do not.No book content is shared here. We fully comply with Amazon policies.

Free Reading Alert! All my books are FREE on Kindle Unlimited or eBooks just ₹145!

Check now: https://www.amazon.in/stores/Anshuman-Mishra/author/B0DQVNPL7P

Start reading! 🚀

🚀 Best content for SSC, CGL, LDC, TET, NET & SET preparation!
📚 Maths | Reasoning | GK | Previous Year Questions | Tips & Tricks

👉 Join our WhatsApp Channel now:
🔗 https://whatsapp.com/channel/0029Vb6kg2vFnSz4zknEOG1D...