LEARN COMPLETE PYTHON IN 24 HOURS

Statistics and probability form the mathematical foundation of data science. Without understanding them, machine learning models, hypothesis testing, confidence intervals, and model evaluation become guesswork.

7.1 Descriptive vs Inferential Statistics

Descriptive Statistics → Summarizes and describes the data you already have (the sample).

Common tools:

Measures of central tendency: mean, median, mode
Measures of spread: range, variance, standard deviation, IQR
Shape: skewness, kurtosis
Visuals: histogram, boxplot, density plot

Example (Python)

Python

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = sns.load_dataset("tips") # Descriptive summary print(df['total_bill'].describe()) # count 244.000000 # mean 19.785943 # std 8.902412 # min 3.070000 # 25% 13.347500 # 50% 17.795000 # 75% 24.127500 # max 50.810000 sns.histplot(df['total_bill'], kde=True) plt.title("Distribution of Total Bill (Descriptive)") plt.show()

Inferential Statistics → Uses sample data to make conclusions / predictions about the population.

Common tools:

Hypothesis testing
Confidence intervals
Regression analysis
p-values, significance levels

Key difference (2026 perspective)

Descriptive: "What does my data look like?" (past/current)
Inferential: "What can I say about the larger population?" (future/generalization)

7.2 Hypothesis Testing & p-value

Hypothesis testing helps decide whether observed effects in sample data are real (statistically significant) or due to random chance.

Basic steps

State null hypothesis (H₀) – usually "no effect / no difference"
State alternative hypothesis (H₁) – what you want to prove
Choose significance level (α) – commonly 0.05
Calculate test statistic & p-value
If p-value ≤ α → reject H₀ (statistically significant)

Common tests

t-test (compare means)
Chi-square test (categorical data)
ANOVA (compare means across 3+ groups)

p-value interpretation (2026 correct understanding)

p-value = probability of observing the data (or more extreme) assuming H₀ is true
Small p-value (< 0.05) → strong evidence against H₀
Not "probability that H₀ is true"

Example: One-sample t-test

Python

from scipy import stats # Suppose average salary claim = ₹80,000 salaries = [75000, 82000, 78000, 79000, 81000, 83000, 77000] t_stat, p_value = stats.ttest_1samp(salaries, 80000) print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}") # If p-value < 0.05 → reject null (salary ≠ ₹80,000)

Two-sample t-test

Python

group1 = [85, 88, 90, 92, 87] group2 = [78, 80, 82, 79, 81] t_stat, p_value = stats.ttest_ind(group1, group2) print(f"p-value: {p_value:.4f}")

7.3 Probability Distributions

Probability distributions describe how probabilities are distributed over values of a random variable.

Key distributions in data science (2026)

Normal / Gaussian Distribution (bell curve)
- Most important – Central Limit Theorem
- Used in: z-scores, confidence intervals, many ML assumptions

Python

from scipy.stats import norm x = np.linspace(-4, 4, 1000) plt.plot(x, norm.pdf(x, loc=0, scale=1)) plt.title("Standard Normal Distribution") plt.show()

Binomial Distribution (discrete)
- Number of successes in n independent trials
- Example: Click-through rate (CTR)
Poisson Distribution (discrete)
- Number of events in fixed interval (rare events)
- Example: Number of customer complaints per day
Exponential Distribution (continuous)
- Time between events in Poisson process
- Example: Time between customer arrivals
Uniform Distribution
- All values equally likely

Quick visualization of common distributions

Python

from scipy.stats import norm, binom, poisson, expon x = np.linspace(0, 20, 1000) plt.subplot(2, 2, 1) plt.plot(x, norm.pdf(x, loc=10, scale=3)) plt.title("Normal") plt.subplot(2, 2, 2) plt.bar(range(20), binom.pmf(range(20), n=20, p=0.5)) plt.title("Binomial") plt.subplot(2, 2, 3) plt.bar(range(20), poisson.pmf(range(20), mu=5)) plt.title("Poisson") plt.subplot(2, 2, 4) plt.plot(x, expon.pdf(x, scale=5)) plt.title("Exponential") plt.tight_layout() plt.show()

7.4 Correlation, Regression & Confidence Intervals

Correlation measures linear relationship strength & direction.

Python

# Pearson correlation print(df[['total_bill', 'tip']].corr()) # total_bill tip # total_bill 1.000000 0.675734 # tip 0.675734 1.000000 sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm') plt.title("Correlation Matrix") plt.show()

Simple Linear Regression

Python

from sklearn.linear_model import LinearRegression X = df[['total_bill']] y = df['tip'] model = LinearRegression() model.fit(X, y) print("Slope (β1):", model.coef_[0]) print("Intercept (β0):", model.intercept_)

Confidence Intervals

Python

from scipy import stats # 95% confidence interval for mean tip mean_tip = df['tip'].mean() ci = stats.t.interval(0.95, len(df['tip'])-1, loc=mean_tip, scale=stats.sem(df['tip'])) print(f"95% CI for mean tip: {ci}")

Interpretation (2026 correct way): "We are 95% confident that the true population mean tip lies between X and Y."

Mini Summary Project – Full Statistical Analysis

Python

import pandas as pd import seaborn as sns from scipy import stats df = sns.load_dataset("tips") # 1. Summary stats print(df['tip'].describe()) # 2. Hypothesis test: Do smokers tip more? smoker_tip = df[df['smoker']=='Yes']['tip'] non_smoker_tip = df[df['smoker']=='No']['tip'] t_stat, p_val = stats.ttest_ind(smoker_tip, non_smoker_tip) print(f"p-value: {p_val:.4f}") if p_val < 0.05: print("Significant difference in tipping between smokers and non-smokers") # 3. Correlation & regression sns.regplot(x='total_bill', y='tip', data=df) plt.title("Tip vs Total Bill with Regression Line") plt.show()

This completes the full Statistics & Probability for Data Science section — now you understand the mathematical foundation behind every data science model and decision!

📚 Amazon Book Library

All my books are FREE on Amazon Kindle Unlimited🌍 Exclusive Country-Wise Amazon Book Library – Only Here!

On GlobalCodeMaster.com you’ll find complete, ready-to-use lists of my books with direct Amazon links for every country.

Belong to India, Australia, USA, UK, Canada or any other country? Just click your country’s link and enjoy:

✅ Any eBook FREE on Kindle Unlimited ✅ Or buy at incredibly low prices

400+ fresh books written in 2025-2026 with today’s latest AI, Python, Machine Learning & tech trends – nowhere else will you find this complete country-wise collection on one platform!

Choose your country below and start reading instantly 🚀

BOOK LIBRARY USA 2026 LINK

BOOK LIBRARY INDIA 2026 LINK

BOOK LIBRARY AUSTRALIA 2026 LINK

BOOK LIBRARY CANADA 2026 LINK

BOOK LIBRARY UNITED KINGDOM 2026 LINK

BOOK LIBRARY GERMANY 2026 LINK

BOOK LIBRARY FRANCE 2026 LINK

BOOK LIBRARY ITALY 2026 LINK

BOOK LIBRARY SPAIN 2026 LINK

BOOK LIBRARY NETHERLANDS 2026 LINK

BOOK LIBRARY BRAZIL 2026 LINK

BOOK LIBRARY MEXICO 2026 LINK

BOOK LIBRARY JAPAN 2026 LINK

BOOK LIBRARY POLAND 2026 LINK

BOOK LIBRARY IRELAND 2026 LINK

BOOK LIBRARY SWEDEN 2026 LINK

BOOK LIBRARY BELGIUM 2026 LINK

Email-ibm.anshuman@gmail.com

All my books are exclusively available on Amazon. The free notes/materials on globalcodemaster.com do NOT match even 1% with any of my PUBLISHED BOoks. Similar topics ≠ same content. Books have full details, exercises, chapters & structure — website notes do not.No book content is shared here. We fully comply with Amazon policies.

Free Reading Alert! All my books are FREE on Kindle Unlimited or eBooks just ₹145!

Check now: https://www.amazon.in/stores/Anshuman-Mishra/author/B0DQVNPL7P

Start reading! 🚀

🚀 Best content for SSC, CGL, LDC, TET, NET & SET preparation!
📚 Maths | Reasoning | GK | Previous Year Questions | Tips & Tricks

👉 Join our WhatsApp Channel now:
🔗 https://whatsapp.com/channel/0029Vb6kg2vFnSz4zknEOG1D...