R Programming Mastery From Beginner to Advanced (Complete 2026 Guide)

If you're seeing this book's cover or link pointing to Amazon.com (USA marketplace)

R Programming Mastery From Beginner to Advanced (Complete 2026 Guide)

👈 PREVIOUS PYTHON PROGRAMMING OOP

TABLE OF CONTENTS

R Programming Mastery – From Beginner to Advanced (Complete 2026 Guide) Hands-on Learning Path for Statistics, Data Analysis, Visualization & Machine Learning

  1. Introduction to R Programming 1.1 What is R and Why Learn It in 2026? 1.2 R vs Python – Quick Comparison for Data Science 1.3 Who Should Learn R? (Students, Researchers, Statisticians, Analysts) 1.4 Installing R & RStudio (2026 Recommended Setup)

  2. R Basics – Syntax & Core Concepts 2.1 Variables, Data Types & Basic Operations 2.2 Vectors, Lists, Matrices & Arrays 2.3 Factors & Data Frames – The Heart of R 2.4 Control Structures (if-else, for, while, apply family) 2.5 Writing Your First R Script

  3. Data Import & Export 3.1 Reading CSV, Excel, SPSS, SAS, Stata & JSON Files 3.2 Working with Databases (SQL, BigQuery, etc.) 3.3 Exporting Data – CSV, Excel, RDS, RData 3.4 Handling Large Datasets Efficiently

  4. Data Manipulation with dplyr & tidyverse 4.1 Introduction to tidyverse & Pipes (%>%) 4.2 filter(), select(), arrange(), mutate(), summarise() 4.3 group_by() + summarise() – Powerful Aggregations 4.4 Joining Data (inner_join, left_join, full_join) 4.5 tidyr – pivot_longer, pivot_wider, separate, unite

  5. Data Visualization with ggplot2 5.1 ggplot2 Grammar of Graphics – Core Logic 5.2 Scatter Plots, Line Charts, Bar Plots & Histograms 5.3 Boxplots, Violin Plots & Density Plots 5.4 Faceting, Themes & Publication-Ready Plots 5.5 Advanced Visuals – Heatmaps, Correlation Plots, Marginal Plots

  6. Exploratory Data Analysis (EDA) in R 6.1 Summary Statistics & Descriptive Analysis 6.2 Handling Missing Values & Outliers 6.3 Univariate, Bivariate & Multivariate EDA 6.4 Automated EDA with DataExplorer / SmartEDA

  7. Statistical Analysis in R 7.1 Descriptive vs Inferential Statistics 7.2 Hypothesis Testing (t-test, ANOVA, Chi-square) 7.3 Correlation & Linear Regression 7.4 Logistic Regression & Generalized Linear Models 7.5 Non-parametric Tests & Post-hoc Analysis

  8. Machine Learning with R 8.1 Supervised Learning – Regression & Classification 8.2 caret vs tidymodels – Two Main ML Frameworks 8.3 Random Forest, XGBoost & Gradient Boosting in R 8.4 Model Evaluation – Cross-validation, ROC-AUC, Confusion Matrix 8.5 Unsupervised Learning – Clustering (k-means, hierarchical)

  9. Time Series Analysis & Forecasting 9.1 Time Series Objects – ts, xts, zoo 9.2 Decomposition – Trend, Seasonality, Remainder 9.3 ARIMA & SARIMA Models 9.4 Prophet & forecast Package 9.5 Real-world Forecasting Project

  10. R Markdown & Reproducible Reports 10.1 Creating Dynamic Reports with R Markdown 10.2 Parameters, Tables, Figures & Citations 10.3 Converting to HTML, PDF, Word 10.4 Quarto – The Modern Replacement (2026 Standard)

  11. Real-World Projects & Portfolio Building 11.1 Project 1: Exploratory Analysis & Dashboard (ggplot2 + flexdashboard) 11.2 Project 2: Customer Churn Prediction (Classification) 11.3 Project 3: Sales Forecasting (Time Series) 11.4 Project 4: Sentiment Analysis on Reviews 11.5 Creating a Professional Portfolio (GitHub + RPubs)

  12. Best Practices, Career Guidance & Next Steps 12.1 Writing Clean, Reproducible & Production-Ready R Code 12.2 R in Industry – Shiny Apps, R Packages, APIs 12.3 Git & GitHub Workflow for R Users 12.4 Top R Interview Questions & Answers 12.5 Career Paths – Data Analyst, Biostatistician, Researcher, Data Scientist 12.6 Recommended Books, Courses & Communities (2026 Updated)

1. Introduction to R Programming

Welcome to your journey into R Programming! This first section explains what R is, why it remains extremely relevant in 2026, how it compares to Python, who should learn it, and how to set up a powerful, modern R environment.

1.1 What is R and Why Learn It in 2026?

R is an open-source programming language and software environment specifically designed for statistical computing, data analysis, data visualization, and research.

Created in 1993 by Ross Ihaka and Robert Gentleman, R is now maintained by the R Foundation and a massive global community.

Why R is still powerful and relevant in 2026:

  • Unmatched statistical packages and cutting-edge methods (many statisticians and biostatisticians still prefer R)

  • Publication-quality graphics (ggplot2 is the gold standard in academia and journals)

  • Reproducible research (R Markdown → Quarto in 2026)

  • Huge ecosystem: tidyverse (dplyr, ggplot2, tidyr), Shiny (interactive apps), caret/tidymodels (ML), Bioconductor (bioinformatics)

  • Free, open-source, and cross-platform

  • Dominant in academia, pharma, clinical trials, government research, finance, and bioinformatics

R is not dying — it is evolving: Quarto, tidyverse, arrow, duckdb integration, faster engines (Posit), and strong community support.

1.2 R vs Python – Quick Comparison for Data Science

Both R and Python are excellent — choose based on your goal and domain.

Feature / AspectR (2026)Python (2026)Winner / When to ChoosePrimary StrengthStatistics, advanced analytics, publication graphicsGeneral-purpose, ML/AI, production deploymentR for stats/research, Python for ML/engData Visualizationggplot2 – best-in-class, publication-readyMatplotlib + Seaborn (good), Plotly (interactive)R (ggplot2)Statistical ModelingExtremely rich (thousands of packages)Good (statsmodels, pingouin), but less depthRMachine Learningcaret, tidymodels, mlr3 (solid but smaller)Scikit-learn, XGBoost, PyTorch, TensorFlowPythonReproducible ReportsR Markdown → Quarto (excellent)Jupyter + nbconvert (good)R (Quarto)Interactive AppsShiny (very strong)Streamlit, Dash, PanelR (Shiny) for stats appsSpeed & Big DataImproving fast (duckdb, arrow, data.table)Polars, PySpark, DaskPython (slightly ahead)Community & Job MarketStrong in academia, pharma, researchMuch larger overall, dominant in industryPython for jobs, R for research

2026 verdict:

  • Choose R if you work in statistics, biostats, clinical research, academia, or publication-heavy fields.

  • Choose Python for machine learning, deep learning, big data, web apps, or broad industry roles.

  • Many professionals learn both — R for stats & visualization, Python for ML & deployment.

1.3 Who Should Learn R? (Students, Researchers, Statisticians, Analysts)

R is especially valuable for:

  • Students (Statistics, Biostatistics, Economics, Psychology, Social Sciences) → Learn R early — many university courses still use it heavily

  • Researchers (Academic, Clinical, Market Research) → ggplot2 + R Markdown/Quarto = perfect for papers, theses, reproducible reports

  • Statisticians & Biostatisticians → R has the deepest collection of statistical tests, mixed models, survival analysis, Bayesian methods

  • Data Analysts in pharma, healthcare, finance, government → R excels at regulatory-compliant reporting and advanced analytics

  • Professionals transitioning from SPSS/Stata/SAS → R is free and more modern

Who can skip R (or learn later)?

  • Pure ML engineers (deep learning, computer vision)

  • Web/full-stack developers

  • Big data engineers (Spark, Hadoop)

1.4 Installing R & RStudio (2026 Recommended Setup)

Step-by-step modern setup (2026 best practice):

  1. Install R (base language) → Go to https://cran.r-project.org → Download latest version (R 4.4.x or 4.5.x in 2026) for your OS

  2. Install RStudio Desktop (best IDE) → https://posit.co/download/rstudio-desktop/ → Free Open Source Edition is perfect (Posit Public Package Manager)

  3. Recommended: Install Posit Package Manager (formerly RSPM) → Faster package installation, especially in corporate/university networks

  4. Create a project & set working directory

    • Open RStudio → File → New Project → New Directory → New Project

    • This keeps everything organized

  5. Install essential packages (run in R console)

    R

    install.packages(c( "tidyverse", # core: dplyr, ggplot2, tidyr, readr, etc. "rmarkdown", # reports "quarto", # modern publishing (2026 standard) "here", # easy file paths "janitor", # clean_names() "skimr", # quick EDA "esquisse" # drag-drop ggplot2 ))

  6. Recommended VS Code alternative (for power users)

    • Install VS Code + R Extension (by REditorSupport)

    • Use radian (better R console) → pip install radian

Quick test – Run this in R console

R

library(tidyverse) ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + geom_smooth(method = "lm") + theme_minimal() + labs(title = "Horsepower vs MPG")

You should see a beautiful scatter plot with regression line.

This completes the full Introduction to R Programming section — your perfect starting point for the entire R tutorial!

2. R Basics – Syntax & Core Concepts

This section covers the foundational elements of R programming. Once you master these, you'll be able to read, write, and understand most R code used in data analysis, statistics, and visualization.

2.1 Variables, Data Types & Basic Operations

In R, you create variables using <- (traditional) or = (also accepted).

Basic data types in R

R

# Numeric (integer or double) x <- 42 # integer y <- 3.14 # double (floating point) z <- 1L # explicit integer (L suffix) # Character (strings) name <- "Anshuman" city <- 'Ranchi' # Logical (boolean) is_student <- TRUE has_degree <- FALSE # Check type class(x) # "numeric" typeof(y) # "double" is.numeric(z) # TRUE

Basic operations

R

a <- 10 b <- 3 print(a + b) # 13 print(a - b) # 7 print(a * b) # 30 print(a / b) # 3.333333 print(a %/% b) # integer division → 3 print(a %% b) # modulo → 1 print(a ^ b) # power → 1000

Logical operations

R

print(a > b) # TRUE print(a == 10) # TRUE print(a != 5) # TRUE print(!TRUE) # FALSE print(TRUE & FALSE) # FALSE (AND) print(TRUE | FALSE) # TRUE (OR)

Tip: Use <- for assignment (community standard). Avoid = except in function arguments.

2.2 Vectors, Lists, Matrices & Arrays

Vectors – The most basic and most used data structure in R (1D, atomic)

R

# Create vector v1 <- c(1, 2, 3, 4, 5) v2 <- 10:20 # sequence 10 to 20 v3 <- seq(0, 10, by = 2) # 0, 2, 4, ..., 10 # Operations are vectorized (no loops needed!) print(v1 * 2) # 2 4 6 8 10 print(v1 + 100) # 101 102 103 104 105 # Indexing (starts from 1!) print(v1[3]) # 3 print(v1[c(1, 3, 5)]) # 1 3 5 print(v1[-2]) # exclude 2nd element → 1 3 4 5

Lists – Can hold mixed types (most flexible)

R

my_list <- list( name = "Anshuman", age = 25, scores = c(85, 92, 78), passed = TRUE, details = list(city = "Ranchi", state = "Jharkhand") ) print(my_list$name) # "Anshuman" print(my_list[[3]]) # scores vector print(my_list$details$city) # "Ranchi"

Matrices – 2D, homogeneous (same type)

R

m <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE) print(m) # [,1] [,2] [,3] [,4] # [1,] 1 2 3 4 # [2,] 5 6 7 8 # [3,] 9 10 11 12 print(m[2, 3]) # 7 print(m[, 2]) # column 2 → 2 6 10

Arrays – Multi-dimensional (rarely used directly)

R

arr <- array(1:24, dim = c(2, 3, 4)) # 2×3×4 array print(arr[1, 2, 3]) # access element

2.3 Factors & Data Frames – The Heart of R

Factors – Used for categorical data (levels)

R

gender <- factor(c("Male", "Female", "Male", "Other", "Female")) print(gender) # [1] Male Female Male Other Female # Levels: Female Male Other levels(gender) <- c("F", "M", "O") # rename levels print(gender)

Data Frames – Rectangular table (like Excel or SQL table) – most important structure

R

# Create data frame students <- data.frame( name = c("Anshuman", "Priya", "Rahul", "Sneha"), age = c(25, 23, 24, 22), marks = c(92, 88, 85, 90), passed = c(TRUE, TRUE, TRUE, TRUE) ) print(students) # name age marks passed # 1 Anshuman 25 92 TRUE # 2 Priya 23 88 TRUE # 3 Rahul 24 85 TRUE # 4 Sneha 22 90 TRUE # Access students$marks students[1, ] # first row students[, "age"] # age column students[students$age > 23, ] # filter

Important: Data frames are lists of vectors (columns) — each column must be same length.

2.4 Control Structures (if-else, for, while, apply family)

if-else

R

score <- 85 if (score >= 90) { print("A+") } else if (score >= 80) { print("A") } else { print("B or below") }

for loop

R

for (i in 1:5) { print(i^2) }

while loop

R

count <- 1 while (count <= 5) { print(count) count <- count + 1 }

apply family – Vectorized alternatives to loops (very important in R)

R

# apply (for matrices/arrays) m <- matrix(1:12, nrow=3) apply(m, 1, sum) # sum each row # lapply (returns list) lapply(students$marks, function(x) x + 5) # sapply (simplifies to vector/matrix) sapply(students$marks, function(x) x > 85) # tapply (grouped apply) tapply(students$marks, students$age > 23, mean)

Best practice: Avoid explicit for loops when possible — use apply, lapply, sapply, tapply, or tidyverse map_* functions.

2.5 Writing Your First R Script

Create a new R script in RStudio File → New File → R Script

Example script – student_analysis.R

R

# student_analysis.R # Load packages library(tidyverse) # Sample data students <- data.frame( name = c("Anshuman", "Priya", "Rahul", "Sneha"), age = c(25, 23, 24, 22), marks = c(92, 88, 85, 90) ) # Summary summary(students) # Visualization ggplot(students, aes(x = age, y = marks)) + geom_point(size = 4, color = "blue") + geom_smooth(method = "lm", color = "red") + labs(title = "Marks vs Age", x = "Age", y = "Marks") + theme_minimal() # Save plot ggsave("marks_vs_age.png", width = 8, height = 6, dpi = 300) # Save results write.csv(students, "students_data.csv", row.names = FALSE) print("Analysis complete!")

Run script

  • Press Ctrl + Enter (line by line)

  • Or Source entire script (Ctrl + Shift + S)

This completes the full R Basics – Syntax & Core Concepts section — now you have the strong foundation to write real R code!

3. Data Import & Export

In real-world data analysis with R, most of your time is spent getting data in and out of R efficiently and correctly. R has excellent support for almost every common data format used in statistics, research, business, and academia.

3.1 Reading CSV, Excel, SPSS, SAS, Stata & JSON Files

CSV (Comma-Separated Values) – Most common format

R

# Basic read df <- read.csv("sales_data.csv") # Recommended modern way (faster, better handling) library(readr) df <- read_csv("sales_data.csv", show_col_types = FALSE) # Useful options df <- read_csv("data.csv", col_types = cols( date = col_date("%Y-%m-%d"), price = col_double(), category = col_factor() ), na = c("", "NA", "missing"), skip = 2) # skip first 2 rows

Excel (.xlsx / .xls)

R

# Recommended package library(readxl) df <- read_excel("report.xlsx", sheet = "Sales", skip = 1) # or read specific range df <- read_excel("report.xlsx", range = "B2:F100")

SPSS (.sav), SAS (.sas7bdat), Stata (.dta) – Very common in research

R

library(haven) # SPSS df_spss <- read_sav("survey.sav") # SAS df_sas <- read_sas("clinical.sas7bdat") # Stata df_stata <- read_dta("economics.dta") # All preserve value labels, formats, etc.

JSON (JavaScript Object Notation)

R

library(jsonlite) df_json <- fromJSON("data.json", flatten = TRUE) # or read from URL df_api <- fromJSON("https://api.example.com/data")

Tip (2026 best practice): Always use readr::read_csv() or readxl::read_excel() instead of base R functions — they are 5–10× faster and handle types better.

3.2 Working with Databases (SQL, BigQuery, etc.)

Connecting to SQL databases

R

# SQLite (local) library(DBI) library(RSQLite) con <- dbConnect(RSQLite::SQLite(), "mydatabase.db") df <- dbGetQuery(con, "SELECT * FROM customers WHERE age > 30") dbDisconnect(con)

PostgreSQL / MySQL / MariaDB

R

library(RPostgres) # or RMariaDB con <- dbConnect(RPostgres::Postgres(), dbname = "sales_db", host = "localhost", port = 5432, user = "user", password = Sys.getenv("DB_PASSWORD")) df <- dbGetQuery(con, "SELECT * FROM orders LIMIT 1000")

Google BigQuery (cloud)

R

library(bigrquery) # Authenticate once bq_auth() project <- "my-project-id" dataset <- "sales_data" table <- "2025_transactions" df <- bq_project_query(project, query = "SELECT * FROM `sales_data.2025_transactions` LIMIT 1000") %>% bq_table_download()

Best practice:

  • Never hardcode passwords → use Sys.getenv() or .Renviron file

  • Use DBI + backend package (standard interface)

3.3 Exporting Data – CSV, Excel, RDS, RData

CSV

R

write.csv(df, "cleaned_data.csv", row.names = FALSE) # Faster & better: write_csv(df, "cleaned_data.csv")

Excel

R

library(openxlsx) write.xlsx(df, "report.xlsx", sheetName = "Analysis", rowNames = FALSE)

RDS (single R object – recommended for saving models/data frames)

R

saveRDS(df, "processed_data.rds") df_loaded <- readRDS("processed_data.rds")

RData / .rda (multiple objects)

R

save(df, model, file = "session_data.rda") load("session_data.rda")

Quick rule (2026):

  • Use CSV for sharing with non-R users

  • Use RDS for saving R objects (preserves types, factors, dates)

  • Use RData when saving multiple objects together

3.4 Handling Large Datasets Efficiently

R can struggle with very large data (> RAM size). Modern solutions (2026) make it possible to work with gigabytes easily.

data.table – Fast alternative to data.frame

R

library(data.table) dt <- fread("very_large_file.csv") # much faster than read.csv # Syntax is similar but faster dt[age > 30, .(mean_salary = mean(salary)), by = city]

arrow + duckdb – Work with data larger than RAM

R

library(arrow) library(duckdb) # Read Parquet (columnar, compressed format) df <- read_parquet("large_data.parquet") # Use duckdb for SQL on large files without loading fully con <- dbConnect(duckdb()) df <- dbGetQuery(con, "SELECT * FROM 'large_data.parquet' WHERE sales > 100000 LIMIT 1000") dbDisconnect(con)

Best practices for big data in R (2026)

  • Use Parquet format instead of CSV (faster, smaller)

  • Prefer data.table or Polars (R package) for in-memory speed

  • Use duckdb or arrow for querying files larger than memory

  • Avoid read.csv() on large files → use fread() or read_csv()

  • Sample data first for EDA: df_sample <- head(df, 10000)

Mini Summary Project – Import, Clean & Export Pipeline

R

library(tidyverse) library(haven) # 1. Import SPSS file df_raw <- read_sav("survey_data.sav") # 2. Clean & transform df_clean <- df_raw %>% clean_names() %>% filter(age >= 18 & age <= 65) %>% mutate(income_k = income / 1000, income_log = log1p(income)) %>% select(id, age, gender, income_k, income_log, everything()) # 3. Quick summary skim(df_clean) # 4. Export write_csv(df_clean, "cleaned_survey.csv") saveRDS(df_clean, "cleaned_survey.rds") write_parquet(df_clean, "cleaned_survey.parquet")

This completes the full Data Import & Export section — now you can confidently bring any kind of data into R, clean it, and save it efficiently!

4. Data Manipulation with dplyr & tidyverse

The tidyverse is a collection of modern R packages designed for data science. The most important one for data manipulation is dplyr — it provides a consistent, readable, and fast grammar for working with data frames.

Core tidyverse packages used here:

  • dplyr – data manipulation

  • tidyr – reshaping data

  • magrittr / pipe – %>% operator

  • readr – fast data import (already covered)

Install tidyverse (once)

R

install.packages("tidyverse")

Load it (always start with this)

R

library(tidyverse)

4.1 Introduction to tidyverse & Pipes (%>%)

The pipe operator %>% (pronounced "then") makes code read like natural language: "Take this data → then do this → then do that".

Without pipe (classic R style)

R

mean(filter(students, age > 20)$marks)

With pipe (tidyverse style – much clearer)

R

students %>% filter(age > 20) %>% summarise(mean_marks = mean(marks))

Key benefits of piping:

  • Code reads from left to right (natural flow)

  • No need to create temporary variables

  • Easier to debug (run line by line)

  • Chain many operations cleanly

4.2 filter(), select(), arrange(), mutate(), summarise()

These are the five core dplyr verbs — learn them well and you can do 80% of data manipulation.

filter() – Keep rows matching condition

R

students %>% filter(age > 23 & marks >= 90)

select() – Choose columns (by name or position)

R

students %>% select(name, marks) # keep only these students %>% select(-age) # drop age students %>% select(starts_with("m")) # columns starting with "m"

arrange() – Sort rows

R

students %>% arrange(desc(marks)) # highest marks first students %>% arrange(age, desc(marks)) # age ascending, then marks descending

mutate() – Create or modify columns

R

students %>% mutate( percentage = marks / 100, grade = case_when( marks >= 90 ~ "A+", marks >= 80 ~ "A", TRUE ~ "B" ) )

summarise() – Collapse data into single row (usually with group_by)

R

students %>% summarise( avg_marks = mean(marks), max_age = max(age), total_students = n() )

4.3 group_by() + summarise() – Powerful Aggregations

group_by() splits data into groups → summarise() computes per group.

Examples

R

# Average marks by gender students %>% group_by(gender) %>% summarise( avg_marks = mean(marks), count = n(), highest = max(marks) ) # Multiple groups sales %>% group_by(region, product) %>% summarise( total_sales = sum(sales_amount), avg_price = mean(price), .groups = "drop" # removes grouping for next step )

Tip: Always use .groups = "drop" in modern code to avoid unexpected behavior.

4.4 Joining Data (inner_join, left_join, full_join)

Joining combines two data frames based on common columns.

Common join types

  • inner_join – only matching rows

  • left_join – keep all rows from left table

  • right_join – keep all rows from right table

  • full_join – keep all rows from both

Example

R

students <- data.frame( id = 1:4, name = c("Anshuman", "Priya", "Rahul", "Sneha"), marks = c(92, 88, 85, 90) ) scores <- data.frame( id = c(1, 2, 5), subject = c("Math", "Science", "Physics"), score = c(95, 90, 82) ) # Left join – keep all students, add scores if available left_join(students, scores, by = "id") # Inner join – only students with scores inner_join(students, scores, by = "id")

Multiple keys / different column names

R

left_join(students, scores, by = c("id" = "student_id"))

4.5 tidyr – pivot_longer, pivot_wider, separate, unite

tidyr helps reshape data from wide to long format (and vice versa) — very common in data preparation.

pivot_longer – make wide data long (tidy format)

R

# Wide format wide <- data.frame( id = 1:3, math = c(85, 90, 78), science = c(92, 88, 95), english = c(80, 82, 87) ) # To long format long <- wide %>% pivot_longer(cols = math:english, names_to = "subject", values_to = "score") print(long) # id subject score # 1 1 math 85 # 2 1 science 92 # ...

pivot_wider – opposite (long to wide)

R

long %>% pivot_wider(names_from = subject, values_from = score)

separate() & unite()

R

df <- data.frame( id = 1:3, name_age = c("Anshuman_25", "Priya_23", "Rahul_24") ) df %>% separate(name_age, into = c("name", "age"), sep = "_") %>% mutate(age = as.integer(age)) # Opposite df %>% unite("full_info", name, age, sep = " - ")

Mini Summary Project – Full Data Manipulation Pipeline

R

library(tidyverse) # Sample messy data sales_raw <- data.frame( region = c("North", "South", "East", "West"), Q1_2025 = c(12000, 15000, 9000, 18000), Q2_2025 = c(14000, 16000, 11000, 20000) ) sales_raw %>% pivot_longer(cols = starts_with("Q"), names_to = "quarter", values_to = "sales") %>% separate(quarter, into = c("quarter", "year"), sep = "_") %>% mutate(sales_in_lakhs = sales / 100000) %>% group_by(region) %>% summarise( total_sales = sum(sales), avg_quarterly = mean(sales), best_quarter = max(sales) ) %>% arrange(desc(total_sales))

This completes the full Data Manipulation with dplyr & tidyverse section — now you can clean, transform, reshape, and summarize data like a pro in R!

5. Data Visualization with ggplot2

ggplot2 is the gold-standard visualization package in R — and one of the best statistical visualization systems in any language. It is based on the Grammar of Graphics by Leland Wilkinson, which breaks plots into logical layers.

Install & Load (if not already in tidyverse)

R

# install.packages("ggplot2") library(ggplot2)

5.1 ggplot2 Grammar of Graphics – Core Logic

Every ggplot follows this structure:

R

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + geom_<TYPE>() + labs(...) + theme(...)

Key components:

  • data → the dataset (usually a data frame)

  • aes() → aesthetics: map variables to visual properties (x, y, color, size, fill, shape…)

  • geom_ → geometric objects: points, lines, bars, histograms, etc.

  • labs() → titles, axis labels, caption

  • theme() → appearance (fonts, colors, grid, background)

Basic template

R

ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) + geom_point() + labs(title = "Car Weight vs MPG", x = "Weight (1000 lbs)", y = "Miles per Gallon") + theme_minimal()

5.2 Scatter Plots, Line Charts, Bar Plots & Histograms

Scatter Plot

R

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(size = 4, alpha = 0.8) + geom_smooth(method = "lm", se = FALSE) + labs(title = "Weight vs MPG by Cylinders", color = "Number of Cylinders") + theme_bw()

Line Chart (time series or trend)

R

library(gapminder) gapminder %>% filter(country %in% c("India", "China", "United States")) %>% ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line(size = 1.2) + geom_point(size = 3) + labs(title = "Life Expectancy Over Time", x = "Year", y = "Life Expectancy (years)") + theme_minimal()

Bar Plot

R

ggplot(diamonds, aes(x = cut, fill = cut)) + geom_bar() + labs(title = "Diamond Cuts Distribution", x = "Cut Quality", y = "Count") + scale_fill_brewer(palette = "Set2") + theme_light()

Histogram

R

ggplot(diamonds, aes(x = price)) + geom_histogram(bins = 50, fill = "steelblue", color = "black") + labs(title = "Price Distribution of Diamonds", x = "Price (USD)", y = "Frequency") + theme_classic()

5.3 Boxplots, Violin Plots & Density Plots

Boxplot

R

ggplot(tips, aes(x = day, y = total_bill, fill = day)) + geom_boxplot(outlier.shape = 21, outlier.size = 3) + labs(title = "Total Bill by Day", x = "Day", y = "Total Bill (USD)") + theme_minimal()

Violin Plot (shows density + boxplot)

R

ggplot(tips, aes(x = day, y = tip, fill = sex)) + geom_violin(trim = FALSE) + geom_boxplot(width = 0.1, fill = "white") + labs(title = "Tip Distribution by Day and Gender") + theme_light()

Density Plot

R

ggplot(diamonds, aes(x = price, fill = cut)) + geom_density(alpha = 0.6) + labs(title = "Price Density by Cut Quality") + theme_minimal()

5.4 Faceting, Themes & Publication-Ready Plots

Faceting – Split plots by category

R

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class), size = 3, alpha = 0.7) + geom_smooth(method = "loess", color = "red") + facet_wrap(~ class, scales = "free_y") + labs(title = "Engine Displacement vs Highway MPG by Vehicle Class") + theme_minimal(base_size = 14)

Popular themes

R

theme_minimal() # clean & modern theme_bw() # black & white theme_classic() # minimal lines theme_light() # light background theme_dark() # dark mode

Publication-ready plot template

R

p <- ggplot(diamonds, aes(x = carat, y = price, color = clarity)) + geom_point(alpha = 0.5, size = 2) + geom_smooth(method = "lm", se = FALSE) + labs(title = "Diamond Price vs Carat by Clarity", x = "Carat", y = "Price (USD)") + scale_color_brewer(palette = "Set1") + theme_minimal(base_size = 16) + theme( plot.title = element_text(face = "bold", hjust = 0.5), axis.title = element_text(face = "bold"), legend.position = "top" ) # Save high-resolution ggsave("diamond_plot.png", plot = p, width = 10, height = 7, dpi = 300)

5.5 Advanced Visuals – Heatmaps, Correlation Plots, Marginal Plots

Heatmap (Correlation)

R

corr <- cor(mtcars) ggplot(melt(corr), aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white") + geom_text(aes(label = round(value, 2)), color = "black") + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) + labs(title = "Correlation Heatmap – mtcars Dataset") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Marginal Plots (joint distribution)

R

library(ggExtra) p <- ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point(aes(color = factor(cyl)), size = 3) + theme_minimal() ggMarginal(p, type = "histogram", fill = "skyblue", color = "black")

Advanced Correlation Plot

R

library(GGally) ggpairs(mtcars[, c("mpg", "hp", "wt", "qsec")], aes(color = factor(cyl)), upper = list(continuous = "cor"))

This completes the full Data Visualization with ggplot2 section — now you can create beautiful, insightful, and publication-ready visualizations in R!

6. Exploratory Data Analysis (EDA) in R

Exploratory Data Analysis (EDA) is the process of investigating a dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions — before building any model. In R, EDA is extremely powerful thanks to tidyverse, ggplot2, and specialized packages.

Core goals of EDA

  • Understand data structure & quality

  • Identify missing values, outliers, errors

  • Discover relationships between variables

  • Detect patterns (trend, seasonality, clusters)

  • Guide feature engineering and modeling decisions

6.1 Summary Statistics & Descriptive Analysis

Start every EDA with a quick overview of the data.

Basic summary functions

R

library(tidyverse) # Load example dataset data("mtcars") # Quick overview glimpse(mtcars) # structure & types summary(mtcars) # min, max, mean, median, quartiles skimr::skim(mtcars) # very detailed summary (install skimr)

Custom summary by group

R

mtcars %>% group_by(cyl) %>% summarise( avg_mpg = mean(mpg, na.rm = TRUE), median_hp = median(hp), sd_wt = sd(wt), n = n() ) %>% arrange(desc(avg_mpg))

Best practice: Always use na.rm = TRUE and check for missing values first.

6.2 Handling Missing Values & Outliers

Detect missing values

R

# Count missing per column colSums(is.na(airquality)) # Percentage missing colMeans(is.na(airquality)) * 100 # Visual: missingno style (install naniar) library(naniar) vis_miss(airquality) gg_miss_var(airquality)

Handling missing values

R

# 1. Drop rows/columns with missing airquality_complete <- airquality %>% drop_na() airquality %>% drop_na(Ozone) # drop only if Ozone missing # 2. Impute with mean/median airquality %>% mutate(Ozone = if_else(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone)) # 3. Impute with last observation carried forward (time series) library(zoo) airquality$Ozone <- na.locf(airquality$Ozone, na.rm = FALSE) # 4. Advanced imputation (missForest, mice packages)

Detecting & handling outliers

R

# Boxplot visual ggplot(airquality, aes(y = Ozone)) + geom_boxplot(fill = "lightblue") + labs(title = "Ozone Outliers") # IQR method Q1 <- quantile(airquality$Ozone, 0.25, na.rm = TRUE) Q3 <- quantile(airquality$Ozone, 0.75, na.rm = TRUE) IQR <- Q3 - Q1 lower <- Q1 - 1.5 IQR upper <- Q3 + 1.5 IQR # Flag outliers airquality <- airquality %>% mutate(ozone_outlier = Ozone < lower | Ozone > upper) # Winsorize (cap) outliers airquality$Ozone_winsor <- pmin(pmax(airquality$Ozone, lower), upper)

Tip: Never blindly remove outliers — investigate first (measurement error? interesting case?).

6.3 Univariate, Bivariate & Multivariate EDA

Univariate (one variable)

R

# Categorical ggplot(diamonds, aes(x = cut)) + geom_bar(fill = "steelblue") + labs(title = "Diamond Cut Distribution") # Numerical ggplot(diamonds, aes(x = price)) + geom_histogram(bins = 50, fill = "coral") + labs(title = "Price Distribution") # Density + boxplot ggplot(diamonds, aes(x = price)) + geom_density(fill = "lightgreen", alpha = 0.5) + geom_boxplot(width = 0.1, fill = "white") + labs(title = "Price Density & Boxplot")

Bivariate (two variables)

R

# Numeric vs Numeric ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(aes(color = factor(cyl)), size = 3) + geom_smooth(method = "lm", color = "red") + labs(title = "Weight vs MPG by Cylinders") # Categorical vs Numeric ggplot(tips, aes(x = day, y = total_bill, fill = day)) + geom_boxplot() + labs(title = "Total Bill by Day") # Categorical vs Categorical ggplot(tips, aes(x = sex, fill = smoker)) + geom_bar(position = "fill") + labs(title = "Smoking Status by Gender (proportions)")

Multivariate (three or more variables)

R

# Pair plot GGally::ggpairs(tips[, c("total_bill", "tip", "size")], aes(color = smoker)) # Faceted plot ggplot(tips, aes(x = total_bill, y = tip)) + geom_point(aes(color = sex)) + facet_grid(time ~ day) + labs(title = "Tip vs Bill by Time & Day")

6.4 Automated EDA with DataExplorer / SmartEDA

Manual EDA takes time — automated tools generate full reports instantly.

DataExplorer (very popular)

R

# install.packages("DataExplorer") library(DataExplorer) # Generate full EDA report (HTML) create_report(airquality, output_file = "airquality_eda_report.html") # Quick plots plot_intro(airquality) plot_missing(airquality) plot_histogram(airquality) plot_correlation(airquality) plot_boxplot(airquality, by = "Month")

SmartEDA (alternative)

R

# install.packages("SmartEDA") library(SmartEDA) # Target variable analysis (if you have one) ExpReport(airquality, Target = NULL, output_file = "smarteda_report.html")

When to use automated EDA

  • First look at new dataset

  • Quick report for team/stakeholders

  • Identify issues before deep manual analysis

Mini Summary Project – Full EDA on Titanic Dataset

R

library(tidyverse) library(DataExplorer) df <- titanic::titanic_train # 1. Quick overview glimpse(df) create_report(df, output_file = "titanic_eda.html") # 2. Manual key plots ggplot(df, aes(x = Age, fill = Survived)) + geom_histogram(position = "identity", alpha = 0.6, bins = 30) + labs(title = "Age Distribution by Survival") ggplot(df, aes(x = Pclass, fill = factor(Survived))) + geom_bar(position = "fill") + labs(title = "Survival Rate by Passenger Class") # 3. Correlation heatmap (numeric only) df_numeric <- df %>% select(where(is.numeric)) corr <- cor(df_numeric, use = "pairwise.complete.obs") corrplot::corrplot(corr, method = "color", type = "upper", tl.cex = 0.8)

This completes the full Exploratory Data Analysis (EDA) in R section — now you can deeply understand any dataset before modeling or reporting!

7. Statistical Analysis in R

R was originally created for statistics — it remains one of the most powerful environments for statistical computing in 2026. This section covers the most important statistical techniques used in research, academia, pharma, finance, and data science.

7.1 Descriptive vs Inferential Statistics

Descriptive Statistics → Describe, summarize, and visualize the data you have (the sample).

Common functions in R:

  • summary(), mean(), median(), sd(), var(), min(), max(), quantile()

  • table(), prop.table() for categorical data

  • skimr::skim() for detailed overview

Inferential Statistics → Use sample data to make generalizations / predictions about the population.

Common goals:

  • Hypothesis testing (is there a real difference?)

  • Confidence intervals (range where true value likely lies)

  • Regression (model relationships)

Quick example comparison

R

# Descriptive summary(airquality$Ozone) mean(airquality$Ozone, na.rm = TRUE) sd(airquality$Ozone, na.rm = TRUE) # Inferential (example later) t.test(airquality$Ozone ~ airquality$Month == 5)

7.2 Hypothesis Testing (t-test, ANOVA, Chi-square)

Hypothesis testing helps decide whether observed differences are statistically significant.

One-sample t-test (compare sample mean to known value)

R

t.test(airquality$Ozone, mu = 30, na.action = na.omit) # p-value < 0.05 → reject null (mean ≠ 30)

Two-sample t-test (compare means of two groups)

R

t.test(Ozone ~ Month == 5, data = airquality, na.action = na.omit) # Welch's t-test by default (unequal variances)

Paired t-test (before-after, same subjects)

R

t.test(before, after, paired = TRUE)

ANOVA (compare means across 3+ groups)

R

anova_model <- aov(mpg ~ factor(cyl), data = mtcars) summary(anova_model) # Post-hoc test if significant TukeyHSD(anova_model)

Chi-square test (categorical association)

R

tbl <- table(mtcars$cyl, mtcars$gear) chisq.test(tbl)

Interpretation tip (2026):

  • p < 0.05 → statistically significant (evidence against null hypothesis)

  • p < 0.01 → very strong evidence

  • Always report effect size + confidence interval (p-value alone is incomplete)

7.3 Correlation & Linear Regression

Correlation – measures linear relationship strength & direction

R

# Pearson correlation cor(mtcars$mpg, mtcars$hp) # -0.776 → strong negative # Spearman (rank-based, non-linear) cor(mtcars$mpg, mtcars$hp, method = "spearman") # Correlation matrix cor(mtcars[, c("mpg", "hp", "wt", "qsec")]) corrplot::corrplot(cor(mtcars), method = "color", type = "upper")

Simple Linear Regression

R

model <- lm(mpg ~ wt, data = mtcars) summary(model) # Look at: R-squared, p-value of coefficients, F-statistic # Plot with confidence intervals ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm", color = "red") + labs(title = "Linear Regression: MPG vs Weight")

Multiple Linear Regression

R

multi_model <- lm(mpg ~ wt + hp + cyl, data = mtcars) summary(multi_model)

7.4 Logistic Regression & Generalized Linear Models

Logistic Regression – for binary outcome (0/1, yes/no, success/failure)

R

# Titanic survival example titanic <- titanic::titanic_train %>% mutate(Survived = factor(Survived)) log_model <- glm(Survived ~ Pclass + Sex + Age, data = titanic, family = binomial(link = "logit")) summary(log_model) # Odds ratios exp(coef(log_model))

Generalized Linear Models (GLM)

  • family = gaussian → linear regression

  • family = binomial → logistic

  • family = poisson → count data

7.5 Non-parametric Tests & Post-hoc Analysis

Non-parametric – when data violates normality assumption

Wilcoxon rank-sum test (non-parametric t-test)

R

wilcox.test(mpg ~ vs, data = mtcars) # vs = engine type

Kruskal-Wallis (non-parametric ANOVA)

R

kruskal.test(mpg ~ factor(cyl), data = mtcars)

Post-hoc (after significant Kruskal-Wallis)

R

library(dunn.test) dunn.test(mtcars$mpg, mtcars$cyl, method = "bonferroni")

Mini Summary Project – Full Statistical Workflow

R

library(tidyverse) # Load data df <- read_csv("your_data.csv") # 1. Descriptive df %>% group_by(group) %>% summarise(mean = mean(outcome, na.rm = TRUE), sd = sd(outcome, na.rm = TRUE), n = n()) # 2. Visualization ggplot(df, aes(x = group, y = outcome)) + geom_boxplot() + geom_jitter(width = 0.2, alpha = 0.5) # 3. Test anova_result <- aov(outcome ~ group, data = df) summary(anova_result) # 4. Post-hoc if significant TukeyHSD(anova_result)

This completes the full Statistical Analysis in R section — now you can perform professional-grade statistical tests and interpret results correctly!

8. Machine Learning with R

R has excellent support for machine learning — especially for statistical modeling, interpretable models, and research-oriented workflows. In 2026, two main frameworks dominate: caret (classic, still widely used) and tidymodels (modern, tidyverse-integrated, recommended for new projects).

8.1 Supervised Learning – Regression & Classification

Supervised learning = predict an outcome variable (label) from input features.

  • Regression → continuous target (price, temperature, sales)

  • Classification → categorical target (yes/no, spam/not-spam, 0/1/2)

Common algorithms in R

  • Linear / Logistic Regression

  • Decision Trees & Random Forest

  • Gradient Boosting (XGBoost, LightGBM, CatBoost)

  • Support Vector Machines

  • k-Nearest Neighbors

8.2 caret vs tidymodels – Two Main ML Frameworks

Featurecaret (older, still very popular)tidymodels (modern, tidyverse-style)Winner in 2026SyntaxFunctional, base-R styleConsistent, pipe-friendly, tidyverse ecosystemtidymodelsPreprocessingBuilt-in preProcess()recipes package (very powerful)tidymodelsModel tuningtrain() with gridtune_grid(), tune_bayes()tidymodelsWorkflowManual stepsworkflow() – combines recipe + modeltidymodelsCommunity momentumLarge legacy user baseRapidly growing, Posit-supportedtidymodelsLearning curveModerateSlightly steeper at first, then easier—

Recommendation (2026): Use tidymodels for all new work — it’s more readable, reproducible, and integrates perfectly with tidyverse. Learn caret only if maintaining legacy code.

Quick tidymodels example

R

library(tidymodels) # Split data set.seed(42) split <- initial_split(iris, prop = 0.8, strata = Species) train_data <- training(split) test_data <- testing(split) # Recipe (preprocessing) rec <- recipe(Species ~ ., data = train_data) %>% step_normalize(all_numeric_predictors()) # Model rf_model <- rand_forest(trees = 500) %>% set_mode("classification") %>% set_engine("ranger") # Workflow wf <- workflow() %>% add_recipe(rec) %>% add_model(rf_model) # Fit fit <- wf %>% fit(data = train_data) # Predict & evaluate predictions <- predict(fit, test_data) accuracy <- accuracy_vec(test_data$Species, predictions$.pred_class) print(accuracy)

8.3 Random Forest, XGBoost & Gradient Boosting in R

Random Forest (bagging ensemble – very robust)

R

# tidymodels way rf_spec <- rand_forest(trees = tune(), min_n = tune()) %>% set_mode("regression") %>% set_engine("ranger") # Tune tune_res <- tune_grid( rf_spec, mpg ~ ., resamples = vfold_cv(mtcars, v = 5), grid = 10 ) best_params <- select_best(tune_res, "rmse") final_model <- finalize_model(rf_spec, best_params)

XGBoost (gradient boosting – often top performer)

R

# Install: install.packages("xgboost") library(xgboost) # Prepare data (matrix format required) X <- as.matrix(mtcars[, -1]) y <- mtcars$mpg xgb_model <- xgboost( data = X, label = y, nrounds = 100, objective = "reg:squarederror", eta = 0.1, max_depth = 6 ) # Prediction pred <- predict(xgb_model, X) rmse <- sqrt(mean((y - pred)^2)) print(rmse)

Gradient Boosting comparison (2026)

  • XGBoost → fastest, most accurate, GPU support

  • LightGBM → even faster on large data

  • CatBoost → best for categorical features out-of-the-box

8.4 Model Evaluation – Cross-validation, ROC-AUC, Confusion Matrix

Cross-validation in tidymodels

R

folds <- vfold_cv(train_data, v = 10, strata = Species) metrics <- metric_set(accuracy, roc_auc) res <- fit_resamples( rf_model, resamples = folds, metrics = metrics ) collect_metrics(res)

Confusion Matrix & ROC-AUC

R

# Classification example confusion <- conf_mat(predictions, truth = test_data$Species) confusion %>% autoplot(type = "heatmap") # ROC-AUC (binary classification) roc_auc_vec(truth = y_test, estimate = prob_positive_class)

Regression metrics

  • RMSE, MAE, R² (rsq_trad)

8.5 Unsupervised Learning – Clustering (k-means, hierarchical)

K-Means Clustering

R

# Scale data first! data_scaled <- scale(mtcars[, c("mpg", "hp", "wt")]) # K-means km <- kmeans(data_scaled, centers = 3, nstart = 25) # Visualize mtcars$cluster <- factor(km$cluster) ggplot(mtcars, aes(x = mpg, y = hp, color = cluster)) + geom_point(size = 4) + labs(title = "K-Means Clustering – mtcars")

Hierarchical Clustering

R

dist_matrix <- dist(data_scaled, method = "euclidean") hc <- hclust(dist_matrix, method = "complete") plot(hc, main = "Hierarchical Clustering Dendrogram") rect.hclust(hc, k = 3, border = "red")

Choosing k (number of clusters)

R

library(factoextra) fviz_nbclust(data_scaled, kmeans, method = "wss") + labs(title = "Elbow Method") fviz_nbclust(data_scaled, kmeans, method = "silhouette") # Silhouette score

Mini Summary Project – Customer Segmentation

R

library(tidyverse) # Load sample customer data (or your own) df <- read_csv("customer_data.csv") # Preprocess df_scaled <- df %>% select(age, annual_income, spending_score) %>% scale() # K-means k <- 5 km <- kmeans(df_scaled, centers = k, nstart = 25) df$segment <- factor(km$cluster) # Visualize ggplot(df, aes(x = annual_income, y = spending_score, color = segment)) + geom_point(size = 4) + labs(title = "Customer Segments", subtitle = "Based on Income & Spending Score")

This completes the full Machine Learning with R section — now you can build, evaluate, and deploy real ML models in R!

9. Time Series Analysis & Forecasting

Time series data is any data collected over time at regular intervals (daily sales, monthly temperature, hourly stock prices, yearly population, etc.). Forecasting = predicting future values based on past patterns.

R has one of the strongest time series ecosystems — especially for classical statistical forecasting.

9.1 Time Series Objects – ts, xts, zoo

R has several classes for handling time series data.

ts – The classic base R time series class (regular frequency)

R

# Monthly data starting from Jan 2020 sales <- c(120, 135, 148, 162, 175, 190, 210, 225, 240, 255, 270, 290) ts_sales <- ts(sales, start = c(2020, 1), frequency = 12) print(ts_sales) plot(ts_sales, main = "Monthly Sales", ylab = "Sales", xlab = "Time")

zoo – Irregular time series (very flexible)

R

library(zoo) dates <- as.Date(c("2025-01-01", "2025-01-05", "2025-01-12", "2025-01-20")) values <- c(100, 120, 115, 140) z <- zoo(values, dates) plot(z, main = "Irregular Time Series", ylab = "Value")

xts – Modern, high-performance extension of zoo (most recommended in 2026 for financial/time series)

R

library(xts) dates <- seq(as.Date("2025-01-01"), by = "day", length.out = 30) values <- cumsum(rnorm(30)) + 100 xts_data <- xts(values, order.by = dates) plot(xts_data, main = "Daily Random Walk", ylab = "Value") # Subsetting xts_data["2025-01-10/2025-01-20"] # slice by date range

Quick recommendation (2026):

  • Use ts for simple, regular, monthly/quarterly data

  • Use xts for financial, daily, or irregular high-frequency data

  • Use zoo only if you need very old compatibility

9.2 Decomposition – Trend, Seasonality, Remainder

Decomposition breaks a time series into:

  • Trend – long-term direction

  • Seasonal – repeating pattern

  • Remainder (residual) – random noise

Classical decomposition (additive/multiplicative)

R

# AirPassengers dataset (built-in) data("AirPassengers") plot(AirPassengers) # Additive decomposition decomp_add <- decompose(AirPassengers, type = "additive") plot(decomp_add) # Multiplicative (better for increasing variance) decomp_mult <- decompose(AirPassengers, type = "multiplicative") plot(decomp_mult)

STL decomposition (more robust – handles changing seasonality)

R

library(forecast) stl_decomp <- stl(AirPassengers, s.window = "periodic") plot(stl_decomp)

Seasonal decomposition with X-13-ARIMA-SEATS (very advanced, used in official statistics)

R

library(seasonal) seas_decomp <- seas(AirPassengers) plot(seas_decomp)

Key takeaway: Use STL for most modern work — it’s robust and handles most real-world series well.

9.3 ARIMA & SARIMA Models

ARIMA (AutoRegressive Integrated Moving Average) is the classic statistical forecasting model.

Components

  • AR(p) – autoregression (depends on past values)

  • I(d) – differencing (make stationary)

  • MA(q) – moving average (depends on past errors)

SARIMA adds seasonal components (P,D,Q,m)

Step-by-step ARIMA in R

R

library(forecast) # 1. Make stationary (ADF test) adf.test(AirPassengers) # p > 0.05 → non-stationary # Difference once diff_series <- diff(AirPassengers, differences = 1) adf.test(diff_series) # now stationary # 2. Auto ARIMA (best automatic choice) auto_model <- auto.arima(AirPassengers, seasonal = TRUE, stepwise = FALSE, approximation = FALSE) summary(auto_model) # 3. Forecast fc <- forecast(auto_model, h = 24) # 24 months ahead plot(fc, main = "ARIMA Forecast – Air Passengers")

Manual SARIMA (when you know parameters)

R

sarima_model <- Arima(AirPassengers, order = c(0,1,1), seasonal = c(0,1,1,12)) checkresiduals(sarima_model) # residuals should be white noise fc_manual <- forecast(sarima_model, h = 12) plot(fc_manual)

9.4 Prophet & forecast Package

Prophet (by Facebook/Meta) – Extremely easy and powerful for business time series

R

library(prophet) # Prepare data (must have ds = date, y = value) df_prophet <- AirPassengers %>% as.data.frame() %>% rownames_to_column("ds") %>% mutate(ds = as.Date(ds, format = "%b %Y")) %>% rename(y = x) # Fit model m <- prophet(df_prophet, yearly.seasonality = TRUE, weekly.seasonality = FALSE) # Future dataframe future <- make_future_dataframe(m, periods = 24, freq = "month") # Forecast forecast_prophet <- predict(m, future) # Plot plot(m, forecast_prophet) prophet_plot_components(m, forecast_prophet)

forecast package – Traditional, comprehensive, still very strong

R

library(forecast) fit <- ets(AirPassengers) # Exponential smoothing fc_ets <- forecast(fit, h = 24) plot(fc_ets) # Or auto.arima as shown earlier

When to choose (2026):

  • Prophet → Business forecasting, strong seasonality, holidays, missing data

  • ARIMA/SARIMA → Classical stats, high accuracy needed, research

  • ETS → Exponential smoothing (good for short-term)

9.5 Real-world Forecasting Project

Project: Monthly Sales Forecasting (End-to-End)

R

library(tidyverse) library(forecast) library(prophet) # 1. Load & prepare (assume you have monthly sales data) sales <- read_csv("monthly_sales.csv") %>% mutate(ds = as.Date(month_year, format = "%Y-%m"), y = sales_amount) # 2. EDA ggplot(sales, aes(x = ds, y = y)) + geom_line(color = "steelblue", size = 1) + labs(title = "Monthly Sales Trend", x = "Date", y = "Sales") + theme_minimal() # 3. Prophet model m <- prophet(sales, yearly.seasonality = TRUE) future <- make_future_dataframe(m, periods = 12, freq = "month") fc <- predict(m, future) plot(m, fc) + labs(title = "Prophet Forecast – Monthly Sales") # 4. ARIMA alternative auto_fit <- auto.arima(sales$y, seasonal = TRUE) fc_arima <- forecast(auto_fit, h = 12) plot(fc_arima) # 5. Compare & choose best model accuracy(fc_arima) # RMSE, MAE, etc.

Key Takeaways from Project:

  • Always visualize trend & seasonality first

  • Prophet is easiest for business users

  • ARIMA gives more control & statistical diagnostics

  • Evaluate with hold-out set or cross-validation (tsCV in forecast)

This completes the full Time Series Analysis & Forecasting section — now you can confidently analyze and predict time-based data in R!

10. R Markdown & Reproducible Reports

R Markdown is one of the most powerful features of R — it allows you to combine executable R code, narrative text, results, tables, figures, and references into a single document. The output can be HTML, PDF, Word, presentations, dashboards, websites, and more — all fully reproducible.

In 2026, Quarto has become the modern successor to R Markdown (recommended for new projects), but R Markdown is still widely used and supported.

10.1 Creating Dynamic Reports with R Markdown

Basic structure of an R Markdown (.Rmd) file

YAML

--- title: "My First Reproducible Report" author: "Anshuman" date: "March 2026" output: html_document --- # Introduction This is a sample report using R Markdown. ## Summary Statistics ```{r summary, echo=TRUE} summary(mtcars)

Visualization

language-{r

library(ggplot2) ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(size = 3) + geom_smooth(method = "lm") + labs(title = "Weight vs MPG by Cylinders")

text

How to create & render 1. In RStudio: File → New File → R Markdown → choose output format → OK 2. Write text in Markdown + code in `{r}` chunks 3. Click Knit button (or Ctrl+Shift+K) to render Key chunk options (very useful) ```r ```{r chunk-name, echo=FALSE, warning=FALSE, message=FALSE, fig.width=10, fig.height=6, eval=TRUE} # code here

text

- `echo = FALSE` → hide code, show only output - `warning = FALSE / message = FALSE` → hide warnings/messages - `fig.width / fig.height` → control figure size - `eval = FALSE` → don't run chunk (useful for setup) #### 10.2 Parameters, Tables, Figures & Citations Parameterized reports (run with different inputs) ```yaml --- title: "Sales Report" params: region: "North" year: 2025 ---

Use inside document:

R

# Sales for `r params$region` in `r params$year`

Beautiful tables

R

library(knitr) library(kableExtra) mtcars %>% head(10) %>% kbl(caption = "First 10 Rows of mtcars") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Figures with captions & references

R

```{r scatter-plot, fig.cap="Scatter plot of MPG vs Weight"} ggplot(mtcars, aes(wt, mpg)) + geom_point()

text

Citations & bibliography ```yaml --- bibliography: references.bib --- See @wickham2019 for more on tidyverse.

references.bib file example:

text

@book{wickham2019, title = {Advanced {R}}, author = {Wickham, Hadley}, year = {2019}, publisher = {Chapman and Hall/CRC} }

10.3 Converting to HTML, PDF, Word

HTML (default – interactive, fast)

YAML

output: html_document

PDF (professional, publication-ready)

YAML

output: pdf_document

Requires LaTeX (install TinyTeX: tinytex::install_tinytex())

Word (.docx)

YAML

output: word_document

Multiple formats at once

YAML

output: html_document: default pdf_document: default word_document: default

Custom themes & CSS

YAML

output: html_document: theme: cosmo # or cerulean, journal, flatly, darkly, etc. highlight: tango css: styles.css

10.4 Quarto – The Modern Replacement (2026 Standard)

Quarto (released by Posit in 2022) is the next-generation successor to R Markdown. It supports R, Python, Julia, and Observable — one tool for all.

Key advantages over R Markdown (2026)

  • Unified syntax across languages

  • Better PDF output (native LaTeX control)

  • Built-in support for interactive plots, code folding, tabs, callouts

  • Freeze computation (cache results)

  • Websites, books, presentations, manuscripts from one source

Basic Quarto document (.qmd)

YAML

--- title: "My Quarto Report" author: "Anshuman" format: html: default pdf: default execute: echo: true warning: false --- # Introduction This is a Quarto document. ## Summary ```{r} summary(mtcars)

Plot

language-{r}

#| label: mpg-plot #| fig-cap: "MPG vs Weight" #| fig-width: 8 #| fig-height: 5 ggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_smooth(method = "lm")

text

Render Quarto ```bash quarto render document.qmd # or in RStudio: click Render button

Quarto features you should know

  • Callouts: ::: {.callout-note} … :::

  • Tabs: ::: {.panel-tabset}

  • Code folding: code-fold: true

  • Cross-references: @fig-mpg-plot

  • Citation: @wickham2019

Recommendation (2026):

  • Use Quarto for all new work

  • Keep R Markdown only for legacy projects or when Quarto is not yet supported

Mini Summary Project – Reproducible Report Create a new .qmd file in RStudio:

YAML

--- title: "Titanic Survival Analysis" format: html execute: echo: false warning: false --- ```{r setup} library(tidyverse) library(knitr)

Data Overview

language-{r}

titanic <- titanic::titanic_train kable(head(titanic))

Survival by Class

language-{r}

#| fig-cap: "Survival Rate by Passenger Class" titanic %>% ggplot(aes(x = factor(Pclass), fill = factor(Survived))) + geom_bar(position = "fill") + labs(x = "Class", y = "Proportion")

text

This completes the full R Markdown & Reproducible Reports section — now you can create dynamic, professional, fully reproducible reports and publications in R! Let me know the next section number (e.g., 11. Real-World Projects & Portfolio Building) or i

11. Real-World Projects & Portfolio Building

These five practical projects combine everything you’ve learned — from data import and manipulation to visualization, statistical analysis, machine learning, time series, and reproducible reporting. They are designed to be portfolio-ready, interview-impressive, and real-world applicable.

11.1 Project 1: Exploratory Analysis & Dashboard (ggplot2 + flexdashboard)

Goal: Perform complete EDA on a dataset and present it as an interactive dashboard.

Tools used: tidyverse, ggplot2, flexdashboard

Steps & Code Structure (save as dashboard.Rmd)

YAML

--- title: "Exploratory Analysis Dashboard – Titanic Dataset" output: flexdashboard::flex_dashboard: orientation: columns vertical_layout: fill runtime: shiny --- ```{r setup, include=FALSE} library(flexdashboard) library(tidyverse) library(ggplot2) library(DT) library(plotly) data("titanic_train") df <- titanic_train

Column {data-width=600}

Data Overview

language-{r}

DT::datatable(df, filter = "top", options = list(pageLength = 10, scrollX = TRUE))

Key Insights

  • Total passengers: r nrow(df)

  • Survival rate: r round(mean(df$Survived, na.rm = TRUE)*100, 1)%

  • Missing Age values: r sum(is.na(df$Age))

Column {data-width=400}

Age Distribution by Survival

language-{r}

ggplot(df, aes(x = Age, fill = factor(Survived))) + geom_histogram(position = "identity", alpha = 0.6, bins = 30) + labs(title = "Age vs Survival", fill = "Survived (1=Yes)") + theme_minimal()

Fare by Class

language-{r}

ggplot(df, aes(x = factor(Pclass), y = Fare, fill = factor(Pclass))) + geom_boxplot(outlier.shape = 21) + labs(title = "Fare Distribution by Passenger Class") + theme_minimal()

Value Boxes

Total Passengers

language-{r}

valueBox(nrow(df), icon = "fa-users", color = "primary")

Survival Rate

language-{r}

valueBox(paste0(round(mean(df$Survived, na.rm = TRUE)*100, 1), "%"), icon = "fa-heartbeat", color = "success")

Average Fare

language-{r}

valueBox(paste0("₹", round(mean(df$Fare, na.rm = TRUE), 1)), icon = "fa-money-bill-wave", color = "warning")

text

How to run: Knit → Save as HTML → Open in browser (interactive) Key Takeaways: flexdashboard is perfect for quick, interactive EDA reports. #### 11.2 Project 2: Customer Churn Prediction (Classification) Goal: Predict which customers will churn using classification models. Dataset: Telco Customer Churn (Kaggle) Steps & Code ```r library(tidyverse) library(tidymodels) library(themis) # for SMOTE # 1. Load & clean df <- read_csv("telco_churn.csv") %>% janitor::clean_names() %>% mutate(churn = factor(churn, levels = c("No", "Yes"))) %>% select(-customer_id) # 2. Split & recipe set.seed(42) split <- initial_split(df, prop = 0.8, strata = churn) train <- training(split) test <- testing(split) rec <- recipe(churn ~ ., data = train) %>% step_impute_median(all_numeric_predictors()) %>% step_dummy(all_nominal_predictors(), -all_outcomes()) %>% step_smote(churn) %>% # handle imbalance step_normalize(all_numeric_predictors()) # 3. Model spec rf_spec <- rand_forest(trees = 500) %>% set_mode("classification") %>% set_engine("ranger") # 4. Workflow & fit wf <- workflow() %>% add_recipe(rec) %>% add_model(rf_spec) fit <- wf %>% fit(data = train) # 5. Evaluate predictions <- predict(fit, test) %>% pull(.pred_class) prob <- predict(fit, test, type = "prob")$.pred_Yes print(conf_mat(test$churn, predictions)) print(roc_auc_vec(test$churn, prob))

Key Takeaways: Use themis::step_smote() for imbalance. Focus on Recall & ROC-AUC for churn problems.

11.3 Project 3: Sales Forecasting (Time Series)

Goal: Forecast monthly sales using Prophet and ARIMA.

Code

R

library(prophet) library(forecast) library(tidyverse) # Assume monthly_sales.csv has columns: date (YYYY-MM-01), sales df <- read_csv("monthly_sales.csv") %>% mutate(ds = as.Date(date), y = sales) # Prophet m <- prophet(df, yearly.seasonality = TRUE) future <- make_future_dataframe(m, periods = 12, freq = "month") fc_prophet <- predict(m, future) plot(m, fc_prophet) # ARIMA ts_data <- ts(df$y, frequency = 12, start = c(2020, 1)) fit_arima <- auto.arima(ts_data) fc_arima <- forecast(fit_arima, h = 12) plot(fc_arima)

Key Takeaways: Prophet is easier for business users; ARIMA gives more statistical control.

11.4 Project 4: Sentiment Analysis on Reviews

Goal: Classify product reviews as positive/negative.

Code (using tidytext)

R

library(tidytext) library(textdata) reviews <- read_csv("amazon_reviews.csv") # Tokenize & sentiment review_words <- reviews %>% unnest_tokens(word, review_text) %>% inner_join(get_sentiments("bing")) sentiment_summary <- review_words %>% count(word, sentiment) %>% pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% mutate(score = positive - negative) # Visualize top words sentiment_summary %>% arrange(desc(score)) %>% slice_head(n = 20) %>% ggplot(aes(reorder(word, score), score, fill = score > 0)) + geom_col() + coord_flip() + labs(title = "Top Sentiment Words in Reviews")

Key Takeaways: tidytext + Bing lexicon = simple & effective baseline.

11.5 Creating a Professional Portfolio (GitHub + RPubs)

Portfolio Structure (2026 standard)

  • GitHub repo for each project

  • README.md with:

    • Project goal

    • Dataset description

    • Key findings & visuals

    • Code walkthrough

    • Live link (RPubs, ShinyApps.io, Quarto Pub)

  • RPubs / Quarto Pub for rendered reports/dashboards

  • Personal portfolio website (Quarto website or GitHub Pages)

Best Practices

  • Use meaningful repo names (e.g., customer-churn-prediction-r)

  • Add screenshots & GIFs in README

  • Include requirements.txt equivalent → sessionInfo() or renv.lock

  • Pin top 6 projects on GitHub profile

  • Add badges: R version, license, stars

Final Advice Publish 4–6 high-quality projects. Write blogs explaining your thought process. Share on LinkedIn, Kaggle, RStudio Community, Reddit (r/rstats, r/datascience). You now have a strong R portfolio!

This completes the full Real-World Projects & Portfolio Building section — and the entire R Programming Mastery tutorial!

12. Best Practices, Career Guidance & Next Steps

You’ve now completed a comprehensive journey through R programming — from basics to advanced data manipulation, visualization, statistical modeling, machine learning, time series, and reproducible reporting. This final section focuses on professional habits, industry applications, Git workflow, interview preparation, career paths, and resources to help you succeed in 2026 and beyond.

12.1 Writing Clean, Reproducible & Production-Ready R Code

Clean and reproducible code is what separates hobbyists from professionals in R.

Core Best Practices (2026 Standard)

  1. Follow Tidyverse Style Guide & use modern tools

    • Use snake_case for objects/functions

    • Consistent spacing & indentation

    • Pipe (%>%) for readability

    Bash

    # Auto-format & sort imports styler::style_file("script.R") # or use lintr + styler in RStudio

  2. Always use projects & here package

    R

    library(here) read_csv(here("data", "sales.csv"))

    → No more broken paths when moving files

  3. Reproducibility

    • Set random seed: set.seed(42)

    • Use renv or groundhog for package versions

    • Document session info: sessionInfo()

    • Prefer Quarto over R Markdown for new work

  4. Production-ready tips

    • Avoid global variables

    • Write functions instead of copy-paste code

    • Use assertthat or checkmate for input validation

    • Add error handling: tryCatch()

    • Log messages: logger package or message()

  5. Code structure for large projects

    text

    project/ ├── R/ # functions & scripts ├── data/ # raw & processed ├── output/ # figures, tables ├── reports/ # Quarto/Rmd files ├── tests/ # testthat tests ├── renv.lock # package versions └── main.qmd

12.2 R in Industry – Shiny Apps, R Packages, APIs

Shiny – Build interactive web apps directly from R

Simple Shiny app example

R

library(shiny) ui <- fluidPage( titlePanel("Interactive MPG Explorer"), sidebarLayout( sidebarPanel( sliderInput("hp", "Horsepower:", min = 50, max = 350, value = c(100, 200)) ), mainPanel( plotOutput("mpgPlot") ) ) ) server <- function(input, output) { output$mpgPlot <- renderPlot({ mtcars %>% filter(hp >= input$hp[1], hp <= input$hp[2]) %>% ggplot(aes(wt, mpg)) + geom_point(size = 4, alpha = 0.7) + theme_minimal() }) } shinyApp(ui = ui, server = server)

Deployment options (2026):

  • shinyapps.io (free tier available)

  • Posit Connect (enterprise)

  • Docker + RStudio Server / Shiny Server

  • Combine with FastAPI/Plumber for hybrid apps

Building R Packages

  • Use devtools & usethis

  • Structure: R/ (functions), tests/, man/ (documentation), DESCRIPTION

  • Publish to CRAN or GitHub

R APIs with Plumber

R

library(plumber) pr(" #* @get /predict function(mpg, hp) { predict(lm_model, newdata = data.frame(mpg = mpg, hp = hp)) } ") %>% pr_run(port = 8000)

12.3 Git & GitHub Workflow for R Users

Recommended workflow (2026):

  1. Create repo on GitHub

  2. Clone locally: git clone https://github.com/username/repo.git

  3. Create branch: git checkout -b feature/eda-report

  4. Work → stage → commit:

    Bash

    git add . git commit -m "Add EDA dashboard and summary stats"

  5. Push: git push origin feature/eda-report

  6. Create Pull Request → review → merge

  7. Delete branch after merge

R-specific tips

  • Add .Rproj to .gitignore (optional)

  • Never commit large data files → use Git LFS or external storage

  • Use usethis::use_git() & usethis::use_github() to initialize

  • Add GitHub Actions for linting & testing

12.4 Top R Interview Questions & Answers

Frequently asked in 2026:

  1. What is the difference between data.frame and tibble? → tibble is stricter (no partial matching), prints better, never changes types automatically

  2. Explain the pipe operator %>% vs native |> → %>% from magrittr → more features; |> is base R (faster, built-in)

  3. How to handle missing values in R? → na.omit(), drop_na(), replace_na(), mice / missForest for imputation

  4. Difference between lapply and sapply? → lapply returns list, sapply simplifies to vector/matrix

  5. What is tidy data? → Each variable = column, each observation = row, each type of observational unit = table

  6. How to reshape data in tidyverse? → pivot_longer() (wide → long), pivot_wider() (long → wide)

  7. Explain group_by() + summarise() vs mutate() → summarise() collapses groups, mutate() keeps rows

  8. What is ggplot2 grammar of graphics? → Data + Aesthetics + Geometries + Scales + Facets + Themes

  9. How to perform t-test in R? → t.test(x, y) or t.test(outcome ~ group, data = df)

  10. Difference between lm() and glm()? → lm() for linear regression, glm() for generalized (logistic, poisson, etc.)

12.5 Career Paths – Data Analyst, Biostatistician, Researcher, Data Scientist

Main Career Tracks in R-heavy domains (2026):

RolePrimary Skills in RTypical EmployersIndia Salary (₹ LPA)Global Salary (USD/year)Data Analystdplyr, ggplot2, R Markdown, SQLConsulting, BFSI, E-commerce5–14$65k–$100kBiostatisticianSurvival analysis, mixed models, clinical trialsPharma, CROs, Hospitals, Research10–30$90k–$160kAcademic ResearcherAdvanced stats, reproducible reports, packagesUniversities, Research Institutes8–25$70k–$140kData Scientisttidyverse + ML (caret/tidymodels), ShinyTech, Finance, Healthcare12–35$100k–$180kStatistical ProgrammerCDISC standards, SAS/R integrationPharma, Clinical Research12–28$90k–$150k

High-demand skills in R ecosystem (2026):

  • tidyverse mastery

  • Quarto / R Markdown

  • Shiny apps

  • Statistical modeling (survival, longitudinal)

  • Reproducible research & reporting

This completes your full R Programming Mastery tutorial! You are now equipped to write professional R code, build impactful projects, and pursue exciting careers in statistics, data science, and research.

👈 PREVIOUS PYTHON PROGRAMMING OOP

These Python notes made complex concepts feel simple and clear.

Amy K.

A cozy study desk with an open laptop displaying Python code and a notebook filled with handwritten notes.
A cozy study desk with an open laptop displaying Python code and a notebook filled with handwritten notes.
A smiling student holding a tablet showing a Python tutorial webpage, surrounded by textbooks.
A smiling student holding a tablet showing a Python tutorial webpage, surrounded by textbooks.

★★★★★