R Programming Mastery From Beginner to Advanced (Complete 2026 Guide)
If you're seeing this book's cover or link pointing to Amazon.com (USA marketplace)
R Programming Mastery From Beginner to Advanced (Complete 2026 Guide)
👈 PREVIOUS PYTHON PROGRAMMING OOP
TABLE OF CONTENTS
R Programming Mastery – From Beginner to Advanced (Complete 2026 Guide) Hands-on Learning Path for Statistics, Data Analysis, Visualization & Machine Learning
Introduction to R Programming 1.1 What is R and Why Learn It in 2026? 1.2 R vs Python – Quick Comparison for Data Science 1.3 Who Should Learn R? (Students, Researchers, Statisticians, Analysts) 1.4 Installing R & RStudio (2026 Recommended Setup)
R Basics – Syntax & Core Concepts 2.1 Variables, Data Types & Basic Operations 2.2 Vectors, Lists, Matrices & Arrays 2.3 Factors & Data Frames – The Heart of R 2.4 Control Structures (if-else, for, while, apply family) 2.5 Writing Your First R Script
Data Import & Export 3.1 Reading CSV, Excel, SPSS, SAS, Stata & JSON Files 3.2 Working with Databases (SQL, BigQuery, etc.) 3.3 Exporting Data – CSV, Excel, RDS, RData 3.4 Handling Large Datasets Efficiently
Data Manipulation with dplyr & tidyverse 4.1 Introduction to tidyverse & Pipes (%>%) 4.2 filter(), select(), arrange(), mutate(), summarise() 4.3 group_by() + summarise() – Powerful Aggregations 4.4 Joining Data (inner_join, left_join, full_join) 4.5 tidyr – pivot_longer, pivot_wider, separate, unite
Data Visualization with ggplot2 5.1 ggplot2 Grammar of Graphics – Core Logic 5.2 Scatter Plots, Line Charts, Bar Plots & Histograms 5.3 Boxplots, Violin Plots & Density Plots 5.4 Faceting, Themes & Publication-Ready Plots 5.5 Advanced Visuals – Heatmaps, Correlation Plots, Marginal Plots
Exploratory Data Analysis (EDA) in R 6.1 Summary Statistics & Descriptive Analysis 6.2 Handling Missing Values & Outliers 6.3 Univariate, Bivariate & Multivariate EDA 6.4 Automated EDA with DataExplorer / SmartEDA
Statistical Analysis in R 7.1 Descriptive vs Inferential Statistics 7.2 Hypothesis Testing (t-test, ANOVA, Chi-square) 7.3 Correlation & Linear Regression 7.4 Logistic Regression & Generalized Linear Models 7.5 Non-parametric Tests & Post-hoc Analysis
Machine Learning with R 8.1 Supervised Learning – Regression & Classification 8.2 caret vs tidymodels – Two Main ML Frameworks 8.3 Random Forest, XGBoost & Gradient Boosting in R 8.4 Model Evaluation – Cross-validation, ROC-AUC, Confusion Matrix 8.5 Unsupervised Learning – Clustering (k-means, hierarchical)
Time Series Analysis & Forecasting 9.1 Time Series Objects – ts, xts, zoo 9.2 Decomposition – Trend, Seasonality, Remainder 9.3 ARIMA & SARIMA Models 9.4 Prophet & forecast Package 9.5 Real-world Forecasting Project
R Markdown & Reproducible Reports 10.1 Creating Dynamic Reports with R Markdown 10.2 Parameters, Tables, Figures & Citations 10.3 Converting to HTML, PDF, Word 10.4 Quarto – The Modern Replacement (2026 Standard)
Real-World Projects & Portfolio Building 11.1 Project 1: Exploratory Analysis & Dashboard (ggplot2 + flexdashboard) 11.2 Project 2: Customer Churn Prediction (Classification) 11.3 Project 3: Sales Forecasting (Time Series) 11.4 Project 4: Sentiment Analysis on Reviews 11.5 Creating a Professional Portfolio (GitHub + RPubs)
Best Practices, Career Guidance & Next Steps 12.1 Writing Clean, Reproducible & Production-Ready R Code 12.2 R in Industry – Shiny Apps, R Packages, APIs 12.3 Git & GitHub Workflow for R Users 12.4 Top R Interview Questions & Answers 12.5 Career Paths – Data Analyst, Biostatistician, Researcher, Data Scientist 12.6 Recommended Books, Courses & Communities (2026 Updated)
1. Introduction to R Programming
Welcome to your journey into R Programming! This first section explains what R is, why it remains extremely relevant in 2026, how it compares to Python, who should learn it, and how to set up a powerful, modern R environment.
1.1 What is R and Why Learn It in 2026?
R is an open-source programming language and software environment specifically designed for statistical computing, data analysis, data visualization, and research.
Created in 1993 by Ross Ihaka and Robert Gentleman, R is now maintained by the R Foundation and a massive global community.
Why R is still powerful and relevant in 2026:
Unmatched statistical packages and cutting-edge methods (many statisticians and biostatisticians still prefer R)
Publication-quality graphics (ggplot2 is the gold standard in academia and journals)
Reproducible research (R Markdown → Quarto in 2026)
Huge ecosystem: tidyverse (dplyr, ggplot2, tidyr), Shiny (interactive apps), caret/tidymodels (ML), Bioconductor (bioinformatics)
Free, open-source, and cross-platform
Dominant in academia, pharma, clinical trials, government research, finance, and bioinformatics
R is not dying — it is evolving: Quarto, tidyverse, arrow, duckdb integration, faster engines (Posit), and strong community support.
1.2 R vs Python – Quick Comparison for Data Science
Both R and Python are excellent — choose based on your goal and domain.
Feature / AspectR (2026)Python (2026)Winner / When to ChoosePrimary StrengthStatistics, advanced analytics, publication graphicsGeneral-purpose, ML/AI, production deploymentR for stats/research, Python for ML/engData Visualizationggplot2 – best-in-class, publication-readyMatplotlib + Seaborn (good), Plotly (interactive)R (ggplot2)Statistical ModelingExtremely rich (thousands of packages)Good (statsmodels, pingouin), but less depthRMachine Learningcaret, tidymodels, mlr3 (solid but smaller)Scikit-learn, XGBoost, PyTorch, TensorFlowPythonReproducible ReportsR Markdown → Quarto (excellent)Jupyter + nbconvert (good)R (Quarto)Interactive AppsShiny (very strong)Streamlit, Dash, PanelR (Shiny) for stats appsSpeed & Big DataImproving fast (duckdb, arrow, data.table)Polars, PySpark, DaskPython (slightly ahead)Community & Job MarketStrong in academia, pharma, researchMuch larger overall, dominant in industryPython for jobs, R for research
2026 verdict:
Choose R if you work in statistics, biostats, clinical research, academia, or publication-heavy fields.
Choose Python for machine learning, deep learning, big data, web apps, or broad industry roles.
Many professionals learn both — R for stats & visualization, Python for ML & deployment.
1.3 Who Should Learn R? (Students, Researchers, Statisticians, Analysts)
R is especially valuable for:
Students (Statistics, Biostatistics, Economics, Psychology, Social Sciences) → Learn R early — many university courses still use it heavily
Researchers (Academic, Clinical, Market Research) → ggplot2 + R Markdown/Quarto = perfect for papers, theses, reproducible reports
Statisticians & Biostatisticians → R has the deepest collection of statistical tests, mixed models, survival analysis, Bayesian methods
Data Analysts in pharma, healthcare, finance, government → R excels at regulatory-compliant reporting and advanced analytics
Professionals transitioning from SPSS/Stata/SAS → R is free and more modern
Who can skip R (or learn later)?
Pure ML engineers (deep learning, computer vision)
Web/full-stack developers
Big data engineers (Spark, Hadoop)
1.4 Installing R & RStudio (2026 Recommended Setup)
Step-by-step modern setup (2026 best practice):
Install R (base language) → Go to https://cran.r-project.org → Download latest version (R 4.4.x or 4.5.x in 2026) for your OS
Install RStudio Desktop (best IDE) → https://posit.co/download/rstudio-desktop/ → Free Open Source Edition is perfect (Posit Public Package Manager)
Recommended: Install Posit Package Manager (formerly RSPM) → Faster package installation, especially in corporate/university networks
Create a project & set working directory
Open RStudio → File → New Project → New Directory → New Project
This keeps everything organized
Install essential packages (run in R console)
R
install.packages(c( "tidyverse", # core: dplyr, ggplot2, tidyr, readr, etc. "rmarkdown", # reports "quarto", # modern publishing (2026 standard) "here", # easy file paths "janitor", # clean_names() "skimr", # quick EDA "esquisse" # drag-drop ggplot2 ))
Recommended VS Code alternative (for power users)
Install VS Code + R Extension (by REditorSupport)
Use radian (better R console) → pip install radian
Quick test – Run this in R console
R
library(tidyverse) ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + geom_smooth(method = "lm") + theme_minimal() + labs(title = "Horsepower vs MPG")
You should see a beautiful scatter plot with regression line.
This completes the full Introduction to R Programming section — your perfect starting point for the entire R tutorial!
2. R Basics – Syntax & Core Concepts
This section covers the foundational elements of R programming. Once you master these, you'll be able to read, write, and understand most R code used in data analysis, statistics, and visualization.
2.1 Variables, Data Types & Basic Operations
In R, you create variables using <- (traditional) or = (also accepted).
Basic data types in R
R
# Numeric (integer or double) x <- 42 # integer y <- 3.14 # double (floating point) z <- 1L # explicit integer (L suffix) # Character (strings) name <- "Anshuman" city <- 'Ranchi' # Logical (boolean) is_student <- TRUE has_degree <- FALSE # Check type class(x) # "numeric" typeof(y) # "double" is.numeric(z) # TRUE
Basic operations
R
a <- 10 b <- 3 print(a + b) # 13 print(a - b) # 7 print(a * b) # 30 print(a / b) # 3.333333 print(a %/% b) # integer division → 3 print(a %% b) # modulo → 1 print(a ^ b) # power → 1000
Logical operations
R
print(a > b) # TRUE print(a == 10) # TRUE print(a != 5) # TRUE print(!TRUE) # FALSE print(TRUE & FALSE) # FALSE (AND) print(TRUE | FALSE) # TRUE (OR)
Tip: Use <- for assignment (community standard). Avoid = except in function arguments.
2.2 Vectors, Lists, Matrices & Arrays
Vectors – The most basic and most used data structure in R (1D, atomic)
R
# Create vector v1 <- c(1, 2, 3, 4, 5) v2 <- 10:20 # sequence 10 to 20 v3 <- seq(0, 10, by = 2) # 0, 2, 4, ..., 10 # Operations are vectorized (no loops needed!) print(v1 * 2) # 2 4 6 8 10 print(v1 + 100) # 101 102 103 104 105 # Indexing (starts from 1!) print(v1[3]) # 3 print(v1[c(1, 3, 5)]) # 1 3 5 print(v1[-2]) # exclude 2nd element → 1 3 4 5
Lists – Can hold mixed types (most flexible)
R
my_list <- list( name = "Anshuman", age = 25, scores = c(85, 92, 78), passed = TRUE, details = list(city = "Ranchi", state = "Jharkhand") ) print(my_list$name) # "Anshuman" print(my_list[[3]]) # scores vector print(my_list$details$city) # "Ranchi"
Matrices – 2D, homogeneous (same type)
R
m <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE) print(m) # [,1] [,2] [,3] [,4] # [1,] 1 2 3 4 # [2,] 5 6 7 8 # [3,] 9 10 11 12 print(m[2, 3]) # 7 print(m[, 2]) # column 2 → 2 6 10
Arrays – Multi-dimensional (rarely used directly)
R
arr <- array(1:24, dim = c(2, 3, 4)) # 2×3×4 array print(arr[1, 2, 3]) # access element
2.3 Factors & Data Frames – The Heart of R
Factors – Used for categorical data (levels)
R
gender <- factor(c("Male", "Female", "Male", "Other", "Female")) print(gender) # [1] Male Female Male Other Female # Levels: Female Male Other levels(gender) <- c("F", "M", "O") # rename levels print(gender)
Data Frames – Rectangular table (like Excel or SQL table) – most important structure
R
# Create data frame students <- data.frame( name = c("Anshuman", "Priya", "Rahul", "Sneha"), age = c(25, 23, 24, 22), marks = c(92, 88, 85, 90), passed = c(TRUE, TRUE, TRUE, TRUE) ) print(students) # name age marks passed # 1 Anshuman 25 92 TRUE # 2 Priya 23 88 TRUE # 3 Rahul 24 85 TRUE # 4 Sneha 22 90 TRUE # Access students$marks students[1, ] # first row students[, "age"] # age column students[students$age > 23, ] # filter
Important: Data frames are lists of vectors (columns) — each column must be same length.
2.4 Control Structures (if-else, for, while, apply family)
if-else
R
score <- 85 if (score >= 90) { print("A+") } else if (score >= 80) { print("A") } else { print("B or below") }
for loop
R
for (i in 1:5) { print(i^2) }
while loop
R
count <- 1 while (count <= 5) { print(count) count <- count + 1 }
apply family – Vectorized alternatives to loops (very important in R)
R
# apply (for matrices/arrays) m <- matrix(1:12, nrow=3) apply(m, 1, sum) # sum each row # lapply (returns list) lapply(students$marks, function(x) x + 5) # sapply (simplifies to vector/matrix) sapply(students$marks, function(x) x > 85) # tapply (grouped apply) tapply(students$marks, students$age > 23, mean)
Best practice: Avoid explicit for loops when possible — use apply, lapply, sapply, tapply, or tidyverse map_* functions.
2.5 Writing Your First R Script
Create a new R script in RStudio File → New File → R Script
Example script – student_analysis.R
R
# student_analysis.R # Load packages library(tidyverse) # Sample data students <- data.frame( name = c("Anshuman", "Priya", "Rahul", "Sneha"), age = c(25, 23, 24, 22), marks = c(92, 88, 85, 90) ) # Summary summary(students) # Visualization ggplot(students, aes(x = age, y = marks)) + geom_point(size = 4, color = "blue") + geom_smooth(method = "lm", color = "red") + labs(title = "Marks vs Age", x = "Age", y = "Marks") + theme_minimal() # Save plot ggsave("marks_vs_age.png", width = 8, height = 6, dpi = 300) # Save results write.csv(students, "students_data.csv", row.names = FALSE) print("Analysis complete!")
Run script
Press Ctrl + Enter (line by line)
Or Source entire script (Ctrl + Shift + S)
This completes the full R Basics – Syntax & Core Concepts section — now you have the strong foundation to write real R code!
3. Data Import & Export
In real-world data analysis with R, most of your time is spent getting data in and out of R efficiently and correctly. R has excellent support for almost every common data format used in statistics, research, business, and academia.
3.1 Reading CSV, Excel, SPSS, SAS, Stata & JSON Files
CSV (Comma-Separated Values) – Most common format
R
# Basic read df <- read.csv("sales_data.csv") # Recommended modern way (faster, better handling) library(readr) df <- read_csv("sales_data.csv", show_col_types = FALSE) # Useful options df <- read_csv("data.csv", col_types = cols( date = col_date("%Y-%m-%d"), price = col_double(), category = col_factor() ), na = c("", "NA", "missing"), skip = 2) # skip first 2 rows
Excel (.xlsx / .xls)
R
# Recommended package library(readxl) df <- read_excel("report.xlsx", sheet = "Sales", skip = 1) # or read specific range df <- read_excel("report.xlsx", range = "B2:F100")
SPSS (.sav), SAS (.sas7bdat), Stata (.dta) – Very common in research
R
library(haven) # SPSS df_spss <- read_sav("survey.sav") # SAS df_sas <- read_sas("clinical.sas7bdat") # Stata df_stata <- read_dta("economics.dta") # All preserve value labels, formats, etc.
JSON (JavaScript Object Notation)
R
library(jsonlite) df_json <- fromJSON("data.json", flatten = TRUE) # or read from URL df_api <- fromJSON("https://api.example.com/data")
Tip (2026 best practice): Always use readr::read_csv() or readxl::read_excel() instead of base R functions — they are 5–10× faster and handle types better.
3.2 Working with Databases (SQL, BigQuery, etc.)
Connecting to SQL databases
R
# SQLite (local) library(DBI) library(RSQLite) con <- dbConnect(RSQLite::SQLite(), "mydatabase.db") df <- dbGetQuery(con, "SELECT * FROM customers WHERE age > 30") dbDisconnect(con)
PostgreSQL / MySQL / MariaDB
R
library(RPostgres) # or RMariaDB con <- dbConnect(RPostgres::Postgres(), dbname = "sales_db", host = "localhost", port = 5432, user = "user", password = Sys.getenv("DB_PASSWORD")) df <- dbGetQuery(con, "SELECT * FROM orders LIMIT 1000")
Google BigQuery (cloud)
R
library(bigrquery) # Authenticate once bq_auth() project <- "my-project-id" dataset <- "sales_data" table <- "2025_transactions" df <- bq_project_query(project, query = "SELECT * FROM `sales_data.2025_transactions` LIMIT 1000") %>% bq_table_download()
Best practice:
Never hardcode passwords → use Sys.getenv() or .Renviron file
Use DBI + backend package (standard interface)
3.3 Exporting Data – CSV, Excel, RDS, RData
CSV
R
write.csv(df, "cleaned_data.csv", row.names = FALSE) # Faster & better: write_csv(df, "cleaned_data.csv")
Excel
R
library(openxlsx) write.xlsx(df, "report.xlsx", sheetName = "Analysis", rowNames = FALSE)
RDS (single R object – recommended for saving models/data frames)
R
saveRDS(df, "processed_data.rds") df_loaded <- readRDS("processed_data.rds")
RData / .rda (multiple objects)
R
save(df, model, file = "session_data.rda") load("session_data.rda")
Quick rule (2026):
Use CSV for sharing with non-R users
Use RDS for saving R objects (preserves types, factors, dates)
Use RData when saving multiple objects together
3.4 Handling Large Datasets Efficiently
R can struggle with very large data (> RAM size). Modern solutions (2026) make it possible to work with gigabytes easily.
data.table – Fast alternative to data.frame
R
library(data.table) dt <- fread("very_large_file.csv") # much faster than read.csv # Syntax is similar but faster dt[age > 30, .(mean_salary = mean(salary)), by = city]
arrow + duckdb – Work with data larger than RAM
R
library(arrow) library(duckdb) # Read Parquet (columnar, compressed format) df <- read_parquet("large_data.parquet") # Use duckdb for SQL on large files without loading fully con <- dbConnect(duckdb()) df <- dbGetQuery(con, "SELECT * FROM 'large_data.parquet' WHERE sales > 100000 LIMIT 1000") dbDisconnect(con)
Best practices for big data in R (2026)
Use Parquet format instead of CSV (faster, smaller)
Prefer data.table or Polars (R package) for in-memory speed
Use duckdb or arrow for querying files larger than memory
Avoid read.csv() on large files → use fread() or read_csv()
Sample data first for EDA: df_sample <- head(df, 10000)
Mini Summary Project – Import, Clean & Export Pipeline
R
library(tidyverse) library(haven) # 1. Import SPSS file df_raw <- read_sav("survey_data.sav") # 2. Clean & transform df_clean <- df_raw %>% clean_names() %>% filter(age >= 18 & age <= 65) %>% mutate(income_k = income / 1000, income_log = log1p(income)) %>% select(id, age, gender, income_k, income_log, everything()) # 3. Quick summary skim(df_clean) # 4. Export write_csv(df_clean, "cleaned_survey.csv") saveRDS(df_clean, "cleaned_survey.rds") write_parquet(df_clean, "cleaned_survey.parquet")
This completes the full Data Import & Export section — now you can confidently bring any kind of data into R, clean it, and save it efficiently!
4. Data Manipulation with dplyr & tidyverse
The tidyverse is a collection of modern R packages designed for data science. The most important one for data manipulation is dplyr — it provides a consistent, readable, and fast grammar for working with data frames.
Core tidyverse packages used here:
dplyr – data manipulation
tidyr – reshaping data
magrittr / pipe – %>% operator
readr – fast data import (already covered)
Install tidyverse (once)
R
install.packages("tidyverse")
Load it (always start with this)
R
library(tidyverse)
4.1 Introduction to tidyverse & Pipes (%>%)
The pipe operator %>% (pronounced "then") makes code read like natural language: "Take this data → then do this → then do that".
Without pipe (classic R style)
R
mean(filter(students, age > 20)$marks)
With pipe (tidyverse style – much clearer)
R
students %>% filter(age > 20) %>% summarise(mean_marks = mean(marks))
Key benefits of piping:
Code reads from left to right (natural flow)
No need to create temporary variables
Easier to debug (run line by line)
Chain many operations cleanly
4.2 filter(), select(), arrange(), mutate(), summarise()
These are the five core dplyr verbs — learn them well and you can do 80% of data manipulation.
filter() – Keep rows matching condition
R
students %>% filter(age > 23 & marks >= 90)
select() – Choose columns (by name or position)
R
students %>% select(name, marks) # keep only these students %>% select(-age) # drop age students %>% select(starts_with("m")) # columns starting with "m"
arrange() – Sort rows
R
students %>% arrange(desc(marks)) # highest marks first students %>% arrange(age, desc(marks)) # age ascending, then marks descending
mutate() – Create or modify columns
R
students %>% mutate( percentage = marks / 100, grade = case_when( marks >= 90 ~ "A+", marks >= 80 ~ "A", TRUE ~ "B" ) )
summarise() – Collapse data into single row (usually with group_by)
R
students %>% summarise( avg_marks = mean(marks), max_age = max(age), total_students = n() )
4.3 group_by() + summarise() – Powerful Aggregations
group_by() splits data into groups → summarise() computes per group.
Examples
R
# Average marks by gender students %>% group_by(gender) %>% summarise( avg_marks = mean(marks), count = n(), highest = max(marks) ) # Multiple groups sales %>% group_by(region, product) %>% summarise( total_sales = sum(sales_amount), avg_price = mean(price), .groups = "drop" # removes grouping for next step )
Tip: Always use .groups = "drop" in modern code to avoid unexpected behavior.
4.4 Joining Data (inner_join, left_join, full_join)
Joining combines two data frames based on common columns.
Common join types
inner_join – only matching rows
left_join – keep all rows from left table
right_join – keep all rows from right table
full_join – keep all rows from both
Example
R
students <- data.frame( id = 1:4, name = c("Anshuman", "Priya", "Rahul", "Sneha"), marks = c(92, 88, 85, 90) ) scores <- data.frame( id = c(1, 2, 5), subject = c("Math", "Science", "Physics"), score = c(95, 90, 82) ) # Left join – keep all students, add scores if available left_join(students, scores, by = "id") # Inner join – only students with scores inner_join(students, scores, by = "id")
Multiple keys / different column names
R
left_join(students, scores, by = c("id" = "student_id"))
4.5 tidyr – pivot_longer, pivot_wider, separate, unite
tidyr helps reshape data from wide to long format (and vice versa) — very common in data preparation.
pivot_longer – make wide data long (tidy format)
R
# Wide format wide <- data.frame( id = 1:3, math = c(85, 90, 78), science = c(92, 88, 95), english = c(80, 82, 87) ) # To long format long <- wide %>% pivot_longer(cols = math:english, names_to = "subject", values_to = "score") print(long) # id subject score # 1 1 math 85 # 2 1 science 92 # ...
pivot_wider – opposite (long to wide)
R
long %>% pivot_wider(names_from = subject, values_from = score)
separate() & unite()
R
df <- data.frame( id = 1:3, name_age = c("Anshuman_25", "Priya_23", "Rahul_24") ) df %>% separate(name_age, into = c("name", "age"), sep = "_") %>% mutate(age = as.integer(age)) # Opposite df %>% unite("full_info", name, age, sep = " - ")
Mini Summary Project – Full Data Manipulation Pipeline
R
library(tidyverse) # Sample messy data sales_raw <- data.frame( region = c("North", "South", "East", "West"), Q1_2025 = c(12000, 15000, 9000, 18000), Q2_2025 = c(14000, 16000, 11000, 20000) ) sales_raw %>% pivot_longer(cols = starts_with("Q"), names_to = "quarter", values_to = "sales") %>% separate(quarter, into = c("quarter", "year"), sep = "_") %>% mutate(sales_in_lakhs = sales / 100000) %>% group_by(region) %>% summarise( total_sales = sum(sales), avg_quarterly = mean(sales), best_quarter = max(sales) ) %>% arrange(desc(total_sales))
This completes the full Data Manipulation with dplyr & tidyverse section — now you can clean, transform, reshape, and summarize data like a pro in R!
5. Data Visualization with ggplot2
ggplot2 is the gold-standard visualization package in R — and one of the best statistical visualization systems in any language. It is based on the Grammar of Graphics by Leland Wilkinson, which breaks plots into logical layers.
Install & Load (if not already in tidyverse)
R
# install.packages("ggplot2") library(ggplot2)
5.1 ggplot2 Grammar of Graphics – Core Logic
Every ggplot follows this structure:
R
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + geom_<TYPE>() + labs(...) + theme(...)
Key components:
data → the dataset (usually a data frame)
aes() → aesthetics: map variables to visual properties (x, y, color, size, fill, shape…)
geom_ → geometric objects: points, lines, bars, histograms, etc.
labs() → titles, axis labels, caption
theme() → appearance (fonts, colors, grid, background)
Basic template
R
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) + geom_point() + labs(title = "Car Weight vs MPG", x = "Weight (1000 lbs)", y = "Miles per Gallon") + theme_minimal()
5.2 Scatter Plots, Line Charts, Bar Plots & Histograms
Scatter Plot
R
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(size = 4, alpha = 0.8) + geom_smooth(method = "lm", se = FALSE) + labs(title = "Weight vs MPG by Cylinders", color = "Number of Cylinders") + theme_bw()
Line Chart (time series or trend)
R
library(gapminder) gapminder %>% filter(country %in% c("India", "China", "United States")) %>% ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line(size = 1.2) + geom_point(size = 3) + labs(title = "Life Expectancy Over Time", x = "Year", y = "Life Expectancy (years)") + theme_minimal()
Bar Plot
R
ggplot(diamonds, aes(x = cut, fill = cut)) + geom_bar() + labs(title = "Diamond Cuts Distribution", x = "Cut Quality", y = "Count") + scale_fill_brewer(palette = "Set2") + theme_light()
Histogram
R
ggplot(diamonds, aes(x = price)) + geom_histogram(bins = 50, fill = "steelblue", color = "black") + labs(title = "Price Distribution of Diamonds", x = "Price (USD)", y = "Frequency") + theme_classic()
5.3 Boxplots, Violin Plots & Density Plots
Boxplot
R
ggplot(tips, aes(x = day, y = total_bill, fill = day)) + geom_boxplot(outlier.shape = 21, outlier.size = 3) + labs(title = "Total Bill by Day", x = "Day", y = "Total Bill (USD)") + theme_minimal()
Violin Plot (shows density + boxplot)
R
ggplot(tips, aes(x = day, y = tip, fill = sex)) + geom_violin(trim = FALSE) + geom_boxplot(width = 0.1, fill = "white") + labs(title = "Tip Distribution by Day and Gender") + theme_light()
Density Plot
R
ggplot(diamonds, aes(x = price, fill = cut)) + geom_density(alpha = 0.6) + labs(title = "Price Density by Cut Quality") + theme_minimal()
5.4 Faceting, Themes & Publication-Ready Plots
Faceting – Split plots by category
R
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class), size = 3, alpha = 0.7) + geom_smooth(method = "loess", color = "red") + facet_wrap(~ class, scales = "free_y") + labs(title = "Engine Displacement vs Highway MPG by Vehicle Class") + theme_minimal(base_size = 14)
Popular themes
R
theme_minimal() # clean & modern theme_bw() # black & white theme_classic() # minimal lines theme_light() # light background theme_dark() # dark mode
Publication-ready plot template
R
p <- ggplot(diamonds, aes(x = carat, y = price, color = clarity)) + geom_point(alpha = 0.5, size = 2) + geom_smooth(method = "lm", se = FALSE) + labs(title = "Diamond Price vs Carat by Clarity", x = "Carat", y = "Price (USD)") + scale_color_brewer(palette = "Set1") + theme_minimal(base_size = 16) + theme( plot.title = element_text(face = "bold", hjust = 0.5), axis.title = element_text(face = "bold"), legend.position = "top" ) # Save high-resolution ggsave("diamond_plot.png", plot = p, width = 10, height = 7, dpi = 300)
5.5 Advanced Visuals – Heatmaps, Correlation Plots, Marginal Plots
Heatmap (Correlation)
R
corr <- cor(mtcars) ggplot(melt(corr), aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white") + geom_text(aes(label = round(value, 2)), color = "black") + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) + labs(title = "Correlation Heatmap – mtcars Dataset") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Marginal Plots (joint distribution)
R
library(ggExtra) p <- ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point(aes(color = factor(cyl)), size = 3) + theme_minimal() ggMarginal(p, type = "histogram", fill = "skyblue", color = "black")
Advanced Correlation Plot
R
library(GGally) ggpairs(mtcars[, c("mpg", "hp", "wt", "qsec")], aes(color = factor(cyl)), upper = list(continuous = "cor"))
This completes the full Data Visualization with ggplot2 section — now you can create beautiful, insightful, and publication-ready visualizations in R!
6. Exploratory Data Analysis (EDA) in R
Exploratory Data Analysis (EDA) is the process of investigating a dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions — before building any model. In R, EDA is extremely powerful thanks to tidyverse, ggplot2, and specialized packages.
Core goals of EDA
Understand data structure & quality
Identify missing values, outliers, errors
Discover relationships between variables
Detect patterns (trend, seasonality, clusters)
Guide feature engineering and modeling decisions
6.1 Summary Statistics & Descriptive Analysis
Start every EDA with a quick overview of the data.
Basic summary functions
R
library(tidyverse) # Load example dataset data("mtcars") # Quick overview glimpse(mtcars) # structure & types summary(mtcars) # min, max, mean, median, quartiles skimr::skim(mtcars) # very detailed summary (install skimr)
Custom summary by group
R
mtcars %>% group_by(cyl) %>% summarise( avg_mpg = mean(mpg, na.rm = TRUE), median_hp = median(hp), sd_wt = sd(wt), n = n() ) %>% arrange(desc(avg_mpg))
Best practice: Always use na.rm = TRUE and check for missing values first.
6.2 Handling Missing Values & Outliers
Detect missing values
R
# Count missing per column colSums(is.na(airquality)) # Percentage missing colMeans(is.na(airquality)) * 100 # Visual: missingno style (install naniar) library(naniar) vis_miss(airquality) gg_miss_var(airquality)
Handling missing values
R
# 1. Drop rows/columns with missing airquality_complete <- airquality %>% drop_na() airquality %>% drop_na(Ozone) # drop only if Ozone missing # 2. Impute with mean/median airquality %>% mutate(Ozone = if_else(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone)) # 3. Impute with last observation carried forward (time series) library(zoo) airquality$Ozone <- na.locf(airquality$Ozone, na.rm = FALSE) # 4. Advanced imputation (missForest, mice packages)
Detecting & handling outliers
R
# Boxplot visual ggplot(airquality, aes(y = Ozone)) + geom_boxplot(fill = "lightblue") + labs(title = "Ozone Outliers") # IQR method Q1 <- quantile(airquality$Ozone, 0.25, na.rm = TRUE) Q3 <- quantile(airquality$Ozone, 0.75, na.rm = TRUE) IQR <- Q3 - Q1 lower <- Q1 - 1.5 IQR upper <- Q3 + 1.5 IQR # Flag outliers airquality <- airquality %>% mutate(ozone_outlier = Ozone < lower | Ozone > upper) # Winsorize (cap) outliers airquality$Ozone_winsor <- pmin(pmax(airquality$Ozone, lower), upper)
Tip: Never blindly remove outliers — investigate first (measurement error? interesting case?).
6.3 Univariate, Bivariate & Multivariate EDA
Univariate (one variable)
R
# Categorical ggplot(diamonds, aes(x = cut)) + geom_bar(fill = "steelblue") + labs(title = "Diamond Cut Distribution") # Numerical ggplot(diamonds, aes(x = price)) + geom_histogram(bins = 50, fill = "coral") + labs(title = "Price Distribution") # Density + boxplot ggplot(diamonds, aes(x = price)) + geom_density(fill = "lightgreen", alpha = 0.5) + geom_boxplot(width = 0.1, fill = "white") + labs(title = "Price Density & Boxplot")
Bivariate (two variables)
R
# Numeric vs Numeric ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(aes(color = factor(cyl)), size = 3) + geom_smooth(method = "lm", color = "red") + labs(title = "Weight vs MPG by Cylinders") # Categorical vs Numeric ggplot(tips, aes(x = day, y = total_bill, fill = day)) + geom_boxplot() + labs(title = "Total Bill by Day") # Categorical vs Categorical ggplot(tips, aes(x = sex, fill = smoker)) + geom_bar(position = "fill") + labs(title = "Smoking Status by Gender (proportions)")
Multivariate (three or more variables)
R
# Pair plot GGally::ggpairs(tips[, c("total_bill", "tip", "size")], aes(color = smoker)) # Faceted plot ggplot(tips, aes(x = total_bill, y = tip)) + geom_point(aes(color = sex)) + facet_grid(time ~ day) + labs(title = "Tip vs Bill by Time & Day")
6.4 Automated EDA with DataExplorer / SmartEDA
Manual EDA takes time — automated tools generate full reports instantly.
DataExplorer (very popular)
R
# install.packages("DataExplorer") library(DataExplorer) # Generate full EDA report (HTML) create_report(airquality, output_file = "airquality_eda_report.html") # Quick plots plot_intro(airquality) plot_missing(airquality) plot_histogram(airquality) plot_correlation(airquality) plot_boxplot(airquality, by = "Month")
SmartEDA (alternative)
R
# install.packages("SmartEDA") library(SmartEDA) # Target variable analysis (if you have one) ExpReport(airquality, Target = NULL, output_file = "smarteda_report.html")
When to use automated EDA
First look at new dataset
Quick report for team/stakeholders
Identify issues before deep manual analysis
Mini Summary Project – Full EDA on Titanic Dataset
R
library(tidyverse) library(DataExplorer) df <- titanic::titanic_train # 1. Quick overview glimpse(df) create_report(df, output_file = "titanic_eda.html") # 2. Manual key plots ggplot(df, aes(x = Age, fill = Survived)) + geom_histogram(position = "identity", alpha = 0.6, bins = 30) + labs(title = "Age Distribution by Survival") ggplot(df, aes(x = Pclass, fill = factor(Survived))) + geom_bar(position = "fill") + labs(title = "Survival Rate by Passenger Class") # 3. Correlation heatmap (numeric only) df_numeric <- df %>% select(where(is.numeric)) corr <- cor(df_numeric, use = "pairwise.complete.obs") corrplot::corrplot(corr, method = "color", type = "upper", tl.cex = 0.8)
This completes the full Exploratory Data Analysis (EDA) in R section — now you can deeply understand any dataset before modeling or reporting!
7. Statistical Analysis in R
R was originally created for statistics — it remains one of the most powerful environments for statistical computing in 2026. This section covers the most important statistical techniques used in research, academia, pharma, finance, and data science.
7.1 Descriptive vs Inferential Statistics
Descriptive Statistics → Describe, summarize, and visualize the data you have (the sample).
Common functions in R:
summary(), mean(), median(), sd(), var(), min(), max(), quantile()
table(), prop.table() for categorical data
skimr::skim() for detailed overview
Inferential Statistics → Use sample data to make generalizations / predictions about the population.
Common goals:
Hypothesis testing (is there a real difference?)
Confidence intervals (range where true value likely lies)
Regression (model relationships)
Quick example comparison
R
# Descriptive summary(airquality$Ozone) mean(airquality$Ozone, na.rm = TRUE) sd(airquality$Ozone, na.rm = TRUE) # Inferential (example later) t.test(airquality$Ozone ~ airquality$Month == 5)
7.2 Hypothesis Testing (t-test, ANOVA, Chi-square)
Hypothesis testing helps decide whether observed differences are statistically significant.
One-sample t-test (compare sample mean to known value)
R
t.test(airquality$Ozone, mu = 30, na.action = na.omit) # p-value < 0.05 → reject null (mean ≠ 30)
Two-sample t-test (compare means of two groups)
R
t.test(Ozone ~ Month == 5, data = airquality, na.action = na.omit) # Welch's t-test by default (unequal variances)
Paired t-test (before-after, same subjects)
R
t.test(before, after, paired = TRUE)
ANOVA (compare means across 3+ groups)
R
anova_model <- aov(mpg ~ factor(cyl), data = mtcars) summary(anova_model) # Post-hoc test if significant TukeyHSD(anova_model)
Chi-square test (categorical association)
R
tbl <- table(mtcars$cyl, mtcars$gear) chisq.test(tbl)
Interpretation tip (2026):
p < 0.05 → statistically significant (evidence against null hypothesis)
p < 0.01 → very strong evidence
Always report effect size + confidence interval (p-value alone is incomplete)
7.3 Correlation & Linear Regression
Correlation – measures linear relationship strength & direction
R
# Pearson correlation cor(mtcars$mpg, mtcars$hp) # -0.776 → strong negative # Spearman (rank-based, non-linear) cor(mtcars$mpg, mtcars$hp, method = "spearman") # Correlation matrix cor(mtcars[, c("mpg", "hp", "wt", "qsec")]) corrplot::corrplot(cor(mtcars), method = "color", type = "upper")
Simple Linear Regression
R
model <- lm(mpg ~ wt, data = mtcars) summary(model) # Look at: R-squared, p-value of coefficients, F-statistic # Plot with confidence intervals ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm", color = "red") + labs(title = "Linear Regression: MPG vs Weight")
Multiple Linear Regression
R
multi_model <- lm(mpg ~ wt + hp + cyl, data = mtcars) summary(multi_model)
7.4 Logistic Regression & Generalized Linear Models
Logistic Regression – for binary outcome (0/1, yes/no, success/failure)
R
# Titanic survival example titanic <- titanic::titanic_train %>% mutate(Survived = factor(Survived)) log_model <- glm(Survived ~ Pclass + Sex + Age, data = titanic, family = binomial(link = "logit")) summary(log_model) # Odds ratios exp(coef(log_model))
Generalized Linear Models (GLM)
family = gaussian → linear regression
family = binomial → logistic
family = poisson → count data
7.5 Non-parametric Tests & Post-hoc Analysis
Non-parametric – when data violates normality assumption
Wilcoxon rank-sum test (non-parametric t-test)
R
wilcox.test(mpg ~ vs, data = mtcars) # vs = engine type
Kruskal-Wallis (non-parametric ANOVA)
R
kruskal.test(mpg ~ factor(cyl), data = mtcars)
Post-hoc (after significant Kruskal-Wallis)
R
library(dunn.test) dunn.test(mtcars$mpg, mtcars$cyl, method = "bonferroni")
Mini Summary Project – Full Statistical Workflow
R
library(tidyverse) # Load data df <- read_csv("your_data.csv") # 1. Descriptive df %>% group_by(group) %>% summarise(mean = mean(outcome, na.rm = TRUE), sd = sd(outcome, na.rm = TRUE), n = n()) # 2. Visualization ggplot(df, aes(x = group, y = outcome)) + geom_boxplot() + geom_jitter(width = 0.2, alpha = 0.5) # 3. Test anova_result <- aov(outcome ~ group, data = df) summary(anova_result) # 4. Post-hoc if significant TukeyHSD(anova_result)
This completes the full Statistical Analysis in R section — now you can perform professional-grade statistical tests and interpret results correctly!
8. Machine Learning with R
R has excellent support for machine learning — especially for statistical modeling, interpretable models, and research-oriented workflows. In 2026, two main frameworks dominate: caret (classic, still widely used) and tidymodels (modern, tidyverse-integrated, recommended for new projects).
8.1 Supervised Learning – Regression & Classification
Supervised learning = predict an outcome variable (label) from input features.
Regression → continuous target (price, temperature, sales)
Classification → categorical target (yes/no, spam/not-spam, 0/1/2)
Common algorithms in R
Linear / Logistic Regression
Decision Trees & Random Forest
Gradient Boosting (XGBoost, LightGBM, CatBoost)
Support Vector Machines
k-Nearest Neighbors
8.2 caret vs tidymodels – Two Main ML Frameworks
Featurecaret (older, still very popular)tidymodels (modern, tidyverse-style)Winner in 2026SyntaxFunctional, base-R styleConsistent, pipe-friendly, tidyverse ecosystemtidymodelsPreprocessingBuilt-in preProcess()recipes package (very powerful)tidymodelsModel tuningtrain() with gridtune_grid(), tune_bayes()tidymodelsWorkflowManual stepsworkflow() – combines recipe + modeltidymodelsCommunity momentumLarge legacy user baseRapidly growing, Posit-supportedtidymodelsLearning curveModerateSlightly steeper at first, then easier—
Recommendation (2026): Use tidymodels for all new work — it’s more readable, reproducible, and integrates perfectly with tidyverse. Learn caret only if maintaining legacy code.
Quick tidymodels example
R
library(tidymodels) # Split data set.seed(42) split <- initial_split(iris, prop = 0.8, strata = Species) train_data <- training(split) test_data <- testing(split) # Recipe (preprocessing) rec <- recipe(Species ~ ., data = train_data) %>% step_normalize(all_numeric_predictors()) # Model rf_model <- rand_forest(trees = 500) %>% set_mode("classification") %>% set_engine("ranger") # Workflow wf <- workflow() %>% add_recipe(rec) %>% add_model(rf_model) # Fit fit <- wf %>% fit(data = train_data) # Predict & evaluate predictions <- predict(fit, test_data) accuracy <- accuracy_vec(test_data$Species, predictions$.pred_class) print(accuracy)
8.3 Random Forest, XGBoost & Gradient Boosting in R
Random Forest (bagging ensemble – very robust)
R
# tidymodels way rf_spec <- rand_forest(trees = tune(), min_n = tune()) %>% set_mode("regression") %>% set_engine("ranger") # Tune tune_res <- tune_grid( rf_spec, mpg ~ ., resamples = vfold_cv(mtcars, v = 5), grid = 10 ) best_params <- select_best(tune_res, "rmse") final_model <- finalize_model(rf_spec, best_params)
XGBoost (gradient boosting – often top performer)
R
# Install: install.packages("xgboost") library(xgboost) # Prepare data (matrix format required) X <- as.matrix(mtcars[, -1]) y <- mtcars$mpg xgb_model <- xgboost( data = X, label = y, nrounds = 100, objective = "reg:squarederror", eta = 0.1, max_depth = 6 ) # Prediction pred <- predict(xgb_model, X) rmse <- sqrt(mean((y - pred)^2)) print(rmse)
Gradient Boosting comparison (2026)
XGBoost → fastest, most accurate, GPU support
LightGBM → even faster on large data
CatBoost → best for categorical features out-of-the-box
8.4 Model Evaluation – Cross-validation, ROC-AUC, Confusion Matrix
Cross-validation in tidymodels
R
folds <- vfold_cv(train_data, v = 10, strata = Species) metrics <- metric_set(accuracy, roc_auc) res <- fit_resamples( rf_model, resamples = folds, metrics = metrics ) collect_metrics(res)
Confusion Matrix & ROC-AUC
R
# Classification example confusion <- conf_mat(predictions, truth = test_data$Species) confusion %>% autoplot(type = "heatmap") # ROC-AUC (binary classification) roc_auc_vec(truth = y_test, estimate = prob_positive_class)
Regression metrics
RMSE, MAE, R² (rsq_trad)
8.5 Unsupervised Learning – Clustering (k-means, hierarchical)
K-Means Clustering
R
# Scale data first! data_scaled <- scale(mtcars[, c("mpg", "hp", "wt")]) # K-means km <- kmeans(data_scaled, centers = 3, nstart = 25) # Visualize mtcars$cluster <- factor(km$cluster) ggplot(mtcars, aes(x = mpg, y = hp, color = cluster)) + geom_point(size = 4) + labs(title = "K-Means Clustering – mtcars")
Hierarchical Clustering
R
dist_matrix <- dist(data_scaled, method = "euclidean") hc <- hclust(dist_matrix, method = "complete") plot(hc, main = "Hierarchical Clustering Dendrogram") rect.hclust(hc, k = 3, border = "red")
Choosing k (number of clusters)
R
library(factoextra) fviz_nbclust(data_scaled, kmeans, method = "wss") + labs(title = "Elbow Method") fviz_nbclust(data_scaled, kmeans, method = "silhouette") # Silhouette score
Mini Summary Project – Customer Segmentation
R
library(tidyverse) # Load sample customer data (or your own) df <- read_csv("customer_data.csv") # Preprocess df_scaled <- df %>% select(age, annual_income, spending_score) %>% scale() # K-means k <- 5 km <- kmeans(df_scaled, centers = k, nstart = 25) df$segment <- factor(km$cluster) # Visualize ggplot(df, aes(x = annual_income, y = spending_score, color = segment)) + geom_point(size = 4) + labs(title = "Customer Segments", subtitle = "Based on Income & Spending Score")
This completes the full Machine Learning with R section — now you can build, evaluate, and deploy real ML models in R!
9. Time Series Analysis & Forecasting
Time series data is any data collected over time at regular intervals (daily sales, monthly temperature, hourly stock prices, yearly population, etc.). Forecasting = predicting future values based on past patterns.
R has one of the strongest time series ecosystems — especially for classical statistical forecasting.
9.1 Time Series Objects – ts, xts, zoo
R has several classes for handling time series data.
ts – The classic base R time series class (regular frequency)
R
# Monthly data starting from Jan 2020 sales <- c(120, 135, 148, 162, 175, 190, 210, 225, 240, 255, 270, 290) ts_sales <- ts(sales, start = c(2020, 1), frequency = 12) print(ts_sales) plot(ts_sales, main = "Monthly Sales", ylab = "Sales", xlab = "Time")
zoo – Irregular time series (very flexible)
R
library(zoo) dates <- as.Date(c("2025-01-01", "2025-01-05", "2025-01-12", "2025-01-20")) values <- c(100, 120, 115, 140) z <- zoo(values, dates) plot(z, main = "Irregular Time Series", ylab = "Value")
xts – Modern, high-performance extension of zoo (most recommended in 2026 for financial/time series)
R
library(xts) dates <- seq(as.Date("2025-01-01"), by = "day", length.out = 30) values <- cumsum(rnorm(30)) + 100 xts_data <- xts(values, order.by = dates) plot(xts_data, main = "Daily Random Walk", ylab = "Value") # Subsetting xts_data["2025-01-10/2025-01-20"] # slice by date range
Quick recommendation (2026):
Use ts for simple, regular, monthly/quarterly data
Use xts for financial, daily, or irregular high-frequency data
Use zoo only if you need very old compatibility
9.2 Decomposition – Trend, Seasonality, Remainder
Decomposition breaks a time series into:
Trend – long-term direction
Seasonal – repeating pattern
Remainder (residual) – random noise
Classical decomposition (additive/multiplicative)
R
# AirPassengers dataset (built-in) data("AirPassengers") plot(AirPassengers) # Additive decomposition decomp_add <- decompose(AirPassengers, type = "additive") plot(decomp_add) # Multiplicative (better for increasing variance) decomp_mult <- decompose(AirPassengers, type = "multiplicative") plot(decomp_mult)
STL decomposition (more robust – handles changing seasonality)
R
library(forecast) stl_decomp <- stl(AirPassengers, s.window = "periodic") plot(stl_decomp)
Seasonal decomposition with X-13-ARIMA-SEATS (very advanced, used in official statistics)
R
library(seasonal) seas_decomp <- seas(AirPassengers) plot(seas_decomp)
Key takeaway: Use STL for most modern work — it’s robust and handles most real-world series well.
9.3 ARIMA & SARIMA Models
ARIMA (AutoRegressive Integrated Moving Average) is the classic statistical forecasting model.
Components
AR(p) – autoregression (depends on past values)
I(d) – differencing (make stationary)
MA(q) – moving average (depends on past errors)
SARIMA adds seasonal components (P,D,Q,m)
Step-by-step ARIMA in R
R
library(forecast) # 1. Make stationary (ADF test) adf.test(AirPassengers) # p > 0.05 → non-stationary # Difference once diff_series <- diff(AirPassengers, differences = 1) adf.test(diff_series) # now stationary # 2. Auto ARIMA (best automatic choice) auto_model <- auto.arima(AirPassengers, seasonal = TRUE, stepwise = FALSE, approximation = FALSE) summary(auto_model) # 3. Forecast fc <- forecast(auto_model, h = 24) # 24 months ahead plot(fc, main = "ARIMA Forecast – Air Passengers")
Manual SARIMA (when you know parameters)
R
sarima_model <- Arima(AirPassengers, order = c(0,1,1), seasonal = c(0,1,1,12)) checkresiduals(sarima_model) # residuals should be white noise fc_manual <- forecast(sarima_model, h = 12) plot(fc_manual)
9.4 Prophet & forecast Package
Prophet (by Facebook/Meta) – Extremely easy and powerful for business time series
R
library(prophet) # Prepare data (must have ds = date, y = value) df_prophet <- AirPassengers %>% as.data.frame() %>% rownames_to_column("ds") %>% mutate(ds = as.Date(ds, format = "%b %Y")) %>% rename(y = x) # Fit model m <- prophet(df_prophet, yearly.seasonality = TRUE, weekly.seasonality = FALSE) # Future dataframe future <- make_future_dataframe(m, periods = 24, freq = "month") # Forecast forecast_prophet <- predict(m, future) # Plot plot(m, forecast_prophet) prophet_plot_components(m, forecast_prophet)
forecast package – Traditional, comprehensive, still very strong
R
library(forecast) fit <- ets(AirPassengers) # Exponential smoothing fc_ets <- forecast(fit, h = 24) plot(fc_ets) # Or auto.arima as shown earlier
When to choose (2026):
Prophet → Business forecasting, strong seasonality, holidays, missing data
ARIMA/SARIMA → Classical stats, high accuracy needed, research
ETS → Exponential smoothing (good for short-term)
9.5 Real-world Forecasting Project
Project: Monthly Sales Forecasting (End-to-End)
R
library(tidyverse) library(forecast) library(prophet) # 1. Load & prepare (assume you have monthly sales data) sales <- read_csv("monthly_sales.csv") %>% mutate(ds = as.Date(month_year, format = "%Y-%m"), y = sales_amount) # 2. EDA ggplot(sales, aes(x = ds, y = y)) + geom_line(color = "steelblue", size = 1) + labs(title = "Monthly Sales Trend", x = "Date", y = "Sales") + theme_minimal() # 3. Prophet model m <- prophet(sales, yearly.seasonality = TRUE) future <- make_future_dataframe(m, periods = 12, freq = "month") fc <- predict(m, future) plot(m, fc) + labs(title = "Prophet Forecast – Monthly Sales") # 4. ARIMA alternative auto_fit <- auto.arima(sales$y, seasonal = TRUE) fc_arima <- forecast(auto_fit, h = 12) plot(fc_arima) # 5. Compare & choose best model accuracy(fc_arima) # RMSE, MAE, etc.
Key Takeaways from Project:
Always visualize trend & seasonality first
Prophet is easiest for business users
ARIMA gives more control & statistical diagnostics
Evaluate with hold-out set or cross-validation (tsCV in forecast)
This completes the full Time Series Analysis & Forecasting section — now you can confidently analyze and predict time-based data in R!
10. R Markdown & Reproducible Reports
R Markdown is one of the most powerful features of R — it allows you to combine executable R code, narrative text, results, tables, figures, and references into a single document. The output can be HTML, PDF, Word, presentations, dashboards, websites, and more — all fully reproducible.
In 2026, Quarto has become the modern successor to R Markdown (recommended for new projects), but R Markdown is still widely used and supported.
10.1 Creating Dynamic Reports with R Markdown
Basic structure of an R Markdown (.Rmd) file
YAML
--- title: "My First Reproducible Report" author: "Anshuman" date: "March 2026" output: html_document --- # Introduction This is a sample report using R Markdown. ## Summary Statistics ```{r summary, echo=TRUE} summary(mtcars)
Visualization
language-{r
library(ggplot2) ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(size = 3) + geom_smooth(method = "lm") + labs(title = "Weight vs MPG by Cylinders")
text
How to create & render 1. In RStudio: File → New File → R Markdown → choose output format → OK 2. Write text in Markdown + code in `{r}` chunks 3. Click Knit button (or Ctrl+Shift+K) to render Key chunk options (very useful) ```r ```{r chunk-name, echo=FALSE, warning=FALSE, message=FALSE, fig.width=10, fig.height=6, eval=TRUE} # code here
text
- `echo = FALSE` → hide code, show only output - `warning = FALSE / message = FALSE` → hide warnings/messages - `fig.width / fig.height` → control figure size - `eval = FALSE` → don't run chunk (useful for setup) #### 10.2 Parameters, Tables, Figures & Citations Parameterized reports (run with different inputs) ```yaml --- title: "Sales Report" params: region: "North" year: 2025 ---
Use inside document:
R
# Sales for `r params$region` in `r params$year`
Beautiful tables
R
library(knitr) library(kableExtra) mtcars %>% head(10) %>% kbl(caption = "First 10 Rows of mtcars") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Figures with captions & references
R
```{r scatter-plot, fig.cap="Scatter plot of MPG vs Weight"} ggplot(mtcars, aes(wt, mpg)) + geom_point()
text
Citations & bibliography ```yaml --- bibliography: references.bib --- See @wickham2019 for more on tidyverse.
references.bib file example:
text
@book{wickham2019, title = {Advanced {R}}, author = {Wickham, Hadley}, year = {2019}, publisher = {Chapman and Hall/CRC} }
10.3 Converting to HTML, PDF, Word
HTML (default – interactive, fast)
YAML
output: html_document
PDF (professional, publication-ready)
YAML
output: pdf_document
Requires LaTeX (install TinyTeX: tinytex::install_tinytex())
Word (.docx)
YAML
output: word_document
Multiple formats at once
YAML
output: html_document: default pdf_document: default word_document: default
Custom themes & CSS
YAML
output: html_document: theme: cosmo # or cerulean, journal, flatly, darkly, etc. highlight: tango css: styles.css
10.4 Quarto – The Modern Replacement (2026 Standard)
Quarto (released by Posit in 2022) is the next-generation successor to R Markdown. It supports R, Python, Julia, and Observable — one tool for all.
Key advantages over R Markdown (2026)
Unified syntax across languages
Better PDF output (native LaTeX control)
Built-in support for interactive plots, code folding, tabs, callouts
Freeze computation (cache results)
Websites, books, presentations, manuscripts from one source
Basic Quarto document (.qmd)
YAML
--- title: "My Quarto Report" author: "Anshuman" format: html: default pdf: default execute: echo: true warning: false --- # Introduction This is a Quarto document. ## Summary ```{r} summary(mtcars)
Plot
language-{r}
#| label: mpg-plot #| fig-cap: "MPG vs Weight" #| fig-width: 8 #| fig-height: 5 ggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_smooth(method = "lm")
text
Render Quarto ```bash quarto render document.qmd # or in RStudio: click Render button
Quarto features you should know
Callouts: ::: {.callout-note} … :::
Tabs: ::: {.panel-tabset}
Code folding: code-fold: true
Cross-references: @fig-mpg-plot
Citation: @wickham2019
Recommendation (2026):
Use Quarto for all new work
Keep R Markdown only for legacy projects or when Quarto is not yet supported
Mini Summary Project – Reproducible Report Create a new .qmd file in RStudio:
YAML
--- title: "Titanic Survival Analysis" format: html execute: echo: false warning: false --- ```{r setup} library(tidyverse) library(knitr)
Data Overview
language-{r}
titanic <- titanic::titanic_train kable(head(titanic))
Survival by Class
language-{r}
#| fig-cap: "Survival Rate by Passenger Class" titanic %>% ggplot(aes(x = factor(Pclass), fill = factor(Survived))) + geom_bar(position = "fill") + labs(x = "Class", y = "Proportion")
text
This completes the full R Markdown & Reproducible Reports section — now you can create dynamic, professional, fully reproducible reports and publications in R! Let me know the next section number (e.g., 11. Real-World Projects & Portfolio Building) or i
11. Real-World Projects & Portfolio Building
These five practical projects combine everything you’ve learned — from data import and manipulation to visualization, statistical analysis, machine learning, time series, and reproducible reporting. They are designed to be portfolio-ready, interview-impressive, and real-world applicable.
11.1 Project 1: Exploratory Analysis & Dashboard (ggplot2 + flexdashboard)
Goal: Perform complete EDA on a dataset and present it as an interactive dashboard.
Tools used: tidyverse, ggplot2, flexdashboard
Steps & Code Structure (save as dashboard.Rmd)
YAML
--- title: "Exploratory Analysis Dashboard – Titanic Dataset" output: flexdashboard::flex_dashboard: orientation: columns vertical_layout: fill runtime: shiny --- ```{r setup, include=FALSE} library(flexdashboard) library(tidyverse) library(ggplot2) library(DT) library(plotly) data("titanic_train") df <- titanic_train
Column {data-width=600}
Data Overview
language-{r}
DT::datatable(df, filter = "top", options = list(pageLength = 10, scrollX = TRUE))
Key Insights
Total passengers: r nrow(df)
Survival rate: r round(mean(df$Survived, na.rm = TRUE)*100, 1)%
Missing Age values: r sum(is.na(df$Age))
Column {data-width=400}
Age Distribution by Survival
language-{r}
ggplot(df, aes(x = Age, fill = factor(Survived))) + geom_histogram(position = "identity", alpha = 0.6, bins = 30) + labs(title = "Age vs Survival", fill = "Survived (1=Yes)") + theme_minimal()
Fare by Class
language-{r}
ggplot(df, aes(x = factor(Pclass), y = Fare, fill = factor(Pclass))) + geom_boxplot(outlier.shape = 21) + labs(title = "Fare Distribution by Passenger Class") + theme_minimal()
Value Boxes
Total Passengers
language-{r}
valueBox(nrow(df), icon = "fa-users", color = "primary")
Survival Rate
language-{r}
valueBox(paste0(round(mean(df$Survived, na.rm = TRUE)*100, 1), "%"), icon = "fa-heartbeat", color = "success")
Average Fare
language-{r}
valueBox(paste0("₹", round(mean(df$Fare, na.rm = TRUE), 1)), icon = "fa-money-bill-wave", color = "warning")
text
How to run: Knit → Save as HTML → Open in browser (interactive) Key Takeaways: flexdashboard is perfect for quick, interactive EDA reports. #### 11.2 Project 2: Customer Churn Prediction (Classification) Goal: Predict which customers will churn using classification models. Dataset: Telco Customer Churn (Kaggle) Steps & Code ```r library(tidyverse) library(tidymodels) library(themis) # for SMOTE # 1. Load & clean df <- read_csv("telco_churn.csv") %>% janitor::clean_names() %>% mutate(churn = factor(churn, levels = c("No", "Yes"))) %>% select(-customer_id) # 2. Split & recipe set.seed(42) split <- initial_split(df, prop = 0.8, strata = churn) train <- training(split) test <- testing(split) rec <- recipe(churn ~ ., data = train) %>% step_impute_median(all_numeric_predictors()) %>% step_dummy(all_nominal_predictors(), -all_outcomes()) %>% step_smote(churn) %>% # handle imbalance step_normalize(all_numeric_predictors()) # 3. Model spec rf_spec <- rand_forest(trees = 500) %>% set_mode("classification") %>% set_engine("ranger") # 4. Workflow & fit wf <- workflow() %>% add_recipe(rec) %>% add_model(rf_spec) fit <- wf %>% fit(data = train) # 5. Evaluate predictions <- predict(fit, test) %>% pull(.pred_class) prob <- predict(fit, test, type = "prob")$.pred_Yes print(conf_mat(test$churn, predictions)) print(roc_auc_vec(test$churn, prob))
Key Takeaways: Use themis::step_smote() for imbalance. Focus on Recall & ROC-AUC for churn problems.
11.3 Project 3: Sales Forecasting (Time Series)
Goal: Forecast monthly sales using Prophet and ARIMA.
Code
R
library(prophet) library(forecast) library(tidyverse) # Assume monthly_sales.csv has columns: date (YYYY-MM-01), sales df <- read_csv("monthly_sales.csv") %>% mutate(ds = as.Date(date), y = sales) # Prophet m <- prophet(df, yearly.seasonality = TRUE) future <- make_future_dataframe(m, periods = 12, freq = "month") fc_prophet <- predict(m, future) plot(m, fc_prophet) # ARIMA ts_data <- ts(df$y, frequency = 12, start = c(2020, 1)) fit_arima <- auto.arima(ts_data) fc_arima <- forecast(fit_arima, h = 12) plot(fc_arima)
Key Takeaways: Prophet is easier for business users; ARIMA gives more statistical control.
11.4 Project 4: Sentiment Analysis on Reviews
Goal: Classify product reviews as positive/negative.
Code (using tidytext)
R
library(tidytext) library(textdata) reviews <- read_csv("amazon_reviews.csv") # Tokenize & sentiment review_words <- reviews %>% unnest_tokens(word, review_text) %>% inner_join(get_sentiments("bing")) sentiment_summary <- review_words %>% count(word, sentiment) %>% pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% mutate(score = positive - negative) # Visualize top words sentiment_summary %>% arrange(desc(score)) %>% slice_head(n = 20) %>% ggplot(aes(reorder(word, score), score, fill = score > 0)) + geom_col() + coord_flip() + labs(title = "Top Sentiment Words in Reviews")
Key Takeaways: tidytext + Bing lexicon = simple & effective baseline.
11.5 Creating a Professional Portfolio (GitHub + RPubs)
Portfolio Structure (2026 standard)
GitHub repo for each project
README.md with:
Project goal
Dataset description
Key findings & visuals
Code walkthrough
Live link (RPubs, ShinyApps.io, Quarto Pub)
RPubs / Quarto Pub for rendered reports/dashboards
Personal portfolio website (Quarto website or GitHub Pages)
Best Practices
Use meaningful repo names (e.g., customer-churn-prediction-r)
Add screenshots & GIFs in README
Include requirements.txt equivalent → sessionInfo() or renv.lock
Pin top 6 projects on GitHub profile
Add badges: R version, license, stars
Final Advice Publish 4–6 high-quality projects. Write blogs explaining your thought process. Share on LinkedIn, Kaggle, RStudio Community, Reddit (r/rstats, r/datascience). You now have a strong R portfolio!
This completes the full Real-World Projects & Portfolio Building section — and the entire R Programming Mastery tutorial!
12. Best Practices, Career Guidance & Next Steps
You’ve now completed a comprehensive journey through R programming — from basics to advanced data manipulation, visualization, statistical modeling, machine learning, time series, and reproducible reporting. This final section focuses on professional habits, industry applications, Git workflow, interview preparation, career paths, and resources to help you succeed in 2026 and beyond.
12.1 Writing Clean, Reproducible & Production-Ready R Code
Clean and reproducible code is what separates hobbyists from professionals in R.
Core Best Practices (2026 Standard)
Follow Tidyverse Style Guide & use modern tools
Use snake_case for objects/functions
Consistent spacing & indentation
Pipe (%>%) for readability
Bash
# Auto-format & sort imports styler::style_file("script.R") # or use lintr + styler in RStudio
Always use projects & here package
R
library(here) read_csv(here("data", "sales.csv"))
→ No more broken paths when moving files
Reproducibility
Set random seed: set.seed(42)
Use renv or groundhog for package versions
Document session info: sessionInfo()
Prefer Quarto over R Markdown for new work
Production-ready tips
Avoid global variables
Write functions instead of copy-paste code
Use assertthat or checkmate for input validation
Add error handling: tryCatch()
Log messages: logger package or message()
Code structure for large projects
text
project/ ├── R/ # functions & scripts ├── data/ # raw & processed ├── output/ # figures, tables ├── reports/ # Quarto/Rmd files ├── tests/ # testthat tests ├── renv.lock # package versions └── main.qmd
12.2 R in Industry – Shiny Apps, R Packages, APIs
Shiny – Build interactive web apps directly from R
Simple Shiny app example
R
library(shiny) ui <- fluidPage( titlePanel("Interactive MPG Explorer"), sidebarLayout( sidebarPanel( sliderInput("hp", "Horsepower:", min = 50, max = 350, value = c(100, 200)) ), mainPanel( plotOutput("mpgPlot") ) ) ) server <- function(input, output) { output$mpgPlot <- renderPlot({ mtcars %>% filter(hp >= input$hp[1], hp <= input$hp[2]) %>% ggplot(aes(wt, mpg)) + geom_point(size = 4, alpha = 0.7) + theme_minimal() }) } shinyApp(ui = ui, server = server)
Deployment options (2026):
shinyapps.io (free tier available)
Posit Connect (enterprise)
Docker + RStudio Server / Shiny Server
Combine with FastAPI/Plumber for hybrid apps
Building R Packages
Use devtools & usethis
Structure: R/ (functions), tests/, man/ (documentation), DESCRIPTION
Publish to CRAN or GitHub
R APIs with Plumber
R
library(plumber) pr(" #* @get /predict function(mpg, hp) { predict(lm_model, newdata = data.frame(mpg = mpg, hp = hp)) } ") %>% pr_run(port = 8000)
12.3 Git & GitHub Workflow for R Users
Recommended workflow (2026):
Create repo on GitHub
Clone locally: git clone https://github.com/username/repo.git
Create branch: git checkout -b feature/eda-report
Work → stage → commit:
Bash
git add . git commit -m "Add EDA dashboard and summary stats"
Push: git push origin feature/eda-report
Create Pull Request → review → merge
Delete branch after merge
R-specific tips
Add .Rproj to .gitignore (optional)
Never commit large data files → use Git LFS or external storage
Use usethis::use_git() & usethis::use_github() to initialize
Add GitHub Actions for linting & testing
12.4 Top R Interview Questions & Answers
Frequently asked in 2026:
What is the difference between data.frame and tibble? → tibble is stricter (no partial matching), prints better, never changes types automatically
Explain the pipe operator %>% vs native |> → %>% from magrittr → more features; |> is base R (faster, built-in)
How to handle missing values in R? → na.omit(), drop_na(), replace_na(), mice / missForest for imputation
Difference between lapply and sapply? → lapply returns list, sapply simplifies to vector/matrix
What is tidy data? → Each variable = column, each observation = row, each type of observational unit = table
How to reshape data in tidyverse? → pivot_longer() (wide → long), pivot_wider() (long → wide)
Explain group_by() + summarise() vs mutate() → summarise() collapses groups, mutate() keeps rows
What is ggplot2 grammar of graphics? → Data + Aesthetics + Geometries + Scales + Facets + Themes
How to perform t-test in R? → t.test(x, y) or t.test(outcome ~ group, data = df)
Difference between lm() and glm()? → lm() for linear regression, glm() for generalized (logistic, poisson, etc.)
12.5 Career Paths – Data Analyst, Biostatistician, Researcher, Data Scientist
Main Career Tracks in R-heavy domains (2026):
RolePrimary Skills in RTypical EmployersIndia Salary (₹ LPA)Global Salary (USD/year)Data Analystdplyr, ggplot2, R Markdown, SQLConsulting, BFSI, E-commerce5–14$65k–$100kBiostatisticianSurvival analysis, mixed models, clinical trialsPharma, CROs, Hospitals, Research10–30$90k–$160kAcademic ResearcherAdvanced stats, reproducible reports, packagesUniversities, Research Institutes8–25$70k–$140kData Scientisttidyverse + ML (caret/tidymodels), ShinyTech, Finance, Healthcare12–35$100k–$180kStatistical ProgrammerCDISC standards, SAS/R integrationPharma, Clinical Research12–28$90k–$150k
High-demand skills in R ecosystem (2026):
tidyverse mastery
Quarto / R Markdown
Shiny apps
Statistical modeling (survival, longitudinal)
Reproducible research & reporting
This completes your full R Programming Mastery tutorial! You are now equipped to write professional R code, build impactful projects, and pursue exciting careers in statistics, data science, and research.
👈 PREVIOUS PYTHON PROGRAMMING OOP
Free Reading Alert! All my books are FREE on Kindle Unlimited or eBooks just ₹145!
Check now: https://www.amazon.in/stores/Anshuman-Mishra/author/B0DQVNPL7P
Start reading! 🚀
फ्री रीडिंग का मौका! मेरी सारी किताबें Kindle Unlimited में FREE या ईबुक सिर्फ ₹145 में!
अभी देखें: https://www.amazon.in/stores/Anshuman-Mishra/author/B0DQVNPL7P पढ़ना शुरू करें! 🚀🚀
These Python notes made complex concepts feel simple and clear.
Amy K.
★★★★★
ibm.anshuman@gmail.com
© 2026 CodeForge AI | Privacy Policy |Terms of Service | Contact | Disclaimer | 1000 university college list












