Data Analysis Using R

Professor Dr. Md. Kamrul Hasan

Why R?

  • R is a free software environment for statistical computing and graphics.
  • It is widely used among statisticians and data miners for developing statistical software and data analysis.
  • R is highly extensible and has a large number of packages for various statistical techniques.
  • R is command driven, so its potentiality is literally unlimited.
  • Code sharing is easy and reproducibility is ensured.

Installation

  • Download R and RStudio from R and RStudio.
  • Install R and RStudio, which is an integrated development environment (IDE) for R.
  • Open RStudio and set up your working directory using and other options in the Tools > Global Options…
  • Explore and play with RStudio interface, including the console, script editor, environment pane, and plots pane.
  • Instruction link

What will we cover?

  • Research design and data collection
  • Data entry and cleaning
  • Basic R syntax and data types
  • Data structures: vectors, matrices, lists, and data frames
  • Importing and exporting data
  • Data manipulation with dplyr and tidyr
  • Data visualization with ggplot2
  • Statistical analysis: chi-square test, t-tests, ANOVA, correlation, regression

Basic R Syntax

# This is a comment
x <- 5  # Assigning a value to a variable x
3 -> y  # Another way to assign a value to y
z = 8 # Simple way of assigning a value to z
x # Print the value of x
y # Print the value of y
z # Print the value of z
x + y  # Perform addition
x * y  # Perform multiplication
x / y  # Perform division
x^2  # Square of x
sqrt(x)  # Square root of x
x > y  # Check if x is greater than y
x == y  # Check if x is equal to y
x != y  # Check if x is not equal to y
y %/% x  # Integer division
y %% x  # Modulus operation
[1] 5
[1] 3
[1] 8
[1] 8
[1] 15
[1] 1.666667
[1] 25
[1] 2.236068
[1] TRUE
[1] FALSE
[1] TRUE
[1] 0
[1] 3

Basic R Syntax

log(1000) # Natural logarithm of 1000
log10(1000) # Base 10 logarithm of 1000
sin(pi/2) # Sine of pi/2

degree = 45
radian = degree * (pi / 180) # Convert degrees to radians
tan(radian) # Tangent of the angle in radians
[1] 6.907755
[1] 3
[1] 1
[1] 1

Data Structures

# Vectors
v = c(1, 2, 3, 4, 5)  # Create a numeric vector
v_char = c("a", "b", "c")  # Create a character vector
# Matrices
m = matrix(1:9, nrow=3, ncol=3)  # Create a 3x3 matrix
m2 = matrix(1:12, nrow=3, ncol=4)  # Create a 3x4 matrix
# Lists
list = list(name="John", age=30, scores=c(90, 85, 88))  # Create a list
# Data Frames by combing vectors
name = c("Alice", "Bob", "Charlie")
age = c(25, 30, 35)
scores = c(90, 85, 88)
df = data.frame(name, age, scores)  # Create a data frame
v = c(1, 2, 3, 4, 5) # Accessing elements in vectors
v[1]  # First element
v[2:4]  # Elements from index 2 to 4

m = matrix(1:9, nrow=3, ncol=3) # Accessing elements in matrices
m[1, 2]  # Element in first row, second column
m[2, ]  # Second row
m[, 3]  # Third column
# Accessing elements in lists
list = list(name="John", age=30, scores=c(90, 85, 88))
list$name  # Access the 'name' element
list[[2]]  # Access the second element (age)
# Accessing elements in data frames
df = data.frame(name=c("Alice", "Bob", "Charlie"), age=c(25, 30, 35), scores=c(90, 85, 88))
df$name  # Access the 'name' column
df[1, ]  # Access the first row
df[2, "age"]  # Access the 'age' of the second row
[1] 1
[1] 2 3 4
[1] 4
[1] 2 5 8
[1] 7 8 9
[1] "John"
[1] 30
[1] "Alice"   "Bob"     "Charlie"
   name age scores
1 Alice  25     90
[1] 30

Accessing elements in data frames using dplyr

  • First you need to install the library by install.packages("dplyr").
  • Then you can load the library using library(dplyr).
  • You can also manage libraries using the Packages options in Tools in RStudio.
  • The easiest way to managme packages using ‘pacman’ package.
  • Install it using install.packages("pacman").
  • Then you can load other packages using pacman::p_load(dplyr, ggplot2, tidyr, car, jtools, kableExtra).

Accessing elements in data frames using dplyr

library(dplyr)
df = data.frame(name=c("Alice", "Bob", "Charlie"), 
                age=c(25, 30, 35), 
                scores=c(90, 85, 88))

df %>% 
  filter(age > 28)  # Filter rows where age is greater than 28

df %>%
  select(name, scores)  # Select specific columns
     name age scores
1     Bob  30     85
2 Charlie  35     88
     name scores
1   Alice     90
2     Bob     85
3 Charlie     88