For R beginners Lesson 2. "Data Manipulation"
2. Data Manipulation
Data manipulation is a key part of data analysis. In R, the dplyr package provides a set of functions that make data manipulation easier and more intuitive. To use dplyr, you'll need to install it if you haven't already, and then load it into your R session.
2.1 Installing and Loading Libraries
Before using any functions from a package like dplyr, you must install the package and load it into your R session:
# Installing the dplyr package (only need to do this once)
install.packages("dplyr")
# Loading the dplyr package
library(dplyr)
2.2 Selecting Columns
To select specific columns from a data frame, use the select() function:
# Sample data frame
my_data <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Score = c(85, 90, 95)
)
# Selecting the 'Name' and 'Score' columns
selected_data <- select(my_data, Name, Score)
print(selected_data)
# Output:
# Name Score
# 1 Alice 85
# 2 Bob 90
# 3 Charlie 95
2.3 Filtering Rows
To filter rows based on specific conditions, use the filter() function:
# Filtering rows where Age is greater than 25
filtered_data <- filter(my_data, Age > 25)
print(filtered_data)
# Output:
# Name Age Score
# 1 Bob 30 90
# 2 Charlie 35 95
2.4 Adding New Columns
To add new columns or modify existing ones, use the mutate() function:
# Adding a new column 'Passed' based on the 'Score'
mutated_data <- mutate(my_data, Passed = Score > 80)
print(mutated_data)
# Output:
# Name Age Score Passed
# 1 Alice 25 85 TRUE
# 2 Bob 30 90 TRUE
# 3 Charlie 35 95 TRUE
2.5 Grouping and Summarizing Data
Grouping and summarizing data are common tasks when you want to analyze subsets of your data. In R, these tasks can be accomplished using the dplyr package, which provides functions like group_by() and summarize().
2.5.1 Grouping Data
Grouping data means organizing data into groups based on one or more variables. This is often a precursor to summarizing or aggregating data within each group.
Using group_by()
The group_by() function in dplyr is used to group data by one or more variables.
Here’s an example:
# Sample data frame
df <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
Gender = c("Female", "Male", "Male", "Male", "Female"),
Age = c(25, 30, 35, 40, 28),
Height = c(165, 180, 175, 170, 160)
)
# Grouping by the 'Gender' column without using pipe
grouped_data <- group_by(df, Gender)
print(grouped_data)
# Output:
# # A tibble: 5 × 4
# # Groups: Gender [2]
# Name Gender Age Height
# <chr> <chr> <dbl> <dbl>
# 1 Alice Female 25 165
# 2 Eve Female 28 160
# 3 Bob Male 30 180
# 4 Charlie Male 35 175
# 5 David Male 40 170
In this example, group_by(Gender) groups the data by the Gender column. This does not change the data itself but prepares it for further operations like summarizing.
2.5.2 Summarizing Data
Summarizing data involves calculating summary statistics (like mean, sum, etc.) for each group.
Using summarize()
The summarize() function in dplyr is used to create summary statistics for each group. When used with group_by(), it allows you to perform aggregations within each group.
Here’s how you can summarize data after grouping:
# Summarizing the data by calculating the mean age and mean height for each gender
summary_data <- df %>%
group_by(Gender) %>%
summarize(
mean_age = mean(Age),
mean_height = mean(Height)
)
print(summary_data)
# Output:
# # A tibble: 2 × 3
# Gender mean_age mean_height
# <chr> <dbl> <dbl>
# 1 Female 26.5 162.5
# 2 Male 35 175
In this example, summarize() is used to calculate the mean age and mean height for each gender. The group_by(Gender) function specifies that the summarization should be done for each gender group.
2.6 Piping in R
The %>% operator, known as the "pipe" operator, is a powerful tool in R for chaining multiple functions together. It allows you to pass the output of one function directly into the next function without needing intermediate variables, making your code more readable and concise.
Without Using Pipes
Without using pipes, you need to create intermediate variables at each step of your data manipulation:
# Sample data frame
grades <- data.frame(
Student = c("Alice", "Bob", "Alice", "Bob", "Alice", "Bob"),
Subject = c("Math", "Math", "Science", "Science", "History", "History"),
Score = c(88, 92, 95, 85, 80, 78)
)
# Step 1: Group by 'Student'
grouped_data <- group_by(grades, Student)
# Step 2: Summarize the average score for each student
average_scores <- summarize(grouped_data, Average_Score = mean(Score))
print(average_scores)
# Output:
# # A tibble: 2 × 2
# Student Average_Score
# <chr> <dbl>
# 1 Alice 87.7
# 2 Bob 85.0
In the code above, you first group the data by Student and store the result in grouped_data. Then, you use summarize() on grouped_data to calculate the average scores, storing the result in another variable average_scores.
Using Pipes
Using pipes (%>%), you can accomplish the same task without creating intermediate variables:
# Using pipes to group by 'Student' and calculate the average score
average_scores <- grades %>%
group_by(Student) %>%
summarize(Average_Score = mean(Score))
print(average_scores)
# Output:
# # A tibble: 2 × 2
# Student Average_Score
# <chr> <dbl>
# 1 Alice 87.7
# 2 Bob 85.0
The pipe operator %>% passes the result of one function directly to the next. The grades data frame is first passed to group_by(Student), and the output is then passed to summarize(Average_Score = mean(Score)). This eliminates the need for intermediate variables, making the code more compact and easier to read.
この記事が気に入ったらサポートをしてみませんか?