This case study was completed using the R programming language and finally wrapped into code chunks in this Markdown text.
Markdown
Christopher Davidson
2022-06-24
Annual Members vs Casual Members
This is a case study for the Google Data Analytics course on Coursera. I chose and then was asked to investigate differences between Annual Members and Casual Riders for a fictional bike rental company called Cyclistic located in Chicago. I have been using data from DIVY bikes under this license https://ride.divvybikes.com/data-license-agreement. This is a report to show my findings from my Analysis.
Ask
The question I was asked by my fictional manager “Moreno” was:
- How do annual members and casual riders use Cyclistic bikes differently?
First, I tried to think from the customers point of view and tried to identify the different customer groups that use Cyclistic bikes in the two groups. These assumptions (since I did not have access to customer data) include:
Casual Riders
Customers wanting to go from A to B quickly
Customers using the bikes for exercise
Tourists
For Exploration
Annual Members
Commuters
Easier then owning a bike
No worries on cost
Regular journeys
There is a full program in Cyclistic to attempt to convert Casual Riders into Annual Members. My manager is called “Lily Moreno”, she is the director of marketing and is responsible for making campaigns and initiatives to promote the bike-share program. Other members of the team include:
- Marketing Analytics Team
- Executive team who are notoriously detail-oriented and they will be responsible for approving the final recommended marketing program made by us the Marketing Analytics Team.
My insights will help Moreno make informed decisions on where to go to next. Other members of the analytics team will also have to answer two other questions to give Moreno a complete picture. Which include:
- Why would casual rides buy Cyclistic annual memberships?
- How can Cyclistic use digital media to influence casual riders to become members?
Prepare
Once I figured out what I was suppose to do, it was time to start. I pulled the CSV data from the DIVY website for an entire year between May 2021 to April 2022. I then read all the data and converted into a large data frame consisting of more than 5 million rows using the code below.
# Convert all the CSV files into one data frame
year <- list.files(path = "H:/R/CSV", # Identify all CSV files
pattern = "*.csv", full.names = TRUE) %>%
lapply(read_csv) %>% # Store all files in list
bind_rows # Combine data sets into one data set
This data came directly from a bike sharing company and in turn came from the fictional company Cyclistic. There was no need to check for bias since it was raw data that came direct from the use of bicycles in Chicago. The only bias involved here would be my own.
In this fictional world, I will be presenting my findings to the analytics team first to get suggestions on how I can proceed with making it look better and see if they think I am going in the right direction. Afterwards I will present the findings to my boss, Moreno who will confirm that it is useful data. Then, when Moreno requires it of me, I will present my findings to the executive team with the rest of the data analytics team and Moreno in a well polished presentation.
Here I produced this document to show my working process to you, the interested reader. I currently haven’t made any presentations or reports that are directed towards the executive team or Moreno. I may come back and do that later but for now, I want to show you my thinking process. This is a report for you!
Process
First I had to clean the data. As you can see I used the packages ‘janitor’ and ‘tidyr’ to do so.
#cleaning time
library("janitor")
clean_names(`year`) #To clean column titles
get_dupes(`year`) #Remove duplicates, this takes a long time to load
remove_empty(`year`) #Remove any entries that have nulls (blank entries)
library("tidyr")
"year" <- `year` %>% drop_na() #Remove any errors in the data, this also does a similar task to "remove_empty" but I wanted to be double sure.
I then checked the structure of the data to see how I can prepare for my analysis.
colnames(year) #List of column names
nrow(year) #How many rows are in data frame?
dim(year) #Dimensions of the data frame?
head(year) #See the first 6 rows of data frame. Also tail(all_trips)
str(year) #See list of columns and data types (numeric, character, etc)
summary(year) #Statistical summary of data. Mainly for numerics
Due to the way that I have created this code, a part of the code preparation has come after the cleaning. Which means this will run afterwards. This was also due to wanting to ensure I understood the errors and how to go around it. I decided that I will filter out people who had a bike for less then 5 minutes. When I was able to add extra data that told me the amount of time each ride lasted for. I found that some were in the negative due to maintenance work on the bikes and I assumed that if someone had a bike for less then 5 minutes, then there was probably something wrong with the bike. I also added week days (e.g. Monday, Tuesday, etc) and week numbers (e.g. week 52, last week in December) so I can analyse when they were taking the bikes and how that would effect the usage of the bikes.
The final entry under the ‘mutate’ function was to add the amount of days the users were using the bikes for. I wanted to see what the users were using their bikes for if they had them longer than two days. Turns out that the amount of people renting their bike for longer then two days was very small. So I eventually decided that I could filter them out as well to remove any outliers.
library("lubridate")
"yearT" <- `year` %>%
select(ride_id,rideable_type,started_at,ended_at,member_casual,start_lat,start_lng,end_lat,end_lng,start_station_name,end_station_name)%>%
mutate(time = difftime(ended_at,started_at,units = "mins"))%>%
filter(time > 5)%>%
mutate(week_day = weekdays(started_at))%>%
mutate(week_no = week(started_at))%>%
mutate(days = difftime(ended_at, started_at, units="days"))
str(yearT)
# Convert "time" to numeric so we can run calculations on the data
yearT$time <- as.numeric(as.character(yearT$time))
is.numeric(yearT$time)
This was the moment I filtered out those users that rented the bikes for more than two days.
"yearTC" <- `yearT` %>%
filter(days<1)%>%
select(-days)
This was actually placed further down in my code but I decided to put it here since it is apart of the prepare process. This code was used to create a new dataframe which I used to figure out the stations which were most likely used for return journeys. More on this later…
"yearN" <- `year` %>%
add_count(start_station_name, name = "start_station_count")%>%
add_count(end_station_name, name = "end_station_count")
This is to prepare the data to look at bike usage throughout the year
"yearTCWcount" <- `yearTC` %>%
group_by(member_casual, week_no)%>%
summarise(number_of_rides = n(),average_duration = mean(time))%>%
arrange(member_casual,week_no)
This next set of data prepares dataframes that will be used to look at return trips vs non-returning trips. This data is slightly speculative and assuming that if the number of times the stations are used are equal, then there must be return journeys going through them. I could not know for sure though since I didn’t have access to customer data. However I would have requested that access if I was working for the company and was attempting to collect this data.
"yearNret" <- `yearN` %>%
filter(rideable_type != "docked_bike") %>%
filter(start_station_count == end_station_count) %>%
count(member_casual, rideable_type, name = "count") %>%
mutate("perc" = `count` / sum(`count`)) %>%
mutate("ID" = "Return")
"yearNnonret" <- `yearN` %>%
filter(rideable_type != "docked_bike") %>%
filter(start_station_count != end_station_count) %>%
count(member_casual, rideable_type, name = "count") %>%
mutate("perc" = `count` / sum(`count`))%>%
mutate("ID" = "Non-Return")
This was done to prepare the data for the plot that will look at bicycles that were returned to the same station. Looking at duration and frequency in which this happens.
"SSpercentage" <- `yearN` %>%
filter(start_station_id == end_station_id) %>%
count(member_casual,name="equal_station_count") %>%
full_join(count(`yearN`,member_casual, name="total_count"))%>%
mutate("return_percentage" = equal_station_count / total_count)
"SSduration" <- `yearN` %>%
full_join(yearT) %>%
filter(start_station_id == end_station_id) %>%
group_by(member_casual) %>%
summarise(duration = mean(time, na.rm = TRUE))
This final mess of code was to prepare for the final plot. I created a map you can see further down, I needed to first summarise the data to tell me how much each station is being used for both the start stations and the end stations, for both casual riders and annual members. Then I needed to ensure that they all had co-ordinates which mean’t more dataframes being created. After all of this, I then joined all the dataframes together to create one that will run the map.
"yearTFStartM" <- `yearT` %>%
filter(start_lng != end_lng) %>%
filter(start_lat != end_lat) %>%
filter(member_casual == "member") %>%
count("station_name" = start_station_name, name = "m_start_count")
"yearTFEndM" <- `yearT` %>%
filter(start_lng != end_lng) %>%
filter(start_lat != end_lat) %>%
filter(member_casual == "member") %>%
count("station_name" = end_station_name, name = "m_end_count")
"yearTFStartC" <- `yearT` %>%
filter(start_lng != end_lng) %>%
filter(start_lat != end_lat) %>%
filter(member_casual == "casual") %>%
count("station_name" = start_station_name, name = "c_start_count")
"yearTFEndC" <- `yearT` %>%
filter(start_lng != end_lng) %>%
filter(start_lat != end_lat) %>%
filter(member_casual == "casual") %>%
count("station_name" = end_station_name, name = "c_end_count")
"yearTcoordS" <- `yearT` %>%
select(start_station_name,start_lat,start_lng) %>%
group_by(start_station_name) %>%
summarize(lat = median(start_lat), lng = median(start_lng))
"yearTcoordE" <- `yearT` %>%
select(end_station_name,end_lat,end_lng) %>%
group_by(end_station_name) %>%
summarize(lat = median(end_lat), lng = median(end_lng)) %>%
rename(station_name = end_station_name)
"yearTF" <- `yearTFStartC` %>%
full_join(`yearTFEndC`, by = "station_name") %>%
full_join(`yearTFStartM`, by = "station_name") %>%
full_join(`yearTFEndM`, by = "station_name") %>%
mutate(c_count = c_start_count + c_end_count) %>%
mutate(m_count = m_start_count + m_end_count) %>%
select(station_name,c_count,m_count) %>%
left_join(yearTcoordE,by = "station_name")
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.3.2, PROJ 7.2.1; sf_use_s2() is TRUE
chi_map <- read_sf("https://raw.githubusercontent.com/thisisdaryn/data/master/geo/chicago/Comm_Areas.geojson")
Analyse
Checking to overview the data to check to see if any other questions spring out at me before I begin my analysis.
glimpse(`yearT`)
count(`yearT`,member_casual) #check the contents of member_casual. Could also use table(year$member_casual)
summary(yearT)
library("skimr")
skim_without_charts(yearT) #quickly using this to see if we can pull any useful data
First I noticed that some of the bikes were hired for a huge amount of time, more than a day. I wanted to investigate that and found that the statistical significance of this was so minimal (less that 0.05%). The percentage of rides that went over 2 days was so small, that I considered them outliers, data that could have been caused by faulty bikes. Therefore, I took this data out in the future. Which I did at the beginning in prepare (this document isn’t in chronological order).
library("dplyr")
"day" <- count(yearT,days,member_casual) %>%
filter(days>1)%>%
group_by(member_casual)%>%
summarise("days_greater_1" = sum(n)) %>%
inner_join(count(`yearT`,member_casual,name = "total")) %>%
mutate(percentage_dif = days_greater_1/total)
head(day)
## # A tibble: 2 × 4
## member_casual days_greater_1 total percentage_dif
## <chr> <int> <int> <dbl>
## 1 casual 3694 2307337 0.00160
## 2 member 492 2492747 0.000197
I continued the analysis by looking at the length of time customers rented their bikes. It was easy to see that casual riders used the bikes for longer on their journeys than their annual counterparts.
`yearTC` %>%
group_by(member_casual)%>%
summarise("mean" = mean(time))
## # A tibble: 2 × 2
## member_casual mean
## <chr> <dbl>
## 1 casual 28.4
## 2 member 15.7
There was an issue with the week day order. Therefore I ensured that the weekdays were in order on the main chart. My apologies for the Danish. I didn’t quite figure out how to change the language.
yearTC$week_day <- ordered(yearTC$week_day, levels=c("mandag", "tirsdag", "onsdag", "torsdag", "fredag", "lørdag", "søndag"))
After all of that, I wanted to quickly look at bike usage depending on the bike type and it was clear that everyone prefers the classic bike over any other bike.
`yearT` %>%
group_by(member_casual) %>%
count(member_casual, rideable_type)
## # A tibble: 5 × 3
## # Groups: member_casual [2]
## member_casual rideable_type n
## <chr> <chr> <int>
## 1 casual classic_bike 1140773
## 2 casual docked_bike 283618
## 3 casual electric_bike 882946
## 4 member classic_bike 1562770
## 5 member electric_bike 929977
Weekday Usage
I then created a weekday data frame to look at annual members and casual riders preferences as to when they want to use the bikes during the week. It was easy to see that casual members used the bikes longer than annual members when viewing the dataframe (can only see head below). However, it was easier to view the chart and see that data instead, so I waited until then to draw a proper conclusion.
"yearTCcount" <- `yearTC` %>%
group_by(member_casual, week_day)%>%
summarise(number_of_rides = n(),average_duration = mean(time)) %>%
mutate(number_of_rides / 1000) #So it fits better on the chart.
head(yearTCcount)
## # A tibble: 6 × 5
## # Groups: member_casual [1]
## member_casual week_day number_of_rides average_duration `number_of_rides/1000`
## <chr> <ord> <int> <dbl> <dbl>
## 1 casual mandag 260677 28.8 261.
## 2 casual tirsdag 241016 25.4 241.
## 3 casual onsdag 254529 25.0 255.
## 4 casual torsdag 266360 25.2 266.
## 5 casual fredag 323952 26.6 324.
## 6 casual lørdag 516067 30.7 516.
Act - The Conclusion
Everyone prefers the classic bike and the amount of times someone rents a bike for longer than a day is so statistically insignificant, it was not even worth investigating. Overall, it’s definitely apparent that Casual Riders use the bikes longer than annual members. Casual riders definitely prefer the weekends over the work week, using them more and for longer during that time. Whereas annual members prefer to use the bikes in the middle of the work week and only use them slightly longer on the weekend. Both casual riders and annual members are more likely to use their bikes in the summer weeks. However it is more pronounced with casual riders and they tend to use the bikes longer in the summer weeks. Annual members appear to be quite consistent with their ride duration between summer and winter weeks.
Surprisingly it appears that casual riders are more likely to go on return trips using the bicycles. Annual members are less likely to go on return trips. Casual riders are twice as likely to drop their bike off from the same place they picked it up from. However, over 10% of casual riders will do this whereas less than 5% of annual riders do this. When they do pick up a bike from the same station they drop them off at, casual riders take double the amount of time to use them compared with annual members. This is probably due to them sightseeing, whereas annual members perhaps just want to go to a shop, use the bike for exercise or another task that takes a small amount of time.
The map shows that casual riders are more likely to rent bikes besides the lake and tend to stick around the city. There are some random stations which were only used by casual riders. However the grey dots with yellow outlines show that annual members are more likely to use the stations that are out from the city center.
Overall, this has been an interesting journey to get hold of this information and I have enjoyed it thoroughly. This case study was completely created using R from start to finish. It has been a hard and challenging journey but I have enjoyed every minute of it. Just wish I had more time to work on it. But lets create one more dataframe to wrap everything up…
## CRiders
## 1 Casual Riders
## 2 - Bike ride duration is longer
## 3 - Prefer riding on the weekend
## 4 - Prefer to ride in the summer
## 5 - More likely to go on return trips
## 6 - At 10%, they are twice as likely to drop a bike off from where they got it from
## 7 - They take double the amount of time to use a bike when they drop it off from the same place they got it
## 8 - They are more likely to rent a bike along the lake
## AMembers
## 1 vs Annual Members
## 2 - Prefer riding mid-week
## 3 - Ride for slightly long on weekend
## 4 - Use bikes more in summer but use them more in winter than casual riders
## 5 - Less likely to take return trips
## 6 - At 5%, annual members are less likely to drop their bike off where they got it from
## 7 - They take half the time to use a bike when they drop a bike off in the same place they got it
## 8 - More likely to rent a bike outside of the city center
Top Three Recomendations
Moreno asked me about what the top three recommendations are from my analysis. I would like to suggest this:
Casual Riders are more likely to use the bikes during the summer, however we could convince them that it would be cheaper for them to get an annual membership since they use their bikes for at least double the amount of time annual members do. Which means, they must be paying more overall.
It surprised me that they are more likely to go on return trips. If this is the case, then we can advertise that it would be cheaper for them to do that with an annual membership rather paying for the ride.
They generally just move around the center of the city. Could it be possible to change the annual membership types so you could have zones included? Say it could be cheaper to go around the center of the city rather than outside the center.
In this fictional world, I would have been talking with the analytics team during analysis, phoning many industry experts to confirm whether my assumptions are correct and get some other thoughts on what I could analyse. I would make a report similar to this but more consise with charts and simple data frames for Moreno to know what is going on. Then After some meeting with everyone in the analytics team and discussions on how we will present. I will incorporate my presentation with the rest of the Analytics team presentations and I will stand in front of the executives to show why Moreno’s plan will work backed up with a great story, data and charts. I am good with the story telling, you could see that in my novel https://starjump.nu/ that is currently published online for free.
Thank you for reading and I hope you have enjoyed reading about my process as much as I have enjoyed doing it.