This case study was completed using the R programming language and finally wrapped into code chunks in this Markdown text.

Markdown

Annual Members vs Casual Members

This is a case study for the Google Data Analytics course on Coursera. I chose and then was asked to investigate differences between Annual Members and Casual Riders for a fictional bike rental company called Cyclistic located in Chicago. I have been using data from DIVY bikes under this license https://ride.divvybikes.com/data-license-agreement. This is a report to show my findings from my Analysis.

Ask

The question I was asked by my fictional manager “Moreno” was:

  • How do annual members and casual riders use Cyclistic bikes differently?

First, I tried to think from the customers point of view and tried to identify the different customer groups that use Cyclistic bikes in the two groups. These assumptions (since I did not have access to customer data) include:

Casual Riders

  • Customers wanting to go from A to B quickly

  • Customers using the bikes for exercise

  • Tourists

  • For Exploration

Annual Members

  • Commuters

  • Easier then owning a bike

  • No worries on cost

  • Regular journeys

There is a full program in Cyclistic to attempt to convert Casual Riders into Annual Members. My manager is called “Lily Moreno”, she is the director of marketing and is responsible for making campaigns and initiatives to promote the bike-share program. Other members of the team include:

  • Marketing Analytics Team
  • Executive team who are notoriously detail-oriented and they will be responsible for approving the final recommended marketing program made by us the Marketing Analytics Team.

My insights will help Moreno make informed decisions on where to go to next. Other members of the analytics team will also have to answer two other questions to give Moreno a complete picture. Which include:

  • Why would casual rides buy Cyclistic annual memberships?
  • How can Cyclistic use digital media to influence casual riders to become members?

Prepare

Once I figured out what I was suppose to do, it was time to start. I pulled the CSV data from the DIVY website for an entire year between May 2021 to April 2022. I then read all the data and converted into a large data frame consisting of more than 5 million rows using the code below.

# Convert all the CSV files into one data frame
year <- list.files(path = "H:/R/CSV",  # Identify all CSV files
                       pattern = "*.csv", full.names = TRUE) %>% 
  lapply(read_csv) %>%                              # Store all files in list
  bind_rows                                         # Combine data sets into one data set

This data came directly from a bike sharing company and in turn came from the fictional company Cyclistic. There was no need to check for bias since it was raw data that came direct from the use of bicycles in Chicago. The only bias involved here would be my own.

In this fictional world, I will be presenting my findings to the analytics team first to get suggestions on how I can proceed with making it look better and see if they think I am going in the right direction. Afterwards I will present the findings to my boss, Moreno who will confirm that it is useful data. Then, when Moreno requires it of me, I will present my findings to the executive team with the rest of the data analytics team and Moreno in a well polished presentation.

Here I produced this document to show my working process to you, the interested reader. I currently haven’t made any presentations or reports that are directed towards the executive team or Moreno. I may come back and do that later but for now, I want to show you my thinking process. This is a report for you!

Process

First I had to clean the data. As you can see I used the packages ‘janitor’ and ‘tidyr’ to do so.

#cleaning time
library("janitor")
clean_names(`year`) #To clean column titles
get_dupes(`year`) #Remove duplicates, this takes a long time to load
remove_empty(`year`) #Remove any entries that have nulls (blank entries)
library("tidyr")
"year" <- `year` %>% drop_na() #Remove any errors in the data, this also does a similar task to "remove_empty" but I wanted to be double sure.

I then checked the structure of the data to see how I can prepare for my analysis.

colnames(year)  #List of column names
nrow(year)  #How many rows are in data frame?
dim(year)  #Dimensions of the data frame?
head(year)  #See the first 6 rows of data frame.  Also tail(all_trips)
str(year)  #See list of columns and data types (numeric, character, etc)
summary(year)  #Statistical summary of data. Mainly for numerics

Due to the way that I have created this code, a part of the code preparation has come after the cleaning. Which means this will run afterwards. This was also due to wanting to ensure I understood the errors and how to go around it. I decided that I will filter out people who had a bike for less then 5 minutes. When I was able to add extra data that told me the amount of time each ride lasted for. I found that some were in the negative due to maintenance work on the bikes and I assumed that if someone had a bike for less then 5 minutes, then there was probably something wrong with the bike. I also added week days (e.g. Monday, Tuesday, etc) and week numbers (e.g. week 52, last week in December) so I can analyse when they were taking the bikes and how that would effect the usage of the bikes.

The final entry under the ‘mutate’ function was to add the amount of days the users were using the bikes for. I wanted to see what the users were using their bikes for if they had them longer than two days. Turns out that the amount of people renting their bike for longer then two days was very small. So I eventually decided that I could filter them out as well to remove any outliers.

library("lubridate")
"yearT" <- `year` %>% 
  select(ride_id,rideable_type,started_at,ended_at,member_casual,start_lat,start_lng,end_lat,end_lng,start_station_name,end_station_name)%>%
  mutate(time = difftime(ended_at,started_at,units = "mins"))%>%
  filter(time > 5)%>% 
  mutate(week_day = weekdays(started_at))%>% 
  mutate(week_no = week(started_at))%>% 
  mutate(days = difftime(ended_at, started_at, units="days"))
str(yearT) 
# Convert "time" to numeric so we can run calculations on the data
yearT$time <- as.numeric(as.character(yearT$time)) 
is.numeric(yearT$time) 

This was the moment I filtered out those users that rented the bikes for more than two days.

"yearTC" <- `yearT` %>%
  filter(days<1)%>%
  select(-days)

This was actually placed further down in my code but I decided to put it here since it is apart of the prepare process. This code was used to create a new dataframe which I used to figure out the stations which were most likely used for return journeys. More on this later…

"yearN" <- `year` %>%
  add_count(start_station_name, name = "start_station_count")%>%
  add_count(end_station_name, name = "end_station_count")

This is to prepare the data to look at bike usage throughout the year

"yearTCWcount" <- `yearTC` %>%
  group_by(member_casual, week_no)%>%
  summarise(number_of_rides = n(),average_duration = mean(time))%>%
  arrange(member_casual,week_no)

This next set of data prepares dataframes that will be used to look at return trips vs non-returning trips. This data is slightly speculative and assuming that if the number of times the stations are used are equal, then there must be return journeys going through them. I could not know for sure though since I didn’t have access to customer data. However I would have requested that access if I was working for the company and was attempting to collect this data.

"yearNret" <- `yearN` %>% 
  filter(rideable_type != "docked_bike") %>%
  filter(start_station_count == end_station_count) %>% 
  count(member_casual, rideable_type, name = "count") %>%
  mutate("perc" = `count` / sum(`count`)) %>%
  mutate("ID" = "Return")
"yearNnonret" <- `yearN` %>% 
  filter(rideable_type != "docked_bike") %>%
  filter(start_station_count != end_station_count) %>% 
  count(member_casual, rideable_type, name = "count") %>%
  mutate("perc" = `count` / sum(`count`))%>%
  mutate("ID" = "Non-Return")

This was done to prepare the data for the plot that will look at bicycles that were returned to the same station. Looking at duration and frequency in which this happens.

"SSpercentage" <- `yearN` %>% 
  filter(start_station_id == end_station_id) %>% 
  count(member_casual,name="equal_station_count") %>%
  full_join(count(`yearN`,member_casual, name="total_count"))%>%
  mutate("return_percentage" = equal_station_count / total_count)
"SSduration" <- `yearN` %>%
  full_join(yearT) %>% 
  filter(start_station_id == end_station_id) %>%
  group_by(member_casual) %>%
  summarise(duration = mean(time, na.rm = TRUE))
  

This final mess of code was to prepare for the final plot. I created a map you can see further down, I needed to first summarise the data to tell me how much each station is being used for both the start stations and the end stations, for both casual riders and annual members. Then I needed to ensure that they all had co-ordinates which mean’t more dataframes being created. After all of this, I then joined all the dataframes together to create one that will run the map.

"yearTFStartM" <- `yearT` %>%
  filter(start_lng != end_lng) %>%
  filter(start_lat != end_lat) %>%
  filter(member_casual == "member") %>%
  count("station_name" = start_station_name, name = "m_start_count")
"yearTFEndM" <- `yearT` %>%
  filter(start_lng != end_lng) %>%
  filter(start_lat != end_lat) %>%
  filter(member_casual == "member") %>%
  count("station_name" = end_station_name, name = "m_end_count")
"yearTFStartC" <- `yearT` %>%
  filter(start_lng != end_lng) %>%
  filter(start_lat != end_lat) %>%
  filter(member_casual == "casual") %>%
  count("station_name" = start_station_name, name = "c_start_count")
"yearTFEndC" <- `yearT` %>%
  filter(start_lng != end_lng) %>%
  filter(start_lat != end_lat) %>%
  filter(member_casual == "casual") %>%
  count("station_name" = end_station_name, name = "c_end_count")
"yearTcoordS" <- `yearT` %>%
  select(start_station_name,start_lat,start_lng) %>%
  group_by(start_station_name) %>%
  summarize(lat = median(start_lat), lng = median(start_lng)) 
"yearTcoordE" <- `yearT` %>%
  select(end_station_name,end_lat,end_lng) %>%
  group_by(end_station_name) %>%
  summarize(lat = median(end_lat), lng = median(end_lng)) %>%
  rename(station_name = end_station_name)
"yearTF" <- `yearTFStartC` %>%
  full_join(`yearTFEndC`, by = "station_name") %>%
  full_join(`yearTFStartM`, by = "station_name") %>%
  full_join(`yearTFEndM`, by = "station_name") %>%
  mutate(c_count = c_start_count + c_end_count) %>%
  mutate(m_count = m_start_count + m_end_count) %>%
  select(station_name,c_count,m_count) %>%
  left_join(yearTcoordE,by = "station_name")
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.3.2, PROJ 7.2.1; sf_use_s2() is TRUE
chi_map <- read_sf("https://raw.githubusercontent.com/thisisdaryn/data/master/geo/chicago/Comm_Areas.geojson") 

Analyse

Checking to overview the data to check to see if any other questions spring out at me before I begin my analysis.

glimpse(`yearT`)
count(`yearT`,member_casual) #check the contents of member_casual. Could also use table(year$member_casual)
summary(yearT)
library("skimr")
skim_without_charts(yearT) #quickly using this to see if we can pull any useful data

First I noticed that some of the bikes were hired for a huge amount of time, more than a day. I wanted to investigate that and found that the statistical significance of this was so minimal (less that 0.05%). The percentage of rides that went over 2 days was so small, that I considered them outliers, data that could have been caused by faulty bikes. Therefore, I took this data out in the future. Which I did at the beginning in prepare (this document isn’t in chronological order).

library("dplyr")
"day" <- count(yearT,days,member_casual) %>%
  filter(days>1)%>% 
  group_by(member_casual)%>% 
  summarise("days_greater_1" = sum(n)) %>% 
  inner_join(count(`yearT`,member_casual,name = "total")) %>%
  mutate(percentage_dif = days_greater_1/total) 
  
head(day)
## # A tibble: 2 × 4
##   member_casual days_greater_1   total percentage_dif
##   <chr>                  <int>   <int>          <dbl>
## 1 casual                  3694 2307337       0.00160 
## 2 member                   492 2492747       0.000197

I continued the analysis by looking at the length of time customers rented their bikes. It was easy to see that casual riders used the bikes for longer on their journeys than their annual counterparts.

`yearTC` %>%
  group_by(member_casual)%>%
  summarise("mean" = mean(time))
## # A tibble: 2 × 2
##   member_casual  mean
##   <chr>         <dbl>
## 1 casual         28.4
## 2 member         15.7

There was an issue with the week day order. Therefore I ensured that the weekdays were in order on the main chart. My apologies for the Danish. I didn’t quite figure out how to change the language.

yearTC$week_day <- ordered(yearTC$week_day, levels=c("mandag", "tirsdag", "onsdag", "torsdag", "fredag", "lørdag", "søndag"))

After all of that, I wanted to quickly look at bike usage depending on the bike type and it was clear that everyone prefers the classic bike over any other bike.

`yearT` %>%
  group_by(member_casual) %>%
  count(member_casual, rideable_type)
## # A tibble: 5 × 3
## # Groups:   member_casual [2]
##   member_casual rideable_type       n
##   <chr>         <chr>           <int>
## 1 casual        classic_bike  1140773
## 2 casual        docked_bike    283618
## 3 casual        electric_bike  882946
## 4 member        classic_bike  1562770
## 5 member        electric_bike  929977

Weekday Usage

I then created a weekday data frame to look at annual members and casual riders preferences as to when they want to use the bikes during the week. It was easy to see that casual members used the bikes longer than annual members when viewing the dataframe (can only see head below). However, it was easier to view the chart and see that data instead, so I waited until then to draw a proper conclusion.

"yearTCcount" <- `yearTC` %>% 
  group_by(member_casual, week_day)%>%
  summarise(number_of_rides = n(),average_duration = mean(time)) %>%
  mutate(number_of_rides / 1000) #So it fits better on the chart.
head(yearTCcount)
## # A tibble: 6 × 5
## # Groups:   member_casual [1]
##   member_casual week_day number_of_rides average_duration `number_of_rides/1000`
##   <chr>         <ord>              <int>            <dbl>                  <dbl>
## 1 casual        mandag            260677             28.8                   261.
## 2 casual        tirsdag           241016             25.4                   241.
## 3 casual        onsdag            254529             25.0                   255.
## 4 casual        torsdag           266360             25.2                   266.
## 5 casual        fredag            323952             26.6                   324.
## 6 casual        lørdag            516067             30.7                   516.

Share

Bike usage

Time for plots! Here I was able to create two plots from the data frame I just created. This first one shows the number of rides vs riders type throughout the week. In this plot it is easy to see that casual members prefer weekends over the weekdays. However, annual members seem to prefer Wednesdays over any other day.

### Bike Duration Then this plot shows the amount of time the different riders were riding the bikes on each weekday. In this plot it is easy to see that overall, casual riders prefer to use the bikes far longer then the annual members.Everyone use them the longest on the weekends.

Week Number Usage

I created a count in the preparation section that shows us the popularity of the bikes by both groups on different weekdays. I didn’t actually analyse this when it was just numbers because I wanted to see the data visually. So I waited until this stage to properly dig into the data. Then another two charts were created to plot this. The first one again looks at the number of the rides throughout each week. It’s clear to see that the summer months are a lot more popular by both groups. However, annual members are more likely to use the bikes in the winter months more than the casual riders.

Week Number Duration

This chart shows us the ride duration. This doesn’t tell us anything other than the casual riders prefer to use the bikes longer overall. It seems the seasons don’t have a huge effect on the ride length for the annual members, however the casual riders seem to prefer to use the bikes longer in the summer months than the winter. Apart from two weeks around Christmas and New Year when events in Chicago increases.

Station Usage, Return vs Non-Returning Bikes

I wanted to know how the cyclists use the stations. I wanted to begin by looking at return journeys. I did this by looking at how many times people had taken a bike and dropped a bike off at a station. Then took all the ones that were equal to assume they were return journeys. I unfortunately can’t confirm this because I don’t have the rides connected to customer data and don’t have access to the customer data. We cannot completely trust this data of course. But if I was working for this company, it may get the board more interested in knowing more about my findings. This plot surprised me. It is easy to see that annual members are more likely to take non-return trips than casual riders and casual riders love to take return trips. I find that really odd because my original thinking is that annual members were more likely to have return trips than casual riders. It also appears that everyone prefers the classic bikes regardless of the trip.

Same Station Return

Here I melded two potential bar charts into one box chart. I wanted to look at the difference between the groups when they returned a bike to the same station. That is, they take a bike, they go wherever they need to go, keep the bike and then come back to the same place they picked the bike up from. The question on my mind was, “Are casual riders more likely to return a bike to the same station compared with annual members?” This was created in the assumption that casual riders are most likely tourists. To ensure that the data was equal between annual members and casual riders, I used a percentage of total rides for each group. As you can see in the plot, the answer to my question is “Yes!”.

In the same chart you can also see that casual members also tend to use the bikes double the amount of time compared with the annual members when dropping a bike off from the same place they picked it up from. The assumption here is that they could be tourists exploring the city and wanting time to see everything. Annual members could just be using the bikes to go to the shop, exercise or another task that requires a small amount of time to complete.

Location Usage

The final plot is actually a map. It shows the location of all the stations and their usage depending on the size of the dots. I purposely made the annual members yellow and the casual riders blue. So together they can make green, unfortunately they appear to have made grey instead. With the yellow being see through it means you can compare which group uses which station more. If you have a blue border to the dot, then casual members use that station more. If you have a yellow border to the dot, then annual members are using that station more. In the center of Chicago you can see that it’s quite the mess with stations. I could have zoomed into that area with another map, however I don’t think that would give us any significant data. The data we do get from this map is that casual riders prefer to use the bikes along the side of the lake more than the annual members. The annual members are more likely to use their bikes further in the city. It is scattered with bikes throughout the rest of the city. An interesting fact is also that some stations have only be used by casual riders, could this be that annual members don’t have access to these? If I was working for the company, I would call up someone who could tell me more about this situation.

Act - The Conclusion

Everyone prefers the classic bike and the amount of times someone rents a bike for longer than a day is so statistically insignificant, it was not even worth investigating. Overall, it’s definitely apparent that Casual Riders use the bikes longer than annual members. Casual riders definitely prefer the weekends over the work week, using them more and for longer during that time. Whereas annual members prefer to use the bikes in the middle of the work week and only use them slightly longer on the weekend. Both casual riders and annual members are more likely to use their bikes in the summer weeks. However it is more pronounced with casual riders and they tend to use the bikes longer in the summer weeks. Annual members appear to be quite consistent with their ride duration between summer and winter weeks.

Surprisingly it appears that casual riders are more likely to go on return trips using the bicycles. Annual members are less likely to go on return trips. Casual riders are twice as likely to drop their bike off from the same place they picked it up from. However, over 10% of casual riders will do this whereas less than 5% of annual riders do this. When they do pick up a bike from the same station they drop them off at, casual riders take double the amount of time to use them compared with annual members. This is probably due to them sightseeing, whereas annual members perhaps just want to go to a shop, use the bike for exercise or another task that takes a small amount of time.

The map shows that casual riders are more likely to rent bikes besides the lake and tend to stick around the city. There are some random stations which were only used by casual riders. However the grey dots with yellow outlines show that annual members are more likely to use the stations that are out from the city center.

Overall, this has been an interesting journey to get hold of this information and I have enjoyed it thoroughly. This case study was completely created using R from start to finish. It has been a hard and challenging journey but I have enjoyed every minute of it. Just wish I had more time to work on it. But lets create one more dataframe to wrap everything up…

##                                                                                                     CRiders
## 1                                                                                             Casual Riders
## 2                                                                            - Bike ride duration is longer
## 3                                                                            - Prefer riding on the weekend
## 4                                                                            - Prefer to ride in the summer
## 5                                                                       - More likely to go on return trips
## 6                         - At 10%, they are twice as likely to drop a bike off from where they got it from
## 7 - They take double the amount of time to use a bike when they drop it off from the same place they got it
## 8                                                      - They are more likely to rent a bike along the lake
##                                                                                          AMembers
## 1                                                                               vs Annual Members
## 2                                                                        - Prefer riding mid-week
## 3                                                             - Ride for slightly long on weekend
## 4                       - Use bikes more in summer but use them more in winter than casual riders
## 5                                                              - Less likely to take return trips
## 6           - At 5%, annual members are less likely to drop their bike off where they got it from
## 7 - They take half the time to use a bike when they drop a bike off in the same place they got it
## 8                                         - More likely to rent a bike outside of the city center

Top Three Recomendations

Moreno asked me about what the top three recommendations are from my analysis. I would like to suggest this:

  1. Casual Riders are more likely to use the bikes during the summer, however we could convince them that it would be cheaper for them to get an annual membership since they use their bikes for at least double the amount of time annual members do. Which means, they must be paying more overall.

  2. It surprised me that they are more likely to go on return trips. If this is the case, then we can advertise that it would be cheaper for them to do that with an annual membership rather paying for the ride.

  3. They generally just move around the center of the city. Could it be possible to change the annual membership types so you could have zones included? Say it could be cheaper to go around the center of the city rather than outside the center.

In this fictional world, I would have been talking with the analytics team during analysis, phoning many industry experts to confirm whether my assumptions are correct and get some other thoughts on what I could analyse. I would make a report similar to this but more consise with charts and simple data frames for Moreno to know what is going on. Then After some meeting with everyone in the analytics team and discussions on how we will present. I will incorporate my presentation with the rest of the Analytics team presentations and I will stand in front of the executives to show why Moreno’s plan will work backed up with a great story, data and charts. I am good with the story telling, you could see that in my novel https://starjump.nu/ that is currently published online for free.

Thank you for reading and I hope you have enjoyed reading about my process as much as I have enjoyed doing it.