BikeShare Analysis

Introduction

Data cleaning, transforming and wrangling

This report shows step-by-step analysis of Bike-Share data of hypothetic Chicago Bike Share Company. The data is hosted in the link here and publicly available: https://divvy-tripdata.s3.amazonaws.com/index.html. The required analysis is for last 12 months and hence this data is downloaded from the source in .csv format.

Due to large data sets, the files are more than 100mb which caused difficulty to work with in cleaning and transforming. Hence decided to do this step in Google BigQuery using SQL Query, but BigQuery sandbox has a file size limit to upload into the database. For this, they were converted to .xls format to decrease the file size and then separated the longitude and latitude data into separate .csv files and remaining columns for 12 months in separate .csv files. Some initial cleaning up were done in Excel Power Query to identify null values in ride_id and dates. Null values in stations were kept unless the latitude and longitude are also nulls. This is assuming that, rides can be start and end any where not necessarily a station. Also, if duration of the ride is zero, they were also removed from the tables. After cleaning, separating and saving as .csv, all files were now below 100mb and uploaded into tables in a database in BigQuery.

After uploading the files, files were merged into single raw data table after combining inidividual monthly tables and lon-lat tables using union and joins and saved as view. Any further cleaning was done based on contexts and analyzed in detail using SQL Query. The cleaned raw data is downloaded for analysis in R and saved in working directory.

Data Visualization and further analysis

For further analysis we need to create visualizations to understand the rides taking place across the last 12 months. To do this, we need to first install and load the required packages.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)

Next let’s import the cleaned dataset which has been downloaded from bigquery and stored in the local working directory.

BikeShareData <- read_csv("BikeShareData_Clean.csv")
## Rows: 5319274 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, started_at, ended_at, member_casual, rideable_type, start_...
## dbl (7): duration_minutes, duration_hours, duration_days, start_lat, start_l...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(BikeShareData)
## # A tibble: 6 × 16
##   ride_id        start…¹ ended…² durat…³ durat…⁴ durat…⁵ membe…⁶ ridea…⁷ start…⁸
##   <chr>          <chr>   <chr>     <dbl>   <dbl>   <dbl> <chr>   <chr>   <chr>  
## 1 46719C2D3D513… 2021-0… 2021-0…       1    0.02       0 casual  classi… TA1307…
## 2 F3A82C6371C22… 2022-0… 2022-0…       1    0.02       0 member  classi… TA1307…
## 3 2FC8CF798C7C5… 2022-0… 2022-0…       1    0.02       0 member  electr… <NA>   
## 4 74ADAA803CCBE… 2022-0… 2022-0…       1    0.02       0 member  electr… 13021  
## 5 C246B15EA73DE… 2021-0… 2021-0…       1    0.02       0 member  classi… TA1306…
## 6 08AB194827CCC… 2021-0… 2021-0…       1    0.02       0 member  electr… 13197  
## # … with 7 more variables: start_station_name <chr>, end_station_id <chr>,
## #   end_station_name <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, and abbreviated variable names ¹​started_at, ²​ended_at,
## #   ³​duration_minutes, ⁴​duration_hours, ⁵​duration_days, ⁶​member_casual,
## #   ⁷​rideable_type, ⁸​start_station_id

The initial data shows all the columns imported correctly. Let’s look at the data types and overview of the dataframe.

str(BikeShareData)
## spec_tbl_df [5,319,274 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5319274] "46719C2D3D513D99" "F3A82C6371C22F1D" "2FC8CF798C7C57AF" "74ADAA803CCBE929" ...
##  $ started_at        : chr [1:5319274] "2021-09-26 10:35:00 UTC" "2022-06-14 18:06:00 UTC" "2022-05-29 13:46:00 UTC" "2022-08-25 08:54:00 UTC" ...
##  $ ended_at          : chr [1:5319274] "2021-09-26 10:36:00 UTC" "2022-06-14 18:07:00 UTC" "2022-05-29 13:47:00 UTC" "2022-08-25 08:55:00 UTC" ...
##  $ duration_minutes  : num [1:5319274] 1 1 1 1 1 1 1 1 1 1 ...
##  $ duration_hours    : num [1:5319274] 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ...
##  $ duration_days     : num [1:5319274] 0 0 0 0 0 0 0 0 0 0 ...
##  $ member_casual     : chr [1:5319274] "casual" "member" "member" "member" ...
##  $ rideable_type     : chr [1:5319274] "classic_bike" "classic_bike" "electric_bike" "electric_bike" ...
##  $ start_station_id  : chr [1:5319274] "TA1307000150" "TA1307000062" NA "13021" ...
##  $ start_station_name: chr [1:5319274] "Pine Grove Ave & Waveland Ave" "Sedgwick St & Huron St" NA "Clinton St & Lake St" ...
##  $ end_station_id    : chr [1:5319274] "TA1307000150" "TA1307000062" "WL-012" "WL-012" ...
##  $ end_station_name  : chr [1:5319274] "Pine Grove Ave & Waveland Ave" "Sedgwick St & Huron St" "Clinton St & Washington Blvd" "Clinton St & Washington Blvd" ...
##  $ start_lat         : num [1:5319274] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5319274] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:5319274] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5319274] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   duration_minutes = col_double(),
##   ..   duration_hours = col_double(),
##   ..   duration_days = col_double(),
##   ..   member_casual = col_character(),
##   ..   rideable_type = col_character(),
##   ..   start_station_id = col_character(),
##   ..   start_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

From this we found that started_at and ended_at columns are stored as character data types. We need to change them into date data type. For this, we will first extract the first 10 date characters into new columns and then convert them into date columns using mutate function.

RidesLast12Months <- BikeShareData %>%
  select(ride_id, member_casual, rideable_type, started_at, ended_at)

RidesLast12Months <- RidesLast12Months %>%
  transform(started_at = substr(started_at, 1, 10), ended_at = substr(ended_at, 1, 10))

RidesLast12Months$started_at <- ymd(RidesLast12Months$started_at)
RidesLast12Months$ended_at <- ymd(RidesLast12Months$ended_at)

head(RidesLast12Months)
##            ride_id member_casual rideable_type started_at   ended_at
## 1 46719C2D3D513D99        casual  classic_bike 2021-09-26 2021-09-26
## 2 F3A82C6371C22F1D        member  classic_bike 2022-06-14 2022-06-14
## 3 2FC8CF798C7C57AF        member electric_bike 2022-05-29 2022-05-29
## 4 74ADAA803CCBE929        member electric_bike 2022-08-25 2022-08-25
## 5 C246B15EA73DEAF6        member  classic_bike 2021-09-26 2021-09-26
## 6 08AB194827CCC83E        member electric_bike 2021-09-13 2021-09-13

We have transformed the started_at and ended_at columns to date datatype. Now let’s look at how the ride counts look like throughout last 12 months. For this we will use the ride start dates and see the count of rides throughout the year.

RideCounts_Last12Months <- RidesLast12Months %>%
  select(ride_id, member_casual, started_at) %>%
  group_by(member_casual, started_at) %>%
  summarize(ride_count = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
head(RideCounts_Last12Months)
## # A tibble: 6 × 3
## # Groups:   member_casual [1]
##   member_casual started_at ride_count
##   <chr>         <date>          <int>
## 1 casual        2021-09-01       9317
## 2 casual        2021-09-02       9980
## 3 casual        2021-09-03      10816
## 4 casual        2021-09-04      15920
## 5 casual        2021-09-05      20375
## 6 casual        2021-09-06      16787

We have created a data frame with summary of ride counts by dates and ride types

Now let’s visualize how the rides are distributed throughout last 12 month period to see if there is any pattern in the rides.

ggplot(RideCounts_Last12Months)+geom_point(mapping = aes(x = started_at, y = ride_count, color = member_casual)) + geom_smooth(method = "gam", mapping = aes(x = started_at, y = ride_count)) + labs(title = "Bike Share: Rides in Last 12 Months", subtitle = "Comparison of Casual and Member rides")  
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

From the scatter-plot, there is a very clear pattern that riders during winter times end of October through end of May are significantly less. So from this it looks like instead of an yearly pass, people will be much more attracted towards a 6 month membership.

ggplot(RideCounts_Last12Months)+geom_point(mapping = aes(x = started_at, y = ride_count, color = member_casual)) + geom_smooth(method = "gam", mapping = aes(x = started_at, y = ride_count)) + facet_wrap(~member_casual) + labs(title = "Bike Share: Rides in last 12 Months", subtitle = "Casual Rides and Member Rides through last 12 months")
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

From this, we can see that there is a small dip in casual riders in summer, after the initial peak. But the member riders were quite consistently high during this period. Now let’s see how the same data by ride types.

RideCounts_Last12Months <- RidesLast12Months %>%
  select(ride_id, member_casual, rideable_type, started_at) %>%
  group_by(member_casual, rideable_type, started_at) %>%
  summarize(ride_count = n())
## `summarise()` has grouped output by 'member_casual', 'rideable_type'. You can
## override using the `.groups` argument.
head(RideCounts_Last12Months)
## # A tibble: 6 × 4
## # Groups:   member_casual, rideable_type [1]
##   member_casual rideable_type started_at ride_count
##   <chr>         <chr>         <date>          <int>
## 1 casual        classic_bike  2021-09-01       5681
## 2 casual        classic_bike  2021-09-02       5879
## 3 casual        classic_bike  2021-09-03       6047
## 4 casual        classic_bike  2021-09-04       9733
## 5 casual        classic_bike  2021-09-05      12834
## 6 casual        classic_bike  2021-09-06      10474
ggplot(RideCounts_Last12Months) + geom_point(mapping = aes(x = started_at, y = ride_count, color = rideable_type)) + geom_smooth(method = "gam", mapping = aes(x = started_at, y = ride_count)) + facet_wrap(~member_casual) + labs(title = "Bike Share: Ride Counts by Ride Types and rider type", subtitle = "Shows the difference in rider behavior by bike type")
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

Let’s look at the data in a different way, using the bar charts.

RideCounts_Last12Months <- RidesLast12Months %>%
  select(ride_id, member_casual, rideable_type, started_at)

ggplot(RideCounts_Last12Months) + geom_bar(mapping = aes(x = rideable_type, fill = rideable_type)) + facet_wrap(~member_casual) + labs(title = "Bike Share: Ride count by rider type and bike type")

Let’s see the ride counts in specific periods: Sep - Mar and Apr - Aug First let us look date range Sep - Mar

RideCounts_SepMar <- RidesLast12Months %>%
  select(ride_id, member_casual, rideable_type, started_at) %>%
  filter(started_at >= '2021-09-01', started_at <= '2022-03-31')

ggplot(RideCounts_SepMar) + geom_bar(mapping = aes(x = rideable_type, fill = rideable_type)) + facet_wrap(~member_casual) + labs(title = "Bike Share: Ride count between Sep 2021 and Mar 2022") 

Now let’s look at date range Apr - Aug

RideCounts_AprAug <- RidesLast12Months %>%
  select(ride_id, member_casual, rideable_type, started_at) %>%
  filter(started_at >= '2022-04-01', started_at <= '2022-08-31')

ggplot(RideCounts_AprAug) + geom_bar(mapping = aes(x = rideable_type, fill = rideable_type)) + facet_wrap(~member_casual) + labs(title = "Bike Share: Ride count between Apr 2022 and Aug 2022")

No specific insights we could find from these, although we can still notice that casual riders mostly prefer to ride electric bikes compared to member riders.

Insights

The Bike share data was prepared, processed and analyzed to derive the insights of casual riders and member riders. From the analysis, we found that member rides and casual rides are 58% and 42% of total ride counts. Member riders prefer classic bikes more compared to casual riders.

Casual riders’ average ride duration is high compared to member riders which could be due to the casual riders mostly take routes between long distance ride stations. Member riders most frequently ride between stations include Ellis Ave & 60Th St, University Ave & 57th St, Calumet Ave & 33Rd St, and State St & 33Rd St. The casual riders between these stations are also quite high, but not as high as the member riders. The data is not sufficient to know if these casual riders are same returning customers. More data is needed to confirm this. If this is true, these casual customers could be converted to member riders. Most casual ride counts are between stations Streeter Dr & Grand Ave, Dusable Lake Shore Dr & Monroe St, Millennium Park, and Michigan Ave & Oak St.

Casual riders prefer both classic bikes and electric bikes equally but slightly more preference towards classic bikes. They also use docked bikes, but the docked bike rides are significantly less. This could be due to less number of docking stations, but this data is not sufficient in this regard to make an inference.

Looking at the ride counts distribution in the last 12 months for any seasonality, it is found that the ride count during winter is significantly low compared to summer. This is true for both casual rides and member rides. Hence there is no rider type specific seasonality. But we found that there is a small dip in casual ride count, after the initial peak in summer but member rides are steady throughout summer until September after the initial peak in start of summer.

Conclusion and Recommendations

Include a six month membership plan along with the one year plan

Most of our member riders ride between stations Ellis Ave & 60Th St, University Ave & 57th St, Calumet Ave & 33Rd St, and State St & 33Rd St. There are significant number of casual rides between these stations although not same number of rides as member rides. This could be due to many of the riders preferring not to have a membership because they only use it during the summer.

Include more electric bikes

It is found that casual ride duration are much higher compared to member rides. It could be hard for them to use classic bikes from far stations on a daily basis in their routes. As we can see that our casual riders have opted electric bikes and classic bikes almost equally. If more electric bikes are available consistently, these riders will prefer to join as members. But more investigation is needed in this as bike count by bike type is not available.