By Isaiah Lyons-Galante, Glen Smith, Max Warnock, Magno Gutierrez, Ross Hawley
We have chosen to explore two years of ridership data from Capital Bikeshare, a bike share company based in Washington, D.C. from 2011-2012. The data have daily resolution and are split between casual ridership from non-member users and registered ridership from members. The data also include about a dozen additional data points for each day that capture variables like the day of the week, type of day, and weather variables. The captured data and variables are explained below:
Variable Descriptions: - instant: record index - dteday: date - season: season (winter, spring, summer, fall) - yr: year (0: 2011, 1:2012) - mnth: month (1 to 12) - holiday: weather day is holiday or not - weekday: day of the week - workingday: if day is neither weekend nor holiday is 1, otherwise is 0. - weathersit: clear, mist, or rain or snow - temp: Normalized temperature in Celsius (range from 0 to 1) - atemp: Normalized feeling temperature in Celsius - hum: Normalized humidity. - windspeed: Normalized wind speed. The values are divided to 67 (max) - casual: count of casual (unregistered) users - registered: count of registered users - casual_percent: percent of daily bike users who are not registered users - cnt: count of total rental bikes including both casual and registered
The goal of this research is to create a predictive model for both casual and registered user ridership by temporal variables such as day of the week and seasonality. We anticipated that increases in registered use would have a positive effect on casual use. We suspect that weather conditions affected ridership, and so we attempt to explain some of the variability due to weather, but do not include it in our model as it is unknowable for future years. Our hypotheses are below:
Null Hypothesis: all temporal and weather variables are uncorrelated with bike ridership for both casual and registered users.
Our Hypotheses:
Temporal:
Weather:
Here is a snapshot of the first six rows of the dataset which includes both categorical and numerical variables:
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 winter 0 1 0 6 0 mist
## 2 2 2011-01-02 winter 0 1 0 0 0 mist
## 3 3 2011-01-03 winter 0 1 0 1 1 dry
## 4 4 2011-01-04 winter 0 1 0 2 1 dry
## 5 5 2011-01-05 winter 0 1 0 3 1 dry
## 6 6 2011-01-06 winter 0 1 0 4 1 dry
## temp atemp hum windspeed casual registered cnt casual_percent
## 1 0.344167 0.363625 0.805833 0.1604460 331 654 985 0.33604061
## 2 0.363478 0.353739 0.696087 0.2485390 131 670 801 0.16354557
## 3 0.196364 0.189405 0.437273 0.2483090 120 1229 1349 0.08895478
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562 0.06914213
## 5 0.226957 0.229270 0.436957 0.1869000 82 1518 1600 0.05125000
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606 0.05479452
The main target variable we studied was total ridership. However, we also studied some aspects of casual and registered ridership. We used histograms to identify the distributions of these different variables below.
Figure 1: This histogram shows a right skewed
distribution for casual users.
Figure 2: This histogram shows a relatively normal
distribution for registered users.
Figure 3: This histogram shows a relatively normal
distribution of casual and registered users combined.
To summarize the data, we graphed casual and registered ridership by
day over the full 2-year period (in black). We have added a smoothed
line of the data in red, and a linear approximation in blue to help see
trends better.
Figure 4: This graph shows the number of registered
users across the time period. There is a clear seasonal pattern as well
as steady growth and there is significant fluctuation on a day to day
scale.
Figure 5: This graph shows the number of casual users
across the time period.
Interpretation of exploratory figures There is a larger amount of fluctuation day to day for casual users (Figure 4), but both casual and registered users share a similar seasonal pattern. When registered usage is higher, casual usage is also higher. The number of casual users is significantly less than the number of registered users over the time period. This difference in magnitude could explain why the skewed histogram of casual users (Figure 1) did not effect the normal distribution of total users (Figure 3).
To answer our hypotheses, we explore each appropriate variable individually to see whether it significantly affects ridership through a combination of correlations, plots, and analysis of variance. After determining which variables are of significance, we build the predictive model for total ridership.
Variable: Day of the Week
Figure 6: This boxplot shows that total ridership is
relatively consistent on all days of the week, with slightly higher
usage on weekdays compared to weekends. Because ridership is fairly
consistent on average across all days, this variable is irrelevant.
Variables: Holiday, Weekday and Weekend
Figure 7: These boxplots show casual and registered
usage for 2011 and 2012 on holidays, weekdays and weekends. Registered
usage is higher on weekdays and lower on weekends and holidays. Casual
usage is higher on holidays and weekends than holidays. The boxplots
also demonstrate that growth in registered users is significantly
greater from 2011 to 2012 than for causal users.
Variable: Seasons
Figure 8: This graph shows the total ridership during the different seasons. The amount of riders significantly decreases during the winter season. The spring, and summer seasons have fairly similar ridership use.
##
## Kruskal-Wallis rank sum test
##
## data: bikes$registered by bikes$season
## Kruskal-Wallis chi-squared = 196.77, df = 3, p-value < 2.2e-16
##
## Comparison of x by group
## (No adjustment)
## Col Mean-|
## Row Mean | fall spring summer
## ---------+---------------------------------
## spring | 0.463067
## | 0.3217
## |
## summer | -2.706012 -3.198420
## | 0.0034* 0.0007*
## |
## winter | 10.27458 9.895069 13.13279
## | 0.0000* 0.0000* 0.0000*
##
## alpha = 0.05
## Reject Ho if p <= alpha/2
##
## Kruskal-Wallis rank sum test
##
## data: bikes$casual by bikes$season
## Kruskal-Wallis chi-squared = 278.83, df = 3, p-value < 2.2e-16
##
## Comparison of x by group
## (No adjustment)
## Col Mean-|
## Row Mean | fall spring summer
## ---------+---------------------------------
## spring | -5.409601
## | 0.0000*
## |
## summer | -8.006406 -2.590066
## | 0.0000* 0.0048*
## |
## winter | 7.316487 12.80991 15.45732
## | 0.0000* 0.0000* 0.0000*
##
## alpha = 0.05
## Reject Ho if p <= alpha/2
Figure 9: This Kruskal Wallis Test correlates the variables of registered riders with season. For registered riders all seasons but spring are significant. For the casual riders every season is significant.
##
## Call:
## lm(formula = bikes$registered ~ bikes$casual)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2876.1 -1202.1 -27.4 1003.3 3344.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.894e+03 8.434e+01 34.32 <2e-16 ***
## bikes$casual 8.982e-01 7.731e-02 11.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1434 on 729 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1551
## F-statistic: 135 on 1 and 729 DF, p-value: < 2.2e-16
Figure 10: This is a linear model that shows the correlation between registered and casual users. The P value is neglected so it can be inferred that the correlation between registered and casual users is neglected.
Weather Variables The following plots will explore
the effect of weather on ridership.
Figure 11: This plot shows a positive correlation
between the total users and normalized temperatures. As temperature
increase, so does rider usage.
Figure 12: This graph has a negative slope which means
that as wind speed increases, ridership decreases which aligns with our
hypothesis on the effect of wind speed on ridership.
Figure 13: This graph shows that humidity has a
negligible impact on ridership. There is a slight downward slope
suggesting that as humidity increases, ridership decreases, but this
could also be related to other variables like temperature. This proves
aligns with our hypothesis that humidity will not effect ridership.
Figure 14: This scatterplot shows the temperature
variability across the sample period. Because we know that temperature
affects the number of riders (Figure 11), this seasonal difference in
temperature can help explain the seasonal changes in ridership seen in
Figures 4 and 5.
In this section, we built a predictive model for overall bike ridership to be used by the company to forecast bike use for better logistical planning. We finished with two final models, one that only looks at temporal, knowable variables such as weekdays and seasons, while the other model includes weather variables such as temperature and humidity. Before we started our linear models, we needed to check if the data is normal. We did this with a histogram:
We see a little bit of non-normality in the low values. This is because the casual ridership is not normal:
However, the overall combination of the two is normal, so we will proceed with the linear models of overall count. We approached the building of the linear models with backward selection, first including lots of variables and then whittling it down.
First, we created extra columns to convert categorical variables such as month, season, weather condition, and day of the week into numerical values of either 0 or 1. Here is the head of the new transformed data frame: Add in the seasons:
We started our development of the predicive model with a large model with all explanotory variables that we thought would be important. Note that we left out the days of the week and the months of the year since we thought that this information would be sufficiently captured by working day and by seasons.
##
## Call:
## lm(formula = cnt ~ yr + holiday + workingday + winter + summer +
## spring + dry + precip + temp + hum + windspeed, data = bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3677.3 -378.9 67.4 475.7 3347.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2705.81 281.17 9.623 < 2e-16 ***
## yr 2013.71 61.75 32.610 < 2e-16 ***
## holiday -624.01 188.44 -3.311 0.000974 ***
## workingday 118.19 67.92 1.740 0.082232 .
## winter -1545.59 96.67 -15.988 < 2e-16 ***
## summer -701.49 122.47 -5.728 1.50e-08 ***
## spring -407.65 96.16 -4.239 2.54e-05 ***
## dry 426.72 81.24 5.253 1.98e-07 ***
## precip -1490.23 194.55 -7.660 6.02e-14 ***
## temp 5108.07 306.98 16.640 < 2e-16 ***
## hum -1325.64 295.07 -4.493 8.19e-06 ***
## windspeed -2795.28 428.06 -6.530 1.24e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 822.2 on 719 degrees of freedom
## Multiple R-squared: 0.8226, Adjusted R-squared: 0.8199
## F-statistic: 303.1 on 11 and 719 DF, p-value: < 2.2e-16
We found a lot of interesting factors with this model. The first is that we were able to achieve an impressive adjusted R-squared of 0.82, meaning we are able to explain over 80% of the variation observed in bike data. This is much better than just using the mean to predict the future value. The logistics operators of the cycling company will be able to put our model to good use. We dove into each variable as well to unpack what the model says about each one:
Year: Strong positive correlation, indicative of increased ridership from year to the next
Holiday: Negative correlation, meaning fewer people ride overall on holidays
Working Day: Weak positive correlation, meaning more bike use during the week, likely due to commuters.
Winter: Strong negative correlation, meaning fewer people bike in winter time, likely due to the cold.
Spring: A weak negative correlation
Summer: A surprising weak negative correlation! Seems like fall is really the most popular time for use of the bike share system. We suspect that it may be driven in part by students.
Dry weather: strong positive correlation. Good weather = more bikers!
Mist: strong positive correlation. This really told us that riders avoid the rain but don’t mind the mist.
Temp: positive correlation. Warmer weather, more bikers.
Humidity: negative correlation. MOre humid weather, fewer bikers. This made sense since humidity makes the effective temperature feel that much hotter.
Windspeed: negative correlation. Windier days, fewer bikers.
Just as a sanity check on the model, we checked the Q-Q plot of the residuals, as well as the residuals vs fitted to look for homoscedasticity, and we were generally content with what we found. The Q-Q plot fits very closely to theoretical quantiles in the middle quantiles, deviating slightly in the upper and lower extremes. We see a similar trend in the residuals vs fitted, but again, we feel that it is randomly distributed enough to merit validity.
Q-Q Plot
Residuals vs. Fitted
Next, as an exercise in comprehensive analysis, we constructed a linear model with every single variable available.
##
## Call:
## lm(formula = cnt ~ yr + holiday + workingday + winter + spring +
## summer + dry + mist + temp + hum + windspeed + jan + feb +
## mar + apr + may + jun + jul + aug + sep + oct + nov + sun +
## mon + tue + wed + thu + fri, data = bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3960.9 -350.9 74.1 456.0 2919.9
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1495.45 357.97 4.178 3.32e-05 ***
## yr 2018.06 58.22 34.660 < 2e-16 ***
## holiday -613.70 206.68 -2.969 0.003086 **
## workingday -10.09 106.96 -0.094 0.924837
## winter -1578.95 181.04 -8.722 < 2e-16 ***
## spring -689.65 212.36 -3.248 0.001219 **
## summer -746.71 191.42 -3.901 0.000105 ***
## dry 1981.36 196.67 10.075 < 2e-16 ***
## mist 1516.15 184.23 8.230 9.11e-16 ***
## temp 4487.30 411.84 10.896 < 2e-16 ***
## hum -1518.18 292.21 -5.196 2.68e-07 ***
## windspeed -2925.44 406.17 -7.202 1.53e-12 ***
## jan 84.39 182.23 0.463 0.643439
## feb 221.24 183.54 1.205 0.228450
## mar 629.52 185.16 3.400 0.000712 ***
## apr 540.88 242.19 2.233 0.025842 *
## may 807.91 257.73 3.135 0.001792 **
## jun 574.94 262.55 2.190 0.028863 *
## jul 92.79 279.39 0.332 0.739894
## aug 489.30 267.52 1.829 0.067824 .
## sep 1068.34 218.37 4.892 1.24e-06 ***
## oct 605.33 163.54 3.701 0.000231 ***
## nov -26.97 154.85 -0.174 0.861767
## sun -438.70 106.59 -4.116 4.32e-05 ***
## mon -213.73 109.18 -1.958 0.050681 .
## tue -119.47 107.25 -1.114 0.265684
## wed -51.20 107.73 -0.475 0.634758
## thu -43.40 107.06 -0.405 0.685328
## fri NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 769.5 on 703 degrees of freedom
## Multiple R-squared: 0.848, Adjusted R-squared: 0.8422
## F-statistic: 145.3 on 27 and 703 DF, p-value: < 2.2e-16
As you can see, we only marginally improved the adjusted R-squared up to 0.8422. To be even more comprehensive, we then used a library MASS that runs a huge number of variations of our model to see which have the highest AIC. We fed it every single variable, even the months and weekdays, just to see what the theoretical maximum would be. The resulting output was the following model:
best_lm <- lm(cnt
~ yr + holiday
+ winter + spring + summer
+ dry + mist
+ temp + hum + windspeed
+ mar + apr + may + jun + aug + sep + oct
+ sun + mon
, data=bikes)
We can see that a few variables had been removed, such
jan, feb, nov,
workingday, and most days of the week except
sun and mon. This model achieved the most
impressive R-squared, 0.8432. However, it still includes 19 different
variables. We still felt that this model was too complex, and so we did
some further weeding out of every variable with a p-value > 1e-4.
##
## Call:
## lm(formula = cnt ~ yr + winter + summer + dry + mist + temp +
## hum + windspeed + mar + sep + oct + sun, data = bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3943.2 -364.0 112.1 493.6 3115.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1261.66 342.05 3.689 0.000243 ***
## yr 2011.32 59.10 34.032 < 2e-16 ***
## winter -1328.70 92.27 -14.400 < 2e-16 ***
## summer -451.03 99.07 -4.553 6.23e-06 ***
## dry 1982.14 199.29 9.946 < 2e-16 ***
## mist 1526.77 186.67 8.179 1.30e-15 ***
## temp 4722.43 271.33 17.405 < 2e-16 ***
## hum -1526.64 285.21 -5.353 1.17e-07 ***
## windspeed -3013.40 407.55 -7.394 3.98e-13 ***
## mar 308.96 110.50 2.796 0.005311 **
## sep 859.54 115.30 7.455 2.59e-13 ***
## oct 638.87 111.38 5.736 1.43e-08 ***
## sun -337.42 83.47 -4.042 5.86e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 787.5 on 718 degrees of freedom
## Multiple R-squared: 0.8375, Adjusted R-squared: 0.8348
## F-statistic: 308.3 on 12 and 718 DF, p-value: < 2.2e-16
Here, we reduced the model from 19 explanatory variables down to 12 while the adjusted R-squared only went down to 0.8348, barely a 1% reduction. This motivated us to continue trimming further, and we eliminated every variable with a p > 1e-10:
##
## Call:
## lm(formula = cnt ~ yr + winter + summer + dry + precip + temp +
## windspeed, data = bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3671.9 -455.7 87.0 503.7 3341.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1833.39 169.07 10.844 < 2e-16 ***
## yr 2057.43 63.17 32.571 < 2e-16 ***
## winter -1400.35 94.83 -14.767 < 2e-16 ***
## summer -345.73 99.31 -3.481 0.000529 ***
## dry 635.07 67.47 9.413 < 2e-16 ***
## precip -1597.80 195.58 -8.169 1.38e-15 ***
## temp 4438.35 285.74 15.533 < 2e-16 ***
## windspeed -2512.25 417.02 -6.024 2.70e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 849.2 on 723 degrees of freedom
## Multiple R-squared: 0.8097, Adjusted R-squared: 0.8079
## F-statistic: 439.5 on 7 and 723 DF, p-value: < 2.2e-16
At this point, we were down to 6 remaining explanatory variables and still had an impressive adjusted R-squared of 0.8079. This is the final version of the model that includes both time and weather variables to be used to predict bike share data. The final list of variables are: - Year - Winter - Dry - Precipitation - Temperature - Windspeed
Finally, we wanted to build one last version of the model that did not include any weather variables that could used for long-term forecasting. We repeated the AIC process but without any weather related variables. We started with a new model with all time related variables:
##
## Call:
## lm(formula = cnt ~ yr + holiday + workingday + winter + spring +
## summer + jan + feb + mar + apr + may + jun + jul + aug +
## sep + oct + nov + sun + mon + tue + wed + thu + fri, data = bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6177.3 -395.7 122.1 600.6 3237.9
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3053.322 178.714 17.085 < 2e-16 ***
## yr 2201.513 73.478 29.962 < 2e-16 ***
## holiday -250.701 265.584 -0.944 0.34551
## workingday 100.502 137.530 0.731 0.46516
## winter -1871.353 231.609 -8.080 2.80e-15 ***
## spring -1132.452 272.513 -4.156 3.64e-05 ***
## summer -783.547 246.094 -3.184 0.00152 **
## jan -1.293 232.625 -0.006 0.99557
## feb 441.295 235.714 1.872 0.06160 .
## mar 1214.636 231.544 5.246 2.06e-07 ***
## apr 1553.542 294.805 5.270 1.82e-07 ***
## may 2419.957 293.951 8.233 8.82e-16 ***
## jun 2703.539 272.474 9.922 < 2e-16 ***
## jul 2287.806 291.652 7.844 1.61e-14 ***
## aug 2366.316 291.829 8.109 2.26e-15 ***
## sep 2272.904 247.621 9.179 < 2e-16 ***
## oct 1139.682 196.411 5.803 9.85e-09 ***
## nov 187.205 198.218 0.944 0.34527
## sun -357.651 137.170 -2.607 0.00932 **
## mon -295.678 140.453 -2.105 0.03563 *
## tue -195.691 137.864 -1.419 0.15621
## wed -154.664 137.845 -1.122 0.26224
## thu -31.500 137.846 -0.229 0.81931
## fri NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 993.2 on 708 degrees of freedom
## Multiple R-squared: 0.745, Adjusted R-squared: 0.7371
## F-statistic: 94.04 on 22 and 708 DF, p-value: < 2.2e-16
The adjusted R-squared has jumped down a step down to 0.7371, but this is still not bad considering we are including zero weather data outside of the season of year.
best_time_lm <- lm(cnt
~ yr + holiday
+ winter + spring + summer
+ feb + mar + apr + may + jun + jul + aug + sep + oct
+ sun + mon
, data=bikes)
Our adjusted R-squared made a tiny improvement from 0.7371 to 0.7380, and we still have 16 variables. We felt there was room again to trim the fat. We cut out all variables with p > 1e-10 again and we were left with just 7 variables:
##
## Call:
## lm(formula = cnt ~ yr + winter + may + jun + jul + aug + sep,
## data = bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5590.7 -496.8 93.2 621.5 4138.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3413.30 78.38 43.548 < 2e-16 ***
## yr 2199.38 78.42 28.046 < 2e-16 ***
## winter -1914.93 104.00 -18.413 < 2e-16 ***
## may 836.78 150.78 5.550 4.02e-08 ***
## jun 1259.37 152.77 8.244 7.84e-16 ***
## jul 1050.69 150.78 6.969 7.22e-12 ***
## aug 1151.43 150.78 7.637 7.07e-14 ***
## sep 1253.52 152.77 8.206 1.05e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1060 on 723 degrees of freedom
## Multiple R-squared: 0.7034, Adjusted R-squared: 0.7005
## F-statistic: 244.9 on 7 and 723 DF, p-value: < 2.2e-16
We maintained an adjusted R-square above 0.7 with this simplified model. However, looking at the fact that it’s basically year, winter, and then the spring and summer months, we went for one step simpler by just keeping year, winter, and summer:
##
## Call:
## lm(formula = cnt ~ yr + winter + summer, data = bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5940.1 -570.7 111.4 695.4 4138.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3762.74 71.95 52.295 < 2e-16 ***
## yr 2199.38 82.82 26.557 < 2e-16 ***
## winter -2264.38 101.92 -22.217 < 2e-16 ***
## summer 781.87 100.65 7.768 2.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1120 on 727 degrees of freedom
## Multiple R-squared: 0.6674, Adjusted R-squared: 0.666
## F-statistic: 486.2 on 3 and 727 DF, p-value: < 2.2e-16
Here, we had a more significant penalty for simplicity in the adjusted R-squared coming down to 0.6660, but this incredible model has just three variables and a high degree of explainability. It told us that 2/3 of the variability in ridership can be predicted by the year of operation with an added bump in the summer months and a dip in the winter months. This matched our intuition and hypothesis.
Q-Q Plot
Residuals vs. Fitted
As a final gut check, we also checked the Q-Q plot and residuals vs. fitted plot and we were happy with a well fit line and random residuals. This version is a model we would be happy to stand by in a board room when forecasting ridership years into the future.
To wrap up this section around predictive modeling, we thought about these two models as having two use cases:
Long Term Forecasting: use the time-only model, with just year, summer, and winter needed for prediction. Explain 2/3 of the expected variability in ridership with the coefficients presented above.
Short Term Forecasting: use the time-plus-weather model. You no longer need to factor in if it’s summer because the weather variables have that covered. Look at if it will rain that day, along with the temperature and windspeed, and you can explain 4/5 of the expected variability in ridership with the coefficients presented above.
Instructions: This section should summarize the interpretation of your results and discussion and provide a clear answer to your research question(s). Were the results as expected? Are there any caveats to your conclusions? Were any model assumptions violated?
Conclusion: The goal of this research was to create a predictive model for both combined user ridership by temporal variables such as day of the week and seasonality. We suspected that weather also impacted ridership, and we attempted to explain some of the variability due to weather, but do not include it in our model as it is unknowable for future years. As a reminder, the working hypotheses were:
After analysis of the data, the Null Hypothesis was discarded as there were clear effects by several of the analyzed variables. For the Temporal Hypotheses, most were confirmed, though the strength of these effects varied significantly.
Registered and Casual ridership are positively correlated. This was confirmed based on the analysis of same-day Registered and Casual rider use, though this is a relatively insignificant effect, accounting for approximately 15% of the variability.
Weekdays will record higher Registered user ridership because of work-day commuting. This was confirmed through box-plot analysis, both individually and in aggregate. In 2011, the average Registered User weekday use was above 3300, where average weekend/holiday use were both below 2700. While all Registered users increased in 2012, the weekday difference was even more significant, with the average use above 5000, whereas weekend/holiday were both below 4000.
Weekends will record higher Casual user ridership because of leisure use. This was confirmed through box-plot analysis, both individually and in aggregate, although the difference and change between years was distinctly smaller. In 2011, the average Casual weekend ridership was above 1200, where average Casual weekday ridership was below 800. All casual users increased in 2012, with average weekend ridership increasing to nearly 1900, while the holiday and weekday use only increased to approximately 1000 and 800, respectively.
Holidays will record higher Casual ridership and lower Registered ridership. This was partially confirmed through box-plot analysis, both individually and in aggregate. For Registered Users, the 2011 average 2300, well below both Weekday and Weekend use. The 2012 ridership increased to nearly 3500 but was still well below the other two categories. The hypothesized high Holiday ridership among Casual users did not manifest, trailing behind Weekend use, but still significantly greater than weekday use. One factor that should be mentioned about “Holiday” analysis is that there were only 21 recorded holidays across 2011 and 2012, and the majority of these fell on weekends. This really limits the value of using “Holiday” as a specific variable for predictive analysis.
For the Weather Hypotheses, all were confirmed.
Higher temperatures lead to higher Registered and Casual ridership. This was confirmed through analysis of ridership compared to adjusted temperature. Throughout the range of temperatures, lower temperatures resulted in fewer riders, while higher temperatures resulted in increased ridership. This relationship is fairly continuous for Registered riders, however for Casual riders there appears to have an upper limit at .0625.
Humidity will be unrelated to both Registered and Casual ridership. Analysis of the humidity variable provided no appreciable connection to changes in humidity and changes in either Registered or Casual ridership.
Higher windspeeds will slightly decrease ridership for both casual and registered users. This was partially confirmed. Analysis showed that higher windspeeds resulted in a slight decrease of registered users, but a very limited effect on casual users.
Weather will be more impactful on casual ridership than registered ridership. While individual weather events were not specifically examined, the fact that casual ridership fell more in the winter than registered ridership generally confirms this hypothesis
Assuming this analysis was being given to the company before the 2013 year, we would conclude that they could anticipate further growth in ridership, with fairly predictable weekeday, weekend and seasonal fluctuation