By Isaiah Lyons-Galante, Glen Smith, Max Warnock, Magno Gutierrez, Ross Hawley

Introduction


We have chosen to explore two years of ridership data from Capital Bikeshare, a bike share company based in Washington, D.C. from 2011-2012. The data have daily resolution and are split between casual ridership from non-member users and registered ridership from members. The data also include about a dozen additional data points for each day that capture variables like the day of the week, type of day, and weather variables. The captured data and variables are explained below:

Variable Descriptions: - instant: record index - dteday: date - season: season (winter, spring, summer, fall) - yr: year (0: 2011, 1:2012) - mnth: month (1 to 12) - holiday: weather day is holiday or not - weekday: day of the week - workingday: if day is neither weekend nor holiday is 1, otherwise is 0. - weathersit: clear, mist, or rain or snow - temp: Normalized temperature in Celsius (range from 0 to 1) - atemp: Normalized feeling temperature in Celsius - hum: Normalized humidity. - windspeed: Normalized wind speed. The values are divided to 67 (max) - casual: count of casual (unregistered) users - registered: count of registered users - casual_percent: percent of daily bike users who are not registered users - cnt: count of total rental bikes including both casual and registered

Research Goal and Hypotheses:

The goal of this research is to create a predictive model for both casual and registered user ridership by temporal variables such as day of the week and seasonality. We anticipated that increases in registered use would have a positive effect on casual use. We suspect that weather conditions affected ridership, and so we attempt to explain some of the variability due to weather, but do not include it in our model as it is unknowable for future years. Our hypotheses are below:

  • Null Hypothesis: all temporal and weather variables are uncorrelated with bike ridership for both casual and registered users.

  • Our Hypotheses:

    Temporal:

    1. Ridership for both types increases during the summer months, and decreases during winter months.
    2. Weekdays increase registered user ridership because of commuting
    3. Weekends increase casual user ridership because of leisure time
    4. Casual ridership increases on holidays, but registered ridership decreases.
    5. Casual ridership is positively correlated to Registered ridership.

    Weather:

    1. Higher temperatures lead to increase ridership on both
    2. Humidity will be unrelated to ridership on both.
    3. Higher windspeeds will slightly decrease ridership for both casual and registered users.
    4. Weather will be more impactful on casual ridership than registered ridership.

Data Summary

Here is a snapshot of the first six rows of the dataset which includes both categorical and numerical variables:

##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01 winter  0    1       0       6          0       mist
## 2       2 2011-01-02 winter  0    1       0       0          0       mist
## 3       3 2011-01-03 winter  0    1       0       1          1        dry
## 4       4 2011-01-04 winter  0    1       0       2          1        dry
## 5       5 2011-01-05 winter  0    1       0       3          1        dry
## 6       6 2011-01-06 winter  0    1       0       4          1        dry
##       temp    atemp      hum windspeed casual registered  cnt casual_percent
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985     0.33604061
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801     0.16354557
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349     0.08895478
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562     0.06914213
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600     0.05125000
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606     0.05479452

The main target variable we studied was total ridership. However, we also studied some aspects of casual and registered ridership. We used histograms to identify the distributions of these different variables below.

Figure 1: This histogram shows a right skewed distribution for casual users.

Figure 2: This histogram shows a relatively normal distribution for registered users. Figure 3: This histogram shows a relatively normal distribution of casual and registered users combined.

To summarize the data, we graphed casual and registered ridership by day over the full 2-year period (in black). We have added a smoothed line of the data in red, and a linear approximation in blue to help see trends better. Figure 4: This graph shows the number of registered users across the time period. There is a clear seasonal pattern as well as steady growth and there is significant fluctuation on a day to day scale.

Figure 5: This graph shows the number of casual users across the time period.

Interpretation of exploratory figures There is a larger amount of fluctuation day to day for casual users (Figure 4), but both casual and registered users share a similar seasonal pattern. When registered usage is higher, casual usage is also higher. The number of casual users is significantly less than the number of registered users over the time period. This difference in magnitude could explain why the skewed histogram of casual users (Figure 1) did not effect the normal distribution of total users (Figure 3).

Results and Discussion

Section A: Temporal and Seasonal Variables

To answer our hypotheses, we explore each appropriate variable individually to see whether it significantly affects ridership through a combination of correlations, plots, and analysis of variance. After determining which variables are of significance, we build the predictive model for total ridership.

Variable: Day of the Week

Figure 6: This boxplot shows that total ridership is relatively consistent on all days of the week, with slightly higher usage on weekdays compared to weekends. Because ridership is fairly consistent on average across all days, this variable is irrelevant.

Variables: Holiday, Weekday and Weekend

Figure 7: These boxplots show casual and registered usage for 2011 and 2012 on holidays, weekdays and weekends. Registered usage is higher on weekdays and lower on weekends and holidays. Casual usage is higher on holidays and weekends than holidays. The boxplots also demonstrate that growth in registered users is significantly greater from 2011 to 2012 than for causal users.

Variable: Seasons

Figure 8: This graph shows the total ridership during the different seasons. The amount of riders significantly decreases during the winter season. The spring, and summer seasons have fairly similar ridership use.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  bikes$registered by bikes$season
## Kruskal-Wallis chi-squared = 196.77, df = 3, p-value < 2.2e-16
## 
##                            Comparison of x by group                            
##                                 (No adjustment)                                
## Col Mean-|
## Row Mean |       fall     spring     summer
## ---------+---------------------------------
##   spring |   0.463067
##          |     0.3217
##          |
##   summer |  -2.706012  -3.198420
##          |    0.0034*    0.0007*
##          |
##   winter |   10.27458   9.895069   13.13279
##          |    0.0000*    0.0000*    0.0000*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2
## 
##  Kruskal-Wallis rank sum test
## 
## data:  bikes$casual by bikes$season
## Kruskal-Wallis chi-squared = 278.83, df = 3, p-value < 2.2e-16
## 
##                            Comparison of x by group                            
##                                 (No adjustment)                                
## Col Mean-|
## Row Mean |       fall     spring     summer
## ---------+---------------------------------
##   spring |  -5.409601
##          |    0.0000*
##          |
##   summer |  -8.006406  -2.590066
##          |    0.0000*    0.0048*
##          |
##   winter |   7.316487   12.80991   15.45732
##          |    0.0000*    0.0000*    0.0000*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2

Figure 9: This Kruskal Wallis Test correlates the variables of registered riders with season. For registered riders all seasons but spring are significant. For the casual riders every season is significant.

## 
## Call:
## lm(formula = bikes$registered ~ bikes$casual)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2876.1 -1202.1   -27.4  1003.3  3344.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.894e+03  8.434e+01   34.32   <2e-16 ***
## bikes$casual 8.982e-01  7.731e-02   11.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1434 on 729 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1551 
## F-statistic:   135 on 1 and 729 DF,  p-value: < 2.2e-16

Figure 10: This is a linear model that shows the correlation between registered and casual users. The P value is neglected so it can be inferred that the correlation between registered and casual users is neglected.

Weather Variables The following plots will explore the effect of weather on ridership. Figure 11: This plot shows a positive correlation between the total users and normalized temperatures. As temperature increase, so does rider usage.

Figure 12: This graph has a negative slope which means that as wind speed increases, ridership decreases which aligns with our hypothesis on the effect of wind speed on ridership.

Figure 13: This graph shows that humidity has a negligible impact on ridership. There is a slight downward slope suggesting that as humidity increases, ridership decreases, but this could also be related to other variables like temperature. This proves aligns with our hypothesis that humidity will not effect ridership.

Figure 14: This scatterplot shows the temperature variability across the sample period. Because we know that temperature affects the number of riders (Figure 11), this seasonal difference in temperature can help explain the seasonal changes in ridership seen in Figures 4 and 5.

Section B: Linear Models

In this section, we built a predictive model for overall bike ridership to be used by the company to forecast bike use for better logistical planning. We finished with two final models, one that only looks at temporal, knowable variables such as weekdays and seasons, while the other model includes weather variables such as temperature and humidity. Before we started our linear models, we needed to check if the data is normal. We did this with a histogram:

We see a little bit of non-normality in the low values. This is because the casual ridership is not normal:

However, the overall combination of the two is normal, so we will proceed with the linear models of overall count. We approached the building of the linear models with backward selection, first including lots of variables and then whittling it down.

First, we created extra columns to convert categorical variables such as month, season, weather condition, and day of the week into numerical values of either 0 or 1. Here is the head of the new transformed data frame: Add in the seasons:

We started our development of the predicive model with a large model with all explanotory variables that we thought would be important. Note that we left out the days of the week and the months of the year since we thought that this information would be sufficiently captured by working day and by seasons.

## 
## Call:
## lm(formula = cnt ~ yr + holiday + workingday + winter + summer + 
##     spring + dry + precip + temp + hum + windspeed, data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3677.3  -378.9    67.4   475.7  3347.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2705.81     281.17   9.623  < 2e-16 ***
## yr           2013.71      61.75  32.610  < 2e-16 ***
## holiday      -624.01     188.44  -3.311 0.000974 ***
## workingday    118.19      67.92   1.740 0.082232 .  
## winter      -1545.59      96.67 -15.988  < 2e-16 ***
## summer       -701.49     122.47  -5.728 1.50e-08 ***
## spring       -407.65      96.16  -4.239 2.54e-05 ***
## dry           426.72      81.24   5.253 1.98e-07 ***
## precip      -1490.23     194.55  -7.660 6.02e-14 ***
## temp         5108.07     306.98  16.640  < 2e-16 ***
## hum         -1325.64     295.07  -4.493 8.19e-06 ***
## windspeed   -2795.28     428.06  -6.530 1.24e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 822.2 on 719 degrees of freedom
## Multiple R-squared:  0.8226, Adjusted R-squared:  0.8199 
## F-statistic: 303.1 on 11 and 719 DF,  p-value: < 2.2e-16

We found a lot of interesting factors with this model. The first is that we were able to achieve an impressive adjusted R-squared of 0.82, meaning we are able to explain over 80% of the variation observed in bike data. This is much better than just using the mean to predict the future value. The logistics operators of the cycling company will be able to put our model to good use. We dove into each variable as well to unpack what the model says about each one:

  • Year: Strong positive correlation, indicative of increased ridership from year to the next

  • Holiday: Negative correlation, meaning fewer people ride overall on holidays

  • Working Day: Weak positive correlation, meaning more bike use during the week, likely due to commuters.

  • Winter: Strong negative correlation, meaning fewer people bike in winter time, likely due to the cold.

  • Spring: A weak negative correlation

  • Summer: A surprising weak negative correlation! Seems like fall is really the most popular time for use of the bike share system. We suspect that it may be driven in part by students.

  • Dry weather: strong positive correlation. Good weather = more bikers!

  • Mist: strong positive correlation. This really told us that riders avoid the rain but don’t mind the mist.

  • Temp: positive correlation. Warmer weather, more bikers.

  • Humidity: negative correlation. MOre humid weather, fewer bikers. This made sense since humidity makes the effective temperature feel that much hotter.

  • Windspeed: negative correlation. Windier days, fewer bikers.

Just as a sanity check on the model, we checked the Q-Q plot of the residuals, as well as the residuals vs fitted to look for homoscedasticity, and we were generally content with what we found. The Q-Q plot fits very closely to theoretical quantiles in the middle quantiles, deviating slightly in the upper and lower extremes. We see a similar trend in the residuals vs fitted, but again, we feel that it is randomly distributed enough to merit validity.

Q-Q Plot

Residuals vs. Fitted

Next, as an exercise in comprehensive analysis, we constructed a linear model with every single variable available.

## 
## Call:
## lm(formula = cnt ~ yr + holiday + workingday + winter + spring + 
##     summer + dry + mist + temp + hum + windspeed + jan + feb + 
##     mar + apr + may + jun + jul + aug + sep + oct + nov + sun + 
##     mon + tue + wed + thu + fri, data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3960.9  -350.9    74.1   456.0  2919.9 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1495.45     357.97   4.178 3.32e-05 ***
## yr           2018.06      58.22  34.660  < 2e-16 ***
## holiday      -613.70     206.68  -2.969 0.003086 ** 
## workingday    -10.09     106.96  -0.094 0.924837    
## winter      -1578.95     181.04  -8.722  < 2e-16 ***
## spring       -689.65     212.36  -3.248 0.001219 ** 
## summer       -746.71     191.42  -3.901 0.000105 ***
## dry          1981.36     196.67  10.075  < 2e-16 ***
## mist         1516.15     184.23   8.230 9.11e-16 ***
## temp         4487.30     411.84  10.896  < 2e-16 ***
## hum         -1518.18     292.21  -5.196 2.68e-07 ***
## windspeed   -2925.44     406.17  -7.202 1.53e-12 ***
## jan            84.39     182.23   0.463 0.643439    
## feb           221.24     183.54   1.205 0.228450    
## mar           629.52     185.16   3.400 0.000712 ***
## apr           540.88     242.19   2.233 0.025842 *  
## may           807.91     257.73   3.135 0.001792 ** 
## jun           574.94     262.55   2.190 0.028863 *  
## jul            92.79     279.39   0.332 0.739894    
## aug           489.30     267.52   1.829 0.067824 .  
## sep          1068.34     218.37   4.892 1.24e-06 ***
## oct           605.33     163.54   3.701 0.000231 ***
## nov           -26.97     154.85  -0.174 0.861767    
## sun          -438.70     106.59  -4.116 4.32e-05 ***
## mon          -213.73     109.18  -1.958 0.050681 .  
## tue          -119.47     107.25  -1.114 0.265684    
## wed           -51.20     107.73  -0.475 0.634758    
## thu           -43.40     107.06  -0.405 0.685328    
## fri               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 769.5 on 703 degrees of freedom
## Multiple R-squared:  0.848,  Adjusted R-squared:  0.8422 
## F-statistic: 145.3 on 27 and 703 DF,  p-value: < 2.2e-16

As you can see, we only marginally improved the adjusted R-squared up to 0.8422. To be even more comprehensive, we then used a library MASS that runs a huge number of variations of our model to see which have the highest AIC. We fed it every single variable, even the months and weekdays, just to see what the theoretical maximum would be. The resulting output was the following model:

best_lm <- lm(cnt 
              ~ yr + holiday 
              + winter + spring + summer 
              + dry + mist 
              + temp + hum + windspeed 
              + mar + apr + may + jun + aug + sep + oct 
              + sun + mon
              , data=bikes)

We can see that a few variables had been removed, such jan, feb, nov, workingday, and most days of the week except sun and mon. This model achieved the most impressive R-squared, 0.8432. However, it still includes 19 different variables. We still felt that this model was too complex, and so we did some further weeding out of every variable with a p-value > 1e-4.

## 
## Call:
## lm(formula = cnt ~ yr + winter + summer + dry + mist + temp + 
##     hum + windspeed + mar + sep + oct + sun, data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3943.2  -364.0   112.1   493.6  3115.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1261.66     342.05   3.689 0.000243 ***
## yr           2011.32      59.10  34.032  < 2e-16 ***
## winter      -1328.70      92.27 -14.400  < 2e-16 ***
## summer       -451.03      99.07  -4.553 6.23e-06 ***
## dry          1982.14     199.29   9.946  < 2e-16 ***
## mist         1526.77     186.67   8.179 1.30e-15 ***
## temp         4722.43     271.33  17.405  < 2e-16 ***
## hum         -1526.64     285.21  -5.353 1.17e-07 ***
## windspeed   -3013.40     407.55  -7.394 3.98e-13 ***
## mar           308.96     110.50   2.796 0.005311 ** 
## sep           859.54     115.30   7.455 2.59e-13 ***
## oct           638.87     111.38   5.736 1.43e-08 ***
## sun          -337.42      83.47  -4.042 5.86e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 787.5 on 718 degrees of freedom
## Multiple R-squared:  0.8375, Adjusted R-squared:  0.8348 
## F-statistic: 308.3 on 12 and 718 DF,  p-value: < 2.2e-16

Here, we reduced the model from 19 explanatory variables down to 12 while the adjusted R-squared only went down to 0.8348, barely a 1% reduction. This motivated us to continue trimming further, and we eliminated every variable with a p > 1e-10:

## 
## Call:
## lm(formula = cnt ~ yr + winter + summer + dry + precip + temp + 
##     windspeed, data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3671.9  -455.7    87.0   503.7  3341.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1833.39     169.07  10.844  < 2e-16 ***
## yr           2057.43      63.17  32.571  < 2e-16 ***
## winter      -1400.35      94.83 -14.767  < 2e-16 ***
## summer       -345.73      99.31  -3.481 0.000529 ***
## dry           635.07      67.47   9.413  < 2e-16 ***
## precip      -1597.80     195.58  -8.169 1.38e-15 ***
## temp         4438.35     285.74  15.533  < 2e-16 ***
## windspeed   -2512.25     417.02  -6.024 2.70e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 849.2 on 723 degrees of freedom
## Multiple R-squared:  0.8097, Adjusted R-squared:  0.8079 
## F-statistic: 439.5 on 7 and 723 DF,  p-value: < 2.2e-16

At this point, we were down to 6 remaining explanatory variables and still had an impressive adjusted R-squared of 0.8079. This is the final version of the model that includes both time and weather variables to be used to predict bike share data. The final list of variables are: - Year - Winter - Dry - Precipitation - Temperature - Windspeed

Finally, we wanted to build one last version of the model that did not include any weather variables that could used for long-term forecasting. We repeated the AIC process but without any weather related variables. We started with a new model with all time related variables:

## 
## Call:
## lm(formula = cnt ~ yr + holiday + workingday + winter + spring + 
##     summer + jan + feb + mar + apr + may + jun + jul + aug + 
##     sep + oct + nov + sun + mon + tue + wed + thu + fri, data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6177.3  -395.7   122.1   600.6  3237.9 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3053.322    178.714  17.085  < 2e-16 ***
## yr           2201.513     73.478  29.962  < 2e-16 ***
## holiday      -250.701    265.584  -0.944  0.34551    
## workingday    100.502    137.530   0.731  0.46516    
## winter      -1871.353    231.609  -8.080 2.80e-15 ***
## spring      -1132.452    272.513  -4.156 3.64e-05 ***
## summer       -783.547    246.094  -3.184  0.00152 ** 
## jan            -1.293    232.625  -0.006  0.99557    
## feb           441.295    235.714   1.872  0.06160 .  
## mar          1214.636    231.544   5.246 2.06e-07 ***
## apr          1553.542    294.805   5.270 1.82e-07 ***
## may          2419.957    293.951   8.233 8.82e-16 ***
## jun          2703.539    272.474   9.922  < 2e-16 ***
## jul          2287.806    291.652   7.844 1.61e-14 ***
## aug          2366.316    291.829   8.109 2.26e-15 ***
## sep          2272.904    247.621   9.179  < 2e-16 ***
## oct          1139.682    196.411   5.803 9.85e-09 ***
## nov           187.205    198.218   0.944  0.34527    
## sun          -357.651    137.170  -2.607  0.00932 ** 
## mon          -295.678    140.453  -2.105  0.03563 *  
## tue          -195.691    137.864  -1.419  0.15621    
## wed          -154.664    137.845  -1.122  0.26224    
## thu           -31.500    137.846  -0.229  0.81931    
## fri                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 993.2 on 708 degrees of freedom
## Multiple R-squared:  0.745,  Adjusted R-squared:  0.7371 
## F-statistic: 94.04 on 22 and 708 DF,  p-value: < 2.2e-16

The adjusted R-squared has jumped down a step down to 0.7371, but this is still not bad considering we are including zero weather data outside of the season of year.

best_time_lm <- lm(cnt 
                   ~ yr + holiday 
                   + winter + spring + summer 
                   + feb + mar + apr + may + jun + jul + aug + sep + oct 
                   + sun + mon
              , data=bikes)

Our adjusted R-squared made a tiny improvement from 0.7371 to 0.7380, and we still have 16 variables. We felt there was room again to trim the fat. We cut out all variables with p > 1e-10 again and we were left with just 7 variables:

## 
## Call:
## lm(formula = cnt ~ yr + winter + may + jun + jul + aug + sep, 
##     data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5590.7  -496.8    93.2   621.5  4138.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3413.30      78.38  43.548  < 2e-16 ***
## yr           2199.38      78.42  28.046  < 2e-16 ***
## winter      -1914.93     104.00 -18.413  < 2e-16 ***
## may           836.78     150.78   5.550 4.02e-08 ***
## jun          1259.37     152.77   8.244 7.84e-16 ***
## jul          1050.69     150.78   6.969 7.22e-12 ***
## aug          1151.43     150.78   7.637 7.07e-14 ***
## sep          1253.52     152.77   8.206 1.05e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1060 on 723 degrees of freedom
## Multiple R-squared:  0.7034, Adjusted R-squared:  0.7005 
## F-statistic: 244.9 on 7 and 723 DF,  p-value: < 2.2e-16

We maintained an adjusted R-square above 0.7 with this simplified model. However, looking at the fact that it’s basically year, winter, and then the spring and summer months, we went for one step simpler by just keeping year, winter, and summer:

## 
## Call:
## lm(formula = cnt ~ yr + winter + summer, data = bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5940.1  -570.7   111.4   695.4  4138.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3762.74      71.95  52.295  < 2e-16 ***
## yr           2199.38      82.82  26.557  < 2e-16 ***
## winter      -2264.38     101.92 -22.217  < 2e-16 ***
## summer        781.87     100.65   7.768 2.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1120 on 727 degrees of freedom
## Multiple R-squared:  0.6674, Adjusted R-squared:  0.666 
## F-statistic: 486.2 on 3 and 727 DF,  p-value: < 2.2e-16

Here, we had a more significant penalty for simplicity in the adjusted R-squared coming down to 0.6660, but this incredible model has just three variables and a high degree of explainability. It told us that 2/3 of the variability in ridership can be predicted by the year of operation with an added bump in the summer months and a dip in the winter months. This matched our intuition and hypothesis.

Q-Q Plot Residuals vs. Fitted

As a final gut check, we also checked the Q-Q plot and residuals vs. fitted plot and we were happy with a well fit line and random residuals. This version is a model we would be happy to stand by in a board room when forecasting ridership years into the future.

To wrap up this section around predictive modeling, we thought about these two models as having two use cases:

  1. Long Term Forecasting: use the time-only model, with just year, summer, and winter needed for prediction. Explain 2/3 of the expected variability in ridership with the coefficients presented above.

  2. Short Term Forecasting: use the time-plus-weather model. You no longer need to factor in if it’s summer because the weather variables have that covered. Look at if it will rain that day, along with the temperature and windspeed, and you can explain 4/5 of the expected variability in ridership with the coefficients presented above.

Conclusion

Instructions: This section should summarize the interpretation of your results and discussion and provide a clear answer to your research question(s). Were the results as expected? Are there any caveats to your conclusions? Were any model assumptions violated?

Conclusion: The goal of this research was to create a predictive model for both combined user ridership by temporal variables such as day of the week and seasonality. We suspected that weather also impacted ridership, and we attempted to explain some of the variability due to weather, but do not include it in our model as it is unknowable for future years. As a reminder, the working hypotheses were:

After analysis of the data, the Null Hypothesis was discarded as there were clear effects by several of the analyzed variables. For the Temporal Hypotheses, most were confirmed, though the strength of these effects varied significantly.

  1. Registered and Casual ridership are positively correlated. This was confirmed based on the analysis of same-day Registered and Casual rider use, though this is a relatively insignificant effect, accounting for approximately 15% of the variability.

  2. Weekdays will record higher Registered user ridership because of work-day commuting. This was confirmed through box-plot analysis, both individually and in aggregate. In 2011, the average Registered User weekday use was above 3300, where average weekend/holiday use were both below 2700. While all Registered users increased in 2012, the weekday difference was even more significant, with the average use above 5000, whereas weekend/holiday were both below 4000.

  3. Weekends will record higher Casual user ridership because of leisure use. This was confirmed through box-plot analysis, both individually and in aggregate, although the difference and change between years was distinctly smaller. In 2011, the average Casual weekend ridership was above 1200, where average Casual weekday ridership was below 800. All casual users increased in 2012, with average weekend ridership increasing to nearly 1900, while the holiday and weekday use only increased to approximately 1000 and 800, respectively.

  4. Holidays will record higher Casual ridership and lower Registered ridership. This was partially confirmed through box-plot analysis, both individually and in aggregate. For Registered Users, the 2011 average 2300, well below both Weekday and Weekend use. The 2012 ridership increased to nearly 3500 but was still well below the other two categories. The hypothesized high Holiday ridership among Casual users did not manifest, trailing behind Weekend use, but still significantly greater than weekday use. One factor that should be mentioned about “Holiday” analysis is that there were only 21 recorded holidays across 2011 and 2012, and the majority of these fell on weekends. This really limits the value of using “Holiday” as a specific variable for predictive analysis.

For the Weather Hypotheses, all were confirmed.

  1. Higher temperatures lead to higher Registered and Casual ridership. This was confirmed through analysis of ridership compared to adjusted temperature. Throughout the range of temperatures, lower temperatures resulted in fewer riders, while higher temperatures resulted in increased ridership. This relationship is fairly continuous for Registered riders, however for Casual riders there appears to have an upper limit at .0625.

  2. Humidity will be unrelated to both Registered and Casual ridership. Analysis of the humidity variable provided no appreciable connection to changes in humidity and changes in either Registered or Casual ridership.

  3. Higher windspeeds will slightly decrease ridership for both casual and registered users. This was partially confirmed. Analysis showed that higher windspeeds resulted in a slight decrease of registered users, but a very limited effect on casual users.

  4. Weather will be more impactful on casual ridership than registered ridership. While individual weather events were not specifically examined, the fact that casual ridership fell more in the winter than registered ridership generally confirms this hypothesis

Assuming this analysis was being given to the company before the 2013 year, we would conclude that they could anticipate further growth in ridership, with fairly predictable weekeday, weekend and seasonal fluctuation