Time Series Analysis and Forecasting With ARIMA
In this article we’ll be focusing on doing the sales forecasting for a company that manufactures the bird repellers and have a univariate dataset on monthly basis from the year 2003–2014. The data of sales has been recorded 1st of every month.
First, we’ll observe the sales data that we have over the years and observed that the variations are there, and data is positively skewed or heavy tailed towards the right end of the “Sales” distribution. This gives us an indication that some of the sales value are towards the higher end and most of the concentration of the sales occurs towards the lower and near the mean value which is approx. 390. The sales data varies in-between the range of 138–871. Below is the data distribution plot of the Sales of bird repellers over the years.
Second, we have created the features out of dates which have been provided to us over the years. Thus, we will be looking for yearly, quarterly and monthly sales over the period of time. With this we note that sales magnitude is increasing consistently over the years and it is following a linear increase in trend. In addition to that seasonality is also been noted on quarterly and monthly basis. This gives us an indication that maximum sales has been occurred during the 3rd quarter followed by the 2nd; specifically during July & August month.
Next we look for monthly maximum sales on yearly basis. With this we’ll conclude August followed by the month of July had seen the maximum sales, except for the years 2005 and 2007 where maximum sales had occurred during the month of May followed by August. Third, fourth and fifth position of maximum sales fluctuates between the months of April, May and June for rest of the years. Below is the graphical representation of the conclusion we have observed.
On yearly basis we have observed that an increase in percentage change of sales, 2005 accounts for the maximum increase of sales with 22% followed by year 2009 and 2010 where increase of about 19% and 16% from the previous year respectively.
For the quarterly basis, we observed maximum increase during the 2nd quarter throughout the years; having maximum percentage change of around 38% for the 3rd quarter of 2014 followed by 34% of the 3rd Quarter for the year 2013.
Decrease in sales and demand have been observed mostly for the 4th quarter throughout the years where we have observed the losses as there is decrease in demand of bird repellers.
In the data to look out for outliers’, boxplots have been plotted where we have seen the variations over the period of time be it on yearly, quarterly and monthly basis but outliers have not been observed.
Lastly we have created a lag_feature by shifting the values of sales over the 12 months period because via creating the ACF plot the recent values of sales does have a higher correlation than the previous year’s values where a gradual decrease over the lags have been noted.
Time series data is non-stationary this has been confirmed via doing the adfuller test. To convert it into stationary data, log transformation has been applied with double differencing.
To do the forecast or the univariate forecasting firstly Holt-Winters model has been applied where the trend is additive and seasonality is multiplicative because the sales magnitude is increasing over the period of time. The same logic of multiplicative has been applied when doing the seasonal_decomposition of the time series. The model is effectively able to capture the seasonality and the trend over the period of time with root mean square error of approx. 33.72 on the test data.
Secondly, the multivariate data that we have created is been used for doing the forecasting with inclusion of seasonal and exogenous factors using ARIMA. Firstly, the variables like year, month and quarters have been converted to dummy variables followed by standardization of the lag_fator that earlier we have created.
With the help of library named pmdarima we’ll use the auto-arima to get the values (p,d,q; these are the ACF and PACF values respectively, based on Auto-Regressive, integrated {differencing} and Moving Average respectively) for the period=12. The model is able to capture the seasonality as well as trend effectively but root mean squared value i.e. 42 is higher when compared to the Holt-Winters univariate model.
Here in the link you can find the notebook and the sample data that has been created.
I hope you enjoyed reading the article, in case of any question kindly connect with me on linkedin and do provide your valuable feedback.