In next couple of posts, I'd like to talk about "Time-Series Analysis" in R. The global superstore data has Order_Date so we can use time-series analysis technique to predict future. The picture above is the goal for this analysis.

For precise definition of time-series data, since I skipped theories (because I prefer to have a practice first), please study by yourself.

In this post, I will talk about

1. how to aggregate data by a certain category

2. how to convert ordinal data into time-series data

For the sake of simplicity, let's use Sales, Quantity and Discount for this post (at the very end of this post, there is my code for all variables).

Data importation and data sorting are here.

You will notice that there are some records on the same day, so let's aggregate them.

Let me talk about aggregate() function a bit.

This function is quite simple. You just need data to be aggregated and a column (or columns) for grouping by, and a function where grouped data are used.

In this code, I extract Sales, Quantity and Discount by data[,2:4] and get the names of them. Then I state that they are grouped by Order_Date with "by=list("Order_Date)", and I calculate sum of them at each day.

Finally, to transform the data into time-series data, I used this code.

You need "zoo" and "xts" packages according to __this website__ (in Japanese). (In fact, without zoo() function, it didn't work. Knowing what the function is can be done in the future.)

Finally, this is the plot of the data in 2014. In the next post, I will talk about predictive model for this global superstore data.

See you.

This is additional note but here is my entire data prep process for the modeling, including feature elimination/selection.

Let's see the result of this preparation after we see our model's accuracy. This predictor variables can be wrong (might be too rough), but I'm looking forward to seeing the result.

(9/29 added)

It didn't work. According to __Stackflow__, if number of variables are many, we can't do VAR selection, which means that we can't build our model. If we use weekly data, in this case we know that the sales has a trend through seasons, so we need at least 48 weeks' data.

Then, according to the Stackflow post, if number of variables D > 209 / 48 - 1 = 3.4, we can't build a model.

I feel it means that it is quite difficult to use categorical variable for time-series analysis. So let's use only Sales, Quantity and Discount (because now D=3, less than 3.4).