tiny_m5
If you practice time series modeling, you probably may hear about the M5 challenge by Spyros Makridakis, hosted on Kaggle. It’s an excellent data set, if we want to play with demand time series. The whole dataset is really large, so we rather can use a subset to demonstrate, how to work with such data.
unique(tiny_m5$store_id) #> [1] "CA_1" "CA_2" "CA_3" "CA_4" "TX_1" "TX_2" "TX_3" "WI_1" "WI_2" "WI_3" ca_1_data <- tiny_m5 %>% filter(store_id == "CA_1") %>% select(item_id, store_id, date, value, wday, month, year, snap, sell_price) %>% arrange(item_id, date) ca_1_data %>% group_by(item_id) %>% summarise(n = n()) #> # A tibble: 28 × 2 #> item_id n #> <chr> <int> #> 1 FOODS_1_033 1913 #> 2 FOODS_1_046 1913 #> 3 FOODS_1_057 1913 #> 4 FOODS_1_218 1913 #> 5 FOODS_2_096 1913 #> 6 FOODS_2_181 1913 #> 7 FOODS_2_352 1913 #> 8 FOODS_2_360 1913 #> 9 FOODS_3_080 1913 #> 10 FOODS_3_377 1913 #> # … with 18 more rows skimr::skim(ca_1_data)
Name | ca_1_data |
Number of rows | 53564 |
Number of columns | 9 |
Key | NULL |
_______________________ | |
Column type frequency: | |
character | 2 |
Date | 1 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
item_id | 0 | 1 | 11 | 15 | 0 | 28 | 0 |
store_id | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2011-01-29 | 2016-04-24 | 2013-09-11 | 1913 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
value | 0 | 1.00 | 5.52 | 10.90 | 0.0 | 0.00 | 1.00 | 6.00 | 133.00 | ▇▁▁▁▁ |
wday | 0 | 1.00 | 4.00 | 2.00 | 1.0 | 2.00 | 4.00 | 6.00 | 7.00 | ▇▃▃▃▇ |
month | 0 | 1.00 | 6.36 | 3.46 | 1.0 | 3.00 | 6.00 | 9.00 | 12.00 | ▇▅▅▅▇ |
year | 0 | 1.00 | 2013.21 | 1.53 | 2011.0 | 2012.00 | 2013.00 | 2015.00 | 2016.00 | ▇▅▅▅▁ |
snap | 0 | 1.00 | 0.33 | 0.47 | 0.0 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
sell_price | 7693 | 0.86 | 3.33 | 3.34 | 0.5 | 0.97 | 1.68 | 4.97 | 11.54 | ▇▁▁▁▂ |
For deep learning models for time series, we’d typically like to create a three-dimensional tensor. In the analyzed case, the each dimension may represent:
The first dimension is “free” - we can add an arbitrary number of items. When it comes to the second one, they length may vary as well. However, for convenience, we’ll keep same-length time series. Otherwise, we’d have to use masking or split the dataset into multiple tensors. The last dimension size is guaranteed by the data.frame structure itself (each row has the same number of columns).
As we can observed in the output of skim
function, there are time series with missing data.
As mentioned before, for simplicity’s sake, we can just select a subset of items with the same series length as well as the first and last date. In such case, we’ll be sure that our data are properly aligned in the tensor. Later, in a separate vignette, we’ll dive into a set of methods, how to handle missing/non-aligned multiple time series when training a deep learning model.
head(ca_1_data) #> item_id store_id date value wday month year snap sell_price #> 1: FOODS_1_033 CA_1 2011-01-29 0 1 1 2011 0 NA #> 2: FOODS_1_033 CA_1 2011-01-30 0 2 1 2011 0 NA #> 3: FOODS_1_033 CA_1 2011-01-31 0 3 1 2011 0 NA #> 4: FOODS_1_033 CA_1 2011-02-01 0 4 2 2011 1 NA #> 5: FOODS_1_033 CA_1 2011-02-02 0 5 2 2011 1 NA #> 6: FOODS_1_033 CA_1 2011-02-03 0 6 2 2011 1 NA
The first column, item_id
, describes the item we already the item and the second one (date
) - a current time step. These two columns will be used to create a data “fold”, i.e. form a 3D tensor. As we mentioned above, the completness of the time moments is crucial to obtain a proper result from this transformation.
If it comes, to the rest of columns:
value
is a target we want to predictwday
is a categorical variablemonth
is a categorical variableyear
can be treated as categorical, but in this case we may remove this variable and introduce a counter insteadsnap
is a categorical variablesell_price
is a real-valued variableSummarizing, we have three categorical variables, which should be represented in some way. The most efficient manner to represent categorical variables in neural network is embedding layer.
In fact, it works similar to a linear (dense) layer. The difference is that instead of performing resource-consuming dot product between weight matrix and the input one-hot encoded sparse matrix, we just use an index to select “right” row from the weight matrix.
Let’s transform tabular data into tensor. A good way to do it is to use a convenient as_tensor
function. The first argument of the function (described as .data
) is a data.frame
object, which we want to transform into a torch_tensor
.
colnames(ca_1_data) #> [1] "item_id" "store_id" "date" "value" "wday" #> [6] "month" "year" "snap" "sell_price"
First, we’ll select only integer item_id
, date
and integer variables.
ca_1 <- ca_1_data %>% select(item_id, date, wday, month, year, snap) ca_1_tensor <- ca_1 %>% as_tensor(item_id, date) dim(ca_1_tensor) #> [1] 28 1913 4 class(ca_1_tensor) #> [1] "torch_tensor" "R7" ca_1_tensor <- ca_1_data %>% select(item_id, date, wday, month, year, snap, sell_price) %>% mutate(across(where(is.integer), as.numeric)) %>% as_tensor(item_id, date) dim(ca_1_tensor) #> [1] 28 1913 5 class(ca_1_tensor) #> [1] "torch_tensor" "R7"
To speed up torch
models developments, torchts
package provides easy-to-use as_ts_dataset
method, which is a shortcut to create a torch
dataset from a data.frame
. For now keys like item_id
are not supported - this feature will be implemented in the near future. We’ll present this function using weather_pl
dataset.
library(rsample) suwalki_temp <- weather_pl %>% filter(station == "SWK") %>% select(date, temp = tmax_daily) #' # Splitting on training and test data_split <- initial_time_split(suwalki_temp) train_ds <- training(data_split) %>% as_ts_dataset(temp ~ date, timesteps = 20, horizon = 1) train_ds[1] #> $x #> torch_tensor #> -1.1453 #> -1.2624 #> -1.0867 #> -1.0282 #> -1.0672 #> -0.8330 #> -0.7354 #> -1.0282 #> -1.0087 #> -0.9891 #> -1.0477 #> -1.1746 #> -1.3893 #> -0.9891 #> -0.9794 #> -1.0965 #> -1.2527 #> -1.4186 #> -1.4869 #> -1.4088 #> [ CPUFloatType{20,1} ] #> #> $y #> torch_tensor #> -3.6000 #> [ CPUFloatType{1} ]
The quickest shortcut to get needed data-provding object is to call as_ts_dataloader
function. It can be used as follows.
train_dl <- training(data_split) %>% as_ts_dataloader(temp ~ date, timesteps = 20, horizon = 1, batch_size = 32) train_dl #> <dataloader> #> Public: #> .auto_collation: active binding #> .dataset_kind: map #> .has_getbatch: FALSE #> .index_sampler: active binding #> .iter: function () #> .length: function () #> batch_sampler: utils_sampler_batch, utils_sampler, R6 #> batch_size: 32 #> clone: function (deep = FALSE) #> collate_fn: function (batch) #> dataset: ts_dataset, dataset, R6 #> drop_last: FALSE #> generator: NULL #> initialize: function (dataset, batch_size = 1, shuffle = FALSE, sampler = NULL, #> multiprocessing_context: NULL #> num_workers: 0 #> pin_memory: FALSE #> sampler: utils_sampler_sequential, utils_sampler, R6 #> timeout: -1 #> worker_globals: NULL #> worker_init_fn: NULL #> worker_packages: NULL dataloader_next(dataloader_make_iter(train_dl)) #> $x #> torch_tensor #> (1,.,.) = #> -1.1453 #> -1.2624 #> -1.0867 #> -1.0282 #> -1.0672 #> -0.8330 #> -0.7354 #> -1.0282 #> -1.0087 #> -0.9891 #> -1.0477 #> -1.1746 #> -1.3893 #> -0.9891 #> -0.9794 #> -1.0965 #> -1.2527 #> -1.4186 #> -1.4869 #> -1.4088 #> #> (2,.,.) = #> -1.2624 #> -1.0867 #> -1.0282 #> -1.0672 #> -0.8330 #> -0.7354 #> -1.0282 #> ... [the output was truncated (use n=-1 to disable)] #> [ CPUFloatType{32,20,1} ] #> #> $y #> torch_tensor #> -3.6000 #> -5.0000 #> -4.2000 #> -2.5000 #> 0.6000 #> 1.5000 #> 1.5000 #> 1.0000 #> 2.0000 #> 1.2000 #> 0.2000 #> -2.2000 #> -5.1000 #> -6.7000 #> -12.1000 #> -6.7000 #> 2.6000 #> 4.9000 #> 5.0000 #> 8.8000 #> 3.9000 #> 2.1000 #> 5.3000 #> 5.5000 #> 2.4000 #> 2.1000 #> 3.8000 #> 2.0000 #> 0.3000 #> 0.1000 #> ... [the output was truncated (use n=-1 to disable)] #> [ CPUFloatType{32,1} ]