Preparing data for recurrent models

library(dplyr, warn.conflicts = FALSE)
library(torch)
library(torchts)

Look at the `tiny_m5`

If you practice time series modeling, you probably may hear about the M5 challenge by Spyros Makridakis, hosted on Kaggle. It’s an excellent data set, if we want to play with demand time series. The whole dataset is really large, so we rather can use a subset to demonstrate, how to work with such data.

unique(tiny_m5$store_id)
#>  [1] "CA_1" "CA_2" "CA_3" "CA_4" "TX_1" "TX_2" "TX_3" "WI_1" "WI_2" "WI_3"

ca_1_data <-
  tiny_m5 %>% 
  filter(store_id == "CA_1") %>% 
  select(item_id, store_id, date, value, wday,
         month, year, snap, sell_price) %>% 
    arrange(item_id, date)

ca_1_data %>% 
  group_by(item_id) %>% 
  summarise(n = n()) 
#> # A tibble: 28 × 2
#>    item_id         n
#>    <chr>       <int>
#>  1 FOODS_1_033  1913
#>  2 FOODS_1_046  1913
#>  3 FOODS_1_057  1913
#>  4 FOODS_1_218  1913
#>  5 FOODS_2_096  1913
#>  6 FOODS_2_181  1913
#>  7 FOODS_2_352  1913
#>  8 FOODS_2_360  1913
#>  9 FOODS_3_080  1913
#> 10 FOODS_3_377  1913
#> # … with 18 more rows

skimr::skim(ca_1_data)

Data summary
Name	ca_1_data
Number of rows	53564
Number of columns	9
Key	NULL
_______________________
Column type frequency:
character	2
Date	1
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
item_id	0	1	11	15	0	28	0
store_id	0	1	4	4	0	1	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2011-01-29	2016-04-24	2013-09-11	1913

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
value	0	1.00	5.52	10.90	0.0	0.00	1.00	6.00	133.00	▇▁▁▁▁
wday	0	1.00	4.00	2.00	1.0	2.00	4.00	6.00	7.00	▇▃▃▃▇
month	0	1.00	6.36	3.46	1.0	3.00	6.00	9.00	12.00	▇▅▅▅▇
year	0	1.00	2013.21	1.53	2011.0	2012.00	2013.00	2015.00	2016.00	▇▅▅▅▁
snap	0	1.00	0.33	0.47	0.0	0.00	0.00	1.00	1.00	▇▁▁▁▃
sell_price	7693	0.86	3.33	3.34	0.5	0.97	1.68	4.97	11.54	▇▁▁▁▂

For deep learning models for time series, we’d typically like to create a three-dimensional tensor. In the analyzed case, the each dimension may represent:

item
time steps
features

The first dimension is “free” - we can add an arbitrary number of items. When it comes to the second one, they length may vary as well. However, for convenience, we’ll keep same-length time series. Otherwise, we’d have to use masking or split the dataset into multiple tensors. The last dimension size is guaranteed by the data.frame structure itself (each row has the same number of columns).

As we can observed in the output of skim function, there are time series with missing data.

As mentioned before, for simplicity’s sake, we can just select a subset of items with the same series length as well as the first and last date. In such case, we’ll be sure that our data are properly aligned in the tensor. Later, in a separate vignette, we’ll dive into a set of methods, how to handle missing/non-aligned multiple time series when training a deep learning model.

as_tensor

head(ca_1_data)
#>        item_id store_id       date value wday month year snap sell_price
#> 1: FOODS_1_033     CA_1 2011-01-29     0    1     1 2011    0         NA
#> 2: FOODS_1_033     CA_1 2011-01-30     0    2     1 2011    0         NA
#> 3: FOODS_1_033     CA_1 2011-01-31     0    3     1 2011    0         NA
#> 4: FOODS_1_033     CA_1 2011-02-01     0    4     2 2011    1         NA
#> 5: FOODS_1_033     CA_1 2011-02-02     0    5     2 2011    1         NA
#> 6: FOODS_1_033     CA_1 2011-02-03     0    6     2 2011    1         NA

The first column, item_id, describes the item we already the item and the second one (date) - a current time step. These two columns will be used to create a data “fold”, i.e. form a 3D tensor. As we mentioned above, the completness of the time moments is crucial to obtain a proper result from this transformation.

If it comes, to the rest of columns:

value is a target we want to predict
wday is a categorical variable
month is a categorical variable
year can be treated as categorical, but in this case we may remove this variable and introduce a counter instead
snap is a categorical variable
sell_price is a real-valued variable

Summarizing, we have three categorical variables, which should be represented in some way. The most efficient manner to represent categorical variables in neural network is embedding layer.
In fact, it works similar to a linear (dense) layer. The difference is that instead of performing resource-consuming dot product between weight matrix and the input one-hot encoded sparse matrix, we just use an index to select “right” row from the weight matrix.

Let’s transform tabular data into tensor. A good way to do it is to use a convenient as_tensor function. The first argument of the function (described as .data) is a data.frame object, which we want to transform into a torch_tensor.

colnames(ca_1_data)
#> [1] "item_id"    "store_id"   "date"       "value"      "wday"      
#> [6] "month"      "year"       "snap"       "sell_price"

First, we’ll select only integer item_id, date and integer variables.

ca_1 <- 
  ca_1_data %>% 
  select(item_id, date, wday, month, year, snap)

ca_1_tensor <- 
  ca_1 %>% 
  as_tensor(item_id, date)

dim(ca_1_tensor)
#> [1]   28 1913    4
class(ca_1_tensor)
#> [1] "torch_tensor" "R7"

ca_1_tensor <- 
  ca_1_data %>% 
  select(item_id, date, wday, month, year, snap, sell_price) %>% 
  mutate(across(where(is.integer), as.numeric)) %>% 
  as_tensor(item_id, date)

dim(ca_1_tensor)
#> [1]   28 1913    5
class(ca_1_tensor)
#> [1] "torch_tensor" "R7"

as_ts_dataset

To speed up torch models developments, torchts package provides easy-to-use as_ts_dataset method, which is a shortcut to create a torch dataset from a data.frame. For now keys like item_id are not supported - this feature will be implemented in the near future. We’ll present this function using weather_pl dataset.

library(rsample)

suwalki_temp <-
  weather_pl %>%
  filter(station == "SWK") %>%
  select(date, temp = tmax_daily)

#' # Splitting on training and test
data_split <- initial_time_split(suwalki_temp)

train_ds <-
  training(data_split) %>%
  as_ts_dataset(temp ~ date, timesteps = 20, horizon = 1)

train_ds[1]
#> $x
#> torch_tensor
#> -1.1453
#> -1.2624
#> -1.0867
#> -1.0282
#> -1.0672
#> -0.8330
#> -0.7354
#> -1.0282
#> -1.0087
#> -0.9891
#> -1.0477
#> -1.1746
#> -1.3893
#> -0.9891
#> -0.9794
#> -1.0965
#> -1.2527
#> -1.4186
#> -1.4869
#> -1.4088
#> [ CPUFloatType{20,1} ]
#> 
#> $y
#> torch_tensor
#> -3.6000
#> [ CPUFloatType{1} ]

as_ts_dataloader

The quickest shortcut to get needed data-provding object is to call as_ts_dataloader function. It can be used as follows.

train_dl <-
   training(data_split) %>%
   as_ts_dataloader(temp ~ date, timesteps = 20, horizon = 1, batch_size = 32)

train_dl
#> <dataloader>
#>   Public:
#>     .auto_collation: active binding
#>     .dataset_kind: map
#>     .has_getbatch: FALSE
#>     .index_sampler: active binding
#>     .iter: function () 
#>     .length: function () 
#>     batch_sampler: utils_sampler_batch, utils_sampler, R6
#>     batch_size: 32
#>     clone: function (deep = FALSE) 
#>     collate_fn: function (batch) 
#>     dataset: ts_dataset, dataset, R6
#>     drop_last: FALSE
#>     generator: NULL
#>     initialize: function (dataset, batch_size = 1, shuffle = FALSE, sampler = NULL, 
#>     multiprocessing_context: NULL
#>     num_workers: 0
#>     pin_memory: FALSE
#>     sampler: utils_sampler_sequential, utils_sampler, R6
#>     timeout: -1
#>     worker_globals: NULL
#>     worker_init_fn: NULL
#>     worker_packages: NULL

dataloader_next(dataloader_make_iter(train_dl))
#> $x
#> torch_tensor
#> (1,.,.) = 
#>  -1.1453
#>  -1.2624
#>  -1.0867
#>  -1.0282
#>  -1.0672
#>  -0.8330
#>  -0.7354
#>  -1.0282
#>  -1.0087
#>  -0.9891
#>  -1.0477
#>  -1.1746
#>  -1.3893
#>  -0.9891
#>  -0.9794
#>  -1.0965
#>  -1.2527
#>  -1.4186
#>  -1.4869
#>  -1.4088
#> 
#> (2,.,.) = 
#>  -1.2624
#>  -1.0867
#>  -1.0282
#>  -1.0672
#>  -0.8330
#>  -0.7354
#>  -1.0282
#> ... [the output was truncated (use n=-1 to disable)]
#> [ CPUFloatType{32,20,1} ]
#> 
#> $y
#> torch_tensor
#> -3.6000
#> -5.0000
#> -4.2000
#> -2.5000
#>  0.6000
#>  1.5000
#>  1.5000
#>  1.0000
#>  2.0000
#>  1.2000
#>  0.2000
#> -2.2000
#> -5.1000
#> -6.7000
#> -12.1000
#> -6.7000
#>  2.6000
#>  4.9000
#>  5.0000
#>  8.8000
#>  3.9000
#>  2.1000
#>  5.3000
#>  5.5000
#>  2.4000
#>  2.1000
#>  3.8000
#>  2.0000
#>  0.3000
#>  0.1000
#> ... [the output was truncated (use n=-1 to disable)]
#> [ CPUFloatType{32,1} ]

Look at the tiny_m5

as_tensor

as_ts_dataset

as_ts_dataloader

Look at the `tiny_m5`