library(dplyr, warn.conflicts = FALSE)
library(torch)
library(torchts)

Look at the tiny_m5

If you practice time series modeling, you probably may hear about the M5 challenge by Spyros Makridakis, hosted on Kaggle. It’s an excellent data set, if we want to play with demand time series. The whole dataset is really large, so we rather can use a subset to demonstrate, how to work with such data.

unique(tiny_m5$store_id)
#>  [1] "CA_1" "CA_2" "CA_3" "CA_4" "TX_1" "TX_2" "TX_3" "WI_1" "WI_2" "WI_3"

ca_1_data <-
  tiny_m5 %>% 
  filter(store_id == "CA_1") %>% 
  select(item_id, store_id, date, value, wday,
         month, year, snap, sell_price) %>% 
    arrange(item_id, date)

ca_1_data %>% 
  group_by(item_id) %>% 
  summarise(n = n()) 
#> # A tibble: 28 × 2
#>    item_id         n
#>    <chr>       <int>
#>  1 FOODS_1_033  1913
#>  2 FOODS_1_046  1913
#>  3 FOODS_1_057  1913
#>  4 FOODS_1_218  1913
#>  5 FOODS_2_096  1913
#>  6 FOODS_2_181  1913
#>  7 FOODS_2_352  1913
#>  8 FOODS_2_360  1913
#>  9 FOODS_3_080  1913
#> 10 FOODS_3_377  1913
#> # … with 18 more rows

skimr::skim(ca_1_data)
Data summary
Name ca_1_data
Number of rows 53564
Number of columns 9
Key NULL
_______________________
Column type frequency:
character 2
Date 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
item_id 0 1 11 15 0 28 0
store_id 0 1 4 4 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2011-01-29 2016-04-24 2013-09-11 1913

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
value 0 1.00 5.52 10.90 0.0 0.00 1.00 6.00 133.00 ▇▁▁▁▁
wday 0 1.00 4.00 2.00 1.0 2.00 4.00 6.00 7.00 ▇▃▃▃▇
month 0 1.00 6.36 3.46 1.0 3.00 6.00 9.00 12.00 ▇▅▅▅▇
year 0 1.00 2013.21 1.53 2011.0 2012.00 2013.00 2015.00 2016.00 ▇▅▅▅▁
snap 0 1.00 0.33 0.47 0.0 0.00 0.00 1.00 1.00 ▇▁▁▁▃
sell_price 7693 0.86 3.33 3.34 0.5 0.97 1.68 4.97 11.54 ▇▁▁▁▂

For deep learning models for time series, we’d typically like to create a three-dimensional tensor. In the analyzed case, the each dimension may represent:

  • item
  • time steps
  • features

The first dimension is “free” - we can add an arbitrary number of items. When it comes to the second one, they length may vary as well. However, for convenience, we’ll keep same-length time series. Otherwise, we’d have to use masking or split the dataset into multiple tensors. The last dimension size is guaranteed by the data.frame structure itself (each row has the same number of columns).

As we can observed in the output of skim function, there are time series with missing data.

As mentioned before, for simplicity’s sake, we can just select a subset of items with the same series length as well as the first and last date. In such case, we’ll be sure that our data are properly aligned in the tensor. Later, in a separate vignette, we’ll dive into a set of methods, how to handle missing/non-aligned multiple time series when training a deep learning model.

as_tensor

head(ca_1_data)
#>        item_id store_id       date value wday month year snap sell_price
#> 1: FOODS_1_033     CA_1 2011-01-29     0    1     1 2011    0         NA
#> 2: FOODS_1_033     CA_1 2011-01-30     0    2     1 2011    0         NA
#> 3: FOODS_1_033     CA_1 2011-01-31     0    3     1 2011    0         NA
#> 4: FOODS_1_033     CA_1 2011-02-01     0    4     2 2011    1         NA
#> 5: FOODS_1_033     CA_1 2011-02-02     0    5     2 2011    1         NA
#> 6: FOODS_1_033     CA_1 2011-02-03     0    6     2 2011    1         NA

The first column, item_id, describes the item we already the item and the second one (date) - a current time step. These two columns will be used to create a data “fold”, i.e. form a 3D tensor. As we mentioned above, the completness of the time moments is crucial to obtain a proper result from this transformation.

If it comes, to the rest of columns:

  • value is a target we want to predict
  • wday is a categorical variable
  • month is a categorical variable
  • year can be treated as categorical, but in this case we may remove this variable and introduce a counter instead
  • snap is a categorical variable
  • sell_price is a real-valued variable

Summarizing, we have three categorical variables, which should be represented in some way. The most efficient manner to represent categorical variables in neural network is embedding layer.
In fact, it works similar to a linear (dense) layer. The difference is that instead of performing resource-consuming dot product between weight matrix and the input one-hot encoded sparse matrix, we just use an index to select “right” row from the weight matrix.

Let’s transform tabular data into tensor. A good way to do it is to use a convenient as_tensor function. The first argument of the function (described as .data) is a data.frame object, which we want to transform into a torch_tensor.

colnames(ca_1_data)
#> [1] "item_id"    "store_id"   "date"       "value"      "wday"      
#> [6] "month"      "year"       "snap"       "sell_price"

First, we’ll select only integer item_id, date and integer variables.

ca_1 <- 
  ca_1_data %>% 
  select(item_id, date, wday, month, year, snap)

ca_1_tensor <- 
  ca_1 %>% 
  as_tensor(item_id, date)

dim(ca_1_tensor)
#> [1]   28 1913    4
class(ca_1_tensor)
#> [1] "torch_tensor" "R7"

ca_1_tensor <- 
  ca_1_data %>% 
  select(item_id, date, wday, month, year, snap, sell_price) %>% 
  mutate(across(where(is.integer), as.numeric)) %>% 
  as_tensor(item_id, date)

dim(ca_1_tensor)
#> [1]   28 1913    5
class(ca_1_tensor)
#> [1] "torch_tensor" "R7"

as_ts_dataset

To speed up torch models developments, torchts package provides easy-to-use as_ts_dataset method, which is a shortcut to create a torch dataset from a data.frame. For now keys like item_id are not supported - this feature will be implemented in the near future. We’ll present this function using weather_pl dataset.

library(rsample)

suwalki_temp <-
  weather_pl %>%
  filter(station == "SWK") %>%
  select(date, temp = tmax_daily)

#' # Splitting on training and test
data_split <- initial_time_split(suwalki_temp)

train_ds <-
  training(data_split) %>%
  as_ts_dataset(temp ~ date, timesteps = 20, horizon = 1)

train_ds[1]
#> $x
#> torch_tensor
#> -1.1453
#> -1.2624
#> -1.0867
#> -1.0282
#> -1.0672
#> -0.8330
#> -0.7354
#> -1.0282
#> -1.0087
#> -0.9891
#> -1.0477
#> -1.1746
#> -1.3893
#> -0.9891
#> -0.9794
#> -1.0965
#> -1.2527
#> -1.4186
#> -1.4869
#> -1.4088
#> [ CPUFloatType{20,1} ]
#> 
#> $y
#> torch_tensor
#> -3.6000
#> [ CPUFloatType{1} ]

as_ts_dataloader

The quickest shortcut to get needed data-provding object is to call as_ts_dataloader function. It can be used as follows.

train_dl <-
   training(data_split) %>%
   as_ts_dataloader(temp ~ date, timesteps = 20, horizon = 1, batch_size = 32)

train_dl
#> <dataloader>
#>   Public:
#>     .auto_collation: active binding
#>     .dataset_kind: map
#>     .has_getbatch: FALSE
#>     .index_sampler: active binding
#>     .iter: function () 
#>     .length: function () 
#>     batch_sampler: utils_sampler_batch, utils_sampler, R6
#>     batch_size: 32
#>     clone: function (deep = FALSE) 
#>     collate_fn: function (batch) 
#>     dataset: ts_dataset, dataset, R6
#>     drop_last: FALSE
#>     generator: NULL
#>     initialize: function (dataset, batch_size = 1, shuffle = FALSE, sampler = NULL, 
#>     multiprocessing_context: NULL
#>     num_workers: 0
#>     pin_memory: FALSE
#>     sampler: utils_sampler_sequential, utils_sampler, R6
#>     timeout: -1
#>     worker_globals: NULL
#>     worker_init_fn: NULL
#>     worker_packages: NULL

dataloader_next(dataloader_make_iter(train_dl))
#> $x
#> torch_tensor
#> (1,.,.) = 
#>  -1.1453
#>  -1.2624
#>  -1.0867
#>  -1.0282
#>  -1.0672
#>  -0.8330
#>  -0.7354
#>  -1.0282
#>  -1.0087
#>  -0.9891
#>  -1.0477
#>  -1.1746
#>  -1.3893
#>  -0.9891
#>  -0.9794
#>  -1.0965
#>  -1.2527
#>  -1.4186
#>  -1.4869
#>  -1.4088
#> 
#> (2,.,.) = 
#>  -1.2624
#>  -1.0867
#>  -1.0282
#>  -1.0672
#>  -0.8330
#>  -0.7354
#>  -1.0282
#> ... [the output was truncated (use n=-1 to disable)]
#> [ CPUFloatType{32,20,1} ]
#> 
#> $y
#> torch_tensor
#> -3.6000
#> -5.0000
#> -4.2000
#> -2.5000
#>  0.6000
#>  1.5000
#>  1.5000
#>  1.0000
#>  2.0000
#>  1.2000
#>  0.2000
#> -2.2000
#> -5.1000
#> -6.7000
#> -12.1000
#> -6.7000
#>  2.6000
#>  4.9000
#>  5.0000
#>  8.8000
#>  3.9000
#>  2.1000
#>  5.3000
#>  5.5000
#>  2.4000
#>  2.1000
#>  3.8000
#>  2.0000
#>  0.3000
#>  0.1000
#> ... [the output was truncated (use n=-1 to disable)]
#> [ CPUFloatType{32,1} ]