Scaled dot product attention layer — layer_scaled_dot

Introduced in Attention Is All You Need. Defined as:

Usage

layer_scaled_dot_attention(object, dropout_rate = 0, ...)

Arguments

dropout_rate: Dropout rate

Details

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

Originally, dropout hasn't been specified there. It was added inside the layer in the Temporal Fusion Transformer implementation by Google. It's a component of the Multi-Head Attention Layers (as well as its interpretable version, available in the aion package).

Call arguments

query: Query Tensor of shape [B, T, dim].
value: Value Tensor of shape [B, S, dim].
key: Optional key Tensor of shape [B, S, dim]. If not given, will use value for both key and value, which is the most common case.
attention_mask: a boolean mask of shape [B, T, S], that prevents attention to certain positions.
return_attention_scores: A boolean to indicate whether the output should be attention output if TRUE, or (attention_output, attention_scores) if FALSE. Defaults to FALSE.
training: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout). Defaults to either using the training mode of the parent layer/model, or FALSE (inference) if there is no parent layer.

References

A. Vaswani et al. Attention Is All You Need (2017)
1. B. Lim, S.O. Arik, N. Loeff, T. Pfiste, Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting (2020)

Examples

lookback   <- 28
horizon    <- 14
all_steps  <- lookback + horizon
state_size <- 5

queries <- layer_input(c(horizon, state_size))
keys    <- layer_input(c(all_steps, state_size))
values  <- layer_input(c(all_steps, state_size))

sdp_attention <- layer_scaled_dot_attention()(queries, keys, values)