Scaled dot product attention layer
layer_scaled_dot_attention.Rd
Introduced in Attention Is All You Need. Defined as:
Details
$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$
Originally, dropout
hasn't been specified there. It was added inside the layer
in the Temporal Fusion Transformer implementation by Google.
It's a component of the Multi-Head Attention Layers (as well as its interpretable version, available in the aion
package).
Call arguments
query: Query Tensor of shape
[B, T, dim]
.value: Value Tensor of shape
[B, S, dim]
.key: Optional key Tensor of shape
[B, S, dim]
. If not given, will use value for both key and value, which is the most common case.attention_mask: a boolean mask of shape
[B, T, S]
, that prevents attention to certain positions.return_attention_scores: A boolean to indicate whether the output should be attention output if TRUE, or (attention_output, attention_scores) if FALSE. Defaults to FALSE.
training: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout). Defaults to either using the training mode of the parent layer/model, or FALSE (inference) if there is no parent layer.
References
A. Vaswani et al. Attention Is All You Need (2017)
B. Lim, S.O. Arik, N. Loeff, T. Pfiste, Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting (2020)
Examples
lookback <- 28
horizon <- 14
all_steps <- lookback + horizon
state_size <- 5
queries <- layer_input(c(horizon, state_size))
keys <- layer_input(c(all_steps, state_size))
values <- layer_input(c(all_steps, state_size))
sdp_attention <- layer_scaled_dot_attention()(queries, keys, values)