Submitted by abc220022 t3_100y331 in MachineLearning

The classic way of incorporating sequential context into a transformer model is to make an encoder-decoder transformer, where the context is processed by the encoder component, and then read into the decoder blocks via cross-attention.

However, there are a number of situations where you might want to incorporate non-sequential context into a model - for example, you might want a language model that can generate text conditioned on some input vector describing the person whose text you are trying to emulate, or you might want to condition on some single vector that summarizes all text that occurred prior to the context window. What are standard ways of incorporating such context? I'd also be interested in standard ways of incorporating such context in other sequence models like RNNs with LSTM blocks.

27

Comments

You must log in or register to comment.

amnezzia t1_j2ljq54 wrote

Make it part of the sequence? Kinda like special tokens in language models.

9

-Rizhiy- t1_j2lkgn2 wrote

Look at papers dealing with multi-modal tasks. e.g. Perceiver/Perceiver IO by DeepMind

You can encode your data into tokens with the same size using something like an MLP. Then feed these tokens into decoder along with encoder tokens. Should probably also add an learnable embedding for different types of data to prevent signal confusion.

6

notforrob t1_j2lnrhg wrote

Assuming your goal is an autoregressive sequence prediction, I would just modify the start-of-sequence token. For example: Use some reasonable model which takes the non-sequential context and creates a vector. Add that vector to the the learned start-of-sequence token vector. Future time steps will be able to attend to the start-of-sequence token as needed to retrieve the context.

If you're only using a transformer encoder, and not doing the autoregressive thing, I would just add an additional token to the input. I would most likely used a learned position encoding to add to that context vector rather than the normal sequential position embedding. Any time step will be able to attend to this special token and take advantage of the context clue you're providing.

3

farmingvillein t1_j2lnxdd wrote

Or, for vectors, just slam it into the start of the sequence directly (use a normalization technique if you need to align dimensionality).

If you feel the need, place some sort of separator token ('###') between the "context features" and the input data.

6

lukeiy t1_j2luz7z wrote

Use another model to reduce this context to a vector, then append it to each token. This was the process used in Set Transformers (TSPN)

2

dark-ascension t1_j2lx7hh wrote

In Conditional GANs, the condition(class label) is prefixed to the latent space random vector, ie. The input vector becomes one-hot-class + rv. You can learn the 'conditions ' such as style by joint training. I believe similar concept can be applied to transformers, judging from the other answers.

1

ai-lover t1_j2m61e3 wrote

There are a few ways to incorporate non-sequential context into a transformer model:
Attention Mechanisms: One way to incorporate non-sequential context is to use attention mechanisms that allow the model to "pay attention" to relevant parts of the input as it processes it. For example, the transformer model uses self-attention mechanisms that allow it to consider the entire input sequence as it processes each element in the sequence.
Context Vectors: Another way to incorporate non-sequential context is to use context vectors, which are fixed-length vectors that represent the context for a given input. These vectors can be concatenated to the input embeddings or used to compute attention weights.
Multi-Head Attention: The transformer model also uses multi-head attention, which allows it to attend to multiple different sources of context simultaneously.
Conditional Transformer: The conditional transformer is a variant of the transformer model that is specifically designed to incorporate non-sequential context by using an additional input modality (e.g. an image or a set of control parameters) to condition the transformer's self-attention mechanisms.
Hierarchical Transformer: The hierarchical transformer is another variant of the transformer model that incorporates non-sequential context by using a hierarchy of transformer blocks, where the lower-level blocks process the input at a finer granularity and the higher-level blocks process the lower-level representations to capture more global context.
Graph Transformer: The graph transformer is a variant of the transformer model that is specifically designed to process graph-structured data, which allows it to incorporate non-sequential context by considering the relationships between nodes in the graph.

3