Cross-Attention for Cross-Asset Applications

How to mash up features from multiple assets

Ernest Chan

Uttej Mannava

, and

Johann Abraham

Jun 05, 2025

In the previous blog post, we saw how we can apply self-attention transformers to a matrix of time series features of a single stock. The output of that transformer is a transformed feature vector r of dimension 768 × 1. 768 is the result of 12 × 64: all the lagged features are concatenated / flattened into one vector. 12 is the number of lagged months, and 64 is the dimension of the embedding space for the 52 features we constructed.

What if we have a portfolio of many stocks whose returns we want to predict, or whose allocations we want to optimize? That’s where cross-attention transformers come in. The purpose of cross-attention is that a feature in an asset i may be relevant as a context to a feature of an asset j. Once again, we follow the development by Cong et. al. (2020).

To recap, self-attention transformers take as input one n × d matrix X, with n rows of features and d columns of each feature’s embeddings. A cross-attention transformer takes as input 2 or more such matrices X₁, X_2,… The common application example of a cross-attention transformer is language translation. E.g. to translate from Chinese (the “key”, or encoder) to English (the “query”, or decoder), we would have

X₁~ “I am Chinese”, and

X₂~ “我是中国人”

To be exact, the rows of X₁will actually be a d-dimensional vector embedding (representation) of one of the words in the English sentence, and ditto for the rows of X₂for the words in the Chinese sentence. Note that while the embedding dimension d must be the same for both X₁ and X₂, they obviously do not need to have the same number of words (i.e. rows).

smmary — Source: https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html

X₁ (query) is the English sentence. X₂(key and value) is the Chinese sentence.

Now, we can imagine that X_i (query) is asset i’s context vector we called r_iin the previous blog post, and X_j(key and value) is asset j’s context vector we called r_j which provides the context for asset i’s features. We can next apply the usual linear transformations W_q, W_k, and W_v to mash up their time components to form the Q, K, and V matrices. Then we can use them to compute the cross-attention matrix A using the usual scaled dot product with the softmax function, which AlphaPortfolio calls SATT(i, j) (“Self Attention function”, a misnomer in our opinion). Because the Q’s and K’s are just 768 × 1 vectors in our case, each (i, j) element of SATT is just a scalar. So the SATT matrix is just another cross-attention matrix, and each row i represents the normalized weights given by features j=1, 2, …, I, including j=i, where I is the number of assets. The context vector given an attention matrix SATT is, as usual,

\(Z^{(i)}=\sum_{j= i}^{I}{SATT(q^{(i)},\ k^{(j)})\bullet v^{(j)}}.\)

(AlphaPortfolio calls this a⁽ⁱ⁾, an attenuation score. But we prefer to describe this as a context vector Z⁽ⁱ⁾because we are multiplying an attention matrix with an input vector v.)

Voilà! Once you have the context vector, it is like a superpowered input feature vector that captures all manners of time-series and cross-sectional information about the portfolio that you can use for downstream applications. In the case of AlphaPortfolio, the authors use Z⁽ⁱ⁾as the state variables for a deep reinforcement learning (DRL) program to find the best allocations to the stocks. It is essentially a stock selection program with a side of optimal capital allocation. In the next blog post, we will dissect one of these DRL programs.

A guest post by

Uttej Mannava

linkedin.com/in/um-

A guest post by

Johann Abraham

https://www.linkedin.com/in/johann-abraham/

gatambook

Discussion about this post