•1 min read•from Machine Learning
[R] Causal self-attention as a probabilistic model over embeddings
We’ve been working on a probabilistic interpretation of causal self-attention where token embeddings are treated as latent variables. In that view, the attention map induces a change-of-variables term, which leads to a barrier / degeneracy boundary in embedding space.
The resulting picture is:
- a stability-margin interpretation of causal attention
- “support tokens,” i.e. the positions closest to the degeneracy boundary
- a simple MAP-style training penalty: standard cross-entropy plus a smooth log-barrier term
Empirically, this improves robustness to input perturbations and makes the learned geometry more margin-concentrated, without much loss in clean accuracy at modest regularization strengths.
Curious whether this framing feels natural to people, or whether it reads more like a <insert-your-favorite-regularizer-here> than a genuinely probabilistic view.
[link] [comments]
Want to read more?
Check out the full article on the original site
Tagged with
#self-service analytics tools
#self-service analytics
#natural language processing for spreadsheets
#rows.com
#natural language processing
#causal self-attention
#probabilistic model
#degeneracy boundary
#token embeddings
#latent variables
#attention map
#change-of-variables
#stability-margin
#support tokens
#embedding space
#MAP-style training penalty
#smooth log-barrier term
#cross-entropy
#margin-concentrated
#robustness