Inductive Biases and Variable Creation in Self-Attention Mechanisms
Citations Over Time
Abstract
Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.
Related Papers
- → Remarks on Algorithm 2, Algorithm 3, Algorithm 15, Algorithm 25 and Algorithm 26(1961)2 cited
- → Remarks on algorithms 372 [A1]: An algorithm to produce complex primes, csieve and Algorithm 401 [A1]: an improved algorithm to produce complex primes(1970)
- → Remarks on Algorithm 332: Jacobi polynomials: Algorithm 344: student's t -distribution: Algorithm 351: modified Romberg quadrature: Algorithm 359: factoral analysis of variance(1970)