Task Generalization With AutoRegressive Compositional Structure: Can Learning From $D$ Tasks Generalize to $D^{T}$ Tasks?

ArXiv.org2025

Amirhesam Abedsoltan, Huaqing Zhang, Kaiyue Wen, Hui Lin, Jingzhao Zhang, Mikhail Belkin

Abstract

Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of autoregressive compositional structure, where each task is a composition of $T$ operations, and each operation is among a finite family of $D$ subtasks. This yields a total class of size $D^T$. We first show that generalization to all $D^T$ tasks is theoretically achievable by training on only $\widetilde{O}(D)$ tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via In-context Learning (ICL) and chain-of-thought (CoT) reasoning. We further show generalization in arithmetic and translation, beyond parity functions.

Related Papers

→ Auxiliary model based recursive and iterative least squares algorithm for autoregressive output error autoregressive systems(2015)5 cited
→ Chapter 5: The Generalization of Relations and the Dependence of Generalization on Analysis and Abstraction(2021)1 cited
Determining the Number of Regimes in a Threshold Autoregressive Model Using Smooth Transition Autoregressions(2005)
→ Implicit Stacked Autoregressive Model for Video Prediction(2023)6 cited
→ Improved autoregressive model(2003)