0 citations0 references

Faster Transformer Decoding: N-gram Masked Self-Attention

arXiv (Cornell University)2020

Citations Over Time

Ciprian Chelba, Mia Xu Chen, Ankur Bapna, Noam Shazeer

Abstract

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.

Related Papers

→ Automatic Synonym Acquisition Using a Context-Restricted Skip-gram Model(2017)2 cited
→ Experimental Study of Higher-gram Index Length for N-gram Full Text Search System(2006)
NGRAM: Stata module to provide n-gram feature extractor(2018)
→ ULC Series gram cells from Interface offer high accuracy at low capacities(2000)