How Anchor Tokens Transform Sequence Information Compression in LLMs

cover
10 Oct 2024

Authors:

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(3) Derek F. Wong, University of Macau;

(4) Longyue Wang, Tencent AI Lab, and corresponding author.

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks

3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References

A More Experimental Results

B Data Settings

Our research is inspired by the recent investigation into the understanding of in-context learning (ICL) within LLMs by Wang et al. (2023). In their study, the authors delve into the underlying mechanisms of ICL, emphasizing the influence of label words in demonstration examples on information flow. They reveal that these label words serve as anchors, wherein semantic information converges into these anchors during inference, subsequently directing the LLMs’ final predictions. Motivated by their findings, our objective is to extend this feature to natural language modeling by guiding sequence information compression into manually designed anchor tokens, rather than solely relying on label words. This is crucial because natural language texts may not always contain an explicit label.

The most relevant method to our approach in the existing literature is the learning to compress prompts with gist tokens (Mu et al., 2023). Their approach centers around compressing task-specific prompts by fine-tuning the model using the proposed gist masking, thereby enforcing prompt compression. However, there are several crucial divergences between our study and theirs. Unlike their focus on compressing a task prompt, our objective lies in training the LLM to condense sequence information into the anchor tokens. Consequently, our approach can be universally applied to a range of tasks without requiring task-specific training, a feature not shared by gist tokens, as the anchor tokens are seamlessly incorporated into the model’s language modeling. Furthermore, our anchor-based attention masks account for information compression within a sequence and information interaction between sequences, thus extending beyond the mere compression of task prompts.

On the other hand, FlashAttention (Dao et al., 2022) and PagedAttention (Kwon et al., 2023) both present memory-efficient attention mechanisms for LLMs. While they focus on optimizing attention computation and subdividing attention processing, our proposed method offers a distinct approach that specifically targets the compression of sequence information into anchor tokens, making it orthogonal to these existing works.

This paper is available on arxiv under CC BY 4.0 DEED license.