Practical LLMs for Real-World Applications

cover
11 Oct 2024

Authors:

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(3) Derek F. Wong, University of Macau;

(4) Longyue Wang, Tencent AI Lab, and corresponding author.

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks

3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References

A More Experimental Results

B Data Settings

7 Conclusion

LLMs have emerged as a significant research area in the field of artificial intelligence. However, despite their exceptional performance across various natural language tasks, the practical application of these models is limited by their significant memory overhead and time efficiency. Implementing LLMs on resource-constrained devices, such as smartphones, poses a unique challenge. To address this issue, we propose anchor-based LLMs with the AnSAN technique. Our experiments demonstrate that by sacrificing a marginal 1.5% in precision, our approach saves 99% of keys/values cache memory while simultaneously improving inference speed by up to 3.5 times. Our methods’ application in machine translation showcases their compatibility and flexibility, effectively enhancing memory efficiency for practical use. Our novel approach is practical, straightforward, flexible, and compatible with existing methods, paving the way for further adoption of LLMs in real-world applications.

Limitations

While our proposed AnLLMs demonstrate significant improvements in memory efficiency and inference acceleration, there are several limitations to consider:

  1. Accuracy trade-off: As observed in the experimental results, our method incurs a minor decrease in accuracy (within 1.5%) compared to the original model. This limitation stems from the information compression process, which may lead to information loss. Although the degradation in accuracy is relatively small, it is crucial to acknowledge this trade-off when deploying our method in practical applications.

  2. Applicability to various tasks: Our experiments primarily focus on question-answering benchmarks and machine translation tasks. The effectiveness of our method in other natural language processing tasks, such as summarization, sentiment analysis, and entity recognition, remains to be thoroughly investigated. Future work should explore the applicability and performance of our method across a broader range of tasks.

  3. Optimal anchor token selection: In our implementation, we chose the last token in a sequence as the anchor token. However, the optimal anchor token selection may vary across different tasks and domains. Further research is needed to develop more sophisticated strategies for identifying and leveraging the most suitable anchor tokens.

  4. Scalability to other LLMs: We have applied our method to the open-source Llama2 models. It remains to be seen how our approach would perform when applied to other opensource LLMs, such as Falcon and Qwen (Almazrouei et al., 2023; Bai et al., 2023). Evaluating the effectiveness and scalability of our method on more extensive language models is an essential direction for future research.

Despite these limitations, our work presents a novel approach to enhance memory efficiency and inference acceleration in LLMs. Future research efforts should address these limitations, refining our method and extending its applicability to a wider range of tasks and model architectures.

Ethics Statement

In conducting this research, we have adhered to the highest ethical standards and principles of academic integrity. The development and implementation of the AnLLMs and the AnSAN have been carried out with the primary aim of improving the memory efficiency and inference speed of large language models, without any intention to cause harm or promote malicious applications.

Our methodology and experimental design have been thoroughly reviewed to ensure that the datasets and models employed are used responsibly and appropriately. The RedPajama datasets and the open-source Llama2 models, which we utilized in our study, are publicly available and widely recognized as reliable resources in the research community. All data used in this study have been processed and analyzed in compliance with relevant guidelines and best practices.

We acknowledge that the advancements in large language models and their applications may have potential implications for privacy, security, and fairness. In light of these concerns, we emphasize the importance of responsible usage and deployment of our proposed AnLLMs and AnSAN techniques. Researchers and practitioners adopting our methods should be aware of the potential risks and take necessary precautions to mitigate any unintended consequences.

Throughout this study, we have strived for transparency and reproducibility. Our results and findings are reported honestly and accurately, without any manipulation or misrepresentation. We are committed to sharing our knowledge and insights with the broader research community, and we encourage open discussion and constructive feedback to further advance the understanding and development of efficient and ethical large language models.

In conclusion, this research has been conducted in accordance with the highest ethical standards, and we are dedicated to fostering a responsible and collaborative research environment in the field of large language models and artificial intelligence.

References

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. 2023. The falcon series of open language models. arXiv preprint arXiv:2311.16867.

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.

Lochan Basyal and Mihir Sanghvi. 2023. Text summarization using large language models: A comparative study of mpt-7b-instruct, falcon-7binstruct, and openai chat-gpt models. arXiv preprint arXiv:2310.10449.

Petr Baudiš, Silvestr Stanko, and Jan Šedivý. 2016. Joint learning of sentence embeddings for relevance and entailment. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 8–17, Berlin, Germany. Association for Computational Linguistics.

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2023. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307.

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.

Together Computer. 2023. Redpajama: an open dataset for training large language models.

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234.

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation.

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.

Wenxiang Jiao, Jen-tse Huang, Wenxuan Wang, Zhiwei He, Tian Liang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. ParroT: Translating during chat using large language models tuned with human translation and feedback. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15009–15020, Singapore. Association for Computational Linguistics.

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.

Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.

Jesse Mu, Xiang Lisa Li, and Noah Goodman. 2023. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467.

OpenAI. 2023. Gpt-4 technical report.

Jianhui Pang, Fanghua Ye, Longyue Wang, Dian Yu, Derek F Wong, Shuming Shi, and Zhaopeng Tu. 2024. Salute the classic: Revisiting challenges of machine translation in the age of large language models. arXiv preprint arXiv:2401.08350.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations.

Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6383–6402, Toronto, Canada. Association for Computational Linguistics.

Ricardo Rei, José G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André F. T. Martins. 2022. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, Ryan A Rossi, and Franck Dernoncourt. 2023. Pdftriage: Question answering over long, structured documents. arXiv preprint arXiv:2309.08872.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.

Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore. Association for Computational Linguistics.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.

Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy Usergenerated Text, pages 94–106, Copenhagen, Denmark. Association for Computational Linguistics.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.

This paper is available on arxiv under CC BY 4.0 DEED license.