Researchers from KAIST have developed a new evaluation protocol called FLASK, aimed at sharpening the performance of language learning models (LLMs). FLASK breaks down the evaluation process into more specific skills, providing a comprehensive picture of an LLM's capabilities.
Fine-tuning of LLMs for human-aligned responses
Language Learning Models (LLMs) have been making remarkable strides in recent years, not only in understanding human language but also in aligning their responses to human values. This 'fine-tuning' has been achieved primarily through a couple of different methods: instruction tuning, where the models are adjusted based on specific tasks or user preferences, and reinforcement learning from human feedback (RLHF), where the models learn and improve through iterative interaction with humans. These techniques ensure that the LLMs provide responses that are not only accurate but also helpful, honest, and non-harmful.
Introducing FLASK: A granular LLM evaluation method
To address the limitations of current LLM evaluation approaches, researchers have introduced an innovative protocol called FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets). This method goes beyond the typical, broad scoring process, offering a much more detailed, 'fine-grained' scoring framework. This allows for an in-depth study of an LLM's performance, taking into account the specific skill set in use, the target domain of application, and the level of difficulty of the task. This novel approach promises to draw a much more comprehensive picture of an LLM's capabilities and potential areas for improvement.
To make the evaluation process more effective and precise, the researchers working on FLASK chose to focus on 12 'fine-grained' skills that they believe are crucial for the successful completion of tasks given to LLMs. These skills are grouped into four primary abilities: logical reasoning, background knowledge, problem-solving, and consistency with user preferences. By assessing an LLM's capabilities in these specific areas, FLASK allows for a more targeted and thus potentially more fruitful form of fine-tuning.
Performance gap between open-source and proprietary LLMs
Despite the impressive progress in open-source LLMs, the research found that these models are still trailing behind proprietary LLMs in terms of Logical Thinking and Background Knowledge abilities, by 25% and 10% respectively. This highlights the complexity of the task at hand and the need for ongoing refinement and improvement. Interestingly, the research also noted that the size of the model could play a role in skill acquisition, with different-sized models yielding better results for different skills.
The detailed insights provided by FLASK can prove invaluable to both researchers and practitioners working with LLMs. By offering explicit steps for improving model alignment, it can help teams refine their models more effectively. Moreover, by providing a fine-grained comparison of different LLMs, FLASK can also assist in identifying the models that are best suited to meet specific needs or tackle certain tasks.