Scenario application
ZHANG Haoning, GU Jiachen, SUN Yanguang, ZHENG Qiang, LI Minghui
In recent years, large language models (LLMs) have achieved breakthrough progress in natural language processing. The powerful semantic understanding and reasoning capabilities accelerate the intelligent and digital transformation of traditional industries. However, existing LLMs remains a lack of systematic and professional evaluation benchmark in traditional manufacturing industry areas such as iron and steel metallurgy. We proposed StiBench, a Chinese evaluation benchmark for the steel and metallurgy industry area, to assess the performance of existing open-source LLMs. StiBench integrates a large amount of technical documents, process manuals, professional examinations, etc. through methods such as open source crawling and optical character recognition, de-duplicates through semantic similarity detection, and finally forms the evaluation benchmark consisting of over 2 000 questions in multiple formats (multichoice, true or false), covering knowledge topics such as ironmaking, steelmaking, rolling, metallurgical physical chemistry, heat treatment and surface engineering. We conducted few-shot and zero-shot experiments using open-source models such as Baichuan, GLM, and Tongyi Qianwen, employing both classification and generation-based evaluation approaches. Results show that while LLMs have made progress, the semantic understanding of specialized content in the steel and metallurgy domain still needs improvement. Research results provide an evaluation benchmark for LLM performance in the steel and metallurgy area, which can be a valuable reference for future model development and industrial application.