GB/T 45288.2-2025 Artificial intelligence―Large-scale model―Part 2: Testing and evaluation for metrics and methods English, Anglais, Englisch, Inglés, えいご
This is a draft translation for reference among interesting stakeholders. The finalized translation (passing through draft translation, self-check, revision and verification) will be delivered upon being ordered.
ICS 35.240
CCS L 70
National Standard of the People's Republic of China
GB/T 45288.2-2025
Artificial intelligence - Large-scale model - Part 2: Testing and evaluation for metrics and methods
人工智能 大模型 第2部分:评测指标与方法
(English Translation)
Issue date: 2025-02-28 Implementation date: 2025-02-28
Issued by the State Administration for Market Regulation
the Standardization Administration of the People's Republic of China
Contents
Foreword
Introduction
1 Scope
2 Normative references
3 Terms and definitions
4 Abbreviations
5 Evaluation indicators
6 Evaluation methods
Annex A (Informative) Calculation methods for evaluation indicators
Bibliography
Artificial intelligence - Large-scale model - Part 2: Testing and evaluation for metrics and methods
1 Scope
This document establishes the evaluation indicators for large artificial intelligence models and describes the evaluation methods for large artificial intelligence models.
This document is applicable to model providers, application service providers, application consumers, etc., for evaluating and testing the capabilities of large models, and also is applicable to guiding the design, development, and application of large models.
2 Normative references
The following documents contain requirements which, through reference in this text, constitute provisions of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
GB/T 42755-2023 Artificial intelligence - Code of practice for data labeling of machine learning
GB/T 45288.1 Artificial intelligence - Large-scale model - Part 1: General requirements
3 Terms and definitions
The terms and definitions defined in GB/T 45288.1 are applicable to this document.
4 Abbreviation
The following abbreviations are applicable to this document.
API: Application Programming Interface
BLEU: Bilingual Evaluation Understudy
5 Evaluation indicators
5.1 Comprehension ability evaluation indicators
5.1.1 Overview
The evaluation of large models' comprehension ability is mainly divided into unimodal and multimodal dimensions. The unimodal dimension mainly includes three secondary dimensions: text, image, and audio. The multimodal dimension mainly includes four secondary dimensions: image-text, text-audio, image-audio, and image-text-audio. The evaluation dimensions and typical tasks of comprehension ability are shown in Table 1.
5.1.2 Text classification
Evaluate the large model's ability to conduct overall analysis of input text content, including but not limited to the following capabilities:
a) Classification task: Ability to map input text to specific categories, where users only need to provide the text to be classified without concerning themselves with the specific implementation. Mainly includes: single-label and multi-label classification tasks.
b) Sentence segmentation: Ability to split a sentence sequence into a word sequence.
c) Part-of-speech tagging: Ability to assign a part of speech to each vocabulary in natural language text, where the part-of-speech categories may include nouns, verbs, adjectives, or others.
d) Sentiment analysis: Ability to determine the emotional tendency contained in the text, such as positive, negative, or neutral.
e) Semantic role labeling: Ability to assign corresponding semantic roles to predicates and arguments in sentences.
5.1.3 Information extraction
Evaluate the large model's ability to automatically identify and extract key information from complex text content, including but not limited to:
a) Keyword extraction: Ability to identify core words and phrases from text, which are crucial for understanding the overall text content;
b) Fact extraction: Ability to extract specific factual information from text, such as dates, locations, figures, and related events;
c) Argument extraction: Ability to identify and extract viewpoints and arguments in text, including supporting and opposing arguments, which is particularly important for analyzing commentative and argumentative text;
d) Relation extraction: Ability to extract semantic relationships between entities from text. In text, entities may include people, locations, organizations, events, etc., while semantic relationships refer to various relationships between entities, such as subject-verb relationships, verb-object relationships, hyponymy relationships, synonymy relationships, etc.;
e) Coreference resolution: Ability to clearly identify and determine the specific referent of pronouns or noun phrases in a sentence.
5.1.4 Mathematical reasoning
Evaluate the large model's ability to understand problems, identify implicit mathematical operations in them, and solve mathematical operation problems using mathematical concepts and principles. Including but not limited to:
a) Arithmetic operations: Ability to perform basic addition, subtraction, multiplication, and division operations;
b) Algebraic problems: Ability to solve algebraic problems such as equation solving, inequality problems, and simplification of algebraic expressions;
c) Geometric problem-solving: Ability to solve problems involving calculations of geometric figure properties, area, perimeter, etc.;
d) Mathematical application problems: Ability to solve daily life mathematical problems, such as time calculation, distance calculation, proportion problems, etc.;
e) Statistical problems: Ability to interpret probability calculations, statistical charts, etc.
5.1.5 Causal reasoning
Evaluate the large model's ability to analyze causal relationships in input text content, including but not limited to:
a) Causal relationship identification: Ability to identify causal relationships from natural language text, such as the "because... so..." structure, including direct and indirect causal relationships;
b) Causal chain construction: Ability to construct a complete causal chain based on information in the text, such as identifying and linking the cause and effect of each event from a series of events;
c) Hypothetical conditional reasoning: Ability to perform logical reasoning on sentences containing hypothetical conditions (such as "if... then...") and accurately identify the relationship between conditions and results;