AI/ML Management for 5G Systems

Sep 11, 2023

By WG SA5, Authors: Yizhi Yao (SI/WI rapporteur, Intel), Hassan Al-kanani (SI/WI co-rapporteur, NEC), Stephen Mwanje (Nokia)

The Artificial Intelligence/Machine Learning (AI/ML) techniques and relevant applications are being increasingly adopted by the wider industries and proved to be successful. These are now being applied to the telecommunication industry including mobile networks. Clearly the adoption of AI/ML technology is opening a new era for creating more business value in terms of improved system performance, higher efficiency, enhanced end use experience as well as creating new business models and use cases for 5G and future generation mobile networks, AI/ML capabilities are used in various domains in 5GS, including management and orchestration (e.g., MDA), 5GC (e.g., NWDAF), and NG-RAN (e.g., RAN intelligence).

Figure1 AI ML in 5GS

Figure 1: AI/ML in 5GS

Almost all the 3GPP WGs including both SA and RAN WGs are now somehow engaged in AI/ML relevant features/capabilities standardization activities. To support and facilitate the efficient deployment and operation of AI/ML capabilities/features with the suitable AI/ML techniques in 5GS, the ML model and AI/ML inference function need to be managed throughout its entire lifecycle.

The ML models, once deployed into the functions throughout the system, could potentially have a direct impact on the behaviour of the system, which may lead to improved or degraded performance. The capability and performance of the ML model are determined by every operational phase of its lifecycle (including training, emulation, deployment, and inference phases), therefore the AI/ML capabilities need to be operable, manageable, and accountable for each phase.

The 3GPP WG SA5 (or in short referred to as "SA5") has initially started the work to develop specifications for AI/ML management as part of the Rel-17 Management Data Analytics (MDA) work item and concluded then that a generic mechanism for managing the ML training for any kind of AI/ML enabled capabilities is needed (i.e., not just restricted to Management Data Analytics. The Rel-17 MDA specifications are documented in TS 28.104 while the specifications for AI/ML management are documented separately in TS 28.105. In Rel-18, SA5 continued the AI/ML management specifications development work starting with a dedicated comprehensive study which has recently been completed and documented in TR 28.908. The study describes the concepts and operational workflow, as well as addresses a wide range of use cases (capabilities) along with the corresponding potential requirements and solutions. As a progression from the successfully completed and detailed study, SA5 moved forward and started normative work which is currently progressing.

SA5 has already made good progress in specifying management services for managing AI/ML capabilities. The specification work continues to be documented in TS 28.105 in Rel-18. The Release 18 version of the 3GPP TS 28.105 is planned to be completed and published during the first quarter of 2024.

Multi-vendor interaction with flexibility of solutions

Deploying an ML model from a vendor into a function of another vendor is concerning and may require future further investigations and discussions which are beyond the scope of current SA5 specifications. The concerns are that this deployment may cause unpredictable complexities, including e.g., exposure of the proprietary details of ML model and associated algorithm which could lead to potential security risks to both the ML model and AI/ML inference function, and open the door for model re-engineering. Therefore, SA5 agreed to define a manageable entity in a more abstract manner around the ML model, which is called “ML entity”. The ML entity is manageable as a single composite entity, it is either an ML model or an entity containing an ML model and the ML model-related metadata. This abstraction facilitates and brings implementation flexibility while still enabling the required management capabilities.

AI/ML operational workflow

SA5 has agreed on the generic AI/ML operational workflow for an ML entity in the normative phase, as depicted in Figure 2.

Figure 2 AI ML operational workflow

Figure 2: AI/ML operational workflow

The workflow involves four main phases; training, emulation, deployment, and inference phase. The main tasks involved in each phase are briefly described below:

Training phase:

ML training: training, including initial training and re-training, of an ML entity or a group of ML entities. It also includes validation of the trained ML entity to evaluate the performance variance when the ML entity performs on the training data and validation data. If the validation result does not meet the expectation (e.g., the variance is not acceptable), the ML entity needs to be re-trained. The ML training is the initial phase of the workflow.
ML testing: testing of the validated ML entity to evaluate the performance of the trained ML entity when it performs on testing data. If the testing result meets the expectation, the ML entity may proceed to the next phase, otherwise the ML entity may need to be re-trained.

Emulation phase:

ML emulation: running an ML entity or AI/ML inference function for inference in an emulation environment. The purpose is to evaluate the inference performance of the ML entity or AI/ML inference function in the emulation environment prior to applying it to the target operational network or system.

NOTE: The emulation phase is considered optional and can be skipped in the AI/ML operational workflow.

Deployment phase:

ML entity loading: loading of a trained ML entity into the target AI/ML inference function which will use it for inference.

NOTE: The deployment phase may not be needed in some cases, for example when the training function and inference function are co-located.

Inference phase:

AI/ML inference: performing inference using the ML entity by the AI/ML inference function.

AI/ML management capabilities

The AI/ML management study (TR 28.908) discussed and concluded more than 40 use cases related to the management of AI/ML, which are being categorized into the management capabilities for the four corresponding operational phases in the AI/ML workflow. The management capabilities are briefly described below for each of the corresponding operational phases while the concrete list of all use cases is listed in Table 1.

Management capabilities for training phase

ML training management: allowing the MnS consumer to request the ML entity training, consume and control the producer-initiated training, set a policy for the producer-initiated ML entity training (e.g., conditions to trigger the ML (re)training based on the AI/ML inference performance), manage the ML entity training/retraining process, training performance management, and training data management.
ML validation: ML training capability also includes validation to evaluate the performance of the ML entity when performing on the validation data, and to identify the variance of the performance on the training and validation data. If the variance is not acceptable, the entity would need to be tuned (re-trained) before being made available to the consumer and used for inference.
ML testing management: allowing the MnS consumer to request the ML entity testing, and to receive the testing results for a trained ML entity. It may also include capabilities for selecting the specific performance metrics to be used or reported by the ML testing function. MnS consumer may also be allowed to set a policy for a producer-initiated ML entity testing after training and validation, and report back on the outcome, or trigger ML entity re-training based on the ML entity testing performance requirements.

Management capabilities for emulation phase

AI/ML inference emulation: a capability allowing an MnS consumer to request an ML inference emulation for a specific ML entity or entities (after the training, validation, and testing) to evaluate the inference performance in an emulation environment prior to applying it to the target network or system.
ML inference emulation management: this capability allows an authorized MnS consumer (e.g., an operator) to manage or control and monitor a specific ML inference emulation process, e.g., to start, suspend or resume the inference emulation, and to receive the emulation

Management capabilities for deployment phase

AI/ML deployment control and monitoring: capabilities for loading the ML entity to the target inference function. It includes informing the consumer when new entities are available, enabling the consumer to request the loading of the ML entity or to set the policy for such deployment and to monitor the deployment process.

Management capabilities for inference phase

AI/ML inference control: allowing an MnS consumer to control the inference, i.e., activate/deactivate the inference function and/or ML entity/entities, including instant activation, partial activation, schedule-based or policy-based activations, configure the allowed ranges of the inference output parameters, or the context for performing inference.
AI/ML inference performance evaluation: allowing the MnS consumer to monitor and evaluate the inference performance of an ML entity or an AI/ML inference function.
AI/ML inference orchestration: to enable the MnS consumer to orchestrate the AI/ML inference functions given e.g., aspects such as the knowledge of capabilities of the inference functions, the expected and actual running context of ML entity, the AI/ML inference performance, the AI/ML inference trustworthiness, etc. For example, the MnS consumer may set the conditions to trigger specific inferences based on the expected outcomes of those inferences.

Common management capabilities for all phases

AI/ML trustworthiness management: allowing the MnS consumer to configure, monitor and evaluate the trustworthiness of an ML entity. This applies to the ML entity at every operational phase of the AI/ML operational workflow.

Table 1: List of use cases for AI/ML management in the operational phases:

Category	Use cases
Management Capabilities for ML training phase
Event data for ML training	Pre-processed event data for ML training
ML entity validation	ML entity validation performance reporting
ML entity testing	Consumer-requested ML entity testing
	Control of ML entity testing
	Multiple ML entities joint testing
ML entity re-training	Producer-initiated threshold-based ML Retraining
	Efficient ML entity re-training
	ML entities updating initiated by producer
ML entity joint training	Support for ML entity modularity – joint training of ML entities
Training data effectiveness	Training data effectiveness reporting
	Training data effectiveness analytics
	Measurement data correlation analytics for ML training
ML context management	ML context monitoring and reporting
	Mobility of ML Context
	Standby mode for ML entity
ML entity capability discovery and mapping	Identifying capabilities of ML entities
ML entity capability discovery and mapping	Mapping of the capabilities of ML entities
Performance evaluation for ML training	Performance indicator selection for ML model training
	Monitoring and control of AI/ML behavior
	ML entity performance indicators query and selection for ML training
	ML entity performance indicators selection based on MnS consumer policy for ML training
Configuration management for ML training	Control of producer-initiated ML training
ML Knowledge Transfer Learning	Discovering sharable Knowledge
ML Knowledge Transfer Learning	Knowledge sharing and transfer learning
Management Capabilities for ML emulation phase
ML Inference emulation	AI/ML Inference emulation
ML Inference emulation	Orchestrating ML Inference emulation
Management Capabilities for ML entity deployment phase
ML entity loading	ML entity loading control and monitoring
Management Capabilities for AI/ML inference phase
AI/ML Inference History	Tracking AI/ML inference decisions and context
Orchestrating AI/ML Inference	Knowledge sharing on executed actions
	Knowledge sharing on impacts of executed actions
	Abstract information on impacts of executed actions
	Triggering execution of AI/ML inference functions or ML entities
	Orchestrating decisions of AI/ML inference functions or ML entities
Coordination between the ML capabilities	Alignment of the ML capability between 5GC/RAN and 3GPP management system
Performance evaluation for AI/ML inference	AI/ML performance evaluation in inference phase
	ML entity performance indicators query and selection for AI/ML inference
	ML entity performance indicators selection based on MnS consumer policy for AI/ML inference
	AI/ML abstract performance
Configuration management for AI/ML inference	ML entity configuration for RAN domain ES initiated by consumer
	ML entity configuration for RAN domain ES initiated by producer
	Partial activation of AI/ML inference capabilities
	Configuration for AI/ML inference initiated by MnS consumer
	Configuration for AI/ML inference initiated by producer
	Enabling policy-based activation of AI/ML capabilities
AI/ML update control	Availability of new capabilities or ML entities
AI/ML update control	Triggering ML entity update
Common management capabilities for ML training and AI/ML inference phase
Trustworthy Machine Learning	AI/ML trustworthiness indicators
	AI/ML data trustworthiness
	ML training trustworthiness
	AI/ML inference trustworthiness
	Assessment of AI/ML trustworthiness

Release	UID	Work item title	Report/ Specification	Work item document	Acronym
Rel-17	850028	Study on enhancement of Management Data Analytics Service	TR 28.809	SP-190930	FS_eMDAS
Rel-17	910027	Enhancements of Management Data Analytics	TS 28.104 TS 28.105	SP-210132	eMDAS
Rel-18	940039	Study on AI/ML management	TR 28.908	SP-211443	FS_AIML_MGMT
Rel-18	990119	AI/ML management	TS 28.105	SP-230335	AIML_MGT