A series of live demos featuring the new 3GPP Immersive Voice and Audio Services codec (IVAS) were on show during the recent TSG#102 meetings (December 2023, Edinburgh) - Each demonstrated the quality of the spatial sound experience over 3GPP voice codecs AMR-WB and EVS in various use cases of immersive telephony and conferencing, extended reality (XR) and user generated content (UGC) sharing (See Tdoc SP-231232 for details).
By IVAS codec Public Collaboration contributors (WG SA4)*
First published Nov 2023, in Highlights Issue 07
The new 3GPP codec for low-delay Immersive Voice and Audio Services (IVAS) has been selected as standard at the WG SA4#125 meeting (August 2023) for Release 18. The new codec enables completely new service scenarios by providing capabilities for interactive stereo and immersive audio communications, content sharing and distribution.
Some envisioned service applications are conversational voice, multi-stream teleconferencing, XR conversational services, and user-generated live and pre-produced content streaming, as well as corresponding applications in the AR/MR space. More details on potential use cases are provided in 3GPP Highlights Issue 05 and the IVAS-9 Usage Scenarios (S4-231523).
The present article focuses on the main features of the new IVAS codec together with a high-level overview of the codec architecture and performance. In addition, the SA4 Audio SWG efforts that lead to the successful selection of the new codec are highlighted.
Main features and properties of the IVAS codec
The IVAS codec is an extension of the 3GPP Enhanced Voice Services (EVS) codec offering:
- Complete bit-exact EVS codec functionality for mono speech/audio signal input
- Support of stereo and binaural audio
- Support of audio formats beyond stereo which include multi-channel audio (5.1, 5.1.2, 5.1.4, 7.1, 7.1.4), scene-based audio (Ambisonics up to 3rd order), metadata-assisted spatial audio (MASA), and object-based audio.
- Support of combined immersive audio formats: object-based audio with scene-based audio (OSBA) and object-based audio with metadata-assisted spatial audio (OMASA)
- VAD/DTX/CNG for rate efficient stereo and immersive conversational voice transmissions
- Error concealment mechanisms to combat the effects of transmission errors and lost packets
- Jitter buffer management
- Binaural rendering functionality for headphone playback including head-tracking and scene orientation control, and loudspeaker rendering functionality for loudspeaker playback
The codec is optimized for services over 5G mobile networks and implementations on 5G devices with:
- Operation on 20 ms audio frames
- Multi-rate/multi-mode operation at the following discrete bit rates [kbps]: 13.2, 16.4, 24.4, 32, 48, 64, 80, 96, 128, 160, 192, 256, 384, and 512
- Ability to switch bitrate upon command
- Support of sampling frequencies of 8 kHz (only EVS interoperable coding), 16 kHz, 32 kHz and 48 kHz (fullband audio content)
- Low algorithmic delay (≤38 ms)
- Complexity and memory footprint within design constraint limits defining three levels, suitable for different device types and application scenarios, with Level 1: not exceeding 3 x EVS, Level 2: not exceeding 6 x EVS and Level 3: not exceeding 10 x EVS
Beyond the features and properties outlined above, the IVAS codec is compliant with all IVAS design constraints set forth by 3GPP (S4-231031).
Codec architecture
The encoder analyzes the sound scene, derives spatial audio parameters, and downmixes input channels to so-called transport channels which are processed by the encoding tools. These tools include Single Channel Element coding (SCE comprising one core-coder), Channel Pair Element coding (CPE comprising two core-coders), and Multichannel Coding Tool (MCT comprising a joint coding of multiple core-coders). The core-coder is inherited from the EVS codec with additional flexibility and variable bitrate support.
Figure 1: Overview of audio processing functions - Transmit Side
The decoder processes the received bitstream and outputs either the same audio format as the signaled input format (pass-through mode) or any given supported audio format including rendered output for binaural or loudspeaker playback with use of the integrated renderer.
Figure 2: Overview of audio processing functions - Receive Side
IVAS rendering
Rendering is the process of converting the decoded audio signals for reproduction on various playback devices. The IVAS decoder provides integrated rendering functionalities for reproduction on headphones and different loudspeaker configurations. A standalone renderer is also provided, which can be applied without prior IVAS encoding/decoding of the input audio signal or when rendering multiple sources (e.g. from several decoders). Both renderers support the same feature set.
The IVAS binaural rendering generates audio signals for headphones simulating a real-life listening experience. It features binauralization, relying on head-related impulse responses, head-tracking, listener orientation processing and supports room acoustics using binaural room impulse responses or late reverb and spatialized early reflections synthesis. Default rendering parameter sets are available and an option to override these with custom sets is provided. The rendering implementation utilizes several algorithms, depending on encoding schemes, input formats, bitrates, and output format. These algorithms operate either in time or in frequency domain for optimal performance at minimum complexity.
IVAS codec performance
For the IVAS codec selection test, the performance of stereo and immersive operation modes was evaluated against EVS codec based reference systems in 18 voice-communication-oriented ITU-T P.SUPPL800 (ITU login required) experiments with naïve listeners and 28 ITU-R BS.1534 audio-oriented expert-listener experiments. The experiments were carried out by 4 listening labs independent from the codec proponents, with expertise in subjective voice and audio quality testing.
In total, 319 requirements were tested out of which the codec exceeded the requirements in 54% of the cases, while meeting the requirements totally in 98.4% of the cases and failing in 1.6% of the test cases. There was no case where the codec failed a requirement systematically in two labs. The results can be found in the Global Analysis Lab report (S4-231573).
Demo material illustrating some of the user experience enabled by the new IVAS codec is accessible at https://forge.3gpp.org/rep/ivas-codec-pc/ivas-codec/-/wikis/Demos.
SA4 Audio SWG – Successful IVAS Codec Standardization
The IVAS codec work item was launched in Sept. 2017. After extensive discussions, in May 2022, a Public Collaboration (PC) was established for the development of a joint IVAS codec candidate. The Terms-of-Reference (ToR) stipulated an entirely open collaboration, which gave the chance to every 3GPP individual member to contribute to or observe the development. The PC made use of the 3GPP Forge repository, which provided public access to source code, documents and meeting reports.
The IVAS codec standardization process was fully defined in 3GPP SA4 in a set of permanent documents (Pdocs). In particular, requirements to be met relate to the implementation of the codec and to the suitable performance for the intended use cases.
A budget of 1.2 million euros, funded by proponent companies (contributors to the Public Collaboration), was collected to cover selection tests and tasks of the characterization phase. The 3GPP Mobile Competence Centre (MCC) contracted qualified laboratories for the testing.
The actual subjective listening tests were conducted in the summer of 2023, assessing the submitted IVAS PC candidate codec (based on floating-point code). Test results and the technical information were reviewed at the 3GPP SA4#125 meeting. WG SA4 concluded that the Public Collaboration candidate meets the selection criteria to become the new 3GPP IVAS standard. TSG-SA approved the decision, the floating-point C source code specification, and the lab reports in September 2023.
Future activities include the IVAS codec characterization, which covers conversion of the approved floating-point C code to fixed-point C code and additional testing. The full set of Codec for Immersive Voice and Audio Services specifications being developed for Rel-18 can be found at the 3GPP Portal:
- TS 26.250 - General overview
- TS 26.251 - C code (fixed-point)
- TS 26.252 - Test sequences
- TS 26.253 - Detailed Algorithmic Description
- TS 26.254 - Rendering
- TS 26.255 - Error concealment of lost packets
- TS 26.256 - Jitter Buffer Management
- TS 26.258 - C code (floating-point)
The characteristics and performance of the IVAS codec will be further described in a Technical Report (TR 26.997), including both selection and characterization test results.
Authors*:
Stefan Bruhn (Dolby Laboratories), Tomas Toftgård (Ericsson), Eleni Fotopoulou (Fraunhofer IIS), Huan-yu Su (Huawei), Lasse Laaksonen (Nokia), Takehiro Moriya (NTT), Stéphane Ragot (Orange), Hiroyuki Ehara (Panasonic Holdings), Marek Szczerba (Philips), Imre Varga (Qualcomm), Milan Jelinek (VoiceAge Corporation)