It means we no longer need to record audio information in 6 (5.1), 8 (7.1), or an insane number of audio channels – 22.2, as Japan is pitching for its 8K production and broadcast standard – to properly convey the spatial information of sound reproduction. And not necessarily only for sophisticated soundtracks in immersive formats.
It is true that this entire transition was first inspired by Hollywood proposing immersive audio formats for its blockbuster productions, with Dolby Atmos and DTS:X becoming the new norm in movies. In Europe, Auro 3D was pitched as a generic "immersive" format for any type of production, including music recordings, adding a "height" layer of additional channels, to the surround layers, creating a 9.1 to 13.1 approach, depending on the size of the room and the audience.
Dolby Atmos, DTS:X and Auro-3D immersive channel-based formats were intended to add a "critical component in the accurate playback of native 3D audio content," described as "height" or ceiling channels, using a speaker layout that constructs "sound layers." No doubt, this is highly effective in movie theaters, and is not a problem in movie production and distribution.
The problem is, people are not exactly able to build "movie theaters" in their homes for daily use. Some can build dedicated home theaters for movie viewing at home, creating those immersive format installations, but we do more at home when we consume media than just movie watching. Most of the time, people just enjoy music listening and watch live and non-fiction TV programs, which don’t necessarily need the creative components of immersive audio formats as described by Dolby Atmos, DTS:X, or Auro-3D.
That's why when the industry started to look at the requirements for the next-generation standards for media distribution, including next-generation broadcast standards, and OTT (on-demand) streamed content distribution – consumed in many cases in simple mobile devices – it was obvious that we shouldn't just add more channels, instead we should look at alternative approaches for efficient media distribution to any type of platform.
Also, new types of media (e.g., virtual reality and gaming) were already inspiring a new generation of content creation tools for immersive audio experiences and generated an increased interest in 3D audio concepts and technologies such as Ambisonics/HOA, Binaural, head-related transfer function (HRTF), head-related impulse response (HRIR), and object-based audio techniques.
Not surprisingly, even companies such as Auro Technologies – which always stated that "natively recording Auro-3D in an object-based format is simply not possible," – Dolby, and DTS, immediately started exploring alternative media distribution techniques for those new platforms, mainly virtual reality (VR) and mobile devices, and using enhanced binaural reproduction on headphones. Auro called it 3D Over Headphones, DTS called it DTS Headphone:X, and Dolby still prefers to call it Dolby Atmos in order not to confuse consumers, while its "professional solutions" division markets a multiplicity of Dolby Audio technologies and tools specific for creation, distribution, and playback.
After all, independently of the immersive audio format descriptions, all these companies explore object-based audio techniques for content creation (production). Dolby describes the process for its Dolby Atmos Renderer, as metadata that creates "multichannel speaker outputs, binaural headphone outputs, and channel-based deliverables."
Basically, all those formats start as audio tools (plug-ins for standard production DAWs, such as Pyramix or Pro Tools) allowing panning manipulation by placing "audio objects in a 3D space" that generate "object metadata that is authored with the final content," using a visual representation and signal metering to monitor the dynamic mix of these objects in a "3D space."
Independently of the distribution and playback platforms, using this simple object metadata we can more easily describe how a sound moves, with better “resolution” regarding the interim stages.
Let's imagine a sound that evolves from the right side directly to the ceiling above us and to our left. Imagine a Dolby Atmos soundtrack where a sound moves like I described, but the listener only has two conventional stereo speakers in the living room. Since there’s no ceiling speakers, the sound would just move from the right speaker to the left speaker.
Now imagine the same program material but reproduced using a soundbar with multiple drivers, with digital signal and beamforming processing for spatial virtualization. The panoramic motion of the sound is in fact translated using spatial (positional) metadata and the resulting reproduction sounds like the sound moves from left, up above our head, and to the right as it would if we had multiple speaker channels. Only we don’t “have channels.” We have just the information about the sound relative position or provenance and a different playback system that is able to generate an immersive experience. That playback system could be a soundbar, or even the tiny speakers on a smartphone or tablet, or a sophisticated omnidirectional speaker that is able to analyze the acoustics in the room and project sounds in different directions, creating an immersive stage (not exactly accurate, but similarly impactful) because the information regarding position or provenance or the sound(s) is generated through descriptive metadata of objects. Audio objects.
Yes, Dolby Atmos and DTS:X immersive audio formats are both object-based – a combination of raw audio channels and metadata describing position and other properties of the audio objects – at least from the production point of view. The formats use standard multi-channel distribution (5.1 or 7.1 - which are part of any standard distribution infrastructure, including broadcast standards) and are able to convey object-based audio for specific overhead and peripheral sounds using metadata that "articulates the intention and direction of that sound: where it’s located in a room, what direction it’s coming from, how quickly it will travel across the sound field, etc." Standard AV receivers, televisions and STBs, equipped with Dolby Atmos and DTS:X, read that metadata and determine how the experience is "rendered" appropriately to the speakers that exist in the playback system. In DTS:X, it is even possible to manually adjust sound objects – interact and personalize the sound.
As I said, not all content material "needs" to be described in such a sophisticated way and not all metadata is intended to be translated as positional data. There are still excellent mono recordings, there are all sorts of single channel or stereo broadcasted content, there's loads of excellent "stereo field" music, etc., and all can benefit from object-based audio or additional metadata for multiplatform distribution and playback.
Content distribution also faces other more complex challenges, such as multi-language commentary and dialog dubbing, loudness management for different playback scenarios, room equalization, acoustic compensation, etc. Also, using object-based metadata, we could allow some basic interaction with the sound program, enabling users to choose between what type of experience they would prefer, like watching live events with or more or less "sound environment" and focus on commentary, or even removing commentary all-together. On a broader perspective, we can also see object-based audio being the metadata layer that could help solve the multi-platform challenges of today's media distribution, allowing better audio reproduction for any sort of content in any type of playback device and channel configuration, including binaural virtualization of audio on headphones and spatial audio reproduction in digitally processed speaker arrays.
And that's where the industry is heading, leaving our precious "audiophile" discussions about the ideal "production standards" for stereo recording and reproduction of music in the dust.
In the same week I visited the CanJam SoCal show in Los Angeles, CA, and AXPONA in Chicago, IL – both consumer-oriented shows – I also attended the NAB Show 2018, in Las Vegas, NV, one of the largest professional-oriented conventions in the world addressing media creation and distribution. Those industries are addressing the mainstream market needs, where homes are typically being populated with Bluetooth speakers, increasingly of the "smart speaker" category, multiroom systems, soundbars or under-TV systems, and where a growing number of people use headphones.
There's also a worrying number of consumers that actually listen to music and consume content mainly via mobile devices, with audio playback provided by the microspeakers built-in smartphones and tablets, or the integrated flat transducers in laptops. And no matter how frightening this might sound, it clearly is not going away. In fact, the mobile industry is committed to improving sound reproduction capabilities in those devices because they know this will just get worse – as the sound will indeed get sufficiently "better," more and more people will be encouraged to actual do that.
Fortunately, with the changes in consumer behavior, there is also an opportunity to change the paradigm in sound reproduction. Enter 3D audio, spatial sound, and... object-based audio.
One could easily think that the gaming market alone is the reason why a high-end headphone brand such as Audeze invested in the development of its new Audeze Mobius headphones, using Waves Nx 3D Audio Technology. The reality is, Audeze is anticipating the needs of the content creation community for a tool that enables experimenting with 3D audio platforms.
Evolution in this domain is the result of an intense research and development effort by major industry players in Europe and also in the US. You can read more about the ongoing efforts on object-based audio by visiting these websites:
In another relevant example, you can read my NAB 2018 report on Blackmagic Design and the new and much improved DaVinci Resolve 15 software (available for free). This is the company that a year and a half ago acquired Fairlight, one of the better known film-mixing and post-production audio production tool manufacturers, and one of the first in the industry to implement object-based audio mixing in its solutions, allowing for a unique way to visualize and monitor the production of immersive, spatial and binaural audio content. At NAB 2018, Blackmagic Design presented the new DaVinci Resolve 15 software and a new generation of Fairlight modular audio consoles – at very affordable prices – making these production tools easily accessible to everyone. Everything needed is there, and I have no doubt that this will enable many young creators to really start "immersing" themselves in object-based audio conceptualization, independently of the final distribution platform and "rendered" format.
The new audio system based on the MPEG-H Audio standard is now on-air with the new television standards adopted and under implementation in Korea and the US (ATSC 3.0), Europe (DVB UHD), and China. But MPEG-H audio also offers interactive and immersive sound, employing the audio objects, height channels, and Higher-Order Ambisonics for other types of distribution, including OTT services, digital radio, music streaming, VR, AR, and web content. Following Fraunhofer's successful effort of demonstrating a "3D Soundbar" prototype, there are now real products in production from multiple companies, naturally from the Korean consumer electronics giants (e.g., Samsung and LG) and also from Sennheiser and others. Other playback possibilities are being explored on TVs and smart speakers using 3D virtualization technology such as Fraunhofer’s UpHear, enabling immersive sound to be delivered without using multiple speakers.
Production tools are currently also available from different vendors. In 2017, Merging Technologies completed its immersive audio tool set with the creation of the Audio Definition Model (ADM) and MPEG-H export, making the latest version of Pyramix (11.1) the first DAW with a complete workflow to generate master files with Object-Based Audio (OBA) metadata.
No doubt, the customization and personalization features of MPEG-H will be decisive to excite broadcasters, content providers, and consumers, and in turn creating awareness and demand for an understanding of object-based audio from all domains of content production. As I stated previously, this could include also music production, since it would allow optimization of content to sound best on any end device, providing universal delivery in the home theater as well as on headphones, smartphones, tablets, and any speaker configuration.
And the best reason to believe that MPEG-H audio will create a solid foundation for working with object-based audio content, is the fact that it is compatible with today's streaming and broadcast equipment and infrastructure. The MPEG-H Audio codec, together with the channels or objects needed for immersive sound can be transmitted at bit rates similar to those used today for 5.1 surround broadcasts, and MPEG-H Audio-based systems offer DASH support for streaming, as well as multi-platform loudness control depending on device and listening environment.
This is about changing the paradigm in sound reproduction.
For more about MPEG-H, you can visit also the following links:
Part of this article was published in our weekly newsletter, The Audio Voice #178