Smart speakers and smart home devices with speech recognition, hands-free communication, conference functionalities, or audio playback must be tested to the highest customer requirements in order to ensure optimal voice and audio quality. Learn how to test the different performance parameters of these systems under real-life conditions and what the challenges are.
Voice-operated smart home products are very universal devices, which provide a variety of different functionalities to the user. They are intended to pick up the talker’s voice, to provide adequate speech recognition of natural language, and initiate the desired actions intended by the user.
Smart home devices can be used to order services and to control other smart or Internet of Things (IoT) devices in the home. In both scenarios speech recognition must be seamless, independent of home and background noise situations. An appropriate dialog system at the back end is required to provide dialog capabilities, enabling a human-like conversation in case more complicated tasks have to be executed.
Smart home devices can also be used for audio or audio/video playback. Either the built-in loudspeaker or (wireless) connections to other devices enable the playback of all types of music and video. Again, the audio playback functionality needs to work up to the user’s expectation in different types of environments.
Besides speech recognition and audio playback, smart home devices can be used for communication. The communication may be either between humans and machines involving dialog systems at the back end or natural human-to-human communication using smart home devices at both ends of a connection. In this operational mode, smart home devices are used like classic conference or handsfree systems. With professional conference systems, the device has to work at different locations in rooms within a broad range of reverberation times and a huge variety of background noises. Furthermore, the talker’s and the listener’s distance to the device may vary substantially more than in classic conference systems.
As a consequence, to sufficiently test and validate the performance of smart home devices, a variety of tests and simulations are required. This article describes methodologies of how to conduct such types of tests efficiently and in a highly automated manner. These tests include the simulation of various types of rooms (room simulation), of various types of background noise scenarios, and of various types of talkers and interfering talkers at different positions.
An Adverse Environment for Smart Devices
When it comes to speech recognition, speech communication, and audio presentation, the home and every similar environment where smart devices are used exposes the devices to a highly varying environment from the acoustic and audio point of view. Speech recognition typically benefits from a good signal-to-noise ratio (SNR) and from a mostly non-reverberant speech. In a similar way, speech communication in sending direction benefits from the same properties.
However, smart home devices may be placed at any location in the home environment — regardless of suitable positions from the acoustical point of view. The distance between talker and the device may range from centimeters up to meters. The talker himself is moving. Interfering talkers as well as background noise generated by sound sources in the room and outside the room may impact the system performance. This may be kitchen noise, living room noise, children playing and shouting, or street noises. In addition, reverberation is added to the speech signal. The farther away the talker is, the more reverberant his speech will get. The amount of reverberation depends on the position of the device and it depends on acoustic treating of the rooms used. Furthermore, various talkers may be in the room simultaneously.
For speech recognition, it is important to be able to focus on one of the talkers who initiated the commands to the smart home device. The system should not react to interfering talkers. Differently, in communication mode it may be desired to simultaneously transmit the speech signals of various talkers in a conference call. The smart home device must be capable of intelligently recognizing all those different scenarios and adjusting its signal processing capabilities accordingly. As a consequence, tests have to take into account all the different situations, the different environments, and the different user expectations.
The Laboratory Setup
The huge amount of different scenarios, rooms, and situations requires various and realistic test setups. The only efficient way of setting up such tests is to simulate the user environment appropriately in a lab. This mainly requires:
• Realistic background noise simulation of prerecorded background noise scenarios
• Realistic simulation of the acoustic room properties, which mainly should simulate the correct reverberation and reflection patterns discovered in different rooms at various locations of the smart home device
• Reproducibility of the setups in different labs
Furthermore, testing must take into account the space and budget restrictions typically found in companies and test labs. Therefore, special rooms with variable acoustics are not a generally working solution. Simulation technics exist that allow the background noise as well room simulation with a limited effort but still with a high degree of realism.
The setups applied are described in the European Telecommunications Standards Institute (ETSI) standards TS 103 224  and TS 103 557 (draft) . The general setup used in a laboratory environment for both background noise and reverberation simulation is shown in Figure 1.
An eight-loudspeaker setup for generating background noise and reverberation is part of the reproduction system. Background noise and reverberation have to be simulated correctly at the position of the smart home device in the lab. A microphone array positioned close to the device to be tested is used to calibrate the reproduction systems in advance of the tests. The calibration allows correctly reproducing any pre-recorded real sound field at the position of the device under test (DUT).
Background noise as well as room reverberation can be recorded in real rooms. For the recordings, the same microphone array, which was later used for equalization and calibration of the sound field in the laboratory, is used. This method provides several advantages:
• By using the microphone array various background noises can be recorded in various situations, stored and made available to all test labs. The background noises used in the test labs are identical and can be reproduced in various labs with almost the same high degree of accuracy.
• In the same way room impulse responses can be recorded at the microphone positions in realistic rooms at various positions. Like background noises, impulse responses of realistic rooms can be used in test labs in an identical way.
• The methods do not require a completely anechoic chamber as laid out in the ETSI standards (, ). Acoustically treated, quiet rooms found in almost every laboratory can be used to reproduce the sound field of background noises as well as the reverberation.
• The background noises can be reproduced physically correct (with magnitude end phase) up to 3 kHz and with correct magnitude up to 20 kHz in the area of the smart home device to be tested.
• The accuracy of reverberation simulation is in a similar range as for background noise.
• The equalization and calibration procedures as described in the above mentioned ETSI standards (, ) allow a completely automated equalization in the lab at the position where the device under test shall be used.
The Smart Home Test Suite
Besides appropriate room and noise simulation technics, a smart home test suite has to reflect the various applications smart home devices are used for. The main areas that have to be covered by test suites are:
• Speech recognition testing
• Speech communication testing
• Audio quality testing
Even more complex — these tests have to be executed by simulating various room characteristics and background noises, simulation of moving talkers, simulation of interfering talkers, and other multiple talker scenarios. It is obvious that testing under all these conditions requires highly automated test procedures, which allow a proper evaluation of smart devices under the time and resource constraints typically found in laboratories.
The simulation of varying talker positions as well as interfering talkers can be automated to a high degree using a turntable solution (see Figure 2). The smart home device is positioned on the turntable. One or more artificial heads (Head-and-Torso Simulator, HATS) are placed around the turntable. The orientation of the talker, with regard to the DUT, is simulated by turning the device itself on the turntable rather than turning the artificial head. Simultaneously, room and background noise simulation for the various directions have to be turned virtually in the simulation. Simulating varying distances from the talker to the smart device is realized by adjusting the speech level of the artificial head and the amount of reverberation accordingly.
Interfering talkers or multiple talker scenarios are simulated by positioning a second artificial head in the same room, producing a second talker’s voice signal. When using this setup, direction-dependent tests, tests in various room, background noise simulations, interfering/multiple talker tests, and audio performance tests are highly automated. This setup generally can be used for all types of test described in this article.
Speech Recognition Testing
Speech recognition testing of smart home devices is mainly targeted to the performance of the smart home device not to the generic speech recognition engine, which is found at the back end. So typically a relatively small set of commands, sentences, and so forth are used to verify the performance of the smart home device in conjunction with the network-based speech recognition engine.
The tests used to evaluate the performance of speech recognition need to provide flexible and easy integration of test sequences. The test system must be capable of controlling and calibrating an artificial head, it must be able to seamlessly and synchronously control background noise and room simulation such that an automated speech recognition test can be performed. For complete automation of such tests, the access to the output of the speech recognition engine is unavoidable.
The HEAD acoustics Voice Control Analysis System (VoCAS) test system provides the desired features and allows users to easily integrate the access to the speech recognition engine by scripting. In combination with the room and background noise simulation, smart home devices can be tested in various conditions including various rooms, various background noise simulation, and various positions of the talker during the tests. A typical test setup is shown in Figure 3.
The test results include, for example, the recognition and failure rate for various types of talkers, utterances, conditions, and so forth, which are generated fully automatically. By this procedure, potential weaknesses of the DUT are discovered and isolated in an easy and reliable way. This type of diagnostics allows users to fix potential issues and to ensure reliable speech recognition within the given constraints.
Speech Communication Testing
The procedures for testing speech communication quality in general are well established. Appropriate standards can be found in ETSI. ES 202 738 , ES 202 740  and TS 102 925  describe transmission requirements for narrowband, wideband, super-wideband, and fullband hands-free and conferencing terminals from a quality of service (QoS) perspective as perceived by the user. These standards are “user-centric” and not so much driven by the technical limitations of device designs. The measurements depicted in those standards allow the investigation and optimization of speech quality under various aspects:
• Listening speech quality is described by standard parameters (e.g., frequency response, loudness rating, distortion, and others), but perceptual-based models are applicable as well.
• Speech quality in the presence of background noise described by objective procedures to evaluate the speech (S-MOS), the noise (N-MOS), and the overall quality (G-MOS) based on a perceptional model, are also described in the relevant ETSI standards (EG 202 396-3 ; TS 103 106 ; TS 103 281 ) and implemented as the algorithm 3QUEST by HEAD acoustics, which has been used successfully in the past years.
• Echo performance of the device under various conditions includes the measurement of the stability of echo cancellation under various conditions. In the HEAD acoustics test suite, a perception-based echo performance measurement is available as well. EQUEST allows the evaluation of the echo performance, including the owner’s voice, which may mask the echo signal and as such is much more realistic than traditional echo tests.
• Double-talk capability based on speech signals allows users to verify the double-talk performance in case when two talkers are talking simultaneously in a conversation.
All these measurements can be conducted for testing smart home devices and allow the implementation of superior performance in such devices. Appropriate background noises can be generated as described. An easy way to summarize all the test results is shown in Figure 4. The most important measurement results within their performance limits can be displayed in so-called “quality pies” according to Recommendation ITU-T P.505. Each slice corresponds to one parameter, which contributes to the various aspects of quality.
Just by visual inspection of the graphs it is possible to identify various areas in which a device is performing more or less well. Even for non-experts this is possible. But there is more to smart home testing. Smart home testing also requires the investigation of those properties with varying talker distances, it requires the investigation of the multi talker environment in cases where more than one talker is present at in front of the device. The relevant procedures for testing the optimization of these parameters can be found in Recommendation ITU-T P.340 , which is currently under revision.
Based on this standard, which also should be implemented for smart home testing, the multi talker scenario can be evaluated in great detail for various talker positions and — with the new revision — in the presence of background noise as well.
Finally, all the relevant performance parameters, mainly listing speech quality and echo performance with varying talker positions have to be tested in the presence of room reverberation. Currently, there are no standard test procedures available. However, the room simulation technique as proposed by HEAD acoustics in standardization (see Figure 1) can be used for this purpose. Appropriate tests are provided.
Audio Quality Testing
Compared to speech recognition testing and speech communication testing the measurement of audio quality is probably the easiest task within the test of smart home devices. For audio measurements, the test signals can be stored either on some place in the network (e.g., YouTube); stored locally on the device itself; or they can be transmitted to the device (e.g., via Bluetooth). Tests are made using the built-in loudspeaker or with satellite loudspeakers, if connected to the smart home device. For analysis, the test system is triggered by the test stimuli (e.g. level/level frequency trigger). Typically the classic test signals (i.e., sinusoidal signals or noise signals) are used. Basic tests for verifying the audio quality performance are the well-established measurements of:
• Frequency response
• Directivity (in combination with turntable)
• Rub & Buzz
However, these tests are not sufficient. More perceptual-based testing is required to determine the audio quality as perceived by a “typical” user. Unfortunately, perceptual models developed for audio quality such as ITU-R BS.1387-1 (PEAQ)  are not suitable for the tests of audio when displayed over loudspeakers. Research is ongoing to provide properly validated objective test procedures that can be used instead of listening tests for the evaluation of audio quality (see ). Currently, no standardized method is available. Meanwhile, hearing models (e.g., the Relative Approach ) can be used to evaluate perceptually correct distortions such as Rub & Buzz distortions. This can be seen as an interim step before perceptual models for audio quality are available. An example of such analysis is shown in Figure 5.
The test of smart home devices needs to cover a variety of different technologies, which in the past were seen separately for separate type of devices. In smart devices, many functionalities are combined. The test of these devices requires sophisticated, flexible, yet fully automated test procedures that help to evaluate the devices under all operational conditions. The aim of these tests is to optimize and validate the performance of smart home devices for the benefit of the user in his living environment.
When using a highly automated test system in conjunction with appropriate databases, test and optimization of the devices can be performed in a seamless and reproducible manner. aX
This article was originally published in audioXpress, January 2019.
About the Author
Hans W. Gierlich started his professional career in 1983 at the Institute for Communication Engineering at RWTH, Aachen. In February 1988, he received a Ph.D. in electrical engineering. In 1989, Hans joined HEAD acoustics GmbH in Aachen as vice president. Since 1999, he is head of the HEAD acoustics Telecom Division and in 2014, he was appointed to the board of directors. Hans is mainly involved in acoustics, speech signal processing and its perceptual effects, QOS and QOE topics, measurement technology and speech transmission quality. He is active in various standardization bodies such as ITU-T, 3GPP, GCF, IEEE, TIA, CTIA, DKE, and VDA and chairman of the ETSI Technical Committee for “Speech and Multimedia Transmission Quality.”
About HEAD acoustics – Telecom Division
HEAD acoustics was founded in 1986 and has been involved in noise and vibration, electroacoustic and voice and audio quality testing since its inception. HEAD acoustics is based in Herzogenrath, Germany, with affiliates in China, France, Great Britain, Italy, Japan, South Korea, and the US, as well as a world-wide network of representatives. The Telecom Division of HEAD acoustics manufactures telecom test equipment and provides consulting services in the field of voice and audio quality. Moreover, HEAD acoustics closely co-operates with ETSI, ITU-T, 3GPP, TIA, CTIA, GSMA and other standardization bodies with regard to the development of quality standards for voice transmission and speech communication. In many partnership projects, HEAD acoustics has proven its competence and capabilities in conducting tests and optimizing communication products with respect to speech and audio quality under end-to-end as well as mouth-to-ear scenarios.
 ETSI TS 103 224: Speech and multimedia Transmission Quality,(STQ); A sound field reproduction method for terminal testing including a background noise database
 ETSI DRAFT TS 103 557: Speech and multimedia Transmission Quality (STQ); Methods for reproducing reverberation for communication device measurements
 ETSI ES 202 738: Speech and multimedia Transmission Quality (STQ); Transmission requirements for narrowband wireless terminals (hands-free) from a QoS perspective as perceived by the user
 ETSI ES 202 740: Speech and multimedia Transmission Quality (STQ); Transmission requirements for wideband wireless terminals (hands-free) from a QoS perspective as perceived by the user
 ETSI TS 102 925: Speech and multimedia Transmission Quality (STQ); Transmission requirements for Super-Wideband/Full-band hands-free and conferencing terminals from a QoS perspective as perceived by the user
 ETSI EG 202 396-3: Speech and multimedia Transmission Quality (STQ); Speech Quality performance in the presence of background noise; Part 3: Background noise transmission - Objective test methods
 ETSI TS 103 106: Speech and multimedia Transmission Quality (STQ); Speech quality performance in the presence of background noise: Background noise transmission for mobile terminals - objective test methods
 ETSI TS 103 281: Speech and multimedia Transmission Quality (STQ); Speech quality in the presence of background noise: Objective test methods for super-wideband and full-band terminals
 Recommendation ITU-T P.340, Annex B: Objective test methods for multitalker scenarios
 Recommendation ITU-R BS.1387-1: Method for objective measurements of perceived audio quality
 M. Schäfer: Auditory Assessment of Audio Systems; ITG 2018, Oldenburg, Germany
 K. Genuit: Objective Evaluation of Acoustic Quality Based on a Relative Approach, InterNoise 1996, Liverpool, UK