(a) Diagram of existing hardware setup. (b) Diagram of modified hardware setup.
Figure 1. Logical diagram of existing(left) and modified(right) hardware setup.
of sound event detection [5–7, 9]. These algorithms are
designed to handle challenging scenarios like background
noise, overlapping events, and low signal-to-noise ratios
[8, 13, 15]. Additionally, there is active research on the in-
tegration of sound event detection into a wide array of ap-
plications, from smart home technology [10], security sys-
tems [4], to automated facility maintenance [11].
Noise reduction Noise reduction for audio signal pro-
cessing involves the development of techniques and algo-
rithms aimed at removing or mitigating unwanted noise
from audio recordings while preserving the quality of the
desired signal. Traditional approaches usually involve fil-
tering [1], spectral subtraction [2], statistical methods [16],
etc. More recently, data-driven approaches leveraging deep
learning and machine learning are developed [3, 12, 18].
Hardware and sensor technology Various tools and de-
vices have been invented to remove or mitigate noise during
the capturing and processing of audio data. For instance,
high-quality microphones and sensors sensitive to the tar-
get audio signals can minimize the capture of unwanted
noise [14]. Advancements in signal processing hardware,
including dedicated digital signal processors (DSPs) and
specialized integrated circuits, also play a significant role
in noise reduction [17].
3. Methodology
The proposed method consists of two parts: at-recording
hardware modification and post-recording algorithm im-
provement for sound event detection. We first introduce the
existing audio recording setup in Section 3.1, and move on
to the details of each part in Section 3.2 and Section 3.3.
3.1. Existing audio recording setup
The existing audio recording setup we use includes a
voice assistant-enabled smart device, a speaker, and a mi-
crophone, as shown in Figure 1a. The speaker and mi-
crophone are connected to and controlled by a computer.
In a typical audio recording experiment, we play synthe-
sized wake word and question/request to the voice assistant-
enabled smart device through the speaker, wait for the re-
sponse from it, and use the microphone to record the entire
conversation, which usually has a duration of 20 ∼ 30 sec-
onds. Some example conversations are shown in Figure 2,
with an Alexa-enabled device in the setup.
Audios collected using this setup can be used to evaluate
a voice assistant’s performance, leveraging the appropriate
technique. For instance, to measure the response latency,
we can perform sound event detection on the waveform to
find the gap between the end of the question/request and
the start of the response. Similarly, to measure the response
quality, we can first perform speech recognition to extract
the response text, and then assess the quality of answer man-
ually or using an automated system.
However, audios recorded using the setup in Figure 1a
are subject to unpredictable and uncontrolled background
noise, leading to degraded audio quality. As pointed out
in Section 1, it will negatively impact the performance of
downstream algorithms and systems that take these audios
as input, such as the sound event detection algorithm and
the speech recognition system mentioned above, and in turn
lead to inaccurate performance measurement and question-
able conclusion.
3.2. At-recording Hardware Modification
We make one simple modification to the setup in Fig-
ure 1a: the voice assistant-enabled device is modified such
that its response is directly streamed out using a wire and
sent to the computer, as shown in Figure 1b. Audios col-
lected in this way is completely free of background noise,
because the signal now travels through the wire instead of
air. Note that we make this modification in such a way that
it does not affect the device’s capability of producing audi-
ble response to human. In other words, the modified setup
captures the device response in two ways in parallel: the
usual audio response audible to human and will be recorded
by the microphone, and the additional audio response not
audible to human that is directly streamed out by a wire,
as shown in Figure 1b. We shall refer to the first type as
“recorded audio” and the second type as “streamed audio”
in the remainder of the paper.
We show the waveform of such a pair of recorded and
streamed audio in Figure 3. There are three main dif-
ferences between them: first, the recorded audio includes
the wake word and question/request played to the device
through the speaker, while the streamed audio does not con-
tain them, because the wire is only connecting the device
and the computer; second, the gap between the wake word,