Noise-free audio signal processing in noisy environment: a

Noise-free audio signal processing in noisy environment: a hardware and

algorithm solution

Yarong Feng*, Zongyi Liu, Shunyan Luo, Yuan Ling, Shujing Dong, Shuyi Wang, Bruce Ferry

Customer Experience and Business Trends, Amazon.com

2121 7th Ave, Seattle, WA, 98121

yarongf@

, joeliu@, shunyl@, yualing@, shujdong@, wanshuyi@, [email protected]

Abstract

Dealing with background noise is a challenging task

in audio signal processing, negatively impacting algorithm

performance and system robustness. In this paper, we

propose a simple solution that combines recording hard-

ware modiﬁcation and algorithm improvement to tackle

the challenge. The proposed solution could produce clean

and noise-free high quality audio recording even in noisy

recording environment. Experiment results show that the

proposed solution leads to better sound event detection ac-

curacy and speech recognition results.

1. Introduction

Audio signal processing is an important ﬁeld that deals

with the manipulation and enhancement of audio signals,

including everything from music and speech to environmen-

tal sounds. It plays a vital role in our daily lives, impacting

various industries and technologies, such as telecommuni-

cations, automotive systems, voice assistants, and more.

Two widely used techniques in audio signal processing

are sound event detection and speech recognition. Sound

event detection focuses on the identiﬁcation and categoriza-

tion of distinct acoustic events within audio signals. It in-

volves analyzing audio signals to determine when speciﬁc

sounds or events occur, and classifying them into prede-

ﬁned categories, such as car horns, dog barks, or doorbells.

Speech recognition is a technology that enables a computer

or machine to convert spoken language into written text or

commands, facilitating human-computer interaction and the

automation of various tasks via voice input.

A challenging task in audio signal processing is dealing

with noise. Noise, usually in the form of unwanted back-

ground sounds, introduces distortions and disruptions to au-

dio signals, which degrade their quality. For example, noise

can signiﬁcantly impact sound event detection results by in-

troducing false positive events or missing actual events. It

can also limit the ability of automated speech recognition

systems to accurately transcribe spoken words. Effective

noise reduction techniques and robust algorithms are essen-

tial for improving the accuracy and performance of such

systems in noisy conditions.

Depending on the time it is utilized, we can split noise

reduction techniques into two types: at recording, and post

recording. At-recording noise reduction techniques are ap-

plied during the recording phase aiming to minimize the in-

troduction of noise into the audio signal in the ﬁrst place.

This involves selecting a quiet recording environment, using

proper microphone techniques, employing soundprooﬁng

and acoustic treatment, and ensuring clean power sources.

These practices can help capture cleaner audio with a higher

signal-to-noise ratio. Post-recording noise reduction tech-

niques are applied after the audio has been recorded. These

methods include digital signal processing tools and software

that can analyze the audio, identify noise components, and

reduce or remove them while preserving the desired audio

content. Noise reduction algorithms, spectral subtraction,

and noise ﬁltering are examples of post-processing tech-

niques. Post-recording noise reduction is valuable when the

recording environment is less than ideal or when dealing

with unexpected noise.

In this paper, we propose a novel and simple solution

combining at-recording and post-recording techniques to

improve recorded audio quality. We show that the solution

leads to improved performance on both sound event detec-

tion and speech recognition tasks.

Organization of the paper is as follows: Section 2 in-

troduces related work. Section 3 explains the proposed

methodology in details. Experimental results are shown in

Section 4, and Section 5 concludes the paper.

2. Related Work

Sound Event Detection Recent research efforts focus

on developing advanced algorithms, such as deep learning

models and neural networks, to enhance the performance

(a) Diagram of existing hardware setup. (b) Diagram of modiﬁed hardware setup.

Figure 1. Logical diagram of existing(left) and modiﬁed(right) hardware setup.

of sound event detection [5–7, 9]. These algorithms are

designed to handle challenging scenarios like background

noise, overlapping events, and low signal-to-noise ratios

[8, 13, 15]. Additionally, there is active research on the in-

tegration of sound event detection into a wide array of ap-

plications, from smart home technology [10], security sys-

tems [4], to automated facility maintenance [11].

Noise reduction Noise reduction for audio signal pro-

cessing involves the development of techniques and algo-

rithms aimed at removing or mitigating unwanted noise

from audio recordings while preserving the quality of the

desired signal. Traditional approaches usually involve ﬁl-

tering [1], spectral subtraction [2], statistical methods [16],

etc. More recently, data-driven approaches leveraging deep

learning and machine learning are developed [3, 12, 18].

Hardware and sensor technology Various tools and de-

vices have been invented to remove or mitigate noise during

the capturing and processing of audio data. For instance,

high-quality microphones and sensors sensitive to the tar-

get audio signals can minimize the capture of unwanted

noise [14]. Advancements in signal processing hardware,

including dedicated digital signal processors (DSPs) and

specialized integrated circuits, also play a signiﬁcant role

in noise reduction [17].

3. Methodology

The proposed method consists of two parts: at-recording

hardware modiﬁcation and post-recording algorithm im-

provement for sound event detection. We ﬁrst introduce the

existing audio recording setup in Section 3.1, and move on

to the details of each part in Section 3.2 and Section 3.3.

3.1. Existing audio recording setup

The existing audio recording setup we use includes a

voice assistant-enabled smart device, a speaker, and a mi-

crophone, as shown in Figure 1a. The speaker and mi-

crophone are connected to and controlled by a computer.

In a typical audio recording experiment, we play synthe-

sized wake word and question/request to the voice assistant-

enabled smart device through the speaker, wait for the re-

sponse from it, and use the microphone to record the entire

conversation, which usually has a duration of 20 ∼ 30 sec-

onds. Some example conversations are shown in Figure 2,

with an Alexa-enabled device in the setup.

Audios collected using this setup can be used to evaluate

a voice assistant’s performance, leveraging the appropriate

technique. For instance, to measure the response latency,

we can perform sound event detection on the waveform to

ﬁnd the gap between the end of the question/request and

the start of the response. Similarly, to measure the response

quality, we can ﬁrst perform speech recognition to extract

the response text, and then assess the quality of answer man-

ually or using an automated system.

However, audios recorded using the setup in Figure 1a

are subject to unpredictable and uncontrolled background

noise, leading to degraded audio quality. As pointed out

in Section 1, it will negatively impact the performance of

downstream algorithms and systems that take these audios

as input, such as the sound event detection algorithm and

the speech recognition system mentioned above, and in turn

lead to inaccurate performance measurement and question-

able conclusion.

3.2. At-recording Hardware Modiﬁcation

We make one simple modiﬁcation to the setup in Fig-

ure 1a: the voice assistant-enabled device is modiﬁed such

that its response is directly streamed out using a wire and

sent to the computer, as shown in Figure 1b. Audios col-

lected in this way is completely free of background noise,

because the signal now travels through the wire instead of

air. Note that we make this modiﬁcation in such a way that

it does not affect the device’s capability of producing audi-

ble response to human. In other words, the modiﬁed setup

captures the device response in two ways in parallel: the

usual audio response audible to human and will be recorded

by the microphone, and the additional audio response not

audible to human that is directly streamed out by a wire,

as shown in Figure 1b. We shall refer to the ﬁrst type as

“recorded audio” and the second type as “streamed audio”

in the remainder of the paper.

We show the waveform of such a pair of recorded and

streamed audio in Figure 3. There are three main dif-

ferences between them: ﬁrst, the recorded audio includes

the wake word and question/request played to the device

through the speaker, while the streamed audio does not con-

tain them, because the wire is only connecting the device

and the computer; second, the gap between the wake word,

(a) Example conversation 1.

Synthesized user: Alexa, what’s today’s date?

Alexa: Today is November 1st, 2021.

(b) Example conversation 2.

Synthesized user: Alexa, play baby shark.

Alexa: Baby Shark by Pink Fong, on Amazon Music.

Alexa: (music playing).)

Figure 2. Waveform and text of example conversations collected in audio recording experiments. Blue color blocks in the waveform

represent speech segments, and pink ones represent music.

Figure 3. Waveform of recorded(top, blue) and streamed au-

dios(bottom, red) for the same conversation.

question, and response segments in the recorded audio con-

tains ambient noise, while in streamed audio it’s completely

free of noise(represented by an almost ﬂat line in between

sound events); third, the recorded audio and the streamed

audio may not be in perfect sync. Because the two are col-

lected in different ways, there is a non-negligible chance

that they are out of sync in the generated audio ﬁles. For in-

stance, in Figure 3, the streamed audio is slightly ahead of

the recorded audio, with the difference most visible around

second 7.5.

3.3. Post-recording Algorithm Improvement for

Sound Event Detection

For the purpose of speech recognition, the streamed au-

dio collected in the modiﬁed setup can be used directly to

obtain high quality response text. However, to measure

performance latency using sound event detection, streamed

audio is insufﬁcient because it only contains the response

part of the conversation. The beneﬁt of it, though, is run-

ning sound event detection on it can be achieved using sim-

ple off-the-shelf algorithms without advanced noise reduc-

tion or separation, because it can be considered completely

noise-free. On the contrary, recorded audio contains the

complete conversation, but requires more advanced and ro-

bust algorithms to ensure the sound event detection results

are not impacted negatively by the background noise. More-

over, as explained in Section 3.2, one can not directly com-

bine the recorded audio and the streamed audio, because the

two can be out of sync.

Based on these observations, we propose a simple algo-

rithmic improvement that brings the best of the two worlds

together, and addresses the out of sync issue at the same

time. We ﬁrst align the recorded audio and the streamed

audio using cross-correlation, bringing them in sync. Given

two real-valued one-dimensional arrays x and y of length

and n

respectively, their cross-correlation is an array z

such that

z[k] =

−1

l=0

l−k+N −1

, k = 0, 1, ..., n

+ n

− 2,

where N = max(n

, n

), and y

= 0 when m is outside

the range of y, and k is the lag index. To align x and y, we

simply take k

∗

= argmax

z[k], and shift x by k

∗

Then we run an off-the-shelf sound event detection al-

gorithm on both the recorded audio and the streamed au-

dio, and refer to the results as recorded sound events and

streamed sound events, respectively. Lastly, we update the

recorded sound events using the more accurate streamed

sound events. Note that in the last step, only the recorded

sound events in the response part are updated, with the wake

word and question part intact. This is feasible because we

know the rough time when the question/request ends in the

recorded audio, as we control the start play time of the ques-

tion and also know its length. The complete proposed algo-

Figure 4. Proposed algorithm diagram.

Figure 5. An intuitive guide on the scale of SNR and mapping to

real-world scenarios.

rithm is shown in Figure 4. Note that when the recording

environment is noisy, alignment between the streamed au-

dio and recorded audio may not be accurate. In such cases,

one can optionally pass the recorded audio through a band-

pass ﬁlter to remove some noise, before running alignment.

4. Experiment Results

To demonstrate the effectiveness of the proposed solu-

tion, we run experiments to collect data across a variety of

questions/requests using the modiﬁed setup in Section 3.2.

We use one Apple home pod mini as the voice assistant-

enabled device in all experiments for convenience. In re-

ality, one can use any voice assistant-enabled device to run

the experiment, as long as the device can be modiﬁed in

the same way as described in Section 3.2. In each exper-

iment, the wake word ”Hey Siri” and a randomly selected

question/request from a predeﬁned question/request set are

played. We perform sound event detection using the pro-

posed algorithm in Section 3.3 and speech recognition using

AWS Transcribe. We use the existing setup in Section 3.1

with recorded audio only as the baseline. We introduce the

detailed experiment settings in the following sections.

Figure 6. Example audios waveforms at different SNR values.

Top(blue): original recorded audio, middle(green): SNR=1(high

noise), bottom(white): SNR=20(low noise).

4.1. Sound Event Detection Results

To simulate a realistic acoustic environment for record-

ing, we manually added background noise into the recorded

audios, at different Signal-Noise-Ratio(SNR) levels. SNR

is a common measure for audio quality and is computed as

the ratio of the signal power and noise power. For our exper-

iments, we simulated 20 distinct SNR settings(SNR ranging

from 1 to 20) to measure the impact of different levels of

noise. Figure 5 gives an intuitive guide on the SNR settings

we choose. Figure 6 shows some example audios at differ-

ent SNR levels and their waveforms.

We run 140 experiments and collect one recorded audio

and one streamed audio from each. In the baseline method,

for each recorded audio, we obtain its noisy version un-

der each SNR setting(such as recorded snr1, recorded snr2,

etc.), and run an off-the-shelf sound event detection algo-

rithm on it to compute the response latency. In the proposed

SNR MAE(in seconds)

streamed(proposed) recorded(baseline)

1 0.08 1.25

2 0.08 0.85

3 0.08 0.46

4 0.08 0.23

5 0.08 0.19

6 0.08 0.18

7 0.08 0.20

8 0.08 0.15

9 0.08 0.16

10 0.08 0.14

15 0.08 0.26

20 0.08 0.19

Table 1. MAE using streamed audio versus recorded audio, in

different SNR settings.

solution, for each audio pair, we simply use the algorithm in

Section 3.3 to get the updated sound events and compute re-

sponse latency. Under each SNR setting, the recorded audio

in the proposed solution is replaced with the corresponding

noisy version.

We use Mean Absolute Error(MAE) to measure the dif-

ference between the computed response latency and ground

truth in each setting. As shown in Table 1, response latency

computed using streamed audio has an MAE of 80ms in

all SNR settings. This is to be expected because streamed

audio is completely noise-free and the sound event detec-

tion results on them are not affected by the noise level.

In contrast, response latency computed using recorded au-

dio produces larger MAE as the noise gets stronger(SNR

gets lower). At SNR = 5 and higher, the MAE stabi-

lizes around 200ms. Note that even the smallest MAE

using recorded audio among all SNR settings(140ms at

SNR = 10) is much higher than the MAE using streamed

audio. This suggests that our proposed solution can still

improve algorithm performance even in acoustic environ-

ment that has been optimized for noise reduction(with high

SNR).

4.2. Speech Recognition Results

We collect data for 80 experiments. For each experi-

ment, we run AWS Transcribe on the recorded audio and

streamed audio respectively to obtain the response text.

As no ground truth is available for this task, we compute

the Word Error Rate(WER) between the recorded response

text and streamed response text, to illustrate the differ-

ence between the two. Across 80 experiments, the average

un-normalized WER between recorded response text and

streamed response text is 18. It means on average, in each

conversion, there are 18 words that are different. “Differ-

ent” could mean word displacement, substitution, insertion,

or deletion. During manual inspection, we have found that

many of these differences are caused by transcription errors

on recorded audios. We provide a few examples in Table 2

with the transcription error highlighted in red.

5. Conclusion

We propose a hardware modiﬁcation solution to record

noise-free and high quality audio even in noisy environ-

ment. In addition, we also propose an algorithm to perform

accurate sound event detection using the recordings col-

lected in the proposed setting. Experiment results show that

the proposed solution could produce very accurate sound

event detection results even in environments with low SNR.

There is also qualitative evidence that the proposed solution

leads to better speech recognition results.

References

[1] A.S. Abutaleb. An adaptive ﬁlter for noise cancelling. IEEE

Transactions on Circuits and Systems, 35(10):1201–1209,

1988. 2

[2] Steven Boll. Suppression of acoustic noise in speech us-

ing spectral subtraction. IEEE Transactions on acoustics,

speech, and signal processing, 27(2):113–120, 1979. 2

[3] Sebastian Braun and Ivan Tashev. Data augmentation and

loss normalization for deep noise suppression. In Interna-

tional Conference on Speech and Computer, pages 79–86.

Springer, 2020. 2

[4] S. Chandrakala and S. L. Jayalakshmi. Environmental audio

scene and sound event recognition for autonomous surveil-

lance: A survey and comparative studies. ACM Comput.

Surv., 52(3), jun 2019. 2

[5] Yarong Feng, Zongyi Joe Liu, Yuan Ling, and Bruce Ferry.

A two-stage lstm based approach for voice activity detec-

tion with sound event classiﬁcation. In 2022 IEEE Interna-

tional Conference on Consumer Electronics (ICCE), pages

1–6, 2022. 2

[6] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang,

Wenwu Wang, and Mark D. Plumbley. Panns: Large-scale

pretrained audio neural networks for audio pattern recogni-

tion, 2020. 2

[7] Shunyan Luo, Yarong Feng, Zongyi Joe Liu, Yuan Ling,

Shujing Dong, and Bruce Ferry. High precision sound event

detection based on transfer learning using transposed convo-

lutions and feature pyramid network. In 2023 IEEE Interna-

tional Conference on Consumer Electronics (ICCE), pages

1–6, 2023. 2

[8] Annamaria Mesaros, Aleksandr Diment, Benjamin Elizalde,

Toni Heittola, Emmanuel Vincent, Bhiksha Raj, and Tuomas

Virtanen. Sound event detection in the dcase 2017 challenge.

IEEE/ACM Transactions on Audio, Speech, and Language

Processing, 27(6):992–1006, 2019. 2

streamed response text recorded response text

hey siri. what’s the weather tomorrow? expect cloudy skies

tomorrow. daytime temperatures will hover around 75 de-

grees with overnight lows around 62.

hey siri. what’s the weather tomorrow? expect? cloudy

skies tomorrow. daytime temperatures will hover around 75

degrees with overnight lows around 62.

hey siri. cancel all timers. there are no timers on home pod. hey siri, cancel. all timers. there are no timers on home pie.

hey siri. what’s ﬁfteen hundred times twelve. 1500 times 12

is 18,000.

hey siri. what’s 1500 times 12, 1000, 500 times 12 is 18,000.

Table 2. Qualitative speech recognition results comparison.

[9] Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, and

Mark D. Plumbley. Sound event detection: A tutorial. IEEE

Signal Processing Magazine, 38(5):67–83, sep 2021. 2

[10] Sharnil Pandya and Hemant Ghayvat. Ambient acoustic

event assistive framework for identiﬁcation, detection, and

recognition of unknown acoustic events of a residence. Ad-

vanced Engineering Informatics, 47:101238, 2021. 2

[11] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo,

Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi. Mimii

dataset: Sound dataset for malfunctioning industrial machine

investigation and inspection, 2019. 2

[12] Chandan KA Reddy, Vishak Gopal, Ross Cutler, Ebrahim

Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matu-

sevych, Robert Aichner, Ashkan Aazami, Sebastian Braun,

et al. The interspeech 2020 deep noise suppression chal-

lenge: Datasets, subjective testing framework, and challenge

results. arXiv preprint arXiv:2005.13981, 2020. 2

[13] Francesca Ronchini and Romain Serizel. A benchmark of

state-of-the-art sound event detection systems evaluated on

synthetic soundscapes. In ICASSP 2022 - 2022 IEEE Inter-

national Conference on Acoustics, Speech and Signal Pro-

cessing (ICASSP), pages 1031–1035, 2022. 2

[14] Thomas Stark Ilona Anderson Peter Nopp Ernst Aschbacher

Alexander M

oltner Yassaman Khajehnouri Rudolf Hagen,

Andreas Radeloff and Kristen Rak. Microphone direction-

ality and wind noise reduction enhance speech perception in

users of the med-el sonnet audio processor. Cochlear Im-

plants International, 21(1):53–65, 2020. PMID: 31524107.

[15] Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin

Salamon. Sound event detection in synthetic domestic en-

vironments. In ICASSP 2020 - 2020 IEEE International

Conference on Acoustics, Speech and Signal Processing

(ICASSP), pages 86–90, 2020. 2

[16] Volker Stahl, Alexander Fischer, and Rolf Bippus. Quantile

based noise estimation for spectral subtraction and wiener

ﬁltering. In 2000 IEEE International Conference on Acous-

tics, Speech, and Signal Processing. Proceedings (Cat. No.

00CH37100), volume 3, pages 1875–1878. IEEE, 2000. 2

[17] Akihiko Sugiyama, Ryoji Miyahara, and Kouji Oosugi. A

noise robust hearable device with an adaptive noise canceller

and its dsp implementation. In ICASSP 2019 - 2019 IEEE

International Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 2722–2726, 2019. 2

[18] Hao Zhang, Ke Tan, and DeLiang Wang. Deep learning for

joint acoustic echo and noise cancellation with nonlinear dis-

tortions. In Interspeech, pages 4255–4259, 2019. 2