Robust Direction-of-Arrival estimation and spatial filtering in noisy and reverberant environments

Document Type
Doctoral Thesis
Issue Date
Issue Year
Chakrabarty, Soumitro

The advent of multi-microphone setups on a plethora of commercial devices in recent years has generated a newfound interest in the development of robust microphone array signal processing methods. These methods are generally used to either estimate parameters associated with acoustic scene or to extract signal(s) of interest. In most practical scenarios, the sources are located in the far-field of a microphone array where the main spatial information of interest is the direction-of-arrival (DOA) of the plane waves originating from the source positions. The focus of this thesis is to incorporate robustness against either lack of or imperfect/erroneous information regarding the DOAs of the sound sources within a microphone array signal processing framework.

The DOAs of sound sources is by itself important information, however, it is most often used as a parameter for a subsequent processing method. One of the most important microphone array signal processing techniques where the DOAs of sound sources is used is spatial filtering, where this information is used for tasks such speech enhancement, source separation, noise reduction etc. Therefore, there are two main points where this robustness can be introduced. It can either be introduced at the DOA estimation part where robust estimators can be developed for applications in adverse acoustic environments, or it can be introduced in the subsequent spatial filtering framework where a mechanism to account for uncertainty or complete lack of DOA information is developed. This thesis investigates both these options and explores three main approaches to this task of incorporation of robustness against DOA information errors.

The first approach deals with DOA estimation. A supervised learning based DOA estimation method is proposed in this thesis which takes the phase component of the short-time Fourier transform coefficients of the microphone signals at each time frame as input. A supervised learning approach generally has the advantage that it can be adapted to different acoustic conditions via training, making it more robust compared to classical signal processing based methods. In the thesis, a convolutional neural network (CNN) based classification approach for both single and multiple speech source localization is developed. The proposed method is trained with synthesized noise signals to simplify the training data generation. Empirical investigation of design aspects of the proposed CNN is also performed.

The second approach deals with incorporating robustness to DOA estimation errors within a spatial filtering framework. For an objective understanding of the influence of DOA estimation errors on spatial filtering performance, an analysis method is developed in this thesis for the recently proposed informed spatial filtering (ISF) framework. The ISF framework is a flexible spatial filtering approach where sound sources from different directions are captured based on a user defined desired directional response function with instantaneous estimates of parameters such as the DOA of the sources. The analysis method is used to investigate the influence of DOA information errors on the obtained directional response at the output of the spatial filter compared to the desired directional response. Experimental analysis demonstrates severe deviation of the obtained response compared to the desired one in the presence of DOA errors, which gives an objective understanding of the influence of the errors and highlights the need for robustness against such errors. As a solution, a Bayesian approach to ISF is proposed to account for the unreliability of DOA estimates in noisy and reverberant conditions. The ISF framework uses the DOA estimates directly for the formulation of the steering vectors that are then used to compute the spatial filter weights, leading to a severe degradation in performance due to DOA estimation errors. In contrast, in the proposed approach the final spatial filter is given as a weighted sum of multiple directional filters where the DOA estimates are used to determine these weighting factors. The improved robustness of the proposed approach is shown using the analysis method described earlier.

The final approach deals with multi-microphone speech enhancement using a deep neural network (DNN) based time-frequency masking approach where rather than using the DOA information, the estimated masks are used to compute the relative transfer function for the formulation of steering vector of the spatial filter. DNN based mask estimation approaches have been shown to be effective for both single-channel as well as multi-channel speech enhancement. In contrast to most such existing methods, where single-channel DNNs are used for each microphone signal separately, the proposed approach utilizes the multi-channel recordings to exploit both the spatial as well as spectro-temporal characteristics of the speech and noise signals for discriminative learning and estimating the masks. The estimation of different types of ideal masks as well as their subsequent application for speech enhancement is investigated. Through experimental analysis the proposed method is shown to be more robust to different angular positions of the desired speech source in noisy and reverberant environments compared to ideal fixed beamformers.

All the mentioned approaches are extensively evaluated with both simulated and measured room impulse responses. The experimental results demonstrate a superior performance of the proposed approaches compared to related existing methods.

Faculties & Collections
Zugehörige ORCIDs