Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Many people with hearing loss struggle to understand speech in noisy environments, making noise robustness critical for hearing-assistive devices. Recently developed haptic hearing aids, which convert audio to vibration, can improve speech-in-noise performance for cochlear implant (CI) users and assist those unable to access hearing-assistive devices. They are typically body-worn rather than head-mounted, allowing additional space for batteries and microprocessors, and so can deploy more sophisticated noise-reduction techniques. The current study assessed whether a real-time-feasible dual-path recurrent neural network (DPRNN) can improve tactile speech-in-noise performance. Audio was converted to vibration on the wrist using a vocoder method, either with or without noise reduction. Performance was tested for speech in a multi-talker noise (recorded at a party) with a 2.5-dB signal-to-noise ratio. An objective assessment showed the DPRNN improved the scale-invariant signal-to-distortion ratio by 8.6 dB and substantially outperformed traditional noise-reduction (log-MMSE). A behavioural assessment in 16 participants showed the DPRNN improved tactile-only sentence identification in noise by 8.2%. This suggests that advanced techniques like the DPRNN could substantially improve outcomes with haptic hearing aids. Low-cost haptic devices could soon be an important supplement to hearing-assistive devices such as CIs or offer an alternative for people who cannot access CI technology.
Some previous studies have avoided the issue of noise robustness by converting the clean speech signal to haptic stimulation, rather than the speech-in-noise signal that would be received by the microphone of a haptic hearing aid6,7,8. While these studies, which have shown substantial benefits to speech-in-noise performance, provide a proof-of-concept, they leave unaddressed a major obstacle in the translation of laboratory benefits to the real world. Other studies have converted the speech-in-noise signal to haptic stimulation and have also shown considerable benefits to speech recognition3,4,9. The method used to reduce the impact of noise in these studies was envelope expansion, which enhances high intensity parts of the haptic signal. While envelope expansion is effective when speech is considerably more intense than background noise3,4,9, it breaks down at more challenging (low or negative) signal-to-noise ratios (SNRs)3,4. If haptic signal extraction strategies that are robust in these more challenging listening scenarios can be developed, this would substantially widen the group of people who could benefit from haptic hearing aids.
One advantage of haptic hearing aids compared to other hearing-assistive devices, such as CIs or acoustic hearing aids, is that they are typically worn on the body (e.g., the wrist) rather than the head. They therefore have significantly more form-factor flexibility and design space available. This allows the use of larger, more powerful batteries and microprocessors that can deploy more sophisticated noise-reduction techniques15. The current study explored whether an advanced real-time-feasible noise-reduction method, using a recurrent neural network, can effectively extract the haptic speech signal from background noise at challenging SNRs and improve tactile speech perception.
Conventional recurrent neural networks are ineffective at modelling long sequences due to optimization issues, such as vanishing and exploding gradients23. The long short-term memory (LSTM) method was developed to overcome these deficiencies24,25. The DPRNN architecture utilizes residual LSTM blocks with layer-normalized alternating global and local processing paths to improve performance over a single LSTM layer or multiple LSTM layers stacked together. The DPRNN uses an encoder-masker-decoder architecture and recent work has shown that, across several metrics, it can outperform other state-of-the-art networks, such as ConvTasNet22. Furthermore, the DPRNN is more efficient than alternative cutting-edge architectures, such as the Dual-Path Transformer Neural Network26 and the Dual-Path Convolution Recurrent Network27, regarding the number of trainable parameters and the computational load.
For audio-to-tactile conversion, the audio was processed with the quasi-causal DPRNN before being converted to vibro-tactile stimulation on the wrist using a previously developed tactile vocoder approach3,4,9,10,14. The tactile vocoder filters the audio into several frequency bands, and the amplitude envelope from each band is used to modulate the amplitude of one of several vibro-tactile tones. The tactile vocoder approach has been shown to be effective for transferring phoneme information10,14 and improving speech-in-noise performance for CI users3,4,9.
The quasi-causal DPRNN was assessed both objectively and behaviourally. The objective assessment aimed to establish whether the quasi-causal DRPNN could effectively extract speech from multi-talker noise and outperform traditional noise-reduction methods. The multi-talker noise used was representative of environments that listeners with hearing loss typically find most challenging, with competing speech and other fluctuating and transient sounds. The behavioural assessment tested whether objectively measured denoising of the tactile speech in multi-talker noise signal by the DPRNN translated into improvements in tactile speech identification. In objective testing, the quasi-causal DPRNN was compared to log MMSE29, which is a leading traditional noise-reduction method30. Note that the expander approach used in previous studies with tactile speech in noise is not suitable for an objective comparison such as this, because it intentionally alters the speech signal. It therefore cannot be directly compared to these other noise-reduction methods for which clean speech serves as a reference. Furthermore, as discussed, the expander approach has been shown to be ineffective at the challenging SNRs that this study focuses on. In addition to log MMSE, the quasi-causal DPRNN was compared to causal and non-causal implementations (all trained using the same data) to establish the effect of including future information.
There were two stages to the objective assessment of the quasi-causal DPRNN method. In the first stage, the SI-SDR of the tactile vocoded speech in noise (referenced to time-aligned clean speech) was calculated for the causal, quasi-causal, and non-causal DPRNNs, as well as for the traditional log-MMSE noise reduction method. Performance was assessed for male and female speech either with a stationary noise, which was filtered to match the international long-term average speech spectrum (ILTASS)34, or with a multi-talker noise recording from a party. The results for the three SNRs tested are shown in Fig. 1.
SI-SDRs for the tactile vocoded signal for speech in either the stationary speech-shaped (ILTASS) noise (left panel) or multi-talker party noise (right panel). SI-SDR scores are shown with no noise reduction (blue), with the log-MMSE method (orange), as well as with the causal (green), quasi-causal (red), and non-causal (purple) DPRNN methods. Performance is shown for three SNRs. The mean and median values are illustrated by a cross and a solid line, respectively, while the top and bottom edges of the box show the upper (0.75) and lower (0.25) quartile. Outliers (values of more than 1.5 times the interquartile range) are shown as filled diamonds.
eSTOI scores (left panels) and SI-SDRs (right panels) for the audio signals (without tactile vocoding), with speech in either the stationary ILTASS (upper panels) or multi-talker party (lower panels) noise. SI-SDRs and eSTOI scores are shown with no noise reduction, with the log-MMSE method, and with the three DPRNN methods. As in Fig. 1, the box plots show the median, quartiles, and outliers.
Percentage of sentences correctly identified with and without the quasi-causal DPRNN noise reduction (NR) and with (light red) and without (dark blue) multi-talker (party) background noise at 2.5 dB SNR. Results are shown in box plots, with the horizontal line inside the box showing the median and the top and bottom edges of the box showing the upper (0.75) and lower (0.25) quartile, like in Fig. 1. Outliers (values of more than 1.5 times the interquartile range) are shown as unfilled circles. The whiskers connect the upper and lower quartiles to the maximum and minimum non-outlier value. The dashed grey line shows chance performance.
The objective assessment for tactile speech-in-noise showed that the quasi-causal DPRNN noise reduction method substantially outperforms traditional noise reduction methods, both for stationary and multi-talker background noises. As expected, the largest gains were found for multi-talker background noise (7.6 dB SI-SDR compared to 2.9 dB SI-SDR at an SNR of 2.5 dB), where traditional methods are known to particularly struggle. Behavioural testing confirmed that the effectiveness of the quasi-causal DPRNN shown in the objective assessment translates to a substantial improvement in tactile sentence identification, without impairing performance for speech in quiet. The quasi-causal DPRNN improved behavioural speech-in-noise scores by 8.2%, which amounted to a recovery of around half of the performance that was lost when noise was added. The acoustic characteristics of the multi-talker noise and the SNR of 2.5 dB are representative of scenarios in which people with hearing loss commonly struggle in their everyday lives. The improved noise-robustness with the real-time-feasible quasi-causal DPRNN could dramatically increase the utility of haptic hearing aids, both when used as stand-alone sensory substitution devices and when used for sensory augmentation with CIs and other hearing-assistive devices.
b1e95dc632