Carnegie Mellon University

An abstract representation of soundwaves, green on a black background

June 14, 2024

Robust and Generalizable Learned Representations of Multichannel Speech

By Shinji Watanabe

This work will investigate learned representations of single- and multichannel audio for better solving downstream tasks. There are two broad components to the work: 1) foundational research to tackle the problem of learning general purpose representations of single- and multichannel speech, and 2) grounding these representations in tasks to solve practical problems of interest to Apple. Two tasks of immediate interest are better speaker identification (who is talking to a device?), and better intent detection (is detected speech directed to a device?). More broadly the work will have impact in automatic speech recognition (ASR), and across many applications of speech processing in Apple products (e.g., voice isolation, acoustic scene analysis, and spatial audio).