June 14, 2024
Robust and Generalizable Learned Representations of Multichannel Speech
This work will investigate learned representations of single- and multichannel audio for better solving downstream tasks. There are two broad components to the work: 1) foundational research to tackle the problem of learning general purpose representations of single- and multichannel speech, and 2) grounding these representations in tasks to solve practical problems of interest to Apple. Two tasks of immediate interest are better speaker identification (who is talking to a device?), and better intent detection (is detected speech directed to a device?). More broadly the work will have impact in automatic speech recognition (ASR), and across many applications of speech processing in Apple products (e.g., voice isolation, acoustic scene analysis, and spatial audio).