We export the network to the ONNX format in order to speed up inference and as of now the operations needed to compute the MFCCs are not supported.įor the feature extraction we rely on a common neural network architecture commonly used for Computer Vision tasks: ResNet18. First off, given enough data, our hope is that the transformation will be of higher quality for the downstream task. The reasoning behind this choice is twofold. Our approach is to let the neural network learn this transformation as a sequence of 3 1-D convolutions. The standard approach is to compute the Spectrogram or the Mel-frequency cepstral coefficients (MFCCs). The preprocessing part transforms the 1-dimensional audio input into a 2-dimensional representation. The architecture of the neural network can be split into 2 parts: preprocessing and feature extraction. This is made possible by choosing the proper neural architecture, a vast amount of training data, and data augmentation techniques. We designed the neural network responsible to compute the voice fingerprints to be robust to these factors. We do not want to restrict the quality of the diarization based on language, accent, gender or age because meetings can occur in varied settings with different microphones and background noises. The objective of this step is to group together segments that are similar to each other.
![cepstral voices keys cepstral voices keys](https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/abce5f6d32ce30e707d53b62dd83fdf6c0111caf/2-Figure2-1.png)
![cepstral voices keys cepstral voices keys](https://d3i71xaburhd42.cloudfront.net/048f3aaa0ab4af6894693b9df00fc7c04fa6a629/3-Figure1-1.png)
These fingerprints are 256-dimensional vectors, i.e. Compute Voice Fingerprints: The next step involves transforming each audio chunk to a "Voice Fingerprint".We use an off the shelf solution: WebRTC Voice Activity Detector. We are therefore discarding silence and background noise. Split Audio: The first thing we want to do is to split the audio input into smaller audio chunks of the same length, and discard all segments that do not contain voice.The process to assign speaker labels to an audio file is straightfoward and can be divided into 3 steps: Integration with Webex: In this section we will talk about the work we've done in order to deploy the speaker diarization system to production as an additional module to our meeting transcriptions pipeline.Īssigning speaker labels to an audio file can be divided into 3 steps.Data pipeline: all AI models require data in order to learn the task and in this section we'll share insights on the data we have available and the strategies we adoped to label it automatically.Clustering: after transforming a sequence of audio inputs in a sequence of voice fingerprints, we'll show how we solved the problem of assigning a speaker label to each segment and group segments from the same speakers together.
![cepstral voices keys cepstral voices keys](http://mediad.publicbroadcasting.net/p/wvpn/files/styles/x_large/public/201804/Skillfulljobs-1.jpg)
A Fingerprint for your voice: we will discuss our approach to building the deep neural network responsible for transforming audio inputs to voice fingerprints.This is very useful, but without knowing who said what, it is more difficult for humans to skim through the content, and for AI solutions to provide more accurate results. Because Webex Meetings recordings are provided with transcriptions, being able to answer "Who spoke when?" would allow colleagues who might have missed the meeting to quickly catch up with what was said as well as provide automatic highlights and summaries. When speakers in the same meeting are speaking from the same room/device, they are identified as one speaker in the meeting transcript. Speaker Diarization answers the question, "Who spoke when?" Currently, speakers in a meeting are identified through channel endpoints whether through PSTN or VOIP.