{"id":2534,"date":"2022-01-22T09:00:29","date_gmt":"2022-01-22T01:00:29","guid":{"rendered":"https:\/\/blog.ailia.ai\/uncategorized\/voicefilter-voice-separation-model-that-can-extract-voice-of-any-person\/"},"modified":"2025-05-20T21:22:42","modified_gmt":"2025-05-20T13:22:42","slug":"voicefilter-voice-separation-model-that-can-extract-voice-of-any-person","status":"publish","type":"post","link":"https:\/\/blog.ailia.ai\/en\/tips-en\/voicefilter-voice-separation-model-that-can-extract-voice-of-any-person\/","title":{"rendered":"VoiceFilter : Targeted Voice Separation Model"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\" id=\"8916\"><strong>Overview<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0e88\"><em>VoiceFilter\u00a0<\/em>is a speech separation model developed by Google AI and released in May 2020. It is capable of extracting the voice of an designated person from an audio file in which multiple people are speaking at the same time.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/arxiv.org\/abs\/1810.04826?source=post_page-----d5b88a8549d9--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/arxiv.org\/abs\/1810.04826\" target=\"_blank\" rel=\"noreferrer noopener\">VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/mindslab-ai\/voicefilter?source=post_page-----d5b88a8549d9--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/github.com\/mindslab-ai\/voicefilter\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub &#8211; mindslab-ai\/voicefilter<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"71ad\"><strong>Architecture<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"97e0\">Although the performance of speech recognition has increased in recent years, the accuracy in environments where multiple people are talking is not sufficient. To solve this problem, it is important to improve the recognition by using&nbsp;<em>speech separation<\/em>&nbsp;to extract the voices from different speakers. However, it is a complex problem to know how many people are currently speaking. Also, speakers need to be labeled, which requires the use of&nbsp;<em>deep clustering<\/em>&nbsp;or&nbsp;<em>deep attractor network<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"55ea\">The proposed method in this model treats all voices other than the target speaker (whose voice we want to extract) as noise. In addition, it is assumed that a sample of the voice of the target speaker is provided for reference. This method is similar to the traditional task of&nbsp;<em>speech separation<\/em>, but it is targeted to a designated individual. This speaker-dependent speech separation task is often referred to as as&nbsp;<em>voice filtering<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bc4e\">In the&nbsp;<em>VoiceFilter&nbsp;<\/em>architecture, two types of models are used: a&nbsp;<em>speaker recognition network<\/em>&nbsp;that produces speaker-discriminative embeddings (aka&nbsp;<em>d-vectors<\/em>), and a&nbsp;<em>spectrogram masking network<\/em>&nbsp;which extracts a specific person\u2019s voice from a noisy spectrogram of several people talking and the target speaker embeddings as input.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"906\" height=\"470\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-51.png\" alt=\"\" class=\"wp-image-303\"\/><figcaption class=\"wp-element-caption\">Source:\u00a0<a href=\"https:\/\/github.com\/mindslab-ai\/voicefilter\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/mindslab-ai\/voicefilter<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2f5f\">The&nbsp;<em>speaker recognition network<\/em>&nbsp;computes speaker embedding d-vectors based on a 3-layer LSTM. It takes as input a spectrogram extracted from windows of 1600 ms, and outputs speaker embeddings with a fixed dimension of 256. A d-vector is computed by sliding windows with 50% overlap, and averaging the L2-normalized d-vectors obtained on each window.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0818\">The&nbsp;<em>VoiceFilter&nbsp;<\/em>system calculates a&nbsp;<em>magnitude spectrogram<\/em>&nbsp;from \u201cnoisy audio\u201d (recording with mixed voices of multiple people). This&nbsp;<em>magnitude spectrogram&nbsp;<\/em>is then multiplied with a soft mask produced from the d-vector, and merged with the phase of the noisy audio. The result is then processed by applying an inverse STFT to obtain the output waveform.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a7be\"><em>Word Error Rate (WER)<\/em>&nbsp;was used to evaluate the model accuracy on the&nbsp;<em>LibriSpeech&nbsp;<\/em>and&nbsp;<em>VCTK&nbsp;<\/em>datasets. The speech recognizer used for the WER evaluation was trained on a YouTube dataset.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"33e7\">In the tables below,&nbsp;<em>Clean WER<\/em>&nbsp;refers to WER for clean audio, and&nbsp;<em>Noisy WER<\/em>refers to WER for noisy audio. Using&nbsp;<em>VoiceFilter<\/em>, the error rate for noisy audio is reduced from 55.9% to 23.4%.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"493\" height=\"393\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-50.png\" alt=\"\" class=\"wp-image-302\"\/><figcaption class=\"wp-element-caption\">Source:\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1810.04826\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/abs\/1810.04826<\/a><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"aef2\"><strong>Usage<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"41ec\">You can use\u00a0<em>VoiceFilter\u00a0<\/em>with ailia SDK using the following command.\u00a0<code>mixed.wav<\/code>\u00a0refers to the the audio of several people talking, and\u00a0<code>ref-voice.wav<\/code>\u00a0is the reference sample of the voice of the target speaker.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>$ python3 voicefilter.py --input mixed.wav --reference_file ref-voice.wav<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/axinc-ai\/ailia-models\/tree\/master\/audio_processing\/voicefilter?source=post_page-----d5b88a8549d9--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/github.com\/axinc-ai\/ailia-models\/tree\/master\/audio_processing\/voicefilter\" target=\"_blank\" rel=\"noreferrer noopener\">ailia-models\/audio_processing\/voicefilter<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"498e\"><a href=\"https:\/\/axinc.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">ax Inc.<\/a>&nbsp;has developed&nbsp;<a href=\"https:\/\/ailia.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">ailia SDK<\/a>, which enables cross-platform, GPU-based rapid inference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"498e\">ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to&nbsp;<a href=\"https:\/\/axinc.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">contact us<\/a>&nbsp;for any inquiry.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/medium.com\/tag\/ailia-models?source=post_page-----d5b88a8549d9---------------ailia_models-----------------\"><\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/medium.com\/@kyakuno?source=post_page-----d5b88a8549d9--------------------------------\"><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Overview VoiceFilter\u00a0is a speech separation model developed by Google AI and released in May 2020. It is capable of extracting the voice of an designated person from an audio file in which multiple people are speaking at the same time. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking GitHub &#8211; mindslab-ai\/voicefilter Architecture Although the performance of speech recognition has increased in recent years, the accuracy in environments where multiple people are talking is not sufficient. To solve this problem, it is important to improve the recognition by using&nbsp;speech separation&nbsp;to extract the voices from different speakers. However, it is a complex problem to know how many people are currently speaking. Also, [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":2433,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[255],"tags":[266],"class_list":["post-2534","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tips-en","tag-ailiamodels-en"],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2534","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/comments?post=2534"}],"version-history":[{"count":1,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2534\/revisions"}],"predecessor-version":[{"id":2536,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2534\/revisions\/2536"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/media\/2433"}],"wp:attachment":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/media?parent=2534"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/categories?post=2534"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/tags?post=2534"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}