{"id":2498,"date":"2021-09-27T09:00:30","date_gmt":"2021-09-27T01:00:30","guid":{"rendered":"https:\/\/blog.ailia.ai\/uncategorized\/autospeech-speech-based-personal-identification-model\/"},"modified":"2025-05-20T17:18:31","modified_gmt":"2025-05-20T09:18:31","slug":"autospeech-speech-based-personal-identification-model","status":"publish","type":"post","link":"https:\/\/blog.ailia.ai\/en\/tips-en\/autospeech-speech-based-personal-identification-model\/","title":{"rendered":"AutoSpeech : Speech-based person identification model"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\" id=\"a7e4\"><strong>Overview<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d8c9\"><em>AutoSpeech\u00a0<\/em>is a machine learning model that can identify individuals from their speech. By inputting two audio files and generating the feature vectors of each recording, the degree of similarity between the two files can be computed. This method can be used to match a recording against feature vectors of people voices stored in a database for identification. It can be used for voice biometric authentication, or identifying speakers in speech transcriptions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/arxiv.org\/abs\/2005.03215?source=post_page-----267a00f26a4a--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/arxiv.org\/abs\/2005.03215\" target=\"_blank\" rel=\"noreferrer noopener\">AutoSpeech: Neural Architecture Search for Speaker Recognition<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"3e6f\"><strong>Architecture<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1834\">There are two main tasks for speaker recognition: Speaker IDentification (SID) and Speaker Verification (SV). In recent years, end-to-end speaker recognition systems have emerged and achieved state-of-the-art performance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bbe2\">In end-to-end speaker recognition, a Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN) is used as feature extractor for each audio frame, which is then turned into a fixed length speaker embedding (<em>d-vector<\/em>) by a a temporal aggregation layer. Finally,&nbsp;<em>cosine similarity<\/em>&nbsp;is used on those embeddings to produce the final speaker identification decision.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b375\"><em>VGG&nbsp;<\/em>and&nbsp;<em>ResNet&nbsp;<\/em>architectures are usually used for the feature extraction. However, these architectures are intended for image identification and are not optimal for speaker recognition.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"15be\"><em>AutoSpeech&nbsp;<\/em>uses&nbsp;<em>Neural Architecture Search (NAS)&nbsp;<\/em>to search for the best network architecture. The search space is a set of the following layers:<\/p>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"122\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-28.png\" alt=\"\" class=\"wp-image-241\" style=\"width:500px;height:auto\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/arxiv.org\/abs\/2005.03215\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/abs\/2005.03215<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5e2f\">The NAS process is made of two types of neural cells:\u00a0<em>normal cells<\/em>\u00a0that keep the spatial resolution of the feature tensor (number of dimensions), and\u00a0<em>reduction cells<\/em>\u00a0that shrinks the resolution. For example, in\u00a0<em>VGG<\/em>\u00a0<code>Conv -> Relu<\/code>corresponds to a\u00a0<em>normal cell<\/em>\u00a0and\u00a0<code>MaxPooling<\/code>\u00a0corresponds to a\u00a0<em>reduction cell<\/em>. These cells are stacked 8 times to form the final model architecture.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"490\" height=\"346\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-29.png\" alt=\"\" class=\"wp-image-242\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/arxiv.org\/abs\/2005.03215\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/abs\/2005.03215<\/a><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"523\" height=\"319\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-30.png\" alt=\"\" class=\"wp-image-243\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/arxiv.org\/abs\/2005.03215\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/abs\/2005.03215<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"cb7c\"><em>VoxCeleb1<\/em>\u00a0dataset was used for training and evaluation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/voxceleb\/vox1.html?source=post_page-----267a00f26a4a--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/www.robots.ox.ac.uk\/~vgg\/data\/voxceleb\/vox1.html\" target=\"_blank\" rel=\"noreferrer noopener\">VoxCeleb<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ac48\">The evaluation results are shown below. The proposed method out performs those based on\u00a0<em>VGG\u00a0<\/em>and\u00a0<em>ResNet<\/em>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"523\" height=\"319\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-30.png\" alt=\"\" class=\"wp-image-244\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/arxiv.org\/abs\/2005.03215\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/abs\/2005.03215<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4afa\">The processing is performed on STFT spectrum data of audio files at sampling rate 16 kHz. The audio file is divided into frames, on which the feature vector is computed, and the mean of all frames is taken as the final fixed-length feature vector. The\u00a0<em>cosine similarity<\/em>\u00a0metric is then calculated by normalizing and inner-product of feature vectors.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"a068\"><strong>Usage<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5184\">The following command will allow you to input two audio files and output the similarity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>$ python3 auto_speech.py --input1 wav\/id10270\/8jEAjG6SegY\/00008.wav --input2 wav\/id10270\/x6uYqmx31kE\/00001.wav<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/axinc-ai\/ailia-models\/tree\/master\/audio_processing\/auto_speech?source=post_page-----267a00f26a4a--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/github.com\/axinc-ai\/ailia-models\/tree\/master\/audio_processing\/auto_speech\" target=\"_blank\" rel=\"noreferrer noopener\">ailia-models\/audio_processing\/auto_speech<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bf52\">Here is an example output. If the similarity is greater than the threshold, the person is identified as the same person and the output is \u201c<em>match<\/em>\u201d.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>INFO auto_speech.py (229) : Start inference\u2026<br>INFO auto_speech.py (243) : similar: 0.42532125<br>INFO auto_speech.py (245) : verification: match (threshold: 0.260)<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"99d9\">The training is done on dataset made of speeches in English, but let\u2019s test it on Japanese sentences using the audio file library below.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/soundeffect-lab.info\/sound\/voice\/info-lady1.html?source=post_page-----267a00f26a4a--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/soundeffect-lab.info\/sound\/voice\/info-lady1.html?source=post_page-----267a00f26a4a--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\">Sound Effects Lab &#8211; Download free, commercial free, report-free sound effects<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4a15\">Various inferences showed that sentences from the same person matched with a similarity between 0.41 and 0.80. An example of similar sentences (same words) from two different people unmatched with a similarity 0.228. Therefore the same model can also be used for other languages, in that case Japanese.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4a15\"><a href=\"https:\/\/axinc.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">ax Inc.<\/a>&nbsp;has developed&nbsp;<a href=\"https:\/\/ailia.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">ailia SDK<\/a>, which enables cross-platform, GPU-based rapid inference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4a15\">ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to&nbsp;<a href=\"https:\/\/axinc.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">contact us<\/a>&nbsp;for any inquiry.<a href=\"https:\/\/medium.com\/tag\/ailia-models?source=post_page-----267a00f26a4a---------------ailia_models-----------------\"><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Overview AutoSpeech\u00a0is a machine learning model that can identify individuals from their speech. By inputting two audio files and generating the feature vectors of each recording, the degree of similarity between the two files can be computed. This method can be used to match a recording against feature vectors of people voices stored in a database for identification. It can be used for voice biometric authentication, or identifying speakers in speech transcriptions. AutoSpeech: Neural Architecture Search for Speaker Recognition Architecture There are two main tasks for speaker recognition: Speaker IDentification (SID) and Speaker Verification (SV). In recent years, end-to-end speaker recognition systems have emerged and achieved state-of-the-art performance. In end-to-end [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":2425,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[255],"tags":[266],"class_list":["post-2498","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tips-en","tag-ailiamodels-en"],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2498","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/comments?post=2498"}],"version-history":[{"count":1,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2498\/revisions"}],"predecessor-version":[{"id":2500,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2498\/revisions\/2500"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/media\/2425"}],"wp:attachment":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/media?parent=2498"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/categories?post=2498"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/tags?post=2498"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}