{"id":2384,"date":"2021-05-25T09:00:00","date_gmt":"2021-05-25T01:00:00","guid":{"rendered":"https:\/\/blog.ailia.ai\/%e6%9c%aa%e5%88%86%e9%a1%9e\/dpt-segmentation-model-using-vision-transformer\/"},"modified":"2025-05-14T15:31:10","modified_gmt":"2025-05-14T07:31:10","slug":"dpt-segmentation-model-using-vision-transformer","status":"publish","type":"post","link":"https:\/\/blog.ailia.ai\/en\/tips-en\/dpt-segmentation-model-using-vision-transformer\/","title":{"rendered":"DPT : Segmentation Model Using Vision Transformer"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\" id=\"5da4\"><strong>Overview<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3b23\"><em>DPT (DensePredictionTransformers)<\/em>&nbsp;is a segmentation model released by Intel in March 2021 that applies&nbsp;<em>vision transformers&nbsp;<\/em>to images. It can perform image semantic segmentation with 49.02% mIoU on ADE20K, and it can also be used for monocular depth estimation with an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/arxiv.org\/abs\/2103.13413?source=post_page-----88db4842b4a7--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/arxiv.org\/abs\/2103.13413\" target=\"_blank\" rel=\"noreferrer noopener\">Vision Transformers for Dense Prediction<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"07aa\"><strong>Architecture<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"96a2\">In&nbsp;<em>DPT<\/em>, v<em>ision transformers (ViT)<\/em>are used instead of convolutional network. Using transformers allows to make more detailed and globally consistent predictions compared to convolutional networks. In particular, performance is improved when a large amount of training data is available.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1007\" height=\"534\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-48.png\" alt=\"\" class=\"wp-image-296\"\/><figcaption class=\"wp-element-caption\">\u51fa\u5178\uff1a<a href=\"https:\/\/arxiv.org\/pdf\/2103.13413\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/arxiv.org\/pdf\/2103.13413<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"37bd\">The&nbsp;<em>encoder&nbsp;<\/em>divides the image into tiles, which are then tokenized (<em>Embed<\/em>&nbsp;in the graph above), and transformers process it. The process marked as&nbsp;<em>Embed<\/em>is a patch-based method to divide image into tiles, and tokenize the pixel feature map obtained by applying&nbsp;<em>ResNet50&nbsp;<\/em>to the input image.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"cfd0\">The&nbsp;<em>decoder&nbsp;<\/em>in DPT converts the output of each resolution of the transformer into an image like representation and uses a convolutional network to generate the segmentation image.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"37bd\">There are three model architectures defined in DPT:&nbsp;<em>ViT-Base, ViT-Large<\/em>, and&nbsp;<em>ViT-Hybrid<\/em>.&nbsp;<em>ViT-Base<\/em>&nbsp;performs patch-based embedding and has 12 transformer layers.&nbsp;<em>ViT-Large<\/em>&nbsp;performs the same embedding as&nbsp;<em>ViT-Base<\/em>, but has 24 transformer layers and a larger feature size.&nbsp;<em>ViT-Hybrid<\/em>&nbsp;performs embedding using&nbsp;<em>ResNet50&nbsp;<\/em>and has 12 transformer layers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"78de\"><strong>DPT accuracy<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"14b9\"><em>DPT&nbsp;<\/em>sets a new state of the art for the semantic segmentation task on ADE20K, a large data set with <a href=\"https:\/\/github.com\/CSAILVision\/sceneparsing\/blob\/master\/objectInfo150.csv\" target=\"_blank\" rel=\"noreferrer noopener\">150 classes<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"581\" height=\"284\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-49.png\" alt=\"\" class=\"wp-image-299\"\/><figcaption class=\"wp-element-caption\">\u51fa\u5178\uff1a<a href=\"https:\/\/arxiv.org\/pdf\/2103.13413\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/arxiv.org\/pdf\/2103.13413<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7770\">It is also the state of the art after some fine-tuning on smaller datasets such as&nbsp;<em>NYUv2<\/em>,&nbsp;<em>KITTI<\/em>, and&nbsp;<em>Pascal Context<\/em>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1007\" height=\"326\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-47.png\" alt=\"\" class=\"wp-image-295\"\/><figcaption class=\"wp-element-caption\">\u51fa\u5178\uff1a<a href=\"https:\/\/arxiv.org\/pdf\/2103.13413\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/arxiv.org\/pdf\/2103.13413<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6a80\">Below is a comparison of&nbsp;<a href=\"https:\/\/medium.com\/axinc-ai\/midas-a-machine-learning-model-for-depth-estimation-e96119cc1a3c\"><em>MiDaS<\/em><\/a><em>&nbsp;<\/em>and&nbsp;<em>DPT&nbsp;<\/em>for depth estimation. DPT is able to predict the depth inmore detail. It can also improve the accuracy of large homogeneous regions and relative positioning within an image, which is a shortcoming of convolution networks.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1007\" height=\"630\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-49.jpg\" alt=\"\" class=\"wp-image-297\"\/><figcaption class=\"wp-element-caption\">\u51fa\u5178\uff1a<a href=\"https:\/\/arxiv.org\/pdf\/2103.13413\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/arxiv.org\/pdf\/2103.13413<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"cccd\">Below is a comparison for the segmentation task. DPT tends to produce more detailed output at object boundaries, and it tends to produce less cluttered output in some cases.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1007\" height=\"431\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-49-1.jpg\" alt=\"\" class=\"wp-image-298\"\/><figcaption class=\"wp-element-caption\">\u51fa\u5178\uff1a<a href=\"https:\/\/arxiv.org\/pdf\/2103.13413\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/arxiv.org\/pdf\/2103.13413<\/a><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"ddd7\"><strong>DPT Usage<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8bc6\"><span style=\"white-space: normal; caret-color: rgb(36, 36, 36); color: rgb(36, 36, 36); font-family: source-serif-pro, Georgia, Cambria, &quot;Times New Roman&quot;, Times, serif; font-size: 20px; letter-spacing: -0.06px;\">You can use the following commands to perform segmentation and depth estimation on the input images with ailia SDK.<\/span><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ python3 dense_prediction_transformers.py -i input.jpg -s output.png --task=segmentation -e 0\n$ python3 dense_prediction_transformers.py -i input.jpg -s output.png--task=monodepth -e 0\n<a href=\"https:\/\/github.com\/axinc-ai\/ailia-models\/tree\/master\/image_segmentation\/dense_prediction_transformers?source=post_page-----88db4842b4a7--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/axinc-ai\/ailia-models\/tree\/master\/image_segmentation\/dense_prediction_transformers\" target=\"_blank\" rel=\"noreferrer noopener\">axinc-ai\/ailia-models<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bb81\">Here is a result you can expect.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"ailia MODELS : DPT\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/ZEtFZxkO-04?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5f2b\"><a href=\"https:\/\/axinc.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">ax Inc.<\/a>&nbsp;has developed&nbsp;<a href=\"https:\/\/ailia.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">ailia SDK<\/a>, which enables cross-platform, GPU-based rapid inference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5f2b\">ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to&nbsp;<a href=\"https:\/\/axinc.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">contact us<\/a>&nbsp;for any inquiry.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/medium.com\/tag\/ailia-models?source=post_page-----88db4842b4a7---------------ailia_models-----------------\"><\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/medium.com\/@kyakuno?source=post_page-----88db4842b4a7--------------------------------\"><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Overview DPT (DensePredictionTransformers)&nbsp;is a segmentation model released by Intel in March 2021 that applies&nbsp;vision transformers&nbsp;to images. It can perform image semantic segmentation with 49.02% mIoU on ADE20K, and it can also be used for monocular depth estimation with an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. Vision Transformers for Dense Prediction Architecture In&nbsp;DPT, vision transformers (ViT)are used instead of convolutional network. Using transformers allows to make more detailed and globally consistent predictions compared to convolutional networks. In particular, performance is improved when a large amount of training data is available. The&nbsp;encoder&nbsp;divides the image into tiles, which are then tokenized (Embed&nbsp;in the graph [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":2109,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[255],"tags":[266],"class_list":["post-2384","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tips-en","tag-ailiamodels-en"],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2384","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/comments?post=2384"}],"version-history":[{"count":3,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2384\/revisions"}],"predecessor-version":[{"id":2407,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2384\/revisions\/2407"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/media\/2109"}],"wp:attachment":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/media?parent=2384"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/categories?post=2384"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/tags?post=2384"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}