DPT (Data-efficient Perceiver Transformer) is a segmentation model that utilizes Vision Transformer architecture.

Here’s an introduction to “DPT,” a machine learning model available for use with the ailia SDK. By utilizing machine learning models published in the ailia MODELS and the edge inference framework of the ailia SDK, you can easily implement AI functionality into your applications.

Summary of DPT

DPT (DensePredictionTransformers) is a segmentation model that applies Transformers to images, released by Intel in March 2021. It can perform image segmentation and monocular depth estimation. In monocular depth estimation, relative performance has been improved by up to 28%. In semantic segmentation, it achieves a mIoU of 49.02% on ADE20K, setting the state-of-the-art (SOTA).

Vision Transformers for Dense Prediction

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional…

arxiv.org

Architecture of DPT

In DPT, Vision Transformer is used instead of Convolutional Networks. By utilizing Transformers, DPT can make more detailed and globally consistent predictions compared to Convolutional Networks. Performance particularly improves when a large amount of training data is available.

In the Encoder of DPT, the image is divided into tiles and tokenized using Embed, then processed with Transformer. There are two methods for tokenization: one is a path-based approach that simply divides the image into tiles, and the other is to apply ResNet50 to the input image, then divide the obtained pixel feature map into tiles and tokenize them.

Source：https://arxiv.org/pdf/2103.13413

In the Decoder of DPT, the output at each resolution of the Transformer is converted to an image-like representation, and then a Convolutional Network is used to generate the segmentation image.

DPT defines three model architectures: ViT-Base, ViT-Large, and ViT-Hybrid. In ViT-Base, patch-based embedding is performed, with 12 transformer layers. ViT-Large performs the same embedding as ViT-Base but has 24 transformer layers and a larger feature size D. ViT-Hybrid uses ResNet50 for embedding and has 12 transformer layers.

The performance of DPT

DPT has achieved state-of-the-art performance on the ADE20K dataset, which is a large-scale dataset with 150 classes, in the semantic segmentation task.

Source：https://arxiv.org/pdf/2103.13413

Furthermore, by fine-tuning on small datasets such as NYUv2, KITTI, and Pascal Context, DPT has achieved state-of-the-art performance.

Source：https://arxiv.org/pdf/2103.13413

Comparison between MiDaS and DPT in depth estimation: DPT can predict depth details more accurately. It can also improve the accuracy of large uniform regions and relative spatial relationships within images, which are challenging for Convolutional Networks.

Source：https://arxiv.org/pdf/2103.13413

Comparison in segmentation models: DPT tends to produce more detailed outputs at object boundaries. Additionally, in some cases, it tends to produce outputs that are less cluttered.

Source：https://arxiv.org/pdf/2103.13413

How to use DPT

To use DPT, use the following command. You can perform segmentation and depth estimation on input images.

$ python3 dense_prediction_transformers.py -i input.jpg -s output.png --task=segmentation -e 0
$ python3 dense_prediction_transformers.py -i input.jpg -s output.png--task=monodepth -e 0

axinc-ai/ailia-models

image file (576×384) (Image from…

github.com

Here is an example of execution.

https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FZEtFZxkO-04%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DZEtFZxkO-04&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FZEtFZxkO-04%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube

AX Corporation develops ailia SDK, a cross-platform AI inference engine that enables fast inference using GPUs. We provide a total solution for AI, including consulting, model creation, SDK provision, development of AI-based applications and systems, and support. Please feel free to contact us for inquiries.