{"id":2508,"date":"2021-10-20T09:00:30","date_gmt":"2021-10-20T01:00:30","guid":{"rendered":"https:\/\/blog.ailia.ai\/uncategorized\/yolox-object-detection-model-beyond-yolov5\/"},"modified":"2025-05-20T17:33:10","modified_gmt":"2025-05-20T09:33:10","slug":"yolox-object-detection-model-beyond-yolov5","status":"publish","type":"post","link":"https:\/\/blog.ailia.ai\/en\/tips-en\/yolox-object-detection-model-beyond-yolov5\/","title":{"rendered":"YOLOX : Object detection model exceeding YOLOv5"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\" id=\"259e\"><strong>Overview<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b61d\"><em>YOLOX\u00a0<\/em>is a state-of-the-art object detection model released in August 2021, which combines performance beyond\u00a0<a href=\"https:\/\/medium.com\/axinc-ai\/yolov5-the-latest-model-for-object-detection-b13320ec516b\">YOLOv5<\/a>\u00a0with a permissive Apache license.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"427\" height=\"147\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-23.png\" alt=\"\" class=\"wp-image-230\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/github.com\/Megvii-BaseDetection\/YOLOX\/blob\/main\/assets\/logo.png\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/Megvii-BaseDetection\/YOLOX\/blob\/main\/assets\/logo.png<\/a><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"243\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-26.jpg\" alt=\"\" class=\"wp-image-234\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/github.com\/Megvii-BaseDetection\/YOLOX\/blob\/main\/assets\/demo.png\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/Megvii-BaseDetection\/YOLOX\/blob\/main\/assets\/demo.png<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/arxiv.org\/abs\/2107.08430?source=post_page-----e9706e15fef2--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/arxiv.org\/abs\/2107.08430\" target=\"_blank\" rel=\"noreferrer noopener\">YOLOX: Exceeding YOLO Series in 2021<\/a><a href=\"https:\/\/github.com\/Megvii-BaseDetection\/YOLOX?source=post_page-----e9706e15fef2--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/Megvii-BaseDetection\/YOLOX\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub &#8211; Megvii-BaseDetection\/YOLOX<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"bf91\"><strong>Architecture<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"758e\"><em>YOLOX&nbsp;<\/em>is an object detection model that is an anchor-free version of the conventional&nbsp;<em>YOLO&nbsp;<\/em>and introduces&nbsp;<em>decoupled head<\/em>&nbsp;and&nbsp;<em>SimOTA<\/em>. This model was awarded first place of the&nbsp;<em>Streaming Perception Challenge<\/em>&nbsp;at<em>&nbsp;CVPR2021 Automatic Driving Workshop<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d8f1\">Since the existing&nbsp;<a href=\"https:\/\/medium.com\/axinc-ai\/yolov4-a-machine-learning-model-to-detect-the-position-and-type-of-an-object-4f108ed0507b\"><em>YOLOv4<\/em><\/a><em>&nbsp;<\/em>and&nbsp;<a href=\"https:\/\/medium.com\/axinc-ai\/yolov5-the-latest-model-for-object-detection-b13320ec516b\"><em>YOLOv5<\/em><\/a><em>&nbsp;<\/em>pipelines are over-optimized for the use of anchors, YOLOX has been improved with&nbsp;<a href=\"https:\/\/medium.com\/axinc-ai\/yolov3-a-machine-learning-model-to-detect-the-position-and-type-of-an-object-60f1c18f8107\"><em>YOLOv3-SPP<\/em><\/a>&nbsp;as a baseline.&nbsp;<em>YOLOv3-SPP<\/em>&nbsp;was updated to use the advanced&nbsp;<em>YOLOv5<\/em>&nbsp;architecture that adopts an advanced&nbsp;<em>CSPNet<\/em>&nbsp;backbone and an additional&nbsp;<em>PAN head<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"327b\">In object detection models, the tasks of&nbsp;<em>classification&nbsp;<\/em>and&nbsp;<em>regression&nbsp;<\/em>(calculation of bounding box positions) are performed simultaneously, which is known to cause conflicts and reduce accuracy. To solve this problem, the concept of&nbsp;<em>decoupled head<\/em>&nbsp;was introduced. The conventional&nbsp;<em>YOLO<\/em>&nbsp;series backbone and feature pyramids still use a classic&nbsp;<em>coupled head,<\/em>but&nbsp;<em>YOLOX&nbsp;<\/em>has been updated to use a&nbsp;<em>decoupled head<\/em>&nbsp;and achieve higher accuracy.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"772\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-26-1.jpg\" alt=\"\" class=\"wp-image-236\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/arxiv.org\/pdf\/2107.08430.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/pdf\/2107.08430.pdf<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"73ac\"><em>YOLOX<\/em>&nbsp;was trained on a dataset that was strongly augmented using Mosaic and Mixup strategies. The authors also use the advanced label assignment&nbsp;<em>SimOTA<\/em>, a modified version of&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2103.14259\" rel=\"noreferrer noopener\" target=\"_blank\">OTA<\/a>, to optimize loss.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"20d7\">The contribution of each newly introduced tool is as follows.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"544\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-27.png\" alt=\"\" class=\"wp-image-235\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/arxiv.org\/pdf\/2107.08430.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/pdf\/2107.08430.pdf<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5b24\">The benchmark results of\u00a0<em>YOLOX\u00a0<\/em>are shown below.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"580\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-26.png\" alt=\"\" class=\"wp-image-237\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/arxiv.org\/pdf\/2107.08430.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/pdf\/2107.08430.pdf<\/a><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"a51b\"><strong>YOLOX model variants<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0a5e\">There are variations of\u00a0<em>YOLOX<\/em>\u00a0split in two categories,\u00a0<em>Standard Models<\/em>\u00a0for high precision and\u00a0<em>Light Models<\/em>\u00a0for edge devices.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"1176\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-28.jpg\" alt=\"\" class=\"wp-image-238\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/github.com\/Megvii-BaseDetection\/YOLOX\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/Megvii-BaseDetection\/YOLOX<\/a><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"27bc\"><strong>YOLOX performance<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"cff6\">Inference time and mAP50 was measured on validation set of COCO2017.\u00a0<em>YOLOX-s<\/em>\u00a0is able to achieve the same accuracy as\u00a0<em>YOLOv4\u00a0<\/em>with half processing time.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"850\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-25.png\" alt=\"\" class=\"wp-image-233\"\/><figcaption class=\"wp-element-caption\">yolox\u306emAP50<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"850\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-24.png\" alt=\"\" class=\"wp-image-232\"\/><figcaption class=\"wp-element-caption\">yolox\u306e\u63a8\u8ad6\u6642\u9593<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c9df\">The following repository and ailia SDK 1.2.8 were used to measure mAP and inference time.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/rafaelpadilla\/Object-Detection-Metrics\/?source=post_page-----e9706e15fef2--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/github.com\/rafaelpadilla\/Object-Detection-Metrics\/\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub &#8211; rafaelpadilla\/Object-Detection-Metrics<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"c56e\"><strong>CVPR2021 Automous Driving Workshop Streaming Perception Challenge<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f04e\">The link below is the leaderboard of the\u00a0<em>Streaming Perception Challenge<\/em>\u00a0at<em>CVPR2021 Automatic Driving Workshop<\/em>, in which\u00a0<em>YOLOX\u00a0<\/em>won the first place under the name\u00a0<em>BaseDet<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/800\/overview?source=post_page-----e9706e15fef2--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/800\/overview\" target=\"_blank\" rel=\"noreferrer noopener\">EvalAI: Evaluating state of the art in AI<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9098\">For this challenge,<em>\u00a0<\/em><a href=\"https:\/\/www.cs.cmu.edu\/~mengtial\/proj\/streaming\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Argoverse 1.1<\/em><\/a>\u00a0dataset was used, which is the\u00a0<a href=\"https:\/\/www.argoverse.org\/data.html\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Argoverse HD<\/em><\/a>dataset for automated driving with the addition of 2D bounding box annotations similar to the COCO dataset. The\u00a0<em>Argoverse 1.1<\/em>\u00a0dataset contains 1,250,000 bounding boxes annotated using car frontal camera videos.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"955\" height=\"475\" src=\"https:\/\/blog.ailia.ai\/wp-content\/uploads\/image-24.jpg\" alt=\"\" class=\"wp-image-231\"\/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/www.cs.cmu.edu\/~mengtial\/proj\/streaming\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.cs.cmu.edu\/~mengtial\/proj\/streaming\/<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.cs.cmu.edu\/~mengtial\/proj\/streaming\/?source=post_page-----e9706e15fef2--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/www.cs.cmu.edu\/~mengtial\/proj\/streaming\/\" target=\"_blank\" rel=\"noreferrer noopener\">Streaming Perception<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"8879\"><strong>Usage<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3846\">YOLOX can be used with ailia SDK with the following command to detect object in the webcam video stream.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>$ python3 yolox.py -v 0<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"80ba\">By default,\u00a0<em>YOLOX-s<\/em>\u00a0is used. Other models, including tiny models, can be used by using\u00a0<code>-m<\/code>\u00a0option.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/axinc-ai\/ailia-models\/tree\/master\/object_detection\/yolox?source=post_page-----e9706e15fef2--------------------------------\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/github.com\/axinc-ai\/ailia-models\/tree\/master\/object_detection\/yolox\" target=\"_blank\" rel=\"noreferrer noopener\">ailia-models\/object_detection\/yolk<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"41ab\"><a href=\"https:\/\/axinc.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">ax Inc.<\/a>&nbsp;has developed&nbsp;<a href=\"https:\/\/ailia.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">ailia SDK<\/a>, which enables cross-platform, GPU-based rapid inference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"120a\">ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to&nbsp;<a href=\"https:\/\/axinc.jp\/en\/\" rel=\"noreferrer noopener\" target=\"_blank\">contact us<\/a>&nbsp;for any inquiry.<a href=\"https:\/\/medium.com\/tag\/ailia-models?source=post_page-----e9706e15fef2---------------ailia_models-----------------\"><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Overview YOLOX\u00a0is a state-of-the-art object detection model released in August 2021, which combines performance beyond\u00a0YOLOv5\u00a0with a permissive Apache license. YOLOX: Exceeding YOLO Series in 2021 GitHub &#8211; Megvii-BaseDetection\/YOLOX Architecture YOLOX&nbsp;is an object detection model that is an anchor-free version of the conventional&nbsp;YOLO&nbsp;and introduces&nbsp;decoupled head&nbsp;and&nbsp;SimOTA. This model was awarded first place of the&nbsp;Streaming Perception Challenge&nbsp;at&nbsp;CVPR2021 Automatic Driving Workshop. Since the existing&nbsp;YOLOv4&nbsp;and&nbsp;YOLOv5&nbsp;pipelines are over-optimized for the use of anchors, YOLOX has been improved with&nbsp;YOLOv3-SPP&nbsp;as a baseline.&nbsp;YOLOv3-SPP&nbsp;was updated to use the advanced&nbsp;YOLOv5&nbsp;architecture that adopts an advanced&nbsp;CSPNet&nbsp;backbone and an additional&nbsp;PAN head. In object detection models, the tasks of&nbsp;classification&nbsp;and&nbsp;regression&nbsp;(calculation of bounding box positions) are performed simultaneously, which is known to cause conflicts and reduce accuracy. [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":2438,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[255],"tags":[266],"class_list":["post-2508","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tips-en","tag-ailiamodels-en"],"acf":[],"_links":{"self":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2508","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/comments?post=2508"}],"version-history":[{"count":1,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2508\/revisions"}],"predecessor-version":[{"id":2510,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/posts\/2508\/revisions\/2510"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/media\/2438"}],"wp:attachment":[{"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/media?parent=2508"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/categories?post=2508"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ailia.ai\/en\/wp-json\/wp\/v2\/tags?post=2508"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}