Abstract:
Real time Object detection is a computer vision task that involves identi fying and localizing objects of interest within an image or video. Many chal lenges need to be addressed in object detection, including occlusions, scale
variations, clutter in the background, deformations and variations of objects,
limited data, real-time processing demands, imbalanced classes, and the need
to adapt to new object categories. This project proposes a Transformer-based
object detection model to tackle the aforementioned challenges. The pro posed model utilizes Transformers, originally designed for natural language
processing, to address object detection challenges. The model leverages the
self-attention mechanism in Transformers for feature extraction rather than
relying on convolutional neural networks. This allows the models to effec tively capture global and local features and learn complex spatial relation ships between objects. Furthermore, the fully connected layers in the con ventional object detection method are replaced with a Transformer-based
detection head in the proposed models. This modification allows the model
to utilize the strengths of Transformers in processing the extracted features
and generating precise bounding box predictions. Also, the model can learn
complex object representations and handle object occlusion, scale variation,
and other challenging scenarios more effectively. This adaptation enhances
the model’s capability to detect and localize objects in various real-world ap plications accurately. The performance of the proposed Transformer-based
object detection model is evaluated through experiments on widely recognized
object detection benchmarks like COCO. Additionally, proprietary datasets
like Next wealth are used to gauge the model’s performance. The results of
these evaluations exhibit significant enhancements in metrics such as mean
average precision and localization accuracy compared to the other state-of art methods. The Transformer-based object detection models demonstrate
promising outcomes, showcasing improved accuracy and their capability to
handle challenging scenarios and complex object interactions effectively.