Abstract:
One of the most unfortunate risks in today’s busy society is traffic accidents. Each year, traffic
accidents cause a large number of casualties, illnesses, and deaths in addition to suffering huge fi-
nancial losses. Given the quick growth of embedded surveillance video systems for tracking traffic
accidents, it is necessary to distribute systems with high detection accuracy and speed. Recent ad-
vancements in vision-based accident detection methods have been extremely successful thanks to
deep convolutional neural networks’ potent capabilities. The preferred architecture for computer
vision tasks has long been CNNs. However, current CNN-based approaches ignore any informa-
tion and treat accidental classification of all image pixels as equal. As a result, this may result in a
low accuracy rate and detection delays.
This study uses a Vision Transformer-based accident detection method in place of CNN to
improve detection speed and achieve high accuracy. Transformers deal with images as a series of
patches as opposed to convolutional networks, selectively focusing on various visual components
according to context. Additionally, the transformer’s attention mechanism addresses the issue
with low probability, enabling early accident identification. In this project, traffic accidents were
found utilizing video footage and the Vision Transformer (VIT-B/32) transformer. For the accident
root analysis, additional roadside actions are also categorized. On the publicly accessible dataset,
Vision Transformer achieves a classification accuracy of about 92%. The model is a video-based
accident detection coupled with sms service to deliver notifications to the appropriate authorities.