Abstract:
It's a difficult task to automatically generate natural language descriptions of an image's
content. Though, unlike humans, it does not come readily to machines. However,
implementing this capability would surely alter how machines interact with us. The recent
advancement of object recognition from photos has resulted in a paradigm for captioning
images based on their object relationships. Various picture caption producing models based
on pre-trained neural networks are presented in this research, with an emphasis on the various
CNN architecture and LSTM to examine their influence on phrase synthesis. For creating a
caption from an image, a combination of neural networks is more suited. The quality of
generated captions is calculated using BLEU Metrics.