Files
Abstract
Automatic image captioning is a challenging deep learning task involving computer vision to understand the contents of an image and natural language generation to compose a coherent description for that image. Image captioning for the English language is well-developed and has high precision, with some recent work surpassing human-level performance. However, Arabic image captioning work has been lacking, with few papers published having relatively low-performance results. Researchers attribute this to the Arabic language's morphological complexity and the to lack of large, robust benchmark datasets compared to those available for the English language. Our proposed framework includes using an improved text preprocessing pipeline incorporating a word segmenter to alleviate some of the morphological complexity associated with the Arabic language. We also build neural network architectures which include techniques not previously explored in the Arabic image captioning literature, such as attention mechanisms and transformers. Our approach yields better results over the most recent published work on the subject in Arabic, improving the BLEU-1 score from 33 to 44.3 and the BLEU-4 score from 6 to 15.6.