Files
Abstract
Vision and language are two of the most critical human faculties. If we are to develop more useful Artificial Intelligence (AI) systems, these modalities will need to work in tandem. Although we are still far from the ultimate goal of synergetic integration of vision and language, several practical applications lying at the intersection of computer vision (CV) and natural language processing (NLP) have experienced a huge upsurge in recent times. This upsurge in the integration of vision and language has been accelerated by recent advances in deep learning and ready availability of both, benchmark and real-world datasets. In this dissertation, we address a few interesting and important applications, such as automated image captioning and classification of objects and actions in images, that lie at the intersection of CV and NLP and have a significant potential impact in important problem domains such as information retrieval and product marketing.First, we propose an approach to speed up image caption retrieval guided by the top object detected in an image. Second, we propose an approach to classify an action in an image without executing explicit action classifiers on the image. In this approach, we first detect objects in an image and then, with the aid of top objects and associated word embeddings obtained via training on a natural language corpus, we infer the the most probable action in the image. Next, we propose a model to guess objects in an image in situations where the datasets for training classifiers for such objects are unavailable. Finally, we conduct a similarity study on consumer products using both visual and textual features. We believe that these studies and the proposed models will provide practitioners with insights that they could apply in designing AI systems for specific applications.