Carleton University - School of Computer Science Honours Project
Fall 2020
SVG Image Captioning with Convolutional Neural Network and Long Short-Term Memory
Hien Le
SCS Honours Project Image
ABSTRACT
Automatically generating natural language descriptions according to the content observed in an image (i.e. image captioning) has been a challenging task for both fields of computer vision and natural language processing. The application of image captioning is extensive and significant, for example, the realization of human-computer interaction. This project will make use of popular image captioning methods to caption pixel-based images of simple objects as scalable vector graphics (SVG) which are composed of mathematical equations using points, lines and shape rather than a pixel grid, and that makes them resolution-independent and infinitely scalable. The task of image captioning can be divided into two separate modules - an image based model and a language based model. The image based model (the encoder) is often a pre-trained Convolutional Neural Network (CNN) model used to extract the features out of the pixel-based images. The language based model (the decoder) is usually a Long Short-Term Memory (LSTM) based Recurrent Neural Network (RNN) model used to translate the features extracted from the image based model to a SVG string. As a result, the CNN-LSTM image captioning model is constructed to solve the task and achieves an accuracy of 82%.