Honours Project: 2020 Fall ant | www.scs.carleton.ca

Carleton University - School of Computer Science Honours Project

Fall 2020

SVG Image Captioning with Convolutional Neural Network and Long Short-Term Memory

Hien Le

ABSTRACT

Automatically generating natural language descriptions according to the content observed in an image (i.e. image captioning) has been a challenging task for both fields of computer vision and natural language processing. The application of image captioning is extensive and significant, for example, the realization of human-computer interaction. This project will make use of popular image captioning methods to caption pixel-based images of simple objects as scalable vector graphics (SVG) which are composed of mathematical equations using points, lines and shape rather than a pixel grid, and that makes them resolution-independent and infinitely scalable. The task of image captioning can be divided into two separate modules - an image based model and a language based model. The image based model (the encoder) is often a pre-trained Convolutional Neural Network (CNN) model used to extract the features out of the pixel-based images. The language based model (the decoder) is usually a Long Short-Term Memory (LSTM) based Recurrent Neural Network (RNN) model used to translate the features extracted from the image based model to a SVG string. As a result, the CNN-LSTM image captioning model is constructed to solve the task and achieves an accuracy of 82%.