Image Inpainting implementation using Deep Learning models

5 min readAug 5, 2020

I’m Pawan, currently pursuing UG at Bennett University in the field of Computer Science. A few weeks back I’ve joined as an intern under Leadingindia.ai and worked on the project titled “Image Inpainting using Deep Learning”. Before this project, I was vague about the concepts of DeepLearning and Neural Networks, and with the guidance of my mentor and various Online resources, I was exposed to heaps of knowledge. This has been a great experience for me and I‘d like to share some insights about it.

I’d like to thank my team: Yashvi Chauhan, Mohammed Ghalib Qureshi, Bharat Ahuja, Ritwik Puri, Souvik Mishra, Giridhar who gave great support throughout the project and Dr. Suneet Gupta, our mentor for being always open and available to us.

About the Organization

Leadingindia.ai is a project initiated to equip faculty members and students with the industry-driven artificial intelligence and deep-learning tools. Royal Academy of Engineering, UK under Newton Bhabha Fund has approved a nationwide initiative in India on “AI and deep learning Skilling and Research”. University College, London, Brunel University, London, and Bennett University, India are collaborators of the project.

Project

What is Image Inpainting?

Seeing the above images makes you think that they are two different images, but that’s where the Image Inpainting magic comes in. Inpainting basically refers to the removal of any noise or an unwanted object and replacing the pixels according to our desire. This technique is mainly used in the field of image editing, due to the seamless interpretation of cropping and patching images companies like Adobe use it in the Photoshopping aspect. There are two major ways we can achieve this goal of inpainting, one being an Image Processing approach and the other being a Deep Learning one, we chose the latter one.

Deep learning approach

Unlike the traditional Image Processing approach the Image processing approach, the Deep Learning approach is more complex but in the same sense produces better results to a set of data. We have used different models in this process with various datasets. Usually, inpainting can be implemented using either an Autoencoder decoder or passing an image through a Generative Adversarial Net (GAN) architecture.

Auto Encoder and Decoder Model

Dataset: CIFAR-10 image dataset curated by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Architecture for Auto Encoder and Decoder

After the implementation, we have created a website where we can upload the image and the mask for the image, which is a portion of the image that should be removed and replaced. The process flow map of the website is given below

Results: The model gave out very good results in having loss, val_loss, dice_coeff, val_dice_coeff which are the major performance metrics.

Results for Auto Encoder and Decoder architecture

Bilinear GAN Model

This time we used a GAN model because it can generate more refined and visually aesthetic results. When we remove and replace the missing patches, we need the information about patch and how the new patch should fit, for that global and local attention module to obtain was used to obtain global dependency information and the local similarity information among the features.

Furthermore to fine-tune the images which were given as the output from the Global and local attention-based model architecture, BCNN architecture was used. Combing both the architectures gave us the final Bilinear GAN.

Datasets:

CelebA: CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. We trained 27,650 images from CelebA. The images in this dataset cover large pose variations and background clutter.

Places2: Places dataset is a large scale database for scene understanding. Places contain more than 10 million images comprising 400+ unique scene categories. For our use, we trained 20,000 images from this dataset. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence.

Results:

For the Places dataset we have trained the images for 3,20,000 iterations so we generated some output from that below is the image from we get the idea of the results.

As I’ve talked about it beforehand the masked image in (A) is given as input and the evolution of inpainted or generated images are compared with the ground truth image (B).

After training images of CelebA train dataset for 4,70,000 iteratons with batch size of 4 we received quite good outcomes from the model. Here is some of the output images.