A Neural Algorithm of Artistic Style

之前的几篇文章是用GAN技术来生成不同风格的艺术图片，GAN本身是一个比较大的方向，生成风格图片只是一个应用而已。最近读的两篇paper来自同一个作者Gatys的文章，一个15年A Neural Algorithm of Artistic Style，一个16年Image Style Transfer Using Convolutional Neural Networks（中了CVPR）读完感受是基本讲述的是一个内容，可能16在15的基础上的升华吧。本篇只要针对第一篇paper,其实并没有差很多。

上一张自己实现的效果图。content picture

style picture

idea

文章的思想还是比较简单的，用几个关键词来概括：texture transfer,artistic style,separating content fron style,convolutional neural networks。用中文组合一下就是对艺术风格的图片进行纹理迁移，思想支撑是一张图片用convilutional networks（卷积网络）可以做到内容和风格的分离。

the paper tells

作者花了一部分的笔墨来说convolutional networks的一些优势：convolutional networks在feed-forward能够通过层级结构的computational units来处理图像信息。每层的units可以理解为图片滤波器对输入图片进行特征抽取（feature maps），不同层滤波抽取到的feature maps也不一样。在学习图片feature maps过程中，随着层数加深，学习到的feature maps更加清晰，直接。换句话来说，不断的抽取使得深层学习到的是图片的content,而不是具体的pixel values。为了验证，可以通过reconstruction技术可视化学习到的feature maps。

content representation and style representation

刚才说到low layers计算的feature maps得到的更多是pixel values,所以在content feature 一般选用的是high layers 的feature response作为content representation。
style representation采用的是抓住texture信息的方法。这里可以简单认为把握风格图的纹理信息等于风格信息，主要是通过不同层得到filter response间的相关性，在计算style loss的时候可以看出。

filter images at each processing stags

可以看到对于content representation,high layer可能会丢失pixl values但是轮廓依然在。对于style representation 学习到的纹理随着层数更加具体，清晰。

loss funciton

content loss

$L_{content}(p,x,l)=\frac{1}{2}\sum_{ij}(F_{ij}^l-P_{ij}^l)^2$

其中p,x分别表示original image和generated image,P,F表示对应的filter responses,i表示第几个filter,l表示第几层，这个loss还是很好理解的。有关correlations的知识可参考这里
关于Gram矩阵的理解：看作是feature之间的偏心协方差矩阵（没有减去均值），在feature map之中，每个数字都是特定滤波器在特定位置的卷积，每个数字代表特征的强度，Gram计算的是两两特征之间的相关性（那两个同时出现，谁与谁此消彼长）Gram对角线元素还体现了每个特征在图像中出现的量。所以Gram有助于把握整个图像的大体风格，有了Gram matrix，可度量图像风格的差异。
关于Gram的运算

style loss

主要基于不同filter response的相关性。feature的相关性用的是Gram matrix来衡量。主要操作是vectorised feature map的内积操作。

$G_{ij}^l=\sum_kF_{ik}^lF_{ik}^l$

每一层的style loss可表示为

$E_l=\frac{1}{4N_l^2M_l^2}\sum_{ij}(G_{ij}^l-A_{ij}^l)^2$

the total loss

$L_{style}(a,x)=\sum_{l=0}^Lw_lE_l$

a,x分别original,generated image,$A$,$G$为各自的style representation。

total loss

content和style是一对trad off过程，生成图片偏向于content，会使得style的效果没那么理想，反之亦然，因此在训练过程中可以控制$\alpha$,$\beta$来控制content和style之间的权衡。

$L_total(p,a,x)=\alpha L_{total}(p,x)+\beta L_{style}(a,x)$

trad off result

行表示trad off比率，生成图像更倾向于style还是content。
列表示在不同layer的效果，style representation和layer的关系。

problems

综合自己的生成的风格迁移图来说，素描有太多细节无法在生成，而且生成画质较差。论文里出现好的效果使用了抽象派的艺术照，油画等，这种类型效果会比较好，容易学习纹理特征。

code

主要复现论文，采用的是pytorch里面的VGG19预训练模型，使用的convolutional layers有‘0’，‘5’，‘10’，‘19’，‘28’。

#encoding=utf-8
from __future__ import division
import numpy as np
import torch
import torch.nn as nn
import torchvision
from torchvision import models
from torchvision.transforms import transforms
from PIL import Image


device = torch.device('cuda:0'if torch.cuda.is_available() else 'cpu')


def load_image(image_path, transform=None, max_size=None, shape=None):
    image = Image.open(image_path)

    #人为裁剪，设置裁剪的大小
    if max_size:
        scale = max_size / max(image.size)
        size = np.array(image.size) * scale
        image = image.resize(size.astype(int), Image.ANTIALIAS)

    if shape:
        image = image.resize(shape, Image.LANCZOS)

    if transform:
        image = transform(image).unsqueeze(0)

    return image.to(device)

class VGGNet(nn.Module):
    def __init__(self):
        super(VGGNet, self).__init__()
        self.layers = ['0','5','10','19','28']
        self.vgg = models.vgg19(pretrained=True).features
        # print(type(self.vgg))


    def forward(self, input):
        features = []
        for name,layer in self.vgg._modules.items():
            input = layer(input)
            if name in self.layers:
                features.append(input)
        return features

# vgg = VGGNet()
# print(vgg)

def train():
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.485, 0.456, 0.406),
                             std=(0.229, 0.224, 0.225))])
    max_size = 400
    steps = 20000
    style_weight = 100
    log_step = 1000
    sample_step = 500
    lr = 0.003
    style_ = 'people'
    content = load_image('PNG/content.png',transform,max_size=max_size)
    style = load_image('PNG/people.jpg',transform,shape=[content.size(2),content.size(3)])
    target = content.clone().requires_grad_(True)
    optimizer = torch.optim.Adam([target],lr=lr,betas=[0.5, 0.999])
    vgg = VGGNet().to(device).eval()

    for step in range(steps):
        target_feature = vgg(target)
        content_feature = vgg(content)
        style_feature = vgg(style)
        style_loss = 0
        content_loss = 0
        for ft,fc,fs in zip(target_feature,content_feature,style_feature):
            _,c,h,w = ft.size()
            content_loss += torch.mean((ft-fc)**2)
            ft = ft.view(c,h*w)
            fs = fs.view(c,h*w)
            ft = torch.mm(ft,ft.t())
            fs = torch.mm(fs,fs.t())
            style_loss += torch.mean((ft-fs)**2/(c*h*w))
        loss =content_loss + style_weight * style_loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if (step+1) % log_step==0:
            print('epoch: [{}/{}], content loss:{:.4f} , sytle loss:{:.4f} ,'
                  .format(step+1,steps,content_loss.item(),style_loss.item()))
        if (step+1) % sample_step == 0:
            #将反归一化拆成正则化的形式，然后利用transform来计算。
            denorm = transforms.Normalize((-2.12, -2.04, -1.80), (4.37, 4.46, 4.44))
            print('target shape',target.shape)
            #把维度为1的去掉
            img = target.clone().squeeze()
            print('image shape', img.shape)
            img = denorm(img).clamp_(0,1)
            torchvision.utils.save_image(img,'./PNG/output-{}-{}.png'.format(step+1,style_))

if __name__ == '__main__':
    train()
# nohup python dcgan.py >dcgan.output 2>&1 &