Docs - Granular Neural Style Transfer

abstract

Neural Style Transfer (NST) synthesizes images by recombining structural content with artistic style using deep convolutional neural networks. In standard formulations, style loss is aggregated uniformly across multiple convolutional layers, limiting explicit control over spatial-scale contributions.

We propose a Layer-Wise Style Weighting (LWSW) framework that introduces independent coefficients for each selected VGG-19 feature layer. Because shallow layers encode fine textures while deeper layers capture higher-level abstractions, redistributing their contributions enables controllable transitions between texture-dominant and structure-dominant stylization.

example

content: Tübingen, Germany

style: The Starry Night, Van Gogh

output: stylized result

introduction

NST demonstrated that deep convolutional networks implicitly separate content and style representations. Content is preserved through high-level feature activations, while style is encoded through second-order feature correlations captured by Gram matrices.

In conventional NST, style loss is computed across multiple convolutional layers and combined using a single global coefficient. Convolutional neural networks are inherently hierarchical, with early layers capturing low-level textures and deeper layers encoding larger-scale abstractions.

Because receptive field sizes increase with depth, uniform aggregation of style losses restricts fine-grained control over spatial-scale stylization. LWSW introduces a layer-wise weighting mechanism to explicitly exploit this hierarchy and enable interpretable multi-scale stylization control.

VGG-19 style layers

relu1_1 shallow Fine texture: brush strokes, grain, fine detail

relu2_1 shallow Small patterns: repeating motifs, colour patches

relu3_1 mid Mid-level shapes: edges, regional structure

relu4_1 deep Large structures: spatial composition

relu5_1 deep Global composition: high-level abstraction

methodology

Stylization is formulated as feature-space optimization over total loss:

L(x) = L_content + L_style

Standard NST applies a single global scalar across all style layers:

L_style = α · Σ E_l

This scalar aggregation treats all convolutional depths equally, preventing selective emphasis on texture vs. structure.

LWSW replaces this with per-layer coefficients, enabling independent control of each spatial scale:

L_style = Σ α_l · E_l

The weights α_l are normalized before the optimization run so that their relative ratios determine emphasis, independent of absolute magnitude.

stylization profiles

how to use

1

Upload images

Drag or click to upload a content image (the photo whose structure you want to keep) and a style image (the artwork whose texture you want to apply).

2

Set layer weights

Adjust the five LWSW sliders. Each controls how much a VGG-19 layer contributes to the style loss. Higher weight on relu1_1 gives fine texture; higher weight on relu5_1 gives abstract structure.

3

Tune optimization

Iterations controls how long the optimizer runs. More steps generally improve quality. Style weight controls the overall strength of the style relative to content.

4

Run

Click run style transfer. The server processes the request synchronously using LBFGS optimization. When complete, the output appears alongside the inputs. Download the result as PNG.

evaluation metrics

LPIPS

Learned Perceptual Image Patch Similarity. Measures perceptual distance between the stylized output and the content image using deep features. Lower = more content-faithful.

SSIM

Structural Similarity Index. Captures luminance, contrast, and structural similarity. Used to evaluate how well content structure is preserved under stylization.

paper

Layer-Wise Weighting for Granular Control in Neural Artistic Stylization

Chetan Tyagi & Linh Le, University of Alberta, February 2026

CMPUT 414

download pdf

references

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. NeurIPS, 2012.
Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to human-level performance in face verification. CVPR, 2014.
U. Guclu and M. A. J. van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 2015.
D. L. K. Yamins et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. PNAS, 2014.
C. F. Cadieu et al. Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLoS Comput Biol, 2014.
M. Kummerer, L. Theis, and M. Bethge. Deep Gaze I: Boosting saliency prediction with feature maps trained on ImageNet. ICLR Workshop, 2015.
S.-M. Khaligh-Razavi and N. Kriegeskorte. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput Biol, 2014.
L. A. Gatys, A. S. Ecker, and M. Bethge. A Neural Algorithm of Artistic Style. arXiv:1508.06576, 2015.
A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. arXiv:1412.0035, 2014.
D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. SIGGRAPH, 1995.
J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. IJCV, 2000.
J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural Computation, 2000.
A. Elgammal and C.-S. Lee. Separating style and content on a nonlinear manifold. CVPR, 2004.
J. E. Kyprianidis et al. A taxonomy of artistic stylization techniques for images and video. IEEE TVCG, 2013.
A. Hertzmann et al. Image analogies. SIGGRAPH, 2001.
M. Ashikhmin. Fast texture transfer. IEEE CGA, 2003.
A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. SIGGRAPH, 2001.
H. Lee et al. Directional texture transfer. NPAR, 2010.
X. Xie, F. Tian, and H. S. Seah. Feature guided texture synthesis for artistic style transfer. DIMEA, 2007.
S. Karayev et al. Recognizing image style. arXiv:1311.3715, 2013.
E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for the perception of motion. JOSA A, 1985.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
O. Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2014.
Y. Jia et al. Caffe: Convolutional architecture for fast feature embedding. ACM Multimedia, 2014.

documentation