Why you should Double-DIP for Natural Image Decomposition

Unsupervised Image Decomposition via Coupled Deep-Image-Priors

Published in

Towards Data Science

7 min readMay 31, 2019

Many computer vision tasks aspire to decompose an image into its sole components. In Image segmentation, the image is decomposed into meaningful sub-regions, e.g. foreground and background. In transparency separation, the image is separated into its superimposed reflection and transmission. Another example is the task of image dehazing where the goal is to separate a foggy image into its underlying haze-free image and the obscuring fog layers.

While appear unrelated at first, these tasks can be viewed as a special case of image decomposition into separate layers. For example, as visualized in Figure 1; image segmentation (separation into foreground and background layers); transparent layer separation (into reflection and transmission layers); Image dehazing (separation into a clear image and a haze map), and more.

In this post, we are going to focus on “Double-DIP”, a unified framework for unsupervised layer decomposition of a single image, based on several “Deep-image-Prior” (DIP) networks.

for the enthusiast reader:
For more details on “Deep-image-Prior” (DIP) check out my previous post.

Figure 1: A unified framework for image decomposition.

Some Intuition

“Double-DIP” is mainly build on top of “Deep Image Prior” (DIP); A work by Ulyanov et al. In DIP, the authors showed that the structure of a DIP network is sufficient to capture the low-level statistics of a single natural image.

The input to the DIP network is random noise, and it trains to reconstruct a single image (which serves as its sole output training example). This network was shown to be quite powerful for solving image restoration tasks like denoising, super-resolution and inpainting, in an unsupervised way. An example for image denoising is shown below taken from my previous post on Deep Image Prior:

Figure 2: (Left) — Clean image x* restoration result using Deep Image Prior starting from random initialization up to convergence , (Right) — The Noisy image x^

The authors of “Double-DIP” observed that by employing a combination of multiple DIPs to reconstruct an image, those DIPs tend to “split” the image, in a way similar to how naturally a human would split it. In addition, they demonstrated how this approach could be utilized for an additional computer vision tasks including Image-Dehazing, Fg/Bg Segmentation of images and videos, Watermark Removal, and Transparency Separation in images and videos.

“Double-DIP”: Unsupervised Image Decomposition via Coupled Deep-Image-Priors

The key aspect of Double-DIP is inherent in the fact that the distribution of small patches within each decomposed layer is “simpler” (more uniform) than in the original mixed image. Let’s simplify it with an example;

Figure 3: The complexity of mixtures of layers vs. the simplicity of the individual components

Let’s Observe the illustrative example in Figure 3a. Two different textures, X and Y , are mixed to form a more complex image Z which exhibits layer transparency. The distribution of small patches and colors inside each pure texture (X and Y) is simpler than the distribution of patches and colors in the combined image (Z). Moreover, the similarity of patches across the two textures is very weak.

The claim that an image could be separated in a natural and simpler way to its sub-components is derived from information theory. Where it is was proven that for two independent random variables X and Y, their joint entropy of their sum Z = X+Y is greater than their individual entropies.

Single DIP vs. Coupled DIPs

The difference between a single DIP network is used to learn pusre images vs mixed images can be shown in Figure 4 below.

Figure 4. MSE Reconstruction Loss of a single DIP network, as a function of time

In Figure 4, the MSE Reconstruction Loss of a single DIP network is shown, as a function of time (training iterations), for each of the 3 images in Figure 3a.

The orange orange plot is the loss of a DIP trained to reconstruct the texture image X
The blue plot — a DIP trained to reconstruct the texture Y
The green plot — a DIP trained to reconstruct their superimposed mixture (image transparency).

Note the larger loss and longer convergence time of the mixed image, compared to the loss of its individual components. What this means it that the loss of the mixed image is larger than the sum of the two individual losses. This can be bound with o the fact that the distribution of patches in the mixed image is more complex and diverse (larger entropy; smaller internal self-similarity) than in any of its individual components.

Finally, applying multiple DIP’s showed that they tend to “split” the image patches among themselves. Namely, similar small patches inside the image tend to all be generated by a single DIP network. In other words, each DIP captures different components of the internal statistics of the image.

“Double-DIP” Framework

Figure 5. demonstrates the Double-DIP framework: two DIPs decompose the input image I into the layers (y1 and y2), then those layers are recomposed according to a learned mask m, reconstructing an approximated image of I.

What is a good image decomposition?

There are infinitely many possible decompositions of an image into layers. The authors propose the following characteristics as defining a meaningful decomposition:

The recovered layers , when recombined, reconstruct the input image
Each layer should be as “simple” as possible
There should not be dependence or correlation between the recovered layers

These proposes translates into the losses used to train the network. The first criterion is enforced via a “Reconstruction Loss” , which measures the error between the constructed image and the input image. The second criterion is obtained by employing multiple DIPs (one per layer). The third criterion is enforced by an “Exclusion Loss” between the outputs of the different DIPs (minimizing their correlation).

Each DIP network reconstructs a different layer y_i of the input image I. The input to each DIP is randomly sampled uniform noise, z_i. The DIP output, y_i are mixed using a weight mask m, to form a reconstructed image

, which should be as close as possible to the input image I. The optimization loss is therefore

where, The loss first element, that is the reconstruction loss is defined as

The second element is the exclusion loss which minimizes the correlation between the gradients of y_1 and y_2. And lastly, a regularization mask term pulling the mask m to be as close as possible to a binary image.

Results

As mentioned, this approach is applicable for many computer vision tasks like image segmentation, transparent layer separation, Image dehazing, and more.

Segmentation

Image segmentation into foreground and background can be expressed as a decomposition of the image into a foreground layer noted as y_1 and a background layer noted as y_2. These two layer combined with a binary mask would yield the decomposed image, and can be formulated as in equation 1:

Equation 1

This formulation naturally fits the Double-DIP framework, subject to y_1 and y_2 complying to natural image priors and each being ‘simpler’ to generate than I.

The regularization term here is defined as an encourage of the segmentation mask to be binary and defined

The results below shows the advantages of Double-DIP, achieving high quality segmentation based on solely layer decomposition without any additional training data.

Figure 6. Foreground/Background separation results

Watermark Removal

Watermarks are widely-used for copyright protection of photos and videos. Double-DIP removes watermarks by treating them as a special case of image reflection, where y_1 is the cleaned up image and y_2 is the watermark.

Here, in contrast to image semgnetation, the mask is not a constant m. The inherent transparent layer ambiguity is resolved by one of two practical ways: (i) when only one watermarked image is available, the user provides a crude hint (bounding box) around the location of the watermark; (ii) given a few images which share the same watermark (2–3 typically suffice), the ambiguity is resolved on its own.

Figure 7. below visualized the remarkable results on watermark removal by Double-DIP.

Figure 7. Watermark removal from a single image

Transparent Layers Separation

In the case of image reflection, each pixel value in image I(x) is a combination of a pixel from the transmission layer y_1(x) and the corresponding pixel in the reflection layer y_2(x). This again can be formulated as in Equation 1. where m(x) is the reflective mask.

The animation below shows a successful separation for real transparent images.

Conclusion

Double-DIP provides a unified framework for unsupervised layer decomposition of a single image with the need to additional data-set. This framework is applicable over wide variety of computer vision tasks.

if you’re interested in the source code it can be found in my Double-DIP — GitHub repository.

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.

Till then, see you in the next post! 😄