Disentangling Disentangled Representations: Towards Improved Latent Units via Diffusion Models

Accepted to WACV 2025
Youngjun Jun, Jiwoo Park, Kyobin Choo, Tae Eun Choi, Seong Jae Hwang
Yonsei University


Visualization

Abstract


Disentangled representation learning (DRL) aims to break down observed data into core intrinsic factors for a profound understanding of the data. In real-world scenarios, manually defining and labeling these factors are non-trivial, making unsupervised methods attractive. Recently, there have been limited explorations of utilizing diffusion models (DMs), which are already mainstream in generative modeling, for unsupervised DRL. They implement their own inductive bias to ensure that each latent unit input to the DM expresses only one distinct factor. In this context, we design Dynamic Gaussian Anchoring to enforce attribute-separated latent units for more interpretable DRL. This unconventional inductive bias explicitly delineates the decision boundaries between attributes while also promoting the independence among latent units. Additionally, we also propose Skip Dropout technique, which easily modifies the denoising U-Net to be more DRL-friendly, addressing its uncooperative nature with the disentangling feature extractor. Our methods, which carefully consider the latent unit semantics and the distinct DM structure, enhance the practicality of DM-based disentangled representations, demonstrating state-of-the-art disentanglement performance on both synthetic and real data, as well as advantages in downstream tasks.

Motivation


Visualization of Latent Units


pipeline

Figure 1. Visualization of latent units in EncDiff and Ours. The top figures visualize the latent unit representing object color using the dimensionality reduction method PaCMAP for multiple data points. The bottom figures show the results of conditional generation of data by sampling the latent unit from the blue and red regions, respectively. (a) In EncDiff, the boundary between color regions is ambiguous, so a latent unit representing red can be sampled even in the blue region, and vice versa. (b) Using our proposed method, we achieve an interpretable latent unit by clearly defining the boundaries between attributes, ensuring that only one color appears in each color region.



Method


Pipeline


pipeline

Figure 2. Training framework with proposed methods. (a) During the diffusion model training, the features generated by the feature extractor are shifted towards the mean direction of the Gaussian for each feature unit based on the selected anchor, becoming the condition for the diffusion model. To ensure the diffusion U-Net effectively utilizes the conditions created by the feature extractor, a skip dropout strategy is employed. (b) The process of anchoring Gaussian distributions involves: i) initializing the Gaussian mixture, ii) performing HDDC using the EM algorithm, iii) adjusting the number of Gaussians by splitting them according to criteria, and iv) filtering out unnecessary Gaussians.



Visualization


Latent Interchange


pipeline


Figure 3. Latent interchange results. This figure shows the results of conditional generation using latent units as the condition, where a single latent unit of the source image is replaced with a latent unit from the target image. The first and second rows represent the source image and target image, respectively. The third to sixth rows show the source image with its attribute (e.g., Floor hue, Camera angle) changed to that of the target. (left) represents the Shapes3D dataset, while (right) represents the MPI3D dataset.



Attention Map


pipeline


Figure 4. Attention map visualizations on Shapes3D and MPI3D-toy datasets. These results allow us to verify how well the image regions highlighted by the attention maps correspond to the factors represented by the latent units. In both datasets, the denoising network focuses on the proper positions associated with the factors represented by the latent units (e.g., object size).





Latent Interpolation


pipeline


Figure 5. Visualization of latent interpolation on the Cars3D and CelebA datasets. (a) For CelebA, we observe natural transitions between two images in terms of hair color, hair style, skin color, background color, gender, and smile. (b) Similarly, in Cars3D, we observe smooth changes in vehicle type, color, azimuth, and elevation.



Quantitative Results


Cars3D, Shapes3D & Mpi3D-toy


pipeline

Table 1. Comparison with baselines on the FactorVAE score and DCI disentanglement metrics (mean ± std). Bold indicates the best, and underline indicates the second-best.



CelebA


figure

Table 2. Comparison of disentanglement and generation quality using the TAD and FID metrics (mean ± std) on the real-world dataset CelebA