Rank in Style: A Ranking-based Approach to Find Interpretable Directions

Umut Kocasarı^1*, Kerem Zaman^1*, Mert Tiftikci^1*, Enis Simsar², Pinar Yanardag¹

¹Boğaziçi University ²Technical University of Munich

Abstract

Recent work such as StyleCLIP aims to harness the power of CLIP embeddings for controlled manipulations. Although these models are capable of manipulating images based on a text prompt, the success of the manipulation often depends on careful selection of the appropriate text for the desired manipulation. This limitation makes it particularly difficult to perform text-based manipulations in domains where the user lacks expertise, such as fashion. To address this problem, we propose a method for automatically determining the most successful and relevant text-based edits using a pre-trained StyleGAN model. Our approach consists of a novel mechanism that uses CLIP to guide beam-search decoding, and a ranking method that identifies the most relevant and successful edits based on a list of keywords. We also demonstrate the capabilities of our framework in several domains, including fashion.

Framework of RankInStyle

For finding best channels to perform manipulations, we rank channels by the value of \( \mathcal{V}_{R} \mathcal{V}_{E} \)
We compute the relevance as the similarity between generated images and keywords in the CLIP embedding space. \begin{equation} \begin{split} \mathcal{V}_{R} = \mathcal{S}_{CLIP}(G(s),t) \end{split} \end{equation}
Since the relevance is not enough to assess if a manipulation is successful, we measure editability, the increase in the relevance after the manipulations. \begin{equation} \begin{split} \mathcal{V}_{E} = \frac{\mathcal{S}_{\text{CLIP}}(G(s+\alpha), t) - \mathcal{S}_{\text{CLIP}}(G(s), t)}{ L_2(\text{CLIP}(G(s+\alpha)) - \text{CLIP}(G(s)))} \end{split} \end{equation}
We rerank beams using the CLIP similarity between a generated image and candidate text.

Highlights

Unlike previous work, our method finds directions using through generating domain-related descriptions in an unsupervised fashion.
Our method outperforms SeFa and GANSpace on semantically meaningfulness and disentanglement of performed manipulations, which is shown by human evaluation.

Example manipulations with RankInStyle

Shows the manipulations that our method provides for FFHQ

Shows the manipulations that our method provides for AFHQ cats

BibTeX

@InProceedings{Kocasari_2022_CVPR, author = {Kocasari, Umut and Zaman, Kerem and Tiftikci, Mert and Simsar, Enis and Yanardag, Pinar}, title = {Rank in Style: A Ranking-Based Approach To Find Interpretable Directions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2022}, pages = {2294-2298} }

In CVFAD 2022

Rank in Style: A Ranking-based Approach to Find Interpretable Directions

Video

Abstract

Framework of RankInStyle

Highlights

Example manipulations with RankInStyle

Comparison with Other Supervised and Unsupervised Methods

Human Evaluation

Poster

BibTeX

Acknowledgments