In CVFAD 2022

Rank in Style: A Ranking-based Approach to Find Interpretable Directions

Umut Kocasarı1*, Kerem Zaman1*, Mert Tiftikci1*, Enis Simsar2, Pinar Yanardag1
1Boğaziçi University 2Technical University of Munich

Video

Abstract

Recent work such as StyleCLIP aims to harness the power of CLIP embeddings for controlled manipulations. Although these models are capable of manipulating images based on a text prompt, the success of the manipulation often depends on careful selection of the appropriate text for the desired manipulation. This limitation makes it particularly difficult to perform text-based manipulations in domains where the user lacks expertise, such as fashion. To address this problem, we propose a method for automatically determining the most successful and relevant text-based edits using a pre-trained StyleGAN model. Our approach consists of a novel mechanism that uses CLIP to guide beam-search decoding, and a ranking method that identifies the most relevant and successful edits based on a list of keywords. We also demonstrate the capabilities of our framework in several domains, including fashion.


Framework of RankInStyle

  • For finding best channels to perform manipulations, we rank channels by the value of \( \mathcal{V}_{R} \mathcal{V}_{E} \)
  • We compute the relevance as the similarity between generated images and keywords in the CLIP embedding space. \begin{equation} \begin{split} \mathcal{V}_{R} = \mathcal{S}_{CLIP}(G(s),t) \end{split} \end{equation}
  • Since the relevance is not enough to assess if a manipulation is successful, we measure editability, the increase in the relevance after the manipulations. \begin{equation} \begin{split} \mathcal{V}_{E} = \frac{\mathcal{S}_{\text{CLIP}}(G(s+\alpha), t) - \mathcal{S}_{\text{CLIP}}(G(s), t)}{ L_2(\text{CLIP}(G(s+\alpha)) - \text{CLIP}(G(s)))} \end{split} \end{equation}
  • We rerank beams using the CLIP similarity between a generated image and candidate text.

Highlights

  • Unlike previous work, our method finds directions using through generating domain-related descriptions in an unsupervised fashion.
  • Our method outperforms SeFa and GANSpace on semantically meaningfulness and disentanglement of performed manipulations, which is shown by human evaluation.

Example manipulations with RankInStyle

Shows the manipulations that our method provides for FFHQ
Shows the manipulations that our method provides for AFHQ cats

Comparison with Other Supervised and Unsupervised Methods


Human Evaluation

Results for human evaluation experiment with mean and std values for RankInStyle, SeFa and GANSpace

Poster

BibTeX

@InProceedings{Kocasari_2022_CVPR,
    author    = {Kocasari, Umut and Zaman, Kerem and Tiftikci, Mert and Simsar, Enis and Yanardag, Pinar},
    title     = {Rank in Style: A Ranking-Based Approach To Find Interpretable Directions},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2022},
    pages     = {2294-2298}
}

Acknowledgments

This publication has been produced benefiting from the 2232 International Fellowship for Outstanding Researchers Program of TUBITAK (Project No: 118c321).