GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation

CVPR Workshop (CVinW) 2024

*Equal contribution $\dagger$Project Lead
1National Taiwan University, 2NVIDIA

Abstract

Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence throughout the entire video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we aim to efficiently adapt foundation segmentation models for addressing RVOS from weak supervision with the proposed Grounded Prompting (GroPrompt) framework. More specifically, we propose TextAware Prompt Contrastive Learning (TAP-CL) to enhance the association between the position prompts and the referring sentences with only box supervisions, including Text-Contrastive Prompt Learning (TextCon) and ModalityContrastive Prompt Learning (ModalCon) at frame level and video level, respectively. With the proposed TAPCL, our GroPrompt framework can generate temporalconsistent yet text-aware position prompts describing locations and movements for the referred object from the video. The experimental results in the standard RVOS benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences) demonstrate the competitive performance of our proposed GroPrompt framework given only bounding box weak supervisions.

Framework

Overview of our proposed GroPrompt framework. In (a), our proposal generation takes each frame It and the referring sentence $S^i$ to derive object queries $Q^i_t$ and produce the prompt embedding $p^i_t$ for segmentation, with another sentence $S^j$ as input for performing Text-Contrastive Prompt Learning. In (b), to handle sentence descriptions containing long-term motions or actions in referring video object segmentation, we uniquely present Modality-Contrastive Prompt Learning to align the text with the referred object at the video level.

Text-Contrastive Prompt Learning.

Formally, in addition to the input sentence $S_i$, we forward another sentence $S^j$ through our GroPrompt framework to obtain the output proposal $B_t^j$ for another object at each frame. To perform contrastive learning, we leverage the prompt encoder from the foundation segmentation models to extract the prompt embeddings $p_t^i$, $p_t^j$, and $\hat{p}_t^i$ for the proposals $B_t^i$ and $B_t^j$ and the ground-truth bounding box $\hat{B}_t^i$, respectively. By taking $p_t^i$, $\hat{p}_t^i$, and $p_t^j$ as the anchor, positive, and negative sample, the frame-level triplet contrastive loss $L_{contra}^f$ would be computed as follows:

Modality-Contrastive Prompt Learning.

In addition to the prompt embedding $p_t^i$ derived in Text-Contrastive Prompt Learning, we also utilize the image encoder to extract the visual features $f_t$. With the cross-attention performed at each frame by taking the prompt embedding $p_t^i$ as the query and visual features $f_t$ as keys and values, followed by an average pooling layer for temporal aggregation, the video-level content feature $f^i$ would be encoded for the referred object. As for the referring sentences $S^i$ and $S^j$, we derive the sentence-level linguistic features $z^i$ and $z^j$ from the text encoder. Then, the video-level triplet contrastive loss $L_{contra}^v$ would be computed as follows:

Experiments

Table 1. Quantitative comparison to state-of-the-art methods on the validation split of Ref-YouTube-VOS and Ref-DAVIS17. RefYT: Ref-YouTube-VOS, RefD: Ref-DAVIS, RefC: RefCOCO [29, 54], AVOS: Audio-VOS [32], AVSB: AVSBench [57], YT: YouTube-VOS 2019 [46], D: DAVIS17 [33], O: Occluded VIS [34], LV: Long-term VOS [16], G: GOT-10K [17], La: LaSOT [13], T: TrackingNet [31], B: BDD100K [53], V: VIS19 [50].

Table 2. The quantitative evaluation on A2D-Sentences, with Precision@K, Overall IoU and Mean IoU.

Table 3. The quantitative evaluation on JHMDB-Sentences, with Precision@K, Overall IoU and Mean IoU.

Visualization

Figure 2. Qualitative comparisons of the state-of-the-art methods on Refer-DAVIS17, where “GT-bbox + SAM” represents the result by taking ground-truth bounding boxes to prompt SAM.

Figure 3. Qualitative comparisons of the state-of-the-art methods on Refer-Youtube-VOS.

BibTeX

If you find this useful for your research, please consider citing:
@inproceedings{lin2024groprompt,
  author    = {Ci-Siang Lin and I-Jieh Liu and Min-Hung Chen and Chien-Yi Wang and Sifei Liu and Yu-Chiang Frank Wang},
  title     = {GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation},
  booktitle = {CVPRW},
  year      = {2024},
}