GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation

Overview of our proposed GroPrompt framework. In (a), our proposal generation takes each frame It and the referring sentence $S^i$ to derive object queries $Q^i_t$ and produce the prompt embedding $p^i_t$ for segmentation, with another sentence $S^j$ as input for performing Text-Contrastive Prompt Learning. In (b), to handle sentence descriptions containing long-term motions or actions in referring video object segmentation, we uniquely present Modality-Contrastive Prompt Learning to align the text with the referred object at the video level.

Text-Contrastive Prompt Learning.

Formally, in addition to the input sentence $S_i$, we forward another sentence $S^j$ through our GroPrompt framework to obtain the output proposal $B_t^j$ for another object at each frame. To perform contrastive learning, we leverage the prompt encoder from the foundation segmentation models to extract the prompt embeddings $p_t^i$, $p_t^j$, and $\hat{p}_t^i$ for the proposals $B_t^i$ and $B_t^j$ and the ground-truth bounding box $\hat{B}_t^i$, respectively. By taking $p_t^i$, $\hat{p}_t^i$, and $p_t^j$ as the anchor, positive, and negative sample, the frame-level triplet contrastive loss $L_{contra}^f$ would be computed as follows:

Modality-Contrastive Prompt Learning.

In addition to the prompt embedding $p_t^i$ derived in Text-Contrastive Prompt Learning, we also utilize the image encoder to extract the visual features $f_t$. With the cross-attention performed at each frame by taking the prompt embedding $p_t^i$ as the query and visual features $f_t$ as keys and values, followed by an average pooling layer for temporal aggregation, the video-level content feature $f^i$ would be encoded for the referred object. As for the referring sentences $S^i$ and $S^j$, we derive the sentence-level linguistic features $z^i$ and $z^j$ from the text encoder. Then, the video-level triplet contrastive loss $L_{contra}^v$ would be computed as follows:

Experiments

Table 1. Quantitative comparison to state-of-the-art methods on the validation split of Ref-YouTube-VOS and Ref-DAVIS17. RefYT: Ref-YouTube-VOS, RefD: Ref-DAVIS, RefC: RefCOCO [29, 54], AVOS: Audio-VOS [32], AVSB: AVSBench [57], YT: YouTube-VOS 2019 [46], D: DAVIS17 [33], O: Occluded VIS [34], LV: Long-term VOS [16], G: GOT-10K [17], La: LaSOT [13], T: TrackingNet [31], B: BDD100K [53], V: VIS19 [50].

Table 2. The quantitative evaluation on A2D-Sentences, with Precision@K, Overall IoU and Mean IoU.

Table 3. The quantitative evaluation on JHMDB-Sentences, with Precision@K, Overall IoU and Mean IoU.

Visualization

Figure 2. Qualitative comparisons of the state-of-the-art methods on Refer-DAVIS17, where “GT-bbox + SAM” represents the result by taking ground-truth bounding boxes to prompt SAM.

Figure 3. Qualitative comparisons of the state-of-the-art methods on Refer-Youtube-VOS.

GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation

CVPR Workshop (CVinW) 2024

Abstract

Framework

Text-Contrastive Prompt Learning.

Modality-Contrastive Prompt Learning.

Experiments

Visualization

BibTeX