ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval (EMNLP 2025 Findings)
We propose ProPy (arxiv), a model with systematic architectural adaption of CLIP specifically designed for PRVR.
conda create -n propy python=3.10
conda activate propy
conda install pytorch==1.12.0 torchvision==0.13.0 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
We use a single RTX 3090 GPU (Driver version: 535.113.01) to run all experiments.
- Download raw videos of Charades, TVQA, ActivityNet and QVHighlights.
- Note: You need to fill out forms for TVQA and ActivityNet datasets.
- Compress downloaded videos to 3fps with width 224 using scripts/prepare.sh.
- Note: You need to modify corresponding paths in the script.
- Download annotations (we convert original annotations to a standard format) from Baidu or Google drive, and unzip them to
annotationsdirectory. - Download pretrained CLIP-ViT-B/32 weights to
CLIP_weightsdirectory.
Modify video_dir in scripts/*.sh according to your local directories, then run:
bash scripts/prvr_{split}.sh
bash scripts/vcmr_{split}.sh
Checkpoints will be saved to logs/prvr_{split} or logs/vcmr_{split}.
Modify the following parameters of same scripts to test models:
# for evaluation
do_train=0
do_eval=1
resume=/path/to/ckpt/ckpt.best.pth.tar
# then
bash scripts/prvr_{split}.sh
bash scripts/vcmr_{split}.sh
We provide all checkpoints and logs in Baidu and Google drives.
| split | R@1 | R@5 | R@50 | R@100 | SumR |
|---|---|---|---|---|---|
| TVR | 22.4 | 45.0 | 55.9 | 89.5 | 212.8 |
| ActivityNet | 14.9 | 34.9 | 47.5 | 82.7 | 180.0 |
| Charades | 2.6 | 8.7 | 14.8 | 50.4 | 76.5 |
| QVHighlights-val | 37.4 | 65.6 | 76.1 | 96.5 | 275.5 |
| QVHighlights-test | 35.0 | 63.2 | 73.1 | 96.2 | 267.5 |
| split | IoU=0.3,R@10 | IoU=0.3,R@100 | IoU=0.5,R@10 | IoU=0.5,R@100 | IoU=0.7,R@10 | IoU=0.7,R@100 |
|---|---|---|---|---|---|---|
| TVR | 26.26 | 50.26 | 17.49 | 35.61 | 9.65 | 19.82 |
| ActivityNet | 28.57 | 57.42 | 20.81 | 46.22 | 12.94 | 31.85 |
| Charades | 6.8 | 23.39 | 4.73 | 18.44 | 2.26 | 9.01 |
| QVHighlights-val | 54.32 | 79.35 | 45.42 | 72.52 | 27.94 | 48.52 |
To produce attention maps in Figure 4, run:
bash scripts/plot/plot_{split}.sh
These scripts will select videos based on R@1 metric, save necessary weights, then draw frame-level and event-level attention maps. Both weights and figures will be saved to VIS/{split}.
This repo is built upon the following wonderful works:
