Boundary-Aware Temporal Dynamic Pseudo-Supervision Pairs Generation for Zero-Shot Natural Language Video Localization - Details

author：

Deng, Xiongwen (Deng, Xiongwen.) ^[1] | Tang, Haoyu (Tang, Haoyu.) ^[2] | Jiang, Han (Jiang, Han.) ^[3] | Zheng, Qinghai (Zheng, Qinghai.) ^[4] (Scholars：郑清海) | Zhu, Jihua (Zhu, Jihua.) ^[5]

Indexed by：

Abstract：

Zero-shot　Natural　Language　Video　Localization　(NLVL)　aims　to　automatically　generate　moments　and　corresponding　pseudo　queries　from　raw　videos　for　the　training　of　the　localization　model　without　any　manual　annotations.　Existing　approaches　typically　produce　pseudo　queries　as　simple　words,　which　overlook　the　complexity　of　queries　in　real-world　scenarios.　Considering　the　powerful　text　modeling　capabilities　of　large　language　models　(LLMs),　leveraging　LLMs　to　generate　complete　queries　that　are　closer　to　human　descriptions　is　a　potential　solution.　However,　directly　integrating　LLMs　into　existing　approaches　introduces　several　issues,　including　insensitivity,　isolation,　and　lack　of　regulation,　which　prevent　the　full　exploitation　of　LLMs　to　enhance　zero-shot　NLVL　performance.　To　address　these　issues,　we　propose　BTDP,　an　innovative　framework　for　Boundary-aware　Temporal　Dynamic　Pseudo-supervision　pairs　generation.　Our　method　contains　two　crucial　operations:　1)　Boundary　Segmentation　that　identifies　both　visual　boundaries　and　semantic　boundaries　to　generate　the　atomic　segments　and　activity　descriptions,　tackling　the　issue　of　insensitivity.　2)　Context　Aggregation　that　employs　the　LLMs　with　a　self-evaluation　process　to　aggregate　and　summarize　global　video　information　for　optimized　pseudo　moment-query　pairs,　tackling　the　issue　of　isolation　and　lack　of　regulation.　Comprehensive　experimental　results　on　the　Charades-STA　and　ActivityNet　Captions　datasets　demonstrate　the　effectiveness　of　our　BTDP　method.　©　2025,　Association　for　the　Advancement　of　Artificial　Intelligence　(www.aaai.org).　All　rights　reserved.

Keyword：

Semantic Segmentation

Community：

[ 1 ] [Deng, Xiongwen]School of Software Engineering, Xi’an Jiaotong University, China
[ 2 ] [Deng, Xiongwen]School of Software, Shandong University, China
[ 3 ] [Tang, Haoyu]School of Software, Shandong University, China
[ 4 ] [Jiang, Han]School of Software Engineering, Xi’an Jiaotong University, China
[ 5 ] [Zheng, Qinghai]College of Computer and Data Science, Fuzhou University, China
[ 6 ] [Zhu, Jihua]School of Software Engineering, Xi’an Jiaotong University, China

Reprint 's Address：

Email：

Show more details

Related Keywords：

Pixel-Level Crack Detection in Images Using SegNet
2019，13th Multi-Disciplinary International Conference on Multi-Disciplinary Trends in Artificial Intelligence (MIWAI)
DM-Net: A Dual-Model Network for Automated Biomedical Image Diagnosis
2023，27th International Conference on Research in Computational Molecular Biology, RECOMB 2023
Semantic Segmentation of Remote Sensing Image Based on Contextual U-Net
2023，2nd International Conference on Applied Statistics, Computational Mathematics, and Software Engineering, ASCMSE 2023
BFRNet: Bimodal Fusion and Rectification Network for Remote Sensing Semantic Segmentation
2025，PATTERN RECOGNITION AND COMPUTER VISION, PT XIII, PRCV 2024
Research and comparison of lightweight U-Net based on GhostNets for medical and remote sensing images
2023，8th International Conference on Intelligent Computing and Signal Processing, ICSP 2023

Source ：

ISSN： 2159-5399

Year： 2025

Issue： 3

Volume： 39

Page： 2717-2725

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 2

Affiliated Colleges：

计算机与大数据学院、软件学院本学院/部未明确归属的数据

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to