A New Memory Based on Sequence to Sequence Model for Video Captioning - Details

author：

Lin, Jin-Cheng (Lin, Jin-Cheng.) ^[1] | Zhang, Chun-Yang (Zhang, Chun-Yang.) ^[2] (Scholars：张春阳)

Indexed by：

Abstract：

In　the　field　of　computer　vision,　Video　captioning　is　a　very　important　and　meaningful　task,　which　could　automatically　generate　textual　descriptions　of　contents　from　videos.　It　is　a　challenging　problem　due　to　the　difficulties　of　understanding　the　objects　and　activities　in　a　video.　Benefit　from　the　rapid　development　of　deep　learning　technology,　e.g.　sequence　to　sequence　model,　video　captioning　task　has　achieved　very　accurate　results.　However,　there　are　two　serious　flaws,　the　first　is　that　the　pre-trained　deep　models　are　often　used　as　visual　feature　abstractors　as　the　training　is　highly　time-consuming,　so　the　feature　generalization　performance　generated　by　these　pre-trained　Encoder　is　limited　when　we　directly　employ　those　networks　in　video　captioning　tasks.　The　second　is　that　each　frame　in　the　video　is　processed　separately,　ignoring　the　correlation　of　video　data　in　the　time　dimension.　In　this　work,　we　propose　video　captioning　model　with　attention-memory　module　to　explore　the　role　of　capturing　temporal　correlations　which　with　sequence　to　sequence　model　as　the　background　and　showing　the　importance　of　temporal　structure　to　vision　tasks　by　adding　the　correlation　of　videos　when　extracting　features　and　enhancing　the　time-memory　capability.　Our　experiments　are　based　on　two　most　famous　benchmark　datasets　in　the　field　of　video　captioning:　MSVD　and　MSR-VTT.　Then　employ　BLEU　and　METEOR　to　evaluate　the　accuracy　of　the　description　generated　by　different　methods.　Finally,　the　experimental　results　confirm　that　the　proposed　model　could　make　significant　improvements　in　description　results　compared　with　the　baseline　models.　©　2021　IEEE.

Keyword：

Computer vision Deep learning

Community：

[ 1 ] [Lin, Jin-Cheng]Fuzhou University, College of Mathematics and Computer Science, Fuzhou, China
[ 2 ] [Zhang, Chun-Yang]Fuzhou University, College of Mathematics and Computer Science, Fuzhou, China

Reprint 's Address：

Email：

Show more details

Related Keywords：

Non-contact vibration sensor using deep learning and image processing
2021，MEASUREMENT
Data analysis in visual power line inspection: An in-depth review of deep learning for component detection and fault diagnosis
2020，ANNUAL REVIEWS IN CONTROL
Deep Learning-based Multi-task Network for Intelligent Management of Garbage Deposit Points
2021，11th International Conference on Information Technology in Medicine and Education, ITME 2021
OLANET: SELF-SUPERVISED 360° DEPTH ESTIMATION WITH EFFECTIVE DISTORTION-AWARE VIEW SYNTHESIS AND L1 SMOOTH REGULARIZATION
2021，2021 IEEE International Conference on Multimedia and Expo, ICME 2021
CN-LBP: Complex Networks-Based Local Binary Patterns for Texture Classification
2021，18th International Conference on Wavelet Analysis and Pattern Recognition, ICWAPR 2021

Source ：

Year： 2021

Page： 470-476

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count： 5

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 0

Affiliated Colleges：

计算机与大数据学院、软件学院本学院/部未明确归属的数据

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to