Multi-Scale 3D Decoupled Convolutional Video Prediction Method Based on Structure and Texture Decomposition - Details

author：

Zheng, Ming-Kui (Zheng, Ming-Kui.) ^[1] | Wu, Kong-Xian (Wu, Kong-Xian.) ^[2] | Qiu, Xin-Tao (Qiu, Xin-Tao.) ^[3] | Zheng, Hai-Feng (Zheng, Hai-Feng.) ^[4] | Zhao, Tie-Song (Zhao, Tie-Song.) ^[5]

Indexed by：

Abstract：

Video　prediction　is　a　fundamental　task　in　computer　vision,　aiming　to　predict　future　frames　based　on　a　series　of　historical　frames.　It　is　a　dense　pixel-level　prediction　task　with　broad　application　value　in　fields　such　as　autonomous　driving,　traffic　flow　prediction,　and　weather　forecasting.　Traditional　video　prediction　methods　typically　rely　on　autoregressive　model　architectures,　which　use　a　cyclical　strategy,　taking　the　output　of　the　previous　frame　as　the　input　for　the　next　frame,　and　recursively　predicting　in　a　loop.　However,　current　models　still　face　unresolved　challenges.　In　particular,　many　existing　approaches　perform　down　sampling　through　strided　convolution　when　reducing　the　dimensionality　of　video　data,　which　inevitably　leads　to　pixel　loss　and　neglect　of　local　details,　thereby　compromising　the　clarity　of　the　predicted　results.　To　mitigate　this　issue,　non-autoregressive　models　have　been　proposed,　featuring　a　multi-frame　input　and　multi-frame　output　architecture　that　generates　future　frames　in　parallel,　breaking　away　from　the　cyclic　framework　and　effectively　avoiding　the　accumulation　of　prediction　errors.　However,　existing　models　still　face　pressing　issues　that　need　to　be　addressed.　Objects　in　videos　often　exhibit　irregular　motion,　and　the　variability　in　video　content　along　with　multiple　possible　motion　trajectories　make　it　challenging　for　network　models　to　predict　image　motion　accurately,　resulting　in　blurred　image　details　in　the　predicted　frames.　To　tackle　this　challenge,　this　paper　introduces　a　novel　research　approach　that　leverages　the　characteristics　of　wavelet　transforms　through　the　separated　learning　of　feature　domain　structure　and　texture　to　enhance　the　quality　of　video　prediction.　Under　this　separated　structure,　the　low-frequency　structural　information,　after　detail　removal,　has　stronger　temporal　correlation,　which　aids　in　more　accurate　spatiotemporal　prediction　of　image　regions.　High-frequency　detail　features　are　learned　through　an　independent　enhancement　module　to　improve　the　local　quality　of　video　prediction.　Additionally,　by　using　a　two-level　wavelet　transform,　down　sampling　operations　can　be　performed　to　reduce　image　resolution　without　losing　pixel　information,　and　corresponding　up　sampling　operations　can　be　achieved　through　inverse　wavelet　transform.　This　symmetrical　structure　maximizes　the　retention　of　image　information　and　allows　for　more　accurate　prediction　of　subsequent　images.　Furthermore,　this　paper　designs　a　multi-scale　3D　decoupled　convolution　module　that　uses　convolutional　kernels　of　different　sizes　to　learn　regional　features　at　various　scales.　This　module　decouples　traditional　3D　convolution　into　2D　and　1D　convolutions.　This　decoupling　method　focuses　on　learning　the　spatial　and　temporal　characteristics　of　low-frequency　structures,　which　not　only　improves　predictive　performance　but　also　reduces　the　model＇s　parameters　and　memory　consumption.　This　design　enables　the　model　to　more　effectively　capture　both　short-term　and　long-term　temporal　dependencies,　thereby　enhancing　the　accuracy　and　coherence　of　video　prediction.　Finally,　a　high-frequency　detail　enhancement　module　on　a　small　scale　is　designed　to　learn　the　decomposed　high-frequency　information　and　predict　image　details　and　textures,　enhancing　the　local　quality　of　video　prediction.　The　experimental　results　on　synthetic　data　and　real-world　datasets　show　that　the　algorithm　designed　in　this　paper　has　higher　prediction　accuracy　than　existing　algorithms.　It　has　more　accurate　prediction　performance　in　local　details　and　overall　prediction　morphology.　Among　them,　the　MSE　on　the　Moving　MNIST　dataset　is　15.　7,　which　is　34%,　20.　7%,　11.　3%,　and　4.　8%　lower　than　the　existing　advanced　algorithms　SimVP,　TAU,　SwinLSTM,　and　VMRNN　respectively.　©　2025　Science　Press.　All　rights　reserved.

Keyword：

Computer vision Convolution Image compression Image enhancement Image resolution Inverse problems Inverse transforms Pixels Prediction models Textures Video analysis Video recording Wavelet transforms Weather forecasting

Community：

[ 1 ] [Zheng, Ming-Kui]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
[ 2 ] [Wu, Kong-Xian]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
[ 3 ] [Qiu, Xin-Tao]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
[ 4 ] [Zheng, Hai-Feng]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
[ 5 ] [Zhao, Tie-Song]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China

Reprint 's Address：

Email：

Show more details

Related Keywords：

VSA-based fractal image compression
2005，13th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision 2005, WSCG'2005 - In Co-operation with EUROGRAPHICS
Fast Artistic Style Transfer via Wavelet Transforms
2023，4th International Conference on Information Science, Parallel and Distributed Systems, ISPDS 2023
Bi-cubic interpolation for image conversion from virtual hexagonal to square structure
2008，2008 International Conference on Image Processing, Computer Vision, and Pattern Recognition, IPCV 2008
Bilinear interpolation on a virtual hexagonal structure
2007，4th International Conference on Information Technology and Applications, ICITA 2007

Source ：

Chinese Journal of Computers

ISSN： 0254-4164

Year： 2025

Issue： 8

Volume： 48

Page： 1832-1847

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 1

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to