• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
成果搜索

author:

Zheng, Ming-Kui (Zheng, Ming-Kui.) [1] | Wu, Kong-Xian (Wu, Kong-Xian.) [2] | Qiu, Xin-Tao (Qiu, Xin-Tao.) [3] | Zheng, Hai-Feng (Zheng, Hai-Feng.) [4] | Zhao, Tie-Song (Zhao, Tie-Song.) [5]

Indexed by:

EI

Abstract:

Video prediction is a fundamental task in computer vision, aiming to predict future frames based on a series of historical frames. It is a dense pixel-level prediction task with broad application value in fields such as autonomous driving, traffic flow prediction, and weather forecasting. Traditional video prediction methods typically rely on autoregressive model architectures, which use a cyclical strategy, taking the output of the previous frame as the input for the next frame, and recursively predicting in a loop. However, current models still face unresolved challenges. In particular, many existing approaches perform down sampling through strided convolution when reducing the dimensionality of video data, which inevitably leads to pixel loss and neglect of local details, thereby compromising the clarity of the predicted results. To mitigate this issue, non-autoregressive models have been proposed, featuring a multi-frame input and multi-frame output architecture that generates future frames in parallel, breaking away from the cyclic framework and effectively avoiding the accumulation of prediction errors. However, existing models still face pressing issues that need to be addressed. Objects in videos often exhibit irregular motion, and the variability in video content along with multiple possible motion trajectories make it challenging for network models to predict image motion accurately, resulting in blurred image details in the predicted frames. To tackle this challenge, this paper introduces a novel research approach that leverages the characteristics of wavelet transforms through the separated learning of feature domain structure and texture to enhance the quality of video prediction. Under this separated structure, the low-frequency structural information, after detail removal, has stronger temporal correlation, which aids in more accurate spatiotemporal prediction of image regions. High-frequency detail features are learned through an independent enhancement module to improve the local quality of video prediction. Additionally, by using a two-level wavelet transform, down sampling operations can be performed to reduce image resolution without losing pixel information, and corresponding up sampling operations can be achieved through inverse wavelet transform. This symmetrical structure maximizes the retention of image information and allows for more accurate prediction of subsequent images. Furthermore, this paper designs a multi-scale 3D decoupled convolution module that uses convolutional kernels of different sizes to learn regional features at various scales. This module decouples traditional 3D convolution into 2D and 1D convolutions. This decoupling method focuses on learning the spatial and temporal characteristics of low-frequency structures, which not only improves predictive performance but also reduces the model's parameters and memory consumption. This design enables the model to more effectively capture both short-term and long-term temporal dependencies, thereby enhancing the accuracy and coherence of video prediction. Finally, a high-frequency detail enhancement module on a small scale is designed to learn the decomposed high-frequency information and predict image details and textures, enhancing the local quality of video prediction. The experimental results on synthetic data and real-world datasets show that the algorithm designed in this paper has higher prediction accuracy than existing algorithms. It has more accurate prediction performance in local details and overall prediction morphology. Among them, the MSE on the Moving MNIST dataset is 15. 7, which is 34%, 20. 7%, 11. 3%, and 4. 8% lower than the existing advanced algorithms SimVP, TAU, SwinLSTM, and VMRNN respectively. © 2025 Science Press. All rights reserved.

Keyword:

Computer vision Convolution Image compression Image enhancement Image resolution Inverse problems Inverse transforms Pixels Prediction models Textures Video analysis Video recording Wavelet transforms Weather forecasting

Community:

  • [ 1 ] [Zheng, Ming-Kui]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
  • [ 2 ] [Wu, Kong-Xian]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
  • [ 3 ] [Qiu, Xin-Tao]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
  • [ 4 ] [Zheng, Hai-Feng]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
  • [ 5 ] [Zhao, Tie-Song]School of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China

Reprint 's Address:

Email:

Show more details

Related Keywords:

Related Article:

Source :

Chinese Journal of Computers

ISSN: 0254-4164

Year: 2025

Issue: 8

Volume: 48

Page: 1832-1847

Cited Count:

WoS CC Cited Count:

SCOPUS Cited Count:

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 1

Affiliated Colleges:

Online/Total:1371/14070340
Address:FZU Library(No.2 Xuyuan Road, Fuzhou, Fujian, PRC Post Code:350116) Contact Us:0591-22865326
Copyright:FZU Library Technical Support:Beijing Aegean Software Co., Ltd. 闽ICP备05005463号-1