The training and inference performance optimization of Resnet50 on CUDA RTX 4090 GPU using DALI and AMP - Details

author：

Lin, JunJie (Lin, JunJie.) ^[1] (Scholars：林俊杰)

Indexed by：

EI Scopus

Abstract：

As　a　deep　residual　network　model,　Resnet50　has　significant　practical　significance　in　image　classification,　target　recognition,　and　image　semantic　recognition.　In　this　paper,　Nvidia　RTX　4090　GPU　is　used　to　conduct　detailed　performance　testing　and　bottleneck　analysis　for　Resnet50　training　and　inference,　including　specific　calculation　delay　and　data　processing　delay　under　different　batch　sizes.　In　order　to　verify　the　overall　acceleration　effect　of　Resnet50,　we　use　two　optimization　methods　on　the　basis　of　GPU　computing　acceleration:　the　first　is　to　use　mixed　precision　to　improve　GPU　training　and　inference　efficiency,　and　the　second　is　to　use　DALI　to　optimize　data　preprocessing　and　reduce　data　loading　delay.　The　experimental　results　show　that　when　the　batch　size　is　256,　the　mixed　precision　is　improved　by　about　90%　compared　with　FP32,　but　the　overall　performance　improvement　is　not　obvious.　When　using　mixed　precision　and　DALI　for　GPU　computing　and　data　loading　optimization　at　the　same　time,　it　can　bring　1.4　and　2.5　times　improvement　in　the　overall　performance　of　training　and　inference.　The　experimental　results　show　that　only　using　the　mixed　precision　can　not　improve　the　overall　computing　efficiency　of　the　system,　and　the　data　loading　time　cost　frequently　limits　the　end-to-end　performance.　Therefore,　only　by　optimizing　GPU　computation　and　data　loading　delay　at　the　same　time　can　end　users　get　a　significant　speed　increase.　This　paper　is　of　great　significance　to　evaluate　and　improve　the　computational　acceleration　performance　of　GPU-based　deep　neural　networks.　©　2024　SPIE.

Keyword：

Batch data processing Computer vision Data handling Deep neural networks Efficiency Graphics processing unit Image classification Semantics

Community：

[ 1 ] [Lin, JunJie]Maynooth College of Engineering, Fuzhou University, Fujian, Fuzhou; 350108, China

Reprint 's Address：

林俊杰

Email：

Show more details

Version：

The training and inference performance optimization of Resnet50 on CUDA RTX 4090 GPU using DALI and AMP
2024，Proceedings of SPIE - The International Society for Optical Engineering

Related Keywords：

UGCM-LU: A Unified Stream and Batch Graph Computing Model with Local Update for Community Detection
2025，19th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2024
A Unified Stream and Batch Graph Computing Model for Community Detection
2023，17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022
Preparation of thioacetamide modified silica gel and its application in the extraction of copper and lead
2012，Polymeric Materials Science and Engineering
Combining hybrid modelling and transfer learning to simulate fed-batch bioprocess under uncertainty
2025，14th IFAC Symposium on Dynamics and Control of Process Systems, including Biosystems, DYCOPS 2025

Source ：

ISSN： 0277-786X

Year： 2024

Volume： 13184

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 9

Affiliated Colleges：

电气工程与自动化学院本学院/部未明确归属的数据

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to