Prediction-based GPU sharing for distributed training

Shin, Changyong; Go, Younghun; Yoo, Yeonho; Jeong, Jinwoo; Hwang, Jaehyun; Yang, Gyeongsik; Yoo, Chuck

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Prediction-based GPU sharing for distributed training

Full metadata record

DC Field	Value	Language
dc.contributor.author	Shin, Changyong	-
dc.contributor.author	Go, Younghun	-
dc.contributor.author	Yoo, Yeonho	-
dc.contributor.author	Jeong, Jinwoo	-
dc.contributor.author	Hwang, Jaehyun	-
dc.contributor.author	Yang, Gyeongsik	-
dc.contributor.author	Yoo, Chuck	-
dc.date.accessioned	2026-03-04T01:30:16Z	-
dc.date.available	2026-03-04T01:30:16Z	-
dc.date.issued	2026-08	-
dc.identifier.issn	0167-739X	-
dc.identifier.issn	1872-7115	-
dc.identifier.uri	https://scholarworks.dongguk.edu/handle/sw.dongguk/63845	-
dc.description.abstract	GPU sharing aims to enhance the efficiency of GPU utilization by running distributed deep learning training jobs concurrently. However, GPU sharing poses a significant challenge: the increase in job completion time (JCT) caused by interference between jobs is inconsistent, complicating job scheduling. Our experiments reveal that the degree of JCT increase varies by as much as-3.7x. While previous studies have analyzed this JCT inconsistency problem, none of them have been able to minimize the inconsistency. We propose TensorShare, a proactive GPU sharing technique that leverages a deep learning model to predict the extent of JCT increase. This study defines a new metric, called GPU SLA, which represents the upper threshold of JCT increase. TensorShare then introduces a novel scheduler that proactively identifies which jobs meet GPU SLA while minimizing the JCT increase. Our evaluation shows that TensorShare improves GPU SLA satisfaction rates by 26.1x-47.3x and reduces the JCT increase by 37%-60%. Furthermore, we evaluate TensorShare with large language models that are not included in training TensorShare's prediction model, achieving-7x and-10.3x improvements in GPU SLA satisfaction and JCT inconsistency, respectively.	-
dc.format.extent	14	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	ELSEVIER	-
dc.title	Prediction-based GPU sharing for distributed training	-
dc.type	Article	-
dc.publisher.location	네델란드	-
dc.identifier.doi	10.1016/j.future.2026.108413	-
dc.identifier.wosid	001689925100001	-
dc.identifier.bibliographicCitation	Future Generation Computer Systems, v.181, pp 1 - 14	-
dc.citation.title	Future Generation Computer Systems	-
dc.citation.volume	181	-
dc.citation.startPage	1	-
dc.citation.endPage	14	-
dc.type.docType	Article	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalWebOfScienceCategory	Computer Science, Theory & Methods	-
dc.subject.keywordAuthor	Cloud computing	-
dc.subject.keywordAuthor	GPU Sharing	-
dc.subject.keywordAuthor	Service level agreement	-
dc.subject.keywordAuthor	Performance prediction	-
dc.subject.keywordAuthor	GPU Scheduling	-

Files in This Item: There are no files associated with this item.

Appears in Collections: ETC > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Yoo, Yeon Ho photo

Yoo, Yeon Ho: College of Advanced Convergence Engineering (Department of Computer Science and Artificial Intelligence)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

30, Pildong-ro 1-gil, Jung-gu, Seoul, 04620, Republic of Korea+82-2-2260-3114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE