Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Prediction-based GPU sharing for distributed training

Full metadata record
DC Field Value Language
dc.contributor.authorShin, Changyong-
dc.contributor.authorGo, Younghun-
dc.contributor.authorYoo, Yeonho-
dc.contributor.authorJeong, Jinwoo-
dc.contributor.authorHwang, Jaehyun-
dc.contributor.authorYang, Gyeongsik-
dc.contributor.authorYoo, Chuck-
dc.date.accessioned2026-03-04T01:30:16Z-
dc.date.available2026-03-04T01:30:16Z-
dc.date.issued2026-08-
dc.identifier.issn0167-739X-
dc.identifier.issn1872-7115-
dc.identifier.urihttps://scholarworks.dongguk.edu/handle/sw.dongguk/63845-
dc.description.abstractGPU sharing aims to enhance the efficiency of GPU utilization by running distributed deep learning training jobs concurrently. However, GPU sharing poses a significant challenge: the increase in job completion time (JCT) caused by interference between jobs is inconsistent, complicating job scheduling. Our experiments reveal that the degree of JCT increase varies by as much as-3.7x. While previous studies have analyzed this JCT inconsistency problem, none of them have been able to minimize the inconsistency. We propose TensorShare, a proactive GPU sharing technique that leverages a deep learning model to predict the extent of JCT increase. This study defines a new metric, called GPU SLA, which represents the upper threshold of JCT increase. TensorShare then introduces a novel scheduler that proactively identifies which jobs meet GPU SLA while minimizing the JCT increase. Our evaluation shows that TensorShare improves GPU SLA satisfaction rates by 26.1x-47.3x and reduces the JCT increase by 37%-60%. Furthermore, we evaluate TensorShare with large language models that are not included in training TensorShare's prediction model, achieving-7x and-10.3x improvements in GPU SLA satisfaction and JCT inconsistency, respectively.-
dc.format.extent14-
dc.language영어-
dc.language.isoENG-
dc.publisherELSEVIER-
dc.titlePrediction-based GPU sharing for distributed training-
dc.typeArticle-
dc.publisher.location네델란드-
dc.identifier.doi10.1016/j.future.2026.108413-
dc.identifier.wosid001689925100001-
dc.identifier.bibliographicCitationFuture Generation Computer Systems, v.181, pp 1 - 14-
dc.citation.titleFuture Generation Computer Systems-
dc.citation.volume181-
dc.citation.startPage1-
dc.citation.endPage14-
dc.type.docTypeArticle-
dc.description.isOpenAccessY-
dc.description.journalRegisteredClassscie-
dc.description.journalRegisteredClassscopus-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalWebOfScienceCategoryComputer Science, Theory & Methods-
dc.subject.keywordAuthorCloud computing-
dc.subject.keywordAuthorGPU Sharing-
dc.subject.keywordAuthorService level agreement-
dc.subject.keywordAuthorPerformance prediction-
dc.subject.keywordAuthorGPU Scheduling-
Files in This Item
There are no files associated with this item.
Appears in
Collections
ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Yoo, Yeon Ho photo

Yoo, Yeon Ho
College of Advanced Convergence Engineering (Department of Computer Science and Artificial Intelligence)
Read more

Altmetrics

Total Views & Downloads

BROWSE