Cited 0 time in
Prediction-based GPU sharing for distributed training
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Shin, Changyong | - |
| dc.contributor.author | Go, Younghun | - |
| dc.contributor.author | Yoo, Yeonho | - |
| dc.contributor.author | Jeong, Jinwoo | - |
| dc.contributor.author | Hwang, Jaehyun | - |
| dc.contributor.author | Yang, Gyeongsik | - |
| dc.contributor.author | Yoo, Chuck | - |
| dc.date.accessioned | 2026-03-04T01:30:16Z | - |
| dc.date.available | 2026-03-04T01:30:16Z | - |
| dc.date.issued | 2026-08 | - |
| dc.identifier.issn | 0167-739X | - |
| dc.identifier.issn | 1872-7115 | - |
| dc.identifier.uri | https://scholarworks.dongguk.edu/handle/sw.dongguk/63845 | - |
| dc.description.abstract | GPU sharing aims to enhance the efficiency of GPU utilization by running distributed deep learning training jobs concurrently. However, GPU sharing poses a significant challenge: the increase in job completion time (JCT) caused by interference between jobs is inconsistent, complicating job scheduling. Our experiments reveal that the degree of JCT increase varies by as much as-3.7x. While previous studies have analyzed this JCT inconsistency problem, none of them have been able to minimize the inconsistency. We propose TensorShare, a proactive GPU sharing technique that leverages a deep learning model to predict the extent of JCT increase. This study defines a new metric, called GPU SLA, which represents the upper threshold of JCT increase. TensorShare then introduces a novel scheduler that proactively identifies which jobs meet GPU SLA while minimizing the JCT increase. Our evaluation shows that TensorShare improves GPU SLA satisfaction rates by 26.1x-47.3x and reduces the JCT increase by 37%-60%. Furthermore, we evaluate TensorShare with large language models that are not included in training TensorShare's prediction model, achieving-7x and-10.3x improvements in GPU SLA satisfaction and JCT inconsistency, respectively. | - |
| dc.format.extent | 14 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.publisher | ELSEVIER | - |
| dc.title | Prediction-based GPU sharing for distributed training | - |
| dc.type | Article | - |
| dc.publisher.location | 네델란드 | - |
| dc.identifier.doi | 10.1016/j.future.2026.108413 | - |
| dc.identifier.wosid | 001689925100001 | - |
| dc.identifier.bibliographicCitation | Future Generation Computer Systems, v.181, pp 1 - 14 | - |
| dc.citation.title | Future Generation Computer Systems | - |
| dc.citation.volume | 181 | - |
| dc.citation.startPage | 1 | - |
| dc.citation.endPage | 14 | - |
| dc.type.docType | Article | - |
| dc.description.isOpenAccess | Y | - |
| dc.description.journalRegisteredClass | scie | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.relation.journalResearchArea | Computer Science | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Theory & Methods | - |
| dc.subject.keywordAuthor | Cloud computing | - |
| dc.subject.keywordAuthor | GPU Sharing | - |
| dc.subject.keywordAuthor | Service level agreement | - |
| dc.subject.keywordAuthor | Performance prediction | - |
| dc.subject.keywordAuthor | GPU Scheduling | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
30, Pildong-ro 1-gil, Jung-gu, Seoul, 04620, Republic of Korea+82-2-2260-3114
Copyright(c) 2023 DONGGUK UNIVERSITY. ALL RIGHTS RESERVED.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
