RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models

Lee, Sang Min; Kim, Hanjoon; Yeon, Jeseung; Kim, Minho; Park, Changjae; Bae, Byeongwook; Cha, Yojung; Choe, Wooyoung; Choi, Jonguk; Choi, Younggeun; Han, Ki Jin; Hwang, Seokha; Jang, Kiseok; Jeon, Jaewoo; Jeong, Hyunmin; Jung, Yeonsu; Kim, Hyewon; Kim, Sewon; Kim, Suhyung; Kim, Won; Kim, Yongseung; Kim, Youngsik; Kwon, Hyukdong; Lee, Jeong Ki; Lee, Juyun; Lee, Kyungjae; Lee, Seokho; Noh, Minwoo; Park, Junyoung; Seo, Jimin; Paik, June

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models

Authors: Lee, Sang Min; Kim, Hanjoon; Yeon, Jeseung; Kim, Minho; Park, Changjae; Bae, Byeongwook; Cha, Yojung; Choe, Wooyoung; Choi, Jonguk; Choi, Younggeun; Han, Ki Jin; Hwang, Seokha; Jang, Kiseok; Jeon, Jaewoo; Jeong, Hyunmin; Jung, Yeonsu; Kim, Hyewon; Kim, Sewon; Kim, Suhyung; Kim, Won; Kim, Yongseung; Kim, Youngsik; Kwon, Hyukdong; Lee, Jeong Ki; Lee, Juyun; Lee, Kyungjae; Lee, Seokho; Noh, Minwoo; Park, Junyoung; Seo, Jimin; Paik, June

Issue Date: Feb-2025

Publisher: IEEE

Keywords: Matrix Algebra; Parallel Architectures; Pipeline Processing Systems; Problem Oriented Languages; Computational Task; Data Locality; High Memory Bandwidth; Language Model; Machine Learning Models; Matrix Multiplication; Power; Power Efficient; Tensor Contraction; Traditional Architecture; Tensors

Citation: 2025 IEEE International Solid-State Circuits Conference (ISSCC), pp 284 - 286

Pages: 3

Indexed: SCOPUS

Journal Title: 2025 IEEE International Solid-State Circuits Conference (ISSCC)

Start Page: 284

End Page: 286

URI: https://scholarworks.dongguk.edu/handle/sw.dongguk/58072

DOI: 10.1109/ISSCC49661.2025.10904727

ISSN: 0193-6530
2376-8606

Abstract: There is a need for an AI accelerator optimized for large language models (LLMs) that combines high memory bandwidth and dense compute power while minimizing power consumption. Traditional architectures [1]-[4] typically map tensor contractions, which is the core computational task in machine learning models, onto matrix multiplication units. However, this approach often falls short in fully leveraging the parallelism and data locality inherent in tensor contractions. In this work, tensor contraction is used as a primitive instead of matrix multiplication, enabling massive parallelism and time-axis pipelining similar to vector processors. Large coarse-grained PEs can be split into smaller compute units called slices, as illustrated in Fig. 16.2.1. Depending on the setup of the fetch network connecting the slices, these slices can function either as one large processing element or as small and independent compute units. Input data are continuously fetched in a pipelined manner through the fetch network, allowing high throughput and efficient data reuse. Since the operation units compute deterministically as configured, accurate cost models for performance and energy can be developed for optimization. The chip specifications are also shown in Fig. 16.2.1. © 2025 IEEE.

Files in This Item: There are no files associated with this item.

Appears in Collections: College of Engineering > Department of Electronics and Electrical Engineering > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Han, Ki Jin photo

Han, Ki Jin: College of Engineering (Department of Electronics and Electrical Engineering)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

30, Pildong-ro 1-gil, Jung-gu, Seoul, 04620, Republic of Korea+82-2-2260-3114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE