Survey and Evaluation of Converging Architecture in LLMs Based on Footsteps of Operations

Abstract

Large language models (LLMs), which have emerged from advances in natural language processing (NLP), enable chatbots, virtual assistants, and numerous domain-specific applications. These models, often comprising billions of parameters, leverage the Transformer architecture and Attention mechanisms to process context effectively and address long-term dependencies more efficiently than earlier approaches, such as recurrent neural networks (RNNs). Notably, since the introduction of Llama, the architectural development of LLMs has significantly converged, predominantly settling on a Transformer-based decoder-only architecture. The evolution of LLMs has been driven by advances in high-bandwidth memory, specialized accelerators, and optimized architectures, enabling models to scale to billions of parameters. However, it also introduces new challenges: meeting compute and memory efficiency requirements across diverse deployment targets, ranging from data center servers to resource-constrained edge devices. To address these challenges, we survey the evolution of LLMs at two complementary levels: architectural trends and their underlying operational mechanisms. Furthermore, we quantify how hyperparameter settings influence inference latency by profiling kernel-level execution on a modern GPU architecture. Our findings reveal that identical models can exhibit varying performance based on hyperparameter configurations and deployment contexts, emphasizing the need for scalable and efficient solutions. The insights distilled from this analysis guide the optimization of performance and efficiency within these converged LLM architectures, thereby extending their applicability across a broader range of environments.

Publication
IEEE Open Journal of the Computer Society (OJCS)
Insu Choi
Insu Choi
Ph.D. Candidate · AI Accelerators & Computer Architecture

My research interests include AI/ML, AI accelerators and memory reliability.