Congratulations to Ph.D. student Tianao Ge and his advisor, Assistant Professor Hongyuan Liu, for receiving a Best Paper Award at ASPLOS 2024, a top-tier conference in computer architecture! Their paper, “ngAP: Non-blocking Large-scale Automata Processing on GPUs“, was collaborated with Dr. Tong Zhang from Samsung Electronics. Out of 193 papers accepted at the conference, only six were honored with this prestigious award.

ASPLOS – the ACM International Conference on Architectural Support for Programming Languages and Operating Systems – is the premier academic forum for multidisciplinary computer systems research spanning hardware, software, and their interaction. It focuses on computer architecture, programming languages, operating systems, and associated areas such as networking and storage.


Paper Summary

Read the paper ➡️

Finite automata, as the compute kernel for various emerging applications, are widely used in areas such as machine learning, network intrusion detection, graph processing, and bioinformatics. However, despite the increasing computational power of GPUs, their potential in automata processing has not been fully utilized. This work identified three main challenges that limit GPU throughput:

1) The available parallelism is insufficient, resulting in underutilized GPU threads. 2) Automata workloads involve significant redundant computations since a portion of states matches with repeated symbols. 3) The mapping between threads and states is switched dynamically, leading to poor data locality. The key insight of the research is to process automata in a “one-symbol-at-a-time” manner, leading to serialized execution. To address these challenges, the study proposed a non-blocking automata processing approach, which allows for parallel processing of different symbols in the input stream and further supports optimization:

1) Prefetching of partial computations to increase the opportunity for processing multiple symbols simultaneously, thus better utilizing GPU threads.

2) To reduce redundant computation, the study stores repeated calculations in a memoization table, allowing the study to use table lookups instead.

3) The study privatizes some computations to maintain the mapping between threads and states, improving data locality.

Experimental results showed that the proposed method outperformed state-of-the-art GPU automata processing engines by an average of 7.9 times across 20 applications. The open-source research artifact of this work passed the evaluation of the ASPLOS 2024 Artifact Evaluation Committee and received all three badges: “Functional,” “Available,” and “Reproduced.”


The methods proposed in this paper successfully overcome the major bottlenecks encountered in processing large-scale automata on GPUs, significantly accelerating domain applications centered around automata. The advancements in GPU-accelerated automata processing offer significant insights into managing irregular applications on GPUs, while also establishing a robust foundation for making general-purpose GPUs “more general-purpose”.

First author

Tianao Ge, an IEEE student member, is a second-year Ph.D. student in the Microelectronics Thrust at the Hong Kong University of Science and Technology (Guangzhou), supervised by Dr. Hongyuan Liu. He obtained a master’s degree from Sun Yat-sen University in 2022 under the guidance of Dr. Xianwei Zhang. He received his bachelor’s degree in engineering from Wuhan University of Technology in 2020.


Hongyuan Liu is an Assistant Professor at the Microelectronics Thrust, Function Hub of the Hong Kong University of Science and Technology (Guangzhou). He works in the broad areas of computer architecture and parallel computing, especially emphasizing domain-specific computations on GPUs. Hongyuan received his Ph.D. degree in computer science from William & Mary.

Personal homepage: