Yuheng Huang

Ph.D. Candidate, The University of Tokyo

I am seeking faculty positions worldwide.

{First Name} dot {Last Name} {42} at {gmail} dot com

Short Bio: I’m currently a Ph.D. candidate at Momentum Lab, Dept. of Computer Science at The University of Tokyo, working under the supervision of Prof. Lei Ma. My Ph.D. studies are supported by the IST-RA from the department and Research Fellowship for Young Scientists from JSPS. I chose to “master out” and have graduated from University of Alberta. My studies at UofA were supported by AMII. During my graduate studies, I was grateful to learn from Prof. Tianyi Zhang and Dr. Felix Juefei Xu. Before joining Momentum Lab, I was fortunate to work with Prof. Haoyu Wang, Prof. Dan Pei and Prof. Yuanchun Li.

Research Interests: The uncertainty and complexity inherent in AI-driven systems make it difficult to fully understand them from first principles, posing significant challenges for their trustworthy and efficient deployment. My long-term research goal is to advance the scientific understanding of AI systems from a software engineering perspective and, ultimately, to enable the systematic development of more trustworthy AI.

My past research is guided by the philosophy of decoding complexity through systematic interaction. While the global internals of modern AI systems often remain opaque, actionable insights and local structural knowledge can be uncovered through disciplined observational interaction, such as fuzzing, internal probing, and boundary exploration. By systematically investigating how these systems behave under diverse conditions, we extract informative clues about their reliability, limitations, and vulnerabilities. These insights enable rigorous characterization, targeted improvement, and more dependable deployment of AI systems, even without requiring complete interpretability of their underlying mechanisms.

Research Experience: I have experience on the trustworthiness and reliability assurance of complex AI systems, with a particular focus on foundation models and techniques for testing, analysis, monitoring, and verification. I also have experience developing human-computer interaction (HCI) solutions for machine learning development, as well as experience spanning the AI stack and blockchain technologies. During my Master’s and Ph.D. studies, I actively collaborated with industry partners on the design and deployment of AI applications, including foundation model-based systems.

Research Interests

Quality Assurance of Complex AI Systems. This involves the exploration of how we can interpret, analyze, enhance, and safeguard AI systems.
Robustness of AI Models. This involves a special focus on standalone AI modules such as CNN, RNN, and Transformers (LLMs).
Machine Learning Operations. This involves designing interactive tools to alleviate the burdens of different stakeholders in leveraging, developing, and operating state-of-the-art AI techniques.

Education

The University of Tokyo

2024 - present

Ph.D. Candidate Computer Science

Supervised by Prof. Lei Ma

University of Alberta

2021 - 2023

M.Sc. Electrical and Computer Engineering

Supervised by Prof. Lei Ma

Beijing University of Posts and Telecommunications

2017 - 2021

B.Sc. Computer Science

Supervised by Prof. Haoyu Wang and Prof. Yuanchun Li

Recent Publications

† indicates Corresponding Author and Project Leader, * indicates Equal Contribution.

Code-MUE: Measuring Code LLM' Uncertainty through Execution-based Semantic Interaction Graphs, 2026, (ISSTA'26) ACM SIGSOFT International Symposium on Software Testing and Analysis Acceptance Rate: 23.6%

SE4AI LLM AI Trustworthiness Black-box Analysis

Xiaoning Ren , Yinxing Xue , Lei Ma , Yuheng Huang†

Summary: Inspired by Finding 8 of *Look Before You Leap*, this work introduces a lightweight, black-box paradigm for Code LLM risk assessment that shifts uncertainty estimation from surface-level textual similarity to execution semantics. Based on the observation that syntactic consensus can obscure substantial functional divergence, this work proposes grounding model confidence in the runtime behavior of generated programs as a promising direction for building more trustworthy code automation.

code pdf

CAM: A Causality-based Analysis Framework for Multi-Agent Code Generation Systems, 2026, (ISSTA'26) ACM SIGSOFT International Symposium on Software Testing and Analysis Acceptance Rate: 23.6%

SE4AI LLM Code Generation AI Trustworthiness Black-box Analysis

Zongyi Lyu , Zhenlan Ji† , Songqiang Chen , Liwen Wang , Yuheng Huang , Shuai Wang , Shing-Chi Cheung

code pdf

Datura: Progressive Red Teaming Testing for Tool Invocation Chain in LLM Agents, 2026, (ISSTA'26) ACM SIGSOFT International Symposium on Software Testing and Analysis Acceptance Rate: 23.6%

SE4AI LLM Testing AI Trustworthiness AI-enabled System

Yuchen Shao , Ziqun Bao , Yuheng Huang , Yuling Shi , Mingyu Weng , Yiwen Sun , Long Yang , Lei Ma , Ting Su , Chengcheng Wan†

Understanding the Fundamental Design Decisions of Retrieval-Augmented Generation Systems, 2026, (TOSEM'26) ACM Transactions on Software Engineering and Methodology

SE4AI LLM Empirical Study

Shengming Zhao , Yuchen Shao , Yuheng Huang , Jiayang Song , Zhijie Wang , Chengcheng Wan , Lei Ma

code pdf

Foundation Models for Autonomous Driving Systems: An Initial Roadmap, 2026, (TOSEM'26) ACM Transactions on Software Engineering and Methodology

Survey ADS

Xiongfei Wu , Mingfei Cheng , Xiaoning Ren , Qiang Hu† , Jianlang Chen , Yuheng Huang , Maxime Cordy , Yao Zhang , Xiaofei Xie , Lei Ma , Yves Le Traon

pdf

Comfrey: Mitigating Integration Failures in LLM-enabled Software at Run-Time, 2026, (ICSE'26) 48th IEEE/ACM International Conference on Software Engineering Acceptance Rate: 24%

SE4AI LLM Empirical Study AI-enabled System

Yuchen Shao , Yuheng Huang , Jiazhen Zou , Yuling Shi , Long Yang , Lei Ma , Ting Su , Chengcheng Wan†

code pdf

DRIVENCE: Realistic Driving Sequence Synthesis for Testing Multi-sensor Fusion Perception Systems, 2026, (TSE'26) Transactions on Software Engineering

SE4AI ADS Testing

Xinyu Gao , Zhijie Wang , Yang Feng† , Chaolan Wang , Zhehua Zhou , Yuheng Huang , Lei Ma† , Zhenyu Chen† , Baowen Xu

code pdf

AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling, 2025, (TOSEM'25) ACM Transactions on Software Engineering and Methodology

SE4AI LLM Testing

Yuheng Huang , Jiayang Song , Qiang Hu , Felix Juefei-Xu , Lei Ma

Summary: We propose characterizing LLM behavior unsupervisedly by jointly analyzing internal neuron-level representations and external uncertainty signals. By capturing complementary behavioral cues, this approach enables more effective and sample-efficient testing.

code pdf

Risk Assessment Framework for Code LLMs via Leveraging Internal States, 2025, (FSE'25 Industry Track) 2025 The ACM International Conference on the Foundations of Software Engineering Acceptance Rate: 27%

SE4AI LLM AI Trustworthiness

Yuheng Huang , Lei Ma , Keizaburo Nishikino , Takumi Akazaki

Summary: The internal hidden states of a code model contain critical signals regarding the trustworthiness of its outputs. However, effectively leveraging these signals from modern, massive LLMs requires an approach that matches their sheer complexity. Because a model's generative power emerges from massive scale, we argue that its risk assessment framework should evolve to match this by applying scalable pre-training approaches to these highly informative internal layers.

code pdf

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation, 2025, (FSE'25) 2025 The ACM International Conference on the Foundations of Software Engineering

SE4AI Multimodal Testing

Zhijie Wang , Zhehua Zhou , Jiayang Song , Yuheng Huang , Zhan Shu , Lei Ma

code pdf

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models, 2025, (TSE'25) Transactions on Software Engineering

SE4AI AI Trustworthiness LLM Black-box Analysis Empirical Study

Yuheng Huang , Jiayang Song , Zhijie Wang , Shengming Zhao , Huaming Chen , Felix Juefei-Xu , Lei Ma

Summary: This work represents an early exploratory work that leverages external uncertainty analysis to evaluate model trustworthiness, highlighting that the open-ended, generative mechanisms of LLMs require fundamentally different uncertainty measurements than the fixed-class probability outputs of classical DNNs.

code pdf

Multilingual Blending: Large Language Model Safety Alignment Evaluation with Language Mixture, 2025, (NAACL'25 Findings) 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics Acceptance Rate: 37%

AI Safety LLM

Jiayang Song , Yuheng Huang , Zhehua Zhou , Lei Ma

pdf

TESTEVAL: Benchmarking Large Language Models for Test Case Generation, 2025, (NAACL'25 Findings) 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics Acceptance Rate: 37%

AI4SE Benchmark Testing

Wenhan Wang* , Chenyuan Yang* , Zhijie Wang* , Yuheng Huang , Zhaoyang Chu , Da Song , Lingming Zhang , An Ran Chen , Lei Ma

code pdf

Are LLMs Correctly Integrated into Software Systems?, 2025, (ICSE'25) 47th IEEE/ACM International Conference on Software Engineering Acceptance Rate: 22%

SE4AI LLM Empirical Study AI-enabled System

Yuchen Shao , Yuheng Huang , Jiawei Shen , Lei Ma , Ting Su , Chengcheng Wan†

code pdf

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models, 2025, (ICSE'25) 47th IEEE/ACM International Conference on Software Engineering Acceptance Rate: 22%

LLM Code Generation Empirical Study

Zhijie Wang* , Zijie Zhou* , Da Song* , Yuheng Huang , Shengmai Chen , Lei Ma , Tianyi Zhang

code pdf

PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement, 2024, (CHI'24) The ACM CHI Conference on Human Factors in Computing Systems Acceptance Rate: 26.4%

HCI AI Trustworthiness Multimodal

Zhijie Wang , Yuheng Huang , Da Song , Lei Ma , Tianyi Zhang

code pdf

LUNA: A Model-Based Universal Analysis Framework for Large Language Models, 2023, (TSE'24) Transactions on Software Engineering

SE4AI AI Trustworthiness LLM White-box Analysis

Da Song , Xuan Xie , Jiayang Song , Derui Zhu , Yuheng Huang , Felix Juefei-Xu , Lei Ma

pdf

Generation-Based Differential Fuzzing for Deep Learning Libraries, 2023, (TOSEM'23) ACM Transactions on Software Engineering and Methodology

AI4SE Testing

Jiawei Liu , Yuheng Huang , Zhijie Wang , Lei Ma , Chunrong Fang† , Mingzheng Gu , Xufan Zhang , Zhenyu Chen

pdf

PatchCensor: Patch Robustness Certification for Transformers via Exhaustive Testing, 2023, (TOSEM'23) ACM Transactions on Software Engineering and Methodology

SE4AI AI Trustworthiness Testing

Yuheng Huang , Lei Ma , Yuanchun Li†

Summary: This work demonstrates that for DNN models (more specifically, Transformer), systematic and rigorous exhaustive testing can move beyond empirical evaluations to provide verifiable, certified results regarding a model's robustness against patch attacks.

code pdf

DeepLens: Interactive Out-of-distribution Data Detection in NLP Models, 2023, (CHI'23) The ACM CHI Conference on Human Factors in Computing Systems Acceptance Rate: 28.4%

HCI AI Trustworthiness

Da Song* , Zhijie Wang* , Yuheng Huang , Lei Ma , Tianyi Zhang

pdf

DeepSeer: Interactive RNN Explanation and Debugging via State Abstraction, 2023, (CHI'23) The ACM CHI Conference on Human Factors in Computing Systems Acceptance Rate: 28.4%

HCI AI Trustworthiness White-box Analysis

Zhijie Wang , Yuheng Huang , Da Song , Lei Ma , Tianyi Zhang

pdf

An Exploratory Study of AI System Risk Assessment from the Lens of Data Distribution and Uncertainty, 2022, Preprint

SE4AI AI Trustworthiness Black-box Analysis Empirical Study

Zhijie Wang , Yuheng Huang , Lei Ma , Haruki Yokoyama , Susumu Tokumoto , Kazuki Munakata

pdf

Understanding (mis) behavior on the eosio blockchain, 2020, (SIGMETRICS'20) Proceedings of the ACM on Measurement and Analysis of Computing Systems Acceptance Rate: 15%

Measurement Blockchain Empirical Study

Yuheng Huang , Haoyu Wang† , Lei Wu† , Gareth Tyson , Xiapu Luo , Run Zhang , Xuanzhe Liu , Gang Huang , Xuxian Jiang

Summary: This work performed a large-scale measurement study of the EOSIO ecosystem through graph analysis. It reveals EOSIO's superficial prosperity and pervasive malicious activities.

code pdf video

Research Interests

Education

Recent Publications

† indicates Corresponding Author and Project Leader, * indicates Equal Contribution.

Service

Journal Review

Conference Review

Talks

Teaching

Volunteer

Hobby