Code-MUE: Measuring Code LLM' Uncertainty through Execution-based Semantic Interaction Graphs,
2026,
(ISSTA'26) ACM SIGSOFT International Symposium on Software Testing and Analysis
Acceptance Rate: 23.6%
SE4AI
LLM
AI Trustworthiness
Black-box Analysis
Xiaoning Ren
,
Yinxing Xue
,
Lei Ma
,
Yuheng Huang†
Summary: Inspired by Finding 8 of *Look Before You Leap*, this work introduces a lightweight, black-box paradigm for Code LLM risk assessment that shifts uncertainty estimation from surface-level textual similarity to execution semantics. By demonstrating that syntactic consensus can mask substantial functional divergence, it suggests that grounding a model’s confidence in the runtime behavior of its generated programs is a promising direction for building trustworthy code automation.
CAM: A Causality-based Analysis Framework for Multi-Agent Code Generation Systems,
2026,
(ISSTA'26) ACM SIGSOFT International Symposium on Software Testing and Analysis
Acceptance Rate: 23.6%
SE4AI
LLM
Code Generation
AI Trustworthiness
Black-box Analysis
Zongyi Lyu
,
Zhenlan Ji
,
Songqiang Chen
,
Liwen Wang
,
Yuheng Huang
,
Shuai Wang
,
Shing-Chi Cheung
Datura: Progressive Red Teaming Testing for Tool Invocation Chain in LLM Agents,
2026,
(ISSTA'26) ACM SIGSOFT International Symposium on Software Testing and Analysis
Acceptance Rate: 23.6%
SE4AI
LLM
Testing
AI Trustworthiness
AI-enabled System
Yuchen Shao
,
Ziqun Bao
,
Yuheng Huang
,
Yuling Shi
,
Mingyu Weng
,
Yiwen Sun
,
Long Yang
,
Lei Ma
,
Ting Su
,
Chengcheng Wan
Understanding the Fundamental Design Decisions of Retrieval-Augmented Generation Systems,
2026,
(TOSEM'26) ACM Transactions on Software Engineering and Methodology
SE4AI
LLM
Empirical Study
Shengming Zhao
,
Yuchen Shao
,
Yuheng Huang
,
Jiayang Song
,
Zhijie Wang
,
Chengcheng Wan
,
Lei Ma
Foundation Models for Autonomous Driving Systems: An Initial Roadmap,
2026,
(TOSEM'26) ACM Transactions on Software Engineering and Methodology
Survey
ADS
Xiongfei Wu
,
Mingfei Cheng
,
Xiaoning Ren
,
Qiang Hu
,
Jianlang Chen
,
Yuheng Huang
,
Maxime Cordy
,
Yao Zhang
,
Xiaofei Xie
,
Lei Ma
,
Yves Le Traon
Comfrey: Mitigating Integration Failures in LLM-enabled Software at Run-Time,
2026,
(ICSE'26) 48th IEEE/ACM International Conference on Software Engineering
Acceptance Rate: 24%
SE4AI
LLM
Empirical Study
AI-enabled System
Yuchen Shao
,
Yuheng Huang
,
Jiazhen Zou
,
Yuling Shi
,
Long Yang
,
Lei Ma
,
Ting Su
,
Chengcheng Wan
DRIVENCE: Realistic Driving Sequence Synthesis for Testing Multi-sensor Fusion Perception Systems,
2026,
(TSE'26) Transactions on Software Engineering
SE4AI
ADS
Testing
Xinyu Gao
,
Zhijie Wang
,
Yang Feng
,
Chaolan Wang
,
Zhehua Zhou
,
Yuheng Huang
,
Lei Ma
,
Zhenyu Chen
,
Baowen Xu
AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling,
2025,
(TOSEM'25) ACM Transactions on Software Engineering and Methodology
SE4AI
LLM
Testing
Yuheng Huang
,
Jiayang Song
,
Qiang Hu
,
Felix Juefei-Xu
,
Lei Ma
Summary: We propose characterizing LLM behavior unsupervisedly by jointly analyzing internal neuron-level representations and external uncertainty signals. By capturing complementary behavioral cues, this approach enables more effective and sample-efficient testing.
Risk Assessment Framework for Code LLMs via Leveraging Internal States,
2025,
(FSE'25 Industry Track) 2025 The ACM International Conference on the Foundations of Software Engineering
Acceptance Rate: 27%
SE4AI
LLM
AI Trustworthiness
Yuheng Huang
,
Lei Ma
,
Keizaburo Nishikino
,
Takumi Akazaki
Summary: The internal hidden states of a code model contain critical signals regarding the trustworthiness of its outputs. However, effectively leveraging these signals from modern, massive LLMs requires an approach that matches their sheer complexity. Because a model's generative power emerges from massive scale, we argue that its risk assessment framework should evolve to match this by applying scalable pre-training approaches to these highly informative internal layers.
VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation,
2025,
(FSE'25) 2025 The ACM International Conference on the Foundations of Software Engineering
SE4AI
Multimodal
Testing
Zhijie Wang
,
Zhehua Zhou
,
Jiayang Song
,
Yuheng Huang
,
Zhan Shu
,
Lei Ma
Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models,
2025,
(TSE'25) Transactions on Software Engineering
SE4AI
AI Trustworthiness
LLM
Black-box Analysis
Empirical Study
Yuheng Huang
,
Jiayang Song
,
Zhijie Wang
,
Shengming Zhao
,
Huaming Chen
,
Felix Juefei-Xu
,
Lei Ma
Summary: This work represents an early exploratory work that leverages external uncertainty analysis to evaluate model trustworthiness, highlighting that the open-ended, generative mechanisms of LLMs require fundamentally different uncertainty measurements than the fixed-class probability outputs of classical DNNs.
Multilingual Blending: Large Language Model Safety Alignment Evaluation with Language Mixture,
2025,
(NAACL'25 Findings) 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
Acceptance Rate: 37%
AI Safety
LLM
Jiayang Song
,
Yuheng Huang
,
Zhehua Zhou
,
Lei Ma
TESTEVAL: Benchmarking Large Language Models for Test Case Generation,
2025,
(NAACL'25 Findings) 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
Acceptance Rate: 37%
AI4SE
Benchmark
Testing
Wenhan Wang*
,
Chenyuan Yang*
,
Zhijie Wang*
,
Yuheng Huang
,
Zhaoyang Chu
,
Da Song
,
Lingming Zhang
,
An Ran Chen
,
Lei Ma
Are LLMs Correctly Integrated into Software Systems?,
2025,
(ICSE'25) 47th IEEE/ACM International Conference on Software Engineering
Acceptance Rate: 22%
SE4AI
LLM
Empirical Study
AI-enabled System
Yuchen Shao
,
Yuheng Huang
,
Jiawei Shen
,
Lei Ma
,
Ting Su
,
Chengcheng Wan
Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models,
2025,
(ICSE'25) 47th IEEE/ACM International Conference on Software Engineering
Acceptance Rate: 22%
LLM
Code Generation
Empirical Study
Zhijie Wang*
,
Zijie Zhou*
,
Da Song*
,
Yuheng Huang
,
Shengmai Chen
,
Lei Ma
,
Tianyi Zhang
PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement,
2024,
(CHI'24) The ACM CHI Conference on Human Factors in Computing Systems
Acceptance Rate: 26.4%
HCI
AI Trustworthiness
Multimodal
Zhijie Wang
,
Yuheng Huang
,
Da Song
,
Lei Ma
,
Tianyi Zhang
LUNA: A Model-Based Universal Analysis Framework for Large Language Models,
2023,
(TSE'24) Transactions on Software Engineering
SE4AI
AI Trustworthiness
LLM
White-box Analysis
Da Song
,
Xuan Xie
,
Jiayang Song
,
Derui Zhu
,
Yuheng Huang
,
Felix Juefei-Xu
,
Lei Ma
Generation-Based Differential Fuzzing for Deep Learning Libraries,
2023,
(TOSEM'23) ACM Transactions on Software Engineering and Methodology
AI4SE
Testing
Jiawei Liu
,
Yuheng Huang
,
Zhijie Wang
,
Lei Ma
,
Chunrong Fang
,
Mingzheng Gu
,
Xufan Zhang
,
Zhenyu Chen
PatchCensor: Patch Robustness Certification for Transformers via Exhaustive Testing,
2023,
(TOSEM'23) ACM Transactions on Software Engineering and Methodology
SE4AI
AI Trustworthiness
Testing
Yuheng Huang
,
Lei Ma
,
Yuanchun Li
Summary: This work demonstrates that for DNN models (more specifically, Transformer), systematic and rigorous exhaustive testing can move beyond empirical evaluations to provide verifiable, certified results regarding a model's robustness against patch attacks.
DeepLens: Interactive Out-of-distribution Data Detection in NLP Models,
2023,
(CHI'23) The ACM CHI Conference on Human Factors in Computing Systems
Acceptance Rate: 28.4%
HCI
AI Trustworthiness
Da Song*
,
Zhijie Wang*
,
Yuheng Huang
,
Lei Ma
,
Tianyi Zhang
DeepSeer: Interactive RNN Explanation and Debugging via State Abstraction,
2023,
(CHI'23) The ACM CHI Conference on Human Factors in Computing Systems
Acceptance Rate: 28.4%
HCI
AI Trustworthiness
White-box Analysis
Zhijie Wang
,
Yuheng Huang
,
Da Song
,
Lei Ma
,
Tianyi Zhang
An Exploratory Study of AI System Risk Assessment from the Lens of Data Distribution and Uncertainty,
2022,
Preprint
SE4AI
AI Trustworthiness
Black-box Analysis
Empirical Study
Zhijie Wang
,
Yuheng Huang
,
Lei Ma
,
Haruki Yokoyama
,
Susumu Tokumoto
,
Kazuki Munakata
Understanding (mis) behavior on the eosio blockchain,
2020,
(SIGMETRICS'20) Proceedings of the ACM on Measurement and Analysis of Computing Systems
Acceptance Rate: 15%
Measurement
Blockchain
Empirical Study
Yuheng Huang
,
Haoyu Wang
,
Lei Wu
,
Gareth Tyson
,
Xiapu Luo
,
Run Zhang
,
Xuanzhe Liu
,
Gang Huang
,
Xuxian Jiang
Summary: This work performed a large-scale measurement study of the EOSIO ecosystem through graph analysis. It reveals EOSIO's superficial prosperity and pervasive malicious activities.