There have been many occasions when I met colleagues from AI, systems, security, or formal methods at conferences and introduced myself as someone working on SE4AI (Software Engineering for Artificial Intelligence) or AI trustworthiness assurance. Almost invariably, the conversation would proceed with a moment of polite acknowledgment followed by a brief pause that is difficult to miss. Underneath the friendly nods and professional gestures, there is often a sense of puzzlement: what exactly is this area supposed to be?
I usually attempt a further explanation, unpacking these phrases into something more concrete. However, usually I will get responses like:
"OK ..."
"Well ..."
"Fair enough ..."
In retrospect, I find these reactions entirely reasonable. Had I encountered someone claiming to work on SE4AI six years ago, I would likely have responded with the same skepticism. At that time, I had just begun an internship in AI safety at MSRA and had started reading one of the most famous surveys in the field. The promise of bringing SE rigor into the development and assurance of learning-based systems seemed both natural and necessary. Yet, it was far from clear what such rigor would actually look like in practice. Moreover, after several years of effort and hardship, I still find myself unsure, at times, how to situate the SE4AI research within the broader AI-related community. Now, standing at a crossroads in my career, it seems worthwhile to step back and re-examine this persistent question.
Let’s begin with the definition of SE4AI first. The central premise is to treat AI systems as a new class of software artifacts that derive much of their functionality from data-driven learning processes. Such software has substantial degrees of uncertainty in both behavior and performance. From this perspective, it is natural to ask whether established software engineering principles can be brought to bear on the development and assurance of such systems. In practice, this has led to a broad research aimed at testing, analyzing, repairing, improving, and monitoring AI-enabled systems throughout the software lifecycle.
Sounds very promising. However, the agenda is difficult to carry out. When the object of a study is a complex AI-enabled system, the scope of analysis is usually inevitably restricted to particular architectures, deployment settings, or data regimes. At that point, questions about generalizability surface almost immediately, and reviewers will almost certainly ask you to what extent the resulting findings can be effective for general AI systems. On the other hand, if one attempts to address lifecycle-level concerns in a more practice-oriented manner spanning different stages of model development, the research quickly becomes entangled with the complexity of real-world pipelines full of dirty work and engineering practice. And a natural question from reviewers is what, precisely, has been learned about the nature of learning-enabled systems beyond the engineering process itself?
In light of this tension, it is perhaps unsurprising that many SE4AI studies have turned toward a safer regime: one that mirrors the experimental conventions of mainstream AI research. Within this setting, the analysis is typically confined to the training, evaluation, and post-hoc analysis of a model. Objectives are crisply defined, metrics are standardized, and benchmark datasets are readily available through public repositories.
Up to this point, there is nothing problematic. However, the absence of a well-understood mechanistic account of modern AI systems introduces a more subtle risk. In attempting to reason about these systems, we, as SE guys, often rely on analogies drawn from classical software, importing concepts such as testing, debugging, repairing, or specification into the context of model development. Analogy, of course, plays a constructive role in cross-domain exploration. Many research directions originate precisely from mapping a familiar conceptual framework onto an unfamiliar class of systems. The problem arises when such mappings remain at the level of terminology rather than being grounded in a rigorous analysis. We begin to make analogies over analogies, chasing shadows that never exist.
As a short and incomplete summary, much model-centric work that currently falls under the SE4AI may have one (or more) of the following patterns:
It builds upon analogies for which the underlying AI literature provides limited empirical or theoretical support, with subsequent efforts directed less toward strengthening the underlying analysis and more toward optimizing surrogate metrics.
It (partially) redefines some problems that have been extensively studied in the AI community.
It directly wraps an AI method with fancy software engineering stories.
Sadly, for some of the work in the domain that satisfies these patterns, there is a question that is difficult to dismiss:
This question has remained with me for years and, as suggested in the title, constitutes the fundamental motivation behind this essay. If a substantial portion of the technical problems addressed under the SE4AI banner have already been studied or could be studied more directly in other domains, then what role does SE4AI uniquely serve? Why should a researcher invest attention in SE4AI papers, rather than in the corresponding work produced within the core AI community?
While these concerns have been discussed widely within the community (e.g., this and this), they nevertheless point to a set of questions that we, as SE4AI researchers, may need to revisit repeatedly.
Another serious issue in the domain can also be observed in some work, where experimental protocols, evaluation pipelines, and methods are adopted in a manner that resembles scientific exploration, but turn out to be neither scientific nor practical.
I personally think this is totally acceptable: funding cycles must be completed, and students must progress toward graduation. Perhaps this is even a cheap and acceptable way to train researchers. However, if we would like to exert a meaningful influence beyond its immediate subcommunity, then perhaps we need to reflect on our advantage and our position.
At this point, I am too junior to offer prescriptive guidance. Nevertheless, I will offer a few preliminary thoughts, at least for my friends, collaborators, and students, on a few working principles that may help situate SE4AI research within a broader scientific context.
For work that is directly concerned with AI models:
it is essential to engage substantively with the machine learning literature. Or, at least, one should know some basic assumptions in the AI literature.
In many cases, meaningful contributions may arise from introducing domain-specific constraints, abstractions, and knowledge that enable existing AI systems to be constructed effectively in practice. As a perfect example, Prof. Xiong’s insights have been integrated into DeepSeek-Coder’s training (though it belongs to the AI4SE domain).
For work that extends beyond model internals:
It may be worthwhile to examine research problems at the system level, where interactions among multiple models, software components, and environments occur. In such settings, the relevant failure modes often emerge not from the mechanism of any individual model, but from the composition of learning-enabled components within a broader software pipeline.
It may be worthwhile to engage directly with real-world AI development lifecycles and get hands dirty. Extracting generalizable insights from these interactions may ultimately prove as important for trustworthy system design as advances in model architecture or training methodology.
Admittedly, the above two scenarios may appear narrowly tied to particular systems or engineering contexts. Yet, it is still, from my point of view, a practical way for us as a SE/System guy to approach the problem. Designing systems governed by intricate protocols, reasoning about components whose internal mechanisms are only partially observable, and exploring large configuration spaces in a disciplined and systematic manner are precisely the kinds of problems our community has long confronted. While individual models may remain black boxes at a mechanistic level, the surrounding system, including its interfaces, contracts, monitoring mechanisms, and integration logic, is not beyond our scope. Hopefully, our SE expertise may contribute most meaningfully to the advancement of trustworthy and reliable AI systems.
2026-02-21
Yuheng Huang
Tokyo, Japan
Ideas expressed in this blog are my own and do not represent any other person’s views.