That’s where a language model could offer a meaningful solution, according to new research detailed in “Learning From Accident Reports Using Language Models: Human Errors and Error Mechanisms in Aviation and Railway Accidents,” which was published in ASME’s Journal of Computing and Information Science in Engineering in January.
Corresponding author Lukman Irshad, a NASA contractor and research engineer, took time to answer some questions about the language model methodology that could identify human errors and related conditions from historical aviation and railway accident reports, which could then be translated into other domains for which there is no such history.

In the study, you mention that existing methods for human error assessment are highly expert-driven and rely on historical knowledge. What are the pros and cons to that?
Some of the advantages of expert-driven assessments are that they capture deep domain experience, operational context, and lessons learned that aren’t always available in structured datasets. However, the downside is that these methods don’t scale well as systems become more complex, and they can introduce subjectivity across analysts. Additionally, these assessments can be resource intensive and time consuming. They also struggle when novel systems or operations emerge, especially in areas where historical precedents or established expertise are limited. Our work is designed to complement, not replace experts by helping them extract human error related knowledge more systematically from historical data, not only within a given domain of interest but also from analogous domains where more extensive precedents may exist.


So why use a language model for human error assessment? Was there something in particular that led you to pursue this topic?
We have extremely rich safety-related information captured in the narrative text of accident, incident, and lessons‑learned reports, but this information is difficult to analyze systematically at scale. Language models are well-suited for this challenge because they can extract patterns and relationships without requiring analysts to manually sift through the text. What motivated us was recognizing that safety-critical industries generate thousands of narrative reports every year, yet much of the insight remains locked in unstructured text. We wanted to use modern natural language processing (NLP) to unlock that knowledge and support timelier, evidence-based human error analysis.


Why did your team choose aviation and railway as its two domains for comparison rather than other transportation domains?
Since this project is funded by NASA’s Aeronautics Research Mission Directorate, aviation was the obvious place to start. Also, there is a huge amount of high-quality accident and incident report data already available in the aviation domain. We chose railway as the comparison domain because it is also a mature, safety critical
transportation system with a long history of structured incident reporting. The two domains share relevant operational parallels: the operator’s role in managing the system, complying with regulations, following operational procedures, and having dedicated infrastructure. At the same time, they differ enough to provide meaningful contrasts in human error patterns. This combination makes them ideal for exploring how knowledge extracted through NLP can generalize across domains.


Can you give us a high-level walkthrough of your methodology for extracting human error-related knowledge from incident reports?
For this study, we used a three‑stage methodology: data pre‑processing, hazard theme identification, and human‑error‑related knowledge elicitation. In the pre‑processing stage, we made sure we were pulling from the correct data sources and narrative fields, and we cleaned the data so any anomalies were handled upfront. The next stage, hazard theme identification, uses topic modeling to uncover themes within the dataset. Topic modeling identifies abstract topics in a data set. An expert then reviews those topics to check their quality, accuracy, and relevance. Once the filtering is done, we group similar topics together so we end up with a clear set of meaningful hazard themes. Finally, in the knowledge elicitation stage, we create queries based on those finalized themes and use semantic search to pull out relevant information from the dataset. Semantic search takes into account the context of the query when searching the document instead of simply looking for matching words. Depending on what we find, we may refine the queries to improve the results, or, if the information is solid, the expert records the human errors, the conditions that led to them, and any contributing factors.


The study notes that there were only three human errors that were common across both domains: "failure to comply with operating procedure", "failure to maintain speed", and "misrepresenting a signal". Can you explain the significance?
Having more or fewer common errors isn’t really the important part. What matters is that we found any shared errors at all. That tells us we can sometimes use analogous datasets to extract human error related insights even when a specific domain doesn’t have much historical data of its own.
A good example is the difference between autonomous surface vehicles and autonomous aircraft. The surface domain has advanced much more quickly and has already generated large amounts of operational data, while autonomous aviation is still emerging. Our findings suggest that if you’re analyzing ground controller errors in autonomous aviation, you can draw on knowledge extracted from the autonomous surface vehicle domain, as long as those insights are properly contextualized before being applied.


You mention in the study that the majority of the errors and their correlating conditions and mechanisms can be used to inform safe operations across domains. Why is that?
We discuss this in detail in the paper, but simply put, even when tasks look different on the surface, the underlying cognitive constructs that shape human performance tend to be similar. As a result, the error-producing conditions and mechanisms can also be similar across domains when the tasks share comparable cognitive demands. By understanding those deeper, structural drivers of human error, safety practitioners can transfer lessons from one domain to another, even when the outward tasks don’t match exactly.


In what ways are you looking to or could you apply the data generated from this methodology?
There are several ways the data produced by this methodology can be applied. Since the approach is meant to be used early in system design, the results can directly complement traditional early‑phase hazard assessments by helping analysts incorporate human considerations from the very beginning. The structured insights we extract such as error producing conditions, contributing factors, and cognitive mechanisms can inform safety requirements, the development of standard operating procedures, and the design of training programs


What are the practical implications of this methodology for hazard assessments moving forward?
One of the most important practical implications of this methodology is its ability to support hazard assessments for novel systems that don’t yet have historical precedents. By drawing from analogous domains with richer operational histories, the method helps experts fill gaps in knowledge and experience, resulting in more complete and context‑aware human error assessments for emerging systems.
Beyond novel technologies, the methodology also has broader implications for hazard analysis in general. It allows analysts to complement expert-driven assessments with evidence-based, data-driven insights extracted from historical narrative reports. This combination not only strengthens the analytical foundation behind human error assessments but can also make the overall process less time consuming and less resource-intensive for complex systems


What’s next for you and your work? Could your methodology and language model approach be of use in any other possible applications?
In terms of future directions, we are currently exploring how this methodology could be incorporated into an interactive chat-based assistant within a hazard analysis tool. The idea is to allow users to directly query the system and extract human error related knowledge in real time. This capability is being developed as part of a new tool called the Hazard Analysis DESigner (HADES), which aims to make hazard analysis model-based, computational, and data-informed.
As for applications, we have already applied insights from this study to integrate human element considerations into the functional hazard assessment of a remotely operated drone conducting hurricane recovery and relief operations. This demonstrated not only that the methodology scales to real world scenarios, but also that it can meaningfully enhance early-stage assessments for emerging, complex systems.


Is there anything that we might have missed or that you would like to highlight?
If there is one key takeaway from this interview, it’s that the methodology proposed in this study is not intended to replace human analysts. Instead, it is designed to complement them by serving as an exploratory tool that expands their ability to investigate the human error space beyond their immediate experience or domain expertise. The goal is to empower analysts with faster, broader, and more systematic access to insights, not to remove them from the process.


Louise Poirier is managing editor of Mechanical Engineering magazine.
