Why software testing needs explainable AI
Applications that use artificial intelligence and machine learning techniques present unique challenges to testers. These systems are largely black boxes that use multiple algorithms—sometimes hundreds of them—to process data in a series of layers and return a result.
While testing can be a complex endeavor for any application, at a fundamental level it involves making sure that the results returned are those expected for a given input. And with AI/ML systems, that's a problem. The software returns an answer, but testers have no way to independently determine whether it's the correct answer. It's not always apparent, because testers don't necessarily know what the right answer should be for a given set of inputs.
In fact, some application results may be laughable. Individual e-commerce recommendation engines often get it factually wrong, but as long as they collectively induce shoppers to add items to their carts, they are considered a business success. And how do you determine whether your ML application achieves the needed level of success before deployment?
So the definition of a correct answer depends not only on the application, but also on how accurate it's required to be. If the answer has to be exact, that is straightforward, but how close is close enough? And will it always be close enough?
That's ultimately the black hole for testers. If you don't have a working statistical definition of accuracy that's based on the needs of the problem domain, you can't tell objectively whether or not a result is correct.
It gets worse from there. Testers may have no idea whether an answer is right or wrong, even for a binary answer. Under certain circumstances, it might be possible to go back to the training data and find a similar case, but there is still no obvious way to validate results under many circumstances.
Does it matter? Yes, and probably more so than in traditional business applications. The vast majority of results in a traditional business application can be easily classified as correct or incorrect. Testers don't need to know about how the underlying algorithms operate, although it would be useful if they did.
ML applications aren't that apparent. A result may seem correct, but bias or misrepresented training data could make it wrong. But wrong answers can also result from using an incorrect ML model that occasionally or systematically produces less-than-optimal answers. That's where explainable AI (XAI) can help.
Explainable AI explained
XAI is a way of allowing an AI or ML application to explain why it came up with a particular result. By providing a defined path from input to output, XAI can allow a tester to understand the logic between inputs and outputs that may be otherwise impenetrable.
XAI is a young field, and most commercial AI/ML applications are not yet to the point of adopting it. Techniques behind the term are vaguely defined. While application users can gain confidence if they have some rationale that points to a result, any explanation also helps development and testing teams validate the algorithms and training data and make sure that the results accurately reflect the problem domain.
A fascinating example of an early XAI effort comes from Pepper, the SoftBank robot that responds to tactile stimulation. Pepper has been programmed to talk through its instructions as it is executing them. Talking through the instructions is a form of XAI, in that it enables users to understand why the robot is performing specific sequences of activities. Pepper will also identify contradictions or ambiguities through this process and knows when to ask for additional clarification.
Imagine how such a program feature can assist testers. Using test data, the tester can obtain a result, then ask the application how it obtained that result, working through the process of manipulating the input data so that the tester can document why the result is valid.
But that's just scratching the surface; XAI has to serve multiple constituents. For developers, it can help validate the technical approach and algorithms used. For testers, it helps confirm correctness and quality. For end users, it is a way of establishing trust in the application.
The three legs of the XAI stool
So how does XAI work? There is a long way to go here, but there are a couple of techniques that show some promise. XAI operates off of the principles of transparency, interpretability, and explainability.
- Transparency means that you can look into the algorithms to clearly discern how they are processing input data. While that may not tell you how those algorithms are trained, it provides insight into the path to the results and is intended for the design and development teams to interpret.
- Interpretability is how the results are presented for human understanding. In other words, if you have an application and are getting a particular result, you should be able to see and understand how that result was achieved, based on the input data and processing algorithms. There should be a logical pathway between data inputs and result outputs.
- Explainability remains a vague concept while researchers try to define exactly how it might work. We might want to support queries into our results, or to get detailed explanations into more specific phases of the processing. But until there is better consensus, this feature remains a gray area.
XIA techniques
Several techniques can help AI/ML applications with explainability. These tend to make quantitative assumptions as to how to qualitatively explain a particular result.
Two common techniques are Shapley values and integrated gradients. Both offer quantitative measures that assess what each set of data or features contributes to a particular result.
Similarly, the contrastive explanations method is an after-the-fact computation that tries to isolate individual results in terms of why one result occurred over a competing one. In other words, why did it return this result and not that one?
Once again, this is a quantitative measure that rates the likelihood of one result over another. The numbers give you the relative positioning of the strength of the input on the result.
Data gets you only partway there
Ultimately, because AI/ML applications rely on data, and the manipulation of that data has to use quantitative methods for explainability, we don't have any way beyond data science to provide explanations. The problem is that numerical weights might find a role in interpretability, but are still a ways from true explainability.
AI/ML development teams need to understand and apply techniques such as these, for their benefit, and for the benefit of testers and users. In particular, without an explanation of the result at some level, it can be impossible for testers to determine whether or not the result returned is correct.
To assure the quality and integrity of AI/ML applications, testers have to have a means of determining where results are derived from. XAI is a start, but it's going to take some time to fully realize this technique.