It has been widely shown that machine learning (ML) models are vulnerable to adversarial attacks, where a malicious user can modify the input to a model (e.g. an image, or electronic health record data) in such a way that the changes are small enough to be imperceptible to the human eye, and yet causes the ML model to produce an incorrect output. A model’s susceptibility to these attacks reduces people’s trust in machine learning, and is a significant barrier to wider adoption of ML in sensitive scenarios such as healthcare.
We develop two novel, state-of-the-art explainability-based techniques that are able to detect adversarial attacks, allowing us to create machine learning pipelines that are robust to adversarial attacks. Our adversarial attack detection methods work by inspecting the parts of the input that the ML model deems important when making its decision.
We test our adversarial attack detection models on medical datasets, gaining accuracies of 77% on the MIMIC-III electronic health record dataset, and 100% on the MIMIC-CXR chest x-ray dataset. We also develop a method that can detect new, unseen adversarial attacks, gaining an accuracy of 87% on the MIMIC-CXR dataset. The integration of these techniques into machine learning pipelines could greatly increase the robustness of ML models to attacks, which is needed for ML to be used in healthcare.