Glossary of Threats
Adversarial Attacks
Backdoor attacks
A hacker injects "backdoors" to their victim’s ML models during model training to train the model with poisoned data in order to manipulate outputs, steal and hijack entire systems.
Data Stealing
The unauthorized acquisition of data with the intention of misusing it through methods like hacking, phishing, malware, data breaches, and insider threats.
Denial of Service (DoS)
An adversarial attack that shuts down machines or networks to prevent them from functioning normally.
Evasion
The most common adversarial attack on ML and AI models performed during inference. An evasion happens by designing an input that seems normal, but is wrongly classified by ML or AI models. For example, an attacker could mislead an LLM by adding universal adversarial prefixes to their prompt, which tricks the LLM into processing manipulated requests and potentially accessing restricted information.
Model Stealing
A technique that allows adversaries to create models that imitate the functionality of black-box (defined below under Cyber Attacks) ML models. The attacker then queries the stolen models to gain insights without accessing the original models.
Poisoning
Involves manipulating a training dataset by introducing, modifying, or deleting specific data points. Attackers poison data to introduce biases, errors, or vulnerabilities into ML models, negatively impacting their decisions or predictions during deployment.
Gen AI/LLM Risks
Hallucination
When a model generates misleading, nonsensical or incorrect output. Models often lack the capacity to respond "I don't know" and instead generate false information with unwavering confidence.
Jailbreak
Written phrases and creative prompts that bypass or trick a model's safeguards to draw out prohibited information that would otherwise be blocked by content filters and guidelines. Unlike Prompt Injection, which aims at the system outputs, Jailbreaking seeks to compromise the model's alignment.
PII
Also known as Personally Identifiable Information, PII risks violating privacy laws or stipulations when used to prompt or train GenAI models. This not only risks the deletion of data containing PII, but also entire models.
Prompt Injection
Meant to elicit unintended behaviors, Prompt Injections are attacks that impact outputs like search rankings, website content and chatbot behaviors. A Direct Prompt Injection is intentional, while an Indirect one involves a user who unknowingly "injects" commands and text.
Toxicity
When LLMs produce manipulative images and text, potentially leading to the spread of disinformation and other harmful consequences.
Trustworthiness
Data Drifts
When the accuracy of AI models drifts within days. Business KPIs are negatively affected when production data differs from training data.
Biases
Refer to AI systems that produce biased results, usually reflective of human societal biases.
Explainability
The level at which human users are able to comprehend and trust the results created by AI, based on tracking how the AI made decisions and reached conclusions.
Fairness
In the context of AI, fairness is the process of correcting and eliminating algorithmic biases (about race, ethnicity, gender, sexual orientation, etc.).
Out-of-Distribution (OOD)
Data that deviates from patterns AI models were trained on, which leads models to behave in unexpected ways.
Weak Spots
AI can have several weak spots which may be exploited and pose risks.
Cyber Attacks on AI
Blackbox or Graybox
When attackers have either partial or no knowledge of a model, other than its input.
Malicious Code
A breed of code that can be used to corrupt files, erase hard drives, or allow attackers to access systems. Malicious code includes trojan horses, worms, and macros, and spreads by visiting infected websites or downloading infected attachments or files.
Malware Injection
When malware is injected into an established software program, website, or database using methods like SQL injection and command injection.
White Box
Also known as XAI attacks, this is when attackers know everything about the deployed model, e.g., inputs, model architecture, and specific model internals like weights or coefficients. Compared to blackbox attacks, white box attacks allow attackers more opportunity to gain information once they are able to access network gradients of XAI models.
Trojan
Attacks that embed malicious code within training datasets and updates which seem benign. After attacking the AI system, these hidden payloads manipulate the model’s decision-making, causing data exfiltration and output poisoning.
Privacy
Data Extraction
When trained attack models are tasked with determining whether a data point is in a training set that can expose essential information like private API keys.
Model Inversion
An attack in which a machine learning model - an inversion model - is trained on the target model's output to predict the original dataset of the target model and infer sensitive information.
Private Data Leakage
When an LLM discloses information that should have remained confidential, leading to privacy and security breaches.