Machine Learning – Black Box Attacks and Transferability

Adversary Knowledge

White-box = adversary has complete knowledge of the targeted model, including its parameter values, architecture, training method and in some cases its training data
Black-box = adversary has no knowledge about the ML model except input output samples of training data or input output pairings obtained using the target model as an oracle

Threat model for black-box attacks

Adversarial capabilities = adversary has no knowledge of
- Training data
- Model architecture
- Model parameters
- Model scores
However it can do query and output of the model. Example: attack models in GCP, AWS
Adversarial goal = force a ML model remotely accessible through and API to misclassify (AWS, GCP)

Black-box attack steps

Adversary queries remote ML system (AWS) and obtain labels of a set of inputs
Advesary uses this labeled data to train a local substitute for the remote system
Adversary selects new inputs for queries tpo the remote ML system
Adversary then uses the local substitute to craft adversarial examples, which are misclassified by the remote ML system

Adversarial Example Transferability

Transferability = ability to attack, crafted against a surrogate model, to be effective against a different, unknown target model
Adversarial examples have a transferability property: the ability of an adversarial example to remain effective even for the models other than the one used to generate it

Transferability Example

Adversarial samples crafted to mislead model A are likely to mislead model B
Intra-technique transferability = models A and B are both trained using the same ML technique,
Cross-technique transferability = models A and B are different

Results of ML as a Service Systems

Train a substitute model with data obtained from a limited number of queries of target model, and use model to generate adversarial examples, and test on the remote platforms

Takeaways from Black-box attacks

Both intra-technique and cross technique adversarial sample transferability are consistently strong phenomena across the space of ML
Black-box attacks are possible in practical settings against any unknown ML classifier
Black-box attacks against classifiers hotsted by Amazon and Google achieve high misclassfication rate, by training a substitute model

Backdoor Attacks on Neural Networks

Machine Learning pipeline and backdoor (trojan attack)

Definition of backdoor attacks:

Hidden patterns trained into DNN models.
Triggered by specific inputs, causing unexpected behavior.

Backdoor/Trojan Trigger:

Small input data leading to target label generation.

Trojan Target Label:

Desired output from the backdoor attack.

Backdoor / Trojaning Attack

Attacker Capabilities:

Label manipulation.
Input feature corruption.

Attacker Goals:

Availability poison: Lower accuracy on benign inputs.
Integrity poison: Misclassification on specific inputs.

Poisoning Attack

Attacker Capabilities

Label Manipulation: The attacker can modify the training labels only
Input Manipulation: The attacker is more powerful and can corrupt the input features of training points

Attacker Goals

Availability poison: Decrease accuracy on benign input
Integrity poison: Misclassifications on certain inputs

Poisoning Attack vs Backdoor Attack

Poisoning attacks:

Availability poison: Decrease accuracy on benign input
Integrity poison: Misclassifications on certain inputs

Backdoor attacks:

Is a type of Integrity poison

The key difference is trigger
DNN backdoor is a hidden pattern trained into a DNN, which produces unexpected behavior if and only if a specific triggeris added to an input

Backdoor Attacks

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Trojaning Attack on Neural Networks

References

eof

solidfish

Machine Learning – Black Box Attacks and Transferability

Backdoor Attacks on Neural Networks

References

Related