Machine Learning – Black Box Attacks and Transferability

Adversary Knowledge

  • White-box = adversary has complete knowledge of the targeted model, including its parameter values, architecture, training method and in some cases its training data
  • Black-box = adversary has no knowledge about the ML model except input output samples of training data or input output pairings obtained using the target model as an oracle

 

Threat model for black-box attacks

  • Adversarial capabilities = adversary has no knowledge of 
    • Training data
    • Model architecture
    • Model parameters
    • Model scores
  • However it can do query and output of the model. Example: attack models in GCP, AWS
  • Adversarial goal = force a ML model remotely accessible through and API to misclassify (AWS, GCP)

 

Black-box attack steps

  • Adversary queries remote ML system (AWS) and obtain labels of a set of inputs
  • Advesary uses this labeled data to train a local substitute for the remote system
  • Adversary selects new inputs for queries tpo the remote ML system
  • Adversary then uses the local substitute to craft adversarial examples, which are misclassified by the remote ML system

 

Adversarial Example Transferability

  • Transferability = ability to attack, crafted against a surrogate model, to be effective against a different, unknown target model
  • Adversarial examples have a transferability property: the ability of an adversarial example to remain effective even for the models other than the one used to generate it

 

Transferability Example

  • Adversarial samples crafted to mislead model A are likely to mislead model B
  • Intra-technique transferability = models A and B are both trained using the same ML technique, 
  • Cross-technique transferability = models A and B are different

 

Results of ML as a Service Systems

Train a substitute model with data obtained from a limited number of queries of target model, and use model to generate adversarial examples, and test on the remote platforms

 

Takeaways from Black-box attacks

  • Both intra-technique and cross technique adversarial sample transferability are consistently strong phenomena across the space of ML
  • Black-box attacks are possible in practical settings against any unknown ML classifier
  • Black-box attacks against classifiers hotsted by Amazon and Google achieve high misclassfication rate, by training a substitute model

 


Backdoor Attacks on Neural Networks

Machine Learning pipeline and backdoor (trojan attack)

Definition of backdoor attacks:

  • Hidden patterns trained into DNN models.
  • Triggered by specific inputs, causing unexpected behavior.

Backdoor/Trojan Trigger:

  • Small input data leading to target label generation.

Trojan Target Label:

  • Desired output from the backdoor attack.

 

Backdoor / Trojaning Attack

Attacker Capabilities:

  • Label manipulation.
  • Input feature corruption.

Attacker Goals:

  • Availability poison: Lower accuracy on benign inputs.
  • Integrity poison: Misclassification on specific inputs.

 

Poisoning Attack

Attacker Capabilities

  • Label Manipulation: The attacker can modify the training labels only
  • Input Manipulation: The attacker is more powerful and can corrupt the input features of training points

 

Attacker Goals

  • Availability poison: Decrease accuracy on benign input
  • Integrity poison: Misclassifications on certain inputs

 

Poisoning Attack vs Backdoor Attack

Poisoning attacks:

  • Availability poison: Decrease accuracy on benign input
  • Integrity poison: Misclassifications on certain inputs

 

Backdoor attacks:

  • Is a type of Integrity poison
  • The key difference is trigger
  • DNN backdoor is a hidden pattern trained into a DNN, which produces unexpected behavior if and only if a specific triggeris added to an input

 

Backdoor Attacks

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Trojaning Attack on Neural Networks

 

 

References

eof