Getting to know a black-box model:

A two-dimensional example of Jacobian-based adversarial attacks and Jacobian-based data augmentation

Published in

Towards Data Science

9 min readJul 24, 2018

Mirror Lake and Lone Eagle Peak in Colorado 7/2018

As the hype about AI has grown, so have discussions around Adversarial examples. An adversarial example, also referred to as an attack, is an input that has been crafted to be misclassified by a machine learning model. These inputs are usually a high dimensional input such as a photo, audio sample, string of text, or even software code.

There are several wonderful blogs that cover adversarial attacks and defenses theoretically and at an introductory level. One of the keys to understanding adversarial examples is to first understand:

How machine learning models make decisions
The ‘data manifold’ and higher dimensional spaces
Adversarial noise or a perturbation

Since these concepts are challenging to visualize in a high dimensional space, we’ll walk through an example of some of the core techniques used in adversarial attacks with a simple two-dimensional example. This will help us gain a better understanding of these concepts in higher dimensions.

We’ll build a logistic regression classifier which will act as the model we intend to attack or trick, or our ‘victim’ model. Then, we will walk through how use a gradient-based method to attack our victim and a black-box version of our victim. (All of the code used to produce this post can be seen here)

Building a victim model

Let’s borrow Martín Pellarolo’s example of building a logistic regression from a subset of the iris data set. We’ll call the two input variables X1 and X2 and the classes class 0 and class 1 to keep it simple.

Our goal when training a machine learning model is to determine the line in this two-dimensional space that best separates the two classes. Luckily, it’s an easy task given that the two classes are visibly separated and there’s not much overlap. To do this, we’ll fit a logistic regression which will create a probability distribution of a data point belonging to class 1. Using the sigmoid function (represented as g) and some parameters θ, we’ll fit this probability distribution to our data.

By changing the parameters in the matrix θ, we can adjust the function g to best fit our data X.

def sigmoid(X, theta):
    return 1 / (1 + np.exp(-np.dot(X, theta[0])))

We’ll use binary cross entropy loss as the loss function to determine how close the model’s predictions are to the ground truth.

def loss(X, theta, y):
    h = sigmoid(X, theta)
    return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()

The partial derivative of the loss function with respect to (w.r.t.) θ tells us the direction we need to change the values of θ to change the loss. In this case, we would like to minimize the loss.

Partial derivative of the loss function w.r.t. θ

h = sigmoid(X, theta)
gradient_wrt_theta = np.dot(X.T, (h - y)) / y.shape[0]

Once we’ve minimized the loss function, by making directed updates to θ, our victim model is trained!

Victim model’s probability distribution for class 1

The graph above shows the model’s probability distribution for any point in this space belonging to class 1, and inversely to class 0 (1- P(y=1)). It also features our model’s decision boundary at a probability threshold of 0.5 — If a point is above the line, the probability of it belonging to class 1 will be below 50%. Since the model ‘decides’ at this threshold, it will assign 0 as its label prediction.

Attacking the victim model

The objective of a jacobian or gradient based attack, described in Explaining and Harnessing Adversarial Examples by Goodfellow et. al., is to move a point over a victim model’s decision boundary. In our example, we’ll take a point that is normally classified as class 0 and ‘push’ it over the victim model’s decision boundary to be classified as class 1. This change to the original point is also called a perturbation when using higher dimensional data because we’re making a very small change to the input.

‘Pushing’ a data point over the victim model’s decision boundary

As you may recall, when training the logistic regression, we used the loss function and derivative of the loss function w.r.t. θ to determine how θ needs to change to minimize the loss. As an attacker, with full knowledge of how the victim model works, we can determine how to change the loss by changing the other input to our function.

Partial derivative of the loss function w.r.t. X

The derivative of the loss function w.r.t. X tells us exactly in which direction we need to change the values of X to change the victim model’s loss.

h = sigmoid(X, theta)
gradient_wrt_X = np.dot(np.expand_dims((h-y),1),theta)/y.shape[0]

Since we want to attack the model, we need to maximize its loss. Changing the values of X essentially moves X in the 2-dimensional space. Direction is only one of the components in an adversarial perturbation. We also need to take into account how large of a step (represented as epsilon) is needed to move in that direction to cross the decision boundary.

#Normalizing gradient vectors to make sure step size is consistent
#necessary for our 2-d example, but not called for in the papergradient_magnitudes = np.expand_dims(np.asarray(list(map(lambda x: np.linalg.norm(x), gradient_wrt_X))),1)
grads_norm = gradient_wrt_X/gradient_magnitudes#Creating the adversarial perturbation
epsilon = 0.5 #The step size be adjusted 
X_advs = X+grads_norm*epsilon

An adversary must consider which data points or inputs to use as well as the smallest epsilon necessary to successfully push a point over the decision boundary. If the adversary starts with points that are very far into the class 0 space they’ll need a larger and more noticeable perturbation to convert it into an adversarial example. Let’s convert some of our points closest to the decision boundary and from class 0. (Note: Other techniques allow for creating adversarial examples from random noise - paper & article)

‘Pushing’ several data points over the victim model’s decision boundary using epsilon = 0.5

We’ve successfully created some adversarial examples!

…Well, you may be thinking:

“We just moved points around. These points are just different points now…”
“… and we were able to do this because we knew everything about the victim model. What if we don’t know how a victim model works?”

and you would be right. We’ll come back to point 1, but the techniques from Practical Black-Box Attacks against Machine Learning by Nicolas Papernot et. al. will help us with point 2.

Attacking Black-Box Models:

When we know everything about a model, we refer to it as a ‘white-box’ model. In comparison, when we know nothing about how a model works, we refer to it as a ‘black-box’. We can imagine black-box models as an API that we ping by sending inputs and receiving some outputs (labels, class numbers, etc). Understanding black-box attacks are vital because they prove that models hidden behind an API may seem safe, but are in fact still vulnerable to attacks.

Papernot’s paper discusses the jacobian-based dataset augmentation technique which aims to train another model, called the substitute model, to share very similar decision boundaries as the victim model. Once a substitute model is trained to have almost the same decision boundaries as the victim model, an adversarial perturbation that is created to move a point over the substitute model’s decision boundary will likely also cross the victim model’s decision boundary. They achieve this by exploring the space around the victim model’s decision space and determining how the victim responds.

Training a substitute model

This technique can be described as a child learning to annoy their parents. The child starts with no preconception of what makes their parent angry or not, but they can test their parent by picking a random-set of actions over the course of a week and noting how their parent respond to those actions. While a parent may exhibit a non-binary response for each of these, let’s pretend that the child’s actions are either bad or good (two classes). After the first week, the child has learned a bit about what bothers their parents and makes an educated guess as to what else would bother their parents. The next week, the child dials down the actions which were successful and takes actions that weren’t successful a step further. The child repeats this, each week, noting their parents’ responses and adjusting their understanding of what will bother their parents until they know exactly what annoys them and what doesn’t.

Jacobian-based dataset augmentation works in the same way where a random sample of the initial data is taken and used to train a very poor substitute model. The adversarial examples are created from the dataset (using the gradient based attacks from earlier). Here, the adversarial examples are a step in the direction of the model’s gradient to determine if the black-box model will classify the new data points the same way as the substitute model.

Substitute model’s decision boundary converges to that of the decision boundary

The augmented data is labeled by the black-box model and used to train a better substitute model. Just like the child, the substitute model gets a more precise understanding of where the black-box model’s decision boundary is. After a few iterations of this, the substitute model shares almost the exact same decision boundaries as the black-box model.

The substitute model doesn’t even need to be the same type of ML model as the black-box. In fact, a simple Multi-Layer Perceptron is enough to learn close enough decision boundaries of a complex Convolutional Neural Network. Ultimately, with a small sample of data, a few iterations of the data augmentation and labeling, a black-box model can be successfully attacked.

The curse of the dot-product:

Now, back to point 1: You’re right, I was moving points in a 2-dimensional space. While that was for the sake of keeping the example simple, adversarial attacks take advantage of a property of neural networks that amplifies signals. Andrej Karpathy explains the effect of the dot-product in more detail here.

The dot-product amplifies the signal from small adversarial perturbations

In our two dimensional example, to move the point across the victim model’s decision boundary, we need to move it with step size epsilon. When weights matrices of a neural network are multiplied with a normal input, the product of each weight and each input value are summed. However, with an adversarial input, the additional summation of the adversarial signal amplifies the signal as a function of the total input dimensionality. Meaning, to achieve the step size needed to cross the decision boundary, we need to make smaller changes to each X value as the number of input dimensions increase. The larger the input dimensionality, the harder it is for us to notice the adversarial perturbations —this effect is one of the reasons why adversarial examples with MNIST are more noticeable than adversarial examples with ImageNet.

Gradient-based attacks have proven to be effective techniques that exploit the way deep learning models process high dimensional inputs into probability distributions. Black-box attacks demonstrate that as long as we have access to a victim model’s inputs and outputs, we can create a good enough copy of the model to use for an attack. However, these techniques have weaknesses. To use a gradient based attack, we need to know exactly how inputs are embedded (turned into a machine readable format like a vector). For instance, an image is usually represented as a 2-d matrix of pixels or a 3-d matrix, a consistent representation of information. On the other hand, other types of unstructured data like text may be embedded using some secret pre-trained word embeddings or learned embeddings. Since we can’t take the derivative of something w.r.t. a word, we need to know how that word is being represented. Training a substitute model requires a set of possibly detectable pings to the black-box model. And researchers are finding more and more ways to defend their models against adversarial attacks. Regardless, we must be proactive in understanding the vulnerabilities of our models.