UNIVERSIDADE FEDERAL DO ESPÍRITO SANTO

CENTRO TECNOLÓGICO

PROGRAMA DE PÓS-GRADUAÇÃO EM INFORMÁTICA

Jacson Rodrigues Correia da Silva

Copycat CNN: Convolutional Neural Network
Extraction Attack with Unlabeled Natural

Images

Vitória, ES

2023


Jacson Rodrigues Correia da Silva

Copycat CNN: Convolutional Neural Network Extraction
Attack with Unlabeled Natural Images

Tese de Doutorado submetida ao Programa
de Pós-Graduação em Informática da Uni-
versidade Federal do Espírito Santo, como
requisito parcial para obtenção do Grau de
Doutor em Ciência da Computação.

Universidade Federal do Espírito Santo – UFES

Centro Tecnológico

Programa de Pós-Graduação em Informática

Supervisor: Prof. Dr. Thiago Oliveira dos Santos
Co-supervisor: Prof. Dr. Alberto Ferreira de Souza

Vitória, ES
2023


Ficha catalográfica disponibilizada pelo Sistema Integrado de
Bibliotecas - SIBI/UFES e elaborada pelo autor

C824c
Correia da Silva, Jacson Rodrigues, 1985-
CorCopycat CNN : Convolutional Neural Network Extraction
Attack with Unlabeled Natural Images / Jacson Rodrigues
Correia da Silva. - 2023.
Cor174 f. : il.

CorOrientador: Thiago Oliveira dos Santos.
CorCoorientador: Alberto Ferreira De Souza.
CorTese (Doutorado em Ciência da Computação) - Universidade
Federal do Espírito Santo, Centro Tecnológico.

Cor1. Inteligência artificial. 2. Redes neurais (Computação). I.
Oliveira dos Santos, Thiago. II. De Souza, Alberto Ferreira. III.
Universidade Federal do Espírito Santo. Centro Tecnológico. IV.
Título. 

CDU: 004 


 Copycat CNN: Convolutional Neural Network 
Extraction Attack with Unlabeled Natural 

Images 

Jacson Rodrigues Correia da Silva 

Tese de Doutorado submetida ao Programa de Pós-Graduação em Informática da Universidade 
Federal do Espírito Santo como requisito parcial para a obtenção do grau de Doutor em Ciência da 
Computação.  

Aprovada em 25 de Abril de 2023. 

 
Prof. Dr. Thiago Oliveira dos Santos 
Orientador 

 
Prof. Dr. Claudine Santos Badue Gonçalves 

Membro Interno 
 

Prof. Dr. Thomas Walter Rauber 
Membro Interno 

 
Prof. Dr. Jurandy Gomes de Almeida Junior 

Membro Externo, participação remota 
 

Prof. Dr. Eduardo José da Silva Luz 
Membro Externo, participação remota 

 
UNIVERSIDADE FEDERAL DO ESPÍRITO SANTO 
Vitória/ES, 25 de Abril de 2023 

Documento assinado digitalmente conforme descrito no(s) Protocolo(s) de Assinatura constante(s) neste arquivo, de onde é possível verificar a autenticidade do mesmo.


PROTOCOLO DE ASSINATURA

UNIVERSIDADE FEDERAL DO ESPÍRITO SANTO

O documento acima foi assinado digitalmente com senha eletrônica através do Protocolo
Web, conforme Portaria UFES nº 1.269 de 30/08/2018, por
THIAGO OLIVEIRA DOS SANTOS - SIAPE 2023810
Departamento de Informática - DI/CT
Em 26/04/2023 às 09:32

Para verificar as assinaturas e visualizar o documento original acesse o link:
https://api.lepisma.ufes.br/arquivos-assinados/698633?tipoArquivo=O

O documento acima foi assinado digitalmente com senha eletrônica através do Protocolo
Web, conforme Portaria UFES nº 1.269 de 30/08/2018, por
THOMAS WALTER RAUBER - SIAPE 2201072
Departamento de Informática - DI/CT
Em 26/04/2023 às 11:47

Para verificar as assinaturas e visualizar o documento original acesse o link:
https://api.lepisma.ufes.br/arquivos-assinados/698826?tipoArquivo=O

O documento acima foi assinado digitalmente com senha eletrônica através do Protocolo
Web, conforme Portaria UFES nº 1.269 de 30/08/2018, por
CLAUDINE SANTOS BADUE - SIAPE 1729561
Departamento de Informática - DI/CT
Em 26/04/2023 às 17:15

Para verificar as assinaturas e visualizar o documento original acesse o link:
https://api.lepisma.ufes.br/arquivos-assinados/699199?tipoArquivo=O


Acknowledgements

There are many moments of struggle and battles that seem to have no end.
Everyone’s life has many obstacles, which can become lighter or be overcome thanks to the
direct and indirect support of the people around us. At this moment, I thank God, who
has always been by my side through life’s coincidences, manifested through the hands of
the people around me, even those who may not believe in Him. He has given me strength
when I needed it and guided my thoughts, enabling me to move forward.

I am very grateful to my wife, who has been with me throughout this entire journey,
providing me with strong support, affection, and love. Moreover, I am grateful to my
son, who has brought a new perspective to my world. Understanding his mind can be
challenging at times, but through him, I have gained a deeper understanding of myself. I
love you and I understand, by your gestures, how much you love me too.

Many thanks to my parents, who have always been there for me and my family,
offering their unwavering presence and support when we needed them most. I also want
to express my gratitude to my sisters and brothers-in-law, who have given me their time,
attention, affection, and support, whether they are near or far.

I am extremely grateful to my supervisor, who consistently demonstrated their
willingness to guide and listen to me. He always made intelligent choices and approached
situations with an open mind, charting new paths and guiding me through the difficulties
encountered on this journey. I would also like to express my heartfelt thanks to the team
at NXP Semiconductors, who provided me with invaluable life experiences and several
opportunities for growth and learning.

Finally, I also want to express my gratitude to everyone who has helped me directly
and indirectly during this process, thanks to NVIDIA for providing me a GPU to be used
in this research, and to the AWS Cloud Credit for Research program, which provided me
cloud computing resources.


Resumo
Redes Neurais Convolucionais (CNNs) têm alcançado alto desempenho em vários problemas
nos últimos anos, levando muitas empresas a desenvolverem produtos com redes neurais
que exigem altos custos para aquisição de dados, anotação e geração de modelos. Como
medida de proteção, as empresas costumam entregar seus modelos como caixas-pretas
acessíveis apenas por APIs, que devem ser seguras, robustas e confiáveis em diferentes
domínios de problemas. No entanto, estudos recentes mostraram que CNNs estado-da-arte
têm vulnerabilidades, onde perturbações simples nas imagens de entrada podem mudar as
respostas do modelo, e até mesmo imagens irreconhecíveis por humanos podem alcançar
uma predição com alto grau de confiança do modelo. Esses métodos precisam acessar
os parâmetros do modelo, mas há estudos mostrando como gerar uma cópia (imitação)
de um modelo usando suas probabilidades (soft-labels) e dados do domínio do problema.
Com um modelo substituto, um adversário pode efetuar ataques ao modelo alvo com
maior possibilidade de sucesso. Este trabalho explora ainda mais essas vulnerabilidades. A
hipótese é que usando imagens publicamente disponíveis (que todos tem acesso) e respostas
que qualquer modelo deve fornecer (mesmo caixa-preta) é possível copiar um modelo
atingindo alto desempenho. Por isso, foi proposto um método chamado Copycat para
explorar modelos de classificação de CNN. O objetivo principal foi copiar o modelo em
duas etapas: primeiro, consultando-o com imagens naturais aleatórias, como do ImageNet,
e anotando suas probabilidades máximas (hard-labels). Depois, usando essas imagens
rotuladas para treinar um modelo Copycat que deve alcançar desempenho semelhante ao
modelo alvo. Avaliamos essa hipótese em sete problemas do mundo real e contra uma API
baseada em nuvem, atingindo desempenhos (F1-Score) em todos modelos Copycat acima
de 96,4% quando comparados aos modelos alvo. Após atingir esses resultados, realizamos
vários experimentos para consolidar e avaliar nosso método. Além disso, preocupados com
essa vulnerabilidade, também analisamos várias defesas existentes contra o método Copycat.
Dentre os experimentos, as defesas que detectam consultas de ataque não funcionam contra
o método, mas defesas que usam marca d’água conseguem identificar a Propriedade
Intelectual do modelo alvo. Assim, o método se mostrou eficaz na extração de modelos,
possuindo imunidade às defesas da literatura, sendo identificado apenas por defesas de
marca d’água.

Palavras-chaves: Aprendizado Profundo. Redes Neurais Convolucionais. Roubo de Co-
nhecimento de Redes Neurais. Destilação de Conhecimento. Extração de Modelo. Roubo
de Modelo. Compressão de Modelo.


Abstract
Convolutional Neural Networks (CNNs) have been achieving state-of-the-art performance
on a variety of problems in recent years, leading to many companies developing neural-
based products that require expensive data acquisition, annotation, and model generation.
To protect their models from being copied or attacked, companies often deliver them
as black-boxes only accessible through APIs, that must be secure, robust, and reliable
across different problem domains. However, recent studies have shown that state-of-the-art
CNNs have vulnerabilities, where simple perturbations in input images can change the
model’s response, and even images unrecognizable to humans can achieve a higher level of
confidence in the model’s output. These methods need to access the model parameters,
but there are studies showing how to generate a copy (imitation) of a model using its
probabilities (soft-labels) and problem domain data. By using the surrogate model, an
adversary can perform attacks on the target model with a higher possibility of success. We
further explored these vulnerabilities. Our hypothesis is that by using publicly available
images (accessible to everyone) and responses that any model should provide (even black-
boxes), it is possible to copy a model achieving high performance. Therefore, we proposed
a method called Copycat to explore CNN classification models. Our main goal is to copy
the model in two stages: first, by querying it with random natural images, such as those
from ImageNet, and annotating its maximum probabilities (hard-labels). Then, using
these labeled images to train a Copycat model that should achieve similar performance to
the target model. We evaluated this hypothesis on seven real-world problems and against
a cloud-based API. All Copycat models achieved performance (F1-Score) above 96.4%
when compared to target models. After achieving these results, we performed several
experiments to consolidate and evaluate our method. Furthermore, concerned about such
vulnerability, we also analyzed various existing defenses against the Copycat method.
Among the experiments, defenses that detect attack queries do not work against our
method, but defenses that use watermarking can identify the target model’s Intellectual
Property. Thus, the method proved to be effective in model extraction, having immunity
to the literature defenses, but being identified only by watermark defenses.

Keywords: Deep learning. Convolutional neural network. Stealing network knowledge.
Knowledge distillation. Model extraction. Model Stealing. Model compression.


List of Figures

Figure 1 – Examples of smart devices and APIs . . . . . . . . . . . . . . . . . . . 18
Figure 2 – Illustration of a Convolutional Neural Network . . . . . . . . . . . . . . 19
Figure 3 – Samples of adversarial examples . . . . . . . . . . . . . . . . . . . . . . 21
Figure 4 – Samples of adversarial examples . . . . . . . . . . . . . . . . . . . . . . 22
Figure 5 – Images that are unrecognizable to humans by reach high confidence in

state-of-the-art Deep Neural Networks (DNNs) . . . . . . . . . . . . . . 23
Figure 6 – An example of 2-D convolution . . . . . . . . . . . . . . . . . . . . . . 29
Figure 7 – An example of max-pooling . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 8 – Architecture of LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 9 – AlexNet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 10 – VGG-16 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 11 – Example of a residual block for ResNet network . . . . . . . . . . . . . 33
Figure 12 – Example of the LRP method . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 13 – Example of LRP method composite strategy . . . . . . . . . . . . . . . 38
Figure 14 – Watermark example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 15 – Illustration of the behavior of a watermarked image in a clean model

and in a watermarked model . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 16 – Overview of the Copycat creation . . . . . . . . . . . . . . . . . . . . . 48
Figure 17 – Network architecture of an illustrative example of Copycat . . . . . . . 52
Figure 18 – Training dataset (ODD) of illustrative example of Copycat . . . . . . . 53
Figure 19 – Attack dataset (NPDD) of illustrative example of Copycat . . . . . . . 54
Figure 20 – Feature map and classification space of the Oracle in the illustrative

example of Copycat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 21 – Several runs of the illustrative example of Copycat . . . . . . . . . . . 56
Figure 22 – Illustrative Block Diagram of Adaptive Misinformation . . . . . . . . . 70
Figure 23 – PRADA’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 24 – Frequency of neuron activations in a small network . . . . . . . . . . . 78
Figure 25 – Three random samples of watermarked images for CIFAR10, MNIST,

and FashionMNIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 26 – t-SNE mapping of the ODD and NPDD-SL points to the classification

space of the Oracle for the DIG10 and FER7 problems . . . . . . . . . 84
Figure 27 – Relative F1-Score of the Copycats . . . . . . . . . . . . . . . . . . . . . 86
Figure 28 – Distribution of labels after querying the Oracle . . . . . . . . . . . . . 87
Figure 29 – Data curve performance of Copycats CC-VGG-NPDD-SL . . . . . . . . 88
Figure 30 – Performances of Copycat CC-VGG-NPDD-SL using the Framework . . 89
Figure 31 – Relative F1-Score of the Oracles on the seven problems . . . . . . . . . 91


Figure 32 – Relative F1-Score of Copycats (100k) over the Oracle . . . . . . . . . . 92
Figure 33 – Relative F1-Score of Copycats (500k images) over the Oracle . . . . . . 93
Figure 34 – Heatmaps generated with LRP for three problems . . . . . . . . . . . . 94
Figure 35 – PDD heatmaps generated with LRP on Copycat Framework . . . . . . 96
Figure 36 – NPDD heatmaps generated with LRP on Copycat Framework . . . . . 97
Figure 37 – Pearson correlation distribution of PDD heatmaps on Copycat Framework 98
Figure 38 – Pearson correlation distribution of PDD heatmaps on Copycat Framework 99
Figure 39 – Pearson correlation distribution of NPDD heatmaps on Copycat Frame-

work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Figure 40 – Pearson correlation distribution of NPDD heatmaps on Copycat Frame-

work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 41 – Performance of Copycats over Azure API . . . . . . . . . . . . . . . . . 102
Figure 42 – Threshold selection for ADMIS . . . . . . . . . . . . . . . . . . . . . . 103
Figure 43 – ADMIS: Oracle results . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure 44 – ADMIS: results of ADMIS attack datasets . . . . . . . . . . . . . . . . 104
Figure 45 – ADMIS CIFAR10: images labels per class . . . . . . . . . . . . . . . . 105
Figure 46 – ADMIS Flowers17: images labels per class . . . . . . . . . . . . . . . . 106
Figure 47 – ADMIS MNIST: images labels per class . . . . . . . . . . . . . . . . . 107
Figure 48 – ADMIS FashionMNIST: images labels per class . . . . . . . . . . . . . 108
Figure 49 – ADMIS: Copycat - NPDD of 100k queries . . . . . . . . . . . . . . . . 110
Figure 50 – ADMIS: Copycat - NPDD of 300k queries . . . . . . . . . . . . . . . . 110
Figure 51 – ADMIS: Copycat - NPDD of 500k queries . . . . . . . . . . . . . . . . 111
Figure 52 – ADMIS: CIFAR10 Oracles results . . . . . . . . . . . . . . . . . . . . . 111
Figure 53 – ADMIS: results of ADMIS attack datasets for CIFAR10 experiments . 112
Figure 54 – ADMIS: Copycat - NPDD of 100k queries for CIFAR10 experiments . . 113
Figure 55 – ADMIS: Copycat - NPDD of 300k queries for CIFAR10 experiments . . 113
Figure 56 – ADMIS: Copycat - NPDD of 500k queries for CIFAR10 experiments . . 114
Figure 57 – PRADA: MNIST, Small Architecture. False Positive and Detection

rates for Statistical test value. . . . . . . . . . . . . . . . . . . . . . . . 115
Figure 58 – PRADA: VGG16 Model, MNIST dataset. False Positive and Detection

Rates for statistical test value . . . . . . . . . . . . . . . . . . . . . . . 117
Figure 59 – PRADA: False Positive Rate for statistical test value . . . . . . . . . . 119
Figure 60 – PRADA: False Positive Rate for statistical test value . . . . . . . . . . 121
Figure 61 – PRADA: False Positive Rate for statistical test value . . . . . . . . . . 123
Figure 62 – EWE: Oracles results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Figure 63 – EWE: Copycat results . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Figure 64 – EWE: difference of label distribution between attacks on watermarked

and clean models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Figure 65 – PDD heatmaps generated with LRP on Copycat Framework for ACT101148


Figure 66 – PDD heatmaps generated with LRP on Copycat Framework for DIG10 149
Figure 67 – PDD heatmaps generated with LRP on Copycat Framework for FER7 150
Figure 68 – PDD heatmaps generated with LRP on Copycat Framework for GOC9 151
Figure 69 – PDD heatmaps generated with LRP on Copycat Framework for PED2 152
Figure 70 – PDD heatmaps generated with LRP on Copycat Framework for SHN10 153
Figure 71 – PDD heatmaps generated with LRP on Copycat Framework for SIG30 154
Figure 72 – NPDD heatmaps generated with LRP on Copycat Framework for ACT101156
Figure 73 – NPDD heatmaps generated with LRP on Copycat Framework for DIG10157
Figure 74 – NPDD heatmaps generated with LRP on Copycat Framework for FER7 158
Figure 75 – NPDD heatmaps generated with LRP on Copycat Framework for GOC9159
Figure 76 – NPDD heatmaps generated with LRP on Copycat Framework for PED2160
Figure 77 – NPDD heatmaps generated with LRP on Copycat Framework for SHN10161
Figure 78 – NPDD heatmaps generated with LRP on Copycat Framework for SIG30162
Figure 79 – Confusion Matrices for ACT101 . . . . . . . . . . . . . . . . . . . . . . 164
Figure 80 – Confusion Matrices for DIG10 . . . . . . . . . . . . . . . . . . . . . . . 166
Figure 81 – Confusion Matrices for FER7 . . . . . . . . . . . . . . . . . . . . . . . 167
Figure 82 – Confusion Matrices for GOC9 . . . . . . . . . . . . . . . . . . . . . . . 168
Figure 83 – Confusion Matrices for PED2 . . . . . . . . . . . . . . . . . . . . . . . 169
Figure 84 – Confusion Matrices for SHN10 . . . . . . . . . . . . . . . . . . . . . . . 170
Figure 85 – Confusion Matrices for SIG30 . . . . . . . . . . . . . . . . . . . . . . . 171


List of Tables

Table 1 – LRP rules and usage suggestions. . . . . . . . . . . . . . . . . . . . . . . 38
Table 2 – Comparison between related works and Copycat . . . . . . . . . . . . . 44
Table 3 – Details of the problems, their respective datasets, and the number of

images in each domain splits. . . . . . . . . . . . . . . . . . . . . . . . . 59
Table 4 – Datasets and architectures used in experiments. . . . . . . . . . . . . . . 71
Table 5 – Number of images per class of the Flowers17 PDD . . . . . . . . . . . . 72
Table 6 – PRADA’s target models architecture . . . . . . . . . . . . . . . . . . . . 75
Table 7 – Comparison of F1-Scores for Oracle and Baseline models, and Copycats

Performance relative to them. . . . . . . . . . . . . . . . . . . . . . . . . 86
Table 8 – Performance of the Copycat using AlexNet architecture . . . . . . . . . 90
Table 9 – Analysis of APIs costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Table 10 – PRADA: MNIST dataset trained on the Small architecture . . . . . . . 116
Table 11 – PRADA: Results for MNIST Dataset trained on VGG-16 . . . . . . . . 118
Table 12 – PRADA: GTSRB Dataset trained on the Small architecture . . . . . . . 120
Table 13 – PRADA: GTSRB Dataset trained on VGG-16 . . . . . . . . . . . . . . 120
Table 14 – PRADA: GOC9 problem trained on VGG-16 . . . . . . . . . . . . . . . 122
Table 15 – EWE:Watermarked success rates of the Oracles. . . . . . . . . . . . . . 124
Table 16 – EWE: Watermarked success rates of the Copycat models. . . . . . . . . 125
Table 17 – EWE: Watermarked success rates of the Copycat models after fine-tuning

process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Table 18 – Classification reports for ACT101 . . . . . . . . . . . . . . . . . . . . . 165
Table 19 – Classification reports for DIG10 . . . . . . . . . . . . . . . . . . . . . . 166
Table 20 – Classification reports for FER7 . . . . . . . . . . . . . . . . . . . . . . . 167
Table 21 – Classification reports for GOC9 . . . . . . . . . . . . . . . . . . . . . . . 168
Table 22 – Classification reports for PED2 . . . . . . . . . . . . . . . . . . . . . . . 169
Table 23 – Classification reports for SHN10 . . . . . . . . . . . . . . . . . . . . . . 170
Table 24 – Classification reports for SIG30 . . . . . . . . . . . . . . . . . . . . . . . 171


Acronyms and Abbreviations

General:

API Application Programming Interface

CNN Convolutional Neural Network

DNN Deep Neural Network

DL Deep Learning

FGSM Fast Gradient Sign Method

FEA Feature Extraction Algorithm

GAN Generative Adversarial Network

ILSVRC ImageNet Large Scale Visual Recognition Challenge

IP Intellectual Property

JBDA Jacobian-Based Dataset Augmentation

LRP Layer-wise Relevance Propagation

MSP Maximum Softmax Probabilities

ML Machine Learning

MLP Multi Layer Perceptron

MLaaS Machine Learning as a Service

RBF Radial Basis Function

RL Representation Learning

SGD Stochastic Gradient Descent

TN Target Network

SNNS Soft Nearest Neighbor Loss

t-SNE t-Distributed Stochastic Neighbor Embedding

XAI Explainable Artificial Intelligence

Problems:


ACT101 Human Action Classification Problem, 101 Categories

DIG10 Handwritten Digit Classification, 10 categories

FER7 Facial Expression Recognition, 7 categories

GOC9 General Object Detection, 9 categories

PED2 Pedestrian Detection, 2 categories

SHN10 Street View House Number Classification, 10 categories

SIG30 Traffic Sign Classification, 30 categories

Models:

BL Baseline model

BL-Alex-* Baseline model generated on AlexNet Architecture

BL-VGG-* Baseline model generated on VGG-16 Architecture

CC Copycat model

CC-Alex-* Copycat model generated on AlexNet Architecture

CC-VGG-* Copycat model generated on VGG-16 Architecture

Datasets:

ODD Original Domain Dataset

ODD-OL Original Domain Dataset with Original Labels

PDD Problem Domain Dataset

PDD-OL Problem Domain Dataset with Original Labels

PDD-SL Problem Domain Dataset with Stolen Labels

TDD Test Domain Dataset

TDD-OL Test Domain Dataset with Original Labels

NPDD Non-Problem Domain Dataset

NPDD-SL Non-Problem Domain Dataset with Stolen Labels

NPDD+PDD Non-Problem Domain Dataset joined to Problem Domain Dataset

NPDD+PDD-SL Non-Problem Domain Dataset joined to Problem Domain Dataset,
both with Stolen Labels


OL Original Labels

*-OL Suffix of Original Labels Dataset

SL Stolen Labels

*-SL Suffix of Stolen Labels Dataset


Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.1 Model extraction and attacks . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1 Convolutional Neural Networks and Large Public Databases . . . . . . . . 27
2.2 Analysis of CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Model extraction and attacks . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Defense methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Copycat Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 48
4.1 Attack formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Copycat Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Simplified example of Copycat method . . . . . . . . . . . . . . . . . . . . 51

5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Datasets Organization and Baselines Setup . . . . . . . . . . . . . . . . . . 57

5.1.1 Datasets for Baselines and Tests . . . . . . . . . . . . . . . . . . . . 57
5.1.2 Datasets for Generating the Copycats . . . . . . . . . . . . . . . . . 58

5.2 Investigated Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.1 Human Action Recognition – ACT101 . . . . . . . . . . . . . . . . 60
5.2.2 Handwritten Digits Classification – DIG10 . . . . . . . . . . . . . . 60
5.2.3 Facial Expression Recognition – FER7 . . . . . . . . . . . . . . . . 60
5.2.4 General-Object Classification – GOC9 . . . . . . . . . . . . . . . . 61
5.2.5 Pedestrian Classification – PED2 . . . . . . . . . . . . . . . . . . . 61
5.2.6 Street View House Numbers Classification – SHN10 . . . . . . . . . 61
5.2.7 Traffic Signs Classification – SIG30 . . . . . . . . . . . . . . . . . . 61

5.3 Used Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5 General Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.6 Model Extraction Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.6.1 Analysis of Datasets Distributions in the Classification Space of the
Black-Box Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.6.2 Copycat from the Same Architecture . . . . . . . . . . . . . . . . . 65


5.6.3 Analysis of the Relationship Between Number of Queries and Copy-
cat Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.6.4 Copycat from a Different Architecture . . . . . . . . . . . . . . . . 66
5.6.5 Robustness of the Copycat Model . . . . . . . . . . . . . . . . . . . 67
5.6.6 Analysis of the Attention-Region in the Input Images . . . . . . . . 67
5.6.7 Analysis of Attack Viability and APIs Costs . . . . . . . . . . . . . 68

5.7 Defense Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.7.1 ADMIS: Adaptive Missinformation . . . . . . . . . . . . . . . . . . 69
5.7.2 PRADA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7.3 EWE: Entangled Watermarks . . . . . . . . . . . . . . . . . . . . . 77

6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1 Model Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1.1 Analysis of Datasets Distributions in the Classification Space of the
Black-Box Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1.2 Copycat from the Same Architecture . . . . . . . . . . . . . . . . . 85
6.1.3 Analysis of the Relationship Between Number of Queries and Copy-

cat Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.4 Copycat from a Different Architecture . . . . . . . . . . . . . . . . 89
6.1.5 Robustness of the Copycat Model . . . . . . . . . . . . . . . . . . . 90
6.1.6 Analysis of the Attention-Region in the Input Images . . . . . . . . 92
6.1.7 Analysis of Attack Viability and APIs Costs . . . . . . . . . . . . . 101

6.2 Defenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.1 ADMIS: Adaptive Missinformation . . . . . . . . . . . . . . . . . . 103

6.2.1.1 Copycat with the proposed datasets . . . . . . . . . . . . 104
6.2.1.2 Copycat with the usual NPDD dataset . . . . . . . . . . . 107
6.2.1.3 Copycat of CIFAR10 using only out-of-distribution data . 109

6.2.2 PRADA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2.2.1 MNIST on Small Architecture . . . . . . . . . . . . . . . . 114
6.2.2.2 MNIST on VGG16 Architecture . . . . . . . . . . . . . . . 116
6.2.2.3 GTSRB on Small Architecture . . . . . . . . . . . . . . . 118
6.2.2.4 GTSRB on VGG16 Architecture . . . . . . . . . . . . . . 119
6.2.2.5 GOC9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.2.3 EWE: Entangled Watermarks . . . . . . . . . . . . . . . . . . . . . 124
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129


Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Appendix 142
APPENDIX A Code for running the simple example of the Copycat Method 143
APPENDIX B LRP Heatmaps of TDD . . . . . . . . . . . . . . . . . . . . . 147
APPENDIX C LRP Heatmaps of NPDD . . . . . . . . . . . . . . . . . . . . . 155
APPENDIX D Confusion Matrices and Classification Reports . . . . . . . . . 163

D.1 ACT101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
D.2 DIG10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
D.3 FER7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
D.4 GOC9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
D.5 PED2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
D.6 SHN10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
D.7 SIG30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171


18

1 Introduction

Currently, there are several smart tools and robots that use Machine Learning (ML)
techniques at their core. These products, such as Echo with Alexa1a, Google Home, Nest
or Android with Google Assistant1b, and HomePod with Siri1c (Figure 1), can process and
provide responses to commands and requests in natural language, can recognize persons and
objects, and also perform other tasks, that makes them convenient for users. Additionally,
many of these products also have access to cloud-based APIs, which allow integration with
other platforms and ML services (Machine Learning as a Service – MLaaS). These MLaaS,
such as Microsoft Cognitive Services Computer Vision API1d, Google Cloud Vision API1e,
IBM Watson Visual Recognition1f (Figure 1) provide access to powerful machine learning
models and algorithms, which can extract useful information from the user data to process
the request. Among these systems, there are the Representation Learning (RL) methods
that are built to also learn representations of the data, thus simplifying the extraction
of useful information for classifiers and predictors (BENGIO; COURVILLE; VINCENT,
2013).

Figure 1 – Examples of smart devices on the left and APIs on the right (image sources: i,
ii, iii, iv, v, vi, vii, viii – source links are available in the digital version of this
document).

An important method in RL is a DNN2 called Convolutional Neural Network (CNN)
(LECUN et al., 1989b). It has one or more convolutional layers that uses a mathematical
operation called convolution to analyze the input data. And a Deep CNN is usually made
1 a, b, c, d, e, f. Source links are available in the digital version of this document.
2 Deep Neural Network is a multilayer feedforward network. But unlike shallow networks that only have

a few layers, the depth of this network comes from the total length of its chain of layers, hence the
name DNN (GOODFELLOW; BENGIO; COURVILLE, 2016).

https://upload.wikimedia.org/wikipedia/commons/0/0f/Android_phones.jpg
https://images-na.ssl-images-amazon.com/images/I/61bEzxKzZIL._AC_SL1000_.jpg
https://upload.wikimedia.org/wikipedia/commons/4/49/Google_Home_with_Home_Hub_and_Home_Mini_on_table.jpg
https://www.apple.com/newsroom/images/product/homepod/standard/Apple_homepod-mini-white-10132020_big.jpg.large.jpg
https://miro.medium.com/max/700/1*tkAiRWvYmAi_RuRhRYgeiQ.jpeg
https://upload.wikimedia.org/wikipedia/commons/a/a8/Microsoft_Azure_Logo.svg
https://connectoricons-prod.azureedge.net/releases/v1.0.1444/1.0.1444.2347/cognitiveservicescomputervision/icon.png
https://freepngimg.com/download/logo/73271-ibm-dachlawinen-text-kunststoff-watson-schwarz-vorsicht.png
https://developer.amazon.com/en-US/alexa
https://developers.google.com/assistant
https://developer.apple.com/siri/
https://azure.microsoft.com/en-us/products/cognitive-services/computer-vision
https://cloud.google.com/vision
https://www.ibm.com/en/watson


Chapter 1. Introduction 19

up of multiple layers (Figure 2), starting with convolutional layers, pooling layers, and
ending with fully connected layers3. The intent of these first layers is to identify and
extract features in the input data, such as edges, shapes, and textures and provide these
feature maps to the next layers. Then, the fully connected layers are used to make the
final prediction using that features.

Given the state-of-the-art performance achieved by CNNs on a variety of problems,
these networks have been the power of many neural-based systems to recognize facial
expressions, traffic signs, product names, objects, speech, and several other tasks. Con-
sequently, many companies are investing a large amount of money to generate products
with CNNs. However, there are at least three high costs associated with it:

i. acquiring and annotating large-scale training datasets;
ii. computational power for training the models4 that can last for days or even months;

and
iii. experts to prepare the data and to design, implement, and train the model.

Logits

ŷ

Fully
Connected

Layers

Feature
Maps

Feature
Maps

Feature
Maps

Feature
Maps

Input

Feature Extraction Classification

Pooling
Layer

Convolutional
Layer

Convolutional
Layer

Pooling
Layer

Softmax
Softmax
Softmax
Softmax
Softmax

Softmax
Softmax

Figure 2 – Illustration of a Convolutional Neural Network for classification. In Feature
Extraction, the input image is fed to the network, which will use Convolutional
Layers (usually followed by an activation function like ReLU) and Pooling
Layers to extract its features and send to the Fully Connected Layers. Then,
in Classification, the final prediction ŷ is calculated by applying the Softmax
function to the outputs of the last Fully Connected Layer. The values before
applying the Softmax are usually called logits in the context of DNN.

Therefore, due to the resources and money invested in creating these models, it is
in the best interests of these companies to protect their model’s Intellectual Property (IP)
3 Although the acronym MLP – Multilayer Perceptron – refers to a Neural Network with fully connected

layers, the term Fully Connected Layers is more popular in the Deep Learning area and is widely used
to refer to the final layer of a Deep Neural Network.

4 An ML model is an architecture, such as CNN, with its parameters already adjusted by a training
process.


Chapter 1. Introduction 20

(i. e., model parameters, patented algorithms, unique datasets, and other copyrighted
assets) against attacks or copies. When another company, for example, lacks resources
but needs a more accurate model, they can gain a competitive advantage by copying
a competing model. Or if someone needs, for example, a copy5 of the target model to
access its parameters to do other tasks, like tricking malware and spam detection, or to
mislead autonomous navigation systems (PAPERNOT; MCDANIEL; GOODFELLOW,
2016). In the first case, the objective is to achieve the same accuracy (number of correct
predictions) of the target model and, in the second case, the objective is to achieve the
fidelity (similarity between the models’ responses) of the target model (JAGIELSKI et al.,
2020). For that and other reasons, their models are usually not delivered as white-boxes
(with parameters, architecture and other details visible to users). In fact, the models are
usually provided as a black-box, where it is possible only to send the input data and
receive the probabilities6 of the model (soft-labels) or the label (highest probability output)
referring to the predicted class (hard-label).

Examples of these black-box models are often available, as shown at the beginning
of this chapter, in smart tools, robots, or in the cloud as MLaaS Application Programming
Interfaces (APIs). In this scenario, note that a user who owns a robot can query it for
free an unlimited number of times, and the one who needs to use it as MLaaS must pay
for each usage. Users and also developers expect that these models provide accurate and
robust generalizations for new queries. For instance, in an object recognition model, new
images in the problem domain might be correctly identified, and it is expected that small
perturbations in the input images should not change the object’s classification. And this is
exactly where an attack can occur and the model knowledge can be extracted to produce
a surrogate (substitute) model. This process (attack) is called model extraction.

1.1 Model extraction and attacks
Model extraction consists of labeling a dataset with the target model responses

(soft-labels or hard-labels) and using this dataset to train a surrogate (substitute) model.
Some studies in this area worked on extracting knowledge from a larger model to a smaller
one, with fewer parameters, calling this process of model compression. Bucilă, Caruana
and Niculescu-Mizil (2006) presented a method for extracting large, complex ensembles
(set of smaller models combined to achieve performance equivalent to a larger, more
complex model) into smaller models without lost significant performance. In addition,
Ba and Caruana (2014) explored transferring a deep neural network to a shallow neural
5 The term copy will be used in the sense of imitating a model and not in the sense of producing an

exact copy of it.
6 The term “probabilities” is commonly used in the literature to refer to the network output after

applying the Softmax operation.


Chapter 1. Introduction 21

network. The dataset for training the substitute model was composed of images from the
problem domain labeled by the logits7 of the target model. Based on these studies of model
compression, Hinton, Vinyals and Dean (2015) observed that the outputs (probabilities
after Softmax function) of a model provide important information about the generalization
of the input data. So they proposed a method called knowledge distillation, where instead
of using logits, they modified the Softmax function (Equation 3.2). Their intention was
to smooth out the model’s output probabilities and provide better information about its
internal knowledge. For this type of model extraction, they also used original and synthetic
data from the problem domain.

Other important works investigated ways to attack machine learning models.
Szegedy et al. (2014) explored some vulnerabilities in neural networks and showed that,
using a simple optimization procedure, it is possible to find adversarial examples, i. e., by
adding small perturbations to an image already classified correctly by the neural network,
it is possible to generate a new image that the network is no longer able to classify correctly
(Figure 3). Later, Goodfellow, Shlens and Szegedy (2015) proposed the fast gradient sign
method, which uses the signal of the cost function’s gradient to create adversarial examples
against DNNs (Figure 4). After, Papernot et al. (2016) used the forward derivatives for
building adversarial saliency maps8 to craft adversarial samples against the DNN model.
These methods need to access the model parameters (white-box) and also need problem
domain data.

After these works, Papernot et al. (2017) formulated a new strategy capable of
crafting adversarial examples against black-box models provided as MLaaS. The first step
is to extract the target model using problem domain images. Then, this surrogate model
is used to generate the adversarial examples using the two methods cited in the previous
paragraph. At each generation of new examples, the target model is consulted, generating
new labels that are used to improve the performance of the substitute model. However,
the objective was not the model extraction, but only to generate a surrogate model to
craft adversarial examples (Figure 4).

On the other hand, Nguyen, Yosinski and Clune (2015) studied the vulnerability
presented by Szegedy et al. (2014) and found that it is easy to generate images that are
completely unrecognizable to humans, but that DNNs recognize with an higher confidence
(Figure 5). They used evolutionary algorithms and gradient ascent to find images with a
high degree of confidence9 belonging to each class in the tested DNN models. Additionally,
the authors also tried to avoid this misclassification by adding a garbage class, but without
7 The term logits is commonly used to refer to the the raw outputs of the network before the application

of the Softmax function, more details in Section 2.1.
8 The saliency map is a measure of how much each input pixel contributes to the final predictions of the

network.
9 High confidence means that the highest probability of the model is a value close to 100%


Chapter 1. Introduction 22

(a) (b) (c)

Figure 3 – Adversarial examples generated in (SZEGEDY et al., 2014). In (a) and (b),
the left images are the correctly predicted sample, the center images are the
difference between correct image and image predicted incorrectly magnified
by 10x (values shifted by 128 and clamped), and the right images are the
adversarial example classified as “ostrich, Struthio camelus”. In (c), results are
presented for a binary car classifier. The images on the left are recognized as
cars and the images on the right are not recognized. The center images are the
magnified absolute value of the difference between the two images.

+ .007 × =

x
“panda”

57.7% of confidence

sgn(∇xJ(θ, x, y))
“nematode”

8.2% of confidence

x+ λsgn(∇xJ(θ, x, y))
“gibbon”

99.3% of confidence

Figure 4 – A demonstration of an adversarial example generated by the fast sign method.
Let θ be the parameters of a model, x the input to the model, y the desired
prediction of x and J(θ;x; y) the cost used to train the neural network. By
adding the imperceptibly small vector (center image) provided by the gradient
of the cost function with respect to the input, the network classification was
changed to a “gibbon”. The λ = .007 corresponds to the magnitude of the
smallest bit of an 8 bit image encoding after model conversion to real numbers
(GOODFELLOW; SHLENS; SZEGEDY, 2015).


Chapter 1. Introduction 23

success.

(a) Images obtained by evolutionary algorithms (b) Images obtained by gradient ascent

Figure 5 – Images provided by (NGUYEN; YOSINSKI; CLUNE, 2015) that are unrecogniz-
able to humans, but that deceived state-of-the-art DNNs with high confidence.

In parallel with these works, Deng et al. (2009) released a large publicly available
image dataset called ImageNet, which provided 3.2 million images to be used in image
recognition systems. And the following year, Russakovsky et al. (2015a) started the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which has become one of
the most important competitions in the area. And later, another large dataset was also
released, the Microsoft COCO, published in (LIN et al., 2014), with 2.5 million labeled
images. Note that even with the existence of these datasets, which have millions of random
images of different classes of data, the related works used only original and synthetic data
from the problem domain. Additionally, the most part of them used the target model as a
white-box, i. e., accessing their parameters, instead of using them as a black-box.

1.2 Hypothesis
As presented so far, the cited works have shown that the information provided in

the responses of the target models allows extracting their knowledge, i. e., they showed
that model extraction attacks can be done through logits, soft-labels and hard-labels to
generate surrogate models. In these works, they also showed that the information of the
target models can be explored using synthetic data, but for that purpose it was necessary
to have data from the problem domain of the attacked model. Moreover, they shown that


Chapter 1. Introduction 24

even images not recognizable by humans can provide a high confidence responses in DNN
models. Additionally, their works suggest that these images belong to the feature or/and
the classification space of the model and can provide important information about it.

Thereby, our hypothesis is that:

Model extraction can be performed using random natural images provided in large
public image datasets (such as ImageNet and Microsoft COCO) and only the hard-labels
of the model, even if the images are not from the problem domain of the attacked model.

Furthermore, we believe that the generated surrogate model can still be fine-tuned
and achieve better accuracy and fidelity when the adversary has data from the problem
domain. We limited the scope of this research on image classification using CNN models.

1.3 Objectives
To prove our hypothesis, we proposed as a main objective a model extraction

method, called Copycat, which was tested in CNNs for image classification. In the Copycat
method10, the target model, i. e., the Oracle f(·), is queried with random non-labeled
images acquired from large public datasets, such as ImageNet or Microsoft COCO. The
pairs of images and their hard-labels provided by the Oracle are used to generate a fake
dataset NPDD = {(x1, `x1), . . . , (xn, `xn)}, where n = |NPDD|, x is the image and
`x = arg max

p
f(x)[p] is the hard-label provided by the Oracle. Then the surrogate model,

named Copycat model, is trained with NPDD. After, a problem domain dataset is labeled
by the Oracle to generate the second dataset PDD = {(x1, `x1), . . . , (xm, `xm)}, where
m = |PDD|, that is used to fine-tune the Copycat model.

To achieve the main objective, we explored two major topics in this study: attacks
with the Copycat method and defenses against it. For this, our secondary objectives for
evaluating the attacks with the Copycat were:

• analysis of model extraction with the Copycat method on seven different problems
and a real-world MLaaS API;

• verification of the capabilities, limitations and robustness of the Copycat method;
• quantitative and qualitative comparison between Oracles and Copycat models.

And our secondary objectives for evaluating the defenses against the Copycat
method were:

• analysis of methods that detect out-of-distribution queries; and
• analysis of a watermark method that protects the model’s IP.

10 The reader can access an interactive visualization of a model extraction attack with the Copycat
method at <https://jeiks.github.io/copycat-cnn-explainer/>

https://jeiks.github.io/copycat-cnn-explainer/


Chapter 1. Introduction 25

1.4 Publications
Up to the present time, the developing of this project has contributed with the

literature in significant ways. The contributions were published in two articles:

i. Correia-Silva, J. R.; Berriel, R. F.; Badue, C.; De Souza, A. F.; Oliveira-Santos, T.
Copycat CNN: Stealing Knowledge by Persuading Confession with Random Non-
Labeled Data. In: International Joint Conference on Neural Networks (IJCNN).
2018. p. 1–8. (Qualis A1).

ii. Correia-Silva, J. R.; Berriel, R. F.; Badue, C.; De Souza, A. F.; Oliveira-Santos,
T. Copycat CNN: Are random non-Labeled data enough to steal knowledge from
black-box models? Pattern Recognition, v. 113, p. 107830, 2021. ISSN 0031-3203.
(Qualis A1).

The first work (Correia-Silva et al., 2018) presented the Copycat method, the
results of primary three problems (facial expression, object, and crosswalk classification),
and the model extraction results against the Microsoft Azure Cognitive Services11. To the
best of our knowledge, this was the first time that the black-box model extraction in a
deep CNN has been investigated using out-of-distribution images obtained from public
datasets and labeled only by hard-labels.

The second work (Correia-Silva et al., 2021) consolidated the Copycat method
by performing an extensive evaluation study aiming at providing a better understanding
of the behavior of the Copycat model on black-box attacks. To achieve this purpose, it
was composed of the following experiments: (i) extraction with less information about
the target black-box model; (ii) feature space analysis of out-of-distribution images; (iii)
analysis of the number of queries to extract a model; (iv) model extraction for a slightly
different but small architecture; (v) robustness of the Copycat method; (vi) comparison
between Oracle and surrogate model based on their inputs; (vii) discussion attack cost to
make a real extraction on a MLaaS.

The results and analysis regarding the defenses against the Copycat method have
not yet been published.

1.5 Outline
The remainder of the text is organized as follows:

• The theoretical background is described in Chapter 2, covering Convolutional
Neural Networks, large datasets, and methods for analyzing neural networks.

11 The Microsoft Azure Cognitive Services: <https://azure.microsoft.com/en-us/pricing/details/
cognitive-services/emotion-api/>

https://azure.microsoft.com/en-us/pricing/details/cognitive-services/emotion-api/
https://azure.microsoft.com/en-us/pricing/details/cognitive-services/emotion-api/


Chapter 1. Introduction 26

• Chapter 3 presents an overview of the related works, covering techniques for model
extraction and attacks, as well as defense approaches against such attacks.

• In Chapter 4, the Copycat method formulation is presented, followed by a discussion
of relevant constraints. Additionally, an illustrative example of a simple model
extraction using the Copycat method is provided.

• Chapter 5 explains our experimental methodology, starting with details about the
datasets, the investigated problems and architectures, and the metrics. Next, the
model extraction experiments using the Copycat method are described. These
experiments include: (i) analysis of dataset distributions in the black-box model’s
classification space, (ii) application of the Copycat method on the same architecture
to several different problems, (iii) analysis of the number of images needed to
successfully attack the proposed problems with the Copycat method, (iv) application
of the Copycat method in a slightly different and small architecture, (v) verification
of the reproducibility and robustness of the Copycat model, (vi) comparison of
the target model and the Copycat model in relation to the input image relevance
pixels, and (vii) analysis of a Copycat attack in a real-world API and discussion of
the viability and costs of such attacks. Finally, three defense approaches explored
in our work are presented.

• Chapter 6 presents the results of the experiments described in Chapter 5 and also
discusses the limitations of our method.

• Finally, Chapter 7 draws the concluding remarks and present the future directions
of our research.


27

2 Theoretical Background

This chapter presents the necessary theoretical concepts to contextualize the scope
of our work. The objective is to provide base knowledge on the subject, presenting the
concepts that underlie the Copycat method. Therefore, this chapter is divided into two
sections. As our method works on extracting CNN models using large image databases, the
first section presents a description of CNN and related model architectures and databases.
After, as we did the qualitative analysis of the surrogate models, second section presents
the respective methods used in our work. Additionally, in the next chapter, we provide an
overview of related works and compare them with our proposed method.

2.1 Convolutional Neural Networks and Large Public Databases
According to Goodfellow, Bengio and Courville (2016), computer programs have

traditionally been used to solve problems that can be described using a set of formal
mathematical rules. Thus, these programs rely on pre-coded knowledge within their
code. However, given the difficulty of coding them for more complex problems (as high-
dimensional problems), it was sought that the programs themselves had the ability to
acquire knowledge on their own, identifying patterns in received data. This ability is
commonly called Machine Learning. The use of these methods allows computers to solve
more complex problems and find solutions that seem subjective. However, the performance
of these machine learning algorithms depends on the representation of the data provided
to them. Thus, manual work is usually required to provide information contained in the
representation, called features, for these programs to produce useful responses.

Increasing the difficulty of these problems, in some cases, ideal results are achieved
only with specific features, which are often difficult to obtain or identify. In these cases, a
solution would be for machine learning algorithms to learn not only to give the answers to
ready-made representation, but also to learn the representation of the problem. This field is
known as Representation Learning, and it is used in ML to provide better performance than
previous methods with hand-designed representations. In this field, the initial challenge is
to extract enough features from the raw data. For this purpose, a solution adopted is to
create a deep hierarchy capable of extracting simple features until reaching complicated
concepts. This method is commonly referred to as Deep Learning (DL), which achieved
human-like abilities to recognize objects or speech.

An example of a DL method can be a kind of feedforward deep network, or Multi
Layer Perceptron (MLP), which is a mathematical function that maps input values to
output values. This function is formed by the composition of several simpler functions,


Chapter 2. Theoretical Background 28

which provide a new representation of the input data. An MLP network can be a shallow
network when it is formed by few layers, or contain a deep amount of layers, forming a
Deep Neural Network. However, an MLP has some limitations in image processing. Let an
image Xm,n be a matrix of pixels composed of m rows and n columns:

X =


x11 x12 · · · x1n

x21 x22 · · · x2n
... ... . . . ...

xm1 xm2 · · · xmn

 (2.1)

where xi,j represents the pixel value in row i and column j. In MLP, this matrix is flattened
into a vector of numbers, X = [x11, . . . , x1n, x21, . . . , x2n, . . . , xm1, . . . , xmn]. This format,
for example, does not fully capture the spatial relationships between pixels in an image.

Differently of MLP, Convolutional Neural Network (LECUN; BENGIO, 1995) is
a DNN that can treat the image as a matrix R2 (considering only one channel, like a
grayscale image) instead of an vector R1. CNNs exploit the spatial invariance that objects
are in the image, seeking to learn useful representations. This type of network has the
following principles: (i) translation invariance (or translation equivariance), the earliest
layers of the network must extract from the image the same characteristics referring to
a patch of interest, regardless of where it appears in the image, (ii) locality, the earliest
layers of the network must extract features from local regions, without taking into account
distant regions of the image. Eventually, these simple representations can be aggregated
to form concepts of the whole picture. Following these principles, deeper layers should be
able to capture longer-range features of the image (ZHANG et al., 2021).

The CNN can be made up of multiple convolutional layers, pooling layers, and
fully connected layers. It uses a mathematical operation called convolution to process the
input data. This operation handles the principles of translation invariance and locality (for
more details, please look at (ZHANG et al., 2021)). The discrete convolution function is:

(K ∗X)(i, j) =
∑
k

∑
l

X(i− k, j − l)K(k, l) (2.2)

where K is the kernel (or filter) with k rows and l columns. However, to avoid the matrix
flipping performed when multiplying the image by the kernel, libraries often implement
the cross-correlation instead of convolution (GOODFELLOW; BENGIO; COURVILLE,
2016):

(K ∗X)(i, j) =
∑
k

∑
l

X(i+ k, j + l)K(k, l) (2.3)

In ML, the algorithm will learn the appropriate values of the kernel. So it will
learn a flipped kernel relative to the convolution kernel, which will not change the final


Chapter 2. Theoretical Background 29

result in the network (GOODFELLOW; BENGIO; COURVILLE, 2016). The output
of the convolutional layer is called feature map, as it can be considered as the learned
representations (features) in spatial dimensions for the subsequent layer. An illustrative
example of a cross-correlation (convolution without kernel flipping) between two 2-D
matrices is presented on Figure 6. Note that the output matrix is not the same size as the
input matrix due to the convolution operation. To avoid this, pixel paddings can be used
around the input matrix. Moreover, another factor that can affect the size of the output is
the size of the Kernel stride (in the Figure 6, the stride was one).

32
10

* =

0x0+1x1+
3x2+4x3=

19

1x0+2x1+
4x2+5x3=

25

3x0+4x1+
6x2+7x3=

37

4x0+5x1+
7x2+8x3=

43

Input Kernel
Output

(feature map)

43 5

10 2

76 8

Figure 6 – An example of 2-D convolution without kernel flipping (cross-correlation). The
shaded portions are the first output element, as well as the input and kernel
elements used for the output operation: 0× 0 + 1× 1 + 3× 2 + 4× 3 = 19.

After a convolutional layer, an activation function such as ReLU can be applied
to add non-linearity to the network. Then, a pooling layer also can be used to mitigate
the sensitivity of the convolutional layer and to reduce the spatial resolution of the
representations. The pooling layer traverses fixed-size windows of pixels over the convolution
output matrix in a defined stride number, performing a single operation between these
pixels, such as their average or maximum value. An example of maximum pooling (often
called max-pooling) is showed on Figure 7.

Input
Output

(reduced feature map)

2x2
Max-

Pooling
=

max(0,1,3,4)

4
max(1,2,4,5)

5

max(3,4,6,7)

7
max(4,5,7,8)

8

43 5

10 2

76 8

Figure 7 – An example of maximum pooling (generally called max-pooling) with a window
size of 2x2. The highlighted portions are the first output element, as well as
the input elements used for the output operation: max(0, 1, 3, 4) = 4.

So, after the feature maps are generated by the previous layers, they are flattened


Chapter 2. Theoretical Background 30

into a single column vector and provided as inputs to the first Fully Connected layer:

zj = σ

 n∑
i=1

Wjiri + bj

 (2.4)

where z = {zj}dj=1 is the output vector and d is the total number of possible classes,
r = {ri}ni=1 is the feature vector of size (n, 1), W is the matrix of weights of size (d, n),
b = {bj}dj=1 is the bias vector, and σ is the activation function applied element-wise to
the output vector z.

These results can go through several Fully Connected layers, but in this example
we are only considering one. Finally, these non-normalized output values produced by
the network are converted into a probability distribution by the normalized exponential
function Softmax:

ŷj = ezj∑d
k=1 e

zk
(2.5)

where ŷj is the predicted probability of the input belonging to the j-th class.

The values before being converted by the Softmax function are commonly called
logits. In statistics, the logits are the logarithm of the odds or log-odds. And in the
logistic regression, the logits are the inverse of the Logistic Sigmoid function φ(z) = 1

1+e−z ,
i. e., logit(φ(z)) = log

(
φ(z)

1−φ(z)

)
(HASTIE; TIBSHIRANI; FRIEDMAN, 2009). In DNN,

the ReLU (NAIR; HINTON, 2010) function (or a similar one) is commonly used as the
activation function. Therefore, the concept of the term logits does not align with this
context. However, maybe derived from the logistic regression, logits are commonly used
in DNN to refer to the raw inputs of the network before the application of the Softmax
function.

LeCun (1989) published the work that started the CNN design and showed that
minimizing the number of free parameters (i. e., using shared weights) of the network
enhances its generalization. After, LeCun et al. (1989a) introduced the first CNN, a
backpropagation learning network fed directly with images to recognize handwritten zip
codes. This network used groups of shared weights and convolution functions performed
by feature maps. This architecture was composed of two convolutional layers and two fully
connected layers. In the next work, LeCun et al. (1989b) improved the CNN’s architecture
using some findings of Neocognitron, a model developed by Fukushima (1980) based
on the discoveries about the vision cortex published by Hubel and Wiesel (1968). This
network architecture had two convolutional layers, two subsampling layers, and one fully
connected layer. After, LeCun et al. (1995) compared ML algorithms for handwritten digit
recognition and proposed the LeNet-1. Lastly, LeCun et al. (1998) proposed a model for
recognizing handwritten digits, LeNet-5 (Figure 8). Several techniques were compared in


Chapter 2. Theoretical Background 31

this work and CNN outperformed all of them. This network had three convolutional layers,
two subsampling layers, one fully connected layer, and one Radial Basis Function (RBF)
layer, where hand-drawn prototypes of digits were used to aid pattern recognition.

INPUT
32x32

Convolutions
Subsampling

Convolutions

C1: feature maps
6@28x28

Subsampling

S2: f. maps
6@14x14

S4: f. maps 16@5x5
C5: layer
120

C3: f. maps 16@10x10

F6: layer
 84

Gaussian connections

OUTPUT
 10

Full connection
Full connection

Figure 8 – Architecture of LeNet-5. The convolutional layers are labeled CN, the subsam-
pling (average pooling) layers are labeled SN, and fully-connected layers are
labeled FN, where N is the layer index. Source: (LECUN et al., 1998)

Over time, another important work emerged in the area. Deng et al. (2009) released
a large publicly image dataset called ImageNet, which provided 3.2 million images to
be used in image recognition systems. They believed that a large dataset was a missing
resource for the development of large-scale, advanced image search algorithms and improved
image analysis techniques. ImageNet was created from web images, which were manually
labeled in their corresponding category. Its organization was based on the hierarchical
structure of WordNet (FELLBAUM, 1998). The ImageNet images are distributed in
1000 categories, containing mammals, birds, fish, reptiles, amphibians, vehicles, furniture,
musical instruments, tools, flowers, fruits, and others. And the following year, Russakovsky
et al. (2015a) started the ImageNet Large Scale Visual Recognition Challenge (ILSVRC),
which has become one of the most important competitions in the area. This challenge
has been held annually for several years and has become a benchmark for large-scale
object recognition. Several works with CNNs gained prominence by participating in this
competition. Throughout this section, we will present the most important details about
the CNNs that were used in our work.

Krizhevsky, Sutskever and Hinton (2012) published the AlexNet, the first large-scale
network that outperformed conventional computer vision methods, winning the ILSVRC
2012 by a large margin from previous work. Its architecture is similar to LeNet, but it
is deeper, having 5 convolutional layers, followed by 3 fully connected layers (Figure 9).
Unlike LeNet, they used ReLU (NAIR; HINTON, 2010) instead of Hyperbolic Tangent as
the activation function. Furthermore, they also used Dropout (SRIVASTAVA et al., 2014)
in their network and image augmentation during network training.
1 Page of Visual Geometry Group at Oxford University: <https://www.robots.ox.ac.uk/~vgg/>

https://www.robots.ox.ac.uk/~vgg/


Chapter 2. Theoretical Background 32

M
a
x
 P

o
o
lin

g

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

M
a
x
 P

o
o
lin

g

C
o
n
v
+

R
e
LU

Lo
ca

l 
N

o
rm

a
liz

a
ti

o
n

FC
+

R
e
LU

FC
+

R
e
LU

FC
+

R
e
LU

In
p

u
t 

Im
a
g

e

S
o
ft

m
a
x

O
u
tp

u
t

C
o
n
v
+

R
e
LU

M
a
x
 P

o
o
lin

g
Lo

ca
l 
N

o
rm

a
liz

a
ti

o
n

Figure 9 – AlexNet Architecture with 5 convolutional layers (Conv) followed by 3 fully
connected layers (FC).

After, the Visual Geometry Group at Oxford University1 introduced the blocks
concept in their network called VGG. A block follows the sequence: (i) a convolutional
layer with padding to maintain the resolution, (ii) a non-linear function, such as ReLU,
and (iii) a pooling layer, such as max-pooling, to reduce the resolution. The main difference
from AlexNet is the convolutional layer groups with non-linear transformations, that leave
the dimensionality unchanged, followed by the resolution-reduction step. Its first network
was VGG-11, which had 5 blocks. As a whole, it had 8 convolutional layers and then 3 fully
connected layers, for a total of 11 layers. Later, Simonyan and Zisserman (2015) proposed
the VGG-16 (Figure 10), which also had 5 blocks, but with 13 convolutional layers instead
of 8. Their work showed the importance of network depth for good performance. The
VGG-16 won first place in object location and second place in image classification at
ILSVRC 2014.

LeNet, AlexNet, and VGG networks all share a common design pattern. They
extract the features through a sequence of convolutions and pooling layers and process the
representations in the fully connected layers. However, different network architectures were
proposed later, such as NiN (LIN; CHEN; YAN, 2014) with Network in Network blocks,
GoogleNet (SZEGEDY et al., 2015) with inception blocks and ResNet (HE et al., 2016a)
with residual blocks (for more details, see (ZHANG et al., 2021)). ResNet was a CNN that
popularized the residual connections and presented a depth of up to 152 layers without
compromising the power of generalization of the model. They won the image classification
at ILSVRC 2015 (HE et al., 2016a).

The idea of the ResNet network was not to degrade the performance of deeper


Chapter 2. Theoretical Background 33

M
a
x
 P

o
o
lin

g

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

M
a
x
 P

o
o
lin

g

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

M
a
x
 P

o
o
lin

g

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

M
a
x
 P

o
o
lin

g

FC
+

R
e
LU

FC
+

R
e
LU

FC
+

R
e
LU

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

C
o
n
v
+

R
e
LU

M
a
x
 P

o
o
lin

g

In
p
u
t 

Im
a
g
e

S
o
ft

m
a
x

O
u
tp

u
t

Figure 10 – VGG-16 architecture with 13 convolutional layers (Conv) distributed in 5
VGG blocks, followed by 3 fully connected layers (FC).

neural networks, where the addition of more layers worsened their performance. For this
purpose, the solution presented by He et al. (2016a) was to add layers with the identity
function in the network. In the residual block, the shortcut connection skips one or more
layers of the network to add the input to the block result. An illustrative example of
residual block is shown on Figure 11. In this example, x represents the input, and f(x)
corresponds to the result of the operations performed within the dashed block. The residual
block operates by adding the input x to the transformed output, f(x) + x. The resulting
sum is then passed through an activation function, generating the final result of the
residual block. This process facilitates the learning of residual information, allowing the
gradients to flow directly through the identity mapping. Residual networks and related
architectures uses different variations of the residual block, such as bottleneck block (HE
et al., 2016a), pre-activation block (HE et al., 2016b), and transformer block (RADFORD
et al., 2019).

These networks and several other CNNs not mentioned here have brought remark-
able results in learning representations, obtaining good results in image processing (for
more details, see (GU et al., 2018; KHAN et al., 2020; ZHANG et al., 2021)). However,
these networks have vulnerabilities that allow model extraction and attacks, which will be
described in Chapter 3, Section 3.1.

Attacks on machine learning models are often due to the inherent complexity and
hyper-dimensionality of these models, which often contain hidden details that are difficult
for humans to understand. Therefore, the number of researches and other works aiming
to understand these models and their responses has increased in the literature. The next
section describes related methods, focusing on those that were used in our work.


Chapter 2. Theoretical Background 34

x

identity

f(x)

f(x)+x

In
pu

t 

C
on

vo
lu

tio
na

l L
ay

er

A
ct

iv
at

io
n 

F
un

ct
io

n

C
on

vo
lu

tio
na

l L
ay

er

+

A
ct

iv
at

io
n 

F
un

ct
io

n

Figure 11 – Example of a residual block for ResNet network.

2.2 Analysis of CNNs
One of the challenges of working with ML methods is dealing with their high

dimensionality, because the training data becomes less dense in these high dimensional
spaces (ANOWAR; SADAOUI; SELIM, 2021). After passing through convolutional, pooling,
and fully connected layers, the input data is transformed into a feature space with a
different dimensionality, where each feature or combination of features represents a different
aspect of the input. These subspaces can have varying dimensions and capture different
levels of abstraction or complexity in the data. Sometimes, one way to interpret the
behavior of certain methods is to visualize their subspaces and attempt to gain insights
from them. A commonly adopted option to visualize them is using techniques to reduce
their dimensionality.

Dimensionality reduction techniques aim to reduce data complexity, improve data
quality or even provide qualitative means of data analysis. There are two main types of
dimensionality reduction: (i) feature selection, which identifies the most informative features
and eliminates the rest, and (ii) feature extraction, which uses algebraic transformations
to combine features generally into fewer new features. Also according to Anowar, Sadaoui
and Selim (2021), FEA are best suited for complex and sparse real-life datasets. Moreover,
Feature Extraction Algorithm (FEA) techniques preserve intrinsic properties or structure
of the original features.

There are several FEA techniques, but there is not one that is always considered
the best one to use. This is because they depend on the data, their characteristics, quality
and size, and also the purpose of use. Among them, t-Distributed Stochastic Neighbor
Embedding (t-SNE) (MAATEN; HINTON, 2008) is a method that seeks to reduce high-
dimensional data to low-dimensional data, usually to 2-D or 3-D, with a important feature
to preserve the significant structure of the original data. Given that it is used to explain and


Chapter 2. Theoretical Background 35

visualize data, providing an intuition of how data is organized in a high-dimensional space,
it was chosen to be used in our work. Furthermore, Anowar, Sadaoui and Selim (2021) also
points out that other FEAs techniques are not suitable for visualizing high-dimensional
data and may not preserve the data structure. In contrast, the t-SNE becomes useful to
visualize high-dimensional data as it maintains the relationship between the data structure.

This method uses two different probabilities to find the similarities between points
in the high dimensional space to the low dimensional space. Initially, the euclidean distance
between each pair of data is converted into conditional probabilities P (a|b) using Stochastic
Neighbor Embedding (SNE) (MAATEN; HINTON, 2008):

P (a|b) = e
−||xa−xb||

2

2σ2∑
k 6=a

−||xa−xk||2
2σ2

(2.6)

It represents the similarities between xa and xb, i. e., how close xa is from xb considering
a Gaussian distribution around it with a given variance σ2. It then uses a Student’s
t-distribution with one degree of freedom to obtain the second set of probabilities Q(a|b)
between the target pair of points ya and yb in low-dimensional space:

Q(a|b) = (1 + ||ya + yb||2)−1∑
k 6=a(1 + ||ya − yk||1)−1 (2.7)

So, if the pair of high-dimensional data is correctly mapped to low-dimensional data, the
similarity between P (a|b) and Q(a|b) becomes equal. Therefore, the final objective is to
minimize the difference between these two probabilities by minimizing the sum of the
Kullback–Leibler (KL) divergence:

KL(P ||Q) =
∑
a,b

P (a|b) log P (a|b)
Q(a|b) (2.8)

For more details of t-SNE, including how to obtain the variance σ2 and how to minimize
KL by gradient-descent, see (MAATEN; HINTON, 2008).

Another way of analyzing ML methods, mainly neural networks, is through methods
of eXplainable Artificial Intelligence (XAI), a term coined by DARPA (GUNNING; AHA,
2019). In the vision domain, these methods usually provide a matrix that represents the
importance of each pixel of the input image in the model response. This matrix is called a
heatmap, where each pixel provides information about its relevant score (contribution) in
the final response of the model. Some XAI methods generate heatmaps in a deterministic
way, based on the behavior of the already trained neural network, i. e., on the results of
the network’s internal operations on an input image. On the other side, there are XAI
methods that generate random or disturbed (original image with noise) inputs to explore
the different classifications (or output probabilities) of the neural network and provide a
final heatmap (ARRAS; OSMAN; SAMEK, 2022).


Chapter 2. Theoretical Background 36

Among several methods, such as Class Saliency Map, Grad-CAM, Gradient×Input,
Integrated Gradients, Excitation Backprop, Guided Backpropagation, and others (for a
brief overview of XAI methods, see (ARRAS; OSMAN; SAMEK, 2022; HOLZINGER
et al., 2022)), we chose to work with Layer-wise Relevance Propagation (BACH et al.,
2015; MONTAVON et al., 2017). Layer-wise Relevance Propagation (LRP) was originally
developed for CNNs and is a very popular method, which has even been extended to other
works, making it a highly applicable technique today (HOLZINGER et al., 2022). It is
a deterministic XAI method based on the operations performed in model propagation.
Specifically, it propagates the relevance score from the model output to its related input.
Furthermore, in a recent work, Arras, Osman and Samek (2022) proposed a new evalua-
tion paradigm for computer vision methods and tested several XAI methods. LRP was
considered to be one of the most accurate XAI computer vision methods tested under its
new evaluation paradigm.

LRP is a XAI method that can be applied to a neural network structure as neural
networks. Besides images, it also works with video and text. In this method, the relevance
score received by a neuron must be redistributed to the lower layer in equal amount. LRP
redistributes the model’s prediction score following a principle of local conservation. The
method is illustrated on the Figure 12. Let R be the Relevance score, j and k be the
sequential indexes to represent two consecutive layers of the neural network, i. e., j as the
previous layer and k as the following layer. Thus, propagating relevance scores Rk at a
given layer onto neurons of the previous layer is achieved by applying the rule:

Rj =
∑
k

zjk∑
j zjk

Rk (2.9)

The zjk represents how much the neuron j contributed to make the neuron k relevant.
The denominator enforces the conservation property (analogous to energy conservation
principle or Kirchhoff’s law in physics), where ∑j Rj = ∑

k Rk. The propagation procedure
terminates once the input features have been reached.

Various rules can be used to redistribute contributions from each layer to the
previous layer. First, consider that deep rectifier networks are composed of neurons ak:

ak = max
0,

∑
0,j
ajwjk

 (2.10)

Let ak be the neurons of a deep rectifier network and aj be the lower-layer activations.
And to include the bias to the weight matrix W , with w ∈ W , let a0 = 1. Then, the first
rule discussed by the LRP authors (BACH et al., 2015) is:

(LRP-0 rule) Rj =
∑
k

ajwjk∑
0,j ajwjk

Rk (2.11)


Chapter 2. Theoretical Background 37

Figure 12 – Illustration of the LRP method running in a neural network. At the top,
the neural network is fed by input x and obtain the output F (x), where aj
indicates low-layer activations and ak indicates the deep rectifier network
neurons. After, the LRP method calculates uses f(x) as the final relevance
Rk of the network. So, at the bottom, Rk is back-propagated to the previous
layer, generating Rj . These values are again propagated back until they reach
the network input, generating the heatmap R. (Image source: LRP project
page <http://www.heatmapping.org>)

It redistributes the contributions of each input to neuron activation proportionally as they
occur. However, as described by the authors, the gradient of a deep neural network is
typically noisy, therefore this rule needs to be more robust. A first enhancement of the
basic LRP-0 rule consists of adding a small positive term ε in the denominator:

(LRP-ε rule) Rj =
∑
k

ajwjk
ε+∑

0,j ajwjk
Rk (2.12)

http://www.heatmapping.org


Chapter 2. Theoretical Background 38

According to the authors, as a result of this rule, the explanations are usually more concise
in terms of input features and contain less noisy. Another improvement proposed by them
is to favor the effect of positive contributions over negative contributions, controlling it by
the scalar λ (a high value will cause the negative contributions disappear):

(LRP-λ rule) Rj =
∑
k

aj · (wjk + λw+
jk)∑

0,j aj · (wjk + λw+
jk)
Rk (2.13)

The notation (·)+ = max(0, ·) and (·)− = min(0, ·).

Table 1 – LRP rules and usage suggestions. Additionally to the variables denoted in the
text, the index i refers to the network input and the parameters li, hi define the
box constraints of the input domain (MONTAVON et al., 2019).

Name Formula (rules) Usage
LRP-0 Rj = ∑

k
ajwjk∑
0,j ajwjk

Rk Upper layers
LRP-ε Rj = ∑

k
ajwjk

ε+
∑

0,j ajwjk
Rk Middle layers

LRP-λ Rj = ∑
k

aj ·(wjk+λw+
jk

)∑
0,j aj ·(wjk+λw+

jk
)Rk Lower layers

LRP-αβ Rj = ∑
k

(
α

(ajwjk)+∑
0,j(ajwjk)+ − β

(ajwjk)−∑
( 0,jajwjk)−

)
Rk Lower layers

[ (flat2) Rj = ∑
k

1∑
j

1Rk Lower layers

w2-rule Ri = ∑
j

w2
ij∑
i
w2
ij
Rj First layer (Rd)

zB-rule Ri = ∑
j

xiwij−liwij+−hiw−ij∑
i
xiwij−liwij+−hiw−ij

Rj First layer (pixels)

Unfortunately, using only one of these functions across the entire network structure
can provide a poor explanation. Therefore, it is recommended to use composite strategy
(Figure 13), where different rules are used in different layers. In addition to these rules,
there are other rules that can be used to get better explanations on the network. There is
also a suggestion of which layer to use which function. These rules and suggestions for use,
are presented on Table 1. For a technical and more in-depth look at LRP method, including
a discussion of the various propagation rules, see (BACH et al., 2015; MONTAVON et al.,
2017; MONTAVON et al., 2019; LAPUSCHKIN et al., 2019).

2 Pronunciation: /flæt/ <https://dictionary.cambridge.org/dictionary/english/flat>

https://dictionary.cambridge.org/dictionary/english/flat


Chapter 2. Theoretical Background 39

Figure 13 – At the top, the input and the heatmaps generated using only the LRP-0,
LRP-ε and LRP-λ rules uniquely. Below right is the network structure with
the rule used in each layer and the resulting heatmap on the left. Source:
(MONTAVON et al., 2019).


40

3 Related Works

The literature has shown that state-of-the-art models are susceptible to attacks,
such as model extraction and adversarial examples. Our work relied on several studies
in the field and presented a new method for model extraction that takes advantage of
previously unexplored features. Furthermore, after obtaining the results of our method,
we analyzed the defenses against such attacks, paying particular attention to the defenses
that could be effective against our method. Thus, this chapter presents works related to
our method, followed by the related types of defenses.

3.1 Model extraction and attacks
Following the taxonomy proposed by Zhang et al. (2022), there are two groups of

targets in attack methods: visual data, and visual deep learning systems. In the first group,
the attack takes place by applying methods on the data instance. And the second group is
formed by methods of attacking datasets or models. The types of attack on datasets can
be: membership inference, model inversion, property inference, model memorization, and
violation in data aggregation. And the attack on models is the model extraction attack,
that is where our method fits.

These attacks can occur in white-box or black-box models. In white-box models,
information about their parameters, training dataset and model architecture is available
and accessible to the adversary. In contrast, in black-box models, little information is
available to the adversary, such as only the final predictions of the model. The literature
also cites gray-box models, where the adversary has more information about the model.
However, most works use only the white-box and black-box nomenclature, and only these
two names will also be adopted here.

This section provides a knowledge base on model extraction performed in our
method. It begins by describing knowledge transfer works related to model compression
and knowledge distillation. Then, it covers research related to model extraction attacks.
Finally, works related to membership inference attacks are presented. These studies focused
on the classification of inputs with perturbation in deep neural networks and introduced the
concept of adversarial examples. These works showed intriguing properties in recognizing
unknown and spurious images by state-of-the-art network models.

Model compression consists of extracting knowledge from a larger model to a smaller
one, i. e., with fewer parameters. The objective is to use the substitute (and smaller) model
on a hardware with less processing power. In the work of Bucilă, Caruana and Niculescu-


Chapter 3. Related Works 41

Mizil (2006), they extracted the knowledge of a ensemble (set of small machine learning
models combined to provide more complex answers) to a shallow network. However, instead
of training the network using the original dataset, they generated a synthetic dataset and
labeled it with an ensemble. The larger dataset, composed of synthetic data, was used
to train a shallow network, which achieved performance similar to that of an ensemble.
Remarkably, the shallow network trained on this dataset outperformed the same network
trained on the original training set. For the creation of synthetic data, they introduced
a new method to create synthetic data that corresponded as closely as possible to the
distribution of the original training set. They evaluated the effectiveness of their model
compression on eight binary classification datasets provided in the UCI Repository (DUA;
GRAFF, 2017), but none of these datasets consisted of images.

Additionally, Ba and Caruana (2014) performed a compression model from a DNN
network to a shallow network. They tested their approach with TIMIT (GAROFOLO et al.,
1993) and CIFAR-10 (KRIZHEVSKY; HINTON, 2009) datasets. The objective was training
the surrogate models to learn the function learned by the larger model. For this purpose,
the surrogate models were not trained with the original labels of the original training
datasets, but with the logits of the target models. They emphasized that learning the target
model function is easier using logits. Given training data D = {(x1, z

t
1), . . . , (xN , ztN)},

where zti is the target model logit of xi, the loss function used was the mean squared error
(MSE, squared L2 norm) loss applied to the logits was:

L(ẑs, ẑt) = 1
2N

N∑
i=1
||zsi − zti ||22 (3.1)

where N = |D| and ẑs is the substitute model logits.

Later, Hinton, Vinyals and Dean (2015) further studied the compression of the
model. They argued that the probabilities of a model’s output provide not only information
about the desired class (i. e., output with higher probability), but also information about
how the model tends to generalize to other classes (i. e., lower probabilities presented
in other outputs). Thus, they proposed a method called knowledge distillation, which
provides more information about the classification of the model in relation to its input.
For this, they proposed smoothing the final softmax output using a temperature T :

σ̂(zi) = ezi/T∑d
j=1 e

zj/T
(3.2)

where z is the logit of the target network. Using a higher value for T produces a smoother
probability distribution across classes. The value of T must be empirically adjusted to
the target model produces a good set of outputs for training the substitute model. They
also cited the work of (BUCILĂ; CARUANA; NICULESCU-MIZIL, 2006), describing it
as a specific case of knowledge distillation. Some other works were built on the top of


Chapter 3. Related Works 42

these results. For example, Chan, Ke and Lane (2015) employed the knowledge distillation
to transfer the knowledge from a RNN to a DNN and Tang, Wang and Zhang (2016) to
transfer the knowledge from DNN to a LSTM.

In addition to model compression studies, Tramèr et al. (2016) explored the model
extraction attacks. They exploited vulnerabilities in machine learning models, obtaining
their predictions to generate an equivalent or nearly equivalent surrogate model, i. e., one
model with accuracy close to the target model. Their research showed good results in
logistic regression models, SVM, decision trees and shallow networks. First, they used the
model soft-labels and after their hard-labels. Moreover, they also argue that ML models
emit data-rich output that can be exploited by adversaries, and model extraction can be
used to steal the model for subsequent free use. In addition, they also used BigML1 and
Amazon Machine Learning2 to simulate an MLaaS, providing their target models as API
and performing the extraction. Additionally, Shi, Sagduyu and Grushin (2017) studied
the extraction on two target classifiers, Naive Bayes and SVM, that classified text in a
binary dataset. The authors generated two deep learning classifiers that managed to have
high fidelity rates with the target models. However, their work did not explore copies from
DNNs, besides demanding problem domain data. And given the cost of copying these
simpler models, their approaches did not seem scalable for deep learning models with
multiple classes and larger datasets.

Unlike our work, however, these methods generally assume that some details about
the models are known, that is, they treat the model of interest as a white box. They also
use the same training data (or problem domain data, at least) assuming one would have
access to the logits or probabilities of all classes for a given input.

Other works have also explored attacks on ML models, more specifically, on neural
networks. Szegedy et al. (2014) discovered that neural networks are able to be fooled with
certain input patterns called adversarial examples, i. e., input images that have their pixels
slightly modified to cause a machine learning model to produce incorrect output. Let f(·)
be the target model, x the input image, ε the pixels perturbation matrix, ŷ the predicted
class of x, and x′ the adversarial example. So, the normal behavior of the target model is
f(x) = ŷ. However, the adversarial example can fool the target model: f(x′) = f(x+ε) 6= ŷ.
They used an optimization process on the input image to find small perturbations in the
pixels that caused wrong outputs in the target networks (Figure 3). Furthermore, they
found the same input with the same perturbation can be applied to a different network,
even trained on a different subset of the data, to also cause an incorrect classification.

Later, Goodfellow, Shlens and Szegedy (2015) proposed the Fast Gradient Sign
Method (FGSM) to craft adversarial examples against DNNs. This method has been used
1 BigML website: <https://bigml.com>
2 Amazon Machine Learning website: <https://aws.amazon.com/machine-learning/>

https://bigml.com
https://aws.amazon.com/machine-learning/


Chapter 3. Related Works 43

in several academic works. Let x be the original image correctly classified by the model,
y the desired prediction of x, x′ the adversarial example, ε the perturbation to craft the
adversarial example:

x′ = x+ ε, ε = λ sign(∇xJ(θ, x, y)) (3.3)

where J(θ, x, y) is the cost used to train the neural network, and λ is the magnitude to
add the perturbation on x. A demonstration of this method (extracted from the original
article) can be seen in Figure 4.

Papernot et al. (2016) also explored adversarial examples and used the forward
derivatives for building adversarial saliency maps3 to craft adversarial samples against the
DNN model. They used the forward derivative, that was defined as the Jacobian matrix of
the model with respect to its inputs. The main difference of this method is the application
of a extended salience map introduced by Simonyan, Vedaldi and Zisserman (2014). These
maps indicate which pixels more efficiently perturb the network behavior for an input,
thus allowing to generate adversarial examples. However, these methods need to access
the model parameters (white-box) and also need problem domain data.

After these works, Papernot et al. (2017) formulated a new strategy capable of
crafting adversarial examples against black-box models provided as MLaaS. The strategy
consists of first attacking (model extraction) the target model to train a surrogate model
(method named as Jacobian-Based Dataset Augmentation – JBDA). For this purpose, a
training dataset S is made from some original images of the problem domain. Then, in an
iteration process, the surrogate model is used to craft new synthetic examples using its
Jacobian error matrix. Then, these images are labeled by the target model to increase the
dataset S and fine-tune the surrogate model:

Sτ+1 ← {x+ λ.sign(J(x)[f(x)] : x ∈ Sτ )} ∪ Sτ (3.4)

where τ is the iteration identifier, x is the original image, J(x)[f(x)] is the Jacobian of
surrogate model with respect to x corresponding to the hard-label assigned by the target
model f(·), and the term λ is a magnitude to add the sign of the Jacobian matrix on x
to generate a new image. Finally, when the training ends (τmax iterations), the surrogate
model is used to craft adversarial examples using the previous two methods of (SZEGEDY
et al., 2014) and (PAPERNOT et al., 2016). Although they use model extraction on the
target model, the surrogate model is not designed to have high accuracy or fidelity to the
target model. The intention was only to generate a substitute model capable of creating
adversarial examples. Moreover, they used only problem domain images in the whole
process.

Other important research was presented by Nguyen, Yosinski and Clune (2015).
3 The saliency map is a measure of how much each input pixel contributes to the final predictions of the

network.


Chapter 3. Related Works 44

They studied the vulnerability of adversarial examples presented by Szegedy et al. (2014)
and found that it is easy to generate images that are completely unrecognizable to humans,
but that DNNs recognize with an higher confidence (Figure 5). Their initial motivation
was to verify the similarity between human vision and computer vision. They started
training two DNNs, one with MNIST and other with ImageNet to craft adversarial
examples using evolutionary algorithms and gradient ascendant. So, they found images
with a high degree of confidence belonging to each class in the tested DNN models.
Additionally, they unsuccessfully tried to avoid this misclassification by adding a garbage
class, but it deprecated the accuracy of their models. They also did make a statement
about discriminating models with a high-dimensional input space. They stated that the
area of a class in the classification space can be much larger than the training examples
for that class. These works show that there are large high confidence regions in the CNN
classification subspaces that were not occupied by training examples.

More recently, and after our preliminary work, Orekondy, Schiele and Fritz (2019)
studied how an adversary can steal functionalities of a black-box target network. Similarly
to our preliminary work (Correia-Silva et al., 2018), the authors labeled a large public
dataset of images using a black-box model to generate a fake dataset. But unlike our
work, they used the model’s probabilities (soft-labels) instead of hard-labels to label their
fake dataset. They conducted experiments with four datasets and a fixed architecture
(ResNet-34) with pre-trained weights. The experiments followed three data distributions:
(i) images used to train the target networks but unlabeled; (ii) a combination of all images
(from OpenImages (KUZNETSOVA et al., 2020) and ILSVRC (RUSSAKOVSKY et al.,
2015b)) used to train all target networks; and (iii) images only from OpenImages and
ILSVRC. Furthermore, they applied one attack method with random image samples
and another one with reinforcement learning. Additionally, to verify the influence of the
architecture, they generated target networks with VGG-16 and ResNet-34 architectures
trained with one of the datasets. Subsequently, they used several architectures to attack
the target networks. Finally, they concluded that it is beneficial to use a more complex
architecture to copy the network and achieved more than 77% of copy in their experiments.
Although similar, their work have a key difference to ours: they assume access to the
probabilities (soft-labels) and not just the hard-label output.

Along the line of this work, Mosafi, David and Netanyahu (2019a) presented an
attack with unlabeled data over models trained with MNIST and CIFAR10. The authors
used the probabilities (soft-labels) of target network to label the attack dataset. In a follow
up work, Mosafi, David and Netanyahu (2019b) cited our preliminary work and decided
to use the same constraints, i.e., only the hard-labels. Additionally, they proposed a new
method to create unlabeled data. In this method, a student consults the teacher (target
model) with composite images to extract their knowledge. The teacher was trained with


Chapter 3. Related Works 45

CIFAR-10 and the student used ImageNet to generate each composite image c:

c = p

100 × ImageNet[i] +
(

1− p

100

)
× ImageNet[j] (3.5)

where p ∼ U(0, 100), and i, j ∼ U(1, |ImageNet|). Their objective was to generate a
diverse dataset to better extract the knowledge of the target model. They performed three
experiments:

i. the first one used ImageNet images labeled with soft-labels provided by the target
model, achieving a performance of 98.5%

(
accuracystudent
accuracyprofessor

)
,

ii. the second one used ImageNet images labeled with hard-labels provided by the
target model, achieving a performance of 96.7%, and

iii. the third one used the composite images labeled with hard-labels provided by the
target model, achieving a performance of 99%.

Unfortunately, although they proved that their method was better than other tested
methods, they did not test it with other datasets. Furthermore, the total amount of images
used in the attack has not been reported and a thorough analysis of the model behavior
was not performed.

Unlike related works that use final model probabilities, our proposed method
requires only access to the hard-labels and is tested with different problem domains
comprising different number of classes and in different architectures. These experiments
allows to present an initial analysis of the limitations and capabilities of copying with
natural random images and final labels of the model. Table 2 shows a comparison between
related works and ours (Copycat).

Table 2 – Comparison between related works and Copycat (ours). Abbreviations: Distil-
lation (D), Adversarial Examples (A) or Copy Attack (C), Problem Domain
Data (P), and Non-Problem Domain Data (N). The reference of the problem’s
numbers is in the footer4.

Features / Methods 1 2 3 4 5 6 7 8 9 10 Ours
Type of method D D D A A A C C C C
Black-Box X X X X X X X X
Data used in the Experiments P P P P P N P P N,P N N,P
Hard labels X X X X
ML, Shallow Networks X X X X
DNN → DNN X X X X X X X X
ML → DNN X X

4 1: (BUCILĂ; CARUANA; NICULESCU-MIZIL, 2006). 2: (HINTON; VINYALS; DEAN, 2015).
3: (TANG; WANG; ZHANG, 2016). 4: (SZEGEDY et al., 2014). 5: (GOODFELLOW; SHLENS;
SZEGEDY, 2015). 6: (NGUYEN; YOSINSKI; CLUNE, 2015). 7: (SHI; SAGDUYU; GRUSHIN, 2017).
8: (TRAMÈR et al., 2016). 9: (OREKONDY; SCHIELE; FRITZ, 2019). 10: (MOSAFI; DAVID;
NETANYAHU, 2019b).


Chapter 3. Related Works 46

To the best of our knowledge, the Copycat CNN (Correia-Silva et al., 2018) was
the first model extraction method that uses random natural images and hard-labels to
extract the model knowledge. After the development of our experiments, additional related
works emerged. Therefore, for the completeness of this thesis, the studies that came after
after the completion of our studies are presented. However, their analysis and exploration
will be carried out in future works.

A different attack approach was presented by Sanyal, Addepalli and Babu (2022),
which proposes the use of Generative Adversarial Networks (GANs) (GOODFELLOW et
al., 2014) for model extraction. First, they used proxy data to train the GAN models. After,
they alternately trained the surrogate model and the Generator, which used Adversarial
Loss (GOODFELLOW et al., 2014) and Diversity Loss (ADDEPALLI et al., 2020). Another
work that also used the GAN strategy was proposed by Beetham et al. (2023). The main
difference of this work was the application of Dual Students, where two student models
were used to generate images to better explore the Oracle. For this proposal, the loss
function employed in a student model is the negative value of loss from the other student
model. Both works used a much larger amount of data than proposed in our methodology.

Due to the importance of these image processing models and their associated cost,
it is necessary to find solutions to protect them against existing threats. Thus, the next
section presents the defense methods related to our work.

3.2 Defense methods
The number of current systems that use machine learning methods such as DNNs

and CNNs has grown substantially in recent years. However, some of its vulnerabilities
have already been presented, such as adversarial examples and model extraction attacks.
Therefore, several current works have