UNIVERSIDADE FEDERAL DO ESPÍRITO SANTO CENTRO TECNOLÓGICO PROGRAMA DE PÓS-GRADUAÇÃO EM INFORMÁTICA Jacson Rodrigues Correia da Silva Copycat CNN: Convolutional Neural Network Extraction Attack with Unlabeled Natural Images Vitória, ES 2023 Jacson Rodrigues Correia da Silva Copycat CNN: Convolutional Neural Network Extraction Attack with Unlabeled Natural Images Tese de Doutorado submetida ao Programa de Pós-Graduação em Informática da Uni- versidade Federal do Espírito Santo, como requisito parcial para obtenção do Grau de Doutor em Ciência da Computação. Universidade Federal do Espírito Santo – UFES Centro Tecnológico Programa de Pós-Graduação em Informática Supervisor: Prof. Dr. Thiago Oliveira dos Santos Co-supervisor: Prof. Dr. Alberto Ferreira de Souza Vitória, ES 2023 Ficha catalográfica disponibilizada pelo Sistema Integrado de Bibliotecas - SIBI/UFES e elaborada pelo autor C824c Correia da Silva, Jacson Rodrigues, 1985- CorCopycat CNN : Convolutional Neural Network Extraction Attack with Unlabeled Natural Images / Jacson Rodrigues Correia da Silva. - 2023. Cor174 f. : il. CorOrientador: Thiago Oliveira dos Santos. CorCoorientador: Alberto Ferreira De Souza. CorTese (Doutorado em Ciência da Computação) - Universidade Federal do Espírito Santo, Centro Tecnológico. Cor1. Inteligência artificial. 2. Redes neurais (Computação). I. Oliveira dos Santos, Thiago. II. De Souza, Alberto Ferreira. III. Universidade Federal do Espírito Santo. Centro Tecnológico. IV. Título. CDU: 004 Copycat CNN: Convolutional Neural Network Extraction Attack with Unlabeled Natural Images Jacson Rodrigues Correia da Silva Tese de Doutorado submetida ao Programa de Pós-Graduação em Informática da Universidade Federal do Espírito Santo como requisito parcial para a obtenção do grau de Doutor em Ciência da Computação. Aprovada em 25 de Abril de 2023. Prof. Dr. Thiago Oliveira dos Santos Orientador Prof. Dr. Claudine Santos Badue Gonçalves Membro Interno Prof. Dr. Thomas Walter Rauber Membro Interno Prof. Dr. Jurandy Gomes de Almeida Junior Membro Externo, participação remota Prof. Dr. Eduardo José da Silva Luz Membro Externo, participação remota UNIVERSIDADE FEDERAL DO ESPÍRITO SANTO Vitória/ES, 25 de Abril de 2023 Documento assinado digitalmente conforme descrito no(s) Protocolo(s) de Assinatura constante(s) neste arquivo, de onde é possível verificar a autenticidade do mesmo. PROTOCOLO DE ASSINATURA UNIVERSIDADE FEDERAL DO ESPÍRITO SANTO O documento acima foi assinado digitalmente com senha eletrônica através do Protocolo Web, conforme Portaria UFES nº 1.269 de 30/08/2018, por THIAGO OLIVEIRA DOS SANTOS - SIAPE 2023810 Departamento de Informática - DI/CT Em 26/04/2023 às 09:32 Para verificar as assinaturas e visualizar o documento original acesse o link: https://api.lepisma.ufes.br/arquivos-assinados/698633?tipoArquivo=O O documento acima foi assinado digitalmente com senha eletrônica através do Protocolo Web, conforme Portaria UFES nº 1.269 de 30/08/2018, por THOMAS WALTER RAUBER - SIAPE 2201072 Departamento de Informática - DI/CT Em 26/04/2023 às 11:47 Para verificar as assinaturas e visualizar o documento original acesse o link: https://api.lepisma.ufes.br/arquivos-assinados/698826?tipoArquivo=O O documento acima foi assinado digitalmente com senha eletrônica através do Protocolo Web, conforme Portaria UFES nº 1.269 de 30/08/2018, por CLAUDINE SANTOS BADUE - SIAPE 1729561 Departamento de Informática - DI/CT Em 26/04/2023 às 17:15 Para verificar as assinaturas e visualizar o documento original acesse o link: https://api.lepisma.ufes.br/arquivos-assinados/699199?tipoArquivo=O Acknowledgements There are many moments of struggle and battles that seem to have no end. Everyone’s life has many obstacles, which can become lighter or be overcome thanks to the direct and indirect support of the people around us. At this moment, I thank God, who has always been by my side through life’s coincidences, manifested through the hands of the people around me, even those who may not believe in Him. He has given me strength when I needed it and guided my thoughts, enabling me to move forward. I am very grateful to my wife, who has been with me throughout this entire journey, providing me with strong support, affection, and love. Moreover, I am grateful to my son, who has brought a new perspective to my world. Understanding his mind can be challenging at times, but through him, I have gained a deeper understanding of myself. I love you and I understand, by your gestures, how much you love me too. Many thanks to my parents, who have always been there for me and my family, offering their unwavering presence and support when we needed them most. I also want to express my gratitude to my sisters and brothers-in-law, who have given me their time, attention, affection, and support, whether they are near or far. I am extremely grateful to my supervisor, who consistently demonstrated their willingness to guide and listen to me. He always made intelligent choices and approached situations with an open mind, charting new paths and guiding me through the difficulties encountered on this journey. I would also like to express my heartfelt thanks to the team at NXP Semiconductors, who provided me with invaluable life experiences and several opportunities for growth and learning. Finally, I also want to express my gratitude to everyone who has helped me directly and indirectly during this process, thanks to NVIDIA for providing me a GPU to be used in this research, and to the AWS Cloud Credit for Research program, which provided me cloud computing resources. Resumo Redes Neurais Convolucionais (CNNs) têm alcançado alto desempenho em vários problemas nos últimos anos, levando muitas empresas a desenvolverem produtos com redes neurais que exigem altos custos para aquisição de dados, anotação e geração de modelos. Como medida de proteção, as empresas costumam entregar seus modelos como caixas-pretas acessíveis apenas por APIs, que devem ser seguras, robustas e confiáveis em diferentes domínios de problemas. No entanto, estudos recentes mostraram que CNNs estado-da-arte têm vulnerabilidades, onde perturbações simples nas imagens de entrada podem mudar as respostas do modelo, e até mesmo imagens irreconhecíveis por humanos podem alcançar uma predição com alto grau de confiança do modelo. Esses métodos precisam acessar os parâmetros do modelo, mas há estudos mostrando como gerar uma cópia (imitação) de um modelo usando suas probabilidades (soft-labels) e dados do domínio do problema. Com um modelo substituto, um adversário pode efetuar ataques ao modelo alvo com maior possibilidade de sucesso. Este trabalho explora ainda mais essas vulnerabilidades. A hipótese é que usando imagens publicamente disponíveis (que todos tem acesso) e respostas que qualquer modelo deve fornecer (mesmo caixa-preta) é possível copiar um modelo atingindo alto desempenho. Por isso, foi proposto um método chamado Copycat para explorar modelos de classificação de CNN. O objetivo principal foi copiar o modelo em duas etapas: primeiro, consultando-o com imagens naturais aleatórias, como do ImageNet, e anotando suas probabilidades máximas (hard-labels). Depois, usando essas imagens rotuladas para treinar um modelo Copycat que deve alcançar desempenho semelhante ao modelo alvo. Avaliamos essa hipótese em sete problemas do mundo real e contra uma API baseada em nuvem, atingindo desempenhos (F1-Score) em todos modelos Copycat acima de 96,4% quando comparados aos modelos alvo. Após atingir esses resultados, realizamos vários experimentos para consolidar e avaliar nosso método. Além disso, preocupados com essa vulnerabilidade, também analisamos várias defesas existentes contra o método Copycat. Dentre os experimentos, as defesas que detectam consultas de ataque não funcionam contra o método, mas defesas que usam marca d’água conseguem identificar a Propriedade Intelectual do modelo alvo. Assim, o método se mostrou eficaz na extração de modelos, possuindo imunidade às defesas da literatura, sendo identificado apenas por defesas de marca d’água. Palavras-chaves: Aprendizado Profundo. Redes Neurais Convolucionais. Roubo de Co- nhecimento de Redes Neurais. Destilação de Conhecimento. Extração de Modelo. Roubo de Modelo. Compressão de Modelo. Abstract Convolutional Neural Networks (CNNs) have been achieving state-of-the-art performance on a variety of problems in recent years, leading to many companies developing neural- based products that require expensive data acquisition, annotation, and model generation. To protect their models from being copied or attacked, companies often deliver them as black-boxes only accessible through APIs, that must be secure, robust, and reliable across different problem domains. However, recent studies have shown that state-of-the-art CNNs have vulnerabilities, where simple perturbations in input images can change the model’s response, and even images unrecognizable to humans can achieve a higher level of confidence in the model’s output. These methods need to access the model parameters, but there are studies showing how to generate a copy (imitation) of a model using its probabilities (soft-labels) and problem domain data. By using the surrogate model, an adversary can perform attacks on the target model with a higher possibility of success. We further explored these vulnerabilities. Our hypothesis is that by using publicly available images (accessible to everyone) and responses that any model should provide (even black- boxes), it is possible to copy a model achieving high performance. Therefore, we proposed a method called Copycat to explore CNN classification models. Our main goal is to copy the model in two stages: first, by querying it with random natural images, such as those from ImageNet, and annotating its maximum probabilities (hard-labels). Then, using these labeled images to train a Copycat model that should achieve similar performance to the target model. We evaluated this hypothesis on seven real-world problems and against a cloud-based API. All Copycat models achieved performance (F1-Score) above 96.4% when compared to target models. After achieving these results, we performed several experiments to consolidate and evaluate our method. Furthermore, concerned about such vulnerability, we also analyzed various existing defenses against the Copycat method. Among the experiments, defenses that detect attack queries do not work against our method, but defenses that use watermarking can identify the target model’s Intellectual Property. Thus, the method proved to be effective in model extraction, having immunity to the literature defenses, but being identified only by watermark defenses. Keywords: Deep learning. Convolutional neural network. Stealing network knowledge. Knowledge distillation. Model extraction. Model Stealing. Model compression. List of Figures Figure 1 – Examples of smart devices and APIs . . . . . . . . . . . . . . . . . . . 18 Figure 2 – Illustration of a Convolutional Neural Network . . . . . . . . . . . . . . 19 Figure 3 – Samples of adversarial examples . . . . . . . . . . . . . . . . . . . . . . 21 Figure 4 – Samples of adversarial examples . . . . . . . . . . . . . . . . . . . . . . 22 Figure 5 – Images that are unrecognizable to humans by reach high confidence in state-of-the-art Deep Neural Networks (DNNs) . . . . . . . . . . . . . . 23 Figure 6 – An example of 2-D convolution . . . . . . . . . . . . . . . . . . . . . . 29 Figure 7 – An example of max-pooling . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 8 – Architecture of LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 9 – AlexNet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure 10 – VGG-16 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure 11 – Example of a residual block for ResNet network . . . . . . . . . . . . . 33 Figure 12 – Example of the LRP method . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 13 – Example of LRP method composite strategy . . . . . . . . . . . . . . . 38 Figure 14 – Watermark example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 15 – Illustration of the behavior of a watermarked image in a clean model and in a watermarked model . . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 16 – Overview of the Copycat creation . . . . . . . . . . . . . . . . . . . . . 48 Figure 17 – Network architecture of an illustrative example of Copycat . . . . . . . 52 Figure 18 – Training dataset (ODD) of illustrative example of Copycat . . . . . . . 53 Figure 19 – Attack dataset (NPDD) of illustrative example of Copycat . . . . . . . 54 Figure 20 – Feature map and classification space of the Oracle in the illustrative example of Copycat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 21 – Several runs of the illustrative example of Copycat . . . . . . . . . . . 56 Figure 22 – Illustrative Block Diagram of Adaptive Misinformation . . . . . . . . . 70 Figure 23 – PRADA’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Figure 24 – Frequency of neuron activations in a small network . . . . . . . . . . . 78 Figure 25 – Three random samples of watermarked images for CIFAR10, MNIST, and FashionMNIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Figure 26 – t-SNE mapping of the ODD and NPDD-SL points to the classification space of the Oracle for the DIG10 and FER7 problems . . . . . . . . . 84 Figure 27 – Relative F1-Score of the Copycats . . . . . . . . . . . . . . . . . . . . . 86 Figure 28 – Distribution of labels after querying the Oracle . . . . . . . . . . . . . 87 Figure 29 – Data curve performance of Copycats CC-VGG-NPDD-SL . . . . . . . . 88 Figure 30 – Performances of Copycat CC-VGG-NPDD-SL using the Framework . . 89 Figure 31 – Relative F1-Score of the Oracles on the seven problems . . . . . . . . . 91 Figure 32 – Relative F1-Score of Copycats (100k) over the Oracle . . . . . . . . . . 92 Figure 33 – Relative F1-Score of Copycats (500k images) over the Oracle . . . . . . 93 Figure 34 – Heatmaps generated with LRP for three problems . . . . . . . . . . . . 94 Figure 35 – PDD heatmaps generated with LRP on Copycat Framework . . . . . . 96 Figure 36 – NPDD heatmaps generated with LRP on Copycat Framework . . . . . 97 Figure 37 – Pearson correlation distribution of PDD heatmaps on Copycat Framework 98 Figure 38 – Pearson correlation distribution of PDD heatmaps on Copycat Framework 99 Figure 39 – Pearson correlation distribution of NPDD heatmaps on Copycat Frame- work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Figure 40 – Pearson correlation distribution of NPDD heatmaps on Copycat Frame- work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Figure 41 – Performance of Copycats over Azure API . . . . . . . . . . . . . . . . . 102 Figure 42 – Threshold selection for ADMIS . . . . . . . . . . . . . . . . . . . . . . 103 Figure 43 – ADMIS: Oracle results . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Figure 44 – ADMIS: results of ADMIS attack datasets . . . . . . . . . . . . . . . . 104 Figure 45 – ADMIS CIFAR10: images labels per class . . . . . . . . . . . . . . . . 105 Figure 46 – ADMIS Flowers17: images labels per class . . . . . . . . . . . . . . . . 106 Figure 47 – ADMIS MNIST: images labels per class . . . . . . . . . . . . . . . . . 107 Figure 48 – ADMIS FashionMNIST: images labels per class . . . . . . . . . . . . . 108 Figure 49 – ADMIS: Copycat - NPDD of 100k queries . . . . . . . . . . . . . . . . 110 Figure 50 – ADMIS: Copycat - NPDD of 300k queries . . . . . . . . . . . . . . . . 110 Figure 51 – ADMIS: Copycat - NPDD of 500k queries . . . . . . . . . . . . . . . . 111 Figure 52 – ADMIS: CIFAR10 Oracles results . . . . . . . . . . . . . . . . . . . . . 111 Figure 53 – ADMIS: results of ADMIS attack datasets for CIFAR10 experiments . 112 Figure 54 – ADMIS: Copycat - NPDD of 100k queries for CIFAR10 experiments . . 113 Figure 55 – ADMIS: Copycat - NPDD of 300k queries for CIFAR10 experiments . . 113 Figure 56 – ADMIS: Copycat - NPDD of 500k queries for CIFAR10 experiments . . 114 Figure 57 – PRADA: MNIST, Small Architecture. False Positive and Detection rates for Statistical test value. . . . . . . . . . . . . . . . . . . . . . . . 115 Figure 58 – PRADA: VGG16 Model, MNIST dataset. False Positive and Detection Rates for statistical test value . . . . . . . . . . . . . . . . . . . . . . . 117 Figure 59 – PRADA: False Positive Rate for statistical test value . . . . . . . . . . 119 Figure 60 – PRADA: False Positive Rate for statistical test value . . . . . . . . . . 121 Figure 61 – PRADA: False Positive Rate for statistical test value . . . . . . . . . . 123 Figure 62 – EWE: Oracles results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Figure 63 – EWE: Copycat results . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Figure 64 – EWE: difference of label distribution between attacks on watermarked and clean models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Figure 65 – PDD heatmaps generated with LRP on Copycat Framework for ACT101148 Figure 66 – PDD heatmaps generated with LRP on Copycat Framework for DIG10 149 Figure 67 – PDD heatmaps generated with LRP on Copycat Framework for FER7 150 Figure 68 – PDD heatmaps generated with LRP on Copycat Framework for GOC9 151 Figure 69 – PDD heatmaps generated with LRP on Copycat Framework for PED2 152 Figure 70 – PDD heatmaps generated with LRP on Copycat Framework for SHN10 153 Figure 71 – PDD heatmaps generated with LRP on Copycat Framework for SIG30 154 Figure 72 – NPDD heatmaps generated with LRP on Copycat Framework for ACT101156 Figure 73 – NPDD heatmaps generated with LRP on Copycat Framework for DIG10157 Figure 74 – NPDD heatmaps generated with LRP on Copycat Framework for FER7 158 Figure 75 – NPDD heatmaps generated with LRP on Copycat Framework for GOC9159 Figure 76 – NPDD heatmaps generated with LRP on Copycat Framework for PED2160 Figure 77 – NPDD heatmaps generated with LRP on Copycat Framework for SHN10161 Figure 78 – NPDD heatmaps generated with LRP on Copycat Framework for SIG30162 Figure 79 – Confusion Matrices for ACT101 . . . . . . . . . . . . . . . . . . . . . . 164 Figure 80 – Confusion Matrices for DIG10 . . . . . . . . . . . . . . . . . . . . . . . 166 Figure 81 – Confusion Matrices for FER7 . . . . . . . . . . . . . . . . . . . . . . . 167 Figure 82 – Confusion Matrices for GOC9 . . . . . . . . . . . . . . . . . . . . . . . 168 Figure 83 – Confusion Matrices for PED2 . . . . . . . . . . . . . . . . . . . . . . . 169 Figure 84 – Confusion Matrices for SHN10 . . . . . . . . . . . . . . . . . . . . . . . 170 Figure 85 – Confusion Matrices for SIG30 . . . . . . . . . . . . . . . . . . . . . . . 171 List of Tables Table 1 – LRP rules and usage suggestions. . . . . . . . . . . . . . . . . . . . . . . 38 Table 2 – Comparison between related works and Copycat . . . . . . . . . . . . . 44 Table 3 – Details of the problems, their respective datasets, and the number of images in each domain splits. . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 4 – Datasets and architectures used in experiments. . . . . . . . . . . . . . . 71 Table 5 – Number of images per class of the Flowers17 PDD . . . . . . . . . . . . 72 Table 6 – PRADA’s target models architecture . . . . . . . . . . . . . . . . . . . . 75 Table 7 – Comparison of F1-Scores for Oracle and Baseline models, and Copycats Performance relative to them. . . . . . . . . . . . . . . . . . . . . . . . . 86 Table 8 – Performance of the Copycat using AlexNet architecture . . . . . . . . . 90 Table 9 – Analysis of APIs costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Table 10 – PRADA: MNIST dataset trained on the Small architecture . . . . . . . 116 Table 11 – PRADA: Results for MNIST Dataset trained on VGG-16 . . . . . . . . 118 Table 12 – PRADA: GTSRB Dataset trained on the Small architecture . . . . . . . 120 Table 13 – PRADA: GTSRB Dataset trained on VGG-16 . . . . . . . . . . . . . . 120 Table 14 – PRADA: GOC9 problem trained on VGG-16 . . . . . . . . . . . . . . . 122 Table 15 – EWE:Watermarked success rates of the Oracles. . . . . . . . . . . . . . 124 Table 16 – EWE: Watermarked success rates of the Copycat models. . . . . . . . . 125 Table 17 – EWE: Watermarked success rates of the Copycat models after fine-tuning process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Table 18 – Classification reports for ACT101 . . . . . . . . . . . . . . . . . . . . . 165 Table 19 – Classification reports for DIG10 . . . . . . . . . . . . . . . . . . . . . . 166 Table 20 – Classification reports for FER7 . . . . . . . . . . . . . . . . . . . . . . . 167 Table 21 – Classification reports for GOC9 . . . . . . . . . . . . . . . . . . . . . . . 168 Table 22 – Classification reports for PED2 . . . . . . . . . . . . . . . . . . . . . . . 169 Table 23 – Classification reports for SHN10 . . . . . . . . . . . . . . . . . . . . . . 170 Table 24 – Classification reports for SIG30 . . . . . . . . . . . . . . . . . . . . . . . 171 Acronyms and Abbreviations General: API Application Programming Interface CNN Convolutional Neural Network DNN Deep Neural Network DL Deep Learning FGSM Fast Gradient Sign Method FEA Feature Extraction Algorithm GAN Generative Adversarial Network ILSVRC ImageNet Large Scale Visual Recognition Challenge IP Intellectual Property JBDA Jacobian-Based Dataset Augmentation LRP Layer-wise Relevance Propagation MSP Maximum Softmax Probabilities ML Machine Learning MLP Multi Layer Perceptron MLaaS Machine Learning as a Service RBF Radial Basis Function RL Representation Learning SGD Stochastic Gradient Descent TN Target Network SNNS Soft Nearest Neighbor Loss t-SNE t-Distributed Stochastic Neighbor Embedding XAI Explainable Artificial Intelligence Problems: ACT101 Human Action Classification Problem, 101 Categories DIG10 Handwritten Digit Classification, 10 categories FER7 Facial Expression Recognition, 7 categories GOC9 General Object Detection, 9 categories PED2 Pedestrian Detection, 2 categories SHN10 Street View House Number Classification, 10 categories SIG30 Traffic Sign Classification, 30 categories Models: BL Baseline model BL-Alex-* Baseline model generated on AlexNet Architecture BL-VGG-* Baseline model generated on VGG-16 Architecture CC Copycat model CC-Alex-* Copycat model generated on AlexNet Architecture CC-VGG-* Copycat model generated on VGG-16 Architecture Datasets: ODD Original Domain Dataset ODD-OL Original Domain Dataset with Original Labels PDD Problem Domain Dataset PDD-OL Problem Domain Dataset with Original Labels PDD-SL Problem Domain Dataset with Stolen Labels TDD Test Domain Dataset TDD-OL Test Domain Dataset with Original Labels NPDD Non-Problem Domain Dataset NPDD-SL Non-Problem Domain Dataset with Stolen Labels NPDD+PDD Non-Problem Domain Dataset joined to Problem Domain Dataset NPDD+PDD-SL Non-Problem Domain Dataset joined to Problem Domain Dataset, both with Stolen Labels OL Original Labels *-OL Suffix of Original Labels Dataset SL Stolen Labels *-SL Suffix of Stolen Labels Dataset Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.1 Model extraction and attacks . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1 Convolutional Neural Networks and Large Public Databases . . . . . . . . 27 2.2 Analysis of CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Model extraction and attacks . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Defense methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Copycat Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 48 4.1 Attack formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Copycat Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Simplified example of Copycat method . . . . . . . . . . . . . . . . . . . . 51 5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 Datasets Organization and Baselines Setup . . . . . . . . . . . . . . . . . . 57 5.1.1 Datasets for Baselines and Tests . . . . . . . . . . . . . . . . . . . . 57 5.1.2 Datasets for Generating the Copycats . . . . . . . . . . . . . . . . . 58 5.2 Investigated Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.1 Human Action Recognition – ACT101 . . . . . . . . . . . . . . . . 60 5.2.2 Handwritten Digits Classification – DIG10 . . . . . . . . . . . . . . 60 5.2.3 Facial Expression Recognition – FER7 . . . . . . . . . . . . . . . . 60 5.2.4 General-Object Classification – GOC9 . . . . . . . . . . . . . . . . 61 5.2.5 Pedestrian Classification – PED2 . . . . . . . . . . . . . . . . . . . 61 5.2.6 Street View House Numbers Classification – SHN10 . . . . . . . . . 61 5.2.7 Traffic Signs Classification – SIG30 . . . . . . . . . . . . . . . . . . 61 5.3 Used Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5 General Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6 Model Extraction Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6.1 Analysis of Datasets Distributions in the Classification Space of the Black-Box Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6.2 Copycat from the Same Architecture . . . . . . . . . . . . . . . . . 65 5.6.3 Analysis of the Relationship Between Number of Queries and Copy- cat Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.6.4 Copycat from a Different Architecture . . . . . . . . . . . . . . . . 66 5.6.5 Robustness of the Copycat Model . . . . . . . . . . . . . . . . . . . 67 5.6.6 Analysis of the Attention-Region in the Input Images . . . . . . . . 67 5.6.7 Analysis of Attack Viability and APIs Costs . . . . . . . . . . . . . 68 5.7 Defense Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.7.1 ADMIS: Adaptive Missinformation . . . . . . . . . . . . . . . . . . 69 5.7.2 PRADA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.7.3 EWE: Entangled Watermarks . . . . . . . . . . . . . . . . . . . . . 77 6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1 Model Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1.1 Analysis of Datasets Distributions in the Classification Space of the Black-Box Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1.2 Copycat from the Same Architecture . . . . . . . . . . . . . . . . . 85 6.1.3 Analysis of the Relationship Between Number of Queries and Copy- cat Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.1.4 Copycat from a Different Architecture . . . . . . . . . . . . . . . . 89 6.1.5 Robustness of the Copycat Model . . . . . . . . . . . . . . . . . . . 90 6.1.6 Analysis of the Attention-Region in the Input Images . . . . . . . . 92 6.1.7 Analysis of Attack Viability and APIs Costs . . . . . . . . . . . . . 101 6.2 Defenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.1 ADMIS: Adaptive Missinformation . . . . . . . . . . . . . . . . . . 103 6.2.1.1 Copycat with the proposed datasets . . . . . . . . . . . . 104 6.2.1.2 Copycat with the usual NPDD dataset . . . . . . . . . . . 107 6.2.1.3 Copycat of CIFAR10 using only out-of-distribution data . 109 6.2.2 PRADA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2.2.1 MNIST on Small Architecture . . . . . . . . . . . . . . . . 114 6.2.2.2 MNIST on VGG16 Architecture . . . . . . . . . . . . . . . 116 6.2.2.3 GTSRB on Small Architecture . . . . . . . . . . . . . . . 118 6.2.2.4 GTSRB on VGG16 Architecture . . . . . . . . . . . . . . 119 6.2.2.5 GOC9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2.3 EWE: Entangled Watermarks . . . . . . . . . . . . . . . . . . . . . 124 6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Appendix 142 APPENDIX A Code for running the simple example of the Copycat Method 143 APPENDIX B LRP Heatmaps of TDD . . . . . . . . . . . . . . . . . . . . . 147 APPENDIX C LRP Heatmaps of NPDD . . . . . . . . . . . . . . . . . . . . . 155 APPENDIX D Confusion Matrices and Classification Reports . . . . . . . . . 163 D.1 ACT101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 D.2 DIG10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 D.3 FER7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 D.4 GOC9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 D.5 PED2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 D.6 SHN10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 D.7 SIG30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 18 1 Introduction Currently, there are several smart tools and robots that use Machine Learning (ML) techniques at their core. These products, such as Echo with Alexa1a, Google Home, Nest or Android with Google Assistant1b, and HomePod with Siri1c (Figure 1), can process and provide responses to commands and requests in natural language, can recognize persons and objects, and also perform other tasks, that makes them convenient for users. Additionally, many of these products also have access to cloud-based APIs, which allow integration with other platforms and ML services (Machine Learning as a Service – MLaaS). These MLaaS, such as Microsoft Cognitive Services Computer Vision API1d, Google Cloud Vision API1e, IBM Watson Visual Recognition1f (Figure 1) provide access to powerful machine learning models and algorithms, which can extract useful information from the user data to process the request. Among these systems, there are the Representation Learning (RL) methods that are built to also learn representations of the data, thus simplifying the extraction of useful information for classifiers and predictors (BENGIO; COURVILLE; VINCENT, 2013). Figure 1 – Examples of smart devices on the left and APIs on the right (image sources: i, ii, iii, iv, v, vi, vii, viii – source links are available in the digital version of this document). An important method in RL is a DNN2 called Convolutional Neural Network (CNN) (LECUN et al., 1989b). It has one or more convolutional layers that uses a mathematical operation called convolution to analyze the input data. And a Deep CNN is usually made 1 a, b, c, d, e, f. Source links are available in the digital version of this document. 2 Deep Neural Network is a multilayer feedforward network. But unlike shallow networks that only have a few layers, the depth of this network comes from the total length of its chain of layers, hence the name DNN (GOODFELLOW; BENGIO; COURVILLE, 2016). https://upload.wikimedia.org/wikipedia/commons/0/0f/Android_phones.jpg https://images-na.ssl-images-amazon.com/images/I/61bEzxKzZIL._AC_SL1000_.jpg https://upload.wikimedia.org/wikipedia/commons/4/49/Google_Home_with_Home_Hub_and_Home_Mini_on_table.jpg https://www.apple.com/newsroom/images/product/homepod/standard/Apple_homepod-mini-white-10132020_big.jpg.large.jpg https://miro.medium.com/max/700/1*tkAiRWvYmAi_RuRhRYgeiQ.jpeg https://upload.wikimedia.org/wikipedia/commons/a/a8/Microsoft_Azure_Logo.svg https://connectoricons-prod.azureedge.net/releases/v1.0.1444/1.0.1444.2347/cognitiveservicescomputervision/icon.png https://freepngimg.com/download/logo/73271-ibm-dachlawinen-text-kunststoff-watson-schwarz-vorsicht.png https://developer.amazon.com/en-US/alexa https://developers.google.com/assistant https://developer.apple.com/siri/ https://azure.microsoft.com/en-us/products/cognitive-services/computer-vision https://cloud.google.com/vision https://www.ibm.com/en/watson Chapter 1. Introduction 19 up of multiple layers (Figure 2), starting with convolutional layers, pooling layers, and ending with fully connected layers3. The intent of these first layers is to identify and extract features in the input data, such as edges, shapes, and textures and provide these feature maps to the next layers. Then, the fully connected layers are used to make the final prediction using that features. Given the state-of-the-art performance achieved by CNNs on a variety of problems, these networks have been the power of many neural-based systems to recognize facial expressions, traffic signs, product names, objects, speech, and several other tasks. Con- sequently, many companies are investing a large amount of money to generate products with CNNs. However, there are at least three high costs associated with it: i. acquiring and annotating large-scale training datasets; ii. computational power for training the models4 that can last for days or even months; and iii. experts to prepare the data and to design, implement, and train the model. Logits ŷ Fully Connected Layers Feature Maps Feature Maps Feature Maps Feature Maps Input Feature Extraction Classification Pooling Layer Convolutional Layer Convolutional Layer Pooling Layer Softmax Softmax Softmax Softmax Softmax Softmax Softmax Figure 2 – Illustration of a Convolutional Neural Network for classification. In Feature Extraction, the input image is fed to the network, which will use Convolutional Layers (usually followed by an activation function like ReLU) and Pooling Layers to extract its features and send to the Fully Connected Layers. Then, in Classification, the final prediction ŷ is calculated by applying the Softmax function to the outputs of the last Fully Connected Layer. The values before applying the Softmax are usually called logits in the context of DNN. Therefore, due to the resources and money invested in creating these models, it is in the best interests of these companies to protect their model’s Intellectual Property (IP) 3 Although the acronym MLP – Multilayer Perceptron – refers to a Neural Network with fully connected layers, the term Fully Connected Layers is more popular in the Deep Learning area and is widely used to refer to the final layer of a Deep Neural Network. 4 An ML model is an architecture, such as CNN, with its parameters already adjusted by a training process. Chapter 1. Introduction 20 (i. e., model parameters, patented algorithms, unique datasets, and other copyrighted assets) against attacks or copies. When another company, for example, lacks resources but needs a more accurate model, they can gain a competitive advantage by copying a competing model. Or if someone needs, for example, a copy5 of the target model to access its parameters to do other tasks, like tricking malware and spam detection, or to mislead autonomous navigation systems (PAPERNOT; MCDANIEL; GOODFELLOW, 2016). In the first case, the objective is to achieve the same accuracy (number of correct predictions) of the target model and, in the second case, the objective is to achieve the fidelity (similarity between the models’ responses) of the target model (JAGIELSKI et al., 2020). For that and other reasons, their models are usually not delivered as white-boxes (with parameters, architecture and other details visible to users). In fact, the models are usually provided as a black-box, where it is possible only to send the input data and receive the probabilities6 of the model (soft-labels) or the label (highest probability output) referring to the predicted class (hard-label). Examples of these black-box models are often available, as shown at the beginning of this chapter, in smart tools, robots, or in the cloud as MLaaS Application Programming Interfaces (APIs). In this scenario, note that a user who owns a robot can query it for free an unlimited number of times, and the one who needs to use it as MLaaS must pay for each usage. Users and also developers expect that these models provide accurate and robust generalizations for new queries. For instance, in an object recognition model, new images in the problem domain might be correctly identified, and it is expected that small perturbations in the input images should not change the object’s classification. And this is exactly where an attack can occur and the model knowledge can be extracted to produce a surrogate (substitute) model. This process (attack) is called model extraction. 1.1 Model extraction and attacks Model extraction consists of labeling a dataset with the target model responses (soft-labels or hard-labels) and using this dataset to train a surrogate (substitute) model. Some studies in this area worked on extracting knowledge from a larger model to a smaller one, with fewer parameters, calling this process of model compression. Bucilă, Caruana and Niculescu-Mizil (2006) presented a method for extracting large, complex ensembles (set of smaller models combined to achieve performance equivalent to a larger, more complex model) into smaller models without lost significant performance. In addition, Ba and Caruana (2014) explored transferring a deep neural network to a shallow neural 5 The term copy will be used in the sense of imitating a model and not in the sense of producing an exact copy of it. 6 The term “probabilities” is commonly used in the literature to refer to the network output after applying the Softmax operation. Chapter 1. Introduction 21 network. The dataset for training the substitute model was composed of images from the problem domain labeled by the logits7 of the target model. Based on these studies of model compression, Hinton, Vinyals and Dean (2015) observed that the outputs (probabilities after Softmax function) of a model provide important information about the generalization of the input data. So they proposed a method called knowledge distillation, where instead of using logits, they modified the Softmax function (Equation 3.2). Their intention was to smooth out the model’s output probabilities and provide better information about its internal knowledge. For this type of model extraction, they also used original and synthetic data from the problem domain. Other important works investigated ways to attack machine learning models. Szegedy et al. (2014) explored some vulnerabilities in neural networks and showed that, using a simple optimization procedure, it is possible to find adversarial examples, i. e., by adding small perturbations to an image already classified correctly by the neural network, it is possible to generate a new image that the network is no longer able to classify correctly (Figure 3). Later, Goodfellow, Shlens and Szegedy (2015) proposed the fast gradient sign method, which uses the signal of the cost function’s gradient to create adversarial examples against DNNs (Figure 4). After, Papernot et al. (2016) used the forward derivatives for building adversarial saliency maps8 to craft adversarial samples against the DNN model. These methods need to access the model parameters (white-box) and also need problem domain data. After these works, Papernot et al. (2017) formulated a new strategy capable of crafting adversarial examples against black-box models provided as MLaaS. The first step is to extract the target model using problem domain images. Then, this surrogate model is used to generate the adversarial examples using the two methods cited in the previous paragraph. At each generation of new examples, the target model is consulted, generating new labels that are used to improve the performance of the substitute model. However, the objective was not the model extraction, but only to generate a surrogate model to craft adversarial examples (Figure 4). On the other hand, Nguyen, Yosinski and Clune (2015) studied the vulnerability presented by Szegedy et al. (2014) and found that it is easy to generate images that are completely unrecognizable to humans, but that DNNs recognize with an higher confidence (Figure 5). They used evolutionary algorithms and gradient ascent to find images with a high degree of confidence9 belonging to each class in the tested DNN models. Additionally, the authors also tried to avoid this misclassification by adding a garbage class, but without 7 The term logits is commonly used to refer to the the raw outputs of the network before the application of the Softmax function, more details in Section 2.1. 8 The saliency map is a measure of how much each input pixel contributes to the final predictions of the network. 9 High confidence means that the highest probability of the model is a value close to 100% Chapter 1. Introduction 22 (a) (b) (c) Figure 3 – Adversarial examples generated in (SZEGEDY et al., 2014). In (a) and (b), the left images are the correctly predicted sample, the center images are the difference between correct image and image predicted incorrectly magnified by 10x (values shifted by 128 and clamped), and the right images are the adversarial example classified as “ostrich, Struthio camelus”. In (c), results are presented for a binary car classifier. The images on the left are recognized as cars and the images on the right are not recognized. The center images are the magnified absolute value of the difference between the two images. + .007 × = x “panda” 57.7% of confidence sgn(∇xJ(θ, x, y)) “nematode” 8.2% of confidence x+ λsgn(∇xJ(θ, x, y)) “gibbon” 99.3% of confidence Figure 4 – A demonstration of an adversarial example generated by the fast sign method. Let θ be the parameters of a model, x the input to the model, y the desired prediction of x and J(θ;x; y) the cost used to train the neural network. By adding the imperceptibly small vector (center image) provided by the gradient of the cost function with respect to the input, the network classification was changed to a “gibbon”. The λ = .007 corresponds to the magnitude of the smallest bit of an 8 bit image encoding after model conversion to real numbers (GOODFELLOW; SHLENS; SZEGEDY, 2015). Chapter 1. Introduction 23 success. (a) Images obtained by evolutionary algorithms (b) Images obtained by gradient ascent Figure 5 – Images provided by (NGUYEN; YOSINSKI; CLUNE, 2015) that are unrecogniz- able to humans, but that deceived state-of-the-art DNNs with high confidence. In parallel with these works, Deng et al. (2009) released a large publicly available image dataset called ImageNet, which provided 3.2 million images to be used in image recognition systems. And the following year, Russakovsky et al. (2015a) started the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which has become one of the most important competitions in the area. And later, another large dataset was also released, the Microsoft COCO, published in (LIN et al., 2014), with 2.5 million labeled images. Note that even with the existence of these datasets, which have millions of random images of different classes of data, the related works used only original and synthetic data from the problem domain. Additionally, the most part of them used the target model as a white-box, i. e., accessing their parameters, instead of using them as a black-box. 1.2 Hypothesis As presented so far, the cited works have shown that the information provided in the responses of the target models allows extracting their knowledge, i. e., they showed that model extraction attacks can be done through logits, soft-labels and hard-labels to generate surrogate models. In these works, they also showed that the information of the target models can be explored using synthetic data, but for that purpose it was necessary to have data from the problem domain of the attacked model. Moreover, they shown that Chapter 1. Introduction 24 even images not recognizable by humans can provide a high confidence responses in DNN models. Additionally, their works suggest that these images belong to the feature or/and the classification space of the model and can provide important information about it. Thereby, our hypothesis is that: Model extraction can be performed using random natural images provided in large public image datasets (such as ImageNet and Microsoft COCO) and only the hard-labels of the model, even if the images are not from the problem domain of the attacked model. Furthermore, we believe that the generated surrogate model can still be fine-tuned and achieve better accuracy and fidelity when the adversary has data from the problem domain. We limited the scope of this research on image classification using CNN models. 1.3 Objectives To prove our hypothesis, we proposed as a main objective a model extraction method, called Copycat, which was tested in CNNs for image classification. In the Copycat method10, the target model, i. e., the Oracle f(·), is queried with random non-labeled images acquired from large public datasets, such as ImageNet or Microsoft COCO. The pairs of images and their hard-labels provided by the Oracle are used to generate a fake dataset NPDD = {(x1, `x1), . . . , (xn, `xn)}, where n = |NPDD|, x is the image and `x = arg max p f(x)[p] is the hard-label provided by the Oracle. Then the surrogate model, named Copycat model, is trained with NPDD. After, a problem domain dataset is labeled by the Oracle to generate the second dataset PDD = {(x1, `x1), . . . , (xm, `xm)}, where m = |PDD|, that is used to fine-tune the Copycat model. To achieve the main objective, we explored two major topics in this study: attacks with the Copycat method and defenses against it. For this, our secondary objectives for evaluating the attacks with the Copycat were: • analysis of model extraction with the Copycat method on seven different problems and a real-world MLaaS API; • verification of the capabilities, limitations and robustness of the Copycat method; • quantitative and qualitative comparison between Oracles and Copycat models. And our secondary objectives for evaluating the defenses against the Copycat method were: • analysis of methods that detect out-of-distribution queries; and • analysis of a watermark method that protects the model’s IP. 10 The reader can access an interactive visualization of a model extraction attack with the Copycat method at https://jeiks.github.io/copycat-cnn-explainer/ Chapter 1. Introduction 25 1.4 Publications Up to the present time, the developing of this project has contributed with the literature in significant ways. The contributions were published in two articles: i. Correia-Silva, J. R.; Berriel, R. F.; Badue, C.; De Souza, A. F.; Oliveira-Santos, T. Copycat CNN: Stealing Knowledge by Persuading Confession with Random Non- Labeled Data. In: International Joint Conference on Neural Networks (IJCNN). 2018. p. 1–8. (Qualis A1). ii. Correia-Silva, J. R.; Berriel, R. F.; Badue, C.; De Souza, A. F.; Oliveira-Santos, T. Copycat CNN: Are random non-Labeled data enough to steal knowledge from black-box models? Pattern Recognition, v. 113, p. 107830, 2021. ISSN 0031-3203. (Qualis A1). The first work (Correia-Silva et al., 2018) presented the Copycat method, the results of primary three problems (facial expression, object, and crosswalk classification), and the model extraction results against the Microsoft Azure Cognitive Services11. To the best of our knowledge, this was the first time that the black-box model extraction in a deep CNN has been investigated using out-of-distribution images obtained from public datasets and labeled only by hard-labels. The second work (Correia-Silva et al., 2021) consolidated the Copycat method by performing an extensive evaluation study aiming at providing a better understanding of the behavior of the Copycat model on black-box attacks. To achieve this purpose, it was composed of the following experiments: (i) extraction with less information about the target black-box model; (ii) feature space analysis of out-of-distribution images; (iii) analysis of the number of queries to extract a model; (iv) model extraction for a slightly different but small architecture; (v) robustness of the Copycat method; (vi) comparison between Oracle and surrogate model based on their inputs; (vii) discussion attack cost to make a real extraction on a MLaaS. The results and analysis regarding the defenses against the Copycat method have not yet been published. 1.5 Outline The remainder of the text is organized as follows: • The theoretical background is described in Chapter 2, covering Convolutional Neural Networks, large datasets, and methods for analyzing neural networks. 11 The Microsoft Azure Cognitive Services: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/emotion-api/ https://azure.microsoft.com/en-us/pricing/details/cognitive-services/emotion-api/ Chapter 1. Introduction 26 • Chapter 3 presents an overview of the related works, covering techniques for model extraction and attacks, as well as defense approaches against such attacks. • In Chapter 4, the Copycat method formulation is presented, followed by a discussion of relevant constraints. Additionally, an illustrative example of a simple model extraction using the Copycat method is provided. • Chapter 5 explains our experimental methodology, starting with details about the datasets, the investigated problems and architectures, and the metrics. Next, the model extraction experiments using the Copycat method are described. These experiments include: (i) analysis of dataset distributions in the black-box model’s classification space, (ii) application of the Copycat method on the same architecture to several different problems, (iii) analysis of the number of images needed to successfully attack the proposed problems with the Copycat method, (iv) application of the Copycat method in a slightly different and small architecture, (v) verification of the reproducibility and robustness of the Copycat model, (vi) comparison of the target model and the Copycat model in relation to the input image relevance pixels, and (vii) analysis of a Copycat attack in a real-world API and discussion of the viability and costs of such attacks. Finally, three defense approaches explored in our work are presented. • Chapter 6 presents the results of the experiments described in Chapter 5 and also discusses the limitations of our method. • Finally, Chapter 7 draws the concluding remarks and present the future directions of our research. 27 2 Theoretical Background This chapter presents the necessary theoretical concepts to contextualize the scope of our work. The objective is to provide base knowledge on the subject, presenting the concepts that underlie the Copycat method. Therefore, this chapter is divided into two sections. As our method works on extracting CNN models using large image databases, the first section presents a description of CNN and related model architectures and databases. After, as we did the qualitative analysis of the surrogate models, second section presents the respective methods used in our work. Additionally, in the next chapter, we provide an overview of related works and compare them with our proposed method. 2.1 Convolutional Neural Networks and Large Public Databases According to Goodfellow, Bengio and Courville (2016), computer programs have traditionally been used to solve problems that can be described using a set of formal mathematical rules. Thus, these programs rely on pre-coded knowledge within their code. However, given the difficulty of coding them for more complex problems (as high- dimensional problems), it was sought that the programs themselves had the ability to acquire knowledge on their own, identifying patterns in received data. This ability is commonly called Machine Learning. The use of these methods allows computers to solve more complex problems and find solutions that seem subjective. However, the performance of these machine learning algorithms depends on the representation of the data provided to them. Thus, manual work is usually required to provide information contained in the representation, called features, for these programs to produce useful responses. Increasing the difficulty of these problems, in some cases, ideal results are achieved only with specific features, which are often difficult to obtain or identify. In these cases, a solution would be for machine learning algorithms to learn not only to give the answers to ready-made representation, but also to learn the representation of the problem. This field is known as Representation Learning, and it is used in ML to provide better performance than previous methods with hand-designed representations. In this field, the initial challenge is to extract enough features from the raw data. For this purpose, a solution adopted is to create a deep hierarchy capable of extracting simple features until reaching complicated concepts. This method is commonly referred to as Deep Learning (DL), which achieved human-like abilities to recognize objects or speech. An example of a DL method can be a kind of feedforward deep network, or Multi Layer Perceptron (MLP), which is a mathematical function that maps input values to output values. This function is formed by the composition of several simpler functions, Chapter 2. Theoretical Background 28 which provide a new representation of the input data. An MLP network can be a shallow network when it is formed by few layers, or contain a deep amount of layers, forming a Deep Neural Network. However, an MLP has some limitations in image processing. Let an image Xm,n be a matrix of pixels composed of m rows and n columns: X =  x11 x12 · · · x1n x21 x22 · · · x2n ... ... . . . ... xm1 xm2 · · · xmn  (2.1) where xi,j represents the pixel value in row i and column j. In MLP, this matrix is flattened into a vector of numbers, X = [x11, . . . , x1n, x21, . . . , x2n, . . . , xm1, . . . , xmn]. This format, for example, does not fully capture the spatial relationships between pixels in an image. Differently of MLP, Convolutional Neural Network (LECUN; BENGIO, 1995) is a DNN that can treat the image as a matrix R2 (considering only one channel, like a grayscale image) instead of an vector R1. CNNs exploit the spatial invariance that objects are in the image, seeking to learn useful representations. This type of network has the following principles: (i) translation invariance (or translation equivariance), the earliest layers of the network must extract from the image the same characteristics referring to a patch of interest, regardless of where it appears in the image, (ii) locality, the earliest layers of the network must extract features from local regions, without taking into account distant regions of the image. Eventually, these simple representations can be aggregated to form concepts of the whole picture. Following these principles, deeper layers should be able to capture longer-range features of the image (ZHANG et al., 2021). The CNN can be made up of multiple convolutional layers, pooling layers, and fully connected layers. It uses a mathematical operation called convolution to process the input data. This operation handles the principles of translation invariance and locality (for more details, please look at (ZHANG et al., 2021)). The discrete convolution function is: (K ∗X)(i, j) = ∑ k ∑ l X(i− k, j − l)K(k, l) (2.2) where K is the kernel (or filter) with k rows and l columns. However, to avoid the matrix flipping performed when multiplying the image by the kernel, libraries often implement the cross-correlation instead of convolution (GOODFELLOW; BENGIO; COURVILLE, 2016): (K ∗X)(i, j) = ∑ k ∑ l X(i+ k, j + l)K(k, l) (2.3) In ML, the algorithm will learn the appropriate values of the kernel. So it will learn a flipped kernel relative to the convolution kernel, which will not change the final Chapter 2. Theoretical Background 29 result in the network (GOODFELLOW; BENGIO; COURVILLE, 2016). The output of the convolutional layer is called feature map, as it can be considered as the learned representations (features) in spatial dimensions for the subsequent layer. An illustrative example of a cross-correlation (convolution without kernel flipping) between two 2-D matrices is presented on Figure 6. Note that the output matrix is not the same size as the input matrix due to the convolution operation. To avoid this, pixel paddings can be used around the input matrix. Moreover, another factor that can affect the size of the output is the size of the Kernel stride (in the Figure 6, the stride was one). 32 10 * = 0x0+1x1+ 3x2+4x3= 19 1x0+2x1+ 4x2+5x3= 25 3x0+4x1+ 6x2+7x3= 37 4x0+5x1+ 7x2+8x3= 43 Input Kernel Output (feature map) 43 5 10 2 76 8 Figure 6 – An example of 2-D convolution without kernel flipping (cross-correlation). The shaded portions are the first output element, as well as the input and kernel elements used for the output operation: 0× 0 + 1× 1 + 3× 2 + 4× 3 = 19. After a convolutional layer, an activation function such as ReLU can be applied to add non-linearity to the network. Then, a pooling layer also can be used to mitigate the sensitivity of the convolutional layer and to reduce the spatial resolution of the representations. The pooling layer traverses fixed-size windows of pixels over the convolution output matrix in a defined stride number, performing a single operation between these pixels, such as their average or maximum value. An example of maximum pooling (often called max-pooling) is showed on Figure 7. Input Output (reduced feature map) 2x2 Max- Pooling = max(0,1,3,4) 4 max(1,2,4,5) 5 max(3,4,6,7) 7 max(4,5,7,8) 8 43 5 10 2 76 8 Figure 7 – An example of maximum pooling (generally called max-pooling) with a window size of 2x2. The highlighted portions are the first output element, as well as the input elements used for the output operation: max(0, 1, 3, 4) = 4. So, after the feature maps are generated by the previous layers, they are flattened Chapter 2. Theoretical Background 30 into a single column vector and provided as inputs to the first Fully Connected layer: zj = σ  n∑ i=1 Wjiri + bj  (2.4) where z = {zj}dj=1 is the output vector and d is the total number of possible classes, r = {ri}ni=1 is the feature vector of size (n, 1), W is the matrix of weights of size (d, n), b = {bj}dj=1 is the bias vector, and σ is the activation function applied element-wise to the output vector z. These results can go through several Fully Connected layers, but in this example we are only considering one. Finally, these non-normalized output values produced by the network are converted into a probability distribution by the normalized exponential function Softmax: ŷj = ezj∑d k=1 e zk (2.5) where ŷj is the predicted probability of the input belonging to the j-th class. The values before being converted by the Softmax function are commonly called logits. In statistics, the logits are the logarithm of the odds or log-odds. And in the logistic regression, the logits are the inverse of the Logistic Sigmoid function φ(z) = 1 1+e−z , i. e., logit(φ(z)) = log ( φ(z) 1−φ(z) ) (HASTIE; TIBSHIRANI; FRIEDMAN, 2009). In DNN, the ReLU (NAIR; HINTON, 2010) function (or a similar one) is commonly used as the activation function. Therefore, the concept of the term logits does not align with this context. However, maybe derived from the logistic regression, logits are commonly used in DNN to refer to the raw inputs of the network before the application of the Softmax function. LeCun (1989) published the work that started the CNN design and showed that minimizing the number of free parameters (i. e., using shared weights) of the network enhances its generalization. After, LeCun et al. (1989a) introduced the first CNN, a backpropagation learning network fed directly with images to recognize handwritten zip codes. This network used groups of shared weights and convolution functions performed by feature maps. This architecture was composed of two convolutional layers and two fully connected layers. In the next work, LeCun et al. (1989b) improved the CNN’s architecture using some findings of Neocognitron, a model developed by Fukushima (1980) based on the discoveries about the vision cortex published by Hubel and Wiesel (1968). This network architecture had two convolutional layers, two subsampling layers, and one fully connected layer. After, LeCun et al. (1995) compared ML algorithms for handwritten digit recognition and proposed the LeNet-1. Lastly, LeCun et al. (1998) proposed a model for recognizing handwritten digits, LeNet-5 (Figure 8). Several techniques were compared in Chapter 2. Theoretical Background 31 this work and CNN outperformed all of them. This network had three convolutional layers, two subsampling layers, one fully connected layer, and one Radial Basis Function (RBF) layer, where hand-drawn prototypes of digits were used to aid pattern recognition. INPUT 32x32 Convolutions Subsampling Convolutions C1: feature maps 6@28x28 Subsampling S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84 Gaussian connections OUTPUT 10 Full connection Full connection Figure 8 – Architecture of LeNet-5. The convolutional layers are labeled CN, the subsam- pling (average pooling) layers are labeled SN, and fully-connected layers are labeled FN, where N is the layer index. Source: (LECUN et al., 1998) Over time, another important work emerged in the area. Deng et al. (2009) released a large publicly image dataset called ImageNet, which provided 3.2 million images to be used in image recognition systems. They believed that a large dataset was a missing resource for the development of large-scale, advanced image search algorithms and improved image analysis techniques. ImageNet was created from web images, which were manually labeled in their corresponding category. Its organization was based on the hierarchical structure of WordNet (FELLBAUM, 1998). The ImageNet images are distributed in 1000 categories, containing mammals, birds, fish, reptiles, amphibians, vehicles, furniture, musical instruments, tools, flowers, fruits, and others. And the following year, Russakovsky et al. (2015a) started the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which has become one of the most important competitions in the area. This challenge has been held annually for several years and has become a benchmark for large-scale object recognition. Several works with CNNs gained prominence by participating in this competition. Throughout this section, we will present the most important details about the CNNs that were used in our work. Krizhevsky, Sutskever and Hinton (2012) published the AlexNet, the first large-scale network that outperformed conventional computer vision methods, winning the ILSVRC 2012 by a large margin from previous work. Its architecture is similar to LeNet, but it is deeper, having 5 convolutional layers, followed by 3 fully connected layers (Figure 9). Unlike LeNet, they used ReLU (NAIR; HINTON, 2010) instead of Hyperbolic Tangent as the activation function. Furthermore, they also used Dropout (SRIVASTAVA et al., 2014) in their network and image augmentation during network training. 1 Page of Visual Geometry Group at Oxford University: https://www.robots.ox.ac.uk/~vgg/ Chapter 2. Theoretical Background 32 M a x P o o lin g C o n v + R e LU C o n v + R e LU C o n v + R e LU M a x P o o lin g C o n v + R e LU Lo ca l N o rm a liz a ti o n FC + R e LU FC + R e LU FC + R e LU In p u t Im a g e S o ft m a x O u tp u t C o n v + R e LU M a x P o o lin g Lo ca l N o rm a liz a ti o n Figure 9 – AlexNet Architecture with 5 convolutional layers (Conv) followed by 3 fully connected layers (FC). After, the Visual Geometry Group at Oxford University1 introduced the blocks concept in their network called VGG. A block follows the sequence: (i) a convolutional layer with padding to maintain the resolution, (ii) a non-linear function, such as ReLU, and (iii) a pooling layer, such as max-pooling, to reduce the resolution. The main difference from AlexNet is the convolutional layer groups with non-linear transformations, that leave the dimensionality unchanged, followed by the resolution-reduction step. Its first network was VGG-11, which had 5 blocks. As a whole, it had 8 convolutional layers and then 3 fully connected layers, for a total of 11 layers. Later, Simonyan and Zisserman (2015) proposed the VGG-16 (Figure 10), which also had 5 blocks, but with 13 convolutional layers instead of 8. Their work showed the importance of network depth for good performance. The VGG-16 won first place in object location and second place in image classification at ILSVRC 2014. LeNet, AlexNet, and VGG networks all share a common design pattern. They extract the features through a sequence of convolutions and pooling layers and process the representations in the fully connected layers. However, different network architectures were proposed later, such as NiN (LIN; CHEN; YAN, 2014) with Network in Network blocks, GoogleNet (SZEGEDY et al., 2015) with inception blocks and ResNet (HE et al., 2016a) with residual blocks (for more details, see (ZHANG et al., 2021)). ResNet was a CNN that popularized the residual connections and presented a depth of up to 152 layers without compromising the power of generalization of the model. They won the image classification at ILSVRC 2015 (HE et al., 2016a). The idea of the ResNet network was not to degrade the performance of deeper Chapter 2. Theoretical Background 33 M a x P o o lin g C o n v + R e LU C o n v + R e LU C o n v + R e LU M a x P o o lin g C o n v + R e LU C o n v + R e LU C o n v + R e LU M a x P o o lin g C o n v + R e LU C o n v + R e LU C o n v + R e LU C o n v + R e LU M a x P o o lin g FC + R e LU FC + R e LU FC + R e LU C o n v + R e LU C o n v + R e LU C o n v + R e LU M a x P o o lin g In p u t Im a g e S o ft m a x O u tp u t Figure 10 – VGG-16 architecture with 13 convolutional layers (Conv) distributed in 5 VGG blocks, followed by 3 fully connected layers (FC). neural networks, where the addition of more layers worsened their performance. For this purpose, the solution presented by He et al. (2016a) was to add layers with the identity function in the network. In the residual block, the shortcut connection skips one or more layers of the network to add the input to the block result. An illustrative example of residual block is shown on Figure 11. In this example, x represents the input, and f(x) corresponds to the result of the operations performed within the dashed block. The residual block operates by adding the input x to the transformed output, f(x) + x. The resulting sum is then passed through an activation function, generating the final result of the residual block. This process facilitates the learning of residual information, allowing the gradients to flow directly through the identity mapping. Residual networks and related architectures uses different variations of the residual block, such as bottleneck block (HE et al., 2016a), pre-activation block (HE et al., 2016b), and transformer block (RADFORD et al., 2019). These networks and several other CNNs not mentioned here have brought remark- able results in learning representations, obtaining good results in image processing (for more details, see (GU et al., 2018; KHAN et al., 2020; ZHANG et al., 2021)). However, these networks have vulnerabilities that allow model extraction and attacks, which will be described in Chapter 3, Section 3.1. Attacks on machine learning models are often due to the inherent complexity and hyper-dimensionality of these models, which often contain hidden details that are difficult for humans to understand. Therefore, the number of researches and other works aiming to understand these models and their responses has increased in the literature. The next section describes related methods, focusing on those that were used in our work. Chapter 2. Theoretical Background 34 x identity f(x) f(x)+x In pu t C on vo lu tio na l L ay er A ct iv at io n F un ct io n C on vo lu tio na l L ay er + A ct iv at io n F un ct io n Figure 11 – Example of a residual block for ResNet network. 2.2 Analysis of CNNs One of the challenges of working with ML methods is dealing with their high dimensionality, because the training data becomes less dense in these high dimensional spaces (ANOWAR; SADAOUI; SELIM, 2021). After passing through convolutional, pooling, and fully connected layers, the input data is transformed into a feature space with a different dimensionality, where each feature or combination of features represents a different aspect of the input. These subspaces can have varying dimensions and capture different levels of abstraction or complexity in the data. Sometimes, one way to interpret the behavior of certain methods is to visualize their subspaces and attempt to gain insights from them. A commonly adopted option to visualize them is using techniques to reduce their dimensionality. Dimensionality reduction techniques aim to reduce data complexity, improve data quality or even provide qualitative means of data analysis. There are two main types of dimensionality reduction: (i) feature selection, which identifies the most informative features and eliminates the rest, and (ii) feature extraction, which uses algebraic transformations to combine features generally into fewer new features. Also according to Anowar, Sadaoui and Selim (2021), FEA are best suited for complex and sparse real-life datasets. Moreover, Feature Extraction Algorithm (FEA) techniques preserve intrinsic properties or structure of the original features. There are several FEA techniques, but there is not one that is always considered the best one to use. This is because they depend on the data, their characteristics, quality and size, and also the purpose of use. Among them, t-Distributed Stochastic Neighbor Embedding (t-SNE) (MAATEN; HINTON, 2008) is a method that seeks to reduce high- dimensional data to low-dimensional data, usually to 2-D or 3-D, with a important feature to preserve the significant structure of the original data. Given that it is used to explain and Chapter 2. Theoretical Background 35 visualize data, providing an intuition of how data is organized in a high-dimensional space, it was chosen to be used in our work. Furthermore, Anowar, Sadaoui and Selim (2021) also points out that other FEAs techniques are not suitable for visualizing high-dimensional data and may not preserve the data structure. In contrast, the t-SNE becomes useful to visualize high-dimensional data as it maintains the relationship between the data structure. This method uses two different probabilities to find the similarities between points in the high dimensional space to the low dimensional space. Initially, the euclidean distance between each pair of data is converted into conditional probabilities P (a|b) using Stochastic Neighbor Embedding (SNE) (MAATEN; HINTON, 2008): P (a|b) = e −||xa−xb|| 2 2σ2∑ k 6=a −||xa−xk||2 2σ2 (2.6) It represents the similarities between xa and xb, i. e., how close xa is from xb considering a Gaussian distribution around it with a given variance σ2. It then uses a Student’s t-distribution with one degree of freedom to obtain the second set of probabilities Q(a|b) between the target pair of points ya and yb in low-dimensional space: Q(a|b) = (1 + ||ya + yb||2)−1∑ k 6=a(1 + ||ya − yk||1)−1 (2.7) So, if the pair of high-dimensional data is correctly mapped to low-dimensional data, the similarity between P (a|b) and Q(a|b) becomes equal. Therefore, the final objective is to minimize the difference between these two probabilities by minimizing the sum of the Kullback–Leibler (KL) divergence: KL(P ||Q) = ∑ a,b P (a|b) log P (a|b) Q(a|b) (2.8) For more details of t-SNE, including how to obtain the variance σ2 and how to minimize KL by gradient-descent, see (MAATEN; HINTON, 2008). Another way of analyzing ML methods, mainly neural networks, is through methods of eXplainable Artificial Intelligence (XAI), a term coined by DARPA (GUNNING; AHA, 2019). In the vision domain, these methods usually provide a matrix that represents the importance of each pixel of the input image in the model response. This matrix is called a heatmap, where each pixel provides information about its relevant score (contribution) in the final response of the model. Some XAI methods generate heatmaps in a deterministic way, based on the behavior of the already trained neural network, i. e., on the results of the network’s internal operations on an input image. On the other side, there are XAI methods that generate random or disturbed (original image with noise) inputs to explore the different classifications (or output probabilities) of the neural network and provide a final heatmap (ARRAS; OSMAN; SAMEK, 2022). Chapter 2. Theoretical Background 36 Among several methods, such as Class Saliency Map, Grad-CAM, Gradient×Input, Integrated Gradients, Excitation Backprop, Guided Backpropagation, and others (for a brief overview of XAI methods, see (ARRAS; OSMAN; SAMEK, 2022; HOLZINGER et al., 2022)), we chose to work with Layer-wise Relevance Propagation (BACH et al., 2015; MONTAVON et al., 2017). Layer-wise Relevance Propagation (LRP) was originally developed for CNNs and is a very popular method, which has even been extended to other works, making it a highly applicable technique today (HOLZINGER et al., 2022). It is a deterministic XAI method based on the operations performed in model propagation. Specifically, it propagates the relevance score from the model output to its related input. Furthermore, in a recent work, Arras, Osman and Samek (2022) proposed a new evalua- tion paradigm for computer vision methods and tested several XAI methods. LRP was considered to be one of the most accurate XAI computer vision methods tested under its new evaluation paradigm. LRP is a XAI method that can be applied to a neural network structure as neural networks. Besides images, it also works with video and text. In this method, the relevance score received by a neuron must be redistributed to the lower layer in equal amount. LRP redistributes the model’s prediction score following a principle of local conservation. The method is illustrated on the Figure 12. Let R be the Relevance score, j and k be the sequential indexes to represent two consecutive layers of the neural network, i. e., j as the previous layer and k as the following layer. Thus, propagating relevance scores Rk at a given layer onto neurons of the previous layer is achieved by applying the rule: Rj = ∑ k zjk∑ j zjk Rk (2.9) The zjk represents how much the neuron j contributed to make the neuron k relevant. The denominator enforces the conservation property (analogous to energy conservation principle or Kirchhoff’s law in physics), where ∑j Rj = ∑ k Rk. The propagation procedure terminates once the input features have been reached. Various rules can be used to redistribute contributions from each layer to the previous layer. First, consider that deep rectifier networks are composed of neurons ak: ak = max 0, ∑ 0,j ajwjk  (2.10) Let ak be the neurons of a deep rectifier network and aj be the lower-layer activations. And to include the bias to the weight matrix W , with w ∈ W , let a0 = 1. Then, the first rule discussed by the LRP authors (BACH et al., 2015) is: (LRP-0 rule) Rj = ∑ k ajwjk∑ 0,j ajwjk Rk (2.11) Chapter 2. Theoretical Background 37 Figure 12 – Illustration of the LRP method running in a neural network. At the top, the neural network is fed by input x and obtain the output F (x), where aj indicates low-layer activations and ak indicates the deep rectifier network neurons. After, the LRP method calculates uses f(x) as the final relevance Rk of the network. So, at the bottom, Rk is back-propagated to the previous layer, generating Rj . These values are again propagated back until they reach the network input, generating the heatmap R. (Image source: LRP project page ) It redistributes the contributions of each input to neuron activation proportionally as they occur. However, as described by the authors, the gradient of a deep neural network is typically noisy, therefore this rule needs to be more robust. A first enhancement of the basic LRP-0 rule consists of adding a small positive term ε in the denominator: (LRP-ε rule) Rj = ∑ k ajwjk ε+∑ 0,j ajwjk Rk (2.12) http://www.heatmapping.org Chapter 2. Theoretical Background 38 According to the authors, as a result of this rule, the explanations are usually more concise in terms of input features and contain less noisy. Another improvement proposed by them is to favor the effect of positive contributions over negative contributions, controlling it by the scalar λ (a high value will cause the negative contributions disappear): (LRP-λ rule) Rj = ∑ k aj · (wjk + λw+ jk)∑ 0,j aj · (wjk + λw+ jk) Rk (2.13) The notation (·)+ = max(0, ·) and (·)− = min(0, ·). Table 1 – LRP rules and usage suggestions. Additionally to the variables denoted in the text, the index i refers to the network input and the parameters li, hi define the box constraints of the input domain (MONTAVON et al., 2019). Name Formula (rules) Usage LRP-0 Rj = ∑ k ajwjk∑ 0,j ajwjk Rk Upper layers LRP-ε Rj = ∑ k ajwjk ε+ ∑ 0,j ajwjk Rk Middle layers LRP-λ Rj = ∑ k aj ·(wjk+λw+ jk )∑ 0,j aj ·(wjk+λw+ jk )Rk Lower layers LRP-αβ Rj = ∑ k ( α (ajwjk)+∑ 0,j(ajwjk)+ − β (ajwjk)−∑ ( 0,jajwjk)− ) Rk Lower layers [ (flat2) Rj = ∑ k 1∑ j 1Rk Lower layers w2-rule Ri = ∑ j w2 ij∑ i w2 ij Rj First layer (Rd) zB-rule Ri = ∑ j xiwij−liwij+−hiw−ij∑ i xiwij−liwij+−hiw−ij Rj First layer (pixels) Unfortunately, using only one of these functions across the entire network structure can provide a poor explanation. Therefore, it is recommended to use composite strategy (Figure 13), where different rules are used in different layers. In addition to these rules, there are other rules that can be used to get better explanations on the network. There is also a suggestion of which layer to use which function. These rules and suggestions for use, are presented on Table 1. For a technical and more in-depth look at LRP method, including a discussion of the various propagation rules, see (BACH et al., 2015; MONTAVON et al., 2017; MONTAVON et al., 2019; LAPUSCHKIN et al., 2019). 2 Pronunciation: /flæt/ https://dictionary.cambridge.org/dictionary/english/flat Chapter 2. Theoretical Background 39 Figure 13 – At the top, the input and the heatmaps generated using only the LRP-0, LRP-ε and LRP-λ rules uniquely. Below right is the network structure with the rule used in each layer and the resulting heatmap on the left. Source: (MONTAVON et al., 2019). 40 3 Related Works The literature has shown that state-of-the-art models are susceptible to attacks, such as model extraction and adversarial examples. Our work relied on several studies in the field and presented a new method for model extraction that takes advantage of previously unexplored features. Furthermore, after obtaining the results of our method, we analyzed the defenses against such attacks, paying particular attention to the defenses that could be effective against our method. Thus, this chapter presents works related to our method, followed by the related types of defenses. 3.1 Model extraction and attacks Following the taxonomy proposed by Zhang et al. (2022), there are two groups of targets in attack methods: visual data, and visual deep learning systems. In the first group, the attack takes place by applying methods on the data instance. And the second group is formed by methods of attacking datasets or models. The types of attack on datasets can be: membership inference, model inversion, property inference, model memorization, and violation in data aggregation. And the attack on models is the model extraction attack, that is where our method fits. These attacks can occur in white-box or black-box models. In white-box models, information about their parameters, training dataset and model architecture is available and accessible to the adversary. In contrast, in black-box models, little information is available to the adversary, such as only the final predictions of the model. The literature also cites gray-box models, where the adversary has more information about the model. However, most works use only the white-box and black-box nomenclature, and only these two names will also be adopted here. This section provides a knowledge base on model extraction performed in our method. It begins by describing knowledge transfer works related to model compression and knowledge distillation. Then, it covers research related to model extraction attacks. Finally, works related to membership inference attacks are presented. These studies focused on the classification of inputs with perturbation in deep neural networks and introduced the concept of adversarial examples. These works showed intriguing properties in recognizing unknown and spurious images by state-of-the-art network models. Model compression consists of extracting knowledge from a larger model to a smaller one, i. e., with fewer parameters. The objective is to use the substitute (and smaller) model on a hardware with less processing power. In the work of Bucilă, Caruana and Niculescu- Chapter 3. Related Works 41 Mizil (2006), they extracted the knowledge of a ensemble (set of small machine learning models combined to provide more complex answers) to a shallow network. However, instead of training the network using the original dataset, they generated a synthetic dataset and labeled it with an ensemble. The larger dataset, composed of synthetic data, was used to train a shallow network, which achieved performance similar to that of an ensemble. Remarkably, the shallow network trained on this dataset outperformed the same network trained on the original training set. For the creation of synthetic data, they introduced a new method to create synthetic data that corresponded as closely as possible to the distribution of the original training set. They evaluated the effectiveness of their model compression on eight binary classification datasets provided in the UCI Repository (DUA; GRAFF, 2017), but none of these datasets consisted of images. Additionally, Ba and Caruana (2014) performed a compression model from a DNN network to a shallow network. They tested their approach with TIMIT (GAROFOLO et al., 1993) and CIFAR-10 (KRIZHEVSKY; HINTON, 2009) datasets. The objective was training the surrogate models to learn the function learned by the larger model. For this purpose, the surrogate models were not trained with the original labels of the original training datasets, but with the logits of the target models. They emphasized that learning the target model function is easier using logits. Given training data D = {(x1, z t 1), . . . , (xN , ztN)}, where zti is the target model logit of xi, the loss function used was the mean squared error (MSE, squared L2 norm) loss applied to the logits was: L(ẑs, ẑt) = 1 2N N∑ i=1 ||zsi − zti ||22 (3.1) where N = |D| and ẑs is the substitute model logits. Later, Hinton, Vinyals and Dean (2015) further studied the compression of the model. They argued that the probabilities of a model’s output provide not only information about the desired class (i. e., output with higher probability), but also information about how the model tends to generalize to other classes (i. e., lower probabilities presented in other outputs). Thus, they proposed a method called knowledge distillation, which provides more information about the classification of the model in relation to its input. For this, they proposed smoothing the final softmax output using a temperature T : σ̂(zi) = ezi/T∑d j=1 e zj/T (3.2) where z is the logit of the target network. Using a higher value for T produces a smoother probability distribution across classes. The value of T must be empirically adjusted to the target model produces a good set of outputs for training the substitute model. They also cited the work of (BUCILĂ; CARUANA; NICULESCU-MIZIL, 2006), describing it as a specific case of knowledge distillation. Some other works were built on the top of Chapter 3. Related Works 42 these results. For example, Chan, Ke and Lane (2015) employed the knowledge distillation to transfer the knowledge from a RNN to a DNN and Tang, Wang and Zhang (2016) to transfer the knowledge from DNN to a LSTM. In addition to model compression studies, Tramèr et al. (2016) explored the model extraction attacks. They exploited vulnerabilities in machine learning models, obtaining their predictions to generate an equivalent or nearly equivalent surrogate model, i. e., one model with accuracy close to the target model. Their research showed good results in logistic regression models, SVM, decision trees and shallow networks. First, they used the model soft-labels and after their hard-labels. Moreover, they also argue that ML models emit data-rich output that can be exploited by adversaries, and model extraction can be used to steal the model for subsequent free use. In addition, they also used BigML1 and Amazon Machine Learning2 to simulate an MLaaS, providing their target models as API and performing the extraction. Additionally, Shi, Sagduyu and Grushin (2017) studied the extraction on two target classifiers, Naive Bayes and SVM, that classified text in a binary dataset. The authors generated two deep learning classifiers that managed to have high fidelity rates with the target models. However, their work did not explore copies from DNNs, besides demanding problem domain data. And given the cost of copying these simpler models, their approaches did not seem scalable for deep learning models with multiple classes and larger datasets. Unlike our work, however, these methods generally assume that some details about the models are known, that is, they treat the model of interest as a white box. They also use the same training data (or problem domain data, at least) assuming one would have access to the logits or probabilities of all classes for a given input. Other works have also explored attacks on ML models, more specifically, on neural networks. Szegedy et al. (2014) discovered that neural networks are able to be fooled with certain input patterns called adversarial examples, i. e., input images that have their pixels slightly modified to cause a machine learning model to produce incorrect output. Let f(·) be the target model, x the input image, ε the pixels perturbation matrix, ŷ the predicted class of x, and x′ the adversarial example. So, the normal behavior of the target model is f(x) = ŷ. However, the adversarial example can fool the target model: f(x′) = f(x+ε) 6= ŷ. They used an optimization process on the input image to find small perturbations in the pixels that caused wrong outputs in the target networks (Figure 3). Furthermore, they found the same input with the same perturbation can be applied to a different network, even trained on a different subset of the data, to also cause an incorrect classification. Later, Goodfellow, Shlens and Szegedy (2015) proposed the Fast Gradient Sign Method (FGSM) to craft adversarial examples against DNNs. This method has been used 1 BigML website: 2 Amazon Machine Learning website: https://bigml.com https://aws.amazon.com/machine-learning/ Chapter 3. Related Works 43 in several academic works. Let x be the original image correctly classified by the model, y the desired prediction of x, x′ the adversarial example, ε the perturbation to craft the adversarial example: x′ = x+ ε, ε = λ sign(∇xJ(θ, x, y)) (3.3) where J(θ, x, y) is the cost used to train the neural network, and λ is the magnitude to add the perturbation on x. A demonstration of this method (extracted from the original article) can be seen in Figure 4. Papernot et al. (2016) also explored adversarial examples and used the forward derivatives for building adversarial saliency maps3 to craft adversarial samples against the DNN model. They used the forward derivative, that was defined as the Jacobian matrix of the model with respect to its inputs. The main difference of this method is the application of a extended salience map introduced by Simonyan, Vedaldi and Zisserman (2014). These maps indicate which pixels more efficiently perturb the network behavior for an input, thus allowing to generate adversarial examples. However, these methods need to access the model parameters (white-box) and also need problem domain data. After these works, Papernot et al. (2017) formulated a new strategy capable of crafting adversarial examples against black-box models provided as MLaaS. The strategy consists of first attacking (model extraction) the target model to train a surrogate model (method named as Jacobian-Based Dataset Augmentation – JBDA). For this purpose, a training dataset S is made from some original images of the problem domain. Then, in an iteration process, the surrogate model is used to craft new synthetic examples using its Jacobian error matrix. Then, these images are labeled by the target model to increase the dataset S and fine-tune the surrogate model: Sτ+1 ← {x+ λ.sign(J(x)[f(x)] : x ∈ Sτ )} ∪ Sτ (3.4) where τ is the iteration identifier, x is the original image, J(x)[f(x)] is the Jacobian of surrogate model with respect to x corresponding to the hard-label assigned by the target model f(·), and the term λ is a magnitude to add the sign of the Jacobian matrix on x to generate a new image. Finally, when the training ends (τmax iterations), the surrogate model is used to craft adversarial examples using the previous two methods of (SZEGEDY et al., 2014) and (PAPERNOT et al., 2016). Although they use model extraction on the target model, the surrogate model is not designed to have high accuracy or fidelity to the target model. The intention was only to generate a substitute model capable of creating adversarial examples. Moreover, they used only problem domain images in the whole process. Other important research was presented by Nguyen, Yosinski and Clune (2015). 3 The saliency map is a measure of how much each input pixel contributes to the final predictions of the network. Chapter 3. Related Works 44 They studied the vulnerability of adversarial examples presented by Szegedy et al. (2014) and found that it is easy to generate images that are completely unrecognizable to humans, but that DNNs recognize with an higher confidence (Figure 5). Their initial motivation was to verify the similarity between human vision and computer vision. They started training two DNNs, one with MNIST and other with ImageNet to craft adversarial examples using evolutionary algorithms and gradient ascendant. So, they found images with a high degree of confidence belonging to each class in the tested DNN models. Additionally, they unsuccessfully tried to avoid this misclassification by adding a garbage class, but it deprecated the accuracy of their models. They also did make a statement about discriminating models with a high-dimensional input space. They stated that the area of a class in the classification space can be much larger than the training examples for that class. These works show that there are large high confidence regions in the CNN classification subspaces that were not occupied by training examples. More recently, and after our preliminary work, Orekondy, Schiele and Fritz (2019) studied how an adversary can steal functionalities of a black-box target network. Similarly to our preliminary work (Correia-Silva et al., 2018), the authors labeled a large public dataset of images using a black-box model to generate a fake dataset. But unlike our work, they used the model’s probabilities (soft-labels) instead of hard-labels to label their fake dataset. They conducted experiments with four datasets and a fixed architecture (ResNet-34) with pre-trained weights. The experiments followed three data distributions: (i) images used to train the target networks but unlabeled; (ii) a combination of all images (from OpenImages (KUZNETSOVA et al., 2020) and ILSVRC (RUSSAKOVSKY et al., 2015b)) used to train all target networks; and (iii) images only from OpenImages and ILSVRC. Furthermore, they applied one attack method with random image samples and another one with reinforcement learning. Additionally, to verify the influence of the architecture, they generated target networks with VGG-16 and ResNet-34 architectures trained with one of the datasets. Subsequently, they used several architectures to attack the target networks. Finally, they concluded that it is beneficial to use a more complex architecture to copy the network and achieved more than 77% of copy in their experiments. Although similar, their work have a key difference to ours: they assume access to the probabilities (soft-labels) and not just the hard-label output. Along the line of this work, Mosafi, David and Netanyahu (2019a) presented an attack with unlabeled data over models trained with MNIST and CIFAR10. The authors used the probabilities (soft-labels) of target network to label the attack dataset. In a follow up work, Mosafi, David and Netanyahu (2019b) cited our preliminary work and decided to use the same constraints, i.e., only the hard-labels. Additionally, they proposed a new method to create unlabeled data. In this method, a student consults the teacher (target model) with composite images to extract their knowledge. The teacher was trained with Chapter 3. Related Works 45 CIFAR-10 and the student used ImageNet to generate each composite image c: c = p 100 × ImageNet[i] + ( 1− p 100 ) × ImageNet[j] (3.5) where p ∼ U(0, 100), and i, j ∼ U(1, |ImageNet|). Their objective was to generate a diverse dataset to better extract the knowledge of the target model. They performed three experiments: i. the first one used ImageNet images labeled with soft-labels provided by the target model, achieving a performance of 98.5% ( accuracystudent accuracyprofessor ) , ii. the second one used ImageNet images labeled with hard-labels provided by the target model, achieving a performance of 96.7%, and iii. the third one used the composite images labeled with hard-labels provided by the target model, achieving a performance of 99%. Unfortunately, although they proved that their method was better than other tested methods, they did not test it with other datasets. Furthermore, the total amount of images used in the attack has not been reported and a thorough analysis of the model behavior was not performed. Unlike related works that use final model probabilities, our proposed method requires only access to the hard-labels and is tested with different problem domains comprising different number of classes and in different architectures. These experiments allows to present an initial analysis of the limitations and capabilities of copying with natural random images and final labels of the model. Table 2 shows a comparison between related works and ours (Copycat). Table 2 – Comparison between related works and Copycat (ours). Abbreviations: Distil- lation (D), Adversarial Examples (A) or Copy Attack (C), Problem Domain Data (P), and Non-Problem Domain Data (N). The reference of the problem’s numbers is in the footer4. Features / Methods 1 2 3 4 5 6 7 8 9 10 Ours Type of method D D D A A A C C C C Black-Box X X X X X X X X Data used in the Experiments P P P P P N P P N,P N N,P Hard labels X X X X ML, Shallow Networks X X X X DNN → DNN X X X X X X X X ML → DNN X X 4 1: (BUCILĂ; CARUANA; NICULESCU-MIZIL, 2006). 2: (HINTON; VINYALS; DEAN, 2015). 3: (TANG; WANG; ZHANG, 2016). 4: (SZEGEDY et al., 2014). 5: (GOODFELLOW; SHLENS; SZEGEDY, 2015). 6: (NGUYEN; YOSINSKI; CLUNE, 2015). 7: (SHI; SAGDUYU; GRUSHIN, 2017). 8: (TRAMÈR et al., 2016). 9: (OREKONDY; SCHIELE; FRITZ, 2019). 10: (MOSAFI; DAVID; NETANYAHU, 2019b). Chapter 3. Related Works 46 To the best of our knowledge, the Copycat CNN (Correia-Silva et al., 2018) was the first model extraction method that uses random natural images and hard-labels to extract the model knowledge. After the development of our experiments, additional related works emerged. Therefore, for the completeness of this thesis, the studies that came after after the completion of our studies are presented. However, their analysis and exploration will be carried out in future works. A different attack approach was presented by Sanyal, Addepalli and Babu (2022), which proposes the use of Generative Adversarial Networks (GANs) (GOODFELLOW et al., 2014) for model extraction. First, they used proxy data to train the GAN models. After, they alternately trained the surrogate model and the Generator, which used Adversarial Loss (GOODFELLOW et al., 2014) and Diversity Loss (ADDEPALLI et al., 2020). Another work that also used the GAN strategy was proposed by Beetham et al. (2023). The main difference of this work was the application of Dual Students, where two student models were used to generate images to better explore the Oracle. For this proposal, the loss function employed in a student model is the negative value of loss from the other student model. Both works used a much larger amount of data than proposed in our methodology. Due to the importance of these image processing models and their associated cost, it is necessary to find solutions to protect them against existing threats. Thus, the next section presents the defense methods related to our work. 3.2 Defense methods The number of current systems that use machine learning methods such as DNNs and CNNs has grown substantially in recent years. However, some of its vulnerabilities have already been presented, such as adversarial examples and model extraction attacks. Therefore, several current works have