CRF+LG: uma abordagem híbrida para o reconhecimento de entidades nomeadas em português

Nenhuma Miniatura disponível
Data
2019-02-07
Autores
Pirovani, Juliana Pinheiro Campos
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal do Espírito Santo
Resumo
Named Entity Recognition involves automatically identifying and classifying entities such as persons, places, and organizations, and it is a very important task in Information Extraction. Named Entity Recognition systems can be developed using the following approaches: linguistics, machine learning or hybrid. This work proposes the use of a hybrid approach, called CRF+LG, for Named Entity Recognition in Portuguese texts in order to explore the advantages of both linguistics and machine learning approaches. The proposed approach uses Conditional Random Fields (CRF) considering the term classification obtained by a Local Grammar (LG) as an additional informed feature. Conditional Random Fields is a probabilistic method for structured prediction. Local grammars are handmade rules to identify expressions within the text. The aim was to study this way of including the human expertise (Local Grammar) in the machine learning Conditional Random Fields approach and to analyze how it can contribute to the performance of this approach. To achieve this aim, a Local Grammar was built to recognize the 10 named entities categories of HAREM, a joint assessment for the Named Entity Recognition in Portuguese. Initially, the Golden Collection of the First and Second HAREM, considered as a reference for Named Entity Recognition systems in Portuguese, were used as training and test sets, respectively, for evaluation of the CRF+LG. After that, the proposed approach was evaluated in two other datasets. The results obtained outperform the results of systems reported in the literature that were evaluated under equivalent conditions. This gain was approximately 8 percentage points in F-measure in comparison to a system that also used CRF and 2 points in comparison to a system that used Neural Networks. Some systems that used Neural Networks presented superior results, but using massive corpora for unsupervised learning of features, which was not the case of this work. The Local Grammar built can be used individually when there is no training set available and in conjunction with other machine learning techniques to improve its performance. We also analyzed the boundaries (lower bound and upper bound) of the proposed approach. The lower bound indicates the minimum performance and the upper bound indicates the maximum gain that we can achieve for the task in question when using this approach.
Descrição
Palavras-chave
Named entity recognition , Conditional random fields , Local grammars , Reconhecimento de entidades nomeadas , Campos aleatórios condicionais , Gramáticas locais
Citação
PIROVANI, Juliana Pinheiro Campos. CRF+LG: uma abordagem híbrida para o reconhecimento de entidades nomeadas em português. 2019. 114 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Espírito Santo, Centro Tecnológico, Vitória, 2019.