Metodologia de categorização de textos a partir de documentos não rotulados utilizando um processo de resolução de anáforas

Nenhuma Miniatura disponível
Data
2010-08-30
Autores
Bossois, Débora Zupeli
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal do Espírito Santo
Resumo
With the constant expansion of text content in electronic format comes the need to organize all this information in an operable way. Thus the text categorization process has been developed, aiming to make easier the manipulation and recovering of the information by separating it in thematic categories. There are many approaches to obtain an automatic text classi cation. Among then, the supervised learning is the most traditional. Though the supervised methodology is as much precise as the one obtained by human specialists, the obligatoriness of a pre-classi ed corpus might be a limiting factor in some applications. In those situations, a semi- or unsupervised solution can be applied, wich does not demands a complete and well formed set of training to the building of a classi er; on the contrary, only unlabeled documents for the method are supplied. Both the supervised and the semi- and unsupervised learning usually built a text representation based only in the occurrence of the terms, not taking in consideration semantic factors. However, many intrinsic characteristics of the natural language can make the process ambiguous, and one of these factors is the use of diverse terms to refer to one entity already presented in the text. This linguistic phenomena is called anaphora. This thesis proposes a method to concept an unsupervised classi er, using as a base the Nominal Structure of Speech (Estrututra Nominal do Discurso END, in Portuguese), developed by Freitas with the objective of solving anaphora, in [Freitas 2005]. To accomplish the objective, the bootstrapping technique for classi cation is implemented, aiming to obtain the inicial labeled training data, wich is used to generate a classifying model through the supervised learning. Besides being grounded on the END, this paper methodology is bene ted by the direct anaphora resolution process, using the antecedents identi ed for the anaphors, during the nal classi cation phase. This work presents details about the proposed methodol, as well as the trials and tests made to evaluate the method. The results show that the use of the anaphora resolution process is bene cial for an unsupervised learning system.
Descrição
Palavras-chave
Citação
BOSSOIS, Débora Zupeli. Metodologia de categorização de textos a partir de documentos não rotulados utilizando um processo de resolução de anáforas. 2010. 109 f. Dissertação (Mestrado em Informática) - Universidade Federal do Espírito Santo, Centro Tecnológico, Vitória, 2010.