Aproximación lingüística en el diseño de un corpus anotado en español sobre COVID-19 para sistemas de pregunta-respuesta
Fecha
2023
Tipo
tesis de maestría
Autores
Barboza Hidalgo, Graciela Yislén
Título de la revista
ISSN de la revista
Título del volumen
Editor
Resumen
Los recursos del Procesamiento del Lenguaje Natural, en conjunto con los corpus anotados,
han acelerado el desarrollo de los sistemas inteligentes de pregunta-respuesta (chatbots), los cuales
se entrenan para imitar el comportamiento lingüístico del ser humano. El etiquetado lingüístico de
corpus es un proceso necesario en el entrenamiento de sistemas de pregunta-respuesta con métodos
de aprendizaje automático; sin embargo, desde los inicios de la Inteligencia Artificial (AI) se ha
intentado agilizar la evolución de la ingeniería lingüística a través de la automatización de las
tareas pertenecientes al procesamiento del lenguaje natural, prescindiendo, muchas veces, del
aporte de la Lingüística.
El etiquetado de roles semánticos en español es un tema que ha permanecido al margen en
NLP y continúa con numerosos problemas sin resolver y, es por esto que, la intervención de
lingüistas, con su conocimiento sobre la estructura interna de la lengua, permite mejorar y
robustecer los modelos para Machine Learning con aportes teóricos lingüísticos pertinentes. Por
lo anterior, en esta tesis se creó un modelo de anotación de roles temáticos en español, con un
análisis descriptivo para los 200 verbos más frecuentes del Corpus COVID-19, empleando para
esto la Lingüística de Corpus como metodología y la Gramática Léxico-Funcional (LFG) como
base teórica.
Esta tesis se enfocó en el aspecto meramente lingüístico de la anotación, y no en la creación
de chatbots ni en el entrenamiento puesta a punto (fine-tuning) para ChatGPT. Con este modelo
de anotación se comparó el acuerdo entre los anotadores humanos y el de ChatGPT, por medio del
coeficiente kappa de Fleiss. En este trabajo se concluye que ChatGPT obtuvo un desempeño
inferior al de los humanos, con un valor de κ de 0.420 y con una precisión de 0.539, comparado al
de los humanos que obtuvieron un valor de κ de 0.600, con una precisión de 0.700.
Esta investigación ha sido de carácter empírico, con pocos antecedentes para la lengua
española, porque, al momento del desarrollo de la tesis, no se habían publicado trabajos en donde
se compararan las anotaciones humanas y las anotaciones de un modelo de lenguaje de gran
tamaño, como ChatGPT, para los roles temáticos; así como tampoco se encontraron antecedentes
que ofrecieran una guía claramente replicable para anotar los roles temáticos. Su diseño fue
exploratorio, por tratarse de una propuesta de anotación de etiquetas lingüísticas para roles
temáticos en español, y el método de análisis fue por medio del análisis textual de corpus en
español, empleando la perspectiva de la Lingüística de Corpus.
Natural Language Processing (NLP) resources, together with annotated corpora, have accelerated the development of intelligent question-answer systems (chatbots), which are trained to imitate human linguistic behavior. Corpus linguistic annotation is a necessary process in training question-answer systems with machine learning methods. However, since the beginnings of Artificial Intelligence (AI), attempts have been made to speed up the evolution of linguistic engineering through the automation of tasks belonging to natural language processing, often disregarding the contribution of Linguistics. The labeling of semantic roles in Spanish is a topic that has remained on the sidelines in NLP and continues with numerous unresolved problems. For this reason, the intervention of linguists, with their knowledge of the internal structure of the language, allows to improve and strengthen the models for Machine Learning with relevant linguistic theoretical contributions. Due to the above, in this thesis a thematic role annotation model in Spanish was created, with a descriptive analysis for the 200 most frequent verbs of the Corpus COVID-19, using for this the Corpus Linguistics as methodology and the Lexical-Functional Grammar (LFG) as the theoretical basis. This thesis focused on the purely linguistic aspect of the annotation, and not on the creation of chatbots or fine-tuning training for ChatGPT. With this annotation model, the agreement of the human annotators and that of ChatGPT was compared using the Fleiss kappa coefficient. In this work it is concluded that ChatGPT obtained a lower performance than that of human annotators, with a κ value of 0.420, with a precision of 0.539; compared to that of humans who obtained a κ value of 0.600, with a precision of 0.700. This research has been of an empirical nature, with few precedents for the Spanish language, because, at the time of the development of the thesis, no papers had been published comparing human annotations and those done by large language model, such as ChatGPT, for the thematic roles; nor were found antecedents that offered a clearly replicable guide to tag the thematic roles. Its design was empirical, since it is a proposal for the annotation of linguistic labels for thematic roles in Spanish, and the method of analysis was through the textual analysis of corpus in Spanish, using the perspective of Corpus Linguistics.
Natural Language Processing (NLP) resources, together with annotated corpora, have accelerated the development of intelligent question-answer systems (chatbots), which are trained to imitate human linguistic behavior. Corpus linguistic annotation is a necessary process in training question-answer systems with machine learning methods. However, since the beginnings of Artificial Intelligence (AI), attempts have been made to speed up the evolution of linguistic engineering through the automation of tasks belonging to natural language processing, often disregarding the contribution of Linguistics. The labeling of semantic roles in Spanish is a topic that has remained on the sidelines in NLP and continues with numerous unresolved problems. For this reason, the intervention of linguists, with their knowledge of the internal structure of the language, allows to improve and strengthen the models for Machine Learning with relevant linguistic theoretical contributions. Due to the above, in this thesis a thematic role annotation model in Spanish was created, with a descriptive analysis for the 200 most frequent verbs of the Corpus COVID-19, using for this the Corpus Linguistics as methodology and the Lexical-Functional Grammar (LFG) as the theoretical basis. This thesis focused on the purely linguistic aspect of the annotation, and not on the creation of chatbots or fine-tuning training for ChatGPT. With this annotation model, the agreement of the human annotators and that of ChatGPT was compared using the Fleiss kappa coefficient. In this work it is concluded that ChatGPT obtained a lower performance than that of human annotators, with a κ value of 0.420, with a precision of 0.539; compared to that of humans who obtained a κ value of 0.600, with a precision of 0.700. This research has been of an empirical nature, with few precedents for the Spanish language, because, at the time of the development of the thesis, no papers had been published comparing human annotations and those done by large language model, such as ChatGPT, for the thematic roles; nor were found antecedents that offered a clearly replicable guide to tag the thematic roles. Its design was empirical, since it is a proposal for the annotation of linguistic labels for thematic roles in Spanish, and the method of analysis was through the textual analysis of corpus in Spanish, using the perspective of Corpus Linguistics.
Descripción
Palabras clave
LINGÜISTA, CORONAVIRUS, SINTAXIS, PREGUNTA-RESPUESTA, COMPORTAMIENTO LINGÜÍSTICO, SISTEMA INTELIGENTE