Persona:
Araujo Serna, M. Lourdes

ORCID

0000-0002-7657-4794

Apellidos

Araujo Serna

Nombre de pila

M. Lourdes

Página completa del ítem

Resultados de la búsqueda

Mostrando 1 - 10 de 22

Web spam detection : new classification features based on qualified link analysis and language models
(Institute of Electrical and Electronics Engineers (IEEE), 2010-09-01) Araujo Serna, M. Lourdes; Martínez Romo, Juan
Web spam is a serious problem for search engines because the quality of their results can be severely degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system based on a classifier that combines new link-based features with language-model (LM)-based ones. These features are not only related to quantitative data extracted from the Web pages, but also to qualitative properties, mainly of the page links.We consider, for instance, the ability of a search engine to find, using information provided by the page for a given link, the page that the link actually points at. This can be regarded as indicative of the link reliability. We also check the coherence between a page and another one pointed at by any of its links. Two pages linked by a hyperlink should be semantically related, by at least a weak contextual relation. Thus, we apply an LM approach to different sources of information from aWeb page that belongs to the context of a link, in order to provide high-quality indicators of Web spam. We have specifically applied the Kullback–Leibler divergence on different combinations of these sources of information in order to characterize the relationship between two linked pages. The result is a system that significantly improves the detection of Web spam using fewer features, on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.
Detecting malicious tweets in trending topics using a statistical analysis of language
(Elsevier, 2013-06-01) Martínez Romo, Juan; Araujo Serna, M. Lourdes
Twitter spam detection is a recent area of research in which most previous works had focused on the identification of malicious user accounts and honeypot-based approaches. However, in this paper we present a methodology based on two new aspects: the detection of spam tweets in isolation and without previous information of the user; and the application of a statistical analysis of language to detect spam in trending topics. Trending topics capture the emerging Internet trends and topics of discussion that are in everybody’s lips. This growing microblogging phenomenon therefore allows spammers to disseminate malicious tweets quickly and massively. In this paper we present the first work that tries to detect spam tweets in real time using language as the primary tool. We first collected and labeled a large dataset with 34 K trending topics and 20 million tweets. Then, we have proposed a reduced set of features hardly manipulated by spammers. In addition, we have developed a machine learning system with some orthogonal features that can be combined with other sets of features with the aim of analyzing emergent characteristics of spam in social networks. We have also conducted an extensive evaluation process that has allowed us to show how our system is able to obtain an F-measure at the same level as the best state-ofthe- art systems based on the detection of spam accounts. Thus, our system can be applied to Twitter spam detection in trending topics in real time due mainly to the analysis of tweets instead of user accounts.
Building a framework for fake news detection in the health domain
(San Francisco CA: Public Library of Science, 2024-07-08) Martinez Rico, Juan R.; Araujo Serna, M. Lourdes; Martínez Romo, Juan; Bongelli, Ramona
Disinformation in the medical field is a growing problem that carries a significant risk. Therefore, it is crucial to detect and combat it effectively. In this article, we provide three elements to aid in this fight: 1) a new framework that collects health-related articles from verification entities and facilitates their check-worthiness and fact-checking annotation at the sentence level; 2) a corpus generated using this framework, composed of 10335 sentences annotated in these two concepts and grouped into 327 articles, which we call KEANE (faKe nEws At seNtence lEvel); and 3) a new model for verifying fake news that combines specific identifiers of the medical domain with triplets subject-predicate-object, using Transformers and feedforward neural networks at the sentence level. This model predicts the fact-checking of sentences and evaluates the veracity of the entire article. After training this model on our corpus, we achieved remarkable results in the binary Classification of sentences (check-worthiness F1: 0.749, fact-checking F1: 0.698) and in the final classification of complete articles (F1: 0.703). We also tested its performance against another public dataset and found that it performed better than most systems evaluated on that dataset. Moreover, the corpus we provide differs from other existing corpora in its duality of sentence-article annotation, which can provide an additional level of justification of the prediction of truth or untruth made by the model.
Understanding and Improving Disability Identification in Medical Documents
(IEEE, 2020) Fabregat Marcos, Hermenegildo; Martínez Romo, Juan; Araujo Serna, M. Lourdes
Disabilities are a problem that affects a large number of people in the world. Gathering information about them is crucial to improve the daily life of the people who suffer from them but, since disabilities are often strongly associated with different types of diseases, the available data are widely dispersed. In this work we review existing proposal for the problem, making an in-depth analysis, and from it we make a proposal that improves the results of previous systems. The analysis focuses on the results of the participants in DIANN shared task was proposed (IberEval 2018), devoted to the detection of named disabilities in electronic documents. In order to evaluate the proposed systems using a common evaluation framework, a corpus of documents, in both English and Spanish, was gathered and annotated. Several teams participated in the task, either using classic methods or proposing specific approaches to deal effectively with the complexities of the task. Our aim is to provide insight for future advances in the field by analyzing the participating systems and identifying the most effective approaches and elements to tackle the problem. We have validated the lessons learned from this analysis through a new proposal that includes the most promising elements used by the participating teams. The proposed system improves, for both languages, the results obtained during the task.
Detecting Signs of Non-suicidal Self-Injury in Psychiatric Medical Reports Using Language Analysis
(Sociedad Española para el Procesamiento del Lenguaje Natural, 2022) Reneses, Blanca; Sevilla-Llewellyn-Jones, Julia; Martínez-Capella, Ignacio; Seara-Aguilar, Germán; Martínez Romo, Juan; Araujo Serna, M. Lourdes
La autolesión no suicida, a menudo denominada autolesión, es el acto de dañarse deliberadamente el propio cuerpo, como cortarse o quemarse. Normalmente, no pretende ser un intento de suicidio. En este trabajo se presenta un sistema de detección de indicios de autolesiones no suicidas, basado en el análisis del lenguaje, sobre un conjunto anotado de informes médicos obtenidos del servicio de psiquiatría de un Hospital público madrileño. Tanto la explicabilidad como la precisión a la hora de predecir los casos positivos, son los dos principales objetivos de este trabajo. Para lograr este fin se han desarrollado dos sistemas supervisados de diferente naturaleza. Por un lado se ha llevado a cabo un proceso de extracción de diferentes rasgos centrados en el propio mundo de las autolesiones mediante técnicas de procesamiento del lenguaje natural para alimentar posteriormente un clasificador tradicional. Por otro lado, se ha implementado un sistema de aprendizaje profundo basado en varias capas de redes neuronales convolucionales, debido a su gran desempeño en tareas de clasificación de textos. El resultado es el funcionamiento de dos sistemas supervisados con un gran rendimiento, en donde destacamos el sistema basado en un clasificador tradicional debido a su mejor predicción de clases positivas y la mayor facilidad de cara a explicar sus resultados a los profesionales sanitarios.
Semi‑supervised incremental learning with few examples for discovering medical association rules
(BioMed Central, 2022) Sánchez‑de‑Madariaga, Ricardo; Cantero Escribano, José Miguel; Martínez Romo, Juan; Araujo Serna, M. Lourdes
Background: Association Rules are one of the main ways to represent structural patterns underlying raw data. They represent dependencies between sets of observations contained in the data. The associations established by these rules are very useful in the medical domain, for example in the predictive health field. Classic algorithms for association rule mining give rise to huge amounts of possible rules that should be filtered in order to select those most likely to be true. Most of the proposed techniques for these tasks are unsupervised. However, the accuracy provided by unsupervised systems is limited. Conversely, resorting to annotated data for training supervised systems is expensive and time‑consuming. The purpose of this research is to design a new semi‑supervised algorithm that performs like supervised algorithms but uses an affordable amount of training data. Methods: In this work we propose a new semi‑supervised data mining model that combines unsupervised techniques (Fisher’s exact test) with limited supervision. Starting with a small seed of annotated data, the model improves results (F‑measure) obtained, using a fully supervised system (standard supervised ML algorithms). The idea is based on utilising the agreement between the predictions of the supervised system and those of the unsupervised techniques in a series of iterative steps. Results: The new semi‑supervised ML algorithm improves the results of supervised algorithms computed using the F‑measure in the task of mining medical association rules, but training with an affordable amount of manually annotated data. Conclusions: Using a small amount of annotated data (which is easily achievable) leads to results similar to those of a supervised system. The proposal may be an important step for the practical development of techniques for mining association rules and generating new valuable scientific medical knowledge.
Discovering related scientific literature beyond semantic similarity: a new co-citation approach
(Springer, 2019-05-17) Rodríguez Prieto, Oscar; Araujo Serna, M. Lourdes; Martínez Romo, Juan
We propose a new approach to recommend scientific literature, a domain in which the efficient organization and search of information is crucial. The proposed system relies on the hypothesis that two scientific articles are semantically related if they are co-cited more frequently than they would be by pure chance. This relationship can be quantified by the probability of co-citation, obtained from a null model that statistically defines what we consider pure chance. Looking for article pairs that minimize this probability, the system is able to recommend a ranking of articles in response to a given article. This system is included in the co-occurrence paradigm of the field. More specifically, it is based on co-cites so it can produce recommendations more focused on relatedness than on similarity. Evaluation has been performed on the ACL Anthology collection and on the DBLP dataset, and a new corpus has been compiled to evaluate the capacity of the proposal to find relationships beyond similarity. Results show that the system is able to provide, not only articles similar to the submitted one, but also articles presenting other kind of relations, thus providing diversity, i.e. connections to new topics.
Generation of social network user profiles and their relationship with suicidal behaviour
(Sociedad Española para el Procesamiento del Lenguaje Natural, 2024) Fernández Hernández, Jorge; Araujo Serna, M. Lourdes; Martínez Romo, Juan
Actualmente el suicidio es una de las principales causas de muerte en el mundo, por lo que poder caracterizar a personas con esta tendencia puede ayudar a prevenir posibles intentos de suicidio. En este trabajo se ha recopilado un corpus, llamado SuicidAttempt en español compuesto por usuarios con o sin menciones explícitas de intentos de suicidio, usando la aplicación de mensajería Telegram. Para cada uno de los usuarios se han anotado distintos rasgos demográficos de manera semi-automática mediante el empleo de distintos sistemas, en unos casos supervisados y en otros no supervisados. Por último se han analizado estos rasgos recogidos, junto con otros lingüísticos extraídos de los mensajes de los usuarios, para intentar caracterizar distintos grupos en base a su relación con el comportamiento suicida. Los resultados sugieren que la detección de estos rasgos demográficos y psicolingüísticos permiten caracterizar determinados grupos de riesgo y conocer en profundidad los perfiles que realizan dichos actos.
Experimentación basada en deep learning para el reconocimiento del alcance y disparadores de la negación
(Sociedad Española para el Procesamiento del Lenguaje Natural, 2019) Fabregat Marcos, Hermenegildo; Araujo Serna, M. Lourdes; Martínez Romo, Juan
La detección automática de los distintos elementos de la negación es un frecuente tema de estudio debido a su alto impacto en diversas tareas de procesamiento de lenguaje natural. Este articulo presenta un sistema basado en deep learning y de arquitectura no dependiente del idioma para la detección automática tanto de disparadores como del alcance de la negación para inglés y español. El sistema presentado obtiene para ingles resultados comparables a los obtenidos en recientes trabajos por sistemas más complejos. Para español destacan los resultados obtenidos en la detección de claves de negación. Por último, los resultados para el reconocimiento del alcance de la negación, son similares a los obtenidos en inglés.
Anonimización de Informes Médicos
(Universidad Nacional de Educación a Distancia (España). Escuela Técnica Superior de Ingeniería Informática. Departamento de Inteligencia Artificial, 2021-09-15) Gaitán Rivas, José Antonio; Araujo Serna, M. Lourdes; Martínez, Raquel
Con el objetivo de mejorar la salud y seguridad de los pacientes cada vez existe un mayor interés en gestionar eficientemente el contenido de los historiales clínicos electrónicos. Dichos informes médicos están escritos principalmente en lenguaje natural, por lo que contienen información no estructurada generalizadamente, haciéndose imprescindibles tecnologías de Minería de Textos y de PLN (Procesamiento de Lenguaje Natural) para su explotación. Con técnicas apropiadas de dichas tecnologías se ayuda en la toma de decisiones clínicas o se facilita la reutilización de medicamentos, entre muchas otras ventajas. Sin embargo, los registros clínicos con información de salud protegida (PHI o Protected Health Information) no pueden ser compartidos directamente debido a restricciones relacionadas con la protección de datos sobre dicha información privada de los pacientes. Es necesaria pues, una anonimización o disociación de dichos registros antes de poder ser usados externamente, debiéndose eliminar total o parcialmente toda información que permita identificar al paciente. La base del presente trabajo ha sido la tarea de evaluación MEDDOCAN (Medical Document Anonymization), a la que puede accederse en https://temu.bsc.es/meddocan , que forma parte de la iniciativa IberLEF 2019, y con la que se organizó un desafío para la comunidad hispano-hablante, con el objetivo de diseñar sistemas eficientes de anonimización de documentos médicos escritos en español. La tarea de MEDDOCAN se estructura en dos subtareas:  Identificación y clasificación de entidades (nombres de paciente, teléfonos, etc.)  Detección de texto sensible La evaluación oficial de la tarea, por tanto, engloba los resultados de ambas subtareas. El corpus está formado por 1.000 estudios de casos clínicos, y cada uno de ellos cuenta, de forma anexa, con expresiones PHI realizadas por profesionales. 4 Del total de 1.000 casos, se reservó el 50% (500 casos) para entrenamiento de la tarea, un 25% (250 casos) para labores de desarrollo, y el otro 25% (250 casos) para pruebas. En el desafío participaron 18 equipos, de un total de 8 nacionalidades distintas, y el mejor resultado, basado en la métrica F-score, fue de 0.9360 para la subtarea 1 (“Identificación y clasificación de entidades”) y de 0.9611 para la subtarea 2 (“Detección de texto sensible”). A lo largo del presente trabajo estudiaremos y compararemos los datos proporcionados por los organizadores de la tarea, y propondremos un sistema que implementa una solución simple mediante técnicas de Aprendizaje Automático y Minería de Textos. Finalmente analizaremos los resultados obtenidos con dicho sistema y serán comparados con los de los participantes en la tarea, exponiendo las ventajas e inconvenientes para la arquitectura escogida, respecto a las presentadas. En dichas conclusiones incorporaremos un listado de posibles mejoras o implementaciones futuras recomendadas para mejorar el rendimiento.

Persona:
Araujo Serna, M. Lourdes

Dirección de correo electrónico

ORCID

Fecha de nacimiento

Proyectos de investigación

Unidades organizativas

Puesto de trabajo

Apellidos

Nombre de pila

Nombre

Filtros

Autor

Tipo

Departamento

Centro

Fecha

Tiene archivos

Tipo de ítem

search.filters.filter.accessLevel.head

Ajustes

Ordenar por

resultados por página

Resultados de la búsqueda

Persona: Araujo Serna, M. Lourdes

Dirección de correo electrónico

ORCID

Fecha de nacimiento

Proyectos de investigación

Unidades organizativas

Puesto de trabajo

Apellidos

Nombre de pila

Nombre

Filtros

Autor

Tipo

Departamento

Centro

Fecha

Tiene archivos

Tipo de ítem

search.filters.filter.accessLevel.head

Ajustes

Ordenar por

resultados por página

Resultados de la búsqueda

Persona:
Araujo Serna, M. Lourdes