http://www.lookfordiagnosis.com
English
Spanish
Italian
Portuguese
French
German


Introducción

Desde que surgieron los primeros documentos en formato digital, la clasificación automática de textos adquirió mucha relevancia para muchos investigadores. Se trata de asignar automáticamente documentos a categorís predefinas. En la actualidad, se está utilizando en muchas aplicaciones tales como clasificación de páginas web, filtrado de spam, etc. Sin embargo, creemos que no ha sido aplicado en otros proyectos para el diagnóstico médico a partir de la información en texto plano que se puede extraer de las publicaciones médicas y de los historiales de pacientes.

Cada año, miles de documentos se publican en las bases de datos "National Library of Medicine" y "National Institutes of Health". La mayoría se clasifican manualmente asignando a cada documento una o varias categorías de un vocabulario preestablecido conocido como MeSH (Medical Subject Headings). Durante las últimas décadas, se han dirigido muchos esfuerzos en automatizar este proceso mediante técnicas de "machine learning". El árbol MeSH es una estructura jerárquica de términos médicos que se usan para definir los temas principales de los que trata un artículo o publicación. Nosotros nos centramos en la parte de las enfermedades, pues establece más de 4.000 enfermedades, y ofrece la posibilidad de buscar documentos relacionados con cada una de ellas. De esta forma, proponemos utilizar un algoritmo de clasificación, extraer documentos de la base de datos MEDLINE y utilizar los historiales médicos de los pacientes para obtener una lista de enfermedades ordenada que pueda establecer posibles diagnósticos.

We have not used binary decisions from binary categorization methods, since they might left out some interesting MeSH entries, which should probably be taken into consideration. We have chosen a category ranking algorithm to get an ordered list of all possible diagnoses so that the user can finally decide which one best suits the patient history.

The training data and the ranking algorithm

We have extracted the training data from the PubMed database by selecting every document about diseases written in SPANISH with abstract and related to humans. The documents were retrieved by using the query diseases category[MAJR], where [MAJR] stands for MeSH Major Topic, telling the system to retrieve only documents whose subject is mainly a disease. The query provided us with 2,747,066 documents that we downloaded by sending them to a file in MEDLINE format. We processed that file to get the titles and abstracts with their corresponding MeSH topics. This led us to 4,155 classes, each one containing at least one training sample. We have selected only the most important data, which is inside the case reports, a subset of 483,726 documents containing detailed information about individual cases of particular diseases.

To select a proper ranking algorithm, we have looked up the most suitable one through several decades of literature about text classification and category ranking. We have chosen the Sum of Weights (SOW) approach, that wins over the rest for its simplicity, efficiency, speed, accuracy and incremental training capacity.

Managing different languages

Since most of the medical literature is written in SPANISH, this has also been the language used to train the algorithm. To carry out diagnosis from other languages, we first translate the symptoms entered by the user to SPANISH by automatically calling a Google Translation Tool and feeding our classifier with its output.

Our purpose

We pretend to provide a new real-world application for category ranking algorithms to get final diagnoses from clinical histories. Although the output of the categorization process should not be directly taken to diagnose a disease without a previous review, however the accuracy achieved could be good enough to assist human experts. It may help to corroborate or choose a suitable MeSH entry among the ones provided automatically.

We must strongly clear up that the results we provide should never be taken as a substitute of medical advice.