Since the first digital documents came up, automatic text categorization acquired a significant relevance for many researchers. It consists on automatically assigning documents to pre-defined classes and thereby, it has been extensively applied to some real-world applications like web pages classification, spam filtering, etc. However, as far as we are concerned, it has been never been used to assist medical diagnosing by using the textual information provided by updated medical literature and patient histories.

Every year, thousands of documents are added to the National Library of Medicine and the National Institutes of Health databases. Most of them are manually indexed by assigning each document to one or several entries in a controlled vocabulary called MeSH (Medical Subject Headings). During the last decades, many efforts have been focused on automating this process through machine learning techniques. The MeSH tree is a hierarchical structure of medical terms which are used to define the main subjects a medical article or report is about. We are most interested in the Diseases sub-tree, since it defines more than 4,000 diseases, and offers the chance to search for specific documents related to each of them. This way, we propose to choose a classification algorithm, take the MEDLINE database to extract relevant training data and use several patient histories as test data to get a ranked list of diseases as possible diagnoses.

We have not used binary decisions from binary categorization methods, since they might left out some interesting MeSH entries, which should probably be taken into consideration. We have chosen a category ranking algorithm to get an ordered list of all possible diagnoses so that the user can finally decide which one best suits the patient history.

The training data and the ranking algorithm

We have extracted the training data from the PubMed database by selecting every document about diseases written in English with abstract and related to humans. The documents were retrieved by using the query diseases category[MAJR], where [MAJR] stands for MeSH Major Topic, telling the system to retrieve only documents whose subject is mainly a disease. The query provided us with 2,747,066 documents that we downloaded by sending them to a file in MEDLINE format. We processed that file to get the titles and abstracts with their corresponding MeSH topics. This led us to 4,155 classes, each one containing at least one training sample. We have selected only the most important data, which is inside the case reports, a subset of 483,726 documents containing detailed information about individual cases of particular diseases.

To select a proper ranking algorithm, we have looked up the most suitable one through several decades of literature about text classification and category ranking. We have chosen the Sum of Weights (SOW) approach, that wins over the rest for its simplicity, efficiency, speed, accuracy and incremental training capacity.

Managing different languages

Since most of the medical literature is written in English, this has also been the language used to train the algorithm. To carry out diagnosis from other languages, we first translate the symptoms entered by the user to English by automatically calling a Google Translation Tool and feeding our classifier with its output.

Our purpose

We pretend to provide a new real-world application for category ranking algorithms to get final diagnoses from clinical histories. Although the output of the categorization process should not be directly taken to diagnose a disease without a previous review, however the accuracy achieved could be good enough to assist human experts. It may help to corroborate or choose a suitable MeSH entry among the ones provided automatically.

We must strongly clear up that the results we provide should never be taken as a substitute of medical advice.


We do not evaluate or guarantee the accuracy of any content in this site. Click here for the full disclaimer.