Study of Effect of Drug Lexicons on Medication Extraction from Electronic Medical Records E. Sirohi and P. Peissig Pacific Symposium on Biocomputing 10:308-318(2005) STUDY OF EFFECT OF DRUG LEXICONS ON MEDICATION EXTRACTION FROM ELECTRONIC MEDICAL RECORDS E. SIROHI, P. PEISSIG Marshfield Clinic Research Foundation, 1000 N. Oak Ave, Marshfield, WI 54449, USA sirohi.ekta@marshfieldclinic.org Extraction of relevant information from free-text clinical notes is becoming increasingly important in healthcare to provide personalized care to patients. The purpose of this dictionary-based NLP study was to determine the effects of using varying drug lexicons to automatically extract medication information from electronic medical records. A convenience training sample of 52 documents, each containing at least one medication, and a randomized test sample of 100 documents were used in this study. The training and test set documents contained a total of 681 and 641 medications respectively. Three sets of drug lexicons were used as sources for medication extraction: first, containing drug name and generic name; second with drug, generic and short names; third with drug, generic and short names followed by filtering techniques. Extraction with the first drug lexicon resulted in 83.7% sensitivity and 96.2% specificity for the training set and 85.2% sensitivity and 96.9% specificity for the test set. Adding the list of short names used for drugs resulted in increasing sensitivity to 95.0%, but decreased the specificity to 79.2% for the training set. Similar results of increased sensitivity of 96.4% and 80.1% specificity were obtained for the test set. Combination of a set of filtering techniques with data from the second lexicon increased the specificity to 98.5% and 98.8% for the training and test sets respectively while slightly decreasing the sensitivity to 94.1% (training) and 95.8% (test). Overall, the lexicon with filtering resulted in the highest precision, i.e., extracted the highest number of medications while keeping the number of extracted non-medications low. 1 Introduction With the widespread use of computers in the healthcare domain, a large array of data ­ coded as well as free-text ­ is being stored digitally. Coded data can be easily interpreted by computer applications but free-text data poses a number of challenges1. Manual information extraction can be rather tedious and differences in style among providers means that document styles can vary widely. Added to that fact is the shear volume of clinical data that must be processed. Natural Language Processing (NLP) has been showing promising results in solving this problem by extracting and structuring text-based biomedical or clinical information. To discover knowledge from free-text, researchers have been exploring NLP systems to facilitate Information Extraction (IE) and text mining. IE techniques allow users to automatically extract pre-defined information from freetext documents. Most NLP systems perform identification of terms in free text with entries from a lexicon1. Such dictionary-based entity name recognition studies extract information by searching the most similar (or identical) term in the dictionary to the target term. These extraction strategies have been used in other biological domains for extracting protein and gene names from biomedical literature2-4. The goal is to extract from the document salient facts about prespecified types of events, entities or relationships. These facts are then usually used to populate clinical databases, which may then be used to analyze the data for trends. IE projects are currently being designed worldwide to summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results and therapeutic treatments. Such systems can be used to assist health care providers with quality assurance studies or to support provider needs or to simply provide improved quality of service to patients. 2 Background and Related Work It is a well established fact that different patients respond in different ways to the same medication. Previous studies have shown that genetics can account for 20 to 95 percent of variability in drug disposition and effects. While various non-genetic factors like age, organ function, drug interactions can affect the response to medications, there are numerous cases in which differences in drug response have been attributed to genetic variations in genes encoding drug-metabolizing enzymes, drug transporters or drug targets5. Clinical observations of inherited differences in drug effects gave rise to the field of pharmacogenetics, which is the study of the hereditary basis for differences in a population's response to a drug6. While pharmacogenetics primarily focuses on the sequence variations in candidate genes suspected of affecting drug response, pharmacogenomics focuses on evaluation of the entire genome and the two terms are often used interchangeably7. Two main technologies are used to study the effect of genetic variations as basis for differences in drug response: genotyping and phenotyping. Genotyping is the study of the genetic variations while phenotyping is the study of observable physiological or biochemical measures. The Marshfield Clinic Personalized Medicine Research Project (PMRP) (http://www.mfldclin.edu/pmrp) is an initiative to facilitate research in pharmacogenomics, epidemiology and population genetics. The primary goal of the project is to help researchers learn more about how genetic alterations cause diseases, how to use an individual's genetic information to predict which diseases he or she is likely to develop and which medications work best for a particular person. One of the objectives of PMRP is to develop a framework that has the ability to link genotype data with identified phenotypes. The Marshfield Clinic is a fully integrated health care system that provides primary, secondary and tertiary care to patients living in Central and Northern Wisconsin. The Clinic has an integrated computerized system for automating the financial, practice management, clinical and real-time decision support processes and supports an electronic medical record that routinely captures clinical data like laboratory results, diagnosis, procedures, immunizations, vitals, etc. Medication information is currently stored as text in electronic clinical documents dating back to 1991. Although the medication information found in clinical notes is useful for patient care, it is not coded and has limited utility for research and computerized phenotyping. The amount of manual record abstraction needed to conduct pharmacogenomics studies is reduced significantly if coded medication information is available. Medication inventory and prescription systems are being deployed to capture coded medication information during a clinical visit. The current study is part of the initiative that aims to capture or extract medication information from historical clinical documents and convert it into a coded format that can be for research. Automatic extraction of medications from free-text documents requires use of NLP systems. However, this application has various challenges because 1) new drugs are continually being created or older ones are renamed, 2) drug names are synonymous with other drug names, 3) drug names or its synonyms often have the same name as an English word, such as the drugs Because (a contraceptive) and Duration (nasal spray), and 4) terms in free text may be ambiguous and resolve to multiple senses, depending on the context in which they are used. Some of these challenges can be at least partially overcome by using good drug lexicons. A lexicon is a list of all the words used in a particular language or subject. In the current context, a drug lexicon contains a list of all the medications that we would like to extract from the text documents. The lexicon-based approach is similar to previous dictionary-based studies that have extracted gene and protein names from literature2-5,8, drug names and relationships from cancer literature9 and studied indexing of entire documents by using special lexicons like UMLS10. However, our study differs from these works in that we are focusing on the automatic extraction of medication items from clinical documents for phenotypic development and the issues affecting performance of extraction, like the quality of the drug lexicon. We could not find any previous reports of medication extraction from clinical data. FreePharmaŽ (Language & Computing; http://www.landc.be) is a software that can automatically capture and structure medication information expressed in freetext natural language and link this information to existing drug databases. FreePharmaŽ generates a structured XML representation of medication information derived from free-text documents, which can then be stored in databases for integration with host applications11. This product needs as input a drug lexicon containing a list of all medication items that must be extracted. This input lexicon is the largest factor in determining the extent and accuracy of extraction. Marshfield Clinic uses the industry's most widely used source of up-to-date drug information, First DataBank's National Drug Data File (NDDF) PlusTM, which delivers descriptive, pricing and clinical information on drugs, encompassing every drug approved by the Food and Drug Administration (FDA), over-the-counter drugs, plus information on herbals, nutraceuticals and dietary supplements12. We used data from this database to create the drug lexicons. The purpose of the current study was to determine the best drug lexicon to use with FreePharmaŽ to maximize extraction of prescription items, while reducing the extraction of non-medication items. 3 Methods A sample of 52 documents was selected for this study by varying patients, service dates and providers to maximize the differences arising from varying styles of dictation and transcription. This convenience sample contained only clinic office visit notes and had been used in a pilot project prior to undertaking this study. This was the training set because the lexicon filtering techniques were refined based on results from this dataset. Each of the documents in this set contained at least one medication and this was verified by manual review. The manual review also yielded a list of 681 items (285 unique items) from the documents that were considered to be "true" medications. The documents were independently reviewed by a second reviewer to ensure that none of the valid medications were missed. These medications were collected from all sections of the documents and not just the discharge summary or medical history sections. The total number of terms in the documents was counted at 28496, including the medication items. Based on FirstDataBank's NDDF data source, Marshfield Clinic's drugs database provides a drug_name, short_name and generic_name for each drug, in addition to information like American Hospital Formulary Service (AHFS) classification. We decided to use these columns for the purposes of creating a drug lexicon. Three sets of drug lexicons were prepared: 1. Lexicon A: unique terms from drug_name and generic_name columns of drugs database. Term count = 25907 2. Lexicon B: unique terms from drug_name, generic_name and short_name columns of drugs database. Term count = 29333 3. Lexicon C: unique terms from drug_name, generic_name and short_name columns of drug database followed by removal of items that met the filtering criteria in Table 1. Term count = 22345 Table 1. Filtering criteria for restricting non-medication terms in drug lexicon Terms where AHFS classification is `Devices', `Dental Supplies' Terms where generic_name is `Organ Concentrates' or `Homeopathic drugs' Terms which contain only numerical values (such as `1', `3', etc.) Terms that were ambiguous with general English words Lexicon C was created by applying the filtering criteria (Table 1) to lexicon B in two steps. First, we applied just the first three criteria to lexicon B to produce the interim version of lexicon C. These criteria were developed after a careful manual examination of results and terms from the use of the first lexicon. Second, to identify drug names that were ambiguous with general English language, we used a list of English words obtained from the SCOWL collection at SourceForge's wordlist website (http://wordlist.sourceforge.net/). This collection contained various wordlists and we analyzed the effectiveness of using them for our purposes. A wordlist was considered relevant for use if it contained the most frequently used words in general English and did not contain a high number of medication items. Based on this analysis, we combined the size 10, 20 and 35 "small" English lists to create a final list of 41,769 words. This wordlist was then compared with the interim version of lexicon C to find the set of terms common to both the lexicon and the wordlist. The set of 1170 common terms were then manually reviewed by a pharmacist to determine if the terms should actually be removed from the drug lexicon. The pharmacist identified 21 items in the set of common terms that should remain in the drug lexicon and the remaining 1149 were approved for removal from the lexicon C. drugs database drug lexicon Extracted medications FreePharma Medication Extraction XML extraction program clinical documents XML output Figure 1. The process of medication extraction from clinical documents We used 3 independent runs with FreePharmaŽ, using a different drug lexicon in each, to extract medications from the documents. The XML outputs from all runs were analyzed by an automated program to extract medication names, which were then compared with the original medications from the 52 documents. The document extraction process is outlined in Figure 1. Extraction results were summarized using the measures sensitivity, specificity and precision. Sensitivity (also known as recall in IE studies) for this study is the likelihood of retrieving medication items or the likelihood that the item extracted was truly a medication. Specificity is the true negative rate and determines how many of the non-medication items were truly not extracted. Another measure that is used frequently in IE studies is precision. Precision measures the proportion of true medications extracted out of all terms that were extracted. This is a fairly important measure for our study since our goal is to maximize precision, i.e., extract the maximum number of medications while reducing the number of non-medication terms that are extracted. We are currently in the production phase of medication extraction at the Clinic and based on results from an internal study, we process only those document types that have a high likelihood of containing medications. Also, we perform weekly quality checks to eliminate or add terms to the drug lexicon. To assess the impact of lexicon filtering on an independent document set, we decided to process a randomized test set of 100 documents, selected from over 150,000 documents of 103 different document types that were processed with FreePharmaŽ over a period of two weeks. Therefore, unlike the training set, this test contained a larger variety of document types. Of these 100 documents, 21 contained no medications and the remaining 79 contained a total of 641 medication terms (266 unique items). The number of non-medication terms in the entire test set was 41751. The medication items in this test set were independently reviewed and verified. We ran FreePharmaŽ with each of the three lexicons separately to extract medications from the test set. The extracted results were analyzed as in the training set. 4 Results The total number of documents used for the training set was 52, each of which contained at least one medication. There were a total of 28496 terms in these documents, of which 681 were medications. Tables 2a, 2b and 2c contain results for the extraction from the training set documents. The data in Table 2a reveal that using only drug_name and generic_name yielded low sensitivity even though the specificity was fairly high. Providers often use short names for drugs (for example, Nitro for Nitroglcerin, Metoprolol for Metoprolol Tartrate) in patient notes for commonly used drugs. However, the first drug lexicon contained only brand names and generic names of drugs and therefore, missed all references to short names of drugs. This was a major factor in yielding low sensitivity. Adding short names to lexicon A (Table 2b) increased the sensitivity of extraction but lowered the specificity and precision significantly. Inclusion of drug short names extracted 77 more medications than lexicon A. However, the short names list also added many terms that were ambiguous with English. This ambiguity resulted in extraction of a large number of terms that were not intended as medications in the patient documents, leading to low specificity and an even lower precision. Table 2a. Training set results for medication extraction with Lexicon A (drug_name + generic_name). Sensitivity = 83.7% (570/681), Specificity = 96.2% (26782/27815), Precision = 35.6% (570/1603) Medication Items Non-medication Items Total count Extracted by IE process 570 1033 1603 Not extracted by IE process 111 26782 26893 Total count 681 27815 28496 Table 2b. Training set results for medication extraction with Lexicon B (drug_name + generic_name + short_name). Sensitivity = 95.0% (647/681), Specificity = 79.2% (22033/27815), Precision = 10.1% (647/6429) Medication Items Non-medication Items Total count Extracted by IE process 647 5782 6429 Not extracted by IE process 34 22033 22067 Total count 681 27815 28496 Table 2c. Training set results for medication extraction with Lexicon C (drug_name + generic_name + short_name + filtering). ). Sensitivity = 94.1% (641/681), Specificity = 98.5% (27400/27815), Precision = 60.7% (641/1056) Medication Items Non-medication Items Total count Extracted by IE process 641 415 1056 Not extracted by IE process 40 27400 27440 Total count 681 27815 28496 Table 3a. Test set results for medication extraction with Lexicon A (drug_name + generic_name). ). Sensitivity = 85.2% (546/641), Specificity = 96.9% (40444/41751), Precision = 29.5% (546/1853) Medication Items Non-medication Items Total count Extracted by IE process 546 1307 1853 Not extracted by IE process 95 40444 40539 Total count 641 41751 42392 Table 3b. Test set results for medication extraction with Lexicon B (drug_name + generic_name + short_name). ). Sensitivity = 96.4%% (618/641), Specificity = 80.1% (33431/41751), Precision = 6.9% (618/8938) Medication Items Non-medication Items Total count Extracted by IE process 618 8320 8938 Not extracted by IE process 23 33431 33454 Total count 641 41751 42392 Table 3c. Test set results for medication extraction with Lexicon C (drug_name + generic_name + short_name + filtering). ). Sensitivity = 95.8% (614/641), Specificity = 98.8% (41240/41751), Precision = 54.6% (614/1125) Medication Items Non-medication Items Total count Extracted by IE process 614 511 1125 Not extracted by IE process 27 41240 41267 Total count 641 41751 42392 Filtering the lexicon to reduce ambiguous non-medication items yielded high sensitivity and specificity (Table 2c), thus also increasing the corresponding precision value. Elimination of ambiguity with English and non-medication agents (filtering criteria in Table 1) created a lexicon with fewer non-medication items. The filtering reduced the number of non-medication items that were extracted from the documents leading to a much higher precision for extraction using lexicon C. The total number of documents used for the test set was 100, of which only 79 contained at least one medication. There were a total of 42392 terms in these documents, of which 641 were medications. Results for the test set and the corresponding precision, specificity and sensitivity values are in Tables 3a, 3b and 3c. The test set results follow a pattern similar to that seen in the training set results. Lexicon A yielded a low sensitivity due to the absence of short names (Table 3a). Inclusion of short names (lexicon B) resulted in a higher sensitivity but decreased the specificity and precision (Table 3b). The filtered lexicon yielded both high specificity and precision (Table 3c) 5 Discussion This study reveals some of the challenges in using drug lexicons to automatically extract medications from electronic medical records. Using existing drug sources without any attempts to remove non-specific or ambiguous terms will most likely extract many terms that are non-medication items. Such anomalies can only be revealed by a manual review of results to ascertain the quality of extraction. However, in the current study we have shown that identifying and filtering out nonmedication items from the drug lexicon significantly enhances the sensitivity and precision of the results. Table 4. Extraction results summarizing the IE measures sensitivity, specificity and recall for each of the three lexicons. Lexicon A = drug_name + generic_name; Lexicon B = drug_name + generic_name + short_name. Lexicon C = Lexicon B + filtering techniques. The highest value for the measure among lexicons is marked in bold for both training and test sets. Training set Test set Lexicon A Lexicon B Lexicon C Lexicon A Lexicon B Lexicon C Sensitivity 83.7% 94.1% 85.2% 95.8% 95.0% 96.4% Specificity 96.2% 79.2% 96.9% 80.1% 98.5% 98.8% Precision 35.6% 10.1% 29.5% 6.9% 60.7% 54.6% Lack of drug short names in lexicon A resulted in a low sensitivity (recall) of only 83.7% in the training set and 85.2% in the test set (Table 4). When short names were added for lexicon B, the results showed an increase in recall to 95.0% (training set) and 96.4% (test set). This indicates that use of short names is fairly important for defining a drug lexicon. However, short name inclusion in lexicon B also resulted in lowering the specificity of the results. The results indicate that in the training set, 5782 non-medication items were also extracted when short names were added to the lexicon. This reduced the precision to only 10.1%, .i.e., only 10.1% of all extracted terms were real medications. Similarly, for the test set, 8320 nonmedication items were extracted with the short name containing lexicon (Table 3b), resulting in a lower precision of only 6.9%. An analysis of these non-medication items revealed that several were ambiguous with words in English (Table 5). Efforts to eliminate these ambiguous terms resulted in definition of the filtering criteria in Table 1. Use of these criteria yielded a smaller drug lexicon, lexicon C, containing only 22345 terms (compared to 29333 after inclusion of short names in lexicon B), which is a 23.8% reduction in lexicon size. Extraction from the training set with this lexicon yielded a high recall of 94.1%, a much higher specificity of 98.5% and the highest precision rate so far of 60.7%. Similarly, test set results with this lexicon yielded a high recall of 95.8%, a much higher specificity of 98.8% and the highest precision rate among lexicons so far of 54.6%. However, results with lexicon C (Tables 2c, 3c) showed a slightly lower recall than the results with lexicon B (Tables 2b, 3b). This is due to the fact that terms like iron, influenza and tetanus were removed during filtering from lexicon C, thus not extracting them as medications from expressions "liquid iron", "tetanus shots" and "influenza vaccine". These occurrences account for the difference in recall values between results from lexicons B and C. Table 5. Examples of terms in drug lexicon B that contribute to ambiguity with English language. The last four columns indicate the data as present in FirstDataBank's drug database. Term in Generic_name Drug_name Drug_short_name AHFS_Category_desc drug lexicon The Lecith/Pyridox Miscellaneous The Eliminator The HCL/I2/Cider Therapeutic Agents Benefit Nutritional Electrolytic, Caloric Benefit Benefit Supplement and Water Balance Control Incontinence Pad, Control Pads Control Devices Liner, Disp Pain Acetaminophen Central Nervous Pain Reliever Pain System Agents Sleep Sleep Aid Diphenhydramine Antihistamine Drugs Sleep Formula HCL Another source of missed extraction is the absence of medication in the source data from which the lexicons were constructed. In our analysis of results, we found many medications in the original documents which were not present in the FirstDataBank database and therefore not incorporated into any of the lexicons and consequently not extracted. Some of these were mis-spelt versions of existing medications, for example, "cyclosporin" is a mis-spelt variant of cyclosporine, "losartin" is losartan mis-spelt; some others like ASA (common short name for Acetyl Salicylic Acid) were simply missing from the database. To improve the drug lexicon further, we would need to add or remove these terms to maximize the values of our measures. Since we are already in production phase of the project, this is an essential step in our quality check procedure and involves active participation by pharmacists for evaluation of such terms for inclusion or removal from the lexicon. One of the major goals of any IE study or application is to maximize precision, i.e., maximize the true positives while minimizing the false positives. For our study, this translates to applying additional filters to the drug lexicons or defining ways of filtering the extracted results based on other criteria like section headers. For example, our filtered lexicon contains the chemical term potassium. But potassium can mean a laboratory test item in "His potassium became elevated" or a drug item in "Potassium 10 mEq. three tablets q.a.m."8. In the current set of results, all occurrences of potassium are extracted, while only some of them are considered actual medications. To distinguish between the two occurrences of potassium, we can extract information about the sections in the document where they occur. Once we have that information, we can define the context in which they were used. In the example above then, potassium from the laboratory test item would then not be extracted because it occurred in the context of laboratory test and not a medication. This context filtering would lead to increasing the precision values. One of the major limitations of this study is the fact that only one of the commercially available drug sources was used to evaluate the lexicon impact on medication extraction. However, we believe that the techniques used in this study can be used with other sources to achieve similar results. Other limitations include, but are not limited to the use of a convenience sample as a training set and the use of only a fixed set of filtering criteria to evaluate extraction precision. 6 Conclusions Medication extraction is crucial for development of phenotypes and other pharmacogenomics studies. Drug dictionaries or lexicons are invaluable resources for extracting medication names from free-text documents. We have shown that drug lexicons can be used for medication extraction from clinical documents and that the precision and recall values for such studies can be considerably enhanced by defining filtering criteria to refine the drug lexicons. Future enhancements to the drug lexicon, such as specific additions and removals and section filtering would further increase the accuracy of results obtained from clinical documents.. Acknowledgements This project was funded in part by a grant from the Marshfield Clinic. The authors acknowledge the contribution of Julie Stangl for data analysis, Peter Welch and Cynthia Motszko for data management and Dr. Russell Wilke and Dr. Gary Plank for assistance with validating removal of wordlist items from drug lexicon. References 1. H. Liu, S.B. Johnson, C. Friedman, "Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS" J. Am. Med. Inform. Assoc. 9(6):621-636 (2002) 2. Y. Tsuruoka, J. Tsujii, "Boosting Precision and Recall of Dictionary-based Protein Name Recognition", Proceedings of the ACL 2003 Workshop on Natural Language Processing in BioMedicine. 41-48 (2003) 3. A. Koike, T. Takagi, "Gene/Protein/Family Name Recognition in Biomedical Literature", HLT-NAACL 2004 Workshop: BioLink 2004 ­ Linking Biological Literature, Ontologies and Databases. 9-16 (2004) 4. L. Tanabe, W.J. Wilbur, "Tagging Gene and Protein Names in Biomedical Text", Bioinformatics. 18(8): 1124-1132 (2002) 5. W.E. Evans, H.L. McLeod, "Pharmacogenomics ­ Drug Disposition, Drug Targets and Side Effects", N. Eng. J. Med. 348(6):538-549 (2003) 6. D. Cooper. "What is Pharmacogenetics?" Pharmacogenetics in Patient Care conference by American Association Clinical Chemistry, Nov. 6, 1998. 7. L. Mancinelli, M. Cronic, W. Sadee. "Pharmacogenomics : the promise of personalized medicine." AAPS PharmSci. 2:30-37 (2000) 8. O. Tuason, L. Chen, H. Liu, J.A. Blake, C. Friedman, "Biological Nomenclatures: A Source of lexical knowledge and ambiguity", Pacific Symposium on Biocomputing 9:238-249 (2004) 9. T.C. Rindflesch, L. Tanabe, J.N. Weinstein, L. Hunter. "EDGAR: Extraction of Drugs, Genes and Relations from the BioMedical Literature", Pacific Symposium on Biocomputing 5:514-525 (2000) 10. S.L. Achour, M. Dojat, C Rieux, P. Bieling, E. Lepage, "A UMLS-based Knowledge Acquisition Tool for Rule-based Clinical Decision Support System Development", J. Am. Med. Inform. Assoc. 8(4):351-360 (2001) 11. "TeSSIŽ for Healthcare and Life Sciences", Language and Computing (http://www.landc.be) (2002) 12. "NDDF PLUSTM Documentation Feb04"(FirstDataBank, 2004)