Textractor (Clinical information extraction framework)

Textractor is based on the Apache Unstructured Information Management Architecture (UMIA) framework, and uses methods that are a hybrid between machine learning and pattern matching. Our earlier versions of information extraction applications were relying on MMTx (application for UMLS Metathesaurus concepts indexing in biomedical text, developed by the NLM) to detect biomedical concepts in clinical text. For several reasons, we are developing a new concepts extraction module based on Lucene. Early versions of this new module, called LULU, have already been evaluated and prove to be significantly faster than MMTx, with better accuracy. This new module also allows for better flexibility and can be deployed on multiple machines for computing-power intensive tasks.

Textractor was adapted to participate in three international NLP challenges organized by the i2b2 NCBC:

2008 i2b2 NLP challenge: This challenge focused on the identification of patients with obesity and/or some of its most common comorbidities. Textual mentions of these diseases were annotated, but also intuitive mentions of them, and we focused this development on the latter.

2009 i2b2 NLP challenge: The general objective of the i2b2 medication extraction challenge was to extract the list of medications found in patient clinical documents, along with attributes of these medications (dosage, route, frequency, duration, and reason(s) for the prescription).

2010 i2b2/VA NLP challenge: For this challenge, a team with several collaborators from the University of Utah and the Salt Lake City VA Medical Center worked on the three “sub-challenges”: concepts extraction (problems, tests, and treatments), assertions analysis, and relations analysis. We focused on the second and third, using methods based on machine learning (multi-class SVM classifiers).



  • Kim, Y., Riloff, E., & Meystre, S. M. (2011). Improving classification of medical assertions in clinical notes (Vol. 2, pp. 311–316). Presented at the Proceedings of the 49th Annual Meeting
  • Meystre, S. M., Thibault, J., Shen, S., Hurdle, J. F., & South, B. R. (2010). Textractor: a hybrid system for medications and reason for their prescription extraction from clinical text documents. Journal of the American Medical Informatics Association : JAMIA, 17(5), 559–562. 
  • Meystre, S. M., Thibault, J., Shen, S., Hurdle, J. F., & South, B. R. (2010). Automatically detecting medications and the reason for their prescription in clinical narrative text documents. Studies in Health Technology and Informatics, 160(Pt 2), 944–948.
  • Meystre, S. (2009). Detecting Intuitive Mentions of Diseases in Narrative Clinical Text. Artificial Intelligence in Medicine, LNAI 5651, 216–224.
  • Meystre, S., Thibault, J., Shen, S., Hurdle, J. F., & South, B. (2009). Description of the Textractor System for Medications and Reason for their Prescription Extraction from Clinical Narrative Text Documents. I2b2 Medication Extraction Challenge Workshop.