Dynamic categorization of clinical research eligibility criteria by hierarchical clustering. J Biomed Inform 2011 Dec;44(6):927-35
Date
06/22/2011Pubmed ID
21689783Pubmed Central ID
PMC3183114DOI
10.1016/j.jbi.2011.06.001Scopus ID
2-s2.0-84855953110 (requires institutional sign-in at Scopus site) 42 CitationsAbstract
OBJECTIVE: To semi-automatically induce semantic categories of eligibility criteria from text and to automatically classify eligibility criteria based on their semantic similarity.
DESIGN: The UMLS semantic types and a set of previously developed semantic preference rules were utilized to create an unambiguous semantic feature representation to induce eligibility criteria categories through hierarchical clustering and to train supervised classifiers.
MEASUREMENTS: We induced 27 categories and measured the prevalence of the categories in 27,278 eligibility criteria from 1578 clinical trials and compared the classification performance (i.e., precision, recall, and F1-score) between the UMLS-based feature representation and the "bag of words" feature representation among five common classifiers in Weka, including J48, Bayesian Network, Naïve Bayesian, Nearest Neighbor, and instance-based learning classifier.
RESULTS: The UMLS semantic feature representation outperforms the "bag of words" feature representation in 89% of the criteria categories. Using the semantically induced categories, machine-learning classifiers required only 2000 instances to stabilize classification performance. The J48 classifier yielded the best F1-score and the Bayesian Network classifier achieved the best learning efficiency.
CONCLUSION: The UMLS is an effective knowledge source and can enable an efficient feature representation for semi-automated semantic category induction and automatic categorization for clinical research eligibility criteria and possibly other clinical text.
Author List
Luo Z, Yetisgen-Yildiz M, Weng CAuthor
Jake Luo Ph.D. Associate Professor; Director, Center for Biomedical Data and Language Processing (BioDLP) in the Health Informatics & Administration department at University of Wisconsin - MilwaukeeMESH terms used to index this publication - Major topics in bold
AlgorithmsArtificial Intelligence
Biomedical Research
Cluster Analysis
Semantics
Unified Medical Language System