Medical College of Wisconsin
CTSICores SearchResearch InformaticsREDCap

Dynamic categorization of clinical research eligibility criteria by hierarchical clustering. J Biomed Inform 2011 Dec;44(6):927-35

Date

06/22/2011

Pubmed ID

21689783

Pubmed Central ID

PMC3183114

DOI

10.1016/j.jbi.2011.06.001

Scopus ID

2-s2.0-84855953110   31 Citations

Abstract

OBJECTIVE: To semi-automatically induce semantic categories of eligibility criteria from text and to automatically classify eligibility criteria based on their semantic similarity.

DESIGN: The UMLS semantic types and a set of previously developed semantic preference rules were utilized to create an unambiguous semantic feature representation to induce eligibility criteria categories through hierarchical clustering and to train supervised classifiers.

MEASUREMENTS: We induced 27 categories and measured the prevalence of the categories in 27,278 eligibility criteria from 1578 clinical trials and compared the classification performance (i.e., precision, recall, and F1-score) between the UMLS-based feature representation and the "bag of words" feature representation among five common classifiers in Weka, including J48, Bayesian Network, Naïve Bayesian, Nearest Neighbor, and instance-based learning classifier.

RESULTS: The UMLS semantic feature representation outperforms the "bag of words" feature representation in 89% of the criteria categories. Using the semantically induced categories, machine-learning classifiers required only 2000 instances to stabilize classification performance. The J48 classifier yielded the best F1-score and the Bayesian Network classifier achieved the best learning efficiency.

CONCLUSION: The UMLS is an effective knowledge source and can enable an efficient feature representation for semi-automated semantic category induction and automatic categorization for clinical research eligibility criteria and possibly other clinical text.

Author List

Luo Z, Yetisgen-Yildiz M, Weng C

Author

Jake Luo Ph.D. Associate Professor; Director, Center for Biomedical Data and Language Processing (BioDLP) in the Health Informatics & Administration department at University of Wisconsin - Milwaukee




MESH terms used to index this publication - Major topics in bold

Algorithms
Artificial Intelligence
Biomedical Research
Cluster Analysis
Semantics
Unified Medical Language System