Medical College of Wisconsin
CTSIResearch InformaticsREDCap

Selection of models for the analysis of risk-factor trees: leveraging biological knowledge to mine large sets of risk factors with application to microbiome data. Bioinformatics 2015 May 15;31(10):1607-13

Date

01/09/2015

Pubmed ID

25568281

Pubmed Central ID

PMC4426830

DOI

10.1093/bioinformatics/btu855

Scopus ID

2-s2.0-84929616310 (requires institutional sign-in at Scopus site)   10 Citations

Abstract

MOTIVATION: Establishment of a statistical association between microbiome features and clinical outcomes is of growing interest because of the potential for yielding insights into biological mechanisms and pathogenesis. Extracting microbiome features that are relevant for a disease is challenging and existing variable selection methods are limited due to large number of risk factor variables from microbiome sequence data and their complex biological structure.

RESULTS: We propose a tree-based scanning method, Selection of Models for the Analysis of Risk factor Trees (referred to as SMART-scan), for identifying taxonomic groups that are associated with a disease or trait. SMART-scan is a model selection technique that uses a predefined taxonomy to organize the large pool of possible predictors into optimized groups, and hierarchically searches and determines variable groups for association test. We investigate the statistical properties of SMART-scan through simulations, in comparison to a regular single-variable analysis and three commonly-used variable selection methods, stepwise regression, least absolute shrinkage and selection operator (LASSO) and classification and regression tree (CART). When there are taxonomic group effects in the data, SMART-scan can significantly increase power by using bacterial taxonomic information to split large numbers of variables into groups. Through an application to microbiome data from a vervet monkey diet experiment, we demonstrate that SMART-scan can identify important phenotype-associated taxonomic features missed by single-variable analysis, stepwise regression, LASSO and CART.

Author List

Zhang Q, Abel H, Wells A, Lenzini P, Gomez F, Province MA, Templeton AA, Weinstock GM, Salzman NH, Borecki IB

Author

Nita H. Salzman MD, PhD Center Director, Professor in the Pediatrics department at Medical College of Wisconsin




MESH terms used to index this publication - Major topics in bold

Animals
Decision Trees
Gastrointestinal Tract
Humans
Logistic Models
Microbiota
Models, Statistical
Phenotype
RNA, Ribosomal
Risk Assessment
Risk Factors