Bayesian inference with incomplete multinomial data: A problem in pathogen diversity Journal of the American Statistical Association 2010; 105 (490) 600-611
Date
07/01/2010Abstract
With recent advance in genetic analysis, it has become feasible to classify a pathogen into genetically distinct variants even though they apparently cause an infected subject similar symptoms. The availability of such data opens up the interesting problem of studying the spatio-temporal variation in the diversity of variants of a pathogen. Data on pathogen variants often suffer the problems of (i) low cell counts, (ii) incomplete classification due to laboratory problems, for example, contamination, and (iii) unseen variants. Shannon entropy may be employed as a measure of variant diversity. A Bayesian approach can be used to deal with the problems of low cell counts and unseen variants. Bayesian analysis of incomplete multinomial data may be carried out by Markov chain Monte Carlo techniques. However, for pathogen-variant data, it often happens that there is only one source of missingness, namely, some subjects are known to be infected by some unidentified pathogen variant. We point out that for incomplete data with disjoint sources of missingness, Bayesian analysis can be more efficiently done by an iid sampling scheme from the posterior distribution. We illustrate the method by analyzing a dataset on prevalence of bartonella infection among individual colonies of prairie dog at the study site in Colorado, from 2003 to 2006.