Subjects and HELIUS cohort
Subjects were participants in the HEalthy Life in an Urban Setting (HELIUS) cohort study. This study used a stratified-random sampling approach to include between 2011 and 2015 25,000 inhabitants (18–70 years) from the city of Amsterdam, the Netherlands [7]. Stratification was done on six subgroups with different ethnic origins (African Surinamese, South Asian Surinamese, Ghanaian, Turkish, Moroccan, and Dutch). Subgroups were about equally large.
Dietary intakes assessment
As described previously [10, 11], a subsample of voluntary participants of Dutch, Moroccan, Turkish, South-Asian Surinamese and African Surinamese origin were asked to participate in the HELIUS-Dietary Patterns study, with objective to collect detailed information on their diet. Usual dietary intakes were assessed through the completion of ethnic-specific semi-quantitative food frequency questionnaires (FFQs) with a reference period of 4 weeks. These FFQs were adapted from an existing Dutch FFQ and comprised about 200 items. Food items were collapsed into 73 food groups based on similarity in nutrient profile and culinary use. In this study ethnic-specific food groups were not included in this analysis and 67 food items were used for the analyses.
16S processing
We used the 16S ribosomal RNA (rRNA) sequencing data generated in a previous study based on the HELIUS cohort [3]. In short, the composition of fecal microbiota was profiled by sequencing the V4 region of the 16S rRNA gene on a MiSeq system. The 16S rRNA gene reads were processed on a mothur pipeline (version 1.39.5) [12]. The OTU clustering was done by using the vsearch (version 2.6) [13] and a phylogenetic tree was constructed by running FastTree 2.1 [14]. The details of the sequencing and bioinformatic pipelines were described in [3].
Statistical analyses
Our analysis is based on 1090 subjects who had both fecal microbiome and FFQ data. Following [1], here we removed OTUs with fewer than 10 reads in total, as well as OTUs which were present in fewer than 1% of samples. The final OTU table contains 1090 samples and 2073 OTUs. We used four widely used methods for sequencing data analysis to quantify the strength of the associations between dietary variables (x) and OTU counts (y). Because the large number of associations (67 × 2073), we used multidplyr R package (https://github.com/hadley/multidplyr) for parallel computation. The selected methods were as follows:
ShotgunFunctionalizeR is a popular R package used in microbiome research community, and based on the Poisson generalized linear model (implemented in glm function in R) [15]. We used the glm function with log(total counts) as offset to quantify associations between dietary variables and OTU counts.
Negative binomial model, also called gamma-Poisson model, is popular for statistical modeling of OTU count data [16, 17]. Phyloseq is a popular R package used by the microbiome research community [18]. The core of Phyloseq is based on another popular R package DESeq2, which is based on negative binomial model [19]. However, when the sample size is big (above 100), the computation becomes slow in DESeq2. Therefore, in this work we used another negative binomial based R package, edgeR [20]. The observed OTU count was modeled by a negative binomial distribution with two parameters, the mean and the dispersion. OTU specific dispersion was estimated by running estimateDisp function implemented in the edgeR package [20, 21]. The associations between dietary variables and OTU counts were quantified by running glmFit function of the edgeR package [20]. The log(total counts) was used as offset.
In contrast to above methods modeling the counts with exact probabilistic distributions, others have advocated weighted linear regression analysis with precision weights derived from the mean variance relationship [22]. This approach has been implemented in the voom function of the popular R package limma [23]. The weighted linear regression was done to estimate the linear association between dietary variables and OTU counts with precision weights estimated by the voom and lmFit functions in the limma package [22, 23].
The last method, metagenomeSeq is also a popular R package used by microbiome research community [24]. It is based on the zero-inflated Gaussian model. This approach has been implemented in the fitZig function of the popular R package metagenomeSeq [24]. The cumulative sum scaling method was used to take care library size difference.
In a typical association study, the primary goal is to identify some candidate associations for future research. Therefore, regarding multiple testing we calculated false discovery rate (FDR). If an association had FDR value below 0.05, we considered it as a significant association. Since the research question is focused only on robustness of the statistical results and not on biological or epidemiological associations, we did not adjust for possible confounding or selection factors.
Simulation framework
We use y to represent the simulated microbiome data with n rows and J columns. Every column of y represents a microbe and every row of y represents a subject. Here, we simulated associations of a dietary variable, denoted as x, with gut microbiota. x is a vector of length n, and was randomly sampled from real FFQ data with replacement. The FFQ data was published in [2] and contained 214 dietary variables that were scaled to having mean 0 and standard deviation 1. For each simulated microbiome data set, we used one dietary variable and in total generated 214 simulated data sets. Our simulation framework included the steps below:
$$ \eta \left[j\right]\sim Bernoulli(0.5) $$
(1)
$$ \gamma \left[j\right]\sim {T}_7\left(0,2.5\right) $$
(2)
$$ \beta \left[j\right]=\left(1-\eta \left[j\right]\right)\times 0+\eta \left[j\right]\times \gamma \left[j\right] $$
(3)
$$ \theta \left[i,1:J\right]\sim Dirichlet\left(\pi \left[1:J\right]\right) $$
(4)
$$ \alpha \left[i,1:J\right]= logit\left(\theta \left[i,1:J\right]\right) $$
(5)
$$ logit\left(\mu \left[i,j\right]\right)=\alpha \left[i,j\right]+\beta \left[j\right]\times x\left[i\right] $$
(6)
$$ N\left[i\right]\sim Lognormal\left({\mu}_L,{\sigma}_L\right) $$
(7)
$$ y\left[i,1:J\right]\sim Multinomial\left(N\left[i\right],\mu \left[i,1:J\right]\right) $$
(8)
Our HELIUS microbiome data set had 1090 subjects and the median sequencing depth was about 50,000. To mimic HELIUS data, we simulated the 16S microbiome data sets, with each data set having 1000 subjects and mean of sequencing depth 50,000.
Spiked-in association
To introduce the spiked-in association between a dietary variable and microbe j, we need two variables η[j] and γ[j]. The indicator variable, η[j], indicates if a dietary variable influences the abundance of the microbe j. For microbe j, we randomly drew η[j] from a Bernoulli distribution with parameter 0.5 (Eq. 1). γ[j] represents the effect of the dietary variable on the abundance for OTU j, and was sampled from a t distribution with 7 degrees of freedom, location 0 and scale 2.5 [25] (Eq. 2). Then the true association between the diet and microbe j was captured by β[j] defined in Eq. 3.
Dirichlet multinomial model
In the current study, we developed a Dirichlet multinomial model to generate 16S rRNA microbiome data.
In Eq. 4, the matrix θ has n rows and J columns. θ[i, j] corresponds to the baseline abundance level for the microbe j in subject i. For subject i, we randomly drew a vector of length J from a Dirichlet distribution. In Eq. 5, the parameter of the Dirichlet distribution π is a vector of length J. We used R package DirichletMultinomial [26] and the Human Microbiome Project 16S rRNA stool data [27] to estimate the π. In Eq. 6, the true microbe j proportion in subject i, μ[i, j] was modeled as a logistic regression of x[i]. Similar to [24], library size of subject i, N[i], was randomly drawn from a lognormal distribution with mean μL and standard deviation σL (Eq. 7). μL is the logarithm of target sequencing depth (50000). We estimated σL = 0.77 based on the HMP stool 16S rRNA data set by using the fitdistr function implemented in the MASS package. Finally, for subject i, the observed microbe counts were randomly generated from a multinomial distribution (Eq. 8).
Performance evaluation
We evaluated the model performances based on metrics including true positive rate, false positive rate and error probability for identifying a significant association between microbe and dietary variable. They are calculated per simulation and defined as below:
$$ \mathrm{True}\ \mathrm{positive}\ \mathrm{rate}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$
(9)
$$ \mathrm{False}\ \mathrm{positive}\ \mathrm{rate}=\frac{\mathrm{FP}}{\mathrm{TN}+\mathrm{FP}} $$
(10)
$$ \mathrm{Error}\ \mathrm{probability}=\frac{FP}{\mathrm{TP}+\mathrm{FP}} $$
(11)
TP, FP, TN and FN refer to true positive, false positive, true negative and false negative, respectively. The indicator variable, η[j], is in the definition of our spike-in associations. When η[j] = 1, the dietary variable x influences the abundance of the microbe j, otherwise η[j] = 0. Therefore, a true positive finding is defined as having a significant association between the dietary variable x and microbe j with FDR < 0.05 in case the true η[j] = 1. A false positive finding is defined as having a significant association between the dietary variable x and microbe j with FDR < 0.05 in case the true η[j] = 0. A true negative finding is defined as having a association between the dietary variable x and microbe j with FDR > 0.05 in case the true η[j] = 0. A false negative finding is defined as having a association between the dietary variable x and microbe j with FDR > 0.05 in case the true η[j] = 1. The error probability quantified the probability that a significant association is false. Here we did not use “false discovery rate” but used the term “error probability” in order to avoid confusion, because we also calculated the false discovery rate during analyses of associations between OTUs and dietary variables.