Full-length 16S rRNA gene amplicon analysis of human gut microbiota using MinION™ nanopore sequencing confers species-level resolution

Background Species-level genetic characterization of complex bacterial communities has important clinical applications in both diagnosis and treatment. Amplicon sequencing of the 16S ribosomal RNA (rRNA) gene has proven to be a powerful strategy for the taxonomic classification of bacteria. This study aims to improve the method for full-length 16S rRNA gene analysis using the nanopore long-read sequencer MinION™. We compared it to the conventional short-read sequencing method in both a mock bacterial community and human fecal samples. Results We modified our existing protocol for full-length 16S rRNA gene amplicon sequencing by MinION™. A new strategy for library construction with an optimized primer set overcame PCR-associated bias and enabled taxonomic classification across a broad range of bacterial species. We compared the performance of full-length and short-read 16S rRNA gene amplicon sequencing for the characterization of human gut microbiota with a complex bacterial composition. The relative abundance of dominant bacterial genera was highly similar between full-length and short-read sequencing. At the species level, MinION™ long-read sequencing had better resolution for discriminating between members of particular taxa such as Bifidobacterium, allowing an accurate representation of the sample bacterial composition. Conclusions Our present microbiome study, comparing the discriminatory power of full-length and short-read sequencing, clearly illustrated the analytical advantage of sequencing the full-length 16S rRNA gene. Supplementary Information The online version contains supplementary material available at 10.1186/s12866-021-02094-5.


Background
Recent advances in DNA sequencing technology have had a revolutionary impact on clinical microbiology [1]. Next-generation sequencing (NGS) technology enables parallel sequencing of DNA on a massive scale to generate vast quantities of accurate data. NGS platforms are now increasingly used in the field of clinical research [2]. Metagenomic sequencing offers numerous advantages over traditional culture-based techniques that have long been the standard test for detecting pathogenic bacteria. This method is particularly useful for characterizing uncultured bacteria and novel pathogens [3].
Among sequence-based bacterial analyses, amplicon sequencing of the 16S ribosomal RNA (rRNA) gene has proven to be a reliable and efficient option for taxonomic classification [4,5]. The bacterial 16S rRNA gene contains nine variable regions (V1 to V9) that are separated by highly conserved sequences across different taxa. For bacterial identification, the 16S rRNA gene is first amplified by polymerase chain reaction (PCR) with primers annealing to conserved regions and then sequenced. The sequencing data are subjected to bioinformatic analysis in which the variable regions are used to discriminate between bacterial taxa [6].
Since the conventional parallel-type short-read sequencer cannot yield reads covering the full length of the 16S rRNA gene [7], several regions of it have been targeted for sequencing, which often causes ambiguity in taxonomic classification [8]. New sequencing platforms have overcome these technical restrictions, particularly those affecting read length. A prime example is the Min-ION™ sequencer from Oxford Nanopore Technologies, which is capable of producing long sequences with no theoretical read length limit [9][10][11]. MinION™ sequencing targets the entire 16S rRNA gene, allowing the identification of bacteria with more accuracy and sensitivity [12,13]. Furthermore, MinION™ produces sequencing data in real time, which reduces turnaround time for data processing [14,15]. Currently, the MinION™ sequencing technology has some drawbacks, such as a relatively lower throughput and higher error rates compared to other sequencing platforms [16]. Despite the lower per read accuracy, the nearly unrestricted read length provided by the MinION™ sequencer offers promise in sequence-based bacterial analyses.
Given these features of MinION™ sequencing, we had previously conducted full-length 16S rRNA gene amplicon sequencing analyses using the MinION™ platform coupled to a bioinformatics pipeline [17,18] (Additional File 1), which allowed us to identify bacterial pathogens with a total analysis time of under two hours [19]. However, we also found that the approach of using the commercial 16S Barcoding Kit (SQK-RAB204) available from Oxford Nanopore Technologies has a limited ability to detect particular taxa such as Bifidobacterium [19]. This is probably due to sequence mismatches in the primer used for 16S rRNA gene amplification [20]. Deviations or aberrancies in the Bifidobacterium composition in the human gut have been reported in several diseases including obesity, allergy, and inflammatory disorders [21]. Based on their putative health-promoting effects, several strains of Bifidobacterium have been utilized as probiotics [22]. Within these contexts, the species-level characterization of Bifidobacterium diversity in human gut microbiota is potentially important in clinical practice.
Our 16S rRNA gene sequence analysis using MinION™ has been tested only with pre-characterized mock bacterial DNA and a limited number of pathogenic bacteria from a patient-derived sample [23]. Its applicability to highly complex bacterial communities has not yet been thoroughly investigated. Therefore, in this study we modified our existing protocol for 16S rRNA gene amplicon sequencing by MinION™ and applied it to human gut microbiota with a complex bacterial composition [24], including Bifidobacterium, to determine whether full-length 16S rRNA gene sequencing with MinION™ is an effective characterization tool.

Classification of the mock bacterial community
The 16S rRNA gene sequence of Bifidobacterium has three base mismatches with the 27F forward primer provided in the commercial sequencing kit (16S Barcoding Kit, SQK-RAB204, Oxford Nanopore Technologies; Additional File 2: Supplementary Fig. S1a), which biases amplification toward underrepresentation of Bifidobacterium species (Additional File 2: Supplementary Fig. S2, Additional File 3: Supplementary Table S1-S3). To overcome this drawback, we introduced three degenerate bases to the 16S rRNA gene-specific sequences of the primer (Additional File 2: Supplementary Fig. S1b). The competence of the modified primer set was then evaluated by 16S rRNA gene sequence analysis of a tenspecies mock community. The V1-V9 region of the 16S rRNA gene was amplified by the four-primer PCR method with the rapid adapter attachment chemistry and sequenced (Fig. 1a). MinION™ sequencing generated 8651 pass reads (Table 1). Following adapter trimming and size selection, 6972 reads (80.6% of pass reads with an average read length of 1473 bp) were retained for bacterial identification. Randomly sampled 3000 reads were aligned against our in-house genome database GenomeSync [18]. Full-length 16S rRNA gene amplicon sequencing with the modified primer set led to the successful identification of all expected bacterial genera, including Bifidobacterium (Fig. 1b, Additional File 3: Supplementary Table S4). The dataset was also analyzed Fig. 1 16S rRNA gene sequence analysis using the MinION™ nanopore sequencer. a Workflow of 16S rRNA gene amplicon sequencing on the MinION™ platform. Sequencing libraries are generated by the four-primer PCR-based strategy, enabling simplified post-PCR adapter attachment. At the initial stage of PCR, the 16S rRNA gene is amplified with the inner primer pairs. The resulting PCR products are targeted for amplification with the outer primers to introduce the barcode and tag sequences at both ends, to which adapter molecules can be attached in a single-step reaction. b, c Taxonomic assignments of a mock community analyzed by MinION™ sequencing. The V1-V9 or V3-V4 region of the 16S rRNA gene was amplified from a precharacterized mock community sample comprising ten bacterial species and sequenced on the MinION™ platform. Three thousand reads were randomly selected from the processed data set and aligned directly to the reference genome database of 5850 representative bacterial species. The pie charts represent taxonomic profiles at the (b) genus and (c) species levels. Even with the full-length 16S rRNA gene analysis, species-level resolution is not possible for Bacillus and Escherichia. Slices corresponding to misclassified (assigned to bacteria not present in the mock community) or unclassified (not classified at the given level but placed in a higher taxonomic rank) reads are exploded. The relative abundance (%) of each taxon is shown  Supplementary Fig. S3a, S3b). The majority of reads were correctly classified down to the species level, demonstrating the excellent discriminatory power of the full-length sequencing method for bacterial identification (Fig. 1c, Additional File 2: Supplementary Fig. S3c). Bacillus species was an exception in the analysis with both the GenomeSync and NCBI reference database (the SILVA database does not include species level information), and discrimination of Bacillus cereus from the closely related species such as Bacillus anthracis and Bacillus thuringiensis was not achieved (Additional File 4, Additional File 6). Likewise, Escherichia coli was not reliably distinguished from Shigella and other Escherichia species sharing the high 16S rRNA gene sequence similarity to each other [25,26], and species-level resolution was not possible. We compared the resolution of full-length and shortread 16S rRNA gene amplicon sequencing for the taxonomic classification of bacteria. The V3-V4 region was amplified by four-primer PCR from the ten-species mock community DNA, and the samples were sequenced on MinION™. After removing the adapter/barcode sequences and filtering reads by length, 96,189 reads with an average length of 454.9 bp for downstream analysis were yielded (Table 1). In contrast to full-length sequencing with the highest resolution, a significant number of V3-V4 reads were misclassified or assigned to a higher taxonomic rank (Fig. 1b, c, Additional File 2: Supplementary Fig. S3). The three alignment tools worked with some differences in assigning the V3-V4 sequences. This was notable for alignments against the GenomeSync database, where most V3-V4 reads derived from Enterococcus faecalis and Escherichia coli were not correctly assigned to each taxon, as more than one species produced the same similarity score for the sequence read queries and the reads were ranked at the lowest common ancestor (Additional File 3: Supplementary  Table S5, Additional File 7). We could not classify these bacteria even at the phylum level. The results suggest an analytical problem such as database errors, which may give rise to assigning a distantly related organism to the query sequences. The classifications were not affected by increasing the number of analyzed reads to 10,000 (Additional File 2: Supplementary Fig. S4, Additional File 3: Supplementary Table S6). These classification problems were solved, for the most part, by the V1-V9 long-read sequencing. Thus, regardless of program and database used, the full-length 16S rRNA gene sequencing appeared to give better resolution for bacterial identification.
For seven of the ten bacterial strains constituting the mock community, each subset of V1-V9 sequencing reads classified to the specific genus was assigned with a high degree of accuracy (> 90%) to the corresponding species against both the GenomeSync (Fig. 2) and NCBI 16S rRNA database (Additional File 2: Supplementary  Fig. S5). V3-V4 short-read sequencing showed a discriminatory power comparable to that of V1-V9 full-length sequencing in the classification of Deinococcus, Rhodobacter, and Streptococcus. However, the V3-V4 region was not suitable for species-level identification of some genera such as Clostridium and Staphylococcus. These results suggest a lower resolution of the V3-V4 region for species-level classification, emphasizing the advantage of long-read sequencing for obtaining an accurate representation of the sample bacterial composition.

Classification of human fecal bacteria
We assessed the performance of our full-length 16S rRNA gene amplicon sequencing approach in the context of a highly complex bacterial community. The V1-V9 region was amplified by four-primer PCR from six human fecal samples (F1-F6) and analyzed by MinION™ sequencing. (Table 2). The reads were mapped against the GenomeSync reference database for taxonomic assignment. In Fig. 3, the numbers of species detected are plotted against the numbers of reads analyzed. The curve started to plateau at around 20,000 reads. There was a highly significant correlation between the read numbers 20,000 and 30,000 (Pearson's correlation coefficient r > 0.999, Additional File 8: Supplementary Table  S7). Based on these observations, randomly sampled 20, 000 reads were used in further analysis to determine the bacterial composition of the human gut.
For comparison, amplicon sequencing of the V3-V4 region was also conducted using the MinION™ (Table 2) and the Illumina MiSeq™ platform ( Table 3). The processed reads from each data set were allocated to the reference bacterial genome using our bioinformatics pipeline to determine the bacterial compositions (Additional File 9 for V1-V9 MinION™ sequencing, Additional File 10 for V3-V4 MinION™ sequencing, and Additional File 11 for V3-V4 MiSeq™ sequencing). From MiSeq™ sequencing data, the bacterial composition was also analyzed by the operational taxonomic unit (OTU)based approach using the QIIME 2 (ver. 2019.7) pipeline (Additional File 2: Supplementary Fig. S6, Additional File 12) [27,28]. Although Bacteroides was underrepresented in the OTU-based analysis, the two analytical methods (our bioinformatics pipeline and OTU-based method) produced similar taxonomic profiles in the dominant phylotypes for the MiSeq™ data.
This result confirmed the validity of our method for the taxonomic classification of the bacterial community.
The three sequencing methods (V1-V9 MinION™ sequencing, V3-V4 MinION™ sequencing, and V3-V4 MiSeq™ sequencing) revealed similar profiles for the six fecal samples at the genus level (Fig. 4). Statistically significant similarities have been found in the relative genus abundances across these sequencing methods. Thus, at the genus level, V1-V9 full-length MinION™ sequencing exhibited a discriminatory power comparable to that of high-quality short-read sequencing with MiSeq™ technology.
The species-level taxonomic resolution achieved by fulllength sequencing of the 16S rRNA gene using MinION™ While genus classification using long versus short reads was relatively comparable, we observed considerable differences across amplified regions in the species-level profiling of human gut microbiota. As shown in Fig. 5, the number of ambiguous reads that were not assigned to species but could be classified at a higher level was significantly greater in the V3-V4 data set in comparison than in the V1-V9 data set. The MinION™ V3-V4 data had the highest proportion of ambiguous reads. In comparison with the V3-V4 reads sequenced on the MiSeq™ platform (Table 3), the MinION™ V3-V4 reads had lower average quality scores ( Table 2). The poorer read quality gave rise to assigning multiple species to a query sequence, leading to the increased number of reads not classified at the species level (Additional File 10).
When species compositions of the dominant taxa (Bifidobacterium, Blautia, and Bacteroides) were analyzed, the V1-V9 and V3-V4 sequencing produced comparable results for Blautia (Additional File 2: Supplementary Fig.  S7, Additional File 8: Supplementary Table S8) and Bacteroides genus (Additional File 2: Supplementary Fig. S8, Additional File 8: Supplementary Table S9) in most of the fecal samples. For Bifidobacterium, there appeared to be considerable deviations in the relative species abundances depending on the sequencing method used (Fig. 6, Additional File 8: Supplementary Table S10). Notably, most of the Bifidobacterium reads generated by V1-V9 MinION™ sequencing were classified into the Bifidobacterium species that were isolated from human sources [21,29]. A significant number of the V3-V4 reads, however, were assigned erroneously to Bifidobacterium species of non-human origin in the direct read mapping approach using the relatively shallow Genome-Sync reference database (Additional File 2: Supplementary Fig. S9). Consistently, the OTU-based classification analysis for V3-V4 MiSeq™ data using the QIIME 2 pipeline also revealed a lower resolution of short-read sequencing for taxonomic separation of Bifidobacterium genus. Except for Bifidobacterium longum, Bifidobacterium species could not be reliably identified by the V3-V4 sequencing strategy and they were ranked at the genus level (Additional File 2: Supplementary Fig. S10).
These results suggest that MinION™ long-read sequencing, which targets the full-length 16S rRNA gene, can provide better resolution for discriminating between members of particular taxa such as Bifidobacterium.

Discussion
16S rRNA gene amplicon sequencing is a powerful strategy for taxonomic classification of bacteria and has been extensively employed for analyzing samples from environmental and clinical sources [5,30,31]. We assessed the performance of MinION™ sequencing by comparing the resolution of the V1-V9 and V3-V4 reads for the taxonomic classification of bacteria. Due to the errorprone nature of MinION™ sequencing, the existing OTU-based approach, requiring at least 97% sequence identity threshold, has been considered unsuitable for taxonomic classification of MinION™ reads [32,33]. Instead, the reads were analyzed by the direct read mapping method that assigns sequences to taxonomic bins based on the similarity to a reference database [14,15]. Long-read MinION™ sequencing with the optimized primer set successfully identified Bifidobacterium species leading to a better representation of the species composition of the mock community. For improving the classification results, the reads were filtered by length to  eliminate those outside the expected size range. Typically, extremely short reads possess only one primerbinding site, suggesting that they are derived from incomplete sequencing. There also exist unexpectedly longer reads with a continuous sequence structure in which two 16S rRNA gene amplicons are linked end-to-end. Because these reads can potentially result in unclassified reads or misclassification, they were eliminated before alignment to the reference sequences of the bacterial genome.
We also modified library construction for MinION™ sequencing with a four-primer PCR strategy, which enabled ligase-free adapter attachment to occur in a singlestep reaction. The four-primer PCR generates amplicons with particular chemical modifications at the 5′ ends to which adapter molecules can be attached nonenzymatically. Unlike the ligation-based approach, the PCR products amplified by the four-primer method are subjected directly to the adapter attachment reaction without repairing their 5′ ends, substantially reducing the time required for sample preparation. Furthermore, because the protocol is free of Nanopore's transposasebased technology (e.g. Rapid Sequencing Kit, SQK-RAD004) that cleaves DNA molecules to produce chemically modified ends for library construction, the PCR products are kept intact, enabling sequencing of the entire amplified region. Thus, the four-primer PCR-based method allowed us to perform amplicon sequencing on the MinION™ platform with user-defined arbitrary primer pairs, taking advantage of the rapid adapter attachment chemistry. This method can be applied to a wide range of sequence-based analyses, including detection of functional genetic markers like antimicrobial resistance genes and identification of genetic variations in targeted loci [11,34,35].
Our present microbiome study, comparing the discriminatory power of the V1-V9 and V3-V4 reads sequenced on the MinION™ platform, clearly illustrated the advantage of sequencing the entire 16S rRNA gene. The full-length 16S rRNA gene sequencing provided better resolution than short-read sequencing for discriminating between members of certain bacterial taxa, including Bifidobacterium, Clostridium, Enterococcus, and Staphylococcus. Consistently, comprehensive microbiome studies using a sequencing data set consisting of different regions of the 16S rRNA gene have shown that the choice of the regions to be sequenced substantially affects the classification results, and some bacterial species are identified only by sequencing the entire 16S rRNA gene [6,36,37]. It is important to note, however, that even full-length 16S rRNA gene analysis fails to discriminate some closely related species such as members of Bacillus cereus group and Escherichia, indicating the limitations of the 16S rRNA gene amplicon sequencing alone in species allocation. Long read sequencing targeting other phylogenetic markers may be an alternative to 16S rRNA gene amplicon sequencing and provide better resolution for bacterial identification.
In the analysis of the human fecal samples, we used the taxonomic resolution of the V3-V4 region sequenced with MiSeq™, which generates highly accurate reads, as a benchmark for the taxonomic resolution of the fulllength 16S rRNA gene sequenced with MinION™. The relative abundance of dominant bacterial taxa was highly similar at the genus level between full-length MinION™ and short-read MiSeq™ sequencing. Despite the lower read quality, the full-length sequencing by MinION™ enabled reliable identification of bacterial genera with an accuracy comparable to MiSeq™ technology. The MiSeq™ platform enables 16S rRNA gene sequencing on a massive scale with reduced cost (approximately 20 USD per sample). Considering a low equipment price (1000 USD) and an affordable per-run cost (approximately 50 USD per sample), the MinION™ sequencer could be a Our study has some limitations. The use of the GenomeSync reference database with a limited number of sequences could not allow a comprehensive survey of bacterial communities. In addition to the database size, the analysis methods may also influence the identification results. In our bioinformatics pipeline, a query sequence is directly mapped to the reference genome and assigned to a specific taxon according to the alignment score. When more than one taxon is identified, the read is assigned to the lowest common ancestor. It seemed that this approach worked well for the V1-V9 sequencing with better resolution, and the reads had a higher probability of being assigned down to the species level. However, as evident in the classification of Enterococcus and Escherichia, the V3-V4 sequence Fig. 4 Comparison of taxonomic profiles of human gut microbiota between sequencing methodologies. Six fecal samples (F1-F6) were analyzed by sequencing the entire 16S rRNA gene using MinION™ (N_V1-V9). For comparison, the V3-V4 region was sequenced on MinION™ (N_V3-V4) or MiSeq™ platforms (I_V1-V9). Randomly sampled 20,000 reads from each data set were allocated to the reference genome database of 5850 representative bacterial species. A heat map shows the relative genus abundance (%) of classified reads. The 15 most abundant taxa are shown. The Pearson correlation coefficient (r) between sequencing methods was computed. Asterisks indicate significant correlations at P < 0.05 alignment against GenomeSync produced a higher proportion of ambiguous identification. In the case of bifidobacteria, non-human species were erroneously identified by the V3-V4 short-read sequencing in the analysis of human fecal samples. Such improbable errors were not observed in QIIME2 analysis, in which species-level calls were not made and the reads were ranked at the genus level as more specific classification was impossible. Due to the low discriminatory power of the V3-V4 region, the direct read mapping against a flawed or incomplete reference database may potentially give rise to assigning an unrelated organism marked as a top hit. Since the taxonomic assignment is a critical step for analyzing the bacterial diversity and community composition, the reference database quality and the alignment algorithm must be further evaluated for each sequencing data set.

Conclusions
Our modified protocol for 16S rRNA gene amplicon sequencing overcame known limitations, such as the primer-associated bias toward the underrepresentation of Bifidobacterium, and enabled taxonomic classification across a broad range of bacterial species. Benchmarking with MiSeq™ sequencing technology demonstrated the analytical advantage of sequencing the full-length 16S rRNA gene with MinION™, which could counteract the lower sequence accuracy and provide better resolution. With the recent progress in nanopore sequencing chemistry and base-calling algorithms, sequencing accuracy is continuously improving [38,39]. This will soon enable us to exploit the full potential of MinION™ long-read sequencing technology. High-quality long sequences will allow better discrimination between closely related species in sequence-based bacterial analyses.

Fecal DNA
DNA was extracted from six human fecal samples using the NucleoSpin® Microbial DNA Kit (Macherey-Nagel, Düren, Germany), as described previously [40]. Briefly, human feces stored using the Feces Collection Kit (Techno Suruga Lab, Shizuoka, Japan) were subjected to mechanical disruption by bead-beating, and DNA was isolated using silica membrane spin columns. Extracted DNA was purified with the Agencourt AMPure® XP (Beckman Coulter, Brea, CA, USA).

16S rRNA gene sequencing on the MinION™ platform
Four-primer PCR with rapid adapter attachment chemistry generated 16S rRNA gene amplicons with modified 5′ ends for simplified post-PCR adapter attachment following the manufacturer's instructions with slight modifications. For amplification of the V1-V9 region of the 16S rRNA gene, the following inner primers were used, with 16S rRNA gene-specific sequences underlined: forward primer (S-D-Bact-0008-c-S-20 [41]) with anchor sequence 5′-TTTCTGTTGG TGCTGATATTGCAGRGTTYGATYMTGGCTCAG-3′ and reverse primer (1492R) with anchor sequence 5′-ACTTGCCTGTCGCTCTATCTTCCGGYTACCTT GTTACGACTT-3′. For amplification of the V3-V4 region, the following inner primers were used, with 16S rRNA gene-specific sequences underlined: 341F with anchor sequence 5′-TTTCTGTTGGTGCTGATA TTGCCCTACGGGNGGCWGCAG-3′ and 806R with anchor sequence 5′-ACTTGCCTGTCGCTCTAT CTTCGGACTACHVGGGTWTCTAAT-3′. PCR amplification of 16S rRNA genes was conducted using the KAPA2G™ Robust HotStart ReadyMix PCR Kit (Kapa Biosystems, Wilmington, MA, USA) in a total volume of 25 μl containing inner primer pairs (50 nM each) and the barcoded outer primer mixture (3%) from the PCR Barcoding Kit (SQK-PBK004; Oxford Nanopore Technologies, Oxford, UK). Amplification was performed with the following PCR conditions: initial denaturation at 95°C for 3 min, 5 cycles of 95°C for 15 s, 55°C for 15 s, and 72°C for 30 s, 30 cycles of 95°C for 15 s, 62°C for 15 s, and 72°C for 30 s, followed by a final extension at 72°C for 1 min. Amplified DNA was purified using AMPure® XP (Beckman Coulter) and quantified by a NanoDrop® 1000 (Thermo Fischer Scientific, Waltham, MA, USA). A total of 100 ng of DNA was incubated with 1 μl of Rapid Adapter at room temperature for 5 min. The prepared DNA library (11 μl) was mixed with 34 μl of Sequencing Buffer, 25.5 μl of Loading Beads, and 4.5 μl of water, loaded onto the R9.4 flow cell (FLO-MIN106; Oxford Nanopore Technologies), and sequenced on the Min-ION™ Mk1B. MINKNOW software ver. 1.11.5 (Oxford Nanopore Technologies) was used for data acquisition.

Bioinformatics analysis
Albacore software ver. 2.3.4 (Oxford Nanopore Technologies) was used for basecalling the MinION™ sequencing data (FAST5 files) to generate pass reads (FASTQ format) with a mean quality score > 7. The adapter and barcode sequences were trimmed using the EPI2ME Fastq Barcoding workflow ver. 3.10.4 (Oxford Nanopore Technologies). The reads were filtered by size using SeqKit software ver. 0.10.0 [42], retaining 1300-1950 bp sequences for the V1-V9 region and 350-600 bp sequences for the V3-V4 region, based on the size distribution of 16S rRNA gene sequences in the SILVA database ver. 132 [43,44]. The average Phred quality score was assessed using Nano-Plot ver. 1.27.0 [45]. The processed reads from each set were analyzed using our bioinformatics pipeline [17], as described previously [14,15]. Briefly, FASTQ files were converted to FASTA files. Simple repetitive sequences were masked using the TANTAN program ver. 18 with default parameters [46]. To remove reads derived from human DNA, we searched each read against the human genome (GRCh38) using minimap2 ver. 2.14 with a map-ont-option [47]. Then, unmatched reads were regarded as reads derived from bacteria. For each read, a minimap2 search with 5850 representative bacterial genome sequences stored in the GenomeSync database (Additional File 1) [18] was performed. For each read, the species showing the highest minimap2 score were assigned to the query sequence. When more than one species showed the same similarity score, the reads were classified at any higher taxonomic rank covering all the identified species. Taxa were determined based on the NCBI taxonomy database [48]. Low-abundance taxa with less than 0.01% of total reads were discarded from the analysis.

Statistical analyses
Differences between groups were evaluated by oneway analysis of variance (ANOVA) followed by Dunnett's test for multiple comparisons. The Pearson correlation coefficient was computed to compare the bacterial compositions analyzed by different sequencing methods. Statistical significance was defined by a P-value < 0.05. Statistical analyses were performed with Prism8 (GraphPad Software, Inc. La Jolla, CA, USA).