-
1.
Pre- and post-sequencing recommendations for functional annotation of human fecal metagenomes.
Treiber, ML, Taft, DH, Korf, I, Mills, DA, Lemay, DG
BMC bioinformatics. 2020;(1):74
Abstract
BACKGROUND Shotgun metagenomes are often assembled prior to annotation of genes which biases the functional capacity of a community towards its most abundant members. For an unbiased assessment of community function, short reads need to be mapped directly to a gene or protein database. The ability to detect genes in short read sequences is dependent on pre- and post-sequencing decisions. The objective of the current study was to determine how library size selection, read length and format, protein database, e-value threshold, and sequencing depth impact gene-centric analysis of human fecal microbiomes when using DIAMOND, an alignment tool that is up to 20,000 times faster than BLASTX. RESULTS Using metagenomes simulated from a database of experimentally verified protein sequences, we find that read length, e-value threshold, and the choice of protein database dramatically impact detection of a known target, with best performance achieved with longer reads, stricter e-value thresholds, and a custom database. Using publicly available metagenomes, we evaluated library size selection, paired end read strategy, and sequencing depth. Longer read lengths were acheivable by merging paired ends when the sequencing library was size-selected to enable overlaps. When paired ends could not be merged, a congruent strategy in which both ends are independently mapped was acceptable. Sequencing depths of 5 million merged reads minimized the error of abundance estimates of specific target genes, including an antimicrobial resistance gene. CONCLUSIONS Shotgun metagenomes of DNA extracted from human fecal samples sequenced using the Illumina platform should be size-selected to enable merging of paired end reads and should be sequenced in the PE150 format with a minimum sequencing depth of 5 million merge-able reads to enable detection of specific target genes. Expecting the merged reads to be 180-250 bp in length, the appropriate e-value threshold for DIAMOND would then need to be more strict than the default. Accurate and interpretable results for specific hypotheses will be best obtained using small databases customized for the research question.
-
2.
Tracing CLL-biased stereotyped immunoglobulin gene rearrangements in normal B cell subsets using a high-throughput immunogenetic approach.
Colombo, M, Bagnara, D, Reverberi, D, Matis, S, Cardillo, M, Massara, R, Mastracci, L, Ravetti, JL, Agnelli, L, Neri, A, et al
Molecular medicine (Cambridge, Mass.). 2020;(1):25
Abstract
BACKGROUND B cell receptor Immunoglobulin (BcR IG) repertoire of Chronic Lymphocytic Leukemia (CLL) is characterized by the expression of quasi-identical BcR IG. These are observed in approximately 30% of patients, defined as stereotyped receptors and subdivided into subsets based on specific VH CDR3 aa motifs and phylogenetically related IGHV genes. Although relevant to CLL ontogeny, the distribution of CLL-biased stereotyped immunoglobulin rearrangements (CBS-IG) in normal B cells has not been so far specifically addressed using modern sequencing technologies. Here, we have investigated the presence of CBS-IG in splenic B cell subpopulations (s-BCS) and in CD5+ and CD5- B cells from the spleen and peripheral blood (PB). METHODS Fractionation of splenic B cells into 9 different B cell subsets and that of spleen and PB into CD5+ and CD5- cells were carried out by FACS sorting. cDNA sequences of BcR IG gene rearrangements were obtained by NGS. Identification of amino acidic motifs typical of CLL stereotyped subsets was carried out on IGHV1-carrying gene sequences and statistical evaluation has been subsequently performed to assess stereotypes distribution. RESULTS CBS-IG represented the 0.26% average of IGHV1 genes expressing sequences, were detected in all of the BCS investigated. CBS-IG were more abundant in splenic and circulating CD5+ B (0.57%) cells compared to CD5- B cells (0.17%). In all instances, most CBS IG did not exhibit somatic hypermutation similar to CLL stereotyped receptors. However, compared to CLL, they exhibited a different CLL subset distribution and a broader utilization of the genes of the IGHV1 family. CONCLUSIONS CBS-IG receptors appear to represent a part of the "public" BcR repertoire in normal B cells. This repertoire is observed in all BCS excluding the hypothesis that CLL stereotyped BcR accumulate in a specific B cell subset, potentially capable of originating a leukemic clone. The different relative representation of CBS-IG in normal B cell subgroups suggests the requirement for additional selective processes before a full transformation into CLL is achieved.
-
3.
Evaluation of molecular inversion probe versus TruSeq® custom methods for targeted next-generation sequencing.
Almomani, R, Marchi, M, Sopacua, M, Lindsey, P, Salvi, E, Koning, B, Santoro, S, Magri, S, Smeets, HJM, Martinelli Boneschi, F, et al
PloS one. 2020;(9):e0238467
Abstract
Resolving the genetic architecture of painful neuropathy will lead to better disease management strategies. We aimed to develop a reliable method to re-sequence multiple genes in a large cohort of painful neuropathy patients at low cost. In this study, we compared sensitivity, specificity, targeting efficiency, performance and cost effectiveness of Molecular Inversion Probes-Next generation sequencing (MIPs-NGS) and TruSeq® Custom Amplicon-Next generation sequencing (TSCA-NGS). Capture probes were designed to target nine sodium channel genes (SCN3A, SCN8A-SCN11A, and SCN1B-SCN4B). One hundred sixty-six patients with diabetic and idiopathic neuropathy were tested by both methods, 70 patients were validated by Sanger sequencing. Sensitivity, specificity and performance of both techniques were comparable, and in agreement with Sanger sequencing. The average targeted regions coverage for MIPs-NGS was 97.3% versus 93.9% for TSCA-NGS. MIPs-NGS has a more versatile assay design and is more flexible than TSCA-NGS. The cost of MIPs-NGS is >5 times cheaper than TSCA-NGS when 500 or more samples are tested. In conclusion, MIPs-NGS is a reliable, flexible, and relatively inexpensive method to detect genetic variations in a large cohort of patients. In our centers, MIPs-NGS is currently implemented as a routine diagnostic tool for screening of sodium channel genes in painful neuropathy patients.
-
4.
The Battle to Sequence the Bread Wheat Genome: A Tale of the Three Kingdoms.
Guan, J, Garcia, DF, Zhou, Y, Appels, R, Li, A, Mao, L
Genomics, proteomics & bioinformatics. 2020;(3):221-229
Abstract
In the year 2018, the world witnessed the finale of the race to sequence the genome of the world's most widely grown crop, the common wheat. Wheat has been known to bear a notoriously large and complicated genome of a polyploidy nature. A decade competition to sequence the wheat genome initiated with a single consortium of multiple countries, taking a conventional strategy similar to that for sequencing Arabidopsis and rice, became ferocious over time as both sequencing technologies and genome assembling methodologies advanced. At different stages, multiple versions of genome sequences of the same variety (e.g., Chinese Spring) were produced by several groups with their special strategies. Finally, 16 years after the rice genome was finished and 9 years after that of maize, the wheat research community now possesses its own reference genome. Armed with these genomics tools, wheat will reestablish itself as a model for polyploid plants in studying the mechanisms of polyploidy evolution, domestication, genetic and epigenetic regulation of homoeolog expression, as well as defining its genetic diversity and breeding on the genome level. The enhanced resolution of the wheat genome should also help accelerate development of wheat cultivars that are more tolerant to biotic and/or abiotic stresses with better quality and higher yield.
-
5.
Enhancing of Particle Swarm Optimization Based Method for Multiple Motifs Detection in DNA Sequences Collections.
Som-In, S, Kimpan, W
IEEE/ACM transactions on computational biology and bioinformatics. 2020;(3):990-998
Abstract
Genome sequence data consists of DNA sequences or input sequences. Each one includes nucleotides with chemical structures presented as characters: 'A', 'C',' G', and 'T', and groups of motif sequences, called Transcription Factor Binding Sites (TFBSs), which are subsequences of DNA that lead to protein-synthesis. The detection of TFBSs is an important problem for bioinformatics research. With the similar patterns of motif sequences in TFBSs, computational algorithms for TFBSs detection have been improved to reduce resources used in laboratory setting. The metaheuristic algorithm is the important issue that has been continually improved to detect TFBSs with greater precision and recall. This paper proposes PSO_HD by applying Particle Swarm Optimization (PSO) as a pre-process and using Hamming distance to improve the efficiency of detecting TFBSs with more precision and recall. In order to measure its efficiency, the paper compares the TFBSs detection using PSO_HD algorithm with relevant algorithms in eight datasets. F-score is used as a measurement unit and compared to the related algorithms. The experimental results show that PSO_HD algorithm gives the highest average F-score, which can be indicated that the PSO_HD algorithm can improve the efficiency of detecting TFBSs with more precision and recall.
-
6.
Pretreatment Tumor DNA Sequencing of KIT and PDGFRA in Endosonography-Guided Biopsies Optimizes the Preoperative Management of Gastrointestinal Stromal Tumors.
Hedenström, P, Andersson, C, Sjövall, H, Enlund, F, Nilsson, O, Nilsson, B, Sadik, R
Molecular diagnosis & therapy. 2020;(2):201-214
-
-
Free full text
-
Abstract
BACKGROUND Neoadjuvant tyrosine kinase inhibitor (TKI) therapy increases the chance of organ-preserving, radical resection in selected patients with gastrointestinal stromal tumors (GISTs). We aimed to evaluate systematic, immediate DNA sequencing of KIT and PDGFRA in pretreatment GIST tissue to guide neoadjuvant TKI therapy and optimize preoperative tumor response. METHODS All patients who were candidates for neoadjuvant therapy of a suspected GIST [the study cohort (SC)] were prospectively included from January 2014 to March 2018. Patients were subjected to pretreatment endosonography-guided fine-needle biopsy (EUS-FNB) or transabdominal ultrasound-guided needle biopsy (TUS-NB), followed by immediate tumor DNA sequencing (< 2 weeks). A historic (2006-2013) reference cohort (RC) underwent work-up without sequencing before neoadjuvant imatinib (n = 42). The rate of optimal neoadjuvant therapy (TherapyOPTIMAL) was calculated, and the induced tumor size reduction (Tumor RegressionMAX, %) was evaluated by computed tomography (CT) scan. RESULTS The success rate of pretreatment tumor DNA sequencing in the SC (n = 81) was 77/81 (95%) [EUS-FNB 71/74 (96%); TUS-NB 6/7 (86%)], with mutations localized in KIT (n = 58), PDGFRA (n = 18), or neither gene, wild type (n = 5). In patients with a final indication for neoadjuvant therapy, the TherapyOPTIMAL was higher in the SC compared with the RC [61/63 (97%) versus 33/42 (79%), p = 0.006], leading to a significantly higher Tumor RegressionMAX in patients treated with TKI (27% vs. 19%, p = 0.015). CONCLUSIONS Pretreatment endosonography-guided biopsy sampling followed by immediate tumor DNA sequencing of KIT and PDGFRA is highly accurate and valuable in guiding neoadjuvant TKI therapy in GIST. This approach minimizes maltreatment with inappropriate regimens and leads to improved tumor size reduction before surgery.
-
7.
PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations.
Wen, B, Wang, X, Zhang, B
Genome research. 2019;(3):485-493
Abstract
Massively parallel or second-generation sequencing-based genomic studies continuously identify new genomic alterations that may lead to novel protein sequences, which are attractive candidates for disease biomarkers and therapeutic targets after proteomic validation. Integrative proteogenomic methods have been developed to use mass spectrometry (MS)-based proteomics data for such validation. These methods replace the reference sequence database in proteomic database searching with a customized protein database that incorporates sample- or disease-specific sequences derived from DNA or RNA sequencing, thus enabling the identification of novel protein sequences. Although useful, this spectrum-centric approach requires a full evaluation of all possible spectrum-peptide pairs, which is time-consuming, error-prone, and difficult to apply. Here, we present PepQuery, a peptide-centric approach that focuses on only novel DNA or protein sequences of interest. PepQuery allows quick and easy proteomic validation of genomic alterations without customized database construction. We demonstrated the sensitivity and specificity of the approach in validating completely novel proteins, novel splice junctions, and single amino acid variants using simulations and experimental data. Notably, enabling unrestricted modification searching in PepQuery reduced false positives by up to 95%. We implemented PepQuery as both web-based and stand-alone applications. The web version provides direct access to more than half a billion MS/MS spectra from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) and other cancer proteomic studies. The stand-alone version supports batch analysis and user-provided MS/MS data. PepQuery will increase the usage of proteogenomics beyond the proteomics community and will broaden the application of proteogenomics in personalized medicine.
-
8.
The Reliability of DNA Sequences in Public Databases Belonging to the Most Economically Important Shiitake Culinary-Medicinal Mushroom Lentinus edodes (Agaricomycetes) in Asia.
Yang, RH, Wu, YY, Tang, LH, Li, CH, Shang, JJ, Li, Y, Song, Y, Huang, WH, Tao, XS, Tan, Q, et al
International journal of medicinal mushrooms. 2019;(12):1223-1239
Abstract
Large numbers of DNA sequences deposited in the International Nucleotide Sequence Databases (INSD) are erroneously annotated. The erroneous information may lead to misleading conclusions or cause great economic losses to farmers. Lentinus edodes (= Lentinula edodes (Berk.) Pegler) is one of the most important and popular culinary-medicinal mushrooms with a high nutritional value. In this study, experimental and in silico methods were used to correct the sequences annotated as L. edodes in the INSD. A total of 3,426 nucleotide entries were retrieved from public databases, including 140 different types of genetic sequences. Excluding 1,893 genome sequences, the most abundant signatures represented by ITS (258) and IGS1 (259) sequences accounted for 33.23% of the total entries. A total of 3,058 sequences were annotated correctly, 350 were indeterminate, and 18 were annotated erroneously based on the two methods. Correction of sequences will be beneficial for species identification and annotation. Phylogenic analysis based on ITS sequences suggested that L. edodes segregate in four clades in the tree based on ITS sequences. The isolates from China were distributed into two clades. In L. edodes, the intraspecific variation of the ITS2 sequences was much higher than that of the ITS1 sequences. In addition, the genetic diversity of the L. edodes sequences from China was much higher than that of any other regions included in this study. The northwest and southwest regions of China were L. edodes diversity centers.
-
9.
Carotenoid Cleavage Dioxygenases: Identification, Expression, and Evolutionary Analysis of This Gene Family in Tobacco.
Zhou, Q, Li, Q, Li, P, Zhang, S, Liu, C, Jin, J, Cao, P, Yang, Y
International journal of molecular sciences. 2019;(22)
Abstract
Carotenoid cleavage dioxygenases (CCDs) selectively catalyze carotenoids, forming smaller apocarotenoids that are essential for the synthesis of apocarotenoid flavor, aroma volatiles, and phytohormone ABA/SLs, as well as responses to abiotic stresses. Here, 19, 11, and 10 CCD genes were identified in Nicotiana tabacum, Nicotiana tomentosiformis, and Nicotiana sylvestris, respectively. For this family, we systematically analyzed phylogeny, gene structure, conserved motifs, gene duplications, cis-elements, subcellular and chromosomal localization, miRNA-target sites, expression patterns with different treatments, and molecular evolution. CCD genes were classified into two subfamilies and nine groups. Gene structures, motifs, and tertiary structures showed similarities within the same groups. Subcellular localization analysis predicted that CCD family genes are cytoplasmic and plastid-localized, which was confirmed experimentally. Evolutionary analysis showed that purifying selection dominated the evolution of these genes. Meanwhile, seven positive sites were identified on the ancestor branch of the tobacco CCD subfamily. Cis-regulatory elements of the CCD promoters were mainly involved in light-responsiveness, hormone treatment, and physiological stress. Different CCD family genes were predominantly expressed separately in roots, flowers, seeds, and leaves and exhibited divergent expression patterns with different hormones (ABA, MeJA, IAA, SA) and abiotic (drought, cold, heat) stresses. This study provides a comprehensive overview of the NtCCD gene family and a foundation for future functional characterization of individual genes.
-
10.
i6mA-DNCP: Computational Identification of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features.
Kong, L, Zhang, L
Genes. 2019;(10)
Abstract
DNA N6-methyladenine (6mA) plays an important role in regulating the gene expression of eukaryotes. Accurate identification of 6mA sites may assist in understanding genomic 6mA distributions and biological functions. Various experimental methods have been applied to detect 6mA sites in a genome-wide scope, but they are too time-consuming and expensive. Developing computational methods to rapidly identify 6mA sites is needed. In this paper, a new machine learning-based method, i6mA-DNCP, was proposed for identifying 6mA sites in the rice genome. Dinucleotide composition and dinucleotide-based DNA properties were first employed to represent DNA sequences. After a specially designed DNA property selection process, a bagging classifier was used to build the prediction model. The jackknife test on a benchmark dataset demonstrated that i6mA-DNCP could obtain 84.43% sensitivity, 88.86% specificity, 86.65% accuracy, a 0.734 Matthew's correlation coefficient (MCC), and a 0.926 area under the receiver operating characteristic curve (AUC). Moreover, three independent datasets were established to assess the generalization ability of our method. Extensive experiments validated the effectiveness of i6mA-DNCP.