-
1.
Prediction and Evolution of the Molecular Fitness of SARS-CoV-2 Variants: Introducing SpikePro.
Pucci, F, Rooman, M
Viruses. 2021;(5)
Abstract
The understanding of the molecular mechanisms driving the fitness of the SARS-CoV-2 virus and its mutational evolution is still a critical issue. We built a simplified computational model, called SpikePro, to predict the SARS-CoV-2 fitness from the amino acid sequence and structure of the spike protein. It contains three contributions: the inter-human transmissibility of the virus predicted from the stability of the spike protein, the infectivity computed in terms of the affinity of the spike protein for the ACE2 receptor, and the ability of the virus to escape from the human immune response based on the binding affinity of the spike protein for a set of neutralizing antibodies. Our model reproduces well the available experimental, epidemiological and clinical data on the impact of variants on the biophysical characteristics of the virus. For example, it is able to identify circulating viral strains that, by increasing their fitness, recently became dominant at the population level. SpikePro is a useful, freely available instrument which predicts rapidly and with good accuracy the dangerousness of new viral strains. It can be integrated and play a fundamental role in the genomic surveillance programs of the SARS-CoV-2 virus that, despite all the efforts, remain time-consuming and expensive.
-
2.
Causal Inference in Microbiome Medicine: Principles and Applications.
Lv, BM, Quan, Y, Zhang, HY
Trends in microbiology. 2021;(8):736-746
Abstract
Microorganisms that colonize the mammalian skin and cavity play critical roles in various physiological functions of the host. Numerous studies have revealed strong associations between the microbiota and multiple diseases. However, association does not mean causation. To clarify the mechanisms underlying microbiota-mediated diseases, research is moving from associative analyses to causation studies. In this article, we first introduce the principles of the computational methods for causal inference, and then discuss the applications of these methods in microbiome medicine. Furthermore, we examine the reliability of theoretically inferred causality by the interventionist framework. Finally, we show the potential of confirmed causality in microbiota-targeted therapy, especially in personalized dietary intervention. We conclude that a comprehensive understanding of the causal relationships between diets, microbiota, host targets, and diseases is critical to future microbiome medicine.
-
3.
An Introduction to Next Generation Sequencing Bioinformatic Analysis in Gut Microbiome Studies.
Gao, B, Chi, L, Zhu, Y, Shi, X, Tu, P, Li, B, Yin, J, Gao, N, Shen, W, Schnabl, B
Biomolecules. 2021;(4)
Abstract
The gut microbiome is a microbial ecosystem which expresses 100 times more genes than the human host and plays an essential role in human health and disease pathogenesis. Since most intestinal microbial species are difficult to culture, next generation sequencing technologies have been widely applied to study the gut microbiome, including 16S rRNA, 18S rRNA, internal transcribed spacer (ITS) sequencing, shotgun metagenomic sequencing, metatranscriptomic sequencing and viromic sequencing. Various software tools were developed to analyze different sequencing data. In this review, we summarize commonly used computational tools for gut microbiome data analysis, which extended our understanding of the gut microbiome in health and diseases.
-
4.
iPhosH-PseAAC: Identify Phosphohistidine Sites in Proteins by Blending Statistical Moments and Position Relative Features According to the Chou's 5-Step Rule and General Pseudo Amino Acid Composition.
Awais, M, Hussain, W, Khan, YD, Rasool, N, Khan, SA, Chou, KC
IEEE/ACM transactions on computational biology and bioinformatics. 2021;(2):596-610
Abstract
Protein phosphorylation is one of the key mechanism in prokaryotes and eukaryotes and is responsible for various biological functions such as protein degradation, intracellular localization, the multitude of cellular processes, molecular association, cytoskeletal dynamics, and enzymatic inhibition/activation. Phosphohistidine (PhosH) has a key role in a number of biological processes, including central metabolism to signalling in eukaryotes and bacteria. Thus, identification of phosphohistidine sites in a protein sequence is crucial, and experimental identification can be expensive, time-taking, and laborious. To address this problem, here, we propose a novel computational model namely iPhosH-PseAAC for prediction of phosphohistidine sites in a given protein sequence using pseudo amino acid composition (PseAAC), statistical moments, and position relative features. The results of the proposed predictor are validated through self-consistency testing, 10-fold cross-validation, and jackknife testing. The self-consistency validation gave the 100 percent accuracy, whereas, for cross-validation, the accuracy achieved is 94.26 percent. Moreover, jackknife testing gave 97.07 percent accuracy for the proposed model. Thus, the proposed model iPhosH-PseAAC for prediction of iPhosH site has the great ability to predict the PhosH sites in given proteins.
-
5.
MiDAS-Meaningful Immunogenetic Data at Scale.
Migdal, M, Ruan, DF, Forrest, WF, Horowitz, A, Hammer, C
PLoS computational biology. 2021;(7):e1009131
Abstract
Human immunogenetic variation in the form of HLA and KIR types has been shown to be strongly associated with a multitude of immune-related phenotypes. However, association studies involving immunogenetic loci most commonly involve simple analyses of classical HLA allelic diversity, resulting in limitations regarding the interpretability and reproducibility of results. We here present MiDAS, a comprehensive R package for immunogenetic data transformation and statistical analysis. MiDAS recodes input data in the form of HLA alleles and KIR types into biologically meaningful variables, allowing HLA amino acid fine mapping, analyses of HLA evolutionary divergence as well as experimentally validated HLA-KIR interactions. Further, MiDAS enables comprehensive statistical association analysis workflows with phenotypes of diverse measurement scales. MiDAS thus closes the gap between the inference of immunogenetic variation and its efficient utilization to make relevant discoveries related to immune and disease biology. It is freely available under a MIT license.
-
6.
Identification of unique subtype-specific interaction features in Class II zinc-dependent HDAC subtype binding pockets: A computational study.
Ukey, S, Choudhury, C, Sharma, P
Journal of biosciences. 2021
Abstract
Zinc-dependent HDAC subtypes (ZnHDACs) exhibit differential expression in various cancer types and significantly contribute to oncogenic cell transformation, and hence are interesting anticancer drug targets. The approved pan HDAC inhibitors (PHIs) lack subtype specificity and inhibit all ZnHDACs, causing severe sideeffects. Considering the distinct tissue distribution and roles of individual ZnHDACs in specific cancer types, it is crucial to rationally design subtype-specific inhibitors (SSIs) for enhanced efficacy and reduced side-effects. There are numerous approaches already conducted for designing SSIs, especially Class I ZnHDACs, whereas Class II and III ZnHDACs are relatively unexplored and equally important in disease pathogenesis. This study attempts to decipher the specificity rendering interaction features of six different ZnHDACs by robust analyses of reported experimental data employing sophisticated computational methods like homology modelling, docking, pharmacophore analysis, and molecular dynamic (MD) simulations. Experimentally validated SSIs (activity<1000 nM) of different ZnHDACs and 8 approved PHIs were docked to 40 MD generated conformations of each ZnHDACs followed by MM-GBSA binding energy estimations. Sequences, structures, physicochemical properties, and interaction patterns of the binding sites obtained from docking were exhaustively compared to identify unique subtype-specific interaction features for each Class II ZnHDACs. To further validate the stabilities of these features, 20 ns MD simulations were performed on 12 complexes (each Class II ZnHDACs bound to one SSI and one PHI) in explicit water models. Distinct pharmacophoric patterns were observed in the binding pockets of each subtype despite high sequence similarities. Presence of amides, ketone, hydroxyl, carboxyl groups, and moieties occupying additional sub-pockets and interacting with Zn 2+, etc., in the SSIs affect the orientations of the binding site residues (BSRs) owing to subtype-specific protein- ligand interactions. Stable and unique residue interactions specific for a HDAC subtype are, e.g. E329 for HDAC4, S904 for HDAC5, W496 S563 I569 for HDAC6, M793 for HDAC9, and E302 for HDAC10. Such unique interaction features and pharmacophoric patterns can be utilized for subtype-specific ZnHDAC inhibitor design.
-
7.
Missense3D-DB web catalogue: an atom-based analysis and repository of 4M human protein-coding genetic variants.
Khanna, T, Hanna, G, Sternberg, MJE, David, A
Human genetics. 2021;(5):805-812
-
-
Free full text
-
Abstract
The interpretation of human genetic variation is one of the greatest challenges of modern genetics. New approaches are urgently needed to prioritize variants, especially those that are rare or lack a definitive clinical interpretation. We examined 10,136,597 human missense genetic variants from GnomAD, ClinVar and UniProt. We were able to perform large-scale atom-based mapping and phenotype interpretation of 3,960,015 of these variants onto 18,874 experimental and 84,818 in house predicted three-dimensional coordinates of the human proteome. We demonstrate that 14% of amino acid substitutions from the GnomAD database that could be structurally analysed are predicted to affect protein structure (n = 568,548, of which 566,439 rare or extremely rare) and may, therefore, have a yet unknown disease-causing effect. The same is true for 19.0% (n = 6266) of variants of unknown clinical significance or conflicting interpretation reported in the ClinVar database. The results of the structural analysis are available in the dedicated web catalogue Missense3D-DB ( http://missense3d.bc.ic.ac.uk/ ). For each of the 4 M variants, the results of the structural analysis are presented in a friendly concise format that can be included in clinical genetic reports. A detailed report of the structural analysis is also available for the non-experts in structural biology. Population frequency and predictions from SIFT and PolyPhen are included for a more comprehensive variant interpretation. This is the first large-scale atom-based structural interpretation of human genetic variation and offers geneticists and the biomedical community a new approach to genetic variant interpretation.
-
8.
Optimization of theoretical maximal quantity of cells to immobilize on solid supports in the rational design of immobilized derivatives strategy.
Castillo-Alfonso, F, Rojas, MM, Salgado-Bernal, I, Carballo, ME, Olivares-Hernández, R, González-Bacerio, J, Guisán, JM, Del Monte-Martínez, A
World journal of microbiology & biotechnology. 2021;(1):9
Abstract
Current worldwide challenges are to increase the food production and decrease the environmental contamination by industrial emissions. For this, bacteria can produce plant growth promoter phytohormones and mediate the bioremediation of sewage by heavy metals removal. We developed a Rational Design of Immobilized Derivatives (RDID) strategy, applicable for protein, spore and cell immobilization and implemented in the RDID1.0 software. In this work, we propose new algorithms to optimize the theoretical maximal quantity of cells to immobilize (tMQCell) on solid supports, implemented in the RDIDCell software. The main modifications to the preexisting algorithms are related to the sphere packing theory and exclusive immobilization on the support surface. We experimentally validated the new tMQCell parameter by electrostatic immobilization of ten microbial strains on AMBERJET® 4200 Cl- porous solid support. All predicted tMQCell match the practical maximal quantity of cells to immobilize with a 10% confidence. The values predicted by the RDIDCell software are more accurate than the values predicted by the RDID1.0 software. 3-indolacetic acid (IAA) production by one bacterial immobilized derivative was higher (~ 2.6 μg IAA-like indoles/108 cells) than that of the cell suspension (1.5 μg IAA-like indoles/108 cells), and higher than the tryptophan amount added as indole precursor. Another bacterial immobilized derivative was more active (22 μg Cr(III)/108 cells) than the resuspended cells (14.5 μg Cr(III)/108 cells) in bioconversion of Cr(VI) to Cr(III). Optimized RDID strategy can be used to synthesize bacterial immobilized derivatives with useful biotechnological applications.
-
9.
A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides.
Charoenkwan, P, Chotpatiwetchkul, W, Lee, VS, Nantasenamat, C, Shoombuatong, W
Scientific reports. 2021;(1):23782
Abstract
Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906-0.910) and 2-17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.
-
10.
Generating functional protein variants with variational autoencoders.
Hawkins-Hooker, A, Depardieu, F, Baur, S, Couairon, G, Chen, A, Bikard, D
PLoS computational biology. 2021;(2):e1008736
Abstract
The vast expansion of protein sequence databases provides an opportunity for new protein design approaches which seek to learn the sequence-function relationship directly from natural sequence variation. Deep generative models trained on protein sequence data have been shown to learn biologically meaningful representations helpful for a variety of downstream tasks, but their potential for direct use in the design of novel proteins remains largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are able to reproduce patterns of amino acid usage characteristic of the family, the MSA VAE is better able to capture long-distance dependencies reflecting the influence of 3D structure. To confirm the practical utility of the models, we used them to generate variants of luxA whose luminescence activity was validated experimentally. We further showed that conditional variants of both models could be used to increase the solubility of luxA without disrupting function. Altogether 6/12 of the variants generated using the unconditional AR-VAE and 9/11 generated using the unconditional MSA VAE retained measurable luminescence, together with all 23 of the less distant variants generated by conditional versions of the models; the most distant functional variant contained 35 differences relative to the nearest training set sequence. These results demonstrate the feasibility of using deep generative models to explore the space of possible protein sequences and generate useful variants, providing a method complementary to rational design and directed evolution approaches.