-
1.
De novo sequencing of proteins by mass spectrometry.
Vitorino, R, Guedes, S, Trindade, F, Correia, I, Moura, G, Carvalho, P, Santos, MAS, Amado, F
Expert review of proteomics. 2020;(7-8):595-607
Abstract
INTRODUCTION Proteins are crucial for every cellular activity and unraveling their sequence and structure is a crucial step to fully understand their biology. Early methods of protein sequencing were mainly based on the use of enzymatic or chemical degradation of peptide chains. With the completion of the human genome project and with the expansion of the information available for each protein, various databases containing this sequence information were formed. AREAS COVERED De novo protein sequencing, shotgun proteomics and other mass-spectrometric techniques, along with the various software are currently available for proteogenomic analysis. Emphasis is placed on the methods for de novo sequencing, together with potential and shortcomings using databases for interpretation of protein sequence data. EXPERT OPINION As mass-spectrometry sequencing performance is improving with better software and hardware optimizations, combined with user-friendly interfaces, de-novo protein sequencing becomes imperative in shotgun proteomic studies. Issues regarding unknown or mutated peptide sequences, as well as, unexpected post-translational modifications (PTMs) and their identification through false discovery rate searches using the target/decoy strategy need to be addressed. Ideally, it should become integrated in standard proteomic workflows as an add-on to conventional database search engines, which then would be able to provide improved identification.
-
2.
Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment.
Bouziane, H, Chouarfia, A
Journal of integrative bioinformatics. 2020;(1):51-79
Abstract
To date, many proteins generated by large-scale genome sequencing projects are still uncharacterized and subject to intensive investigations by both experimental and computational means. Knowledge of protein subcellular localization (SCL) is of key importance for protein function elucidation. However, it remains a challenging task, especially for multiple sites proteins known to shuttle between cell compartments to perform their proper biological functions and proteins which do not have significant homology to proteins of known subcellular locations. Due to their low-cost and reasonable accuracy, machine learning-based methods have gained much attention in this context with the availability of a plethora of biological databases and annotated proteins for analysis and benchmarking. Various predictive models have been proposed to tackle the SCL problem, using different protein sequence features pertaining to the subcellular localization, however, the overwhelming majority of them focuses on single localization and cover very limited cellular locations. The prediction was basically established on sorting signals, amino acids compositions, and homology. To improve the prediction quality, focus is actually on knowledge information extracted from annotation databases, such as protein-protein interactions and Gene Ontology (GO) functional domains annotation which has been recently a widely adopted and essential information for learning systems. To deal with such problem, in the present study, we considered SCL prediction task as a multi-label learning problem and tried to label both single site and multiple sites unannotated bacterial protein sequences by mining proteins homology relationships using both GO terms of protein homologs and PSI-BLAST profiles. The experiments using 5-fold cross-validation tests on the benchmark datasets showed a significant improvement on the results obtained by the proposed consensus multi-label prediction model which discriminates six compartments for Gram-negative and five compartments for Gram-positive bacterial proteins.
-
3.
Machine-learning approach expands the repertoire of anti-CRISPR protein families.
Gussow, AB, Park, AE, Borges, AL, Shmakov, SA, Makarova, KS, Wolf, YI, Bondy-Denomy, J, Koonin, EV
Nature communications. 2020;(1):3784
Abstract
The CRISPR-Cas are adaptive bacterial and archaeal immunity systems that have been harnessed for the development of powerful genome editing and engineering tools. In the incessant host-parasite arms race, viruses evolved multiple anti-defense mechanisms including diverse anti-CRISPR proteins (Acrs) that specifically inhibit CRISPR-Cas and therefore have enormous potential for application as modulators of genome editing tools. Most Acrs are small and highly variable proteins which makes their bioinformatic prediction a formidable task. We present a machine-learning approach for comprehensive Acr prediction. The model shows high predictive power when tested against an unseen test set and was employed to predict 2,500 candidate Acr families. Experimental validation of top candidates revealed two unknown Acrs (AcrIC9, IC10) and three other top candidates were coincidentally identified and found to possess anti-CRISPR activity. These results substantially expand the repertoire of predicted Acrs and provide a resource for experimental Acr discovery.
-
4.
Small design from big alignment: engineering proteins with multiple sequence alignment as the starting point.
Wang, T, Liang, C, Hou, Y, Zheng, M, Xu, H, An, Y, Xiao, S, Liu, L, Lian, S
Biotechnology letters. 2020;(8):1305-1315
Abstract
Multiple sequence alignment (MSA) is a fundamental way to gain information that cannot be obtained from the analysis of any individual sequence included in the alignment. It provides ways to investigate the relationship between sequence and function from a perspective of evolution. Thus, the MSA of proteins can be employed as a reference for protein engineering. In this paper, we reviewed the recent advances to highlight how protein engineering was benefited from the MSA of proteins. These methods include (1) engineering the thermostability or solubility of proteins by making it closer to the consensus sequence of the alignment through introducing site mutations; (2) structure-based engineering proteins with comparative modeling; (3) creating paleoenzymes featured with high thermostability and promiscuity by constructing the ancestral sequences derived from multiple sequence alignment; and (4) incorporating site-mutations targeting the evolutionarily coupled sites identified from multiple sequence alignment.
-
5.
Seq2seq Fingerprint with Byte-Pair Encoding for Predicting Changes in Protein Stability upon Single Point Mutation.
Kawano, K, Koide, S, Imamura, C
IEEE/ACM transactions on computational biology and bioinformatics. 2020;(5):1762-1772
Abstract
The engineering of stable proteins is crucial for various industrial purposes. Several machine learning methods have been developed to predict changes in the stability of proteins corresponding to single point mutations. To improve the prediction accuracy, we propose a new unsupervised descriptor for protein sequences, which is based on a sequence-to-sequence (seq2seq) neural network model combined with a sequence-compression method called byte-pair encoding (BPE). Our results demonstrate that BPE can encode a protein sequence into a sequence of shorter length, thereby enabling efficient training of the seq2seq model. Furthermore, we implement a basic predictor using the proposed descriptor, and our experimental results demonstrate that the predictor achieves state-of-the-art accuracy in tests for proteins that are not included in the training data.
-
6.
IPRO+/-: Computational Protein Design Tool Allowing for Insertions and Deletions.
Chowdhury, R, Grisewood, MJ, Boorla, VS, Yan, Q, Pfleger, BF, Maranas, CD
Structure (London, England : 1993). 2020;(12):1344-1357.e4
Abstract
Insertions and deletions (indels) in protein sequences alter the residue spacing along the polypeptide backbone and consequently open up possibilities for tuning protein function in a way that is inaccessible by amino acid substitution alone. We describe an optimization-based computational protein redesign approach centered around predicting beneficial combinations of indels along with substitutions and also obtain putative substrate-docked structures for these protein variants. This modified algorithmic capability would be of interest for enzyme engineering and broadly inform other protein design tasks. We highlight this capability by (1) identifying active variants of a bacterial thioesterase enzyme ('TesA) with experimental corroboration, (2) recapitulating existing active TEM-1 β-Lactamase sequences of different sizes, and (3) identifying shorter 4-Coumarate:CoA ligases with enhanced in vitro activities toward non-native substrates. A separate PyRosetta-based open-source tool, Indel-Maker (http://www.maranasgroup.com/software.htm), has also been created to construct computational models of user-defined protein variants with specific indels and substitutions.
-
7.
Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment.
Jing, X, Dong, Q, Hong, D, Lu, R
IEEE/ACM transactions on computational biology and bioinformatics. 2020;(6):1918-1931
Abstract
As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.
-
8.
Protein Remote Homology Detection and Fold Recognition Based on Sequence-Order Frequency Matrix.
Liu, B, Chen, J, Guo, M, Wang, X
IEEE/ACM transactions on computational biology and bioinformatics. 2019;(1):292-300
Abstract
Protein remote homology detection and fold recognition are two critical tasks for the studies of protein structures and functions. Currently, the profile-based methods achieve the state-of-the-art performance in these fields. However, the widely used sequence profiles, like position-specific frequency matrix (PSFM) and position-specific scoring matrix (PSSM), ignore the sequence-order effects along protein sequence. In this study, we have proposed a novel profile, called sequence-order frequency matrix (SOFM), to extract the sequence-order information of neighboring residues from multiple sequence alignment (MSA). Combined with two profile feature extraction approaches, top-n-grams and the Smith-Waterman algorithm, the SOFMs are applied to protein remote homology detection and fold recognition, and two predictors called SOFM-Top and SOFM-SW are proposed. Experimental results show that SOFM contains more information content than other profiles, and these two predictors outperform other state-of-the-art methods. It is anticipated that SOFM will become a very useful profile in the studies of protein structures and functions.
-
9.
ProtDet-CCH: Protein Remote Homology Detection by Combining Long Short-Term Memory and Ranking Methods.
Liu, B, Li, S
IEEE/ACM transactions on computational biology and bioinformatics. 2019;(4):1203-1210
Abstract
As one of the most challenging tasks in sequence analysis, protein remote homology detection has been extensively studied. Methods based on discriminative models and ranking approaches have achieved the state-of-the-art performance, and these two kinds of methods are complementary. In this study, three LSTM models have been applied to construct the predictors for protein remote homology detection, including ULSTM, BLSTM, and CNN-BLSTM. They are able to automatically extract the local and global sequence order information. Combined with PSSMs, the CNN-BLSTM achieved the best performance among the three LSTM-based models. We named this method as CNN-BLSTM-PSSM. Finally, a new method called ProtDet-CCH was proposed by combining CNN-BLSTM-PSSM and a ranking method HHblits. Tested on a widely used SCOP benchmark dataset, ProtDet-CCH achieved an ROC score of 0.998, and an ROC50 score of 0.982, significantly outperforming other existing state-of-the-art methods. Experimental results on two updated SCOPe independent datasets showed that ProtDet-CCH can achieve stable performance. Furthermore, our method can provide useful insights for studying the features and motifs of protein families and superfamilies. It is anticipated that ProtDet-CCH will become a very useful tool for protein remote homology detection.
-
10.
PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations.
Wen, B, Wang, X, Zhang, B
Genome research. 2019;(3):485-493
Abstract
Massively parallel or second-generation sequencing-based genomic studies continuously identify new genomic alterations that may lead to novel protein sequences, which are attractive candidates for disease biomarkers and therapeutic targets after proteomic validation. Integrative proteogenomic methods have been developed to use mass spectrometry (MS)-based proteomics data for such validation. These methods replace the reference sequence database in proteomic database searching with a customized protein database that incorporates sample- or disease-specific sequences derived from DNA or RNA sequencing, thus enabling the identification of novel protein sequences. Although useful, this spectrum-centric approach requires a full evaluation of all possible spectrum-peptide pairs, which is time-consuming, error-prone, and difficult to apply. Here, we present PepQuery, a peptide-centric approach that focuses on only novel DNA or protein sequences of interest. PepQuery allows quick and easy proteomic validation of genomic alterations without customized database construction. We demonstrated the sensitivity and specificity of the approach in validating completely novel proteins, novel splice junctions, and single amino acid variants using simulations and experimental data. Notably, enabling unrestricted modification searching in PepQuery reduced false positives by up to 95%. We implemented PepQuery as both web-based and stand-alone applications. The web version provides direct access to more than half a billion MS/MS spectra from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) and other cancer proteomic studies. The stand-alone version supports batch analysis and user-provided MS/MS data. PepQuery will increase the usage of proteogenomics beyond the proteomics community and will broaden the application of proteogenomics in personalized medicine.