0
selected
-
1.
Imbalance Data Processing Strategy for Protein Interaction Sites Prediction.
Wang, B, Mei, C, Wang, Y, Zhou, Y, Cheng, MT, Zheng, CH, Wang, L, Zhang, J, Chen, P, Xiong, Y
IEEE/ACM transactions on computational biology and bioinformatics. 2021;(3):985-994
Abstract
Protein-protein interactions play essential roles in various biological progresses. Identifying protein interaction sites can facilitate researchers to understand life activities and therefore will be helpful for drug design. However, the number of experimental determined protein interaction sites is far less than that of protein sites in protein-protein interaction or protein complexes. Therefore, the negative and positive samples are usually imbalanced, which is common but bring result bias on the prediction of protein interaction sites by computational approaches. In this work, we presented three imbalance data processing strategies to reconstruct the original dataset, and then extracted protein features from the evolutionary conservation of amino acids to build a predictor for identification of protein interaction sites. On a dataset with 10,430 surface residues but only 2,299 interface residues, the imbalance dataset processing strategies can obviously reduce the prediction bias, and therefore improve the prediction performance of protein interaction sites. The experimental results show that our prediction models can achieve a better prediction performance, such as a prediction accuracy of 0.758, or a high F-measure of 0.737, which demonstrated the effectiveness of our method.
-
2.
Investigation of Potential Genetic Biomarkers and Molecular Mechanism of Ulcerative Colitis Utilizing Bioinformatics Analysis.
Zhang, J, Wang, X, Xu, L, Zhang, Z, Wang, F, Tang, X
BioMed research international. 2020;:4921387
Abstract
OBJECTIVES To reveal the molecular mechanisms of ulcerative colitis (UC) and provide potential biomarkers for UC gene therapy. METHODS We downloaded the GSE87473 microarray dataset from the Gene Expression Omnibus (GEO) and identified the differentially expressed genes (DEGs) between UC samples and normal samples. Then, a module partition analysis was performed based on a weighted gene coexpression network analysis (WGCNA), followed by pathway and functional enrichment analyses. Furthermore, we investigated the hub genes. At last, data validation was performed to ensure the reliability of the hub genes. RESULTS Between the UC group and normal group, 988 DEGs were investigated. The DEGs were clustered into 5 modules using WGCNA. These DEGs were mainly enriched in functions such as the immune response, the inflammatory response, and chemotaxis, and they were mainly enriched in KEGG pathways such as the cytokine-cytokine receptor interaction, chemokine signaling pathway, and complement and coagulation cascades. The hub genes, including dual oxidase maturation factor 2 (DUOXA2), serum amyloid A (SAA) 1 and SAA2, TNFAIP3-interacting protein 3 (TNIP3), C-X-C motif chemokine (CXCL1), solute carrier family 6 member 14 (SLC6A14), and complement decay-accelerating factor (CD antigen CD55), were revealed as potential tissue biomarkers for UC diagnosis or treatment. CONCLUSIONS This study provides supportive evidence that DUOXA2, A-SAA, TNIP3, CXCL1, SLC6A14, and CD55 might be used as potential biomarkers for tissue biopsy of UC, especially SLC6A14 and DUOXA2, which may be new targets for UC gene therapy. Moreover, the DUOX2/DUOXA2 and CXCL1/CXCR2 pathways might play an important role in the progression of UC through the chemokine signaling pathway and inflammatory response.
-
3.
RPiRLS: Quantitative Predictions of RNA Interacting with Any Protein of Known Sequence.
Shen, WJ, Cui, W, Chen, D, Zhang, J, Xu, J
Molecules (Basel, Switzerland). 2018;(3)
Abstract
RNA-protein interactions (RPIs) have critical roles in numerous fundamental biological processes, such as post-transcriptional gene regulation, viral assembly, cellular defence and protein synthesis. As the number of available RNA-protein binding experimental data has increased rapidly due to high-throughput sequencing methods, it is now possible to measure and understand RNA-protein interactions by computational methods. In this study, we integrate a sequence-based derived kernel with regularized least squares to perform prediction. The derived kernel exploits the contextual information around an amino acid or a nucleic acid as well as the repetitive conserved motif information. We propose a novel machine learning method, called RPiRLS to predict the interaction between any RNA and protein of known sequences. For the RPiRLS classifier, each protein sequence comprises up to 20 diverse amino acids but for the RPiRLS-7G classifier, each protein sequence is represented by using 7-letter reduced alphabets based on their physiochemical properties. We evaluated both methods on a number of benchmark data sets and compared their performances with two newly developed and state-of-the-art methods, RPI-Pred and IPMiner. On the non-redundant benchmark test sets extracted from the PRIDB, the RPiRLS method outperformed RPI-Pred and IPMiner in terms of accuracy, specificity and sensitivity. Further, RPiRLS achieved an accuracy of 92% on the prediction of lncRNA-protein interactions. The proposed method can also be extended to construct RNA-protein interaction networks. The RPiRLS web server is freely available at http://bmc.med.stu.edu.cn/RPiRLS.
-
4.
HEMEsPred: Structure-Based Ligand-Specific Heme Binding Residues Prediction by Using Fast-Adaptive Ensemble Learning Scheme.
Zhang, J, Chai, H, Gao, B, Yang, G, Ma, Z
IEEE/ACM transactions on computational biology and bioinformatics. 2018;(1):147-156
Abstract
Heme is an essential biomolecule that widely exists in numerous extant organisms. Accurately identifying heme binding residues (HEMEs) is of great importance in disease progression and drug development. In this study, a novel predictor named HEMEsPred was proposed for predicting HEMEs. First, several sequence- and structure-based features, including amino acid composition, motifs, surface preferences, and secondary structure, were collected to construct feature matrices. Second, a novel fast-adaptive ensemble learning scheme was designed to overcome the serious class-imbalance problem as well as to enhance the prediction performance. Third, we further developed ligand-specific models considering that different heme ligands varied significantly in their roles, sizes, and distributions. Statistical test proved the effectiveness of ligand-specific models. Experimental results on benchmark datasets demonstrated good robustness of our proposed method. Furthermore, our method also showed good generalization capability and outperformed many state-of-art predictors on two independent testing datasets. HEMEsPred web server was available at http://www.inforstation.com/HEMEsPred/ for free academic use.
-
5.
PCVMZM: Using the Probabilistic Classification Vector Machines Model Combined with a Zernike Moments Descriptor to Predict Protein-Protein Interactions from Protein Sequences.
Wang, Y, You, Z, Li, X, Chen, X, Jiang, T, Zhang, J
International journal of molecular sciences. 2017;(5)
Abstract
Protein-protein interactions (PPIs) are essential for most living organisms' process. Thus, detecting PPIs is extremely important to understand the molecular mechanisms of biological systems. Although many PPIs data have been generated by high-throughput technologies for a variety of organisms, the whole interatom is still far from complete. In addition, the high-throughput technologies for detecting PPIs has some unavoidable defects, including time consumption, high cost, and high error rate. In recent years, with the development of machine learning, computational methods have been broadly used to predict PPIs, and can achieve good prediction rate. In this paper, we present here PCVMZM, a computational method based on a Probabilistic Classification Vector Machines (PCVM) model and Zernike moments (ZM) descriptor for predicting the PPIs from protein amino acids sequences. Specifically, a Zernike moments (ZM) descriptor is used to extract protein evolutionary information from Position-Specific Scoring Matrix (PSSM) generated by Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST). Then, PCVM classifier is used to infer the interactions among protein. When performed on PPIs datasets of Yeast and H. Pylori, the proposed method can achieve the average prediction accuracy of 94.48% and 91.25%, respectively. In order to further evaluate the performance of the proposed method, the state-of-the-art support vector machines (SVM) classifier is used and compares with the PCVM model. Experimental results on the Yeast dataset show that the performance of PCVM classifier is better than that of SVM classifier. The experimental results indicate that our proposed method is robust, powerful and feasible, which can be used as a helpful tool for proteomics research.
-
6.
PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation.
Zhang, J, Liu, B
International journal of molecular sciences. 2017;(9)
Abstract
DNA-binding proteins play crucial roles in various biological processes, such as DNA replication and repair, transcriptional regulation and many other biological activities associated with DNA. Experimental recognition techniques for DNA-binding proteins identification are both time consuming and expensive. Effective methods for identifying these proteins only based on protein sequences are highly required. The key for sequence-based methods is to effectively represent protein sequences. It has been reported by various previous studies that evolutionary information is crucial for DNA-binding protein identification. In this study, we employed four methods to extract the evolutionary information from Position Specific Frequency Matrix (PSFM), including Residue Probing Transformation (RPT), Evolutionary Difference Transformation (EDT), Distance-Bigram Transformation (DBT), and Trigram Transformation (TT). The PSFMs were converted into fixed length feature vectors by these four methods, and then respectively combined with Support Vector Machines (SVMs); four predictors for identifying these proteins were constructed, including PSFM-RPT, PSFM-EDT, PSFM-DBT, and PSFM-TT. Experimental results on a widely used benchmark dataset PDB1075 and an independent dataset PDB186 showed that these four methods achieved state-of-the-art-performance, and PSFM-DBT outperformed other existing methods in this field. For practical applications, a user-friendly webserver of PSFM-DBT was established, which is available at http://bioinformatics.hitsz.edu.cn/PSFM-DBT/.
-
7.
Bioinformatics resources and tools for conformational B-cell epitope prediction.
Sun, P, Ju, H, Liu, Z, Ning, Q, Zhang, J, Zhao, X, Huang, Y, Ma, Z, Li, Y
Computational and mathematical methods in medicine. 2013;:943636
Abstract
Identification of epitopes which invoke strong humoral responses is an essential issue in the field of immunology. Localizing epitopes by experimental methods is expensive in terms of time, cost, and effort; therefore, computational methods feature for its low cost and high speed was employed to predict B-cell epitopes. In this paper, we review the recent advance of bioinformatics resources and tools in conformational B-cell epitope prediction, including databases, algorithms, web servers, and their applications in solving problems in related areas. To stimulate the development of better tools, some promising directions are also extensively discussed.