ProtDML: label-aware representation learning for broad-spectrum protein function prediction

Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

Protein function prediction is crucial for narrowing the gap between the rapid growth of protein sequences and the limited availability of experimental annotations. In recent years, protein language models (pLMs) that learn representations from large-scale sequence data have emerged as scalable frameworks for modeling protein function. However, most pLMs rely on self-supervised pretraining and struggle to capture correlations among functional labels, which constrains their applicability to annotation-specific tasks. To address this limitation, we present ProtDML, a protein function prediction framework based on distance metric learning that explicitly models label correlations. ProtDML learns a task-specific distance metric using a similarity-weighted pull-push objective designed to encode co-annotation structure. This approach yields a label-aware feature space with improved intra-class compactness and inter-class separability, thereby enhancing the modeling of protein multifunctionality. Furthermore, we introduce a multi-view strategy that integrates complementary sequence-derived protein representations in sequence-only environments. On Pfam multi-label classification, ProtDML outperforms sequence-based baselines by effectively modeling domain co-occurrence patterns and transforming the embedding space into label-distinct clusters with clearer boundaries. Structural generalization on SCOP-based benchmarks demonstrates that ProtDML's function-informed metric learning can recover latent structural similarities, even without explicit structural supervision. On the CAFA3 benchmark, ProtDML achieves the lowest semantic distance across all ontologies, indicating robust performance in large-scale and imbalanced function prediction. A SARS-CoV-2 case study further illustrates its applicability to real-world viral data. Overall, ProtDML is model-agnostic and complements pretrained pLM embeddings, enabling accurate and scalable sequence-based protein function prediction.

키워드

protein representation learningprotein function predictionmulti-label classificationdistance metric learningPSI-BLASTLANGUAGEGENES
제목
ProtDML: label-aware representation learning for broad-spectrum protein function prediction
저자
Kan, YejinYi, Gangman
DOI
10.1093/bib/bbag293
발행일
2026-05
유형
Article
저널명
Briefings in Bioinformatics
27
3