Data-centric AI for Small Molecule Drug Discovery
Loading...
Date
2025-03-24
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Data-centric AI focuses on improving the quality and utility of data to enhance model performance rather than solely emphasizing innovations in model architecture. The shift from a model-driven to a data-driven paradigm has gained attention in drug discovery, where data quality issues and data quantity limitations present significant challenges for model training. Therefore, we introduced WelQrate, a high-quality benchmark dataset for small molecule drug discovery, supported by a professional curation pipeline and standardized evaluation framework aimed at bridging the gap between the biochemistry and AI communities. We found that dataset quality greatly influences model evaluation during benchmarking, stressing the importance of enhancing data quality. Our experiments revealed two significant challenges in real-world drug discovery data that must be carefully considered when developing new algorithms: extreme class imbalance due to low percentages of active compounds and structural distribution shift resulting from the unexplored chemical space compared to known drugs. Examining imbalance issues in WelQrate datasets, we developed ScaffAug, a novel data-driven augmentation framework that utilizes a diffusion model to generate novel training samples based on molecular scaffolds with under-represented structures and minor classes. We demonstrated that incorporating these generated molecules through a self-training strategy can significantly augment the performance of activity predictors.
Description
Keywords
Data-centric AI, Drug Discovery