Data-centric AI for Small Molecule Drug Discovery

dc.contributor.advisorDerr, Tyler
dc.contributor.advisorBerger, Matthew
dc.creatorWang, Xin
dc.creator.orcid0009-0001-1519-7569
dc.date.accessioned2025-06-06T09:45:35Z
dc.date.available2025-06-06T09:45:35Z
dc.date.created2025-05
dc.date.issued2025-03-24
dc.date.submittedMay 2025
dc.description.abstractData-centric AI focuses on improving the quality and utility of data to enhance model performance rather than solely emphasizing innovations in model architecture. The shift from a model-driven to a data-driven paradigm has gained attention in drug discovery, where data quality issues and data quantity limitations present significant challenges for model training. Therefore, we introduced WelQrate, a high-quality benchmark dataset for small molecule drug discovery, supported by a professional curation pipeline and standardized evaluation framework aimed at bridging the gap between the biochemistry and AI communities. We found that dataset quality greatly influences model evaluation during benchmarking, stressing the importance of enhancing data quality. Our experiments revealed two significant challenges in real-world drug discovery data that must be carefully considered when developing new algorithms: extreme class imbalance due to low percentages of active compounds and structural distribution shift resulting from the unexplored chemical space compared to known drugs. Examining imbalance issues in WelQrate datasets, we developed ScaffAug, a novel data-driven augmentation framework that utilizes a diffusion model to generate novel training samples based on molecular scaffolds with under-represented structures and minor classes. We demonstrated that incorporating these generated molecules through a self-training strategy can significantly augment the performance of activity predictors.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/1803/19748
dc.language.isoen
dc.subjectData-centric AI
dc.subjectDrug Discovery
dc.titleData-centric AI for Small Molecule Drug Discovery
dc.typeThesis
dc.type.materialtext
thesis.degree.disciplineComputer Science
thesis.degree.grantorVanderbilt University Graduate School
thesis.degree.levelMasters
thesis.degree.nameMS
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
WANG-THESIS-2025.pdf
Size:
5.61 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.92 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.24 KB
Format:
Plain Text
Description: