Topology-based

and Conformation-

based Decoys Database

An unbiased database for the training and benchmarking of machine-learning scoring functions, providing not only 155 target-specific datasets but also a decoys generation interface.


Learn more
...

About ToCoDDB

datasets for machine-learning scoring functions.

Machine-learning-based scoring functions (MLSFs) have attracted extensive attention due to their potentially improved accuracy in binding affinity prediction and/or structure-based virtual screening (SBVS) compared with classical scoring functions (SFs). Development of accurate MLSFs for SBVS against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most existing datasets for the development of MLSFs were originally designed for traditional SFs and may suffer from hidden biases (artificial enrichment, analogue bias, domain bias and noncausal bias) and data insufficiency. Hereby, we developed a new database named Topology-based and Conformation-based decoys database (ToCoDDB), which can not only provide 155 target-specific unbiased datasets but also can generate unbiased and expandable datasets for training and benchmarking MLSFs.

Download ToCoDDB Generate TocoDecoy

In ToCoDDB, the biological targets and their active ligands were collected from scientific literature and existing datasets for traditional SFs including DUDE, DEKOIS and LIT-PCBA, and the duplicated targets were manually checked and merged. The final ToCoDDB from multiple data sources contain nearly 2.8 million compounds for 155 targets, which makes ToCoDDB the biggest and the most target-diverse database for training MLSFs.

In ToCoDDB, two protocols were designed to generate and debias the decoys by tweaking the actives for a specific target: (1) conditional recurrent neural networks (cRNN) was used to generate the virtual decoy molecules that have similar physicochemical properties to the actives but are topologically dissimilar, and (2) the actives were docked into the binding pocket and the docking conformations of the actives with high docking scores were regarded as the decoy conformations.

The performance of the InteractionGraphNet (IGN) model, a graph-based MLSF, on our database was recorded and provided as benchmarks, and as far as we know, ToCoDDB is the first database for benchmarking the performance of MLSFs.

ToCoDDB provides the decoys for a large number of targets. However, if users cannot find the targets of interest in our database, they can use the automatic dataset generation interface freely to generate the corresponding target-specific datasets.

Statistics

overview of the database

28+

Super-family

155

Targets

33328

Actives

2409721

Decoys

Contact

You can contact us through email

...
Prof. Tingjun Hou

tingjunhou@zju.edu.cn

Zhejiang University

Hangzhou, Zhejiang, China

...
Prof. Dongsheng Cao

oriental-cds@163.com

Central South University

Changsha, Hunan, China

Xujun Zhang

xujunzhang@zju.edu.cn


If you use ToCoDDB, please cite ...

Page Counter

Copyright © 2021-2023 Tingjun Hou's Group All Rights Reserved. | Total views: 3525 | Powered by python 3.7 django 3.1.7 tensorflow 2.0.0 rdkit 2019.03.1 bootstrap 4.6