ML-PLIC

Cracking the entangling code of protein-ligand interactions (PLI) is of great importance for structure-based drug design and target fishing. According to physical and biochemical philosophy, PLI can be characterized by different representations, such as the energy terms from scoring functions and protein-ligand interaction fingerprints, which can be used by machine learning (ML) algorithms to capture and learn the mode of PLI. Here, we propose our ML-based protein-ligand interaction capturer (ML-PLIC) for automatically characterizing PLI and developing ML-based scoring functions (MLSFs) to identify the potential binders of a specific protein target through virtual screening.

Pipeline module integrates the individual Docking, Descriptors, Modelling and Screening modules. Users can submit jobs of ligand docking, descriptor generation, modelling based on one of the three ML algorithms (eXtreme Gradient Boosting, Support Vector Machine and Random Forest) and virtual screening, and users need to provide all the required files once and for all.

If you want to know the specific role of each module, please click the button below or go to the Help page directly.

Learn more



Pipeline Panel

.pdb
.mol2
.mol2
.mol2
@
  • Docking

type parameter description
protein repair

repair protein structure [detail...]

bonds_hydrogens: build bonds and add hydrogens;

bonds: build a single bond from each atom with no bonds to its closest neighbor;

hydrogens: add hydrogens;

checkhydrogens: add hydrogens only if there are none already;

None: do not make any repairs;

del_nonstd_residue

delete every non-standard residue [detail...]

yes: delete any residue whose name is not in ['CYS', 'ILE', 'SER', 'VAL', 'GLN', 'LYS', 'ASN', 'PRO', 'THR', 'PHE', 'ALA', 'HIS', 'GLY', 'ASP', 'LEU', 'ARG', 'TRP', 'GLU', 'TYR','MET'];

no: no deletion;

protein cleanup

remove necessary atoms [detail...]

non-polar hydrogens: merge charges and remove non-polar hydrogens;

lone pairs: merge charges and remove lone pairs;

water residues: remove water residues;

chains: remove chains composed entirely of residues of types other than the standard 20 amino acids;

standard: all the options mentioned above;

ligand repair

repair ligand structure [detail...]

bonds_hydrogens: adds hydrogens and builds bonds to any non-bonded atoms;

bonds: build a single bond from each atom with no bonds to its closest neighbor;

hydrogens: add hydrogens;(PyBabel is used for adding all hydrogens, not just polar-hydrogens;)

add_charge

add Gasteiger partial atomic charges [detail...]

yes: add Gasteiger partial atomic charges;

no: If this option is used, the input ligand should already have partial atomic charges;

ligand cleanup

remove necessary atoms [detail...]

non-polar hydrogens: merges non-polar hydrogens by adding the charge of each non-polar hydrogen to the carbon to which it is bonded and then removes the non-polar hydrogen from the ligand molecule;

lone pairs: merges lone-pairs by adding the charge of each lone pair to the atom to which it is 'bonded' and then removes the lone-pair;

standard: all the options mentioned above;

search_space_size
Float

this value restricts where the movable atoms should lie

exhaustiveness
Integer

exhaustiveness of the global search (roughly proportional to time)

num_binding
Integer

maximum number of binding modes to generate

  • Descriptors

  • Chosen Descriptors
Energy term from scoring function
Interaction Fingerprint
Molecular Fingerprint
  • Modelling-preprocessing

type method parameter description
Standardization
Integer

mean removal and variance scaling [detail...]

StandardScaler: removes the mean and scales the data to unit variance;

MinMaxScaler: rescales the data set such that all feature values are in the range [0, 1];

MaxAbsScaler: the absolute values are mapped in the range [0, 1];

RobustScaler: the centering and scaling statistics of this scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers;

Normalizer: rescales the vector for each sample to have unit norm, independently of the distribution of the samples;

Feature selection
Integer

used for feature selection/dimensionality reduction on sample sets [detail...]

SelectFromModel: Random Forest Classifiers are used to compute impurity-based feature importances, which in turn can be used to discard irrelevant features,the parameters means the number of tree;

Chi-squared: selecting the best features based on univariate statistical tests, the parameters means the number of selected features;

ANOVA: selecting the best features based on univariate statistical tests, the parameters means the number of selected features;

Mutual_info: selecting the best features based on univariate statistical tests, the parameters means the number of selected features;

Retain_all: no feature selection and the model will be trained with all features;

  • Hyper-parameters

parameter method value range description
tuning times
Integer

[0, 1000] the optimization times of hyper-parameters; [suggestion...]

The larger the value, the more likely it is to find better hyper-parameters and the longer it will take;

n_estimators
Integer

[1, 9999] number of boosting rounds; [suggestion...]

increasing this value will improve the learning ability of the model and the probability for model being overfitting;

learning rate
Float

(0, 1] step size shrinkage used in update to prevents overfitting; [suggestion...]

learning rate shrinks the feature weights to make the boosting process more conservative;

subsample
Float

(0, 1] subsample ratio of the training instances; [suggestion...]

decreasing this value can prevent overfitting;

max depth
Integer

[1, 9999] Maximum depth of a tree; [suggestion...]

Increasing this value will make the model more complex and more likely to overfit;

gamma
Float

[0, 9999] Minimum loss reduction required to make a further partition on a leaf node of the tree; [suggestion...]

The larger gamma is, the more conservative the algorithm will be;

min child weight
Float

[0, 9999] Minimum sum of instance weight (hessian) needed in a child; [suggestion...]

The larger min_child_weight is, the more conservative the algorithm will be;

colsample_bytree
Float

(0, 1] the subsample ratio of columns when constructing each tree; [suggestion...]

decreasing this value can prevent overfitting;;

colsample_bylevel
Float

(0, 1] the subsample ratio of columns when constructing each level; [suggestion...]

decreasing this value can prevent overfitting;

colsample_bynode
Float

(0, 1] the subsample ratio of columns when constructing each node(split); [suggestion...]

decreasing this value can prevent overfitting;

alpha
Float

[0, 9999] L1 regularization term on weights; [suggestion...]

Increasing this value will make model more conservative;

lambda
Float

[0, 9999] L1 regularization term on weights; [suggestion...]

Increasing this value will make model more conservative;

tuning times
Integer

[0, 1000] the optimization times of hyper-parameters; [suggestion...]

the larger the value is, the bigger probability of finding the best hyper-parameters;

C
Float

(0, 9999] Regularization parameter; [suggestion...]

Increasing this value will make model more conservative;

gamma
Float

(0, 9999] Kernel coefficient; [suggestion...]

None;

tuning times
Integer

[0, 1000] the optimization times of hyper-parameters; [suggestion...]

the larger the value is, the bigger probability of finding the best hyper-parameters;

n_estimators
Integer

[1, 9999] the number of trees in the forest; [suggestion...]

increasing this value will improve the learning ability of the model and the probability for model being overfitting;

max depth
Integer

[1, 9999] Maximum depth of a tree; [suggestion...]

Increasing this value will make the model more complex and more likely to overfit;

min samples leaf
Integer

[1, 9999] The minimum number of samples required to be at a leaf node; [suggestion...]

Increasing this value will make the model less complex and less likely to overfit;;

min samples split
Integer

[2, 9999] The minimum number of samples required to split an internal node; [suggestion...]

Increasing this value will make the model less complex and less likely to overfit;

min impurity decrease
Float

[0, 9999] The minimum number of samples required to split an internal node; [suggestion...]

Increasing this value will make the model less complex and less likely to overfit;

max features
StrOrInt

[sqrt, log2]∪[1, 19999] The number of features to consider when looking for the best split; [suggestion...]

Decreasing this value will make the model less complex and less likely to overfit;