Modelling

In this module, users can choose a ML algorithm to train a customized scoring function based on the descriptors they submitted.

The dataset provided by the user is divided into the training set and the test set according to the ratio inputted by the user and will be preprocessed using sklearn. Three ML algorithms (i.e., eXtreme Gradient Boosting, Support Vector Machine and Random Forest) are provided. Users can choose a ML algorithm and do settings about hyper-parameter optimization (which hyper-parameter to be optimized, the hyper-parameter range, how to generate the search space and the optimization rounds). Finally, according to the user's input, the server uses hyperopt to find the optimal hyper-parameter combinations and chooses the corresponding ML algorithm for training (10-fold cross validation) and testing.

The suggestions for each set of hyper-parameters in this page and the information provided on the result page include the model performance, the hyper-parameters of the model, the feature importance, the validation curve, the loss-hype_parameter scatter, the hyper_parameter-tuning_round scatter offer users the directions for the improvement of model accuracy. And thanks to the multiple options offered by this server, users are able to continuously use this module to customize their models to achieve better performance.

Learn more



Modelling Panel

.csv
(0,1]
@
  • Preprocessing

type method parameter description
Standardization
Integer

mean removal and variance scaling [detail...]

StandardScaler: removes the mean and scales the data to unit variance;

MinMaxScaler: rescales the data set such that all feature values are in the range [0, 1];

MaxAbsScaler: the absolute values are mapped in the range [0, 1];

RobustScaler: the centering and scaling statistics of this scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers;

Normalizer: rescales the vector for each sample to have unit norm, independently of the distribution of the samples;

Feature selection
Integer

used for feature selection/dimensionality reduction on sample sets [detail...]

SelectFromModel: Random Forest Classifiers are used to compute impurity-based feature importances, which in turn can be used to discard irrelevant features,the parameters means the number of tree;

Chi-squared: selecting the best features based on univariate statistical tests, the parameters means the number of selected features;

ANOVA: selecting the best features based on univariate statistical tests, the parameters means the number of selected features;

Mutual_info: selecting the best features based on univariate statistical tests, the parameters means the number of selected features;

Retain_all: no feature selection and the model will be trained with all features;

  • Setting Hyper-parameters

parameter method value range description
tuning times
Integer

[0, 1000] the optimization times of hyper-parameters; [suggestion...]

The larger the value, the more likely it is to find better hyper-parameters and the longer it will take;

n_estimators
Integer

[1, 9999] number of boosting rounds; [suggestion...]

increasing this value will improve the learning ability of the model and the probability for model being overfitting;

learning rate
Float

(0, 1] step size shrinkage used in update to prevents overfitting; [suggestion...]

learning rate shrinks the feature weights to make the boosting process more conservative;

subsample
Float

(0, 1] subsample ratio of the training instances; [suggestion...]

decreasing this value can prevent overfitting;

max depth
Integer

[1, 9999] Maximum depth of a tree; [suggestion...]

Increasing this value will make the model more complex and more likely to overfit;

gamma
Float

[0, 9999] Minimum loss reduction required to make a further partition on a leaf node of the tree; [suggestion...]

The larger gamma is, the more conservative the algorithm will be;

min child weight
Float

[0, 9999] Minimum sum of instance weight (hessian) needed in a child; [suggestion...]

The larger min_child_weight is, the more conservative the algorithm will be;

colsample_bytree
Float

(0, 1] the subsample ratio of columns when constructing each tree; [suggestion...]

decreasing this value can prevent overfitting;;

colsample_bylevel
Float

(0, 1] the subsample ratio of columns when constructing each level; [suggestion...]

decreasing this value can prevent overfitting;

colsample_bynode
Float

(0, 1] the subsample ratio of columns when constructing each node(split); [suggestion...]

decreasing this value can prevent overfitting;

alpha
Float

[0, 9999] L1 regularization term on weights; [suggestion...]

Increasing this value will make model more conservative;

lambda
Float

[0, 9999] L1 regularization term on weights; [suggestion...]

Increasing this value will make model more conservative;

tuning times
Integer

[0, 1000] the optimization times of hyper-parameters; [suggestion...]

the larger the value is, the bigger probability of finding the best hyper-parameters;

C
Float

(0, 9999] Regularization parameter; [suggestion...]

Increasing this value will make model more conservative;

gamma
Float

(0, 9999] Kernel coefficient; [suggestion...]

None;

tuning times
Integer

[0, 1000] the optimization times of hyper-parameters; [suggestion...]

the larger the value is, the bigger probability of finding the best hyper-parameters;

n_estimators
Integer

[1, 9999] the number of trees in the forest; [suggestion...]

increasing this value will improve the learning ability of the model and the probability for model being overfitting;

max depth
Integer

[1, 9999] Maximum depth of a tree; [suggestion...]

Increasing this value will make the model more complex and more likely to overfit;

min samples leaf
Integer

[1, 9999] The minimum number of samples required to be at a leaf node; [suggestion...]

Increasing this value will make the model less complex and less likely to overfit;;

min samples split
Integer

[2, 9999] The minimum number of samples required to split an internal node; [suggestion...]

Increasing this value will make the model less complex and less likely to overfit;

min impurity decrease
Float

[0, 9999] The minimum number of samples required to split an internal node; [suggestion...]

Increasing this value will make the model less complex and less likely to overfit;

max features
StrOrInt

[sqrt, log2]∪[1, 19999] The number of features to consider when looking for the best split; [suggestion...]

Decreasing this value will make the model less complex and less likely to overfit;